AI tokenization: How AI uses tokens to break down language

Human language is complicated. Understanding howartificial intelligence (AI)models break language into digestible components isn’t as complex as language. As you learn about AI, you might come across the term “tokenization.” While tokenization sounds like abstract jargon for tech geniuses, it isn’t difficult to understand, even if you’re new to it.

Whether you’re interested in building AI programs using something like GitHub or want to understand how AI works, learn about tokenization to further your knowledge of AI. Tokenization even applies to using yourfavorite Alexa devicesand other AI assistants. Here’s what AI tokenization is, how it works, and why it’s important for language learning models such as ChatGPT and other AI applications.

ChatGPT home screen on a mobile phone.

AI tokenization in generative AI explained

Tokenization is a method AI systems use to break language down into smaller parts. AI models that deal with language, such as chatbots, use alarge language model (LLM). These AI models have “tokens,” which represent a word, character, or phrase. Humans learn to break down language into these small parts as children. For instance, we learned basic punctuation before moving on to grammatical concepts. As we get older, we don’t think of language in terms of puzzle pieces that fit together to form a paragraph or phrase because it’s ingrained.

Machine learning must break down language into these separate parts to understand the nuances of language and soundmore human. LLMs use a mathematical process to grasp language since they aren’t human. This dataset is what tells ChatGPT and other natural language processing (NLP) models to create phrases that make sense.

A screenshot of OpenAI’s Tokenizer page.

AI systems understand and process text by breaking parts of language into tokens. When using a chatbot, for instance, you don’t see anything happening behind the scenes. The chatbot uses tokens to speak fluently and provide answers that make sense linguistically. Tokens are thesmallest data unitsAI models use to generate text, per design and development company Povio.

Another explanation of tokenization comes from OpenAI, which developed ChatGPT. It usestransformers, the most advanced AI model. OpenAI states that AI “models learn to understand the statistical relationships between these tokens and excel at producing the next token in a sequence of tokens.” If an AI language model understands statistically which tokens (or words, phrases, and punctuation) go together, it learns to excel at producing sequences of text that make sense and improve with time.

ChatGPT home screen on a mobile phone.

Tokenization isn’t only for chatbots. It’s also used in AI language translation services likeGoogle Translate, search engines, and voice recognition software used by AI models such as Amazon Alexa or Google Assistant.

Types of AI tokens used in tokenization

Language has different parts, such as punctuation, verbs, and adjectives. So does AI language processing. Tokenization generally involves “subtokens,” which are different types of tokens. Subtokens may include categories like punctuation tokens (reserved for periods, commas, colons, and other marks used in language) and word tokens, which represent whole words. For example, there might be a token for an exclamation point or a subtoken for the word “journey.”

Some words might be broken down into parts (called subword tokenization), such as the word “suddenly.” In tokenization, this might look like “sudden” and “ly.” This also might occur for words that are combined but are sometimes separate entities, such as “chatbot” occurring in two tokens, “chat” and “bot.” Or the word “boyfriend” becomes “boy” and “friend.”

A screenshot of OpenAI’s Tokenizer page with a phrase in the text box.

What are large language models?

Large language models (LLMs) are the basis for AI chatbots and much more. Here’s what’s going on behind the scenes

There’s also a type of tokenization called morphological tokenization, which breaks words into “morphemes.” This takes a word and breaks it into smaller, meaningful units. It is commonly used in languages that have many root meanings for complex words.

A screenshot of OpenAI’s Tokenizer displaying a phrase with 20 tokens.

These subtokens are all parts of AI tokenization that are generated separately but come together to form fully fleshed-out paragraphs of text.

Limitations of tokens in AI language models

Certain AI language models, such as ChatGPT 3.5 and 4, have atoken limit. The AI model can’t go past a certain threshold of generated tokens when generating text and, therefore, generating tokens. To understand what a sentence of generated text looks like in terms of the number of tokens, useOpenAI’s Tokenizerfor free. Type in any text and see what this converts to in AI-generated tokens.

For example, type, “All cat ladies love ginger tabby cats, but not all ginger tabby cats love cat ladies.” In the Tokenizer, the results show that this sentence comprised 20 tokens. When you scroll down, it also color codes different words and punctuation to illustrate which is a separate token.

You might not run into the total token limit on ChatGPT 3.5 or ChatGPT 4. Still, it’s important to know the limit when entering an extended amount of text. On ChatGpt 3.5, the limit is around 4,100 tokens. There’s a larger threshold for ChatGPT 4 of nearly 8,200 tokens. If a 16-word sentence produces 20 tokens, there’s about a 20% increase in tokens versus words. This is an estimation that isn’t a sure-fire one.

Your tokens change depending on the amount of punctuation used and other factors. But for illustrative purposes, your limit on ChatGPT 3.5 is about 3,280 words of text to reach 4,100 tokens.

OpenAI also states on its Tokenizer page that “one token generally corresponds to ~4 characters of text for common English text,” which means 100 tokens is equivalent to 75 words.

Other challenges in tokenization

A large language model uses deep learning to give real-world answers. Deep learning uses deep neural networks to process data and make decisions. Because human language is nuanced, some challenges arise in tokenization that data scientists haven’t smoothed out. Text data goes through preprocessing. Hence, the AI model uses tokens to analyze and string together sentences.

What is generative AI?

An agent of the human will, an amplifier of human cognition. Discover the power of generative AI

Sentiment analysis, for example, is one area in AI chatbots that could use improvement. Sentiment analysis is the AI model’s combination of language processing, linguistics, and text analysis to determine the “affect” or emotion of language. But emotions aren’t always easy to grasp. Sometimes, a bit of text could be interpreted as sarcastic. Other times, it might come across as literal. Because of these ambiguities in language, the intent behind text output can be misread or not what the person asking the chatbot intended.

Another issue is in languages that don’t have clearly designated spaces between words. Tokenization is a harder task when analyzing languages like Chinese and Japanese since it isn’t always clear where words start and end.

Additionally, it isn’t always a walk in the park dealing with special characters, such as the “@” symbol in email addresses or the slashes, dashes, and other text of URLs.

How tokenization may improve AI language models in the future

AI language models are imperfect. However, the more developers work on tokenization and make AI more context-aware, it’s likely that generated text will improve. AI doesn’t always provide the most human-sounding text or can miss the point that you’re looking to nail. You’ll have to be patient about some of this progression.

you’re able to take matters into your own hands and learn more about how AI works to get the best results. You canpay attention to AI repetition penaltiesand how to work around them. You can also learn how to leverage the answers you want from AI models like ChatGPT. Being specific is best. Instead of asking an AI model for a paragraph about donkeys, ask it for a conversational paragraph about donkeys for a blog.

Be specific, be patient, and enjoy the wild ride that’s learning about AI and the future of tech. While you’re learning, download thebest AI apps for your Android phonebecause it’s time to get with the times.

AI tokenization in generative AI explained#

Types of AI tokens used in tokenization#

What are large language models?#

Limitations of tokens in AI language models#

Other challenges in tokenization#

What is generative AI?#

How tokenization may improve AI language models in the future#