December 5, 2024
|
Gert Jan Spriensma

What are tokens and why do they matter?

If you've been exploring AI, particularly Large Language Models (LLMs), you've likely encountered the term "token." This concept often comes up in discussions about pricing, where fees are typically based on a per-thousand-tokens model.

Tokens serve as the fundamental units of data for LLMs, similar to how words form the basis of sentences. However, a token can vary in form—it might be part of a word, an entire word, or even punctuation. LLMs function as next-token predictors, calculating the probability of the next token based on the sequence of preceding tokens.

Let’s take token itself as an example. “Tokens” is 1 token, but “tokenize” can be 2. 

  • This sentence is 7 tokens.
  • We need to tokenize the text for better analysis.

The question of whether "tokenize" is one or two tokens hinges on the word's frequency in the training set of the Large Language Model. In models trained on extensive vocabularies, where there's a reduced necessity to segment familiar words into smaller units, "tokenize" would typically be represented as a single token. In contrast, in scenarios where the vocabulary is less comprehensive or the word is less common, it may be split into two tokens.

Token usage differences across languages

The performance of Large Language Models varies across different languages. While we won't delve deeply into the specifics here, it's important to note the differences in the number of tokens needed to compose a typical sentence in various languages.

English, with its thousands of words typically represented as single tokens, is often more economical in token usage. This efficiency is partly due to the LLM's coding structure, which tends to favor English because of its dominant presence in the training datasets. As a result, English is a cost-effective language for LLM operations.

For instance, consider the word "probability" – a relatively long word in English, yet it counts as only one token. When translated into other languages, the token count is different;

  • NL - Kans - 2 tokens
  • DE - Wahrscheinlichkeit - 4 tokens
  • FR - Probabilité - 2 tokens
  • ES - Probabilidad - 2 tokens

Achieving an exact comparison is challenging, but when it comes to token usage, major European languages like German, French, Spanish, and Dutch (not a large language we know 😀) are approximately 1.5 to 1.7 times more token-intensive than English. Smaller European languages can be even more demanding, with Hungarian costing around 2.5 times, and Greek reaching about 4 times the token usage of English.

For a major language like Hindi, the token expenditure is around 5 times that of English. This increased cost is not only a result of the language's intricacy but also due to its complex script rules and the use of conjunct characters.

Why is this important?

As we began with the premise that token usage equates to cost, it's clear that languages requiring more tokens to express the same content can be significantly more expensive.

Additionally, the concept of the context window is closely related. The context window is the amount of text (measured in tokens) that a model can process at once when generating a response or understanding a text. This means that working in a language like Hindi, not only increases operational costs but also limits the amount of context you can include in responses. We'll explore context windows in greater detail in an upcoming blog post.

While major players like OpenAI and Anthropic charge per token, Google's PaLM2 model has introduced charging per character, which could reduce costs under certain conditions. Another alternative is deploying an open-source model, typically billed per hour of use. However, this option often involves higher setup and fixed monthly fees, making it an optimization rather than a launch strategy. Moreover, the quality of open-source models may not meet the requirements for certain use cases.

To determine the most suitable option for your needs, consider looking into our design sprints. We will review your content and help you identify the best approach for your specific situation.

Come chat with us

Get in touch to find out what your data can do for you. Spoiler alert: it's a lot.

Contact Us