From the course: Introduction to Large Language Models

What are tokens?

From the course: Introduction to Large Language Models

What are tokens?

- [Instructor] Large language models generate text word by word, right? Not quite. They generate tokens. So what are tokens? Basically each word is split into sub words and one token corresponds to around four characters of text. Let's head over to the OpenAI website to get a good visual example of what tokens are. So this is the Tokenizer on the OpenAI website. So let me just go ahead and scroll down a bit. Now I'm going to go ahead and enter some text into the Tokenizer. So, tokenization is the process of splitting words into smaller chunks or tokens. Each of the different colors corresponds to a token. So in general, you can see that most words correspond to tokens, which includes the space in front of the word. There are a couple of exceptions. For example, the word tokenization is made up of two tokens, token and ization. The sentence I've typed has 12 words. Now, this corresponds to 14 tokens or 77 characters and you can see that the full stop right at the end has its own individual token. All right, so now that we know that tokens are made up of sub words, let's take a look at context windows.

Contents