From the course: Introduction to Large Language Models
BERT
- [Narrator] If you're like me, the only time you watch dressage is every four years during the Summer Olympics. Now, whenever I've used Google search in the past, I've often only entered keywords such as dressage goal. It turns out that Google search uses BERT. BERT, which stands for bi-directional encoder representations from transformers is a large language model that was developed by Google. BERT is based on the transformer architecture and composed of transformer in coder layers which we looked at in a previous video. Since Google uses BERT, we get better language understanding. This means I don't just have to use keywords, but I can enter a question like, "What's the main objective of dressage?" And the answer back from Google doesn't just give me the most relevant page, but the answer to my question is highlighted. So the main objective of dressage is improving and facilitating the horse's performance of normal tasks. Here's another example of BERT in production. In the past, if you did a Google search using the phrase, "Can you get medicine for someone pharmacy", it wouldn't have picked up on the fact that for someone was a really important part of the query because you're looking for another person to pick up the medicine. Google search would've returned results about generally getting a prescription filled, which isn't relevant in this context. Now with BERT, Google search captures the important nuance that another person needs to pick up the medicine, and it returns results about having a friend or a family member pick up a prescription. BERT is around 110 million parameters, and was trained on the English Wikipedia and BookCorpus, which are 11,000 books written by yet unpublished authors. And unlike text generation models like ChatGPT and GPT-4, where they're trained to generate the next token, BERT is trained on two other tasks, masked language modeling, or MLM, and next sentence prediction, or NSP. The masked language modeling task requires BERT to predict a masked-out word. Let's take a look at an example. The MLM task requires BERT to predict the masked-out word. So the Tokyo Olympic Games were something from 2020 to 2021. The answer here is that the Tokyo Olympic Games were postponed from 2020 to 2021. If BERT doesn't guess postponed during training, the weights in the network get adjusted so that it's more likely to guess this word the next time round. The next sentence prediction task asks the question, does the second sentence follow immediately after the first? So does the sentence, this is the first instance, and so on, logically follow from the Tokyo Olympic Games were postponed from 2020 to 2021. Now, when would you ever need either of these tasks and why are either of them useful? These tasks force BERT to get a really good understanding of language. And we've seen it so good that Google incorporated it as part of its search. All right, so we've seen that BERT doesn't generate text, but it's great at language understanding. Recall that transformer models have encoders and decoders. BERT is the only encoder model we look at. The rest of the models will be decoder only models. Right after BERT, research from OpenAI suggested that bigger models are better. And so the rest of the models we look at are truly large with billions of parameters.