From the course: Advanced Guide to ChatGPT, Embeddings, and Other Large Language Models (LLMs)
Unlock this course with a free trial
Join today to access over 25,200 courses taught by industry experts.
Evaluating generative tasks: Part 1 - ChatGPT Tutorial
From the course: Advanced Guide to ChatGPT, Embeddings, and Other Large Language Models (LLMs)
Evaluating generative tasks: Part 1
- While this isn't the first time we've talked about evaluations, this is going to be the first time we're going to take an in-depth look into the many ways that we can evaluate the many types of LLM tasks. But before we do, we should understand that evaluation is not simply checking whether or not a model works. That is part of it, but it's basically step one. True holistic evaluation of LLMs is really more of a step to understand how well the model will actually perform and will be useful in a real-world scenario. We're also going to be discussing how to measure things like trustworthiness and impactfulness of our tasks and of our datasets. Now, most people's testing harnesses will look something like this when it comes to LLMs. They'll have more or less a two by two, well, not two by two, but a two-axis grid, where you have models on one axis, like I have GPT-3.5, Llama 3, Claude 3.5 as columns, and as rows, I have prompting variants. Now, this is of course, a testing harness…