From the course: LLaMa for Developers
Unlock this course with a free trial
Join today to access over 25,300 courses taught by industry experts.
Using a vendor for serving LLaMA - Llama Tutorial
From the course: LLaMa for Developers
Using a vendor for serving LLaMA
- [Instructor] So far, we've worked with Llama both in a chat setting on Hugging Face and running it locally in a Colab. In this video, we're going to discuss how we can use a vendor for accessing Llama. So let's go ahead into a GitHub project. I'm here on the ray-project on the llmperf-leaderboard. What this repo does is benchmark how quickly Llama models are served by different vendors. As you can see here, eight different vendors are benchmarked and all of them offer a solution to run Llama. More recently, the Groq set of chips and APIs have been shown to be fastest in industry. So I'm going to go ahead and try those out. I'm going to open up console.groq.com and log in, and I'm going to hit Login with Google. Currently, Groq cloud is free, but it has a pretty low rate limit, so probably shouldn't be used in production. So let's go ahead and create an API key by clicking on API Keys on the left, I have one here already,…
Contents
-
-
-
-
-
-
(Locked)
Resources required to serve LLaMA4m 35s
-
(Locked)
Quantizing LLaMA4m 7s
-
(Locked)
Using TGI for serving LLaMA2m 40s
-
(Locked)
Using VLLM for serving LLaMA5m 27s
-
(Locked)
Using DeepSpeed for serving LLaMA4m 13s
-
(Locked)
Explaining LoRA and SLoRA1m 59s
-
(Locked)
Using a vendor for serving LLaMA3m 16s
-
(Locked)
-
-