Retrieval Augmented Generation in Practice
Scalable GenAI platforms with k8s, LangChain, HuggingFace and Vector
DBs
Mihai Criveti, Principal Architect, CKA, RHCA III
September 7, 2023
1
Large Language Models and their Limitations
Retrieval Augmented Generation or Conversation with your Documents
2
Introduction
Mihai Criveti, Principal Architect, Platform Engineering
• Responsible for large scale Cloud Native and AI Solutions
• Red Hat Certified Architect III, CKA/CKS/CKAD
• Driving the development of Retrieval Augmentation Generation platforms, and
solutions for Generative AI at IBM that leverage WatsonX, Vector databases,
LangChain, HuggingFace and open source AI models.
Abstract
• Large Language Models: use cases and limitations
• Scaling Large Language Models: Retrieval Augmented Generation
• LLAMA2, HuggingFace TGIS, SentenceTransformers, Python, LangChain, Weaviate,
ChromaDB vector databases, deployment to k8s
3
Large Language Models and
their Limitations
GenAI and Large Language Models Explained
Think of LLMs like mathematical functions, or your phone’s autocomplete
f(x) = x'
• Where the input (x) and the output (x') are strings
A more accurate representation
f(training_data, model_parameters, input_string) = output_string
• training_data represents the data it was trained.
• model_parameters represent things like “temperature”
• input_string is the combination of prompt and context you give to the model. Ex:
“What is Kubernetes” or “Summarize the following document: ${DOCUMENT}”
• the ‘prompt’ is usually an optional instruction like “summarize”, “extract”,
“translate”, “classify” etc. but more complex prompts are usually used. “Be a helpful
assistant that responds to my question.. etc.”
• The function can process a maximum of TOKEN_LIMIT (total input and output),
usually ~4096 tokens (~3000 words in English, fewer in say.. Japanese).
4
What Large Language Models DON’T DO
Learn
A model will not ‘learn’ from interactions (unless specifically trained/fine-tuned).
Remember
A model doesn’t remember previous prompts. In fact, it’s all done with prompt trickery:
previous prompts are injected. The API does a LOT of of filtering and heavy lifting!
Reason
Think of LLMs like your phone’s autocomplete, it doesn’t reason, or do math.
Use your data
LLMs don’t provide responses based on YOUR data (databases or files), unless it’s
include in the training dataset, or the prompt (ex: RAG).
Use the Internet
• A LLM doesn’t have the capacity to ‘search the internet’, or make API calls.
• In fact, a model does not perform any activity other than converting one string of text
into another string of text.
5
GenAI Use Cases
Figure 1: Use Cases
6
Think of adding this to your architecture
7
In fact, even a 9600 baud modem is much faster. Think teletype!
LLMs are really slow
• A WPM = ((BPS / 10) / 5) * 60, a 9600 baud modem will generate 11520 words /
minute.
• At an average 30 tokens / second (20 words) for LLAMA-70B, you’re getting 1200
words / minute!
• This is slower than a punch card reader :-)
LLMs are also expensive to run
• Running your own LLAMA2 70B might cost as much as 20K / month if you’re using a
dedicated GPU instance!
8
Model Limitations in Practice
Latency and Bandwidth: Tokens per second
• Large models (70B) such as LLAMA2 can be painfully slow
• Smaller models (20B, 13B, 7B) are faster, and can perform inference on a cheaper
GPU (less VRAM)
• Even so, models will perform at anywhere between 10 - 100 tokens / second.
Token Limits
• Models have a token limit as well
• Usually 4096 tokens (roughly ~3000 words) of total input and output
9
Getting LLMs to work with our data
Training
• Very Expensive, takes a long time
Fine Tuning
• Expensive, takes considerable time as well, but achievable
Retrieval Augmented Generation
• Insert your data into prompts every time
• Cheap, and can work with vast amounts of data
• While LLMs are SLOW, Vector Databases are FAST!
• Can help overcome model limitations (such as token limits) - as you’re only feeding
‘top search results’ to the LLM, instead of whole documents.
10
Retrieval Augmented
Generation or Conversation with
your Documents
RAG Explained
Figure 3: RAG Explained 11
Loading Documents
12
Scaling factor for RAG
• Vector Database: consider sharding and High Availability
• Fine Tuning: collecting data to be used for fine tuning
• Governance and Model Benchmarking: how are you testing your model performance
over time, with different prompts, one-shot, and various parameters
• Chain of Reasoning and Agents
• Caching embeddings and responses
• Personalization and Conversational Memory Database
• Streaming Responses and optimizing performance. A fine tuned 13B model may
perform better than a poor 70B one!
• Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are
terrible at reasoning and prediction, consider calling other models)
• Fallback techniques: fallback to a different model, or default answers
• API scaling techniques, rate limiting, etc.
• Async, streaming and parallelization, multiprocessing, GPU acceleration (including
embeddings), generating your API using OpenAPI, etc.
13
Contact
This talk can be found on GitHub
• https://github.com/crivetimihai/shipitcon-scaling-retrieval-augmented-generation
Social media
• https://twitter.com/CrivetiMihai - follow for more LLM content
• https://youtube.com/CrivetiMihai - more LLM videos to follow
• https://www.linkedin.com/in/crivetimihai/
14

Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s, LangChain, HuggingFace and Vector

  • 1.
    Retrieval Augmented Generationin Practice Scalable GenAI platforms with k8s, LangChain, HuggingFace and Vector DBs Mihai Criveti, Principal Architect, CKA, RHCA III September 7, 2023 1
  • 2.
    Large Language Modelsand their Limitations Retrieval Augmented Generation or Conversation with your Documents 2
  • 3.
    Introduction Mihai Criveti, PrincipalArchitect, Platform Engineering • Responsible for large scale Cloud Native and AI Solutions • Red Hat Certified Architect III, CKA/CKS/CKAD • Driving the development of Retrieval Augmentation Generation platforms, and solutions for Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI models. Abstract • Large Language Models: use cases and limitations • Scaling Large Language Models: Retrieval Augmented Generation • LLAMA2, HuggingFace TGIS, SentenceTransformers, Python, LangChain, Weaviate, ChromaDB vector databases, deployment to k8s 3
  • 4.
    Large Language Modelsand their Limitations
  • 5.
    GenAI and LargeLanguage Models Explained Think of LLMs like mathematical functions, or your phone’s autocomplete f(x) = x' • Where the input (x) and the output (x') are strings A more accurate representation f(training_data, model_parameters, input_string) = output_string • training_data represents the data it was trained. • model_parameters represent things like “temperature” • input_string is the combination of prompt and context you give to the model. Ex: “What is Kubernetes” or “Summarize the following document: ${DOCUMENT}” • the ‘prompt’ is usually an optional instruction like “summarize”, “extract”, “translate”, “classify” etc. but more complex prompts are usually used. “Be a helpful assistant that responds to my question.. etc.” • The function can process a maximum of TOKEN_LIMIT (total input and output), usually ~4096 tokens (~3000 words in English, fewer in say.. Japanese). 4
  • 6.
    What Large LanguageModels DON’T DO Learn A model will not ‘learn’ from interactions (unless specifically trained/fine-tuned). Remember A model doesn’t remember previous prompts. In fact, it’s all done with prompt trickery: previous prompts are injected. The API does a LOT of of filtering and heavy lifting! Reason Think of LLMs like your phone’s autocomplete, it doesn’t reason, or do math. Use your data LLMs don’t provide responses based on YOUR data (databases or files), unless it’s include in the training dataset, or the prompt (ex: RAG). Use the Internet • A LLM doesn’t have the capacity to ‘search the internet’, or make API calls. • In fact, a model does not perform any activity other than converting one string of text into another string of text. 5
  • 7.
    GenAI Use Cases Figure1: Use Cases 6
  • 8.
    Think of addingthis to your architecture 7
  • 9.
    In fact, evena 9600 baud modem is much faster. Think teletype! LLMs are really slow • A WPM = ((BPS / 10) / 5) * 60, a 9600 baud modem will generate 11520 words / minute. • At an average 30 tokens / second (20 words) for LLAMA-70B, you’re getting 1200 words / minute! • This is slower than a punch card reader :-) LLMs are also expensive to run • Running your own LLAMA2 70B might cost as much as 20K / month if you’re using a dedicated GPU instance! 8
  • 10.
    Model Limitations inPractice Latency and Bandwidth: Tokens per second • Large models (70B) such as LLAMA2 can be painfully slow • Smaller models (20B, 13B, 7B) are faster, and can perform inference on a cheaper GPU (less VRAM) • Even so, models will perform at anywhere between 10 - 100 tokens / second. Token Limits • Models have a token limit as well • Usually 4096 tokens (roughly ~3000 words) of total input and output 9
  • 11.
    Getting LLMs towork with our data Training • Very Expensive, takes a long time Fine Tuning • Expensive, takes considerable time as well, but achievable Retrieval Augmented Generation • Insert your data into prompts every time • Cheap, and can work with vast amounts of data • While LLMs are SLOW, Vector Databases are FAST! • Can help overcome model limitations (such as token limits) - as you’re only feeding ‘top search results’ to the LLM, instead of whole documents. 10
  • 12.
    Retrieval Augmented Generation orConversation with your Documents
  • 13.
    RAG Explained Figure 3:RAG Explained 11
  • 14.
  • 15.
    Scaling factor forRAG • Vector Database: consider sharding and High Availability • Fine Tuning: collecting data to be used for fine tuning • Governance and Model Benchmarking: how are you testing your model performance over time, with different prompts, one-shot, and various parameters • Chain of Reasoning and Agents • Caching embeddings and responses • Personalization and Conversational Memory Database • Streaming Responses and optimizing performance. A fine tuned 13B model may perform better than a poor 70B one! • Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are terrible at reasoning and prediction, consider calling other models) • Fallback techniques: fallback to a different model, or default answers • API scaling techniques, rate limiting, etc. • Async, streaming and parallelization, multiprocessing, GPU acceleration (including embeddings), generating your API using OpenAPI, etc. 13
  • 16.
    Contact This talk canbe found on GitHub • https://github.com/crivetimihai/shipitcon-scaling-retrieval-augmented-generation Social media • https://twitter.com/CrivetiMihai - follow for more LLM content • https://youtube.com/CrivetiMihai - more LLM videos to follow • https://www.linkedin.com/in/crivetimihai/ 14