Training > AI/Machine Learning > Deploying Small Language Models (LFWS307)
INSTRUCTOR-LED COURSE

Deploying Small Language Models (LFWS307)

Prepare for high-impact roles in MLOps and AI infrastructure by mastering real-world small language model deployment. Deploy SLMs across laptop, server, edge, and browser environments using Hugging Face, llamafile, and PAIML.

Who Is It For

For MLOps Engineers, Backend Engineers, Platform Engineers, and developers deploying AI in real environments who need a portable, production-ready approach to running small language models across laptop, server, edge, and browser targets.
read less read more
What You’ll Learn

Learn how to deploy small language models end to end—from sourcing and packaging models to serving, scaling, and monitoring production workloads—using Hugging Face, llamafile, and the PAIML Rust stack, including RAG pipelines, streaming APIs, browser-based WASM deployment, and observability.
read less read more
What It Prepares You For

Position yourself for emerging AI career opportunities by mastering end-to-end SLM deployment across server, edge, and browser environments and building scalable, cost-efficient AI with Phi, Gemma, Llama, Qwen, and Mistral.
read less read more
Course Outline
Expand All
Collapse All
Course Introduction
Hugging Face Model Ecosystem
Lab 2.1. Download Phi-3-mini and Qwen2.5-1.5B. Compare model cards, licenses, and file sizes. Convert safetensors to GGUF.
Llamafile: Zero-Dependency Deployment
Lab 3.1. Create llamafile from Phi-3-mini GGUF. Test CLI completion and HTTP API. Benchmark tokens/sec on CPU vs GPU
Quantization with llama.cpp
Lab 4.1. Quantize Qwen2.5-1.5B to Q4/Q5/Q8. Benchmark size, speed, and perplexity. Select optimal quantization for 8GB RAM target.
Llamafile HTTP Serving
Lab 5.1. Deploy llamafile server. Build Python/curl client. Test streaming completions. Load test with 10 concurrent users.
Production Serving with Batuta
Lab 6.1. Build Batuta serving pipeline. Compare latency vs llamafile. Achieve <100ms p99 with continuous batching.
RAG with Patcha + Hugging Fae Embeddings
Lab 7.1. Index 1000 docs using all-MiniLM-L6-v2 embeddings. Build RAG pipeline with Phi-3. Compare RAG vs pure generation accuracy.
Edge Deployment
Lab 8.1. Deploy Q4 quantized model to ARM device (or emulator). Achieve interactive inference with 4GB RAM constraint.n
Browser Deployment with Presentar
Lab 9.1. Deploy Phi-3 Q4 to browser via Presentar. Achieve <500ms first-token latency. Build chat interface with streaming.
Monitoring with Entrenar
Kubernetes Deployment
Capstone: Multi-Target Deployment
Course Summary

Prerequisites
Knowledge/Skills Prerequisites:

Learners should have Linux command line proficiency, a basic understanding of large language models (including prompts, tokens, and inference), and familiarity with HTTP/REST API concepts. Recommended but not required: basic Rust knowledge (helpful for customizing the PAIML stack) and Docker fundamentals (useful for understanding container-based alternatives).

Lab Environment Prerequisites:

  • Linux/macOS/WSL2
  • 16GB RAM, 50GB disk
  • Optional: NVIDIA GPU 8GB+ VRAM