Deploying Small Language Models (LFWS307)

Training > AI/Machine Learning > Deploying Small Language Models (LFWS307)

INSTRUCTOR-LED COURSE

Prepare for high-impact roles in MLOps and AI infrastructure by mastering real-world small language model deployment. Deploy SLMs across laptop, server, edge, and browser environments using Hugging Face, llamafile, and PAIML.

Who Is It For

For MLOps Engineers, Backend Engineers, Platform Engineers, and developers deploying AI in real environments who need a portable, production-ready approach to running small language models across laptop, server, edge, and browser targets.

read less read more

What You’ll Learn

Learn how to deploy small language models end to end—from sourcing and packaging models to serving, scaling, and monitoring production workloads—using Hugging Face, llamafile, and the PAIML Rust stack, including RAG pipelines, streaming APIs, browser-based WASM deployment, and observability.

read less read more

What It Prepares You For

Position yourself for emerging AI career opportunities by mastering end-to-end SLM deployment across server, edge, and browser environments and building scalable, cost-efficient AI with Phi, Gemma, Llama, Qwen, and Mistral.

read less read more

Course Outline

Expand All

Collapse All

Course Introduction

Hugging Face Model Ecosystem

Lab 2.1. Download Phi-3-mini and Qwen2.5-1.5B. Compare model cards, licenses, and file sizes. Convert safetensors to GGUF.

Llamafile: Zero-Dependency Deployment

Lab 3.1. Create llamafile from Phi-3-mini GGUF. Test CLI completion and HTTP API. Benchmark tokens/sec on CPU vs GPU

Quantization with llama.cpp

Lab 4.1. Quantize Qwen2.5-1.5B to Q4/Q5/Q8. Benchmark size, speed, and perplexity. Select optimal quantization for 8GB RAM target.

Llamafile HTTP Serving

Lab 5.1. Deploy llamafile server. Build Python/curl client. Test streaming completions. Load test with 10 concurrent users.

Production Serving with Batuta

Lab 6.1. Build Batuta serving pipeline. Compare latency vs llamafile. Achieve <100ms p99 with continuous batching.

RAG with Patcha + Hugging Fae Embeddings

Lab 7.1. Index 1000 docs using all-MiniLM-L6-v2 embeddings. Build RAG pipeline with Phi-3. Compare RAG vs pure generation accuracy.

Edge Deployment

Lab 8.1. Deploy Q4 quantized model to ARM device (or emulator). Achieve interactive inference with 4GB RAM constraint.n

Browser Deployment with Presentar

Lab 9.1. Deploy Phi-3 Q4 to browser via Presentar. Achieve <500ms first-token latency. Build chat interface with streaming.

Monitoring with Entrenar

Kubernetes Deployment

Capstone: Multi-Target Deployment

Course Summary

Prerequisites

Knowledge/Skills Prerequisites:

Learners should have Linux command line proficiency, a basic understanding of large language models (including prompts, tokens, and inference), and familiarity with HTTP/REST API concepts. Recommended but not required: basic Rust knowledge (helpful for customizing the PAIML stack) and Docker fundamentals (useful for understanding container-based alternatives).

Lab Environment Prerequisites: