A smart semantic cache for high-scale GenAI workloads.
Note
v0.2.0 is now available! This version includes support for Mistral and Claude providers. Significant improvements in stability, performance, and configuration are included.
In production, a large percentage of LLM requests are repetitive:
- RAG applications: Variations of the same employee questions
- AI Agents: Repeated reasoning steps or tool calls
- Support Bots: Thousands of similar customer queries
Every redundant request means extra token cost and extra latency.
Why pay your LLM provider multiple times for the same answer?
PromptCache is a lightweight middleware that sits between your application and your LLM provider. It uses semantic understanding to detect when a new prompt has the same intent as a previous one β and returns the cached result instantly.
| Metric | Without Cache | With PromptCache | Benefit |
|---|---|---|---|
| Cost per 1,000 Requests | β $30 | β $6 | Lower cost |
| Avg Latency | ~1.5s | ~300ms | Faster UX |
| Throughput | API-limited | Unlimited | Better scale |
Numbers vary per model, but the pattern holds across real workloads: semantic caching dramatically reduces cost and latency.
* Results may vary depending on model, usage patterns, and configuration.
Naive semantic caches can be risky β they may return incorrect answers when prompts look similar but differ in intent.
PromptCache uses a two-stage verification strategy to ensure accuracy:
- High similarity β direct cache hit
- Low similarity β skip cache directly
- Gray zone β intent check using a small, cheap verification model
This ensures cached responses are semantically correct, not just βclose enoughβ.
PromptCache works as a drop-in replacement for the OpenAI API.
# Clone the repo
git clone https://github.com/messkan/prompt-cache.git
cd prompt-cache
# Set your embedding provider (default: openai)
export EMBEDDING_PROVIDER=openai # Options: openai, mistral, claude
# Set your provider API key(s)
export OPENAI_API_KEY=your_key_here
# export MISTRAL_API_KEY=your_key_here
# export ANTHROPIC_API_KEY=your_key_here
# export VOYAGE_API_KEY=your_key_here # Required for Claude embeddings
# Run with Docker Compose
docker-compose up -d# Clone the repository
git clone https://github.com/messkan/prompt-cache.git
cd prompt-cache
# Set environment variables
export EMBEDDING_PROVIDER=openai
export OPENAI_API_KEY=your-openai-api-key
# Option 1: Use the run script (recommended)
./scripts/run.sh
# Option 2: Use Make
make run
# Option 3: Build and run manually
go build -o prompt-cache cmd/api/main.go
./prompt-cacheTest the performance and cache effectiveness:
# Set your API keys first
export OPENAI_API_KEY=your-openai-api-key
# Run the full benchmark suite (HTTP-based)
./scripts/benchmark.sh
# or
make benchmark
# Run Go micro-benchmarks (unit-level performance)
go test ./internal/semantic/... -bench=. -benchmem
# or
make bench-goExample benchmark results:
BenchmarkCosineSimilarity-12 2593046 441.0 ns/op 0 B/op 0 allocs/op
BenchmarkFindSimilar-12 50000 32000 ns/op 2048 B/op 45 allocs/op
make help # Show all available commands
make build # Build the binary
make test # Run unit tests
make benchmark # Run full benchmark suite
make clean # Clean build artifacts
make docker-build # Build Docker image
make docker-run # Run with Docker ComposeSimply change the base_url in your SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1", # Point to PromptCache
api_key="your-openai-api-key"
)
# First request β goes to the LLM provider
client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum physics"}]
)
# Semantically similar request β served from PromptCache
client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "How does quantum physics work?"}]
)No code changes. Just point your client to PromptCache.
PromptCache supports multiple AI providers for embeddings and semantic verification. Select your provider using the EMBEDDING_PROVIDER environment variable.
export EMBEDDING_PROVIDER=openai # Options: openai, mistral, claudeIf not specified, OpenAI is used by default.
export EMBEDDING_PROVIDER=openai
export OPENAI_API_KEY=your-openai-api-key- Embedding Model:
text-embedding-3-small - Verification Model:
gpt-4o-mini
export EMBEDDING_PROVIDER=mistral
export MISTRAL_API_KEY=your_mistral_key- Embedding Model:
mistral-embed - Verification Model:
mistral-small-latest
export EMBEDDING_PROVIDER=claude
export ANTHROPIC_API_KEY=your_anthropic_key
export VOYAGE_API_KEY=your_voyage_key # Required for embeddings- Embedding Model:
voyage-3(via Voyage AI) - Verification Model:
claude-3-haiku-20240307
Note: Claude uses Voyage AI for embeddings as recommended by Anthropic. You'll need both API keys.
The provider is automatically selected at startup based on the EMBEDDING_PROVIDER environment variable. Simply set the variable and restart the service:
# Switch to Mistral
export EMBEDDING_PROVIDER=mistral
export MISTRAL_API_KEY=your_key
docker-compose restart
# Switch to Claude
export EMBEDDING_PROVIDER=claude
export ANTHROPIC_API_KEY=your_key
export VOYAGE_API_KEY=your_voyage_key
docker-compose restartFine-tune the semantic cache behavior with these optional environment variables:
# High threshold: Direct cache hit (default: 0.70)
# Scores >= this value return cached results immediately
export CACHE_HIGH_THRESHOLD=0.70
# Low threshold: Clear miss (default: 0.30)
# Scores < this value skip the cache entirely
export CACHE_LOW_THRESHOLD=0.30Recommended ranges:
- High threshold: 0.65-0.85 (higher = stricter matching)
- Low threshold: 0.25-0.40 (lower = more aggressive caching)
- Always ensure:
HIGH_THRESHOLD > LOW_THRESHOLD
# Enable/disable LLM verification for gray zone scores (default: true)
# Gray zone = scores between low and high thresholds
export ENABLE_GRAY_ZONE_VERIFIER=true # or false, 0, 1, yes, noWhen to disable:
- Cost optimization (skip verification API calls)
- Speed priority (accept slightly lower accuracy)
- Prompts are highly standardized
Keep enabled for:
- Production environments requiring high accuracy
- Varied prompt patterns
- Critical applications where wrong answers are costly
Change the embedding provider at runtime without restarting the service.
curl http://localhost:8080/v1/config/providerResponse:
{
"provider": "openai",
"available_providers": ["openai", "mistral", "claude"]
}curl -X POST http://localhost:8080/v1/config/provider \
-H "Content-Type: application/json" \
-d '{"provider": "mistral"}'Response:
{
"message": "Provider updated successfully",
"provider": "mistral"
}Use cases:
- A/B testing different providers
- Fallback to alternative providers during outages
- Cost optimization by switching based on load
- Testing provider performance in production
Built for speed, safety, and reliability:
- Pure Go implementation (high concurrency, minimal overhead)
- BadgerDB for fast embedded persistent storage
- In-memory caching for ultra-fast responses
- OpenAI-compatible API for seamless integration
- Multiple Provider Support: OpenAI, Mistral, and Claude (Anthropic)
- Docker Setup
- In-memory & BadgerDB storage
- Smart semantic verification (dual-threshold + intent check)
- OpenAI API compatibility
- Multiple Provider Support: Built-in support for OpenAI, Mistral, and Claude (Anthropic)
- Environment-Based Configuration: Dynamic provider selection and cache tuning via environment variables
- Configurable Thresholds: User-definable similarity thresholds with sensible defaults
- Gray Zone Control: Enable/disable LLM verification for cost/speed optimization
- Dynamic Provider Management: Switch providers at runtime via REST API
- Core Improvements: Bug fixes and performance optimizations
- Enhanced Testing: Comprehensive unit tests for all providers and configuration
- Better Documentation: Updated configuration guide for all features
- Public API: Enhanced cache management operations (create, delete, list)
- Metrics & Monitoring: Built-in hit rate and latency tracking
- Configuration API: Update thresholds and settings via REST endpoints
- Clustered mode (Raft or gossip-based replication)
- Custom embedding backends (Ollama, local models)
- Rate-limiting & request shaping
- Web dashboard (hit rate, latency, cost metrics)
We are working hard to reach v1.0.0! If you find this project useful, please give it a βοΈ on GitHub and consider contributing. Your support helps us ship v0.2.0 and v1.0.0 faster!
MIT License.
Complete documentation is available at: https://messkan.github.io/prompt-cache
