Open Source Multimodal Embeddings in Rust
👷♂️ Work in progress
- Access to state-of-the-art multimodal embedding models
- Only inference
- High performance on CPU
- Cheap & easy to scale
- Flexible integration
| Status | Model | Feature | Embeddings |
|---|---|---|---|
| Development | OpenAI/Clip-ViT-B-32 | clip_vit_b_32_image | Image |
| Next | OpenAI/Clip-ViT-B-32 | clip_vit_b_32_text | Text |
| - | OpenAI/Clip-ViT-B-32 | clip_vit_b_32 | Image, Text |
| - | ImageBind | ||
- Embed models config & safetensors files:
- No need to handle extra files
- Reduce component size
- Each feature will only be responsible for generating embeddings for 1 modality for 1 model
- Extra feature for the full model (low priority)
- Split safetensors files into modalities (todo! improve code to automate this)
Basic test to get an idea of the current performance CPU, same environment
Clip python, JIT
device="cpu"
model_name="ViT-B/32"
model, transform = clip.load(model_name, device=device, jit=True)
input_data_b = torch.ones((5, 3, 224, 224), dtype=torch.float32, device=device)
out = model.encode_image(input_data_b)
Results
Time elapsed in build_model() is: 1.6573503890000003
Time elapsed in forward() is: 1.101630814
Rust, Clip_vit_b_32_image
use hs_mm_embeddings::clip_vit_b_32_image;
fn main() {
let start = Instant::now();
let image_model = clip_vit_b_32_image::model::build_model().unwrap();
let duration = start.elapsed();
println!("Time elapsed in build_model() is: {:?}", duration);
let input_img_batch = Tensor::ones((5, 3, 224, 224), DType::F32, &Device::Cpu).unwrap();
let start = Instant::now();
let output = image_model.forward(&input_img_batch).unwrap();
let duration = start.elapsed();
println!("Time elapsed in forward() is: {:?}", duration);
Results
Time elapsed in build_model() is: 483.20213ms
Time elapsed in forward() is: 1.495299234s
Testing different batch size
- size 1:
- Python: 0.5376610519999998
- Rust: 282.812172ms
- size 2:
- Python: 0.6142103929999996
- Rust: 941.727512ms
Observations
- Time generating embeddings still too high when batch_size > 1
- With batch_size = 1 is faster
- Max difference between output tensors: [8.5831e-6] JIT=False & [7.6294e-6] JIT=True
- Rust executable size (release): 183.7 MB
todo!: check unnecessary copies or redundant ops
todo!
- First Clip Model
- Output & Performance evaluation
- Add ImageBind Model
- Examples in Rust
- Examples integration with other languages
- Example Weaviate custom module
- Expand models catalog
todo!
todo!