From the course: Programming Generative AI: From Variational Autoencoders to Stable Diffusion with PyTorch and Hugging Face

Unlock this course with a free trial

Join today to access over 24,800 courses taught by industry experts.

Embedding text and images with CLIP

Embedding text and images with CLIP

- [Instructor] So in our lesson on encoding text, at the very end, if you remember, I presented the SentenceTransformers library, which, internally, most of the models are actually trained using a similar objective to what I just presented with CLIP, basically using contrastive loss. Where the model isn't predicting a label, it isn't predicting another word or an image, but it's really trying to predict the most similar in a group of something or trying to predict how similar two separate things are when presented together. But in that SentenceTransformers embedding model, it was really text-to-text, essentially. We were embedding sentences compared against sentences. CLIP on the other hand, as we just saw, is the first multimodal model that we're going to encounter in these lessons where it's embedding text and images in a shared embedding space. And that joint embedding space is very powerful since, as we'll see in this kind of worked example, we can actually use that to do semantic…

Contents