From the course: Advanced Guide to ChatGPT, Embeddings, and Other Large Language Models (LLMs)

Unlock this course with a free trial

Join today to access over 25,200 courses taught by industry experts.

Case study: Visual QA—Setting up a model

Case study: Visual QA—Setting up a model

- In our case, we are going to be creating our very own visual question and answering system. VQA for short is a type of architecture that takes in two types of data. Our system will be able to input both an image and a piece of text, usually a question about the image. Our system will process both image and text in tandem, combine those two pieces of information, use cross attention to hand that information over to our decoder, in this case, we'll be using open source GPT-2, to start answering the question given the image. Now this type of visual Q&A is seen as a precursor to some of the more frankly interesting tasks that researchers and practitioners are expecting from transformer-based architectures. It's all extremely well and good to have text to text models like GPT and T5 and BERT to a degree, although it doesn't output text, but the idea that we can start to cross data modalities, basically saying, well, it doesn't have to be just text. What if there's also an accompanying…

Contents