Try NVIDIA NIM APIs

Search Results

Searching for: Image-Text Retrieval

Sort By

Publisher

Use Case

NIM Type

Blueprint Type

Launchable

Sorting by Most Recent

nvidia Cosmos Dataset Search

Accelerate post-training of end-to-end autonomous vehicle stacks with vector search and retrieval for large video datasets.

blueprint Autonomous Vehicles data Physical AI Search Enterprise Cosmos NVIDIA AI

nvidia nemotron-nano-12b-v2-vl

Nemotron Nano 12B v2 VL enables multi-image and video understanding, along with visual Q&A and summarization capabilities.

language generation chat Image-to-Text vision assistant visual question answering

cyborg Cyborg Enterprise RAG

Securely extract, embed, and index multimodal data with encryption in-use for fast, accurate semantic search.

NIM Launchable Blueprint Retrieval-Augmented Generation NeMo Retriever

nvidia llama-3_2-nemoretriever-300m-embed-v2

Multilingual, cross-lingual embedding model for long-document QA retrieval, supporting 26 languages.

Retrieval Augmented Generation Text-to-Embedding NeMo Retriever

microsoft TRELLIS

MSFT TRELLIS is a 3D AI model that generates high-quality 3D assets from text or image inputs.

text-to-3d Run-on-RTX image-to-3d

nvidia Retail Shopping Assistant

Elevate Shopping Experiences Online and In Stores.

blueprint nemo retriever nim Launchable Retrieval-Augmented Generation NVIDIA AI

stabilityai stable-diffusion-3.5-large

Stable Diffusion 3.5 is a popular text-to-image generation model

Image Generation Text-to-Image

black-forest-labs FLUX.1-Kontext-dev

FLUX.1 Kontext is a multimodal model that enables in-context image generation and editing.

Image Generation Text-to-Image Run-on-RTX

nvidia nemoretriever-ocr-v1

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

Optical Character Recognition Table Extraction nemo retriever data ingestion extraction

nvidia llama-3_2-nemoretriever-300m-embed-v1

Multilingual, cross-lingual embedding model for long-document QA retrieval, supporting 26 languages.

Retrieval Augmented Generation Text-to-Embedding NeMo Retriever

nvidia nemoretriever-ocr

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

Optical Character Recognition Table Extraction nemo retriever data ingestion extraction

google gemma-3n-e4b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generation speech recognition Visual QA chat

google gemma-3n-e2b-it

An edge computing AI model which accepts text, audio and image input, ideal for resource-constrained environments

language generation speech recognition Visual QA chat

nvidia llama-3.2-nemoretriever-1b-vlm-embed-v1

Multimodal question-answer retrieval representing user queries as text and documents as images.

nemo retriever embedding Retrieval Augmented Generation Text-to-Embedding

nvidia Biomedical AI-Q Research Agent Blueprint

Build advanced AI agents within the biomedical domain using the AI-Q Blueprint and the BioNeMo Virtual Screening Blueprint

Launchable Agent Blueprint Blueprint Retrieval-augmented generation llm

nvidia llama-3.1-nemotron-nano-vl-8b-v1

Multi-modal vision-language model that understands text/img and creates informative responses

doc intelligence chat multiple image understanding OCR

black-forest-labs FLUX.1-schnell

FLUX.1-schnell is a distilled image generation model, producing high quality images at fast speeds

Image Generation Text-to-Image Run-on-RTX

mistralai mistral-small-3.1-24b-instruct-2503

Efficient multimodal model excelling at multilingual tasks, image understanding, and fast-responses

language generation chat multimodal image understanding

black-forest-labs FLUX.1-dev

FLUX.1 is a state-of-the-art suite of image generation models

Image Generation Text-to-Image Run-on-RTX

nvidia Build an AI Agent for Enterprise Research

Build a custom enterprise research assistant powered by state-of-the-art models that process and synthesize multimodal data, enabling reasoning, planning, and refinement to generate comprehensive reports.

NIM Launchable Llama Nemotron Reasoning Blueprint Enterprise Retrieval-Augmented Generation NVIDIA AI NeMo Retriever

nvidia Synthetic Manipulation Motion Generation for Robotics

Generate exponentially large amounts of synthetic motion trajectories for robot manipulation from just a few human demonstrations.

NVIDIA Omniverse Blueprint synthetic data Enterprise robotics physical ai robot learning Humanoids NVIDIA Isaac GR00T text-to-world image-to-world teleop

nvidia cosmos-predict1-5b

Generates future frames of a physics-aware world state based on simply an image or short video prompt for physical AI development.

Synthetic Data Generation Physical AI policy evaluation robotics video-to-world

nvidia nv-embedcode-7b-v1

The NV-EmbedCode model is a 7B Mistral-based embedding model optimized for code retrieval, supporting text, code, and hybrid queries.

nemo retriever Embedding Retrieval Augmented Generation

microsoft phi-4-multimodal-instruct

Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.

Speech Recognition Visual QA chat Language Generation Image-to-Text Chart and Table Understanding

nvidia Build an Enterprise RAG Pipeline Blueprint

Power fast, accurate semantic search across multimodal enterprise data with NVIDIA’s RAG Blueprint—built on NeMo Retriever and Nemotron models—to connect your agents to trusted, authoritative sources of knowledge.

NIM Launchable Nemotron Blueprint Enterprise Retrieval-Augmented Generation NVIDIA AI NeMo Retriever

nvidia cosmos-nemotron-34b

Multi-modal vision-language model that understands text/img/video and creates informative responses

VLM Vision language model image caption image to text

nvidia llama-3.2-nv-embedqa-1b-v2

Multilingual and cross-lingual text question-answering retrieval with long context support and optimized data storage efficiency.

nemo retriever run-on-rtx embedding Retrieval Augmented Generation Text-to-Embedding

nvidia llama-3.2-nv-rerankqa-1b-v2

Fine-tuned reranking model for multilingual, cross-lingual text question-answering retrieval, with long context support.

nemo retriever run-on-rtx Retrieval Augmented Generation reranking

university-at-buffalo cached

Context-aware chart extraction that can detect 18 classes for chart basic elements, excluding plot elements.

nemo retriever Chart Element Detection Image-To-Text

baidu paddleocr

Model for table extraction that receives an image as input, runs OCR on the image, and returns the text within the image and its bounding boxes.

Optical Character Recognition Table Extraction Optical Character Detection nemo retriever data ingestion run-on-rtx extraction

hive deepfake-image-detection

Advanced AI model detects faces and identifies deep fake images.

computer vision AI safety deep fake detection Content moderation

nvidia Build an AI Virtual Assistant

Create intelligent virtual assistants for customer service across every industry

Customer Service Launchable Blueprint Retrieval-augmented generation llm contact center NVIDIA AI

meta llama-3.2-11b-vision-instruct

Cutting-edge vision-language model exceling in high-quality reasoning from images.

Image-Text Retrieval Visual QA chat Image-to-Text Image Captioning Visual Grounding

meta llama-3.2-90b-vision-instruct

Cutting-edge vision-Language model exceling in high-quality reasoning from images.

Image-Text Retrieval Visual QA image captioning chat Image-to-Text Visual Grounding

nvidia vila

Multi-modal vision-language model that understands text/img/video and creates informative responses

VLM Vision language model image caption image to text

hive ai-generated-image-detection

Robust image classification model for detecting and managing AI-generated content.

image classification computer vision AI safety Content moderation

nvidia nv-dinov2

NV-DINOv2 is a visual foundation model that generates vector embeddings for the input image.

Image-to-Embedding computer vision deepstream NVIDIA NIM object Classification

nvidia usdsearch

AI-powered search for OpenUSD data, 3D models, images, and assets using text or image-based inputs.

OpenUSD Synthetic Data Generation Digital Twin USD Text-to-3D

nvidia nv-embedqa-e5-v5

English text embedding model for question-answering retrieval.

Embedding run-on-rtx Retrieval Augmented Generation Nemo retriever Text-to-Embedding

nvidia nv-embedqa-mistral-7b-v2

Multilingual text question-answering retrieval, transforming textual information into dense vector representations.

nemo retriever Embedding Retrieval Augmented Generation

nvidia maisi

MAISI is a pre-trained volumetric (3D) CT Latent Diffusion Generative Model.

Image Generation Medical Imaging NVIDIA NIM

nvidia nvclip

NV-CLIP is a multimodal embeddings model for image and text.

Computer vision multimodal embeddings text and image Run-on-rtx

stabilityai stable-diffusion-3-medium

Advanced text-to-image model for generating high quality images

Image Generation Text-to-Image

nvidia ocdrnet

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

Optical Character Recognition image Optical Character Detection cv vlm computer vision TAO Toolkit video

baai bge-m3

Embedding model for text retrieval tasks, excelling in dense, multi-vector, and sparse retrieval.

Embeddings Retrieval Augmented Generation Text-to-Embedding

nvidia visual-changenet

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

image Image Generation cv Image Segmentation vlm computer vision TAO Toolkit video NVIDIA NIM

nvidia retail-object-detection

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

Object Detection image cv vlm computer vision TAO Toolkit video NVIDIA NIM

google paligemma

Vision language model adept at comprehending text and visual inputs to produce informative responses

image cv Vision Assistant vlm Visual Question Answering computer vision Language Generation Image-to-Text video

nvidia rerank-qa-mistral-4b

GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.

Ranking Retrieval Augmented Generation

nvidia vista-3d

VISTA-3D is a specialized interactive foundation model for segmenting and anotating human anatomies.

Interactive Annotation Image Segmentation Non-Commercial Use Only Medical Imaging