Skip to content

This application utilizes the powerful Florence-2 vision-language model from Microsoft to generate comprehensive captions for images. The model is capable of understanding visual content and expressing it in natural language.

Notifications You must be signed in to change notification settings

PRITHIVSAKTHIUR/Image-Captioning-Florence2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Image Captioning with Florence-2

This project demonstrates image captioning using the Microsoft Florence-2 base model. It allows you to upload an image and receive a detailed textual description generated by the AI.

Overview

This application utilizes the powerful Florence-2 vision-language model from Microsoft to generate comprehensive captions for images. The model is capable of understanding visual content and expressing it in natural language.

How to Use

You can try the live demo on Hugging Face Spaces: https://huggingface.co/spaces/prithivMLmods/Image-Captioning

Alternatively, you can run the Gradio application locally using the provided Python script.

Local Deployment

  1. Clone the repository (if you haven't already):

    git clone [https://github.com/PRITHIVSAKTHIUR/Image-Captioning-Florence2.git](https://github.com/PRITHIVSAKTHIUR/Image-Captioning-Florence2.git)
    cd Image-Captioning-Florence2
  2. Install the necessary dependencies:

    pip install -r requirements.txt

    The requirements.txt file should contain the following:

    gradio
    torch
    Pillow
    transformers
    
  3. Run the Gradio application:

    python app.py

    This will start a local web server, and you can access the interface in your browser (usually at http://localhost:7860).

Interface

The Gradio interface is simple and intuitive:

  1. Upload Image: Click on the "Upload Image" box to select and upload an image from your local machine.
  2. Generated Caption: Once the image is uploaded, the Florence-2 model will process it and generate a detailed caption, which will be displayed in the "Generated Caption" text box.
  3. Copy Caption: A "Copy" button is available to easily copy the generated caption to your clipboard.

Code Details

The Python script (app.py) performs the following steps:

  1. Imports Libraries: Imports necessary libraries like gradio, subprocess, torch, PIL, and transformers.
  2. Installs flash-attn (optional): Attempts to install the flash-attn library for potential performance improvements on CUDA-enabled systems. It gracefully handles installation errors by continuing without flash-attn.
  3. Loads Model and Processor: Loads the pre-trained Florence-2 base model and its corresponding processor from Hugging Face Transformers. The model is moved to the GPU if available.
  4. describe_image Function: This function takes an uploaded image as input, preprocesses it using the processor, generates a caption using the Florence-2 model, and post-processes the output to extract the detailed caption.
  5. Gradio Interface: Creates a Gradio interface with an image input component and a text output component to display the generated caption.
  6. Launches Interface: Starts the Gradio web server to make the application accessible.

Hugging Face Spaces GitHub Repository

Model Information

Florence-2 is a powerful vision-language model known for its ability to generate detailed and accurate descriptions of images.

About

This application utilizes the powerful Florence-2 vision-language model from Microsoft to generate comprehensive captions for images. The model is capable of understanding visual content and expressing it in natural language.

Topics

Resources

Stars

Watchers

Forks

Languages