Image Captioning with Florence-2

This project demonstrates image captioning using the Microsoft Florence-2 base model. It allows you to upload an image and receive a detailed textual description generated by the AI.

Overview

This application utilizes the powerful Florence-2 vision-language model from Microsoft to generate comprehensive captions for images. The model is capable of understanding visual content and expressing it in natural language.

How to Use

You can try the live demo on Hugging Face Spaces: https://huggingface.co/spaces/prithivMLmods/Image-Captioning

Alternatively, you can run the Gradio application locally using the provided Python script.

Local Deployment

Clone the repository (if you haven't already):

git clone [https://github.com/PRITHIVSAKTHIUR/Image-Captioning-Florence2.git](https://github.com/PRITHIVSAKTHIUR/Image-Captioning-Florence2.git)
cd Image-Captioning-Florence2

Install the necessary dependencies:
```
pip install -r requirements.txt
```
The requirements.txt file should contain the following:
```
gradio
torch
Pillow
transformers
```
Run the Gradio application:
```
python app.py
```
This will start a local web server, and you can access the interface in your browser (usually at http://localhost:7860).

Interface

The Gradio interface is simple and intuitive:

Upload Image: Click on the "Upload Image" box to select and upload an image from your local machine.
Generated Caption: Once the image is uploaded, the Florence-2 model will process it and generate a detailed caption, which will be displayed in the "Generated Caption" text box.
Copy Caption: A "Copy" button is available to easily copy the generated caption to your clipboard.

Code Details

The Python script (app.py) performs the following steps:

Imports Libraries: Imports necessary libraries like gradio, subprocess, torch, PIL, and transformers.
Installs flash-attn (optional): Attempts to install the flash-attn library for potential performance improvements on CUDA-enabled systems. It gracefully handles installation errors by continuing without flash-attn.
Loads Model and Processor: Loads the pre-trained Florence-2 base model and its corresponding processor from Hugging Face Transformers. The model is moved to the GPU if available.
describe_image Function: This function takes an uploaded image as input, preprocesses it using the processor, generates a caption using the Florence-2 model, and post-processes the output to extract the detailed caption.
Gradio Interface: Creates a Gradio interface with an image input component and a text output component to display the generated caption.
Launches Interface: Starts the Gradio web server to make the application accessible.

Model Information

Model: microsoft/Florence-2-base
Processor: microsoft/Florence-2-base

Florence-2 is a powerful vision-language model known for its ability to generate detailed and accurate descriptions of images.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitattributes		.gitattributes
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Captioning with Florence-2

Overview

How to Use

Local Deployment

Interface

Code Details

Model Information

About

Uh oh!

Uh oh!

Languages

PRITHIVSAKTHIUR/Image-Captioning-Florence2

Folders and files

Latest commit

History

Repository files navigation

Image Captioning with Florence-2

Overview

How to Use

Local Deployment

Interface

Code Details

Model Information

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages