This project demonstrates image captioning using the Microsoft Florence-2 base model. It allows you to upload an image and receive a detailed textual description generated by the AI.
This application utilizes the powerful Florence-2 vision-language model from Microsoft to generate comprehensive captions for images. The model is capable of understanding visual content and expressing it in natural language.
You can try the live demo on Hugging Face Spaces: https://huggingface.co/spaces/prithivMLmods/Image-Captioning
Alternatively, you can run the Gradio application locally using the provided Python script.
-
Clone the repository (if you haven't already):
git clone [https://github.com/PRITHIVSAKTHIUR/Image-Captioning-Florence2.git](https://github.com/PRITHIVSAKTHIUR/Image-Captioning-Florence2.git) cd Image-Captioning-Florence2
-
Install the necessary dependencies:
pip install -r requirements.txt
The
requirements.txt
file should contain the following:gradio torch Pillow transformers
-
Run the Gradio application:
python app.py
This will start a local web server, and you can access the interface in your browser (usually at
http://localhost:7860
).
The Gradio interface is simple and intuitive:
- Upload Image: Click on the "Upload Image" box to select and upload an image from your local machine.
- Generated Caption: Once the image is uploaded, the Florence-2 model will process it and generate a detailed caption, which will be displayed in the "Generated Caption" text box.
- Copy Caption: A "Copy" button is available to easily copy the generated caption to your clipboard.
The Python script (app.py
) performs the following steps:
- Imports Libraries: Imports necessary libraries like
gradio
,subprocess
,torch
,PIL
, andtransformers
. - Installs
flash-attn
(optional): Attempts to install theflash-attn
library for potential performance improvements on CUDA-enabled systems. It gracefully handles installation errors by continuing withoutflash-attn
. - Loads Model and Processor: Loads the pre-trained Florence-2 base model and its corresponding processor from Hugging Face Transformers. The model is moved to the GPU if available.
describe_image
Function: This function takes an uploaded image as input, preprocesses it using the processor, generates a caption using the Florence-2 model, and post-processes the output to extract the detailed caption.- Gradio Interface: Creates a Gradio interface with an image input component and a text output component to display the generated caption.
- Launches Interface: Starts the Gradio web server to make the application accessible.
- Model: microsoft/Florence-2-base
- Processor: microsoft/Florence-2-base
Florence-2 is a powerful vision-language model known for its ability to generate detailed and accurate descriptions of images.