Audify is an agent-based application built with LangGraph and Streamlit that transforms visual media (images and videos) into unique, emotionally resonant musical compositions. It uses a team of AI agents to analyze visual input, develop a musical concept, generate a prompt, and compose a track, which you can then refine with your own feedback.
- Agentic Workflow: Powered by LangGraph, Audify uses a graph of specialized AI agents that collaborate to turn an idea into music.
- Multi-Modal Input: Generate music from either static images or dynamic videos.
- Dynamic Video Analysis: For videos, the app uses
scenedetectto extract keyframes, which are then collectively analyzed to create a single, cohesive musical score that follows the video's narrative arc. - AI Music Theory: The
MusicTheoristagent analyzes the visuals to determine mood, genre, tempo, and key instruments. - Creative Refinement: A
MusicCriticagent enhances the initial musical idea, making it more evocative and detailed for the generation model. - Iterative Feedback Loop: Not satisfied with the result? Provide natural language feedback (e.g., "make it faster and more epic") and the
RefinementAgentwill rewrite the prompt and regenerate the track. - Advanced Parameter Tuning: An AI
ParameterTunerautomatically adjusts technical settings for the music generation model based on the visual analysis. - High-Quality Music Generation: Uses the powerful ACE-Step model for music synthesis.
- Interactive Web UI: A user-friendly interface built with Streamlit.
Audify's core logic is a stateful graph where each node is a specialized agent. The process flows from visual analysis to music generation, with decision points that direct the workflow based on the current state.
- Input Node (Router): The graph's entry point determines the first step based on the input:
- If a video is provided, it routes to the
VideoAnalyzer. - If an image is provided, it routes to the
MusicTheorist. - If the user provides feedback on existing music, it routes to the
RefinementAgent.
- If a video is provided, it routes to the
VideoAnalyzer: If the input is a video, this node extracts keyframes, analyzes them as a sequence to understand the story, and generates aMusicTheoryobject (mood, genre, etc.) for the entire video.MusicTheorist: Analyzes a single image to generate aMusicTheoryobject.MusicCritic: Takes theMusicTheoryobject and refines thedetailed_prompt, making it richer and more descriptive for the music model.ParameterTuner: Adjusts technical generation parameters (e.g.,omega_scale,guidance_scale) based on the analyzed mood and genre.LyricsGenerator(Optional): If requested, this agent writes lyrics that match the musical concept.MusicGenerator: The final step in the main flow. It takes the refined prompt and tuned parameters and uses the ACE-Step model to generate the audio file.RefinementAgent: This node is triggered by user feedback. It modifies the existing music prompt based on the user's request and sends the new prompt back to theMusicGenerator.
- Orchestration: LangGraph
- Web Framework: Streamlit
- LLM (Analysis & Text): Google Gemini via
langchain_google_genai - Music Generation Model: ACE-Step
- Video Processing:
MoviePy,scenedetect - Deployment: Google Colab,
pyngrok
This project is designed to be run in a Google Colab environment to leverage its free GPU resources.
- A Google Account
- Git
-
Open the Notebook: Open the
Audify.ipynbnotebook in Google Colab. -
Set Up API Keys: You will need API keys for Google Gemini and ngrok.
- Get a Google Gemini API Key from Google AI Studio.
- Get an ngrok Authtoken by signing up at the ngrok Dashboard.
-
Add Keys to Colab Secrets:
- In your Colab notebook, click the "π" (Secrets) icon in the left sidebar.
- Add two new secrets:
GEMINI_API_KEY: Paste your Google Gemini key here.NGROK_AUTH_TOKEN: Paste your ngrok authtoken here.
- Make sure to enable the "Notebook access" toggle for both secrets.
-
Run the Cells:
- Execute the cells in the notebook sequentially from top to bottom.
- The first few cells will install all required dependencies and set up the project structure.
- The final cells will start the Streamlit server and use
ngrokto create a public URL.
-
Access the App:
- The last cell's output will provide a public ngrok URL (e.g.,
https://<unique-id>.ngrok-free.app). - Click this URL to open the Audify web application in your browser.
- The last cell's output will provide a public ngrok URL (e.g.,
- Step 1: Upload: Upload an image (
.jpg,.png) or a short video (.mp4,.mov). - Step 2: Analyze: Click the "Analyze & Create Music Concept" button. The AI agents will analyze your media and generate a musical prompt.
- Step 3: Generate: Review the AI-generated prompt and parameters. You can edit them if you wish. Click "Generate Music!".
- Step 4: Listen & Refine: Listen to your new track! If it's not quite right, type your feedback into the refinement box and click "Refine & Regenerate" to try again.
/
βββ Audify.ipynb # The main Google Colab notebook for setup and execution.
βββ ACE-Step/ # Cloned repository for the music generation model.
βββ app.py # The Streamlit web application front-end.
βββ outputs/ # Directory where generated music and videos are saved.
βββ temp_uploads/ # Temporary storage for user-uploaded files.
βββ src/
βββ __init__.py
βββ config.py # Configuration for models and default parameters.
βββ graph.py # Defines the LangGraph agentic workflow.
βββ models.py # Pydantic models for data structures (e.g., MusicTheory).
βββ state.py # Defines the AppState TypedDict for the graph.
βββ nodes/ # Contains the individual agent modules.
βββ __init__.py
βββ music_theorist.py
βββ video_analyzer.py
βββ music_critic.py
βββ lyrics_generator.py
βββ parameter_tuner.py
βββ refinement_agent.py
βββ music_generator.py
This project is licensed under the MIT License. See the LICENSE file for details.
- The LangChain team for creating LangGraph.
- The developers of the ACE-Step model for their incredible work in music generation.
- Google for the powerful Gemini models.
- The Streamlit team for making it easy to build beautiful data apps.