Skip to content

Layjins/Spider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Highlight

The Inference section provides various inference demos for multimodal generation using different model architectures. Key capabilities include:

  • Any-to-Many Generation: Support for cross-modal inputs and outputs (text, image, audio, video, box, mask)
  • Specialized Story Generation: Convert multimodal inputs into coherent text-image story
  • Model Variants: Implementations using both Qwen2.5-Omni and DeepSeek-Llama architectures

SpiderFree Visualization

travel.guide.mp4
text2manymodal.mp4
story.mp4
seg.mp4

Supported Features

✅ Gradio web interfaces
✅ Python API calls
✅ Autodl cloud deployment

Quick Start Tip: We recommend starting with SpiderFree (Qwen2.5-Omni) for quick experimentation with multimodal generation capabilities.

Our Paper

Spider: Any-to-Many Multimodal LLM: https://arxiv.org/pdf/2411.09439

@article{lai2024spider,
  title={Spider: Any-to-Many Multimodal LLM},
  author={Lai, Jinxiang and Zhang, Jie and Liu, Jun and Li, Jian and Lu, Xiaocheng and Guo, Song},
  journal={arXiv preprint arXiv:2411.09439},
  year={2024}
}

Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities 'Text + X' within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents, and an Any-to-Many Instruction Template designed for producing Xs signal prompts. To train Spider, we constructed a novel Text-formatted Many-Modal (TMM) dataset, which facilitates the learning of the X-to-Xs (i.e., Any-to-Many) capability necessary for AMMG. Ultimately, the well-trained Spider generates a pseudo X-to-Xs dataset, the first-ever X-to-Xs many-modal dataset, enhancing the potential for AMMG task in future research. Overall, this work not only pushes the boundary of multimodal interaction but also provides rich data support for advancing the field.

Table of Contents

Quick start

Includes many models. Some models are recommended as below:

Any-to-Many modalities generation. The generated examples are shown in visual.md

Any modalities to text-image story generation.

Text to text-image story generation.

Environment setting

(If you need the docker in autodl, please provide your autodl-ID for docker-sharing, and contact Jinxiang Lai: layjins1994@gmail.com)

Spider+Llama3+Qwen+Story Environment

base docker: PyTorch 2.1.0, Python 3.10(ubuntu22.04), CUDA 12.1

docker: spider_qwen

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
pip3 install -r requirements_spider_llama3.txt

Qwen Environment setting

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
pip3 install -r requirements_spider_qwen.txt
  1. install transformer with Qwen: https://github.com/QwenLM/Qwen2.5-Omni

or offline install transformer with Qwen:

# pip3 uninstall transformers
cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/Pretrain_model/transformers/dist
pip3 install transformers-4.50.0.dev0.tar.gz
# pip3 install accelerate
pip3 install qwen-omni-utils[decord]
sudo apt update && sudo apt install ffmpeg -y
  1. Alternative: Flash-Attention 2 to speed up generation. manually download: https://github.com/Dao-AILab/flash-attention/releases
# pip3 install -U flash-attn --no-build-isolation
# pip3 install flash-attn==2.7.3 --no-build-isolation
cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/Pretrain_model
pip3 install flash_attn-2.5.8+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
  1. If Error when do story generation.
File "/root/miniconda3/lib/python3.10/site-packages/diffusers/utils/dynamic_modules_utils.py", line 28, in <module>
    from huggingface_hub import cached_download, hf_hub_download, model_info
ImportError: cannot import name 'cached_download' from 'huggingface_hub' (/root/miniconda3/envs/qwen/lib/python3.10/site-packages/huggingface_hub/__init__.py)

remove cached_download

vim /root/miniconda3/lib/python3.10/site-packages/diffusers/utils/dynamic_modules_utils.py
# from huggingface_hub import cached_download, hf_hub_download, model_info
from huggingface_hub import hf_hub_download, model_info # remove cached_download
  1. Error:
File "/root/miniconda3/lib/python3.10/site-packages/pytorchvideo/transforms/augmentations.py", line 9, in <module>
    import torchvision.transforms.functional.to_tensor as F_t
ModuleNotFoundError: No module named 'torchvision.transforms.functional.to_tensor'

Fix:

vim /root/miniconda3/lib/python3.10/site-packages/pytorchvideo/transforms/augmentations.py
# import torchvision.transforms.functional.to_tensor as F_t
import torchvision.transforms.functional as F_t
  1. Fix:
vim /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider/spider/runners/runner_base.py
#from transformers.deepspeed import is_deepspeed_zero3_enabled
from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled

modify deepspeed:

  1. pip3 install --upgrade deepspeed==0.16.5

  2. In python3 environment,find the path of the installed deepspeed: import deepspeed print(deepspeed) <module 'deepspeed' from '/root/miniconda3/lib/python3.10/site-packages/deepspeed/init.py'>

  3. Replace the installed deepspeed with the corresponding files in /myGPT/myDeepSpeed0.16.5:

deepspeed/inference/engine.py

deepspeed/module_inject/load_checkpoint.py

  1. In Spider/demo/inference_api.py load_ckpt_mode = 'manul'

mmdet Environment

mmcv:

pip3 install -U openmim
mim install mmengine
# mim install mmcv==2.1.0
pip install mmcv==2.2.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.2/index.html
# https://mmcv.readthedocs.io/zh-cn/latest/get_started/installation.html

mmdet

mim install mmdet

error: AssertionError: MMCV==2.2.0 is used but incompatible. Please install mmcv>=1.7.2, <2.2.0

vim /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider/spider/models/mmdetection/mmdet/__init__.py
mmcv_maximum_version = '2.2.1'
vim /root/miniconda3/lib/python3.10/site-packages/mmdet/__init__.py
mmcv_maximum_version = '2.2.1'

nltk_data for grounding DINO: https://blog.csdn.net/qq_43140627/article/details/103895811

mv /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/Pretrain_model/nltk_data.zip /root
cd /root
unzip nltk_data.zip

Train

Train of Spider

docker in autodl: spider_qwen 1. spider_demo_train

(1) modify start.sh:

mode="spider_demo_train"

(2) related config:

train_configs/spider_demo_train.py

train_configs/ds_config.json (make sure the "train_batch_size" is adjusted correctly according to the GPU numbers)

(3) Finally:

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
sh start.sh

Inference

Inference Demo of Spider

Spider

docker in autodl: spider_qwen

Inference Pipeline: Spider/demo/inference_api.py

(1) select the checkpoint (train by train_configs/spider_demo_train.py), by modifying train_configs/demo_config.json:

"checkpoints": "path/to/checkpoint.pt"

(2) modify demo/frontend.py, "server_name" is the IP of the running machine:

demo.launch(share=True, enable_queue=True, server_port=8081, server_name='11.213.119.213')

demo.launch(share=True, server_port=6006) # autodl

(3) gradio in autodl: https://blog.csdn.net/weixin_43976646/article/details/143723135

E:\jinxianglai\code\AutoDL-SSH-Tools\AutoDL.exe

(4) Corresponding setting in Spider/spider/models/spider.py.

# init Grounding DINO if needed
init_dino_flag = True

(5) Finally:

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
# mode="spider_demo_train", in demo.sh
sh demo.sh

SpiderStory

docker in autodl: spider_qwen

Inference Pipeline: Spider/demo/inference_api.py

(1) select the checkpoint (train by train_configs/spider_story.py), by modifying train_configs/demo_config.json:

"checkpoints": "path/to/checkpoint.pt"

(2) Finally:

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
# mode="spider_story", in demo.sh
sh demo.sh

Inference Demo of SpiderFree

SpiderStory free (DeepSeek-R1-Distill-Llama-8B)

docker in autodl: spider_qwen

Inference Pipeline: Spider/demo/inference_api.py

(1) Finally:

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
# mode="spider_story_free_llama3", in demo.sh
sh demo.sh

SpiderStory free (Qwen2.5-Omni)

docker in autodl: spider_qwen

Inference Pipeline: Spider/qwen2.5omni_spider_web.py

  1. Chatbot in web:
cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
# MODEL_NAME="spider_story_free_qwen", in qwen2.5omni_spider_web.py
python3 qwen2.5omni_spider_web.py

SpiderFree (Qwen2.5-Omni)

docker in autodl: spider_qwen

Config: Spider/train_configs/spider_decoder_cfg.py (Note: We are still working on designing a better system prompt to let Qwen2.5-Omni output a better formated text.)

Inference Pipeline: Spider/spider_decoder_infer.py

Gradio: Spider/qwen2.5omni_spider_web.py

  1. Chatbot in web:
cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
# MODEL_NAME="spider_free_qwen", in qwen2.5omni_spider_web.py
python3 qwen2.5omni_spider_web.py

Inference Demo of DeepSeek-R1-Distill-Llama-8B

docker in autodl: spider_qwen

  1. Text Chatbot in gradio:
cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
python3 r1_llama3_8B_gradio.py
  1. Text Chat in python:
cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
python3 r1_llama3_8B_chat.py
  1. Text generation in python:
cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
python3 r1_llama3_8B_infer.py

Inference Demo of StoryDiffusion

docker in autodl: spider_qwen

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
python3 story_diffusion_infer.py

Inference Demo of SpiderDecoder

docker in autodl: spider_qwen

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
python3 spider_decoder_infer.py

Inference Demo of Qwen2.5-Omni

docker in autodl: spider_qwen

  1. Chatbot in web:
cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
python3 qwen2.5omni_web.py
  1. Inference in python:
cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Spider
python3 qwen2.5omni_infer.py

Inference Demo of NextGPT

docker in autodl: nextgpt

conda activate nextgpt

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Opencodes-Multimodal/NExT-GPT/NExT-GPT-old-jinxiang/ckpt/pretrained_ckpt/imagebind_ckpt/huge
ln -s /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/Pretrain_model/imagebind_huge.pth

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Opencodes-Multimodal/NExT-GPT/NExT-GPT-old-jinxiang/ckpt/pretrained_ckpt/vicuna_ckpt
ln -s /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/Pretrain_model/vicuna/7b_v0 7b_v0

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Opencodes-Multimodal/NExT-GPT/NExT-GPT-old-jinxiang/ckpt/delta_ckpt/nextgpt
rm -rf 7b_tiva_v0
ln -s /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/Pretrain_model/nextgpt_7b_tiva_v0 7b_tiva_v0

cd /root/autodl-tmp/4e5ee6e154984712803fe75176fe7a38/myGPT/Opencodes-Multimodal/NExT-GPT/NExT-GPT-old-jinxiang/code
bash scripts/app.sh

Code

Code Structure

The code of the model is in spider/models/spider.py

The inference code: Spider/demo/inference_api.py

Code Base

Code Base - Spider

https://github.com/Vision-CAIR/MiniGPT-4

https://github.com/NExT-GPT/NExT-GPT

https://github.com/microsoft/unilm/tree/master/kosmos-2

https://github.com/dvlab-research/LISA

https://github.com/QwenLM/Qwen2.5-Omni

Code Base - Story

https://github.com/HVision-NKU/StoryDiffusion

https://github.com/xichenpan/ARLDM

https://github.com/TencentARC/SEED-Story

Dataset

Dataset - Spider

https://github.com/Vision-CAIR/MiniGPT-4

https://github.com/NExT-GPT/NExT-GPT

https://huggingface.co/datasets/sailvideo/webvid10m/tree/main

https://huggingface.co/datasets/Olivia714/audiocaps

https://github.com/NExT-GPT/NExT-GPT/blob/main/data/T_X_pair_data/audiocap/prepare.md

Dataset - Story

https://github.com/xichenpan/ARLDM?tab=readme-ov-file

https://github.com/TencentARC/SEED-Story

Citation

If you use this code for your research, please cite our paper:

@article{lai2024spider,
  title={Spider: Any-to-Many Multimodal LLM},
  author={Lai, Jinxiang and Zhang, Jie and Liu, Jun and Li, Jian and Lu, Xiaocheng and Guo, Song},
  journal={arXiv preprint arXiv:2411.09439},
  year={2024}
}

Contact

Jinxiang Lai: layjins1994@gmail.com

About

Code for paper "Spider: Any-to-Many Multimodal LLM"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages