We build a CLIP-based Activity2Vec model to use the power of CLIP. Based on the CLIP pre-trained model, we finetune it on our HAKE data with the human body part state (PaSta) labels. Thus, this model would estimate all the existing PaSta within the whole image (like the visualizations below). We believe it would be useful for action understanding as a more powerful image action semantic extractor. If you have any advice, feel free to drop us an email!
- install PyTorch 1.5+ and torchvision 0.6+:
conda install -c pytorch pytorch torchvision
- clone this repository
git clone https://github.com/DirtyHarryLYL/HAKE-Action-Torch.git PaStaNetCLIP
cd PaStaNetCLIP
git checkout CLIP-Activity2Vec
- Download HAKE dataset from here
tar xzvf hake-large.tgz
- Organize your data as follows
PaStaNetCLIP
|_ data
|_ hake
|_ hake-large
| |_ hico-train
| |_ hico-test
| |_ ...
-
Download annotations and put them under $PROJECT/data/hake folder.
-
Download the CLIP pretrained model from Baidu Pan and put it in $PROJECT/pretrained/clip.
Or Google Drive (ViT-B-16.new.pt: original CLIP pre-trianed model, ckpt_4.pth: finetuned model with HAKE data).
Backbone: ViT-B/16
data split: train/val
model weights: link
| - | mAP |
|---|---|
| foot | 64.6 |
| leg | 76.3 |
| hip | 64.5 |
| hand | 44.7 |
| arm | 72.9 |
| head | 60.6 |
| binary | 81.0 |
| verb | 68.1 |
# by default, we use gpu x batch : 8x4
# you can use --batch_size and --nproc_per_node to adjust the batch size and number of GPUS.
./run.sh # set --vis_test to visualize the predicition
./test.shIf you find our works useful, please consider citing:
@inproceedings{li2020pastanet,
title={PaStaNet: Toward Human Activity Knowledge Engine},
author={Li, Yong-Lu and Xu, Liang and Liu, Xinpeng and Huang, Xijie and Xu, Yue and Wang, Shiyi and Fang, Hao-Shu and Ma, Ze and Chen, Mingyang and Lu, Cewu},
booktitle={CVPR},
year={2020}
}
