Mus_musculus.GRCm39.cdna.all.fa.gzwas downloaded from EnsembleHomo_sapiens.GRCh38.cdna.all.fa.gzwas downloaded from Ensemble
00_get_seqs_mouse.ipynb00_get_seqs_human.ipynb
get_aa_seq_mouse.pyget_aa_seq_human.py
03_inspect_aa_seq_human.ipynb04_inspect_aa_seq.ipynb
prot_nuc_seqs_mouse.csvprot_nuc_seqs_human.csv
| gene_symbol | ensg_id | enst_id | nuc_seq_length | aa_seq_length | nuc_aa_seq_ratio | chromosome | start | end | strand | nuc_seq | aa_seq |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gm20730 | ENSMUSG00.. | ENSMUST00 | 359 | 119.0 | 3.01 | GRCm39:6 | 430 | 4305 | -1.0 | ATGAGGTGC | MRCLAEFLR. |
We used Nucleotide Transformer models to embed sequences with a maximum length of 5952 nucleotides. Sequences longer than 5952 nucleotides were truncated. Specifically, we used the 500M_human_ref and 500M_multi_species_v2 models for human and mouse respectively.
04_nuc_nt_emb_mouse.ipynb05_nuc_nt_emb_human.ipynb
- See the nautilex-esm repository
We performed Leiden clustering for nt-emb and esm3-emb, along with 2d umap projections. Results are merged with those from MMIDAS joint clustering for further visualization.
06_make_df.ipynb07_static_plots.ipynb08_dynamic_plots.ipynb09_enrichr_vignettes.ipynb
- Rohan Gala
- Yeganeh Marghi