Where #AlphaFold prediction meets cryo-EM density A recently deposited cryo-EM structure of a lipid export complex contains a small tracing issue and a 4-residue register shift in a transmembrane helix crucial for complex formation. Four residues are almost exactly one turn of an α-helix (~3.6 residues per turn). The mostly hydrophobic interface still looks perfectly reasonable. And yet, it’s wrong. After correcting the register and tracing, map fit improves, a plausible Phe–Met contact appears, and a proline moves from a helix cap into the helix core gaining functional significance. Interestingly, in this case the AlphaFold predicts the correct interface. However, details vary between repeated predictions while showing similar confidence scores. This illustrates that pLDDT, ipTM, and PAE are not designed to validate residue-level details in small interfaces. AI tools provide strong structural priors. Cryo-EM provides often weak but real experimental constraints. Tools that aggregate subtle map signals (like #checkMySequence) or extract residue-level contacts from weak evolutionary couplings (like #gapTrick) don’t replace human interpretation. They amplify weak evidence. The future of structural biology is not AI versus experiment. It’s iterative cross-validation between them.
Structural Biology Data Analysis
Explore top LinkedIn content from expert professionals.
Summary
Structural biology data analysis is the process of studying the three-dimensional shapes and movements of proteins and other biological molecules to understand their functions and interactions. New AI and machine learning methods are making it easier to interpret large datasets from experiments and simulations, helping scientists uncover how these molecules work and aiding drug discovery.
- Validate your models: Use both experimental data and AI predictions to cross-check structural details for improved accuracy and insight.
- Filter and focus: Systematically sift through generated data to highlight meaningful contacts and discard irrelevant information, saving time and resources in downstream research.
- Choose smart tools: Select software and computing hardware that match each analysis step, considering speed, cost, and the complexity of the data you’re working with.
-
-
Boltz-2: How much can 3D structure really tell us about molecular binding energetics? This week’s AI ∩ Bio: Reading the Revolution series covers Boltz-2, a new structural biology foundation model that exhibits strong performance for both structure and affinity prediction. To put this work in context, let’s start with the classic protein modeling pipeline logic: 🧬 Sequence → 🧱 Structure → 🎯 Function AlphaFold revolutionized the first step, grounded in the premise that function follows from structure. Boltz-2 puts that premise to the test. It starts at the middle of the pipeline — with the 3D structure of a protein–ligand complex — and asks: 👉 Can we predict binding affinity using only geometry? Key Insight: Structure is signal. Boltz-2 is a deep learning model that predicts binding affinity directly from 3D geometry — no sequence, no docking scores, no molecular dynamics. It learns by: >Using real 3D snapshots of protein–ligand complexes from experiments (via the PDBBind database) as “correct” examples >Comparing them to incorrect or nonbinding versions (decoys) >Teaching itself to distinguish between the two by assigning higher scores to the true binders — a method called contrastive learning >Viewing each complex from multiple angles and modeling how atoms interact using cross-attention between the ligand and protein The result? Accuracy approaching Free Energy Perturbation (FEP) — a gold-standard physics-based method — at a fraction of the computational cost. So: IF you have the correct structure, you can get binding affinity. But that’s the tradeoff. Boltz-2 doesn’t predict binding sites. It doesn’t model flexible loops or conformational dynamics. It assumes the structure is already known — and that it’s accurate. But we know that: 📎 Crystallography can trap proteins in inactive states 📎 Ligand poses may not reflect behavior in solution 📎 Flexibility is collapsed into a single static frame Still, Boltz-2 shows how much signal is embedded in structure — when that structure is right. 🌱 Reflection for Early-Career Scientists What happens when you flip the framing? Instead of building up from sequence to structure to function, Boltz-2 works from the middle, assuming structure is known, and asking how far that alone can take you. As a result, Boltz-2 sharpens the boundary of what structure can predict — and what it can’t. In other words, Boltz-2 is a boundary marker: a way to measure what’s possible if geometry is complete and correct.
-
Bayesian network modeling for analyzing protein dynamics Proteins are constantly moving, and these structural shifts help determine their roles in biology. Capturing the shifting conformations is critical for applications like drug development, yet the sheer amount of data produced from molecular simulations can be overwhelming. New strategies are needed to identify which interactions matter most and how they shape a protein’s overall behavior. Mukhaleva et al. introduce BaNDyT, a specialized software that employs Bayesian network modeling, an interpretable machine learning method designed to uncover probabilistic relationships in high-dimensional data. In this framework, each residue or residue pair is modeled as a node, and edges represent direct dependencies rather than mere correlations. The approach involves converting continuous simulation output into data bins, systematically searching for the best-fitting network structure, and then measuring each node’s weighted degree to highlight particularly influential contacts or regions. By filtering out redundant connections, the software effectively pinpoints functionally significant interactions buried in large-scale simulation datasets. Using this method on G protein-coupled receptor systems, the authors discovered both local and long-range interactions that drive protein dynamics. The researchers showed how BaNDyT can identify critical residues and communication pathways, even in distant parts of the structure, offering fresh insights into protein allostery. This interpretable machine learning approach lays a foundation for more nuanced studies of molecular interactions, broadening possibilities for research and therapeutic innovation. Paper: https://lnkd.in/dw6ypcaK #MachineLearning #BayesianNetworks #DataScience #ProteinDynamics #StructuralBiology #ComputationalBiology #Bioinformatics #DrugDiscovery #ComputationalChemistry #Proteomics #Pharmacology #ProteinFunction #MolecularModeling #AIforScience #Biotech
-
RosettaCommons just released a new bootcamp video series that covers the modern protein design stack end to end. It is one of the better educational resources I have seen for anyone entering this space or transitioning from legacy tools. Here are the key takeaways. 1. The standard workflow is now three modular steps: generate, design, validate. Backbone generation with RFdiffusion. Sequence design with ProteinMPNN (or LigandMPNN when small molecules are involved). Structure validation with AlphaFold2 or ESMFold. If the predicted structure matches the intended design at low RMSD, the sequence is considered designable and worth testing experimentally. Each step is independent and swappable, which is what makes the pipeline flexible. 2. Diffusion models have replaced rule-based design. Rosetta built proteins by minimizing physics-based energy functions. RFdiffusion starts from random noise and iteratively denoises it into a realistic backbone. AlphaFold3 and BoltzGen extend this further by reasoning over proteins, DNA, RNA, and small molecules simultaneously. The shift from energy minimization to learned generative models is the single biggest paradigm change in structural biology in the last five years. 3. MSAs and protein language models serve different design problems. AlphaFold2 uses multiple sequence alignments to extract co-evolutionary signal from natural homologs. ESMFold uses a protein language model trained on hundreds of millions of sequences to learn structural context implicitly, no alignment search needed. MSA-based methods excel when natural homologs exist. Language models are faster and better suited for de novo design where there is nothing to align against. Most real campaigns use both. 4. Filtering between steps matters more than generation parameters. RFdiffusion can produce thousands of backbones in a single run. The discipline is in what you throw away before moving to the next step. Check geometric criteria, verify contact distances, discard disordered loops. When designing inserts, test multiple lengths (20, 25, 30 residues) systematically rather than changing five variables at once. The compute cost of being methodical is trivial compared to running a failed experimental campaign with no diagnostic information. 5. Know your hardware. Rosetta runs well on CPUs because it performs sequential energy calculations. RFdiffusion, ProteinMPNN, and AlphaFold require GPUs for parallelized matrix operations. GPUs are faster but expensive and often scarce. For ProteinMPNN work on large number of candidates, CPUs can be the cost-effective choice when turnaround time is not the constraint. Choosing the right hardware for each step is a real operational decision, not a trivial one. The full playlist is worth watching for anyone building or managing a protein design pipeline. https://lnkd.in/eTpdsvN9
-
𝗖𝗿𝘆𝗼𝟮𝗦𝘁𝗿𝘂𝗰𝘁: 𝗔𝗜-𝗗𝗿𝗶𝘃𝗲𝗻 𝗗𝗲 𝗡𝗼𝘃𝗼 𝗣𝗿𝗼𝘁𝗲𝗶𝗻 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴 𝗳𝗼𝗿 𝗖𝗿𝘆𝗼-𝗘𝗠 Building atomic protein structures from Cryo-EM density maps remains one of the biggest challenges in structural biology, especially in de novo modeling scenarios where no homologous structures are available. Enter Cryo2Struct—a fully automated AI-based method that leverages 3D transformers and Hidden Markov Models (HMMs) to predict atomic-level protein structures directly from Cryo-EM maps. 🔬 Key Innovations in Cryo2Struct 🔹 𝟯𝗗 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿-𝗣𝗼𝘄𝗲𝗿𝗲𝗱 𝗔𝘁𝗼𝗺 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 Cryo2Struct employs a deep learning model with an attention mechanism to accurately predict Cα, N, and C backbone atoms, as well as amino acid types, directly from Cryo-EM density maps. 🔹 𝗛𝗶𝗱𝗱𝗲𝗻 𝗠𝗮𝗿𝗸𝗼𝘃 𝗠𝗼𝗱𝗲𝗹 (𝗛𝗠𝗠)-𝗕𝗮𝘀𝗲𝗱 𝗕𝗮𝗰𝗸𝗯𝗼𝗻𝗲 𝗧𝗿𝗮𝗰𝗶𝗻𝗴 A customized Viterbi algorithm is used to connect predicted atoms into protein chains, aligning them with the input protein sequence to generate high-accuracy atomic structures. 🔹 𝗥𝗼𝗯𝘂𝘀𝘁 𝗔𝗴𝗮𝗶𝗻𝘀𝘁 𝗠𝗮𝗽 𝗥𝗲𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗩𝗮𝗿𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 Unlike traditional methods, Cryo2Struct is highly robust across varying Cryo-EM resolutions (1.9–4.0 Å), maintaining high accuracy in atomic modeling even with low-resolution maps. 🔹 𝗦𝘂𝗽𝗲𝗿𝗶𝗼𝗿 𝘁𝗼 𝗘𝘅𝗶𝘀𝘁𝗶𝗻𝗴 𝗗𝗲 𝗡𝗼𝘃𝗼 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴 𝗧𝗼𝗼𝗹𝘀 Compared to Phenix, one of the most widely used Cryo-EM modeling tools, Cryo2Struct achieves: ✅ 65% recall vs. 40% in Phenix for identifying correct Cα atoms ✅ Higher F1 score (66%) and TM-score (0.22) for overall model quality ✅ 2.6× more complete protein structures than Phenix, ensuring better model coverage 🚀 𝗪𝗵𝘆 𝗧𝗵𝗶𝘀 𝗠𝗮𝘁𝘁𝗲𝗿𝘀 𝗳𝗼𝗿 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗮𝗹 𝗕𝗶𝗼𝗹𝗼𝗴𝘆 & 𝗔𝗜-𝗗𝗿𝗶𝘃𝗲𝗻 𝗣𝗿𝗼𝘁𝗲𝗶𝗻 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴 🔬 Faster & more accurate atomic model building without homologous templates 🧩 A breakthrough in de novo protein structure prediction using Cryo-EM ⚡ A step forward in AI-powered structure-based drug discovery & biomolecular engineering 💬 𝗪𝗵𝗮𝘁���𝘀 𝗻𝗲𝘅𝘁 𝗳𝗼𝗿 𝗔𝗜-𝗱𝗿𝗶𝘃𝗲𝗻 𝗖𝗿𝘆𝗼-𝗘𝗠 𝗺𝗼𝗱𝗲𝗹𝗶𝗻𝗴? 𝗖𝗼𝘂𝗹𝗱 𝗖𝗿𝘆𝗼𝟮𝗦𝘁𝗿𝘂𝗰𝘁 𝗽𝗮𝘃𝗲 𝘁𝗵𝗲 𝘄𝗮𝘆 𝗳𝗼𝗿 𝗳𝘂𝗹𝗹𝘆 𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗱𝗲𝘁𝗲𝗿𝗺𝗶𝗻𝗮𝘁𝗶𝗼𝗻? 𝗟𝗲𝘁’𝘀 𝗱𝗶𝘀𝗰𝘂𝘀𝘀! 🔗 Read the full paper here: https://lnkd.in/dKVRA7by #CryoEM #StructuralBiology #MachineLearning #ComputationalBiology #ProteinModeling #AIinScience #Bioinformatics #DeepLearning #DrugDiscovery #3DTransformers #HiddenMarkovModels
-
In a paper newly published in mAbs, NaturalAntibody-, Genmab, and Sanofi-based authors introduce NAStructuralDB, a new database consisting of general proteins, antibodies, Nanobodies®, and their interfaces, which can be used for deep learning featurization on par with state of the art use-cases for structure predictions, co-folding, diffusive design, docking, or binding prediction. From the abstract: Studying the interactions between antibodies and antigens is fundamental to the development of novel therapeutic biologics. Predictions of such interactions start with data collection. Though there exist reliable resources to identify antibody structures in the Protein Data Bank (PDB), such data still requires substantial processing to be usable in predictive tasks. Redundancy in sequences needs to be removed to avoid data leakages between train, test, and validation sets. Descriptors such as surface accessibility, secondary structure, and antibody region information need to be additionally annotated. Information on inter- and intra-molecular contacts, which is crucial to studying paratope/epitope information, needs to be collected. The specialized immunoglobulin format of Nanobodies® requires a separate dataset mirroring that of antibodies, given that their structure contains only a single VHH chain. Because antibody–antigen structures account for a small amount of all protein–protein contacts, having a molecular contact reference from other proteins is also desired. To address these issues, we introduce NAStructuralDB (https://lnkd.in/etiQnPjw), a dataset of processed structures of antibodies, Nanobodies®, proteins, and their complexes with molecular contact information and associated annotations. We use the opportunity of having collected the contact data to provide a reference of binding propensities of different residues across distinct contact types. https://lnkd.in/e-pw6c3p
-
Did you know the Protein Data Bank isn’t just one website? The global archive of 3D macromolecular structures is synchronized across multiple portals, but each offers unique tools and strengths. If you’re working with structural biology or bioinformatics, knowing which one to use can save you time. 🔹 RCSB PDB (PDBww) – The U.S. portal • Advanced search (GraphQL API for complex queries) • Ligand Explorer and Chemical Component Dictionary • Mol* 3D viewer • PDB‑101 educational resources 🔹 PDBe (Protein Data Bank in Europe) • Clean, intuitive interface • PDBe‑KB: integrates functional data (GO, InterPro, UniProt links) • Fast EBI FTP for bulk downloads • Great for functional and genomic context 🔹 PDBj (Protein Data Bank Japan) • Multilingual (Japanese/English) • eF‑site for electrostatic surface mapping • SeSAW sequence‑structure motif searches • Aspera for ultra‑fast bulk downloads (ideal for Asia/global users) All three portals share the same structural data, updated daily. The difference lies in the tools, APIs, and linked databases each provides. #Bioinformatics #StructuralBiology #ProteinDataBank #RCSB #PDBe #PDBj #MolecularBiology #DrugDiscovery #DataScience #APIs #LifeSciences
-
"Proteins are key to all cellular processes and their structure is important in understanding their function and evolution. Sequence-based predictions of protein structures have increased in accuracy, and over 214 million predicted structures are available in the AlphaFold database. However, studying protein structures at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm—Foldseek cluster—that can cluster hundreds of millions of structures. Using this method, we have clustered all of the structures in the AlphaFold database, identifying 2.30 million non-singleton structural clusters, of which 31% lack annotations representing probable previously undescribed structures. Clusters without annotation tend to have few representatives covering only 4% of all proteins in the AlphaFold database. Evolutionary analysis suggests that most clusters are ancient in origin but 4% seem to be species specific, representing lower-quality predictions or examples of de novo gene birth. We also show how structural comparisons can be used to predict domain families and their relationships, identifying examples of remote structural similarity. On the basis of these analyses, we identify several examples of human immune-related proteins with putative remote homology in prokaryotic species, illustrating the value of this resource for studying protein function and evolution across the tree of life." Exciting study which used a structural-alignment-based clustering algorithm to align and cluster all 214 million proteins from the AlphaFold Protein Structure Database. Structural alignment analysis resulted in 2.30 million clusters with 31% of clusters being previously undescribed structural or domain family annotations. Paper and research by @Inigo Barrio-Hernandez, Jingi Yeo and larger team https://lnkd.in/gF2ipNdd
-
The paper addresses a fundamental challenge in computational structural biology: distinguishing metamorphic proteins, which can adopt multiple distinct conformations, from single-fold proteins that stably adopt only one structure. Traditional structure prediction tools like AlphaFold2 excel at predicting a single conformation from sequence, but this paradigm limits their ability to capture the conformational diversity intrinsic to metamorphic proteins. To overcome this, the authors repurpose AlphaFold2 by generating ensembles of predicted structures through a tailored multiple sequence alignment sampling strategy. From these ensembles, they extract features that quantify structural modality and dispersion, capturing variability across conformations that may indicate metamorphic behavior. Using these features, the authors develop a random forest classifier trained on a curated dataset of known metamorphic and single-fold proteins, achieving a mean area under the ROC curve (AUC) of 0.869 under cross-validation, which demonstrates strong discriminative performance. Applying the classifier to 600 randomly sampled proteins from the Protein Data Bank reveals several potentially novel metamorphic protein candidates, including the 40S ribosomal protein S30, whose conformational change is linked to antimicrobial defense. This integrated approach combining AlphaFold2 with statistical learning offers a powerful new tool for identifying conformationally diverse proteins and deepens our understanding of their functional roles. https://lnkd.in/gc5bj5Zq