Developments in Genomic Data Mining

Explore top LinkedIn content from expert professionals.

Summary

Developments in genomic data mining refer to advanced techniques and computational tools used to analyze vast amounts of genetic information, helping scientists uncover patterns linked to diseases, traits, and biological processes. These innovations are transforming how researchers interpret DNA, diagnose conditions, and create personalized treatments.

  • Integrate diverse data: Combine information from genomics, transcriptomics, and imaging to get a more complete picture of disease mechanisms and potential treatments.
  • Apply AI models: Use artificial intelligence algorithms to sift through complex genetic data, identify hidden mutations, and improve diagnostic accuracy for rare and common conditions.
  • Uncover regulatory clues: Focus on analyzing noncoding regions of DNA and functional relationships between nucleotides to find new biomarkers and understand gene regulation, which can guide targeted therapies.
Summarized by AI based on LinkedIn member posts
  • View profile for Luke Yun

    Founder @ Decisive Machines | AI Researcher @ Harvard Medical School

    33,077 followers

    Research from Harvard & MIT used AI to unlock molecular insights in cancer pathology. Foundation models are revolutionizing computational pathology. But, most struggle to analyze entire whole-slide images (WSIs) and incorporate molecular data. 𝗧𝗛𝗥𝗘𝗔𝗗𝗦 𝗶𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗲𝘀 𝗮 𝗺𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗺𝗼𝗱𝗲𝗹 𝘁𝗵𝗮𝘁 𝗹𝗲𝗮𝗿𝗻𝘀 𝗳𝗿𝗼𝗺 𝗯𝗼𝘁𝗵 𝗵𝗶𝘀𝘁𝗼𝗽𝗮𝘁𝗵𝗼𝗹𝗼𝗴𝘆 𝘀𝗹𝗶𝗱𝗲𝘀 𝗮𝗻𝗱 𝗺𝗼𝗹𝗲𝗰𝘂𝗹𝗮𝗿 𝗽𝗿𝗼𝗳𝗶𝗹𝗲𝘀. • 𝗣𝗿𝗲𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗼𝗻 𝟰𝟳,𝟭𝟳𝟭 𝗛&𝗘-𝘀𝘁𝗮𝗶𝗻𝗲𝗱 𝗪𝗦𝗜𝘀 𝘄𝗶𝘁𝗵 𝗴𝗲𝗻𝗼𝗺𝗶𝗰 𝗮𝗻𝗱 𝘁𝗿𝗮𝗻𝘀𝗰𝗿𝗶𝗽𝘁𝗼𝗺𝗶𝗰 𝗽𝗿𝗼𝗳𝗶𝗹𝗲𝘀, the largest dataset of its kind. • Enabled state-of-the-art survival prediction, identifying high-risk patients with up to 8.9% higher accuracy than previous models. • 𝗘𝘅𝗰𝗲𝗹𝗹𝗲𝗱 𝗶𝗻 𝗹𝗼𝘄-𝗱𝗮𝘁𝗮 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀, achieving near-clinical accuracy with just 4 training samples per class. • Introduced “molecular prompting”, allowing AI to classify cancer types and mutations without task-specific training. I like that the architecture of THREADS is notably modular. It begins with an ROI encoder based on CONCHV1.5 (a ViT-L model fine-tuned with vision–language data) to extract patch features. The patch features are then aggregated into a slide-level embedding via an attention-based multiple instance learning (ABMIL) slide encoder. In parallel, distinct encoders for transcriptomic data (a modified scGPT) and genomic data (a multi-layer perceptron) create molecular embeddings. This design not only enables integration of heterogeneous data types but also achieves remarkable parameter efficiency. For instance, THREADS is reported to be 4× smaller than PRISM and 7.5× smaller than GIGAPATH, yet outperforms them on 54 oncology tasks. Here's the awesome work: https://lnkd.in/g5y5HFuV Congrats to Faisal Mahmood, Anurag Vaidya, Andrew Zhang, Guillaume Jaume, and co! I post my takes on the latest developments in health AI – 𝗰𝗼𝗻𝗻𝗲𝗰𝘁 𝘄𝗶𝘁𝗵 𝗺𝗲 𝘁𝗼 𝘀𝘁𝗮𝘆 𝘂𝗽𝗱𝗮𝘁𝗲𝗱! Also, check out my health AI blog here: https://lnkd.in/g3nrQFxW

  • View profile for Ken Wasserman

    Assistant Professor at Georgetown University School of Medicine

    4,327 followers

    NotebookLM: "...[described is] a novel technique called nucleotide dependency analysis to enhance the interpretability of genomic language models (gLMs) and detect functional elements within DNA sequences. By quantifying how a single nucleotide substitution affects the predicted probability of another nucleotide, this method effectively uncovers functional relationships that existing gLM reconstruction methods often miss. The researchers demonstrate that these dependencies are superior at indicating the deleteriousness of genetic variants and can accurately map diverse genomic features, including regulatory motifs, interactions between distal elements like splice sites, and complex RNA secondary and tertiary structures, including pseudoknots, all in an alignment-free manner. Ultimately, dependency maps serve as a powerful new tool for dissecting the regulatory code and diagnosing the limitations of different gLM architectures and training data choices." From the source: "...we introduced nucleotide dependencies that quantify how nucleotide substitutions at one genomic position affect the likelihood of nucleotides at another position. This new metric appears as a general and effective approach to identifying functionally related nucleotides using gLMs. Nucleotide dependency maps reveal functional elements across various biological processes, including transcriptional, post-transcriptional regulatory elements, their interactions and RNA folding. Therefore, this new metric has implications across multiple areas of computational and genome biology." https://lnkd.in/ebVkQHp8

  • View profile for Michele Ferrante

    Accomplished Sr. Program Director & AI/ML expert w/ a track record of scaling digital & computational psychiatry programs. Excels at bridging cutting-edge research, regulatory strategy, & cross-functional teams.

    6,176 followers

    A new article [in the comments] leverages computational methods to integrate high-dimensional genomic and neuroimaging data to uncover the developmental role of regional gene expression differences in the human cortex and their association with neurodevelopmental disorders like autism spectrum disorder (ASD) and schizophrenia (SCZ). The study explores how cortical gene expression dynamics during different developmental stages correlate with the structural and functional organization of the human brain, and how these patterns might deviate in neurodevelopmental disorders. Using a computational framework, the study analyzes gene expression data from the Allen Human Brain Atlas in conjunction with neuroimaging data and other genomic datasets like PsychENCODE. Data analytics and dimension reduction methods (e.g., PCA, DME) are employed to identify robust patterns. Findings: [1] The analysis highlights three major transcriptional components (C1, C2, C3) that correspond to different aspects of cerebral function and linkage to disorders. C1 is associated with neuron-specific patterns, C2 with metabolic processes, and C3 with synaptic planning and immune responses. [2] These components show distinct temporal patterns across fetal to adolescent brain development, with implications for understanding the evolution of cortical functions. [3] C1 and C2 show a strong correlation with ASD across multiple data modalities, whereas C3 is more closely associated with SCZ. This highlights how different developmental trajectories and gene expression disruptions can relate to specific clinical outcomes. Implications for Computational Psychiatry: [1] The research demonstrates the utility of integrating genomic, transcriptomic, and neuroimaging data in a computational framework to study complex brain disorders, providing a more comprehensive understanding of the underpinnings of these conditions. [2] The identified gene expression components could further be utilized to develop predictive models for identifying individuals at high risk for these disorders based on their cortical gene expression patterns. [3] Understanding specific gene-environment interactions that lead to disorder-specific deviations from normal cortical development might open up new avenues for targeted therapeutic interventions. Conclusion: The study effectively uses computational tools to link high-dimensional biological data with brain organization and disorder phenotypes and makes a significant contribution by providing insights into the molecular mechanisms contributing to neurodevelopmental disorders. This computational approach not only uncovers the intricate gene expression dynamics that shape the human cortex but also illustrates how deviations from these normative patterns are associated with clinical conditions, thus offering new pathways for diagnosis and treatment.

  • View profile for George L.

    Global Pharma & Life Sciences Executive | Expert in Biomarkers, Diagnostics, Computational Pathology & AI | Transformational Leader Driving Growth, Innovation & Patient-Centered Impact | AI for Medical Education

    5,251 followers

    Massive Genomic Study Reshapes Our View of Breast Cancer Progression and Treatment Groundbreaking new research from Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea analyzing whole-genome sequences (WGS) from 1,364 clinically annotated breast cancers, significantly advances our understanding of the disease's genomic landscape and its profound link to patient outcomes. This large-scale study, which included deep transcriptome data, delivered several high-impact findings: • Expanded Driver Atlas: The team identified a broader range of oncogenic alterations, including novel driver genes, recurrent gene fusions, and structural variants, expanding the known repertoire of breast cancer mechanisms. • Decades-Long Timeline: Timing analysis on copy number alterations revealed that genomic instability emerges decades before a tumor is clinically diagnosed. This offers crucial new insights into the extremely early initiation of tumorigenesis, opening doors for prevention strategies. • Predictive Biomarker Power: Pattern-driven genomic features—including mutational signatures, homologous recombination deficiency (HRD), and tumor heterogeneity scores—were strongly associated with clinical outcomes. The Clinical Takeaway: These findings highlight the potential for using comprehensive WGS data to develop predictive biomarkers that can better guide therapeutic decisions for individual patients, particularly concerning the use of: • CDK4/6 inhibitors • HER2 inhibitors • Adjuvant and neoadjuvant chemotherapy This research underscores the power of large-scale, clinically integrated whole-genome sequencing to translate complex genomic data into actionable insights, ultimately driving personalized and improved patient care. #BreastCancer #Genomics #Oncology #PrecisionMedicine #WGS #Biomarkers Figure Courtesy: Nature

  • View profile for Bill Gadless

    Founding Partner, emagineHealth | No-fluff, No-BS Marketing for Life Sciences, Healthcare, CDMOs, CROs, MedTech, & Diagnostics | Keep it real. Differentiate. No apologies | Current (esophageal) cancer fighter💪🏼

    37,878 followers

    Illumina has cracked a rare disease code that's been frustrating rare disease researchers for decades. Their new PromoterAI algorithm can finally interpret the 98% of the human genome that everyone's been ignoring - the noncoding regions where regulatory variants hide. Here's why this matters: Only 30% of rare disease patients get an accurate diagnosis from exome sequencing. The other 70%? Their answers might be buried in promoter regions that control gene expression but have been impossible to decode at scale. Until now. Published in Science Magazine, PromoterAI discovered regulatory variants that contribute up to 6% of rare disease causes. When combined with Illumina's other AI tools (SpliceAI and PrimateAI-3D), they're doubling diagnostic yield compared to traditional approaches. This is more than a technical breakthrough. It's potentially life-changing for families who've spent years searching for answers. The bigger story? Illumina keeps building an AI ecosystem that turns genomic data into actionable insights. They're not just selling sequencers anymore - they're becoming the intelligence layer for precision medicine. Every rare disease diagnosis that was previously impossible just became possible.

  • View profile for Manuel Corpas

    Genomics & AI for Global Health Equity | Senior Lecturer, University of Westminster

    11,237 followers

    Last week I stood in front of 50 bioinformaticians and ran a pharmacogenomics analysis in under one second. No cloud. No data leaving the room. Within 24 hours, a researcher I'd never met submitted a pull request adding a nutrigenomics skill I hadn't planned. That's how ClawBio started. The problem: general-purpose AI is powerful but blind to biology. It hallucinates star allele calls. It uses outdated CPIC guidelines. And you can't send patient genomes to a cloud API. ClawBio fixes this. It's a skill library that gives AI agents real bioinformatics expertise — pharmacogenomics, equity scoring, metagenomics, nutrigenomics — all running locally on your machine. What we shipped in one week: - 7 production skills (PharmGx, Equity Scorer, NutriGx, Metagenomics, and more) - 57 automated tests, CI on 3 Python versions - 1 community contribution merged in 24 hours - Published on ClawHub registry What I learned: 1. Methodology before code — a detailed spec is itself useful 2. Local-first isn't a limitation, it's the moat 3. One unsolicited PR proves architecture more than any benchmark 4. Tests are trust signals — it's why I merged fast 8 more skills are waiting for contributors: VCF annotation, scRNA-seq, protein structure, lit synthesis. If you work with genomic data and want to build: github.com/ClawBio/ClawBio MIT licensed. Every analysis ships with a reproducibility bundle. #Bioinformatics #AI #Genomics #OpenSource #Pharmacogenomics #HealthEquity

  • View profile for Slavé Petrovski

    Vice President, Centre for Genomics Research at AstraZeneca

    5,584 followers

    We’ve stopped asking how AI can make us more efficient and started asking what becomes possible at a scale that was previously unachievable. With data, models, and compute accelerating, the new constraint is no longer access, but our ability to reason across it all. That’s why, at the AstraZeneca Centre for Genomics Research, we’re pioneering agentic systems that navigate massive datasets, interact with tools, and surface mechanistic hypotheses where literature is sparse, giving our teams a clearer, faster path to high‑value targets. It’s early, and we’re learning. In the article below, I explain why this shift from high-throughput discovery to scalable biological reasoning feels anything but incremental.

  • View profile for Thomas Pierrot

    Head of AI @ Stealth AI Startup

    3,921 followers

    When you think about computer vision, segmentation models label every pixel of an image — assigning each to a car, a tree, or a person. This is the type of system used, for instance, in self-driving cars to understand their surroundings and distinguish one object from another. In genomics, when annotating genomes, we face a conceptually similar challenge: identifying, for every nucleotide, what biological element it belongs to — such as a gene, an exon, a promoter, or an enhancer. When we built the Nucleotide Transformer (NT) foundation model, we observed that it had implicitly learned many of these genomic elements through large-scale pre-training, even without supervision. This raised a new question: if NT already captures such representations, could we combine it with the segmentation principles used in computer vision — treating the genome like a one-dimensional image — to achieve explicit, high-resolution annotation? That idea became SegmentNT — a model that builds on NT’s learned representations and integrates a segmentation head inspired by computer vision architectures. It achieves single-nucleotide resolution annotation across genic, splicing, and regulatory elements, multiple species, and sequences up to 50,000 base pairs. This work, now published in Nature Methods (Nature Portfolio), represents another important step toward a unified, data-driven framework for interpreting genomes with AI. Very proud of the InstaDeep team — they did an outstanding job on this one.

  • View profile for Zhaohui Su

    VP, Biostatistics | Bridging Clinical Trials and Real-World Evidence

    4,738 followers

    The integration of artificial intelligence (#AI) into next-generation sequencing (#NGS) represents a transformative advancement in the field of genomics and #precision_medicine. AI-based tools now help with identifying genetic variants, conducting epigenomic and transcriptomic profiling, and improving single-cell sequencing, providing alternatives to traditional analytical methods. This integration has multiple implications. In clinical research, AI contributes to tumor subtyping, biomarker discovery, and the advancement of personalized therapies, supporting precision medicine initiatives. In drug discovery, AI assists with target identification and drug repurposing, influencing timelines and costs related to developing new treatments. However, there are ongoing challenges such as data heterogeneity, model interpretability, and ethical considerations related to privacy and clinical trust. This review outlines future areas for exploration, including federated learning and interpretable AI, which aim to enable secure, scalable, and reliable clinical applications. Reference: Athanasopoulou, K.; Michalopoulou, V.-I.; Scorilas, A.; Adamopoulos, P.G. (2025). Integrating Artificial Intelligence in Next-Generation Sequencing: Advances, Challenges, and Future Directions. Current Issues in Molecular Biology, 47, 470. https://lnkd.in/eJWDk_Kv

  • View profile for Elliot Hershberg

    Partner at Amplify | Author of Century of Bio

    11,961 followers

    Great functional genomics work from Genentech's AI team They identified three isolated threads of AI research on functional genomics data: 1. Predicting tracks of functional data (i.e. ChIP-seq, RNA-seq) 2. Training on sequences alone with language models, and 3. Training generative models to produce regulatory sequences. So they set out to make a "unifying multimodal masking framework for functional genomics" that creates a standard training paradigm across these different types of objectives. The framework, called Nona, takes in both DNA sequences and functional data tracks into a model where one or both inputs can be masked. In ML terms, this is basically rolling "predictive, generative, and self-supervised learning within a single framework." They showed how this paradigm improves performance for both functional data prediction and sequence design/generation. It's awesome to live in a time where top AI scientists are working at Genentech and openly publishing their research. (Link to preprint in comments.)

Explore categories