We're releasing RoP v2026.04: 1.33 million harmonized biomedical Common Data Elements - open for the research community. What & Why RoP pairs CDEs with semantic embeddings, value sets, and governance parameters for true multi-cohort interoperability. It's the foundation for our HiTL automated data and schema harmonization registry. Today, multi-cohort studies spend months manually mapping variables. One calls it "MoCA Total Score," Two calls it MOCATOTS, Three calls it MOCA. Same assessment, zero compatibility. Researchers spend 40-70% of their time on data wrangling (Kaggle). RoP + The Forge cut standardization time and cost by an order of magnitude. Let's make research as FAIR as possible. RoP covers every layer of biomedical data: identity, clinical , genomics, imaging, governance, and findability. Built on OMOP, LOINC, ICD-10, HPO, Mondo, and 9 other major vocabularies. Running in production across hundreds of thousands of samples and millions of data points for collaborators leading federated open science initiatives. What's Inside 1,328,973 CDEs organized into 13 themes 768-dim semantic embeddings (SapBERT), IVF4096 FAISS index (sub-second similarity search), Full reproducibility (SHA256 checksums, build scripts), Sources: 9 public standards (OMOP, LOINC, ICD-10, HPO, Mondo, NINDS-CDE, PhenX, CDISC, DICOM, BIDS, DUO) + 9 project collections Get RoP 📥 Data: https://lnkd.in/eyzswQZd (7.8 GB, DOI: 10.57967/hf/8781) 💻 Code & Docs: https://lnkd.in/eJPjmRZS 📋 Questions / Help deploying: https://lnkd.in/e2yhfSWr Community Effort! Primary Authors: Pietro Marini, Rodney Long, Hirotaka Iwaki, Mike Nalls, Dan Vitale Collaborators: Mette Peters, Hampton Leonard, Andy Henrie, Amara Alexander, Elise Marsan, PhD, Yang Fann, Mark Cookson, Cornelis Blauwendraat, Andrew Singleton, Huw Morris, Timothy Hohman, Sarah Biber, PhD, John Crary, Syed Islamuddin Shah, Brittany Dugger, David Gutman, Chris Morris, Patrick Brannelly, Lietsel Jones, Mathew Koretsky, Cole Tindall, Mukta Phatak, Zane Jaunmuktane, Mimi Tambi, Brandon Jernigan, Terri Thompson, Ph.D., PMP, Michael Karlovich, Kurt Kurt Farrell, and many more… CDEs are a community effort. Collaborative Studies: NIH CARD, GP2, NACC, Answer ALS, SEA-AD, ADSP-PHC, ASAP, BDR, BDSA, PART through their connection with the Path-ND Consortium by the 10,000 Brains Project Foundational Concepts: Based on the preprint by Long et al 2024 (https://lnkd.in/e6HtXGwm). Special recognition: We are deeply grateful to the National Library of Medicine (NLM) for their decades of foundational work in biomedical informatics and vocabulary development. We would be honored if any of our harmonization work could contribute back to NLM's mission in any way. #OpenScience #DataHarmonization #FAIR #Bioinformatics #Neuroscience #Genomics #ClinicalResearch
Biological Data Standardization
Explore top LinkedIn content from expert professionals.
Summary
Biological data standardization refers to the process of organizing and harmonizing complex biological information—from gene sequences to cell types—so researchers can easily compare, share, and analyze data across studies, technologies, and species. This enables more reliable discoveries by reducing confusion caused by inconsistent naming, formats, and measurement methods.
- Streamline data integration: Adopt shared vocabularies and standardized processing pipelines to simplify combining datasets from different sources and studies.
- Promote reproducibility: Use open-source tools and community-driven frameworks that ensure findings can be verified and built upon by others in the scientific community.
- Support cross-disciplinary research: Choose data formats and platforms that connect molecular, cellular, and clinical information, making it easier to explore relationships across biological scales and fields.
-
-
Excited to share our latest paper in Nature Aging: “A unified framework for systematic curation and evaluation of aging biomarkers.” We introduce Biolearn, an open-source platform that standardizes and benchmarks 39 aging biomarkers across more than 20,000 individuals from diverse cohorts. Our analyses show that while Horvath’s skin and blood clock best predicts chronological age, GrimAge2 shows the strongest associations with mortality and healthspan, emphasizing the multidimensional nature of biological aging. Biolearn provides unified pipelines for data harmonization, imputation, and cell-type deconvolution, creating a foundation for reproducible and scalable aging biomarker research. 📦 Explore the toolkit: https://lnkd.in/gRXzpm5V 🔗 Read the paper: https://lnkd.in/gwFU7Eim Grateful to all collaborators and co-authors for this incredible team effort: Seth Paulson Alec Eames Alexander Tyshkovskiy Siyuan Li Nir Eynon Macsue Jacques Robin Grolaux Erik J. Ludger Goeminne Andrea Cipriano Martín Pérez-Guevara, PhD, Mehrnoosh Emamifar, Dayoon Kwon, Michael Snyder, Dane Gobel, Chiara Herzog, Daniel McCartney, Riccardo Marioni, Jessica Lasky-Su, Jesse Poganik, Mahdi Moqri, Vadim Gladyshev and others! #Aging #Epigenetics #AI #Biomarkers #NatureAging #OpenScience
-
How to Simplify the Integration of Human Cell Types Harmonizing cell types across datasets is a critical step in building a standardized and unified Human Cell Atlas (HCA). A team led by Sarah A. Teichmann at the Wellcome Sanger Institute has introduced CellHint, a powerful tree-based predictive clustering tool designed to address differences in annotation resolution and technical biases in single-cell datasets. CellHint quantifies transcriptome similarities between cells with high accuracy, organizing cell types into a hierarchical relationship graph. This approach defines shared and unique subtypes across datasets, providing a powerful framework for harmonizing cell annotations. When applied to multiple immune cell datasets, CellHint successfully replicated expert-curated annotations, demonstrating its accuracy and reliability. In addition, the tool revealed previously underexplored relationships between healthy and diseased lung cell states in eight diseases. This insight highlights its utility in identifying subtle cellular changes associated with disease. The team also presents a rapid cross-dataset integration workflow guided by the harmonized cell types and hierarchical structure. The workflow identified underappreciated cell types in the adult hippocampus, demonstrating its potential to reveal new biological insights. To further validate its versatility, the team applied CellHint to 12 tissues, covering 38 datasets, and collated a comprehensive cross-tissue database of approximately 3.7 million cells. This database, combined with a machine learning model developed for automatic cell annotation, provides an important resource for researchers studying human tissues. This study, published in the journal Cell, provides a key tool for the single-cell community, facilitating the coordination and integration of datasets to build a standardized and deeply annotated human cell atlas. By improving cross-dataset compatibility and revealing new cell type relationships, CellHint is shaping the future of cell biology and biomedical research. Reference [1] Chuan Xu et al., Cell 2023 (DOI: 10.1016/j.cell.2023.11.026) #HumanCellAtlas #SingleCellBiology #CellHint #DataIntegration #CellAnnotation #MachineLearning #Bioinformatics #BiomedicalResearch #Transcriptomics #CellBiology #HealthcareInnovation #LifeSciences
-
Building a virtual model of the cell is an emerging frontier at the intersection of artificial intelligence and biology, aided by the rapid growth of single-cell RNA sequencing data. By aggregating gene expression profiles from millions of cells across hundreds of studies, single cell atlases have provided a foundation for training AI-driven models of the cell. However, reliance on datasets with pre-processed counts limits the size and diversity of these repositories and constrains downstream model training to data curated for divergent purposes. This introduces analytical variability due to differences in the choice of alignment tools, genome references, and counting strategies. Here, we introduce scBaseCamp, a continuously updated single-cell RNA-seq database that leverages an AI agent-driven hierarchical workflow to automate discovery, metadata extraction, and standardized data processing. Built by directly mining and processing all publicly accessible 10x Genomics single-cell RNA sequencing reads, scBaseCamp is currently the largest public repository of single-cell data, comprising over 230 million cells spanning 21 organisms and 72 tissues. Using studies comprised of both single cell and single nucleus sequencing data, we demonstrate that uniform processing across datasets helps mitigate analytical artifacts introduced by inconsistent data processing choices. This standardized approach lays the groundwork for more accurate virtual cell models and serves as a foundation for a wide range of biological and biomedical applications. Interesting paper detailing the development of SRAgent, a genomics AI Tool using LangChain's multi-agent system to automate complex biological data processing and RNA sequencing workflows from scientific databases. By Nicholas Youngblut and larger team at the Arc Institute Link to full paper: https://lnkd.in/ezBGXrng Github repository: https://lnkd.in/e-b-mRsy
-
BrAVe: Unifying the Brain Across Scales and Species The quest to understand the brain’s structure and function has long been limited by one barrier: the inability to integrate molecular, structural, and functional data across scales and species. A new open-source framework, BrAVe (BrainAtlas Viewer), aims to change that. BrAVe provides a 3D, species-agnostic, interactive platform for integrating multimodal brain atlas datasets, from gene expression and neuronal morphology to circuit connectivity and whole-brain activity. Supporting standardized data formats (.nrrd, .stl, .csv, .swc), it enables researchers to visualize and quantitatively analyze how molecular signatures align with neuronal structures and network wiring, across flies, fish, mice, and primates. At its core, BrAVe bridges three frontiers of brain research: ◾ Cross-modal integration: linking molecular, structural, and functional data in a unified coordinate space to identify molecularly defined neurons and their functional circuits. ◾ Cross-scale mapping: connecting light and electron microscopy datasets to match neuronal types and reconstruct synaptic networks. ◾ Cross-species alignment: enabling comparative analyses from invertebrates to non-human primates within a single framework. Technically, BrAVe combines intuitive 3D visualization with a distributed computing backend for high-performance analysis of large-scale datasets. Users can perform neuron morphology clustering, infer synaptic connectivity, and explore network motifs, all without code. It adheres to FAIR principles, promoting interoperability, reproducibility, and open science. From a translational perspective, this kind of integrative spatial biology platform has far-reaching implications: ▪️ For neuroscience, it means accelerating cell-type discovery and functional annotation. ▪️ For neurodegenerative and psychiatric disease research, it opens a path to correlate molecular changes with circuit-level dysfunction. ▪️ And for the broader life sciences, BrAVe’s architecture offers a template for multi-omic, spatially anchored analysis that could extend to other organs (heart, lung, or kidney) ushering in a next generation of organ-level reference atlases. The future of brain mapping will not be defined by one dataset or imaging modality, but by the integration of many. Tools like BrAVe move us toward an era where spatial registration is as foundational to biology as sequencing once was - linking molecules, cells, and networks into coherent, dynamic systems we can finally see and analyze as one. Read the full preprint: https://lnkd.in/eYBkY_UX #Neuroscience #BrainAtlas #SpatialBiology #Neuroinformatics #OpenScience