European Genome-phenome Archive (EGA)’s Post

View organization page for European Genome-phenome Archive (EGA)

1,763 followers

𝗚𝗼𝗼𝗱 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗶𝘀 𝗸𝗲𝘆 𝘁𝗼 𝗺𝗮𝘅𝗶𝗺𝗶𝘀𝗶𝗻𝗴 𝘁𝗵𝗲 𝗿𝗲𝘂𝘀𝗲 𝗼𝗳 𝗱𝗮𝘁𝗮 𝗱𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗯𝘆 𝗘𝗚𝗔 𝗮𝗻𝗱 𝗙𝗲𝗱𝗲𝗿𝗮𝘁𝗲𝗱 𝗘𝗚𝗔 Providing high-quality metadata relies on having a good metadata model and validation approach. The #FederatedEGA Metadata Working Group is working on a new metadata model to manage and link metadata effectively, ensuring scalability, interoperability, and alignment with the wider research community. This second version of the EGA metadata model, still under active development, will enable EGA and Federated EGA (FEGA) to set a high standard for the (meta)data we archive. Improved metadata quality leads to: 🔎 Researchers easily finding relevant datasets for their research in the EGA/FEGA catalogues. 💡 New scientific conclusions through EGA and FEGA distributed data. The ‘FEGA Metadata Technical Report’ has recently been published in Zenodo. This document is the outcome of collaborative work among groups, including established FEGA Nodes and Nodes currently under development: Germany (The German Human Genome-Phenome Archive (GHGA)), Poland (University of Lodz), Norway (University of Oslo), Finland (CSC - IT Center for Science), Portugal (BioData.pt), Sweden (NBIS – National Bioinformatics Infrastructure Sweden), Spain (Barcelona Supercomputing Center & Centre for Genomic Regulation (CRG)), Canada (BC Cancer), Switzerland (SIB Swiss Institute of Bioinformatics), France (INSERM), Estonia (University of Tartu) and Greece (BSRC Alexander Fleming). We also bridged metadata modelling with well-known projects and standards such as ELIXIR, Genomic Data Infrastructure (GDI), Global Alliance for Genomics and Health, Beacon, and #BioSchemas. Data loses its value without metadata. At the EGA, we aim for the highest quality of metadata. We will continue to work and collaborate with various groups to ensure that data is stored and reused for the benefit of the research community, accelerating scientific discovery and healthcare breakthroughs. 🔗 You can access the report on Zenodo here: https://lnkd.in/e2_Y4ydu

FEGA Metadata Technical Report: EGA v2 model zenodo.org

To view or add a comment, sign in

More Relevant Posts

Institut Français de Bioinformatique

3,824 followers
1w
Report this post
🔎 See the ‘FEGA Metadata Technical Report’, to which the Institut Français de Bioinformatique (Inserm) contributed as the lead organisation in the development of the FEGA France node (👋 David Salgado Gaignard Alban )

European Genome-phenome Archive (EGA)

1,763 followers
1w

𝗚𝗼𝗼𝗱 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗶𝘀 𝗸𝗲𝘆 𝘁𝗼 𝗺𝗮𝘅𝗶𝗺𝗶𝘀𝗶𝗻𝗴 𝘁𝗵𝗲 𝗿𝗲𝘂𝘀𝗲 𝗼𝗳 𝗱𝗮𝘁𝗮 𝗱𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗯𝘆 𝗘𝗚𝗔 𝗮𝗻𝗱 𝗙𝗲𝗱𝗲𝗿𝗮𝘁𝗲𝗱 𝗘𝗚𝗔 Providing high-quality metadata relies on having a good metadata model and validation approach. The #FederatedEGA Metadata Working Group is working on a new metadata model to manage and link metadata effectively, ensuring scalability, interoperability, and alignment with the wider research community. This second version of the EGA metadata model, still under active development, will enable EGA and Federated EGA (FEGA) to set a high standard for the (meta)data we archive. Improved metadata quality leads to: 🔎 Researchers easily finding relevant datasets for their research in the EGA/FEGA catalogues. 💡 New scientific conclusions through EGA and FEGA distributed data. The ‘FEGA Metadata Technical Report’ has recently been published in Zenodo. This document is the outcome of collaborative work among groups, including established FEGA Nodes and Nodes currently under development: Germany (The German Human Genome-Phenome Archive (GHGA)), Poland (University of Lodz), Norway (University of Oslo), Finland (CSC - IT Center for Science), Portugal (BioData.pt), Sweden (NBIS – National Bioinformatics Infrastructure Sweden), Spain (Barcelona Supercomputing Center & Centre for Genomic Regulation (CRG)), Canada (BC Cancer), Switzerland (SIB Swiss Institute of Bioinformatics), France (INSERM), Estonia (University of Tartu) and Greece (BSRC Alexander Fleming). We also bridged metadata modelling with well-known projects and standards such as ELIXIR, Genomic Data Infrastructure (GDI), Global Alliance for Genomics and Health, Beacon, and #BioSchemas. Data loses its value without metadata. At the EGA, we aim for the highest quality of metadata. We will continue to work and collaborate with various groups to ensure that data is stored and reused for the benefit of the research community, accelerating scientific discovery and healthcare breakthroughs. 🔗 You can access the report on Zenodo here: https://lnkd.in/e2_Y4ydu

FEGA Metadata Technical Report: EGA v2 model zenodo.org
Like Comment
To view or add a comment, sign in
Mubashir Ali
2w
Report this post
In modern bioinformatics, a recurring challenge is how to transition from raw genomic data to meaningful biological insights. Two powerful tools from the Bioconductor ecosystem ggbio and Gviz are assisting researchers in addressing this challenge with clarity and precision. While most scientists working with high-throughput sequencing data rely on statistical analysis, it is visualization that reveals patterns and leads to discoveries. Proper visualization of genomic regions, transcripts, and annotations allows hidden relationships to emerge. Why ggbio and Gviz matter: - ggbio extends the philosophy of ggplot2 to genomic data, enabling researchers to visualize genomic ranges, alignments, and annotations using a grammar-of-graphics style workflow. This results in flexible, layered genomic plots that integrate seamlessly with modern data science pipelines in R. - Gviz, on the other hand, emphasizes genome browser-style visualization. It enables researchers to create publication-ready tracks displaying genes, transcripts, coverage data, and regulatory elements along chromosomes. This approach is particularly beneficial for those working with sequencing experiments such as RNA-seq, ChIP-seq, or epigenomic datasets. The bigger picture is that these tools are not just about visualization; they are part of a broader ecosystem that supports reproducible science, open research, and scalable genomic analysis. In an era of exponentially growing datasets, bioinformatics transcends algorithms it's about clearly communicating complex biological signals. Tools like ggbio and Gviz transform genomic data into stories that can be seen, interpreted, and shared. What tools do you use for genomic data visualization? Have you worked with ggbio, Gviz, or other packages in the Bioconductor ecosystem? #Bioinformatics #Genomics #DataVisualization #Bioconductor #RStats #ComputationalBiology
2 Comments
Like Comment
To view or add a comment, sign in
K. VAMSEE KRISHNA A
4w
Report this post
After several months of development alongside my MSc at Cranfield University, I am excited to announce the first public release (v0.1.0) of my bioinformatics pipeline: PRS_GWAS_SNP_PIPELINE. You can find it on GitHub: https://lnkd.in/e5xXHG_z. This project is designed as an end-to-end, mode-aware genomics workflow capable of handling real-world variability in input formats while producing consistent, reproducible downstream outputs. Supported entry modes include: - Raw FASTQ (alignment + variant calling) - External VCF / VCF.GZ - GWAS-style TSV/CSV summary tables All processing paths converge into a shared interpretation and reporting layer, ensuring architectural consistency rather than fragmented tool chaining. Core architecture features: - Snakemake DAG orchestration - Conda-based reproducible environments - Preflight validation and build-aware checks - Deterministic output structure - Modular downstream interpretation layer - Optional PRS scoring branch Integrated tools include FastQC, fastp, BWA/BWA-MEM2, minimap2, STAR, samtools, bcftools, bedtools, mosdepth, MultiQC, and optional snpEff. Custom Python and R modules provide: - Variant QC metrics - GWAS visualizations (Manhattan, QQ, Volcano) - Functional enrichment analysis - Structured PRS scoring and reporting An optional AI-assisted interpretation component is included to help summarize and contextualize variant-level findings. This release represents the first stable milestone, with potential options to explore. The architecture is intentionally designed for further expansion, optimisation, and scaling. Additional annotation layers, containerisation, and PRS modelling capabilities can be extended. Developing this pipeline required thinking carefully about reproducibility across environments, compatibility between heterogeneous input modes, and the practical constraints of running genomics workflows across different compute contexts. Ensuring consistent outputs whether starting from raw FASTQ, external VCF, or GWAS summary tables meant repeatedly refactoring rule logic, validating dependency isolation, and stress-testing the DAG under different execution scenarios. Reproducibility, deterministic artifact generation, and architectural clarity were deliberate design priorities, not afterthoughts. This release marks the first stable milestone in that process. I welcome thoughtful feedback, including strengths, limitations, and areas for improvement. #Bioinformatics #Genomics #ComputationalBiology #ReproducibleResearch #Snakemake #GWAS #PolygenicRiskScore #WorkflowEngineering #OpenScience #AIinHealthcare

GitHub - vamsee2k1/PRS_GWAS_SNP_PIPELINE: Reproducible Snakemake bioinformatics pipeline for FASTQ/VCF/GWAS workflows, variant interpretation, pathway enrichment, and Alzheimer PRS reporting. github.com

4 Comments
Like Comment
To view or add a comment, sign in
Michael Forde
3w
Report this post
Dug into Data and Sample Governance for Biomedical Research. https://lnkd.in/eC_YxS2f. Lead: Big data & human height: Scientists develop algorithm to boost biobank data retrieval & analysis. https://lnkd.in/e2dzrhWr

Big data and human height: Scientists develop algorithm to boost biobank data retrieval and analysis medicalxpress.com
Like Comment
To view or add a comment, sign in
Kokob Hadgu
2w
Report this post
Demystifying the Bioinformatics Pipeline: From Sequencer to Insight 🚀 Thrilled to share this comprehensive infographic detailing a modern #NGS (Next-Generation Sequencing) analysis pipeline! Biomedical research moves from the lab bench to the computing cluster. This visualization perfectly breaks down how raw genetic data is transformed into actionable biological knowledge. Here’s a snapshot of the five essential stages: 1️⃣ Input Data: The journey begins with high-throughput sequencers generating raw FASTQ reads. 2️⃣ Data Pre-processing: Garbage in, garbage out! Quality Control (QC) and read trimming are crucial for ensuring the integrity and accuracy of the analysis. 3️⃣ Alignment & Assembly: Reconstructing the genetic puzzle by either mapping reads to a reference genome or performing De Novo assembly. 4️⃣ Variant Calling: Identifying the differences (SNPs & Indels), filtering out noise, and annotating functional impacts. 5️⃣ Downstream Analysis: The "so what?" factor. Visualizing gene expression (Heatmaps), building phylogenetic trees, and utilizing genome browsers to extract real Biological Insights. Whether you're working on personalized medicine, evolutionary biology, or functional genomics, a robust bioinformatics pipeline is the backbone of modern discovery. What tools and techniques is your team using? Drop your thoughts in the comments! 👇 #Bioinformatics #Genomics #NextGenerationSequencing #NGS #DataScience #DataAnalysis #ComputationalBiology #BiomedicalResearch #LifeSciences #TechInBio
Like Comment
To view or add a comment, sign in
Strand Life Sciences

106,445 followers
1w Edited
Report this post
📢We’re excited to partner once again with the Bay Area Bioinformatics Forum to bring you an evening of engaging talks, great food, and networking! 📅 April 8, 2026 📍 Myriad Genetics' SSF facility in the Grand Nexus Building, 233 E Grand Ave. South San Francisco, CA ⏰ 5 - 9PM PST Hosted by Shane Brubaker, this program features speakers from Strand Life Sciences and Myriad Genetics discussing the role of AI in precision medicine, and the development of bioinformatics workflows to scale clinical genomics. 🎤 Featured Talks: ▫️ Talk 1 (Strand): Connecting Metadata, Analytics, and AI to Accelerate Target and Drug Discovery Presenter: Shrutee Jakhanwal, Associate Director - Product Management, Strand Life Sciences ▫️ Talk 2 (Cheiron): AI-Native Operating System for Drug Development Presenters: Twesh Malkani, Growth Lead, Cheiron ▫️ Talk 3 (Myriad): AI & Digital Pathology Presenter: Brianna Sanchez, Software Engineer III - Bioinformatics, Myriad Genetics ▫️ Talk 4: (Strand and Myriad) BioInformatics Workflows for Scaled Clinical Genomics Presenters: Vinay Kumar Sajja, Associate Software Engineer III, Strand Life Sciences Samuel Pearlman, Senior Bioinformatics Software Engineer, Myriad Genetics ▫️ Talk 5 (Myriad): DeepCAH: Deep Learning in Variant Calling for Difficult Genes Presenter: Jing Liu, Bioinformatics Analyst II, Myriad Genetics This edition also features an expert panel discussion on the challenges in validation, integration and scalability of AI models, following the above talks. This interactive session will unpack some of the key bottlenecks in deploying AI models for data curation, integrating scalable AI workflows and how GPT-based chatbots are influencing this dynamic space. The event will be recorded, and the videos will be available online. 👉 Register here: https://lnkd.in/gkFWJt7R Parking will be available in the Nexus garage (next to Myriad) until 9:30PM. #BayAreaBioinformatics #GenomicsData #GenomicsWorkflows #ComputationalGenomics #BioinformaticsTools #BioinformaticsSolutions #BioinformaticsInfrastructure #PipelineAutomation #AIForPrecisionMedicine #AIForPathology

1 Comment
Like Comment
To view or add a comment, sign in
Cindy Crowninshield, RDN, LDN, HHC
5d
Report this post
Most bioinformatics workflows don’t fail because of algorithms. They fail because the data is fragmented, inconsistent, and not designed for reuse. At the 25th Annual Bio-IT World Conference & Expo, this hands-on session focuses on fixing that: Building Workflows and Advancing FAIR Bioinformatics Practices 📅 Tuesday, May 19 | 9:00 AM to 12:00 PM This is a working session using the Playbook Workflow Builder (PWB) to design workflows that are truly findable, interoperable, and reusable. Led by Daniel Clarke and Avi Ma’ayan, the goal is straightforward: move from fragmented pipelines to structured, reusable systems. Because the future of R&D is not more tools. It’s data ecosystems that actually work in production and support AI at scale. If your workflows don’t scale, neither does your science. Agenda and Registration Details: https://lnkd.in/gRjADR6w #FAIRData #Bioinformatics #ScientificData #ResearchData #DataInteroperability #AIforLifeSciences #ComputationalBiology #BioITExpo Bio-IT World
2 Comments
Like Comment
To view or add a comment, sign in
Ylenia Galluzzo
1w Edited
Report this post
🧬 How do you find meaningful patterns in massive biological datasets? We tackled this problem by developing a new Big Data–ready framework for extracting informative k-mer dictionaries across multiple sequence datasets. 💡 What’s new: • Scalable architecture for large-scale sequence analysis • Two tools: – BioFt → pattern discovery + IR-inspired metrics – BioSet2Vec → flexible dictionary construction 🧪 Tested on real-world problems: ✔ Chromatin states ✔ Disease–gene relationships ✔ Whole genomes The results show promising applications in genomics and computational biology. 📄 Read the paper here, published by BMC: https://lnkd.in/dqT4vga2 Paper Authors: Ylenia Galluzzo (🙋🏻♀️), Raffaele Giancarlo, Filippo Utro, Simona E. Rombo Curious to hear your potential applications 👇 #Bioinformatics #AI #Spark #BigData #Genomics #Research #BMCBioinformatics https://lnkd.in/dtQR4R4X

BioSet2Vec: extraction of k-mer dictionaries from multiple sets of biological sequences via big data technologies - PubMed pubmed.ncbi.nlm.nih.gov
Like Comment
To view or add a comment, sign in
Evert Bosdriesz
3w
Report this post
🧬 New preprint | ESGI: Efficient Splitting of Generic Indices We're excited to share ESGI, a flexible demultiplexing framework for single-cell sequencing data with complex barcode architectures. Most existing tools assume fixed barcode positions and substitution-only error models, which works well for standard platforms, but breaks down for emerging technologies that often use complex, custom barcoding schemes. ➡️ ESGI addresses this in two key ways: It allows for for demultiplexing and processing single-cell sequencing data with arbitrary barcode designs and uses Indel-aware barcode correction, using Levenshtein distance rather than Hamming distance only. This matters because deletions are the dominant error type in barcode synthesis. ➡️ We validated ESGI across six datasets spanning four technologies: SIGNAL-seq, SPLiT-seq, Phospho-seq, and xDBiT. Indel-aware mapping increased read recovery by >10% in combinatorial barcoding datasets, with count matrices highly concordant with established pipelines. ➡️ ESGI is written in C++, multi-threaded, well-documented, and comes with vignettes for diverse single-cell technologies. 📄 Preprint: https://lnkd.in/eQcuADN7 💻 Code: github.com/tstohn/ESGI 📖 Docs: https://lnkd.in/eEi2V5rR Led by Tim Stohn co-supervised with Lodewyk Wessels. With help from Nadine van de Brug, Anastasia Theodosiadou, Bram Thijssen and Kathy Jastrzebski, PhD, at VU Bioinformatics and The Netherlands Cancer Institute
Like Comment
To view or add a comment, sign in
Omicsify

46 followers
1w
Report this post
At Omicsify, we believe bioinformatics should not be treated as an afterthought. Too often, sequencing gets the attention, while the downstream analysis is expected to somehow “work itself out.” But in reality, the value of genomics is not created at the point of data generation alone. It is created when data becomes interpretable, reproducible, scalable, and useful for decision-making. That is where strong bioinformatics matters. Good bioinformatics is not just running pipelines. It is building systems that can handle complexity without losing traceability. It is designing workflows that are scientifically sound, operationally practical, and adaptable as needs evolve. It is making sure analysis can move beyond one-off projects and become part of a reliable process. At Omicsify, this is the problem we care about solving. We are focused on building practical bioinformatics infrastructure and analysis solutions that help organizations move from raw data to actionable results with more clarity, structure, and confidence. As genomics continues to expand, the real bottleneck is no longer only data generation. It is how effectively that data can be processed, interpreted, and operationalized. That is where the future will be shaped. #Omicsify #Bioinformatics #Genomics #DataAnalysis #PrecisionMedicine #NGS #HealthcareInnovation #LifeSciences
Like Comment
To view or add a comment, sign in

1,763 followers

View Profile Connect

European Genome-phenome Archive (EGA)’s Post

More Relevant Posts

Explore related topics

Explore content categories