This paper describes how a large pharmaceutical company adopted an ontology-based data management strategy to ensure scientific data is findable, accessible, interoperable, and reusable from the moment it is generated. 1️⃣ The approach emphasizes creating structured, high-quality data at the source to preserve context and reduce downstream processing time. 2️⃣ Standardized vocabularies and models (ontologies) are used to align data across systems and teams, supporting consistency and integration. 3️⃣ Public ontologies are adapted with organization-specific extensions while maintaining compatibility with external data standards. 4️⃣ Simplified term lists are derived from complex models to enable broader adoption across teams with varying technical backgrounds. 5️⃣ Data from different systems is integrated virtually rather than physically moved, enabling secure, real-time access without redundancy. 6️⃣ This framework enhances the performance of advanced analytics and machine learning by providing clear, semantically rich context. 7️⃣ Controlled vocabularies are delivered through interfaces like APIs and dropdowns, ensuring consistent metadata usage at scale. 8️⃣ The unified semantic structure improves enterprise search, allowing users to retrieve contextually relevant data from across domains. 9️⃣ Adoption metrics show growing usage across multiple phases of the pharmaceutical value chain, reflecting system scalability and value. 🔟 Organizational alignment—from executive support to operational implementation—has been critical, with recent advances in AI further enabling this transformation. ✍🏻 Shawn Zheng Kai Tan, Shounak Baksi, Thomas Gade Bjerregaard, Preethi Elangovan, Thrishna Kuttikattu Gopalakrishnan, Darko Hric, Joffrey Joumaa, Beidi Li, Kashif Rabbani, Santhosh Kannan Venkatesan, Joshua Daniel Valdez, Saritha Vettikunnel Kuriakose, Digital evolution: Novo Nordisk’s shift to ontology-based data management. Journal of Biomedical Semantics. 2025. DOI: 10.1186/s13326-025-00327-4
Scientific Software Development
Explore top LinkedIn content from expert professionals.
-
-
I’d like to introduce an experiment I’ve been working on for the past couple of years: Xarray-SQL. This library asks, what if SQL worked natively with arrays? `xarray-sql` tries to make multi-dimensional arrays in Xarray queryable with SQL by imagining them as tables. The closest parallel that I've found to this project is XVec, which brings table-like geospatial concepts to rasters in Xarray. This project aims to do the opposite, which is to treat rasters like rows in tables, and then to fit that in with a broad SQL ecosystem via #DataFusion. Currently, the library should “just work” with any Xarray dataset – i.e. datasets backed by #Zarr, #IceChunk, NetCDF, #Xee, TiTiler, etc. Once opened, the library lets you think of coordinates as primary keys and data_vars as columns as you filter, group by, and aggregate to your heart’s desire. Tuning the performance of queries requires a bit of skill with SQL, but ultimately will be a matter of data engineering (another case of the tyranny of the chunk). This library was designed to address the use case of joining tabular data with weather data, such as Earthmover's temporal-chunked copy of ERA5. That's the dream, anyway. Right now, this package will only work with Xarray datasets on a single node. I expect it to break at early challenges of scale, and yet – I am hopeful that the library will eventually rise to the complexity of geospatial or scientific datasets. It should be possible in theory to distribute on Ray clusters, and doing so is on the roadmap. Currently, this project is looking for early adopters and open source contributors. Together with tire kickers and drive by patches, I think xarray-sql could make scientific datasets way more accessible. The high level hypothesis posed by this package is that the Cloud-Native Geospatial Forum (CNG) ecosystem functions as a new type of database, one whose components we get to pick and choose a la carte. If this is the case, and #CNG is really a database, then I argue it ought to also have a decent SQL front-end. https://lnkd.in/gYMeCp9S #xql
-
A biotech company's most valuable asset is constantly underused. No, not your people—your data. Here's a hard truth many biotech companies learn too late: storing your experimental data next to (or even worse, inside of) your documents is a recipe for disaster. Why? Because data and documents have fundamentally different needs: - Immutability: Your data should be set in stone once recorded. Documents, on the other hand, are living entities that evolve over time. - Versioning: While both need versioning, the approach differs. Data versions should be additive (in line with the immutability concept above), while document versions often replace each other. - Access patterns: Data often needs to be accessed programmatically, while documents are typically accessed manually. - Audit trails: Changes to data should be strictly logged to help with reproducibility. Document edits? Not so much. By storing data in document systems like Google Drive, SharePoint or Dropbox, you're exposing it to accidental alterations, making it harder to track changes, and complicating integration with analysis tools. The solution? Separate your data and document storage. Use specialized data management systems for your experimental results and keep your documents elsewhere. At Sphinx Bio, we're building tools for scientific data management, helping biotech companies safeguard their most valuable asset while maximizing its utility. Would love to hear more success stories about achieving this split!
-
Data is the new lab bench in biotech, but most companies have a broken bench. Let me explain why this 123 approach is changing everything: Most biotech data goes unanalyzed—trapped in siloed systems, proprietary formats, and disconnected workflows. The fundamental problem? Traditional architectures treat each experiment as isolated rather than part of an interconnected knowledge web. This creates a massive cognitive burden for scientists who spend more time wrangling data than making discoveries. The solution isn't just better databases—it's creating what I call a "memory layer" for scientific knowledge. This layer has 3 critical components: 1) Structure first, analysis second Most labs try to analyze raw data directly without proper structure. Effective systems focus on building semantic models that define relationships between experimental components before analysis begins. This seemingly simple shift helps our customers dramatically reduce analysis time and enable previously impossible cross-experimental insights. 2) Graphs, not tables Biological systems are interconnected networks, yet we force data into rigid tables. Modern graph databases mirror how science actually works—through relationships, connections, and patterns. This approach allows scientists to discover "hidden bridges" between seemingly unrelated experiments. 3) Compound intelligence The true power emerges when these structured, graph-based systems learn over time. Each experiment enriches the model rather than sitting as a static data point. This creates compounding value where the 100th experiment is far more valuable than the first because it connects to everything before it. One genomics startup we worked with implemented this approach and saw remarkable acceleration: • They identified targets in weeks rather than months • Their experimental iterations became significantly faster • Scientists uncovered novel insights from existing data What's fascinating is that this approach makes scientists more effective while creating defensible IP in the data model itself. The biotech companies gaining the most investor traction aren't just producing molecules—they're building knowledge systems that get more valuable with every experiment. This is why forward-thinking VCs now evaluate data architecture as thoroughly as science. As we enter this new era, companies that build proper memory layers will outperform those still treating data as an afterthought. Wet lab scientists: Want to see how this memory layer approach could transform your research? DM me for a demo or subscribe to my newsletter: https://lnkd.in/gsyuTb_5
-
Starting an R&D data management initiative always feels like heavy lifting. Much of the difficulty comes from defining a data model that can hold diverse, ever-evolving R&D data. Relational data modeling, the default for enterprise data management, prioritizes transactional operations and storage efficiency. It assumes well-understood data requirements and predictable data workflows, both of which break down quickly in active R&D. JSON is what I usually recommend instead. It aligns with the document-centric shape of R&D artifacts (lab notebook pages, analytical reports, etc.) and can largely store them as-is, no upfront reshaping into tables required. JSON Schema then adds the governance layer: it declares data shape, types, and required fields, and is easy to flex as the data evolves. It can also serve as the data contract for user interfaces, ML modeling and, more recently, the structured input/output specification for LLMs. I leaned on this pattern in past R&D work at DuPont (https://lnkd.in/dKi8gJF). A recent paper puts that pattern to work on metal-organic framework (MOF) synthesis, where one JSON Schema reads lab data in, validates it, exports to community-standard formats, and feeds the ML analysis. Here is how the schema drives everything: 🔹LLM-based extraction: an off-the-shelf LLM turns free-text ELN procedures into structured JSON, validated against the schema and cross-checked against a hand-written rule parser. 🔹Integration and validation: Powder XRD measurements, ELN entries, and CSV tables are unified under the same schema, which also catches missing fields, wrong types, and out-of-range values. 🔹Standard-based serialization: from the same schema, data exports cleanly into community standards, facilitating data sharing across labs and platforms. 🔹Data analysis and ML: schema-validated data feeds visualization and decision-tree modeling that pinpoint critical synthesis parameters. The result: the same schema-driven workflow handled two distinct MOF systems and nearly 200 synthesis trials. Even drafting the schema itself was LLM-supported, lowering the bar for non-specialists to start. A pattern worth considering for any R&D team weighing how to modernize its data management. 📄 Data Management and Analysis of Metal-Organic Framework Synthesis Using Data Models, Journal of Chemical Information and Modeling, May 8, 2026 🔗 https://lnkd.in/e23J7cPQ
-
Still wrangling endless CSVs in your lab workflow? There's a smarter way: unify all your data with xarray. Curious how a single data structure can simplify everything? Read on. After years of managing experimental and machine learning data across scattered files and formats, I realized the cognitive load of keeping everything aligned was overwhelming. I started exploring unified data structures to reduce this friction. For example, I once spent days writing index-matching code just to keep my training data, features, and model outputs in sync across multiple files. It was exhausting and error-prone—one small misalignment could break the whole pipeline. This experience pushed me to look for a better, unified approach. Traditional lab data management means scattered files, mismatched indices, and constant manual bookkeeping. It's error-prone and exhausting. Inspired by a recent talk at SciPy, I built a synthetic microRNA study example to show how xarray can unify raw measurements, computed features, and model outputs in a single, coordinate-aligned Dataset—no more index-matching headaches. With xarray, you can store all your experimental measurements, computed features, statistical estimates, and even train/test splits in one dataset. Every piece of data knows exactly where it belongs—no more index juggling. In my latest blog post, I walk through this synthetic example step by step. The result? Cleaner workflows, bulletproof data consistency, and cloud-native scalability. If you're ready to reduce friction in your experimental data lifecycle, check out my blog post for a practical guide. Would love to hear your thoughts or experiences! https://lnkd.in/eXqGJB57 How are you currently managing complex experimental or ML data? Have you tried a unified approach like xarray? #datascience #laboratoryinformatics #machinelearning #xarray #bioinformatics
-
🚨 Is Traditional #LIMS Dead? If you’re leading Life Science R&D or Manufacturing and still treating LIMS as the center of your universe, you’re quietly capping your AI potential and your scientific ROI. From our work at The Stellix Group / ZAETHER, here’s what we’re seeing with forward-thinking organizations who are letting go of old LIMS Beliefs 1️⃣ From monolithic LIMS to a unified scientific data fabric LIMS is no longer the hub – it’s a transactional spoke in a wider scientific data platform that unifies LIMS, ELN, SDMS, instruments, and manufacturing systems into an enterprise scientific data fabric. The real value is the #datafabric, not the individual application. 2️⃣ Consultants must be #dataArchitects, not app installers “Implementing one big LIMS that does everything” is an outdated mandate. Modern partners are architecting the full data ecosystem and workflow orchestration: letting each system do what it does best while the platform handles unification, lineage, and context. 3️⃣ Beyond digitization: build #AI‑ready, FAIR data assets Scanning paper and automating old workflows is table stakes. Competitive advantage now comes from data that is #FAIR (Findable, Accessible, Interoperable, Reusable) and ready for advanced analytics and AI/ML across R&D and manufacturing – not locked in a single vendor’s database. 4️⃣ Unstructured data and scientific context are first‑class citizens Free‑text observations, images, protocols, rationales – this is the “why and how” of science, and it’s where AI learns the most. LIMS consulting now requires ontology and metadata design, not just process mapping, so that context can actually be used for modeling and prediction. 5️⃣ #LIMS & #MES consulting = continuous digital transformation, not “go‑live” The real work (and value) starts after go‑live. Leaders are measuring success in business and scientific outcomes: Shorter assay and method development cycles Higher data reusability for AI/ML Clear, defensible ROI from better decisions and fewer failures That demands enterprise data stewardship, cloud data lake strategy, and governance – not just a “finished” LIMS project. At Stellix / Zaether, we’re leaning into this shift: from application‑centric to data‑centric lab and manufacturing architectures that are genuinely AI‑ready. Curious to hear from R&D and manufacturing leaders: where are you on the journey from LIMS-as-system to scientific data fabric as an asset class?
-
🚨 Here's the uncomfortable truth about "unstructured" data in life sciences: We've been approaching it all wrong. When we label complex scientific data that come in bespoke formats, such as genomic variants, single-cell data, and biomedical imaging, as "unstructured," we automatically give up on properly organizing them. The result? Data silos, inefficient storage, and missed discoveries that could change lives. But here's what I've learned after years of building data systems: No data is truly unstructured. An image isn't random pixels; it's a precise 2-D matrix. Genomic variants have clear positional relationships. The real challenge isn't that biological data lacks structure. The issue is that we've lacked a data model flexible enough to capture its complexity efficiently. At TileDB, we've solved this with multi-dimensional arrays that shape-shift to handle any data type — from tables to genomics to imaging — in a unified system with database-level performance. The implications? Researchers can finally ask cross-modal questions that were previously impossible. Drug discovery accelerates when data engineering stops being the bottleneck. What's your experience with complex scientific data? Are you still fighting with multiple formats and tools? Read more about multi-dimensional arrays for multimodal multiomics data: https://lnkd.in/ebfihBUg #DataScience #LifeSciences #DrugDiscovery #MultimodalData #Multiomics