New DPROD specification for data products by OMG

The immediate challenge for organisations is consolidating and connecting their data, enabling them to build AI systems tailored to their unique needs and semantics. While this may seem daunting, as Lao Tzu said, "A journey of a thousand miles begins with a single step." The first step on this particular journey is the new DPROD specification—a straightforward, semantic approach to defining data products. Many organisations are reorganising their data around data products, and the most advanced are connecting their data with knowledge graphs for use with Gen AI. These efforts should not be separate endeavours; the real value lies in bridging them. This is where DPROD comes in. It is a freely available semantic ontology that defines data products, serving as both a specification and a first small step in creating a Distributed Knowledge Graph. Here’s an example of a fictional UK Bonds data product: { "@context": "https://lnkd.in/epwdgBtD", "id": "y.com/products/uk-bonds", "type": "DataProduct", "title": "UK Bonds", "dataProductOwner": "linkedin.com/in/tonyseale/", "outputPort": { "type": "DataService", "id": " y.com/uk-10-year-bonds", "endpointURL": "y.com/uk-10-year-bonds", "isAccessServiceOf": { "type": "Distribution", "format": "https://lnkd.in/epxCH697", "isDistributionOf": { "type": "Dataset", "id": " y.com/ds/uk-10-year-bonds", "conformsTo": "https://lnkd.in/e9bxa4Fc" } } } } 🔵 Points to Note: 🔹Simple to Implement: Define your data product in plain JSON. 🔹Shared Schemas: Each product connects to a shared schema using the @context property. 🔹Linkable: Data products have unique URLs, enabling interconnection in a distributed graph. 🔹Semantics: The conformsTo property links data products to powerful semantic ontologies, enabling LLMs to understand what these data products represent. 🔹Open Standards: DPROD will be an open standard built on established frameworks like RDF and DCAT. 🔵 Tried, Tested and Open We’ve tested DPROD for over a year with large enterprises and gathered feedback from vendors and experts. It has been developed at the EKGF with the support of EDM Council members and is now open for public review and comment at the OMG. 🔵 The Time to Act is Now! This challenge isn’t just technical—it’s organisational. The question is: can your teams agree on the shared semantics that will allow you to consolidate and connect your data products? DPROD is a practical first step towards answering this question. I encourage you to test DPROD within your organisation, provide feedback, and see if it helps unify your data. DPROD is more than a specification—it’s the first step towards an architecture that prepares your data for AI. 🔴 DPROD Specification: https://lnkd.in/ed9jAzGF ⭕ How to Article: https://lnkd.in/eScj_nfg ⭕ OMG RFC: https://lnkd.in/ee2qvBRp

95 Comments

Tony Seale 1y

Thanks to everyone who gave up their Friday evenings to work on the specification. It has been a pleasure working with you all (in no particular order): Natasa Varytimou, Pete Rivett, Marcel Fröhlich, Andrea Gioia, Jacobus Geluk, Ben Whittam Smith, Steve Fisher, Oli Bage, Ben Clinch, carl mattocks, Carlos Tubbax, Charles Ivie I would also like to thank all the reviewers who took the time to provide us with invaluable feedback. Here is a sample of some of them, again in no particular order: Juan Sequeda, Peter Winstanley, Matthias Autrata, Richard Perris, Manu Sporny, Ritu Sinha, Murali Suraparaju, Elisa Kendall, Stephen Gatchell, Mike Bennett, Gregor Wobbe, Jon Cooke

17 Reactions

Tzvi Weitzner 1y

Tony Seale is difficult to start to consider this approach for real enterprise use, given the innacuracy of NL2SPARQL. The current best accuracy for NL2SPARQL tasks has been reported using various approaches, including models like SPBERT and extensions of large language models (LLMs) such as GPT-based systems. Currently, models can achieve up to around 75-85% accuracy on benchmark datasets depending on the architecture and fine-tuning methods employed. How can an enterprise make decision based on this low accuracy. This is equivalent to making decisions with bad data.

3 Reactions

Stephen Channell 1y

You're to be congratulated on the effort - it's a good contribution to the body of knowledge and helps to address the perennial of question of where a dataset came from, how do I know it's accurate, what does it represent. ODATA follows a similar schema of annotating data with a context (e.g. https://services.odata.org/TripPinRESTierService/(S(upayljlkiwqrr5pvqxtqyopp))/People includes a reference to the metadata https://services.odata.org/TripPinRESTierService/(S(upayljlkiwqrr5pvqxtqyopp))/$metadata#People). "@dprod.context" might be a better to distinguish the data from other schemes. dataProductOwner is useful provenance information, but does not tell me if Tony Seale is an authoritative source for "UK Bonds", where he sourced the list from, at what point-in-time was it assembled, what (e.g. GDPR) sensitivity/confidentiality, or what license restriction are there on usage (e.g. Bloomberg will sue me for using data you captured from a terminal). One strategy to address AI hallucination and deep-fakes is to reference the full provenance of information. If you bought the "UK Bonds" list from Bloomberg Data-license for a global Holding Co, you'll pay a higher price - you might you expect a fee when I use it.

2 Reactions

Kingsley Uyi Idehen 1y

“I encourage you to test DPROD within your organization, provide feedback, and see if it helps unify your data.” In the age of Smart #AI Agents (or Assistants), achieving this goal is much simpler. By that, I mean leveraging these tools for handling critical tasks such as: 1. Reading the #Ontology 2. Generating sample instance data from the Ontology using various notations (e.g., #JSONLD or #RDFTurtle). Here are links demonstrating the points outlined above, using an interface to our (OpenLink Software) AI Layer (#OPAL), which integrates seamlessly with #LLMs from OpenAI and Mistral AI: 1. https://linkeddata.uriburner.com/assist-metal/?share_id=sh-4UZgEzW9rkKGDJWau2ev7gidqTnz&t=120 -- Animated view 2. https://linkeddata.uriburner.com/assist-metal/?share_id=sh-4UZgEzW9rkKGDJWau2ev7gidqTnz -- Static view (by simply dropping the URL parameter &t=120) I’ve also attached a GIF that showcases the entire process. Once upon a time, the journey to this point was fraught with riddles and gotchas. That’s no longer the case, thanks to the symbiotic relationship between modern LLM-based natural language processors and structured content (e.g., specifications, ontologies, etc.). #HowTo #SmartAgent #DataProduct #LinkedData #UseCase

6 Reactions

Mark Spivey 1y

what is the immediate differentiation from DCAT though, or more of a specialization case from? I’ve been modeling what your getting at with DPROD as such being a special case of DCAT though in addition doing it in a Hypermedia Oriented Approach, as the JSON-LD doesn’t automatically capture the affordances assumed and expected from a RESTful side .

1 Reaction

Imran Chaudhri 1y

First of all, who calls an organization OMG 😆 secondly excellent work we will be looking at this carefully.

2 Reactions

Niklas Lind 1y

Tony Seale This is very interesting! I have a question about the dataset's use of the <conformsTo> property mentioned in section 6.4.5 of the DPROD specification. In the specification, the example dataset is linked to the class SBA Pool (from FIBO) through a <conformsTo> relationship. What does this actually mean? Does it imply that all the object properties connecting the SBA Pool to other classes, like <hasMeasure some> prepayment speed, must also apply to the dataset? If the dataset is primarily meant to represent the SBA Pool, does it need to include data for each of the properties of the SBA Pool class? Or, is the rule that the data elements in the dataset schema must simply be valid in relation to the SBA Pool class as defined in the reference ontology? I think I need to read more to fully understand this. However, wouldn't it be better to reference the SBA Pool class using versioned IRIs, like this one: https://spec.edmcouncil.org/fibo/ontology/master/2024Q2/SEC/Debt/MortgageBackedSecurities/SBA-Pool? The ontology may have a different lifecycle compared to the release of a specific version of a dataset provided by a data product.

1 Reaction

Perry (Pin) Chen, PhD 1y

Great efforts to bridge the gap between data practice and AI/ML applications through semantic data products, Tony Seale, DPROD shows another practice of data product thinking (please allow me to say that there will be different practices of data product thinking to help produce or generate different data products for different purposes).

1 Reaction

Howard Guess 1y

A neat way to visualise DPPROD is through a reference model in Solidatus . A demonstration link is available here https://trial.solidatus.com/viewer/share/PId9yxYo0iN49evG7RXSSQuIG6PA3iFO . It would be neat to connect the ontology held in the reference mode to a number of Solidatus lineage models to show how the ontology can be implemented in real life cases.

3 Reactions

Roxanne Howdle-Rowe 1y

Thanks .. what’s different to the last update 24/July please ? Where’s a detailed log of changes, can’t see it. Thanks!

1 Reaction

See more comments

To view or add a comment, sign in

Explore content categories