The Wayback Machine - https://web.archive.org/web/20110711025811/http://blog.abodit.com:80/category/programming/sql/

SQL

A Semantic Web ontology / triple Store built on MongoDB

In a previous blog post I discussed building a Semantic Triple Store using SQL Server. That approach works fine but I’m struck by how many joins are needed to get any results from the data and as I look to storing much larger ontologies containing billions of triples there are many potential scalability issues with this approach. So over the past few evenings I decided to try a different approach and so I created a semantic store based on MongoDB. In the MongoDB version of my semantic store I take a different approach to storing the basic building blocks of semantic knowledge representation. For starters I decided that typical ABox and TBox knowledge has really quite different storage requirements and that smashing all the complex TBox assertions into simple triples and stringing them together with meta fields only to immediately join then back up whenever needed just seemed like a bad idea from the NOSQL / document-database perspective.

TBox/ABox: In the ABox you typically find simple triples of the form X-predicate-Y. These store simple assertions about individuals and classes. In the TBox you typically find complex sequents, that’s to say complex logic statements having a head (or consequent) and a body (or antecedents). The head is ‘entailed’ by the body, which means that if you can satisfy all of the body statements then the head is true. In a traditional store all the ABox assertions can be represented as triples and all the complex TBox assertions use quads with a meta field that is used solely to rebuild the sequent with a head and a body. The ABox/TBox distinction is however arbitrary (see http://www.semanticoverflow.com/questions/1107/why-is-it-necessary-to-split-reasoning-into-t-box-and-a-box).

I also decided that I wanted to be use ObjectIds as the primary way of referring to any Entity in the store. Using the full Uri for every Entity is of course possible and MongoDB couuld have used that as the index but I wanted to make this efficient and easily shardable across multiple MongoDB servers. The MongoDB ObjectID is ideal for that purpose and will make queries and indexing more efficient.

The first step then was to create a collection that would hold Entities and would permit the mapping from Uri to ObjectId. That was easy: an Entity type inheriting from a Resource type produces a simple document like the one shown below. An index on Uri with a unique condition ensures that it’s easy to look up any Entity by Uri and that there can only ever be one mapping to an Id for any Uri.

RESOURCES COLLECTION - SAMPLE DOCUMENT

{
  "_id": "4d243af69b1f26166cb7606b",
  "_t": "Entity",
  "Uri": "http://www.w3.org/1999/02/22-rdf-syntax-ns#first"
}

Although I should use a proper Uri for every Entity I also decided to allow arbitrary strings to be used here so if you are building a simple ontology that never needs to go beyond the bounds of this one system you can forgo namespaces and http:// prefixes and just put a string there, e.g. “SELLS”. Since every Entity reference is immediately mapped to an Id and that Id is used throughout the rest of the system it really doesn’t matter much.

The next step was to represent simple ABox assertions. Rather than storing each assertion as its own document I created a document that could hold several assertions all related to the same subject. Of course, if there are too many assertions you’ll still need to split them up into separate documents but that’s easy to do. This move was mainly a convenience for developing the system as it makes it easy to look at all the assertions made concerning a single Entity using MongoVue or the Mongo command line interface but I’m hoping it will also help performance as typical access patterns need to bring in all of the statements concerning a given Entity.

Where a statement requires a literal the literal is stored directly in the document and since literals don’t have Uris there is no entry in the resources collection.

To make searches for statements easy and fast I added an array field “SPO” which stores the set of all Ids mentioned anywhere in any of the statements in the document. This array is indexed in MongoDB using the array indexing feature which makes it very efficient to find and fetch every document that mentions a particular Entity. If the Entity only ever appears in the subject position in statements that search will result in possibly just one document coming back which contains all of the assertions about that Entity. For example:

STATEMENTGROUPS COLLECTION - SAMPLE DOCUMENT

{
  "_id": "4d243af99b1f26166cb760c6",
  "SPO": [
    "4d243af69b1f26166cb7606f",
    "4d243af69b1f26166cb76079",
    "4d243af69b1f26166cb7607c"
  ],
  "Statements": [
    {
      "_id": "4d243af99b1f26166cb760c5",
      "Subject": {
        "_t": "Entity",
        "_id": "4d243af69b1f26166cb7606f",
        "Uri": "GROCERYSTORE"
      },
      "Predicate": {
        "_t": "Entity",
        "_id": "4d243af69b1f26166cb7607c",
        "Uri": "SELLS"
      },
      "Object": {
        "_t": "Entity",
        "_id": "4d243af69b1f26166cb76079",
        "Uri": "DAIRY"
      }
    }
	... more statements here ...
  ]
}

The third and final collection I created is used to store TBox sequents consisting of a head (consequent) and a body (antecedents). Once again I added an array which indexes all of the Entities mentioned anywhere in any of the statements used in the sequent. Below that I have an array of Antecedent statements and then a single Consequent statement. Although the statements don’t really need the full serialized version of an Entity (all they need is the _id) I include the Uri and type for each Entity for now. Variables also have Id values but unlike Entities, variables are not stored in the Resources collection, they exist only in the Rule collection as part of consequent statements. Variables have no meaning outside a consequent unless they are bound to some other value.

RULE COLLECTION - SAMPLE DOCUMENT

{
  "_id": "4d243af99b1f26166cb76102",
  "References": [
    "4d243af69b1f26166cb7607d",
    "4d243af99b1f26166cb760f8",
    "4d243af99b1f26166cb760fa",
    "4d243af99b1f26166cb760fc",
    "4d243af99b1f26166cb760fe"
  ],
  "Antecedents": [
    {
      "_id": "4d243af99b1f26166cb760ff",
      "Subject": {
        "_t": "Variable",
        "_id": "4d243af99b1f26166cb760f8",
        "Uri": "V3-Subclass8"
      },
      "Predicate": {
        "_t": "Entity",
        "_id": "4d243af69b1f26166cb7607d",
        "Uri": "rdfs:subClassOf"
      },
      "Object": {
        "_t": "Variable",
        "_id": "4d243af99b1f26166cb760fa",
        "Uri": "V3-Class9"
      }
    },
    {
      "_id": "4d243af99b1f26166cb76100",
      "Subject": {
        "_t": "Variable",
        "_id": "4d243af99b1f26166cb760fa",
        "Uri": "V3-Class9"
      },
      "Predicate": {
        "_t": "Variable",
        "_id": "4d243af99b1f26166cb760fc",
        "Uri": "V3-Predicate10"
      },
      "Object": {
        "_t": "Variable",
        "_id": "4d243af99b1f26166cb760fe",
        "Uri": "V3-Something11"
      }
    }
  ],
  "Consequent": {
    "_id": "4d243af99b1f26166cb76101",
    "Subject": {
      "_t": "Variable",
      "_id": "4d243af99b1f26166cb760f8",
      "Uri": "V3-Subclass8"
    },
    "Predicate": {
      "_t": "Variable",
      "_id": "4d243af99b1f26166cb760fc",
      "Uri": "V3-Predicate10"
    },
    "Object": {
      "_t": "Variable",
      "_id": "4d243af99b1f26166cb760fe",
      "Uri": "V3-Something11"
    }
  }
}

That is essentially the whole semantic store. I connected it up to a reasoner and have successfully run a few test cases against it. Next time I get a chance to experiment with this technology I plan to try loading a larger ontology and will rework the reasoner so that it can work directly against the database instead of taking in-memory copies of most queries that it performs.

At this point this is JUST AN EXPERIMENT but hopefully someone will find this blog entry useful. I hope later to connect this up to the home automation system so that it can begin reasoning across an ontology of the house and a set of ABox assertions about its current and past state.

Since I’m still relatively new to the semantic web I’d welcome feedback on this approach to storing ontologies in NOSQL databases from any experienced semanticists.

An ontology triple (quad) store for RDF/OWL using Entity Framework 4

This weeks side-project was the creation of an ontology store using Entity Framework 4. An ontology store stores axioms consisting of Subject, Predicate, Object which are usually serialized as RDF, OWL, N3, … Whereas there’s lots of details about these serialization formats, the actual mechanics of how to store and manipulate them was somewhat harder to come by. Nevertheless, after much experimentation I came up with an Entity Model that can store Quads (Subject, Predicate, Object and Meta) or Quins (Subject, Predicate, Object, Meta, Graph). The addition of Meta allows one Axiom to reference another. The addition of Graph allows the store to be segmented making it easy to import some N3 or RDF into a graph, then flush that graph if it is no longer needed or if a newer version becomes available.

The store is currently hooked up to an Euler reasoner that can reason against it, lazily fetching just the necessary records from the SQL database that backs the Entity Model.

Here’s the EDMX showing how I modeled the Ontology Store:

Ontology Store Entity Model

SQL Server – error: 18456, severity: 14, state: 38 – Incorrect Login

Despite the error message this problem can be caused by something other than an Authorization failure.  In fact, simply misspelling the Initial Catalog can cause this message to appear.  I wish developers wouldn’t reuse error messages when the problem and solution is completely different.

Generate a SQL Compact Database from your Entity Model (EDMX)

I recently switched from SQL Lite to SQL Compact for some of the databases in a system I’m building. If you are using SQL Compact make sure you have the Hotfix from Microsoft that fixes the Where() clause. The release that shipped as SP1 is utterly useless – can’t even do a where clause on an nvarchar()! I don’t know why they didn’t recall it and replace it with the hotfix version!
Unfortunately SQL Server Management Studio doesn’t support generating scripts from a SQL Compact database, so if you design your database there you are out of luck when it comes to creating the scripts to auto-generate it when you deploy your code.
Steve Lasker’s Blog has a great article explaing how to embed these scripts as resources but it still doesn’t solve the issue of how to generate the script in the first place.
There are a few articles that explain how to use T4 to generate stored procedures from the EDMX but I didn’t see one to generate the actual database itself. So I built one and you can get it below.
To use this simply create a text file in your project called generate.tt, paste this text in. It will run and generate another file below itself containing the script to create each table it finds in each EDMX file in your project.
Limitations: Within the EDMX file there isn’t enough information to fully create a database. While the code below can generate foreign keys and primary keys from Associations and Keys defined in your model it has no idea what indexes you might want. So for now it just generates an index for every field that looks vaguely indexable and it leaves it up to you to delete the ones you don’t want.
<#
//
// Generate SQL Server CE database tables from EDMX definition – suitable for some simple databases only
// Creates tables, adds primary keys, adds an index for every conceivable field, adds foreign key relationships for all associations
//
#>
<#@ template language=”C#v3.5″ debug=”True” #>
<#@ output extension=”.sql” #>
<#@ assembly name=”System.Core” #>
<#@ assembly name=”System.Data” #>
<#@ assembly name=”System.Data.Entity” #>
<#@ assembly name=”System.Xml” #>
<#@ assembly name=”System.Xml.Linq” #>
<#@ import namespace=”System.Collections” #>
<#@ import namespace=”System.Collections.Generic” #>
<#@ import namespace=”System.Data.EntityClient” #>
<#@ import namespace=”System.Data.SqlClient” #>
<#@ import namespace=”System.Diagnostics” #>
<#@ import namespace=”System.IO” #>
<#@ import namespace=”System.Linq” #>
<#@ import namespace=”System.Text” #>
<#@ import namespace=”System.Text.RegularExpressions” #>
<#@ import namespace=”System.Xml.Linq” #>
<#
// This will generate the code to build SQL COMPACT tables for every EDMX in the current directory
// Simply plug that code into SQL Server Management Studio or use it as a resource to build your database

// Get the current directory from the stack trace
string stackTraceFileName = new StackTrace(true).GetFrame(0).GetFileName();
string directoryName = Path.GetDirectoryName(stackTraceFileName);

#>
——————————————————————————–
– This code was generated by a tool.
– Changes to this file may cause incorrect behavior and will be lost if
– the code is regenerated.
——————————————————————————–

<#
string[] entityFrameworkFiles = Directory.GetFiles(directoryName, “*.edmx”);

foreach (string fileName in entityFrameworkFiles)
{
#>

– Creating TSQL creation script <#= fileName #>.

<#
DoExtractTables(fileName);
}
#>

<#+
XNamespace edmxns = “http://schemas.microsoft.com/ado/2007/06/edmx”;
XNamespace edmns = “http://schemas.microsoft.com/ado/2006/04/edm”;
XNamespace ssdlns = “http://schemas.microsoft.com/ado/2006/04/edm/ssdl”;

private void DoExtractTables(string edmxFilePath)
{
XDocument edmxDoc = XDocument.Load(edmxFilePath);
XElement edmxElement = edmxDoc.Element(edmxns + “Edmx”); // element
XElement runtimeElement = edmxElement.Element(edmxns + “Runtime”); // element
XElement storageModelsElement = runtimeElement.Element(edmxns + “StorageModels”); // element
XElement ssdlSchemaElement = storageModelsElement.Element(ssdlns + “Schema”);

string entityContainerName = runtimeElement
.Element(edmxns + “ConceptualModels”) // element
.Element(edmns + “Schema”) // element
.Element(edmns + “EntityContainer”) // element
.Attribute(“Name”).Value;
string ssdlNamespace = ssdlSchemaElement.Attribute(“Namespace”).Value;

// Get a list of tables from the SSDL.
XElement entityContainerElement = ssdlSchemaElement.Element(ssdlns + “EntityContainer”);
IEnumerable entitySets = entityContainerElement.Elements(ssdlns + “EntitySet”);
IEnumerable entityTypes = ssdlSchemaElement.Elements(ssdlns + “EntityType”);
IEnumerable associations = ssdlSchemaElement.Elements(ssdlns + “Association”);
string defaultSchema = entityContainerElement.Attribute(“Name”).Value;

foreach (XElement table in entitySets)
{
string tableName = table.Attribute(“Name”).Value;
#>
DROP Table <#= tableName #>
GO

CREATE TABLE <#= tableName #> (
<#+

XElement entityType = entityTypes.First(et => et.Attribute(“Name”).Value == tableName);
IEnumerable properties = entityType.Elements(ssdlns + “Property”);
int i = 0;
int count = properties.Count();

// GET THE PRIMARY KEY INFORMATION
XElement key = entityType.Element(ssdlns + “Key”);
string pkstr = “”;
if (key != null)
{
pkstr = string.Join(“,”, key.Elements(ssdlns + “PropertyRef”).Select(pk => pk.Attribute(“Name”).Value).ToArray());
count++; // To add a comma on last field definition
}
// // Accumulate a list of indexes to add as we go along examining the properties
// List indexesToAdd = new List();
foreach (XElement property in properties)
{
bool last = (i == count-1);
string propertyName = property.Attribute(“Name”).Value;
string propertyType = property.Attribute(“Type”).Value;
string nullField = “”;
if (property.Attribute(“Nullable”) != null && property.Attribute(“Nullable”).Value == “false”)
nullField = ” NOT NULL”;
string maxLength = “”;
if (property.Attribute(“MaxLength”) != null)
maxLength = “(” + property.Attribute(“MaxLength”).Value + “)”;
string comma = last ? “” : “,”;
#>
<#= propertyName #> <#= propertyType #><#= maxLength #><#= nullField #><#= comma #>
<#+
i++;
} // each property

// Was their a primary key? If so, add it
if (pkstr != “”)
{
#>
CONSTRAINT <#=”PK_”+tableName #> PRIMARY KEY ( <#= pkstr #> )
<#+
}

#>
)
GO

<#+
// Now add all the indexes we might need … ———————————————————————–
foreach (XElement property in properties)
{
string comment = “”;
string propertyName = property.Attribute(“Name”).Value;
if (pkstr.Contains(propertyName))
{
comment = “already in primary key, no need to index again”;
}

int maxLength = -1;
if (property.Attribute(“MaxLength”) != null)
int.TryParse(property.Attribute(“MaxLength”).Value, out maxLength);

if (maxLength > 256)
{
comment = “too long to meaningfully index”;
}
string commentPrefix = comment != “” ? “– ” : “”;
// TODO: More rules here about what to index and whether it’s unique or not …
// CREATE [UNIQUE] INDEX …
if (comment != “”)
{
#>
<#=commentPrefix #><#=comment #>
<#+
}

#>
<#=commentPrefix #>CREATE INDEX IX_<#=tableName #>_<#=propertyName #> ON <#=tableName #> (<#=propertyName #>);
<#+
} // each property
} // Each table
// Now add all the associations between tables … ——————————————————————-
foreach (XElement association in associations)
{
string associationName = association.Attribute(“Name”).Value;

XElement referentialConstraint = association.Elements(ssdlns + “ReferentialConstraint”).First();
XElement principalRole = referentialConstraint.Element(ssdlns + “Principal”);
string thisTable = principalRole.Attribute(“Role”).Value;
string thisField = principalRole.Element(ssdlns + “PropertyRef”).Attribute(“Name”).Value;
XElement dependentRole = referentialConstraint.Element(ssdlns + “Dependent”);
string otherTable = dependentRole.Attribute(“Role”).Value;
string otherField = dependentRole.Element(ssdlns + “PropertyRef”).Attribute(“Name”).Value;
#>

ALTER TABLE <#= otherTable #>
ADD CONSTRAINT <#= associationName #>
FOREIGN KEY (<#= otherField #>)
REFERENCES <#= thisTable #> (<#= thisField #>)
ON UPDATE CASCADE
ON DELETE CASCADE
GO
<#+
} // Each association
} // Class
#>