Semantic Web, RDF and querying with SPARQL

The phrase ‘Semantic Web’ was coined by Tim Berners-Lee. In his own words, it’s a “web of data that can be processed by machines“. The idea is simple - websites publish their data in a way that allows it to be shared and reused across applications. W3C define standards for how this should be achieved, most notably RDF (Resource Description Framework).

RDF  is a family of W3C standards. It is XML based and is used to describe resources. A resource can be anything, e.g. a person, a book, a car etc. SPARQL is a query language that allows you to retrieve data stored in RDF format.

In the context of the web, RDF allows the linking of data. Take Wikipedia as an example. We have many pages on people, for example Barack Obama. We could represent this person in RDF. The advantage of doing this is we then have a machine readable version of this person.

In addition to this, we can also define machine-readable relationships between this entity and others. For example, we could say Barack Obama is the president of the USA, and link to the RDF representation of the USA. The USA entity might have many other relationships, for example ‘has language’.

Basic graph with relationships that could be queried using SPARQL

These entities and relationships form a graph - and the basis of the semantic web.

Semantic Web in practice

For example, say you run a website about cheese. You might have the following URIs:

www.mywebsiteaboutcheese.org/cheeses/brieHTML page for visitors in a browser

www.mywebsiteaboutcheese.org/cheeses/brie/rdfRDF version of the page about Brie allowing other applications to reference this data

Our cheese RDF might have other handy bits of information, such as relationships (it might define a ‘country of origin’ relationship and point to the wikipedia page for that country).

Ontologies/vocabularies

So we have loads of websites offering up their data as RDF. There are relationships and entities all over the place. As you might imagine, it might not always be obvious what a particular entity or relationship means. Take a complex domain such as medical science - if everyone is building their own relationships, everything will likely end up a mess with duplicates and misunderstandings.

The solution: ontologies. Ontologies are a description of entities and their relationships. So if you’re building you own cheese appreciation site, you might look about to see if there is a decent ontology already in existence that you could use. This way, others can easily interpret your data as you’re using a shared standard.

For example, there is an ontology for e-commerce sites, used by Google and Yahoo! called Good Relations. A list of popular ontologies can be found here.

Ontologies in the context of the semantic web are written in OWL. OWL is a syntactic extension of RDF and allows RDF to encode descriptive logic (to an extend). As there are limitations to how descriptive you can be in RDF, there are other ‘species’ of OWL, such as OWL-DL.

Note: vocabularies are considered “light-weight ontologies”. Still very useful but perhaps not as beefy as a full-blown ontology.

Querying with SPARQL

If you’ve ever done any SQL queries, you’ll need to shift your perspective to start querying with SPARQL. Instead of thinking in terms of relational data, start thinking in terms of graphs. For example:

SELECT id, firstname, lastname, city FROM Sales.Customers WHERE city = 'Seattle';

This SQL statement is fairly easy to understand. We have a set of data (of rows) and we’re filtering based on the city. With graphical data, you have to think in terms of the relationships. In a graph, the entity would have this relationship:

Person -- hasCity --> Seattle

So we would be writing a query that queries that relationship, not the entity itself. We’re looking for instances of a type of relationship (hasCity) that points to a specific entity (the city of Seattle).

With that in mind, let’s jump in.

Your first SPARQL query

To run these queries, I’m going to use this site. It allows you to run SPARQL queries in your browser against a graph of your choice. The default is DBpedia, which extracts structured data from Wikipedia and exposes it as a graph (RDF).

SELECT ?x ?y ?z
WHERE { ?x ?y ?z }
LIMIT 100

The first thing to clear up: variables. Anything that starts with a ‘?’ is a variable.

The second: the WHERE clause. The WHERE clause contains the relationship we want to look for. Above, I’ve put three variables in a row. This means the relationship it will look for is this:

<anything> – <anything> –> <anything>

The above structure in RDF is known as a triple. It’s made up like this:

<subject> - - <predicate> –> <object>

Further examples

So the above query will return the first 100 things it finds. If you removed the ‘LIMIT 100’, you’d get a huge dataset returned.

Let’s fiddle with this query and return anything where the ‘end’ of the relationship is

SELECT ?x ?z
WHERE { ?x  ?z }
LIMIT 100

This query finds 100 instances where there is the relationship – hasBirthDate –>

Run it, and you’ll get a list of URIs to people in the first column and their corresponding dates of birth in the second column.

What if we wanted to find people who are born in 1990? We need to replace or ?z variable with an actual value:

SELECT ?x
WHERE { ?x  "1990"^^ }
LIMIT 100

Another example: books where Obama is the subject:

SELECT ?x
WHERE { ?x   }
LIMIT 100

How to find the triple information

If you’re wondering where I get the URIs and data to make up these triples, here’s how:

  1. Navigate to the resource or property you wish to use. I do this by guessing the URL. For example, Obama was easy: http://dbpedia.org/resource/Barack_Obama
  2. Hit ‘Formats’ at the top and then ‘N-triples’
  3. You’ll be taken to a page that contains all the triples for the entity at which you were looking

Multiple WHERE predicates

WHERE clauses can contain multiple predicates. Simply use a period (.) to concatenate the predicates (this is similar to AND in SQL).

For example, to find books where Obama is the subject that are written in German:

SELECT ?x
WHERE {
?x   .
?x  "German"@en
}
LIMIT 100

Filter

The FILTER keyword can be used to filter on the group returned. For example:

SELECT ?x ?z
WHERE {
?x  ?z .
FILTER(?z = "Finance"@en)
}
LIMIT 100

The above query would return books on Finance.

Applications

The semantic web and its associated technologies promise to facilitate the querying of the largest database in the world: the web. Any application that needs to pick out relevant information from this huge distributed graph could benefit. For example, news sites could automatically build complex profiles on the subjects of their stories by querying the semantic web.

The field of AI could also benefit. By having a huge machine-readable dataset about everything, machines can learn about the world very rapidly by traversing this huge graph.

Further Resources