Ivan’s private site

December 16, 2011

Where we are with RDFa 1.1?

English: RDFa Content Editor

Image via Wikipedia

There has been a flurry of activities around RDFa 1.1 in the past few months. Although a number of blogs and news items have been published on the changes, all those have become “officialized” only the past few days with the publication of the latest drafts, as well as with the publication of RDFa 1.1 Lite. It may be worth looking back at the past few months to have a clearer idea on what happened. I make references to a number of other blogs that were published in the past few months; the interested readers should consult those for details.

The latest official drafts for RDFa 1.1 were published in Spring 2011. However, lot has happened since. First of all, the RDFWA Working Group, working on this specification, has received a significant amount of comments. Some of those were rooted in implementations and the difficulties encountered therein; some came from potential authors who asked for further simplifications. Also, the announcement of schema.org had an important effect: indeed, this initiative drew attention on the importance of structured data in Web pages, which also raised further questions on the usability of RDFa for that usage pattern This came to the fore even more forcefully at the workshop organized by the stakeholders of schema.org in Mountain View. A new task force on the relationships of RDFa and microdata has been set up at W3C; beyond looking at the relationship of these two syntaxes, that task force also raised a number of issues on RDFa 1.1. These issues have been, by and large, accepted and handled by the Working Group (and reflected in the new drafts).

What does this mean for the new drafts? The bottom line: there have been some fundamental changes in RDFa 1.1. For example, profiles, introduced in earlier releases of RDFa 1.1, have been removed due to implementation challenges; however, management of vocabularies have acquired an optional feature that helps vocabulary authors to “bind” their vocabularies to other vocabularies, without introducing an extra burden on authors (see another blog for more details). Another long-standing issue was whether RDFa should include a syntax for ordered lists; this has been done now (see the same blog for further details).

A more recent important change concerns the usage of @property and @rel. Although usage of these attributes for RDF savy authors was never a real problem (the former is for the creation of literal objects, whereas the latter is for URI references), they have proven to be a major obstacle for ‘lambda’ HTML authors. This issue came up quite forcefully at the schema.org workshop in Mountain View, too. After a long technical discussion in the group, the new version reduces the usage difference between the two significantly. Essentially, if, on the same element, @property is present together with, say, @href or @resource, and @rel or @rev is not present, a URI reference is generated as an object of the triple. I.e., when used on a, say, <link> or <a> element, @property  behaves exactly like @rel. It turns out that this usage pattern is so widespread that it covers most of the important use cases for authors. The new version of the RDFa 1.1 Primer (as well as the RDFa 1.1 Core, actually) has a number of examples that show these. There are also some other changes related to the behaviour of @typeof in relations to @property; please consult the specification for these.

The publication of RDFa 1.1 Lite was also a very important step. This defines a “sub-set” of the RDFa attributes that can serve as a guideline for HTML authors to express simple structured data in HTML without bothering about more complex features. This is the subset of RDFa that schema.org will “accept”,  as an alternative to the microdata, as a possible syntax for schema.org vocabularies. (There are some examples on how some schema.org example look like in RDFa 1.1 Lite on a different blog.) In some sense, RDFa 1.1 Lite can be considered like the equivalent of microdata, except that it leaves the door open for more complex vocabulary usage, mixture with different vocabularies, etc. (The HTML Task Force will publish soon a more detailed comparison of the different syntaxes.)

So here is, roughly, where we are today. The recent publications by the W3C RDFWA Working Group have, as I said, ”officialized” all the changes that were discussed since spring. The group decided not to publish a Last Call Working Draft, because the last few weeks’ of work on the HTML Task Force may reveal some new requirements; if not, the last round of publications will follow soon.

And what about implementations? Well, my “shadow” implementation of the RDFa distiller (which also includes a separate “validator” service) incorporates all the latest changes. I also added a new feature a few weeks ago, namely the possibility to serialize the output in JSON-LD (although this has become outdated a few days ago, due to some changes in JSON-LD…). I am not sure of the exact status of Gregg Kellogg’s RDF Distiller, but, knowing him, it is either already in line with the latest drafts or it is only a matter of a few days to be so. And there are surely more around that I do not know about.

This last series of publications have provided a nice closure for a busy RDFa year. I guess the only thing now is to wish everyone a Merry Christmas, a peaceful and happy Hanukkah, or other festivities you honor at this time of the year.  In any case, a very happy New Year!

Enhanced by Zemanta

April 18, 2011

Open data from Fukushima

This is just an extended tweet… Masahide Kanzaki has just posted an announcement on the LOD mailing list on releasing some data he collected on the radioactivity levels on different places in Japan, enriched with metadata (e.g., geo data or time). Though the original data were in PDF, the results are integrated in RDF with a SPARQL endpoint. He also added some visualization endpoint that gives a simple visualization of the SPARQL query results:

Visualization results for radioactivity data for Tokyo and Fukushima, using integrated datasets and SPARQL query

Simple but effective, and makes the point on the usage of open data in RDF… Thanks!

November 23, 2010

My first mapping from RDB to RDF using a direct mapping, cont.

A few days ago I posted a blog on how the RDB to RDF direct mapping could be used for a simple example. I do not want to repeat the whole blog: the essence of it was that database tables were mapped onto a simple RDF Graph (this is what the direct mapping does) and the resulting graph was transformed into the “target” graph using the following SPARQL 1.1 construct:

CONSTRUCT {
  ?id a:title ?title ;
    a:year  ?year ;
    a:author _:x .
  _:x a:name ?name ;
    a:homepage ?hp .
}
WHERE {
  SELECT (IRI(fn:concat("http://...",?isbn)) AS ?id)
          ?title ?year ?name
         (IRI(?homepage) AS ?hp)
  {
    ?book a  <Book>;
       ?isbn ;
       ?title ;
        ?year ;
       ?author .
    ?author a  <Author>;
       ?name ;
       ?homepage .
  }
}

where the trick was to use a nested SELECT whose main job was to create URI references from strings. I realized that if one uses the latest editors’ version of SPARQL 1.1 (i.e., that version that is much closer to what SPARQL 1.1 will be) then the solution is actually simpler due to the variable assigning possibility that makes the nested SELECT unnecessary:

CONSTRUCT {
  ?id a:title ?title ;
    a:year  ?year ;
    a:author _:x .
  _:x a:name ?name ;
    a:homepage ?hp .
}
WHERE {
  ?book a  <Book>;
     ?isbn ;
     ?title ;
      ?year ;
     ?author .
  ?author a  <Author>;
     ?name ;
     ?homepage .
  BIND (IRI(fn:concat("http://...",?isbn)) AS ?id)
  BIND (IRI(?homepage) AS ?hp)
}

which makes, at least in my view, the mapping even clearer.

But SPARQL is not the only way to transform the graph. Another possibility is to use RIF Core. Essentially the same transformation can indeed be expressed using the RIF Presentation syntax. Here it is (with a little help from Sandro Hawke and Harold Boley):

Forall ?book ?title ?author ?isbn ?year ?id (
  ?id[a:year->?year a:title->?title a:author->?author] :-
    And(
      ?book[rdf:type-> <Book>
             a:isbn->?isbn
             a:title->?title
             a:year->?year
             a:author->?author]
      External(pred:iri-string(?id External( func:concat("http://..." ?isbn ) )))
    )
)
Forall ?author ?name ?hp ?homepage (
 ?author[a:name->?name a:homepage->?hp] :-
   And(
        ?author[rdf:type-> <Author>
                a:name->?name
                a:homepage->?homepage]
        External(pred:iri-string(?hp ?homepage))
  )
)

(as I did in the earlier examples, I did not put the prefix declaration and other syntactic stuffs into the code above.)

The only difference between the two is that I retained the URI for the author, because generating a blank node on the fly in RIF Core does not seem to be possible. A better solution would be, probably, to mint a URI from the ?author variable just like I did for the ISBN value. Other than that, the two solutions are pretty much identical…

November 19, 2010

My first mapping from RDB to RDF using a direct mapping

A few weeks ago I wrote a blog on my first RDB to RDF mapping using R2RML; the W3C RDB2RDF Working Group had just published a first public Working Draft for R2RML. That mapping was based on a specific mapping language (i.e., R2RML). R2RML relies on an R2RML processing done by, for example, the database system, interpreting the language, using some SQL constructions, etc. The R2RML processing depends on the specific schema of the database which guides the mapping.

As I already mentioned in that blog, a “direct” mapping was also in preparation by the Working Group; well, the first public Working Draft of that mapping has just been published. That mapping does not depend on the schema of the database: it defines a general mapping of any relational database structure into RDF; only a base URI has to be specified for the database, everything else is generated automatically. The resulting RDF graph is of course much more coarse than the one generated by R2RML; whereas the result of an R2RML mapping may be a graph using well specified vocabularies, for example, this is not the case for the output of the direct mapping. But that is not really a problem: after all, we have SPARQL or RIF to make transformation on graphs! Ie, the two approaches are really complementary.

What I will do in this blog is to show how the very same example as in my previous blog can be handled by a direct mapping. As a reminder: the toy example I use comes from my  generic Semantic Web tutorial. Here is the (toy) table:

which is then converted into an RDF Graph:

(Just as in the previous case I will ignore the part of the graph dealing with the publisher, which has the same structure as the author part. I will also ignore the prefix definitions.)

The direct mapping of the first and second tables is pretty straightforward. The URI-s are a bit ugly but, well, this is what you get when you use a generic solution. So here it is:

@base <http://book.example/> .
<Book/ID=0006511409X#_> a <Book> ;
  <Book#ISBN> "0006511409X" ;
  <Book#Title> "The Glass Palace" ;
  <Book#Year>  "2000" ;
  <Book#Author> <Author/ID=id_xyz#_> .

<Author/ID=id_xyz#_> a <Author> ;
  <Author#ID> "id_xyz" ;
  <Author#Name> "Ghosh, Amitav" ;
  <Author#Homepage> "http://www.amitavghosh.com" .

Simple, isn’t it?

The result is fairly close to what we want, but not exactly. First of all, we want to use different vocabulary terms (like a:name). Also, note that the direct mapping produces literal objects most of the time, except when there is a “jump” from one table to another. Finally, the resulting graph should use a blank node for the author, which is not the case in the generated graph.

Fortunately, we have tools in the Semantic Web domain to transform RDF graphs. RIF is one possible solution; another is SPARQL, using the CONSTRUCT form. Using SPARQL is an attractive solution because, in practice, the output of the direct mapping may not even be materialized; instead, one would expect a SPARQL engine attached to a particular relational database, mapping the SPARQL queries to the table on the fly. I will use SPARQL 1.1 below because that gives nice facilities to generate RDF URI Resources from strings, i.e., to have “bridges” from literals to URI-s. Here is a possible SPARQL 1.1 query/construct that could be used to achieve what we want:

CONSTRUCT {
  ?id a:title ?title ;
    a:year  ?year ;
    a:author _:x .
  _:x a:name ?name ;
    a:homepage ?hp .
}
WHERE {
  SELECT (IRI(fn:concat("http://...",?isbn)) AS ?id)
          ?title ?year ?name
         (IRI(?homepage) AS ?hp)
  {
    ?book a <Book> ;
      <Book#ISBN> ?isbn ;
      <Book#Title> ?title ;
      <Book#Year>  ?year ;
      <Book#Author> ?author .
    ?author a <Author> ;
      <Author#Name> ?name ;
      <Author#Homepage ?homepage .
  }
}

Note the usage of a nested query; this is used to create new variables representing the URI references to be used by the outer query. The key is the IRI operator. (Both the nesting and the AS in the SELECT are SPARQL 1.1 features.)

That is it. Of course, the question does arise: which one would one use? The direct mapping or R2RML? Apart from the possible restriction that the local database system may implement the direct mapping only, it becomes also a question of taste. The heavy tool in R2RML is, in fact, the embedded SQL query; if one is comfortable with SQL than that is fine. But if the user is more comfortable with Semantic Web tools (e.g., SPARQL or RIF) then the direct mapping might be handier.

(Note that these are evolving documents still. I already know that my previous blog is wrong in the sense that it is not in line with the next version of R2RML. Oh well…)

November 2, 2010

My first mapping from RDB to RDF using R2RML

The W3C RDB2RDF Working Group has just published a first public Working Draft for the standardized RDB->RDF mapping language called R2RML. I decided that the only way to understand a specification like that is to try to use it for an example. Caveat: this is a “First Public Working Draft” for R2RML, so many things still have to happen and there will be changes.

For several years now I use a simple example in my generic Semantic Web tutorial (see, e.g., the one at SemTech). It is an artificial example referring to an imaginary bookshop’s table:

which is then converted into an RDF Graph:

(And the tutorial story is how this graph can be merged with a graph coming from another bookshop’s data.) Up until now I always glossed over how this mapping is done. Well, so how could that be done with R2RML?

R2RML defines mappings that describe how an RDB table is mapped on triples. (R2RML is in itself in RDF, b.t.w.) Simply put, in R2RML, each row of a table is mapped to an RDF subject; the individual cells, with the column names, provide the object and the predicates, respectively.

If we look at the middle table in the example, it corresponds to the lower right hand part of the graph. The R2RML mapping has to specify that the homepage column should actually produce an RDF Resource as a literal and not a string. Furthermore, the first column should become a blank node; that has to be specified, too. Here is the way this is all specified:

:Table2 rdf:type rr:TriplesMap ;
    rr:logicalTable "Select  ("_:" || ID) AS pid, Name, ("<" || Homepage || ">) AS Home from person_table";
    rr:subjectMap [ a rr:BlankNodeMap ; rr:column "pid" ; ] ;
    rr:propertyObjectMap [ rr:property a:name; rr:column "Name" ] ;
    rr:propertyObjectMap [ a rr:IRIMap ; rr:property a:homepage; rr:column "Home" ] .

What happens here is:

  1. a mapping is defined that turns the original table into a virtual, “logical” table using SQL. The goal here is to generate a blank node ID on the fly, and a URI in NTriple syntax (note, however, that I am not sure it is o.k. to use that approach in the spec!);
  2. the subject for the triples is chosen to be a cell in a specific column (“pid”, generated by the SQL transform of the previous point), and it is also specified that this is a blank node;
  3. the other two properties are specified (for the same subject); the one for the home page also specifies that the object must be a URI resource (as opposed to a Literal).

That is it. Mapping of the bottom table to the lower left hand corner of the graph is also quite similar, I will not go into this here.

But we still need the “root”, so to say, i.e., the node in the upper right hand corner, the top portion of the graph (with the title and the year) and, mainly, we also have to relate the root to the portion of the graph that is generated from the middle table.

First, the following R2RML part does the job of generating the top part of the graph:

:Table1 rdf:type rr:TriplesMap ;
    rr:logicalTable "Select ('<http:..isbn/' || ISBN || '>') AS isbn,
                     Author, Title, Publisher, Year from book_table";
    rr:subjectMap [ rdf:type rr:IRIMap ; rr:column "isbn" ] ;
    rr:propertyObjectMap [ rr:property a:title ; rr:column "Title" ; ] ;
    rr:propertyObjectMap [ rr:property a:year ; rr:column "Year" ; ] ;

The only role of the mapping to a logical table is to generate a URI from the ISBN; all the other cells are, conceptually, simply copied on the logical table. The rest is fairly straightforward.

The missing trick is to combine, i.e., to “join”, the two tables on the graph. R2RML has a separate construction for that, referred to as “mapping” the foreign keys. The following additional statements should be added to :Table1:

    rr:foreignKeyMap [
       rr:key a:author ;
       rr:parentTriplesMap :Table2 ; rr:joinCondition "{child}.Author = {parent}.pid"
    ] .

Which combines the nodes defined by :Table1 with those of :Table2. And voilà! We’re done: the R2RML document is ready, i.e., an R2RML engine would generate my example table into my example graph.

Of course, there are more complicated possibilities. Triples, or whole rows, can be explicitly stored in a specific named graph, for example. Or a column defining a predicate could, actually, use a cell in another column as an object. Etc. And, to be honest, I am not even 100% sure that above is correct, I may have misunderstood some details. But the “melody” is still clear.

Note the role the SQL based mapping of the original table to the logical table has. For SQL experts, most of the work can be done there, i.e., the resulting RDF graph can be ready for further usage by an application, to be linked into the LOD, to be used with the right attributes, namespaces, etc. Which is very powerful indeed, provided… the user has the necessary SQL expertise. And, while that is obviously true for database managers, it is not necessarily true for RDF experts. For those, a slightly different model seems to be more appropriate: they would prefer to get an RDF graph ASAP, so to say, without any fancy transformation, and would then use RIF, SWSRL, SPARQL’s CONSTRUCT, etc., to turn it into the RDF graph they eventually want to have. In other words, they may not need the concept of a logical table. That is what is referred to by the group as the “default” mapping. I.e., what graph does one get if nothing is specified? If that is properly defined then, say, RIF experts can use their expertise instead of SQL. This default mapping is not yet fully specified by the group, but it is on its way; it will be published shortly, and will complete the R2RML picture. So watch that space…

June 29, 2010

SemTech2010 & co.

I am on my way home from a long trip in the US (writing these lines on the plane, to be posted from home). Few days in Seattle, SemTech 2010 in San Francisco, finally the “RDF Next Steps” workshop in Palo Alto (i.e, Stanford). I do not want to write about the last one now, simply because we hope to have a more extended public report available within 10-15 days. I.e., more about that later.

Seattle consisted of a number of company visits, but it also included a talk at the SemWeb Meetup in Seattle. I gave a presentation on what happened at W3C the last year which, I think, was was well received. (Although one is never sure about these things.) I had a bunch of discussions and chats after the presentation; it was pleasant, relaxing… I and mainly my colleague from W3C, Eric Prud’hommeaux, had also a long discussion with two developers from Microsoft who are involved in the oData work; that was really interesting because we reached the conclusion of possibly outlining together a possible plan whereby we could write down how to “export” oData into RDF, and publish that, e.g., as W3C note (note that there are already systems doing something like that out there, but I am not knowledgeable enough to judge how complete those solutions are). I think it would be good for the community if this happens. It is important for a general Web of Data to include, well, all the data on the Web…

Semtech… it was big. Bigger than last year (I heard and read a figure of a 30% increase in attendance). This industry is lively indeed! The only problem that it was almost too big; it was the conference of eternal frustration:-( Indeed, there were so many things in parallel that one always had the feeling to have missed something because another, parallel session may have been more interesting! I heard presentations from Facebook, from Google, saw stunning visualizations of RDF graphs, or heard about plans on ontology hosting and management. There was a report on the US and UK governmental data work (this stuff still amazes me, though it is not the first time I hear about it), there was a presentation of BestBuy (alas! I missed that one). There was a separate track on the publication world as a separate “vertical” area (and we also had some great discussions with the people from the New York Times with whom we outlined a possible first step in gathering that community). Lots of hallway conversation with companies and institutions and, of course the social life, chatting with David, and Ian, and the other Ian, and Eric, and the other David, and Christine, and Jeremy, and Jim, and Fabien, and Sandro, and Jenni, and… I should stop and not even try to list everybody because it is simply impossible! I also gave an introductory Semantic Web Tutorial (quite a lot of people in the audience, and I think it went well), we had a panel on the W3C RDB2RDF work and another one on SPARQL 1.1. As a nice little touch, I could announce the publication of the W3C RIF Recommendation as a primeur during the tutorial when as I was talking about RIF (the publication itself happened while I was talking…)

There were, as every year, some “buzz” topics. My impression that the linked open governmental data effort was a buzz and was still new information for many. Facebook’s keynote on the Open Graph Protocol crated another buzz. More generally, RDFa was definitely a buzz (big time!). I.e., as I said, this industry is lively and continue to be exciting.

But there are of course challenges. The way I feel it the biggest challenge is not technical. Yes, of course, there are technical issues, but those will be solved, eventually. The issue is outreach, to get to those new communities who may understand the value of a Web of Data in general but have not enough guidance on how to start doing something. How to publish the data, how to link it to other data, how to consume it, use it, mash it up… How to talk to “C-level” people, how to reach out to them. There are books, of course, but not enough; there are tutorials and guides, of course, but not enough; there are experts around but definitely not enough. As one of our discussion partners put it: if I go to any better bookshop, there are rows of books on, say, XML (good or bad, but they are there). But books on RDF, on Linked Data, on SPARQL, on SKOS, on OWL: only a few here and there (comparatively, that is), and some of them are actually quite old. Let alone the problem of trying to hire experts that could do the job. I really feel that this is the biggest challenge our community faces. I say “community” and not only a single organization like W3C or other; the challenge is too great to be solved by one group only. We have been fighting with this issue for a while now, but it is still a challenge… And a challenge for us all who care about that stuff!

It was a good week!

May 12, 2010

RIF (Core) and LOD

Linked Data (Semantic Web) candies
Image by reedster via Flickr

W3C has just published a Proposed Recommendation for the Rule Interchange Format (RIF); this means, in the W3C jargon, that the technical work is done, and the W3C asks its members for a seal of approval to publish it as Recommendation.

Somehow the RIF development was not on the radar screen of the Semantic Web community. There may be many reasons for that, and I think we should just accept this as part of history. The future is much more important; we should indeed realize that RIF is an important piece of the Semantic Web technical architecture and let us do our best to get it embraced widely.

RIF Core is the simplest variant of RIF. It is not very complicated. It is a simple rule language; one can define a series of Horn rules, there are some safety features built in so that the rules can be executed, conceptually, by a forward chaining engine, it has the familiar XSD datatypes with the usual operations, it operates on URI-s, and it has a notion analogous to RDF blank nodes. There is a separate document that describes how RIF (Core) rules operate with RDF data and how the various semantics (RIF, RDF(S), OWL) work together. The details are not really important here, suffices it to say that it, essentially, works like one would expect as a layperson… The RIF syntax is a little bit convoluted for the moment, but there may be work coming up to improve that in form of alternative, more readable syntaxes.

So what can it be used for? At the W3C LOD Camp in Raleigh (held as part of the WWW2010 conference), Sandro Hawke already gave a simple example why RIF should be interesting for LOD applications. Let me add a few further examples that might be of interest.

Remember OWL-RL? The OWL Working Group has defined a subset of OWL that can be handled by rules. The rules themselves were also published by the OWL WG, albeit using an abstract notation. Those rules can be described in RIF Core as well; the RIF group has published this mapping in a separate document. Following those rules a RIF Core engine can handle OWL-RL directly.

Why is that interesting?—you might ask. Well, there has been quite some discussions when defining OWL RL on whether the features included in OWL RL represent the right set for users. Some claimed that there are other OWL features that could be added; others said that, on the contrary, the complexity of OWL RL is already too high and the features should be reduced to make them more palatable to users. In some ways, the usage of RIF Core may make this discussion moot. Indeed, users, or user communities, can define the rules they are interested in RIF by cherry picking the rules described by the RIF WG in the document cited above. They can send those rules to their RIF Core reasoner alongside their data, and get what they want. If that rule set consists only of 2-3 OWL rules, because that is all the application cares about, than all the better, the RIF inference engine will just do its job faster. If the user wants to add OWL features that are not in OWL RL, that may also be doable; the OWL 2 RDF-Based semantics specification is such that, in many cases, the extra rules can be extracted fairly easily from the OWL 2 Full semantics, using the patterns in the RIF/OWL RL document (although I have to emphasize that this does not work in all cases!). Note that this model of “sending” the RIF rule set alongside the RDF data to a reasoner is exactly the way RIF reasoning is being defined for SPARQL1.1 in the separate Entailment Regimes document (still in draft). Note also that I referred to OWL RL here, but the same approach could be used with RDFS with, obviously, a smaller RIF Rule set.

Another, albeit related application of RIF came to my mind reading an email discussion on whether inferences should be materialized for large LOD datasets or not and, if yes, which ones. As an answer to Vasiliy Faronov’s question, Leigh Dodds also proposed a text to be added to his Linked Data Patterns book. The resulting discussion thread was really about which inferences should really be materialized. Materializing them all may not be realistic; but if only a selection of the possible inferences is used (eg, subset of RDFS or OWL) how would consumers of the data know? It looks like RIF may come to rescue. Publishers could simply publish the rules they use for materializing their inferences in RIF. (Again, this is not always possible; RIF cannot cover the whole of OWL. But it does cover a very large percentage of the use cases.) Consumers may actually choose whether they want to download all the triples, including the inferenced triples, or whether they choose to download data from the “core” dataset only together with the RIF file, and materialize the inferences locally using a local RIF engine (or use the RIF file with an RIF Entailment aware SPARQL 1.1 engine).

RIF is and should be considered as integral and essential part of the Semantic Web Technology landscape. Let us hope many implementations of, at least, RIF Core will bloom to make this a reality! (There is a public list of existing implementations so far.)

December 12, 2009

RDFa usage spreading…

Filed under: Semantic Web,Work Related — Ivan Herman @ 14:53
Tags: ,

It may be that I was not attentive enough, ie, some of these may be old(er) news. But I did hit two interesting RDFa related news yesterday and today (both via twitter, b.t.w.) that I think are really noteworthy.

1. A blog from Priyank Mohan “Online retail : How is using Semantic Technology to define a new trend” reported about a talk given by Jay Myers from Best Buy. Best Buy started using RDFa a while ago already, but they recently added statements using the GoodRelations Ontology that Martin Hepp published. What Jay said (quoting from Priyank’s blog here):

  1. GoodRelations + RDFa improved the rank of the respective pages in Google tremendously… In fact, if you try the query “BestBuy Ferris Bueller” on Google, then the page comes on rank # 1 ahead of the much more established page . This indicates a strong effect of GoodRelations + RDFa on Google’s appreciation of a page.…
  2. Jay also reported a 30 % percent (!) increase in traffic on the BestBuy stores pages
  3. Yahoo observes a 15% increase in the Click-through-Rate (CTR). Nick Cox from Yahoo also recently reported that augmented search results, e.g. those with GoodRelations / RDFa in Yahoo get a 15 % higher Click-through-Rate (CTR).

There has been some discussions on twitter whether those numbers (eg, 30%) are really reliable, and maybe these statements are indeed too good to be fully true. But even if the 30% is only 15%, it is still quite an achievement!

2. This morning I found out that O’Reilly has begun to systemically add RDFa to their catalog pages. Eg, the page on the “Switching to the Mac” book can produce the RDF information using the RDFa distiller. Note the code uses well established vocabularies: Core FRBR, GoodRelations, Foaf, Dublin Core… ie, using this data with other mashup sites become much easier!

Great news. And, by the way, it worth noting that both also relate to Martin’s GoodRelations Ontology. That stuff is really coming to the fore, too…

September 29, 2009

OWL 2 RL closure

OWL 2 has just been published as a Proposed Recommendation (yay!) which means, in laymen’s term, that the technical work is done, and it is up to the membership of W3C to accept it as a full blown Recommendation.

As I already blogged before, I did some implementation work on a specific piece of OWL 2, namely the OWL 2 RL Profile. (I have also blogged about OWL 2 RL and its importance before, nothing to repeat here.) The implementation itself is not really optimized, and it would probably not stand a chance for any large scale deployment (the reader may want to look at the OWL 2 implementation report for other alternatives).  But I can hope that the resulting service can be useful in getting a feel for what OWL 2 RL can give you: by just adding a few triples into the text box you can see what OWL 2 RL means. This is, by the way, an implementation of the OWL 2 RL rule set, which means that it can also accepts triples that are not mandated by the Direct Semantics of OWL 2 (a.k.a. OWL 2 DL). Put it another way, it is an implementation of a small portion of OWL 2 Full.

The core of my implementation turned out to be really easy straightforward: a forward chaining structure directly encoded in Python. I use RDFLib to handle the RDF triples and the triple store. Each triple in the RDF Graph is considered, compared to the premises of the rules; if there is a match then new triples are added to the Graph. (Well, most of the rules contain several triples to match with, and the usual approach is to pick one and explore the Graph deeper check against additional matches. Which one to pick is important, it may affect the overall speed, though.) If, through such a cycle, no additional triples are added to the Graph then we are done, the “deductive closure” of the Graph has been calculated. The rules of OWL 2 RL have been carefully chosen so that no new resources are added to the Graph (only new triples), ie, this process eventually stops.

The rules themselves are usually simple. Although it is possible and probably more efficient to encode the whole process using some sort of a rule engine (I know of implementations based on, eg, Jena’s rules or Jess), one can simply encode the rules using the usual conditional constructs of the programming language. The number of rules is relatively high but nothing that a good screen editor would not manage with copy-paste. There were only a few rules that required a somewhat more careful coding (usually to take care of lists) or many searches through the graph like, for examples, the rule for property chains (see rule prp-spo2 in the rule set). It is also important to note that the higher number of rules does really not affect the efficiency of the final system; if no triple matches a rule then, well, it just does not fire. No side effect of the mere existence of an unused rule.

So is it all easy and rosy? Not quite. First of all, this implementation is of course simplistic in so far as it generates all possible deducted triples that include a number of trivial triples (like ?x owl:sameAs ?x for all possible resources). That means that the resulting graph becomes fairly big even if the (optional) axiomatic triples are not added. If the OWL 2 RL process is bound to a query engine (eg, the new version of SPARQL will, hopefully, give a precise specification of what it means to have OWL 2 RL reasoning on the data set prior to a SPARQL query) then many of these trivial triples could be generated at query time only, thereby avoiding an extra load on the database. Well, that is one place where a proof-of-concept and simple implementation like mine looses against a more professional one:-)

The second issue was the contrast between RDF triples and “generalized” RDF triples, ie, triples where literals can appear in subject positions and bnodes can appear as properties. OWL 2 explicitly says that it works with generalized triples and the OWL 2 RL rule set also shows why that is necessary. Indeed, consider the following set of triples:

ex:X rdfs:subClassOf [
  a owl:Restriction;
  owl:onProperty [ owl:inverseOf ex:p ];
  owl:allValuesFrom ex:A
].

This is a fairly standard “idiom” even for simple ontologies; one wants to restrict, so to say, the subjects instead of the objects using an OWL property restriction. In other words that restriction combined with

ex:x rdf:type ex:X .
ex:y ex:p ex:x .

should yield

ex:y rdf:type ex:A .

Well, this deduction would not occur through the rule set if non-generalized RDF triples were used. Indeed, the inverse of ex:p is a blank node, ie, using it in a triple is not legal; but using that blank node to denote a property is necessary for the full chain of deductions. In other words, to get that deduction to work properly using RDF and rules, the author of the vocabulary would have to give an explicit URI to the inverse of ex:p. Possible, but slightly unnatural. If generalized triples are used, then the OWL 2 RL rules yield the proper result.

It turns out that, in my case, having bnodes as properties was not really an issue, because RDFLib could handle that directly (is that a bug in RDFLib?). But similar, though slightly more complex or even pathological examples can be constructed involving literals in subject positions, and that was a problem because RDFLib refused to handle those triples. What I had to do was to exchange all literals in the graph against a new bnode, perform all the deductions using those, and exchange the bnodes “back” against their original literals at the end. (This mechanism is not my invention; it is actually described by the RDF Semantics document, in the section on Datatype entailment rules.) B.t.w., the triples returned by the system are all “legal” triples, generalized triples play a role during the deduction only (and illegal triples are filtered out at output).

Literals with datatypes were also a source of problems. This is probably where I spent most of my implementation time (I must thank Michael Schneider who, while developing the test cases for OWL 2 RDF Based Semantics, was constantly pushing me to handle those damn datatypes properly…). Indeed, the underlying RDFLib system is fairly lax on checking the typed literals against their definition by the XSD specification (eg, issues like minimum or maximum values were not checked…). As a consequence, I had to re-implement the lexical to value conversion for all datatypes. Once I found out how to do that (I had dive a bit into the internals of RDFLib but, luckily, Python is an interpretative language…) it became a relatively straightforward, repetitive, and slightly time consuming work. Actually, using bnodes instead of “real” literals made it easier to implement datatype subsumptions, too (eg, the fact that, say, an xsd:byte is also a xsd:integer). This became important so that the rules would work properly on property restrictions involving datatypes.

Bottom line: even for a simple implementation literals, mainly literals with datatypes, are the biggest headache. The rest is really easy.  (This is hardly the discovery of the year, but is nevertheless good to remember…)

I was, actually, carried away a bit once I got a hold on how to handle datatypes, so I also implemented a small “extension” to OWL 2 RL by adding datatype restrictions (one of the really nice new features of OWL 2 but which is not mandated for OWL 2 RL). Imagine you have the following vocabulary item:

ex:RE a owl:Restriction ;
    owl:onProperty ex:p ;
    owl:someValuesFrom [
      a rdfs:Datatype ;
      owl:onDatatype xsd:integer ;
      owl:withRestrictions (
          [ xsd:minInclusive "1"^^xsd:integer ]
          [ xsd:maxInclusive "6"^^xsd:integer ]
      )
   ] .

which defines a restriction on the property ex:p so that some its values should be integers in the [1,6] interval. This means that

ex:q ex:p "2"^^xsd:integer.

yields

ex:q rdf:type ex:RE .

And this could be done by a slight extension of OWL 2 RL; no new rules, just adding the datatype restrictions to the datatypes. Nifty…

That is it. I had fun, and maybe it will be useful to others. The package can also be downloaded and used with RDFLib, by the way…

July 4, 2009

Dagstuhl Workshop on Semantic Web

Dagstuhl castleI have just come back from the Workshop “Semantic Web: Reflections and Future Directions”, held in Dagstuhl, Germany. Organized by John Domingue, Rudi Studer, Jim Hendler, and Dieter Fensel, the workshop positioned itself as the “second release” of a similar workshop that was held at the same place 10 years ago.

The first two days of the workshop were more traditional, in the sense that it was series of presentations and panels. This was the “reflection” part of the workshop: looking back to 10 years’ of history as well a peek into the current state of the art. It was interesting but, for my taste, a bit too long; the programme of the two days could have been compressed into one or, say, one and a half days. That would have given more time to the “future directions” part, ie, discussions in break out groups on various topics. I enjoyed those a lot: free flowing discussions on various topics, helping to exchange ideas, experiences, pointers at other works and results, and crystallizing possible future R&D issues. These discussions took place in a very pleasant, relaxed atmosphere among people who mostly knew one another already, ie, we could really concentrate on issues. Each group formulated a number of research goals for the years to come; some group also came up with more practical steps and goals.

As far as I know, the workshop organizers plan to collect all those research issues in some more coherent form, so we should watch this space. In what follows I just collect some issues that I took away from the workshop without the goal of being exhaustive; indeed, there were 6-7 parallel break out groups.

Issues around Web scale. This is clearly one of the major topics of the day. What happens when one has to deal with data containing billions of triples, when the data (ie, the triples) are “dirty”, ie, inconsistent, faulty, etc. Think of the Linked Open Data cloud, of data coming from sensor networks, mobiles, etc. Do we have to re-think all the notions that the Semantic Web inherited from the logic world, ie, completeness, meaning and consequences of consistency, what it means to get results for a query, etc? This is one area where opinions tend to diverge a lot. Some would prefer to completely put aside the traditional logic approaches (rules, descriptions logic, ontologies, OWL, etc), while others may argue that the advances in computing, in reasoning engines and methods are (and are expected to be) such that these methods should still be just as usable as before. As always, I hate any black-and-white statements… I do not think dismissing an area of technology is the right way but, also, other avenues, or new viewpoints should to be explored, too (e.g., how to react on inconsistencies, trying to get possibly incomplete results but whatever can be obtained within, say, 2 minutes, that sort of things). What approach would be used is very much dependent of the application. Anyway… Web scale is a major issue, everybody agrees on that!

Interaction. This is one of the break out groups that I did not attend, unfortunately. And obviously a hugely important direction of future R&D. Many Semantic Web applications today are such that their user interface is just standard because all Semantic Web related work happens behind the scenes, usually on the server side. However, on long term, there is a clear need for programs that could somehow directly show the data in some friendly way, programs that self-adapt themselves to the nature of the data. Not only for experts, but also for laypeople. Such environments may not only include extensions of current browsers but, eg, full desktop environments. Sort of intelligent, data-oriented user interfaces. A major research problem (user interface methodology is always a major problem, whether related to Semantic Web or not…), but also a hugely exciting research and development opportunities!

Vocabularies. There was a separate group on the management of vocabularies, which has identified a number of R&D issues: how does one describe a vocabulary, its interdependence with other vocabularies, how does one rank vocabularies… These are all fundamental question to solve to be able to find vocabularies for a specific purpose, to make specialized search. There are also issues around archiving, providing stable URI-s; last but not least (and this goes way beyond vocabularies only) major legal issues on what type of attribution, copyright or other legal machinery are to be used with vocabularies (it was good to have Tom Heath, who could tell us a bit about the datacommons’ approach). As an example of the many technological problems arising, the break-out groups coined the term “cherry picking of terms”. Although OWL has a mechanism for import, the practice of the RDF world is to use (ie, “cherry pick”) vocabulary terms (predicates, classes, etc) from various different vocabularies without necessarily taking the whole vocabulary, and certainly without using the owl:import predicates (think of routine usage of dc:title without importing the full Dublin Core vocabulary). How would a reasoner treat those? It may be a little bit easier to use a more rule based approach (like OWL RL) although it is not obvious how to cherry pick just the right amount of information on a, say, predicate. But Ian Horrocks also drew my attention on formal ontology modularization work that might be very relevant here; item added to my “to-be-read” list…

Provenance (and trust). One of the issues that popped up in all other break out groups; in consequence a separate one was formed on the second day of discussions. It is indeed one of the questions that anyone who talks about Semantic Web gets; in my personal view, having a clear “story” to tell about provenance is essential for a further deployment of this technology. The discussion in the group was really interesting because this issue raises a number of other questions, like the overall relationship of cryptographic techniques and the Semantic Web, what it means to have trust in context, what are the relationships to temporal or uncertainty reasoning, etc, etc, etc. It was also interesting for me to hear about other works, like the Open Provenance Model, albeit some of these were not necessarily done by Semantic Web people (eg, by the database community). We agreed that a Wiki page will be created (probably at RPI, set up by Deb McGuinnis) to collect information on this subject, and forming a W3C Incubator Group might also be in the books to provide a more thorough state-of-the-art. A long list of additional items to my “to-be-read” pile is coming…

And, of course, it was also good to meet a bunch of people, discuss things at lunch or dinner. This type of interaction is really fruitful. And there was also intensive twittering going on (using the #swdag2009 tag, pointing to a bunch of other reseources) although this time I did not twitter too much because I had problems with my wireless card:-(

It was a good meeting; thanks for the organizers. Would be good not to wait another 10 years for the next incarnation of this event…

Next Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 2,317 other followers