Ivan’s private site

March 1, 2013

RDFa 1.1, microdata, and turtle-in-HTML now in the core distribution of RDFLib

This has been in the works for a while, but it is done now: the latest (3.4.0 version) of the python RDFLib library has just been released, and it includes and RDFa 1.1, microdata, and turtle-in-HTML parser. In other words, the user can add structured data to an HTML file, and that will be parsed into RDF and added to an RDFLib Graph structure. This is a significant step, and thanks to Gunnar Aastrand Grimnes, who helped me adding those parsers into the main distribution.

I have written a blog last summer on some of the technical details of those parsers; although there has been updates since then, essentially following the minor changes that the RDFa Working has defined for RDFa, as well as changes/updates on the microdata->RDF algorithm, the general approach described in that blog remains valid, and it is not necessary to repeat it here. For further details on these different formats, some of the useful links are:

Enjoy!

August 31, 2012

RDFa, microdata, turtle-in-HTML, and RDFLib

For those of us programming in Python, RDFLib is certainly one of the RDF packages of choice. Several years ago, when I developed a distiller for RDFa 1.0, some good souls picked the code up and added it to RDFLib as one of the parser formats. However, years have gone by, and have seen the development of RDFa 1.1, of microdata, and also the specification of directly embedding Turtle into HTML. It is time to bring all these into RDFLib…

Some times ago I have developed both a new version of the RDFa distiller, adapted for the  1.1 RDFa standard, as well as a microdata to RDF distiller, based on the Interest Group note on converting microdata to RDF. Both of these were packages and applications on top of RDFLib. Which is fine because they can be used with the deployed RDFLib installations out there. But, ideally, these should be retrofitted into the core of RDFLib; I have used the last few quiet days of the vacation period in August to do just that (thanks to Niklas Lindström and Gunnar Grimes for some email discussion and helping me through the hooplas of RDFLib-on-github). The results are in a separate branch of the RDFLib github repository, under the name structured_data_parsers. Using these parsers here is what one can do:

g = Graph()
# parse an SVG+RDF 1.1 file an store the results in 'g':
g.parse(URI_of_SVG_file, format="rdfa1.1") 
# parse an HTML+microdata file an store the results in 'g':
g.parse(URI_of_HTML_file, format="microdata")
# parse an HTML file for any structured conent an store the results in 'g':
g.parse(URI_of_HTML_file, format="html")

The third option is interesting (thanks to Dan Brickley who suggested it): this will parse an HTML file for any structured data, let that be in microdata, RDFa 1.1, or in Turtle embedded in a <script type="text/turtle">...</script> tag.

The core of the RDFa 1.1 has gone through a very thorough testing, using the extensive test suite on rdfa.info. This is less true for microdata, because there is not yet an extensive test suite for that one yet (but the code is also simpler). On the other hand, any restructuring like that may introduce some extra bugs. I would very much appreciate if interested geeks in the community could install and test it, and forward me the bugs that are still undeniably there… Note that the microdata->RDF mapping specification may still undergo some changes in the coming few weeks/months (primarily catching up with some development around schema.org); I hope to adapt the code to the changes quickly.

I have also made some arbitrary decisions here, which are minor, but arbitrary nevertheless. Any feedback on those is welcome:

  • I decided not to remove the old, 1.0 parser from this branch. Although the new version of the RDFa 1.1 parser can switch into 1.0 mode if the necessary switches are in the code (e.g., @version or a RDFa 1.0 specific DTD), in the absence of those 1.1 will be used. As, unfortunately, 1.1 is not 100% backward compatible with 1.0, this may create some issues with deployed applications. This also means that the format="rdfa" argument will refer to 1.0 and not to 1.1. Am I too cautious here?
  • The format argument in parse can also hold media types. Some of those are fairly obvious: e.g., application/svg+xml will map on the new parser with RDFa 1.1, for example. But what should be the default mapping for text/html? At present, it maps to the “universal” extractor (i.e., extracting everything).

Of course, at some point, this branch will be merged with the main branch of RDFLib meaning that, eventually, this will be part of the core distribution. I cannot say at this point when this will happen, I am not involved in the day-to-day management of the RDFLib development.

I hope this will be useful…

June 7, 2012

The long journey to RDFa 1.1

RDFa 1.1 Core, RDFa 1.1 Lite, and XHTML+RDFa 1.1 have just been published as Web Standards, i.e., W3C Recommendations, accompanied by a new edition of the RDFa Primer. Although it is “merely” and update of the previous RDFa 1.0 standard (published in 2008), it is a significant milestone nevertheless. RDFa 1.1 has restructured RDFa 1.0 in terms of the host languages it can be used with, and has also added some important features.

It has been a long journey. The development of RDFa (and I include RDFa 1.0 in this) was slowed down more by “social” rather than technical issues. Indeed, RDFa is at the crossroad of two different communitites which, alas!, had very little interaction before. As its name suggests, RDFa is of course closely related to RDF, i.e., to the communites related to the Semantic Web, Linked Data, RDF, etc. On the other hand, the very goal of RDFa is to add structured data to markup languages (primarily the HTML family, of course, but also SVG, Atom, etc.). This means that RDFa is also relevant to all these communities, often loosely referred to as the “Web Application” community. The interaction between these communities was not always easy, and was often characterized by misunderstandings, different engineering patterns, different concerns. To make things even more difficult, RDFa was also caught in the middle of the XHTML2 vs. HTML5 controversy: after all, the first drafts of RDFa were developed alongside XTHML2 and, although the current RDFa has long moved away from this heritage, the image of being part of XHTML2 stayed.

But all this is behind us now, and should be relegated to history. In my view the result, RDFa 1.1, reflects a good balance between the concerns and usage patterns of these communities; and that is what really counts. RDFa 1.1 allows the usage of prefixed abbreviation for URIs (so called CURIEs) that the RDF community had been using and got used to for many years, but (in contrast to RDFa 1.0) its usage is now optional: authors may choose to use full URIs wherever and whenever they wish. By the way, prefixes for CURIEs are not defined through the @xmlns mechanism inherited from XML (this was probably the single biggest stumbling block around RDFa 1.0): instead, the usage of @xmlns is deprecated in favour of a dedicated @prefix attribute. Finally, a number of well-known vocabularies have predefined prefixes; authors are not required to define prefixes for, say, the Dublin Core, FOAF, Schema.org, or Facebook’s Open Graph Protocol terms; they are automatically recognized. Finally, beyond these facilities with prefixed terms, RDFa 1.1 authors also have the possibility to define a vocabulary for a markup fragment (via the @vocab attribute) and forget about URIs and prefixes altogether: simple terms in property names or types will authomatically be assigned URIs in that vocabulary. This is particularly important when RDFa is used with a single vocabulary (Schema.org or OGP usage comes to mind again).

The behaviour of @property has been made richer, which means that in many (most?) situations the structured data can be expressed with @property alone, without the usage of @rel or @rev (although the usage of these latter is still possible). This increased simplicity is important for authors who are new to this world and may not, initially, grasp the difference between the classical usage of @propery (i.e., literal objects) and @rel (i.e., URI References as objects). (Unfortunately, this change has created some corner-case backward incompatibilities with RDFa 1.0.)

There are also some other, though maybe less significant, improvements. For example, authors can also express (RDF) lists succintly; this means that RDFa 1.1 can be used to describe, e.g., author lists for an article (where order counts a lot) or an OWL vocabulary. Also, an awkwardness in RDFa 1.0, related to XML Literals, have been removed.

The structure of RDFa has also changed. Whereas the definition of RDFa 1.0 was closely intertwined with XHTML, RDFa 1.1 separates the core definition from what it calls “Host Languages”. This means that RDFa is defined in a way that it can be adapted to all types of XML languages as well as HTML5. There are separate specifications on how RDFa 1.1 applies to XHTML1 and for HTML5, as well as for XML in general; this means that RDFa 1.1 can also be used with SVG, Atom, or MathML, because those languages automatically inherit from the XML definitions.

Last but not least: the Working Group has also defined a separate “subset” language, called RDFa 1.1 Lite. This is not a separate RDFa 1.1 dialect, just an authoring subset of RDFa 1.1: an authoring subset that makes it easy for authors to step into this world easily, without being forced to use all the possibilities of RDFa 1.1 (i.e., RDF). It can be expected that a large percentage of RDFa usage can be covered by this subset, but it would also provide a good stepping stone when more complex structures (mixture of many different vocabularies, datatypes, more complex graph structures, etc) are required.

As I said, it has been a long journey. Many people were involved in the work, both in the Working Group but also through comments coming from the public and from major potential users. But now that the result is there, I can safely say: it was worth the effort. Recent figures on the adoption of structured data on the Web (see, for example the reports published at the LDOW 2012 Workshop recently by Peter Mika and Tim Potter, as well as by Hannes Mühleisen and Christian Bizer) can be summarized by a simple statement: structured data in Web pages is now mainstream, thanks to its adoption by search engines (i.e., Schema.org) or companies like Facebook. And RDFa 1.1 has a major role to play in this evolution.

If you are new to RDFa: the RDFa Primer is of course a good starting point, but it is well worth checking out (and possibly contribute to!) the rdfa.info web site which contains references to tools, documents; you can also try out small RDFa snippets. Enjoy!

April 17, 2012

Linked Data on the Web Workshop, Lyon

(See the Workshop’s home page for details.)

The LDOW20** series have become more than workshops; they are really a small conferences. I did not count the number of participants (the meeting room had a fairly odd shape which made it a bit difficult) but I think it was largely over a hundred. Nice to see…

The usual caveat applies for my notes below: I am selective here with some papers which is no judgement on any other paper at the workshop. These are just some of my thoughts jotted down…

Giuseppe Rizzo made a presentation related to all the tools we know have to tag texts and thereby being able to use these resources in linked data (“NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud”), i.e., the Zemanta or Open Calais services of this World. As these services become more and more important, having a clear view of what they can do, how one can use them individually or together, etc., is essential. Their project, called NERD, will become an important source for this community, bookmark that page:-)

Jun Zhao made a presentation (“Towards Interoperable Provenance Publication on the Linked Data Web”) essentially on the work of the W3C Provenance Working Group. I was pleased to see and listen to this presentation: I believe the outcome of that group is very important for this community and, having played a role in the creation of that group, I am anxious to see it succeed. B.t.w., a new round of publication coming from that group should happen very soon, watch the news…

Another presentation, namely Arnaud Le Hors’ on “Using read/write Linked Data for Application Integration — Towards a Linked Data Basic Profile” was also closely related to W3C work. Arnaud and his colleagues (at IBM) came to this community after a long journey working on application integration; think, e.g., of systems managing software updates and error management. These systems are fundamentally data oriented and IBM has embarked into a Linked Data based approach (after having tried others). The particularity of this approach is to stay very “low” level, insofar as they use only basic HTTP protocol reading and writing RDF data. This approach seems to strike chord at a number of other companies (Elsevier, EMC, Oracle, Nokia) and their work form the basis of a new W3C Working Group that should be started this coming summer. This work may become a significant element of palette of technologies around Linked Data.

Luca Costabello talked about Access Control, Linked Data, and Mobile (“Linked Data Access Goes Mobile: Context-Aware Authorization for Graph Stores”). Although Luca emphasized that their solution is not a complete solution for Linked Data access control issues in general, it may become an important contribution in that area nevertheless. Their approach is to modify SPARQL queries “on-the-fly” by including access control clauses; for that purpose, an access control ontology (S4AC) has been developed and used. One issue is: how would that work with a purely HTTP level read/write Linked Data Web, like the one Arnaud is talking about? Answer: we do not know yet:-)

Igor Popov concentrated on user interface issues (“Interacting with the Web of Data through a Web of Inter-connected Lenses”): how to develop a framework whereby data-oriented applications can cooperate quickly, so that lambda users could explore data, switching easily to applications that are well adapted to a particular dataset, and without being forced to use complicated programming or use too “geeky” tools. This is still an alpha level work, but their site-in-development, called Mashpoint is a place to watch. There are (still) not enough work on user-facing data exploration tools, I was pleased to see this one…

What is the dynamics of Linked Data? How does it change? This is the question Tobias Käfer and his friends try to answer in future (“Towards a Dynamic Linked Data Observatory”). For that, data is necessary, and Tobias’ presentation was on how to determine what collection of resources to regularly watch and measure. The plan is to produce a snapshot of the data once a week for a year; the hope is that based on this collected data we will learn more about the overall evolution of linked data. I am really curious to see the results of that. One more reason to be at LDOW2013:-)

Tobias’ presentation has an important connection to the last presentation of the day, made by Axel Polleres (OWL: Yet to arrive on the Web of Data?) insofar as what he presented was based on the analysis of the Linked Data out there. The issue has been around, with lots of controversy, for a while: what level of OWL should/could be used for Linked Data? OWL 2 as a whole seems to be too complex for the amount of data we are talking about, both in terms of program efficiency and in terms of conceptually complexity for end users. OWL 2 has defined a much simpler profile, called OWL 2 RL, which does have some traction but may be still too complex, e.g., for implementations. Axel and his friends analyzed the usage of OWL statements out there, and also established some criteria on what type of rules should be used to make OWL processing really efficient; their result is another profile called OWL LD. It is largely a subset of OWL 2 RL, though it does adopt some datatypes that OWL 2 RL does not have.

There are some features that are left out of OWL 2 RL which I am not fully convinced of; after all their measurement was based on data in 2011, and it is difficult to say how much time it takes for new OWL 2 features to really catch up. I think that keys and property chains should/could be really useful on the Linked Data, and can be managed by rule engines, too. So the jury is still out on this, but it would be good to find a way to stabilize this at some point and see the LD crowd look at OWL (i.e., the subset of OWL) more positively. Of course, another approach would be to concentrate on an easy way to encode Rules into RDF which might make this discussion moot in a certain sense; one of the things we have not succeeded to do yet:-(

The day ended by a panel, on which I also participated; I would let others judge whether the panel was good or not. However, the panel was preceded by a presentation of Chris on the current deployment of RDFa and microdata which was really interesting. (His slides will be on the workshop’s page soon.) The deployment of RDFa, microdata, and microformats has become really strong now; structured data in HTML is a well established approach out there. RDFa and microdata covers now half of the cases, the other half being microformats, which seems to indicate a clear shift towards RDFa/microdata, ie, a more syntax oriented approach (with a clear mapping to RDF). Microdata is used almost exclusively with schema.org vocabularies (which is to be expected) whereas RDFa makes use of a larger palette of various other vocabularies. All these were to be expected, but it is nice to see being reflected in collected data.

It was a great event. Chris, Tim, and Tom: thanks!

December 16, 2011

Where we are with RDFa 1.1?

English: RDFa Content Editor

Image via Wikipedia

There has been a flurry of activities around RDFa 1.1 in the past few months. Although a number of blogs and news items have been published on the changes, all those have become “officialized” only the past few days with the publication of the latest drafts, as well as with the publication of RDFa 1.1 Lite. It may be worth looking back at the past few months to have a clearer idea on what happened. I make references to a number of other blogs that were published in the past few months; the interested readers should consult those for details.

The latest official drafts for RDFa 1.1 were published in Spring 2011. However, lot has happened since. First of all, the RDFWA Working Group, working on this specification, has received a significant amount of comments. Some of those were rooted in implementations and the difficulties encountered therein; some came from potential authors who asked for further simplifications. Also, the announcement of schema.org had an important effect: indeed, this initiative drew attention on the importance of structured data in Web pages, which also raised further questions on the usability of RDFa for that usage pattern This came to the fore even more forcefully at the workshop organized by the stakeholders of schema.org in Mountain View. A new task force on the relationships of RDFa and microdata has been set up at W3C; beyond looking at the relationship of these two syntaxes, that task force also raised a number of issues on RDFa 1.1. These issues have been, by and large, accepted and handled by the Working Group (and reflected in the new drafts).

What does this mean for the new drafts? The bottom line: there have been some fundamental changes in RDFa 1.1. For example, profiles, introduced in earlier releases of RDFa 1.1, have been removed due to implementation challenges; however, management of vocabularies have acquired an optional feature that helps vocabulary authors to “bind” their vocabularies to other vocabularies, without introducing an extra burden on authors (see another blog for more details). Another long-standing issue was whether RDFa should include a syntax for ordered lists; this has been done now (see the same blog for further details).

A more recent important change concerns the usage of @property and @rel. Although usage of these attributes for RDF savy authors was never a real problem (the former is for the creation of literal objects, whereas the latter is for URI references), they have proven to be a major obstacle for ‘lambda’ HTML authors. This issue came up quite forcefully at the schema.org workshop in Mountain View, too. After a long technical discussion in the group, the new version reduces the usage difference between the two significantly. Essentially, if, on the same element, @property is present together with, say, @href or @resource, and @rel or @rev is not present, a URI reference is generated as an object of the triple. I.e., when used on a, say, <link> or <a> element, @property  behaves exactly like @rel. It turns out that this usage pattern is so widespread that it covers most of the important use cases for authors. The new version of the RDFa 1.1 Primer (as well as the RDFa 1.1 Core, actually) has a number of examples that show these. There are also some other changes related to the behaviour of @typeof in relations to @property; please consult the specification for these.

The publication of RDFa 1.1 Lite was also a very important step. This defines a “sub-set” of the RDFa attributes that can serve as a guideline for HTML authors to express simple structured data in HTML without bothering about more complex features. This is the subset of RDFa that schema.org will “accept”,  as an alternative to the microdata, as a possible syntax for schema.org vocabularies. (There are some examples on how some schema.org example look like in RDFa 1.1 Lite on a different blog.) In some sense, RDFa 1.1 Lite can be considered like the equivalent of microdata, except that it leaves the door open for more complex vocabulary usage, mixture with different vocabularies, etc. (The HTML Task Force will publish soon a more detailed comparison of the different syntaxes.)

So here is, roughly, where we are today. The recent publications by the W3C RDFWA Working Group have, as I said, ”officialized” all the changes that were discussed since spring. The group decided not to publish a Last Call Working Draft, because the last few weeks’ of work on the HTML Task Force may reveal some new requirements; if not, the last round of publications will follow soon.

And what about implementations? Well, my “shadow” implementation of the RDFa distiller (which also includes a separate “validator” service) incorporates all the latest changes. I also added a new feature a few weeks ago, namely the possibility to serialize the output in JSON-LD (although this has become outdated a few days ago, due to some changes in JSON-LD…). I am not sure of the exact status of Gregg Kellogg’s RDF Distiller, but, knowing him, it is either already in line with the latest drafts or it is only a matter of a few days to be so. And there are surely more around that I do not know about.

This last series of publications have provided a nice closure for a busy RDFa year. I guess the only thing now is to wish everyone a Merry Christmas, a peaceful and happy Hanukkah, or other festivities you honor at this time of the year.  In any case, a very happy New Year!

Enhanced by Zemanta

April 18, 2011

Open data from Fukushima

This is just an extended tweet… Masahide Kanzaki has just posted an announcement on the LOD mailing list on releasing some data he collected on the radioactivity levels on different places in Japan, enriched with metadata (e.g., geo data or time). Though the original data were in PDF, the results are integrated in RDF with a SPARQL endpoint. He also added some visualization endpoint that gives a simple visualization of the SPARQL query results:

Visualization results for radioactivity data for Tokyo and Fukushima, using integrated datasets and SPARQL query

Simple but effective, and makes the point on the usage of open data in RDF… Thanks!

November 23, 2010

My first mapping from RDB to RDF using a direct mapping, cont.

A few days ago I posted a blog on how the RDB to RDF direct mapping could be used for a simple example. I do not want to repeat the whole blog: the essence of it was that database tables were mapped onto a simple RDF Graph (this is what the direct mapping does) and the resulting graph was transformed into the “target” graph using the following SPARQL 1.1 construct:

CONSTRUCT {
  ?id a:title ?title ;
    a:year  ?year ;
    a:author _:x .
  _:x a:name ?name ;
    a:homepage ?hp .
}
WHERE {
  SELECT (IRI(fn:concat("http://...",?isbn)) AS ?id)
          ?title ?year ?name
         (IRI(?homepage) AS ?hp)
  {
    ?book a  <Book>;
       ?isbn ;
       ?title ;
        ?year ;
       ?author .
    ?author a  <Author>;
       ?name ;
       ?homepage .
  }
}

where the trick was to use a nested SELECT whose main job was to create URI references from strings. I realized that if one uses the latest editors’ version of SPARQL 1.1 (i.e., that version that is much closer to what SPARQL 1.1 will be) then the solution is actually simpler due to the variable assigning possibility that makes the nested SELECT unnecessary:

CONSTRUCT {
  ?id a:title ?title ;
    a:year  ?year ;
    a:author _:x .
  _:x a:name ?name ;
    a:homepage ?hp .
}
WHERE {
  ?book a  <Book>;
     ?isbn ;
     ?title ;
      ?year ;
     ?author .
  ?author a  <Author>;
     ?name ;
     ?homepage .
  BIND (IRI(fn:concat("http://...",?isbn)) AS ?id)
  BIND (IRI(?homepage) AS ?hp)
}

which makes, at least in my view, the mapping even clearer.

But SPARQL is not the only way to transform the graph. Another possibility is to use RIF Core. Essentially the same transformation can indeed be expressed using the RIF Presentation syntax. Here it is (with a little help from Sandro Hawke and Harold Boley):

Forall ?book ?title ?author ?isbn ?year ?id (
  ?id[a:year->?year a:title->?title a:author->?author] :-
    And(
      ?book[rdf:type-> <Book>
             a:isbn->?isbn
             a:title->?title
             a:year->?year
             a:author->?author]
      External(pred:iri-string(?id External( func:concat("http://..." ?isbn ) )))
    )
)
Forall ?author ?name ?hp ?homepage (
 ?author[a:name->?name a:homepage->?hp] :-
   And(
        ?author[rdf:type-> <Author>
                a:name->?name
                a:homepage->?homepage]
        External(pred:iri-string(?hp ?homepage))
  )
)

(as I did in the earlier examples, I did not put the prefix declaration and other syntactic stuffs into the code above.)

The only difference between the two is that I retained the URI for the author, because generating a blank node on the fly in RIF Core does not seem to be possible. A better solution would be, probably, to mint a URI from the ?author variable just like I did for the ISBN value. Other than that, the two solutions are pretty much identical…

November 19, 2010

My first mapping from RDB to RDF using a direct mapping

A few weeks ago I wrote a blog on my first RDB to RDF mapping using R2RML; the W3C RDB2RDF Working Group had just published a first public Working Draft for R2RML. That mapping was based on a specific mapping language (i.e., R2RML). R2RML relies on an R2RML processing done by, for example, the database system, interpreting the language, using some SQL constructions, etc. The R2RML processing depends on the specific schema of the database which guides the mapping.

As I already mentioned in that blog, a “direct” mapping was also in preparation by the Working Group; well, the first public Working Draft of that mapping has just been published. That mapping does not depend on the schema of the database: it defines a general mapping of any relational database structure into RDF; only a base URI has to be specified for the database, everything else is generated automatically. The resulting RDF graph is of course much more coarse than the one generated by R2RML; whereas the result of an R2RML mapping may be a graph using well specified vocabularies, for example, this is not the case for the output of the direct mapping. But that is not really a problem: after all, we have SPARQL or RIF to make transformation on graphs! Ie, the two approaches are really complementary.

What I will do in this blog is to show how the very same example as in my previous blog can be handled by a direct mapping. As a reminder: the toy example I use comes from my  generic Semantic Web tutorial. Here is the (toy) table:

which is then converted into an RDF Graph:

(Just as in the previous case I will ignore the part of the graph dealing with the publisher, which has the same structure as the author part. I will also ignore the prefix definitions.)

The direct mapping of the first and second tables is pretty straightforward. The URI-s are a bit ugly but, well, this is what you get when you use a generic solution. So here it is:

@base <http://book.example/> .
<Book/ID=0006511409X#_> a <Book> ;
  <Book#ISBN> "0006511409X" ;
  <Book#Title> "The Glass Palace" ;
  <Book#Year>  "2000" ;
  <Book#Author> <Author/ID=id_xyz#_> .

<Author/ID=id_xyz#_> a <Author> ;
  <Author#ID> "id_xyz" ;
  <Author#Name> "Ghosh, Amitav" ;
  <Author#Homepage> "http://www.amitavghosh.com" .

Simple, isn’t it?

The result is fairly close to what we want, but not exactly. First of all, we want to use different vocabulary terms (like a:name). Also, note that the direct mapping produces literal objects most of the time, except when there is a “jump” from one table to another. Finally, the resulting graph should use a blank node for the author, which is not the case in the generated graph.

Fortunately, we have tools in the Semantic Web domain to transform RDF graphs. RIF is one possible solution; another is SPARQL, using the CONSTRUCT form. Using SPARQL is an attractive solution because, in practice, the output of the direct mapping may not even be materialized; instead, one would expect a SPARQL engine attached to a particular relational database, mapping the SPARQL queries to the table on the fly. I will use SPARQL 1.1 below because that gives nice facilities to generate RDF URI Resources from strings, i.e., to have “bridges” from literals to URI-s. Here is a possible SPARQL 1.1 query/construct that could be used to achieve what we want:

CONSTRUCT {
  ?id a:title ?title ;
    a:year  ?year ;
    a:author _:x .
  _:x a:name ?name ;
    a:homepage ?hp .
}
WHERE {
  SELECT (IRI(fn:concat("http://...",?isbn)) AS ?id)
          ?title ?year ?name
         (IRI(?homepage) AS ?hp)
  {
    ?book a <Book> ;
      <Book#ISBN> ?isbn ;
      <Book#Title> ?title ;
      <Book#Year>  ?year ;
      <Book#Author> ?author .
    ?author a <Author> ;
      <Author#Name> ?name ;
      <Author#Homepage ?homepage .
  }
}

Note the usage of a nested query; this is used to create new variables representing the URI references to be used by the outer query. The key is the IRI operator. (Both the nesting and the AS in the SELECT are SPARQL 1.1 features.)

That is it. Of course, the question does arise: which one would one use? The direct mapping or R2RML? Apart from the possible restriction that the local database system may implement the direct mapping only, it becomes also a question of taste. The heavy tool in R2RML is, in fact, the embedded SQL query; if one is comfortable with SQL than that is fine. But if the user is more comfortable with Semantic Web tools (e.g., SPARQL or RIF) then the direct mapping might be handier.

(Note that these are evolving documents still. I already know that my previous blog is wrong in the sense that it is not in line with the next version of R2RML. Oh well…)

November 2, 2010

My first mapping from RDB to RDF using R2RML

The W3C RDB2RDF Working Group has just published a first public Working Draft for the standardized RDB->RDF mapping language called R2RML. I decided that the only way to understand a specification like that is to try to use it for an example. Caveat: this is a “First Public Working Draft” for R2RML, so many things still have to happen and there will be changes.

For several years now I use a simple example in my generic Semantic Web tutorial (see, e.g., the one at SemTech). It is an artificial example referring to an imaginary bookshop’s table:

which is then converted into an RDF Graph:

(And the tutorial story is how this graph can be merged with a graph coming from another bookshop’s data.) Up until now I always glossed over how this mapping is done. Well, so how could that be done with R2RML?

R2RML defines mappings that describe how an RDB table is mapped on triples. (R2RML is in itself in RDF, b.t.w.) Simply put, in R2RML, each row of a table is mapped to an RDF subject; the individual cells, with the column names, provide the object and the predicates, respectively.

If we look at the middle table in the example, it corresponds to the lower right hand part of the graph. The R2RML mapping has to specify that the homepage column should actually produce an RDF Resource as a literal and not a string. Furthermore, the first column should become a blank node; that has to be specified, too. Here is the way this is all specified:

:Table2 rdf:type rr:TriplesMap ;
    rr:logicalTable "Select  ("_:" || ID) AS pid, Name, ("<" || Homepage || ">) AS Home from person_table";
    rr:subjectMap [ a rr:BlankNodeMap ; rr:column "pid" ; ] ;
    rr:propertyObjectMap [ rr:property a:name; rr:column "Name" ] ;
    rr:propertyObjectMap [ a rr:IRIMap ; rr:property a:homepage; rr:column "Home" ] .

What happens here is:

  1. a mapping is defined that turns the original table into a virtual, “logical” table using SQL. The goal here is to generate a blank node ID on the fly, and a URI in NTriple syntax (note, however, that I am not sure it is o.k. to use that approach in the spec!);
  2. the subject for the triples is chosen to be a cell in a specific column (“pid”, generated by the SQL transform of the previous point), and it is also specified that this is a blank node;
  3. the other two properties are specified (for the same subject); the one for the home page also specifies that the object must be a URI resource (as opposed to a Literal).

That is it. Mapping of the bottom table to the lower left hand corner of the graph is also quite similar, I will not go into this here.

But we still need the “root”, so to say, i.e., the node in the upper right hand corner, the top portion of the graph (with the title and the year) and, mainly, we also have to relate the root to the portion of the graph that is generated from the middle table.

First, the following R2RML part does the job of generating the top part of the graph:

:Table1 rdf:type rr:TriplesMap ;
    rr:logicalTable "Select ('<http:..isbn/' || ISBN || '>') AS isbn, 
                     Author, Title, Publisher, Year from book_table";
    rr:subjectMap [ rdf:type rr:IRIMap ; rr:column "isbn" ] ;
    rr:propertyObjectMap [ rr:property a:title ; rr:column "Title" ; ] ;
    rr:propertyObjectMap [ rr:property a:year ; rr:column "Year" ; ] ;

The only role of the mapping to a logical table is to generate a URI from the ISBN; all the other cells are, conceptually, simply copied on the logical table. The rest is fairly straightforward.

The missing trick is to combine, i.e., to “join”, the two tables on the graph. R2RML has a separate construction for that, referred to as “mapping” the foreign keys. The following additional statements should be added to :Table1:

    rr:foreignKeyMap [ 
       rr:key a:author ; 
       rr:parentTriplesMap :Table2 ; rr:joinCondition "{child}.Author = {parent}.pid"
    ] .

Which combines the nodes defined by :Table1 with those of :Table2. And voilà! We’re done: the R2RML document is ready, i.e., an R2RML engine would generate my example table into my example graph.

Of course, there are more complicated possibilities. Triples, or whole rows, can be explicitly stored in a specific named graph, for example. Or a column defining a predicate could, actually, use a cell in another column as an object. Etc. And, to be honest, I am not even 100% sure that above is correct, I may have misunderstood some details. But the “melody” is still clear.

Note the role the SQL based mapping of the original table to the logical table has. For SQL experts, most of the work can be done there, i.e., the resulting RDF graph can be ready for further usage by an application, to be linked into the LOD, to be used with the right attributes, namespaces, etc. Which is very powerful indeed, provided… the user has the necessary SQL expertise. And, while that is obviously true for database managers, it is not necessarily true for RDF experts. For those, a slightly different model seems to be more appropriate: they would prefer to get an RDF graph ASAP, so to say, without any fancy transformation, and would then use RIF, SWSRL, SPARQL’s CONSTRUCT, etc., to turn it into the RDF graph they eventually want to have. In other words, they may not need the concept of a logical table. That is what is referred to by the group as the “default” mapping. I.e., what graph does one get if nothing is specified? If that is properly defined then, say, RIF experts can use their expertise instead of SQL. This default mapping is not yet fully specified by the group, but it is on its way; it will be published shortly, and will complete the R2RML picture. So watch that space…

June 29, 2010

SemTech2010 & co.

I am on my way home from a long trip in the US (writing these lines on the plane, to be posted from home). Few days in Seattle, SemTech 2010 in San Francisco, finally the “RDF Next Steps” workshop in Palo Alto (i.e, Stanford). I do not want to write about the last one now, simply because we hope to have a more extended public report available within 10-15 days. I.e., more about that later.

Seattle consisted of a number of company visits, but it also included a talk at the SemWeb Meetup in Seattle. I gave a presentation on what happened at W3C the last year which, I think, was was well received. (Although one is never sure about these things.) I had a bunch of discussions and chats after the presentation; it was pleasant, relaxing… I and mainly my colleague from W3C, Eric Prud’hommeaux, had also a long discussion with two developers from Microsoft who are involved in the oData work; that was really interesting because we reached the conclusion of possibly outlining together a possible plan whereby we could write down how to “export” oData into RDF, and publish that, e.g., as W3C note (note that there are already systems doing something like that out there, but I am not knowledgeable enough to judge how complete those solutions are). I think it would be good for the community if this happens. It is important for a general Web of Data to include, well, all the data on the Web…

Semtech… it was big. Bigger than last year (I heard and read a figure of a 30% increase in attendance). This industry is lively indeed! The only problem that it was almost too big; it was the conference of eternal frustration:-( Indeed, there were so many things in parallel that one always had the feeling to have missed something because another, parallel session may have been more interesting! I heard presentations from Facebook, from Google, saw stunning visualizations of RDF graphs, or heard about plans on ontology hosting and management. There was a report on the US and UK governmental data work (this stuff still amazes me, though it is not the first time I hear about it), there was a presentation of BestBuy (alas! I missed that one). There was a separate track on the publication world as a separate “vertical” area (and we also had some great discussions with the people from the New York Times with whom we outlined a possible first step in gathering that community). Lots of hallway conversation with companies and institutions and, of course the social life, chatting with David, and Ian, and the other Ian, and Eric, and the other David, and Christine, and Jeremy, and Jim, and Fabien, and Sandro, and Jenni, and… I should stop and not even try to list everybody because it is simply impossible! I also gave an introductory Semantic Web Tutorial (quite a lot of people in the audience, and I think it went well), we had a panel on the W3C RDB2RDF work and another one on SPARQL 1.1. As a nice little touch, I could announce the publication of the W3C RIF Recommendation as a primeur during the tutorial when as I was talking about RIF (the publication itself happened while I was talking…)

There were, as every year, some “buzz” topics. My impression that the linked open governmental data effort was a buzz and was still new information for many. Facebook’s keynote on the Open Graph Protocol crated another buzz. More generally, RDFa was definitely a buzz (big time!). I.e., as I said, this industry is lively and continue to be exciting.

But there are of course challenges. The way I feel it the biggest challenge is not technical. Yes, of course, there are technical issues, but those will be solved, eventually. The issue is outreach, to get to those new communities who may understand the value of a Web of Data in general but have not enough guidance on how to start doing something. How to publish the data, how to link it to other data, how to consume it, use it, mash it up… How to talk to “C-level” people, how to reach out to them. There are books, of course, but not enough; there are tutorials and guides, of course, but not enough; there are experts around but definitely not enough. As one of our discussion partners put it: if I go to any better bookshop, there are rows of books on, say, XML (good or bad, but they are there). But books on RDF, on Linked Data, on SPARQL, on SKOS, on OWL: only a few here and there (comparatively, that is), and some of them are actually quite old. Let alone the problem of trying to hire experts that could do the job. I really feel that this is the biggest challenge our community faces. I say “community” and not only a single organization like W3C or other; the challenge is too great to be solved by one group only. We have been fighting with this issue for a while now, but it is still a challenge… And a challenge for us all who care about that stuff!

It was a good week!

Next Page »

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,616 other followers