Ivan’s private site

January 4, 2014

Data vs. Publishing: my change of responsibilities…

Fairly Lake Botanical Garden, Shenzhen, China

There was an official announcement, as well as some references, on the fact that the structure of data related work has changed at W3C. A new activity has been created called “Data Activity”, that subsumes what used to be called the Semantic Web Activity. “Subsumes is an important term here: W3C does not abandon the Semantic Web work (I emphasize that because I did get such reactions); instead, the existing and possible future work is simply continuing within a new structure. The renaming is simply a sign that W3C has also to pay attention to the fact that there are many different data formats used on the Web, not all of them follow the principles and technologies of the Semantic Web, and those other formats and approaches also have technological and standardization needs that W3C might be in position to help with. It is not the purpose of this blog, however, to look at the details; the interested reader may consult the official announcements (or consider Tim Finin’s formula: Data Activity  ⊃ Semantic Web  ∪  eGovernment :-)

There is a much less important but more personal aspect of the change, though: I will not be the leader of this new Data Activity (my colleague and friend, Phil Archer, will do that). Before anybody tries to find some complicated explanation (e.g., that I was fired): the reason is much more simple. About a year ago I got interested by a fairly different area, namely Digital Publishing. What used to be, back then, a so-called “headlight” project at W3C, i.e., an exploration into a new area, turned into an Activity on its own, with me as the lead, last summer. There is a good reason for that: after all, digital publishing (e.g., e-books) may represent one of the largest usage areas of the core W3C technologies (i.e., HTML5, CSS, or SVG) right after browsers; indeed, for those of you who do not realize that (I did not know that just a year and a half ago either…) an e-book is “just” a frozen and packaged Web site, using many of the technologies defined by W3C. A major user area, thus, but whose requirements may be special and not yet properly represented at W3C. Hence the new Activity.

However, this development at W3C had its price for me: I had to choose. Heading both the Digital Publishing and the Data Activities was not an option. I have lead W3C’s Semantic Web Activity for cca. 7 years; 7 years that were rich in events and results (the forward march of Linked Open Data, a much more general presence and acceptation of the technology, specifications like OWL 2, RDFa, RDB2RDF, PROV, SKOS, SPARQL 1.1, with RDF 1.1 just around the corner now…). I had my role in many of these, although I was merely a coordinator for the work done by other amazing individuals. But, I had to choose, and I decided to go towards new horizons (in view of my age, probably for the last time in my professional life); hence my choice for Digital Publishing. As simple as that…

But this does not mean I am completely “out”. First of all, I will still actively participate in some of the data activity groups (e.g., in the “CSV on the Web WG”), and have a continuing interest in many of the issues there. But, maybe more importantly, there are some major overlapping areas between Digital Publishing and Data on the Web. For example, publishing also means scientific, scholarly publishing, and this particular area is increasingly aware of the fact that publishing data, as part of reporting of a particular scientific endeavor, becomes as important as publishing a traditional paper. And this raises tons of issues on data formats, linked data, metadata, access, provenance, etc. Another example: the traditional publishing industry makes an increasingly heavy usage of metadata. There is a recognition among publishers that a well chosen and well curated defined metadata for books is a major business asset that may make a publication win or loose. There are many (overlapping…) vocabularies and relationships to libraries, archival facilities, etc., come to the fore. Via this metadata the world of publishing may become a major player of the Linked Data cloud. A final example may be annotation: while many aspects of the annotation work is inherently bound to Semantic Web (see, e.g., the work W3C Community Group on Annotation), it is also considered to be one of the most important areas for future development in, say, the educational publishing area.

I can, hopefully, contribute to these overlapping areas with my experience from the Semantic Web. So no, I am not entirely gone, just changed hats! Or, as on the picture, acting (also) as a bridge…

March 16, 2013

Multilingual Linked Open Data?

Filed under: Semantic Web,Work Related — Ivan Herman @ 14:13
Tags: , , ,

Logo of the EU Multilingual Web ProjectExperts developing Web sites for various cultures and languages know that it is way better to include such features into Web pages at the start, i.e., at the time of the core design, rather than to “add” them once the site is done. What is valid for Web sites is also valid for data deployed on the Web, and that is especially true for Linked Data whose mantra is to combine data and datasets from all over the place.

Why do I say all this? I had the pleasure to participate, earlier this week, at the MultilingualWeb Workshop in Rome, Italy. One of the topics of the workshop was Linked (Open) Data and its multilingual (and, also, multicultural) aspects. There were a number of presentations at a dedicated session (the presentations are online, linked from the Workshop Page; just scroll down and look for a session entitled “Machines”), and there was also a separate break-out session (the slides are not yet on-line, but they should be soon). There are also a number of interesting projects and issues in this area beyond those presented at the event; for example, the lemon model or the (related) Monnet EU project as examples.

All these projects are great. However, the overall situation in the Linked Data world is, in this respect, not that great, at least in my view. If one looks at the various Linked Data (or Semantic Web) related mailing lists, discussion fora, workshops, etc, multilingual or multicultural issues are almost never discussed. I did not make any systematic analysis of the various datasets on the LOD cloud, but I have the impression that only a few of them are prepared for multilingual use (e.g., by providing alternative labels and other metadata in different languages). URI-s are defined in English, most of the vocabularies we use are documented in only one language; they may be hard to use for non-English speakers. Worse, vocabularies may not even be properly prepared for multicultural use (just consider the complexity of personal names which is hardly ever properly reflected in vocabularies). And this is where we hit the same problem as for Web sites; with all its successes we are still at the beginning of the deployment of Linked Data: our community should have much more frequent discussions on how to handle this issue now, because after a while it may be too late.

B.t.w., one of the outcomes of the break-out session at the Workshop was that a W3C Community Group should be created soon to produce some best practices for Multilingual Linked Open Data. There is already some work done in the area, look at the page set up by José Emilio Labra Gayo, Dimitris Kontokostas, and Sören Auer; this may very well be the starting point. Watch this space!

It is hard. But it will be harder if we miss this particular boat.

January 24, 2012

Nice reading on Semantic Search

I had a great time reading a paper on Semantic Search[1]. Although the paper is on the details of a specific Semantic Web search engine (DERI’s SWSE), I was reading it as somebody not really familiar with all the intricate details of such a search engine setup and operation (i.e., I would not dare to give an opinion on whether the choice taken by this group is better or worse than the ones taken by the developers of other engines) and wanting to gain a good image of what is happening in general. And, for that purpose, this paper was really interesting and instructive. It is long (cca. 50 pages), i.e., I did not even try to understand everything at my first reading, but it did give a great overall impression of what is going on.

One of the “associations” I had, maybe somewhat surprisingly, is with another paper I read lately, namely a report on basic profiles for Linked Data[2]. In that paper Nally et al. look at what “subsets” of current Semantic Web specifications could be defined, as “profiles”, for the purpose of publishing and using Linked Data. This was also a general topic at a W3C Workshop on Linked Data Patterns at the end of last year (see also the final report of the event) and it is not a secret that W3C is considering setting up a relevant Working Group in the near future. Well, the experiences of an engine like SWSE might come very handy here. For example, SWSE uses a subset of the OWL 2 RL Profile for inferencing; that may be a good input for a possible Linked Data profile (although the differences are really minor, if one looks at the appendix of the paper that lists the rule sets the engine uses). The idea of “Authoritative Reasoning” is also interesting and possibly relevant; that approach makes a lot of pragmatic sense, I wonder whether this is not something that should be, somehow, documented for a general use. And I am sure there are more: In general, analyzing the experiences of major Semantic Web search engines on handling Linked Data might provide a great set of input for such pragmatic work.

I was also wondering about a very different issue. A great deal of work had to be done in SWSE on the proper handling of owl:sameAs. On the other hand, one of the recurring discussions on various mailing list and elsewhere is on whether the usage of this property is semantically o.k. or not (see, e.g., [3]). A possible alternative would be to define (beyond owl:sameAs) a set of properties borrowed from the SKOS Recommendation, like closeMatch, exactMatch, broadMatch, etc. It is almost trivial to generalize these SKOS properties for the general case but, reading this paper, I was wondering: what effect would such predicates have on search? Would it make it more complicated or, in fact, would such predicates make the life of search engines easier by providing “hints” that could be used for the user interface? Or both? Or is it already too late, because the ubiquitous usage of owl:sameAs is already so prevalent that it is not worth touching that stuff? I do not have a clear answer at this moment…

Thanks to the authors!

  1. A. Hogan, et al., “€œSearching and Browsing Linked Data with SWSE: the Semantic Web Search Engine”€, Journal of Web Semantics, vol. 4, no. December, pp. 365-401, 2011.
  2. M. Nally and S. Speicher, “Toward a Basic Profile for Linked Data”, IBM developersWork, 2011.
  3. H. Halpin, et al. “When owl:sameAs Isn’t the Same: An Analysis of Identity in Linked Data”, Proceedings of the International Semantic Web Conference, pp. 305-320, 2010

November 7, 2011

November 2, 2011

Some notes on ISWC2011…

The 10th International Semantic Web Conference (ISWC2011) took place in Bonn last week. Others have already blogged on the conference in a more systematic way (see, for example, Juan Sequeda’s series on semanticweb.com); there is no reason to repeat that. Just a few more personal impression, with the obvious caveat that I may have missed interesting papers or presentations, and the ones I picked here are also the results of my personal bias… So, in no particular order:

Zhishi.me is the outcome of the work of a group from the APEX lab in Shanghai and Southeast University: it is, in some ways, the Chinese DBPedia. “In some ways” because it is actually a mixture of three different Chinese, community driven encyclopedia, namely the Chinese Wikipedia, Baidu Baike and Hudong Baike. I am not sure of the exact numbers, but the combined dataset is probably a bit bigger than DBpedia. The goal of Zhishi.me is to act as a “seed” and a hub for Chinese linked open data contributions, just like DBpedia did and does for the LOD in general.

It is great stuff indeed. I do have one concern (which, hopefully, is only a matter of presentation, i.e., may be a misunderstanding on my side). Although zhishi.me is linked to non-Chinese datasets (DBPedia and others), the paper talks about a “Chinese Linked Open Data (COLD)”, as if this was something different, something separate. As a non-English speaker myself I can fully appreciate the issues of language and culture differences but I would nevertheless hate to see the Chinese community develop a parallel LOD, instead of being an integral part of the the LOD as a whole. Again, I hope this is just a misunderstanding!

There were a number of ontology or RDF graph visualization presentations, for example from the University of Southampton team (“Connecting the Dots”), on the first results of an exploration done by a Magnus Stuhr and his friends in Norway, called LODWheel (the latter was actually at the COLD2011 Workshop), or another one from a mixed team, led by Enrico Motta, on a visualization plugin to the NeOn toolkit called KC-Viz. I have downloaded the latter, and have played a bit with it already, but I haven’t had the time to have a really informed conclusion on it yet. Nevertheless, KC-Viz was interesting for me for a different reason. The basic idea of the tool is to use some sort of an importance metric attached to each node in the class hierarchy and direct the visualization based on that metric. It was reminiscent to some work I did in my previous life on graph visualization, though the metric was different, the graph was only a tree, the visualization approach was different, but nevertheless, there was a similar feel to it… Gosh, that was a long time ago!

The paper of John Howse et al. on visualizing ontologies was also interesting. Interesting because different: the idea is a systematic usage of Euler diagrams to visualize class hierarchies combined with some sort of a visual language for the presentation of property restrictions. In my experience property restrictions is a very difficult (maybe the most difficult?) OWL concept to understand without a logic background; any tool, visual or otherwise, that helps teaching and explaining this can be very important. Whether John’s visual language is the one I am not sure yet, but it may well be. I will consider using it the next time I give a tutorial…

I was impressed by the paper of Gong Cheng and his friends from Nanjing, “Empirical Study of Vocabulary Relatedness…”. Analyzing the results of a search engine (in this case Falcons) to draw conclusion on the nature, the usage, the mutual relationship, etc., of vocabularies is very important indeed. We need empirical results, bound to real life usage. This is not the first work in this direction (see, for example, the work of Ghazvinia et al, from ISWC2009), but there is still much to do. Which reminds me of some much smaller scale work Giovanni, Péter and I didon determining the top vocabulary prefixes for the purpose of the RDFa 1.1 initial context (we used to call it default profile back then). I should probably try to talk to the Nanjing team to merge with their results!

I think the vision paper of Marcus Cobden and his friends (again at the COLD2011 Workshop) on a “Research Agenda for Linked Closed Data” is worth noting. Although not necessarily earthshaking, the fact that we can and we should speak about Linked Closed Data alongside Linked Open Data is important if we want the Semantic Web to be adopted and used by the enterprise world as well. One of the main issue, which is not really addressed frequently enough (although there have been some papers published here and there) is access control. Who has the right to access data? Who has the right to access a particular ontology or rule set that may lead to the deduction of new relationships? What are the licensing requirements, how do we express them? I do not think our community has a full answer to these. B.t.w., W3C organizes a Workshop concentrating on the enterprise usage of Linked Data in December…

Speaking about research agenda… I really liked Frank van Harmelen’s keynote on the second day of the conference. His approach was fresh, and the question he asked was different: essentially, after 10 or more years of research in the Semantic Web area, can we derive some “higher level” laws that describe and govern this area of research? I will not repeat all the laws that he proposed, it is better to look his Web with the HTML version of his slides. The ones that is worth repeating again and again are that “Factual knowledge is a graph”, “Terminological knowledge is a hierarchy”, and “Terminological knowledge is much smaller than the factual knowledge”. Why are these important? To quote from his keynote slides:

  1. traditionally, KR has focussed on small and very intricate sets of axioms: a bunch of universally quantified complex sentences
  2. but now it turns out that much of our knowledge comes in the form of very large but shallow sets of axioms.
  3. lots of the knowledge is in the ground facts, (not in the quantified formula’s)

Which is important to remember when planning future work and activities. “Reasoning”, usually, happens on a huge set of ground facts in a graph, with a shallow hierarchy of terminology…

I was a little bit disappointed by the Linked Science Workshop; probably because I had wrong expectations. I was expecting a workshop looking at how Linked Data in general can help in the renewal of the scientific publication process as a whole (a bit along the lines of the Force11 work on improving the future of scholarly communication). Instead, the workshop was more on how different scientific fields use linked data for their work. Somehow the event was unfocussed for me…

As in some previous years, I was again part of the jury for the Semantic Web Challenge. It was interesting how our own expectations have changed over the years. What was really a wow! a few years ago, has become so natural that we are not excited any more. Which is of course a good thing, it shows that the field is maturing further, but we may need some sort of a Semantic Web Super-Challenge to be really excited again. That being said, the winners of the challenge really did impressive works, I do not want to give the impression of being negative about them… It is just that I was missing that “Wow”.

Finally, I was at one session of the industrial track, which was a bit disappointing. If we wanted to to show the research community that the Semantic Web technologies are really used by industry, then the session did not really make a good job on that. With one exception, and a huge one at it: the presentation of Yahoo! (beware, the link is to a PowerPoint slidedeck). It seems that Yahoo! is building an internal infrastructure based on what they call “Web of Objects”, by regrouping pieces of knowledge in a graph-like fashion. By using internal vocabularies (superset of schema.org) and using the underlying graph infrastructure they aim at regrouping similar or identical knowledge pieces harvested on the Web. I am sure we will hear more about this.

Yes, it was a full week…

Enhanced by Zemanta

April 20, 2011

RDFa 1.1 Primer (draft)

Filed under: Semantic Web,Work Related — Ivan Herman @ 10:21
Tags: , , ,

I have had several posts in the past on the new features of RDFa 1.1 and where it adds functionalities to RDFa 1.0. The Working Group has just published a first draft for an RDFa 1.1 Primer, which gives an introduction to RDFa. We did have such a primer already for RDFa, but the new version has been updated in the spirit of RDFa 1.1… Check it out if you are interested in RDFa!

March 29, 2011

LDOW2011 Workshop

Filed under: Semantic Web,Work Related — Ivan Herman @ 15:29
Tags: ,

The Linked Open Data Workshop (LDOW20XX) has become an integral part of the yearly WWW conferences, and this year was no exception under the unsurprising name of LDOW2011. And, as always, it is was an enjoyable, pleasant event. The organizers (Chris Bizer, Tom Heath, Michael Hausenblas, and Tim Berners-Lee) made the choice of accepting slightly less papers to leave room for more discussions. That was a good choice; the workshop was really more of a workshop rather than just listening to presentations, there were nice discussions, lots of comments… and that was great.

It is very difficult to summarize a whole day, and I do not want to go and comment each individual paper. The papers (and, I believe, soon the presentation slides) are on the Web, of course, it is worth glancing at each of them. For me, and that is obviously very personal, maybe the most important takeaway is actually close to the blog I wrote yesterday on the empirical study of SPARQL queries. And this is the general fact that we are at the point when the size and complexity of linked open data cloud is such that we can begin to make meaningful measurements, experimental data analysis, empirical studies, etc, to understand how the data is really used out there, what is the shape and behavior of the beast, and how these affect the tools and specifications we develop.

The workshop started with an overview of Chris (I hope his slides will be on the Web at some point) doing exactly that. He looked at the evolution of the LOD cloud and tried to analyze its content. There were some nice cosy figures: the growth in 2010, in terms of the number of triples, was of 300%, with some spectacular application areas coming into the game, like a 955% growing of library related data, or the appearance of  governmental data from nothing in 2009 to about 11B triples in November 2010. Although Danny Vrandecic made the remark at the end of the Workshop that we should stop measuring the LOD cloud in terms of pure number of triples (and I can agree with that), those numbers are nice nevertheless. Some figures were less satisfactory: links among datasets is relatively low (90 out of the 200 datasets have only around 1000 links to the outside, and the majority only interlink with only one other dataset; only around 9% of the datasets publish machine readable licenses (although 31% publish machine readable provenance data, which is a bit nicer). Some of the common vocabularies are commonly reused (31% use Dublin Core terms, for example), but way too many dataset publishers define their own vocabulary even if that is not strictly necessary, and only about 7% publish mapping relationships from their own vocabulary to others.

Beyond the numbers themselves, I believe the important point is that somebody does collect and publish these data regularly to understand where we should put some emphasis in future. For example (and this came up during the discussion) work should be done on simple (in my view, rule, i.e., RIF or N3 based) mappings among vocabularies, those should be published for others to use; that figure of 7% is really too low. Work on helping data providers to create additional links easily is another area of necessary improvement (and there were, in fact, several papers on that very topic during the day).

I do not know whether it was a coincidence or whether the organizers did it on purpose, but the day ended by a similar paper but on vocabularies. A group from DERI collected some specific datasets to see how a particular vocabulary (in this case the GoodRelations vocabulary) is being used on the Web of Data, what are the usage patterns, how it can be used for specific possible use cases, etc. The issue here is not the GoodRelations ontology as such (you can see the details of the results in the paper) but rather the methodology: we are at the point when we can measure what we got, and we can therefore come up with empirical data that will help us to concentrate on what is essential. I hope this approach will come up to the fore more and more in future.  We need it.

It was a good day.

November 23, 2010

My first mapping from RDB to RDF using a direct mapping, cont.

A few days ago I posted a blog on how the RDB to RDF direct mapping could be used for a simple example. I do not want to repeat the whole blog: the essence of it was that database tables were mapped onto a simple RDF Graph (this is what the direct mapping does) and the resulting graph was transformed into the “target” graph using the following SPARQL 1.1 construct:

CONSTRUCT {
  ?id a:title ?title ;
    a:year  ?year ;
    a:author _:x .
  _:x a:name ?name ;
    a:homepage ?hp .
}
WHERE {
  SELECT (IRI(fn:concat("http://...",?isbn)) AS ?id)
          ?title ?year ?name
         (IRI(?homepage) AS ?hp)
  {
    ?book a  <Book>;
       ?isbn ;
       ?title ;
        ?year ;
       ?author .
    ?author a  <Author>;
       ?name ;
       ?homepage .
  }
}

where the trick was to use a nested SELECT whose main job was to create URI references from strings. I realized that if one uses the latest editors’ version of SPARQL 1.1 (i.e., that version that is much closer to what SPARQL 1.1 will be) then the solution is actually simpler due to the variable assigning possibility that makes the nested SELECT unnecessary:

CONSTRUCT {
  ?id a:title ?title ;
    a:year  ?year ;
    a:author _:x .
  _:x a:name ?name ;
    a:homepage ?hp .
}
WHERE {
  ?book a  <Book>;
     ?isbn ;
     ?title ;
      ?year ;
     ?author .
  ?author a  <Author>;
     ?name ;
     ?homepage .
  BIND (IRI(fn:concat("http://...",?isbn)) AS ?id)
  BIND (IRI(?homepage) AS ?hp)
}

which makes, at least in my view, the mapping even clearer.

But SPARQL is not the only way to transform the graph. Another possibility is to use RIF Core. Essentially the same transformation can indeed be expressed using the RIF Presentation syntax. Here it is (with a little help from Sandro Hawke and Harold Boley):

Forall ?book ?title ?author ?isbn ?year ?id (
  ?id[a:year->?year a:title->?title a:author->?author] :-
    And(
      ?book[rdf:type-> <Book>
             a:isbn->?isbn
             a:title->?title
             a:year->?year
             a:author->?author]
      External(pred:iri-string(?id External( func:concat("http://..." ?isbn ) )))
    )
)
Forall ?author ?name ?hp ?homepage (
 ?author[a:name->?name a:homepage->?hp] :-
   And(
        ?author[rdf:type-> <Author>
                a:name->?name
                a:homepage->?homepage]
        External(pred:iri-string(?hp ?homepage))
  )
)

(as I did in the earlier examples, I did not put the prefix declaration and other syntactic stuffs into the code above.)

The only difference between the two is that I retained the URI for the author, because generating a blank node on the fly in RIF Core does not seem to be possible. A better solution would be, probably, to mint a URI from the ?author variable just like I did for the ISBN value. Other than that, the two solutions are pretty much identical…

November 19, 2010

My first mapping from RDB to RDF using a direct mapping

A few weeks ago I wrote a blog on my first RDB to RDF mapping using R2RML; the W3C RDB2RDF Working Group had just published a first public Working Draft for R2RML. That mapping was based on a specific mapping language (i.e., R2RML). R2RML relies on an R2RML processing done by, for example, the database system, interpreting the language, using some SQL constructions, etc. The R2RML processing depends on the specific schema of the database which guides the mapping.

As I already mentioned in that blog, a “direct” mapping was also in preparation by the Working Group; well, the first public Working Draft of that mapping has just been published. That mapping does not depend on the schema of the database: it defines a general mapping of any relational database structure into RDF; only a base URI has to be specified for the database, everything else is generated automatically. The resulting RDF graph is of course much more coarse than the one generated by R2RML; whereas the result of an R2RML mapping may be a graph using well specified vocabularies, for example, this is not the case for the output of the direct mapping. But that is not really a problem: after all, we have SPARQL or RIF to make transformation on graphs! Ie, the two approaches are really complementary.

What I will do in this blog is to show how the very same example as in my previous blog can be handled by a direct mapping. As a reminder: the toy example I use comes from my  generic Semantic Web tutorial. Here is the (toy) table:

which is then converted into an RDF Graph:

(Just as in the previous case I will ignore the part of the graph dealing with the publisher, which has the same structure as the author part. I will also ignore the prefix definitions.)

The direct mapping of the first and second tables is pretty straightforward. The URI-s are a bit ugly but, well, this is what you get when you use a generic solution. So here it is:

@base <http://book.example/> .
<Book/ID=0006511409X#_> a <Book> ;
  <Book#ISBN> "0006511409X" ;
  <Book#Title> "The Glass Palace" ;
  <Book#Year>  "2000" ;
  <Book#Author> <Author/ID=id_xyz#_> .

<Author/ID=id_xyz#_> a <Author> ;
  <Author#ID> "id_xyz" ;
  <Author#Name> "Ghosh, Amitav" ;
  <Author#Homepage> "http://www.amitavghosh.com" .

Simple, isn’t it?

The result is fairly close to what we want, but not exactly. First of all, we want to use different vocabulary terms (like a:name). Also, note that the direct mapping produces literal objects most of the time, except when there is a “jump” from one table to another. Finally, the resulting graph should use a blank node for the author, which is not the case in the generated graph.

Fortunately, we have tools in the Semantic Web domain to transform RDF graphs. RIF is one possible solution; another is SPARQL, using the CONSTRUCT form. Using SPARQL is an attractive solution because, in practice, the output of the direct mapping may not even be materialized; instead, one would expect a SPARQL engine attached to a particular relational database, mapping the SPARQL queries to the table on the fly. I will use SPARQL 1.1 below because that gives nice facilities to generate RDF URI Resources from strings, i.e., to have “bridges” from literals to URI-s. Here is a possible SPARQL 1.1 query/construct that could be used to achieve what we want:

CONSTRUCT {
  ?id a:title ?title ;
    a:year  ?year ;
    a:author _:x .
  _:x a:name ?name ;
    a:homepage ?hp .
}
WHERE {
  SELECT (IRI(fn:concat("http://...",?isbn)) AS ?id)
          ?title ?year ?name
         (IRI(?homepage) AS ?hp)
  {
    ?book a <Book> ;
      <Book#ISBN> ?isbn ;
      <Book#Title> ?title ;
      <Book#Year>  ?year ;
      <Book#Author> ?author .
    ?author a <Author> ;
      <Author#Name> ?name ;
      <Author#Homepage ?homepage .
  }
}

Note the usage of a nested query; this is used to create new variables representing the URI references to be used by the outer query. The key is the IRI operator. (Both the nesting and the AS in the SELECT are SPARQL 1.1 features.)

That is it. Of course, the question does arise: which one would one use? The direct mapping or R2RML? Apart from the possible restriction that the local database system may implement the direct mapping only, it becomes also a question of taste. The heavy tool in R2RML is, in fact, the embedded SQL query; if one is comfortable with SQL than that is fine. But if the user is more comfortable with Semantic Web tools (e.g., SPARQL or RIF) then the direct mapping might be handier.

(Note that these are evolving documents still. I already know that my previous blog is wrong in the sense that it is not in line with the next version of R2RML. Oh well…)

November 2, 2010

My first mapping from RDB to RDF using R2RML

The W3C RDB2RDF Working Group has just published a first public Working Draft for the standardized RDB->RDF mapping language called R2RML. I decided that the only way to understand a specification like that is to try to use it for an example. Caveat: this is a “First Public Working Draft” for R2RML, so many things still have to happen and there will be changes.

For several years now I use a simple example in my generic Semantic Web tutorial (see, e.g., the one at SemTech). It is an artificial example referring to an imaginary bookshop’s table:

which is then converted into an RDF Graph:

(And the tutorial story is how this graph can be merged with a graph coming from another bookshop’s data.) Up until now I always glossed over how this mapping is done. Well, so how could that be done with R2RML?

R2RML defines mappings that describe how an RDB table is mapped on triples. (R2RML is in itself in RDF, b.t.w.) Simply put, in R2RML, each row of a table is mapped to an RDF subject; the individual cells, with the column names, provide the object and the predicates, respectively.

If we look at the middle table in the example, it corresponds to the lower right hand part of the graph. The R2RML mapping has to specify that the homepage column should actually produce an RDF Resource as a literal and not a string. Furthermore, the first column should become a blank node; that has to be specified, too. Here is the way this is all specified:

:Table2 rdf:type rr:TriplesMap ;
    rr:logicalTable "Select  ("_:" || ID) AS pid, Name, ("<" || Homepage || ">) AS Home from person_table";
    rr:subjectMap [ a rr:BlankNodeMap ; rr:column "pid" ; ] ;
    rr:propertyObjectMap [ rr:property a:name; rr:column "Name" ] ;
    rr:propertyObjectMap [ a rr:IRIMap ; rr:property a:homepage; rr:column "Home" ] .

What happens here is:

  1. a mapping is defined that turns the original table into a virtual, “logical” table using SQL. The goal here is to generate a blank node ID on the fly, and a URI in NTriple syntax (note, however, that I am not sure it is o.k. to use that approach in the spec!);
  2. the subject for the triples is chosen to be a cell in a specific column (“pid”, generated by the SQL transform of the previous point), and it is also specified that this is a blank node;
  3. the other two properties are specified (for the same subject); the one for the home page also specifies that the object must be a URI resource (as opposed to a Literal).

That is it. Mapping of the bottom table to the lower left hand corner of the graph is also quite similar, I will not go into this here.

But we still need the “root”, so to say, i.e., the node in the upper right hand corner, the top portion of the graph (with the title and the year) and, mainly, we also have to relate the root to the portion of the graph that is generated from the middle table.

First, the following R2RML part does the job of generating the top part of the graph:

:Table1 rdf:type rr:TriplesMap ;
    rr:logicalTable "Select ('<http:..isbn/' || ISBN || '>') AS isbn, 
                     Author, Title, Publisher, Year from book_table";
    rr:subjectMap [ rdf:type rr:IRIMap ; rr:column "isbn" ] ;
    rr:propertyObjectMap [ rr:property a:title ; rr:column "Title" ; ] ;
    rr:propertyObjectMap [ rr:property a:year ; rr:column "Year" ; ] ;

The only role of the mapping to a logical table is to generate a URI from the ISBN; all the other cells are, conceptually, simply copied on the logical table. The rest is fairly straightforward.

The missing trick is to combine, i.e., to “join”, the two tables on the graph. R2RML has a separate construction for that, referred to as “mapping” the foreign keys. The following additional statements should be added to :Table1:

    rr:foreignKeyMap [ 
       rr:key a:author ; 
       rr:parentTriplesMap :Table2 ; rr:joinCondition "{child}.Author = {parent}.pid"
    ] .

Which combines the nodes defined by :Table1 with those of :Table2. And voilà! We’re done: the R2RML document is ready, i.e., an R2RML engine would generate my example table into my example graph.

Of course, there are more complicated possibilities. Triples, or whole rows, can be explicitly stored in a specific named graph, for example. Or a column defining a predicate could, actually, use a cell in another column as an object. Etc. And, to be honest, I am not even 100% sure that above is correct, I may have misunderstood some details. But the “melody” is still clear.

Note the role the SQL based mapping of the original table to the logical table has. For SQL experts, most of the work can be done there, i.e., the resulting RDF graph can be ready for further usage by an application, to be linked into the LOD, to be used with the right attributes, namespaces, etc. Which is very powerful indeed, provided… the user has the necessary SQL expertise. And, while that is obviously true for database managers, it is not necessarily true for RDF experts. For those, a slightly different model seems to be more appropriate: they would prefer to get an RDF graph ASAP, so to say, without any fancy transformation, and would then use RIF, SWSRL, SPARQL’s CONSTRUCT, etc., to turn it into the RDF graph they eventually want to have. In other words, they may not need the concept of a logical table. That is what is referred to by the group as the “default” mapping. I.e., what graph does one get if nothing is specified? If that is properly defined then, say, RIF experts can use their expertise instead of SQL. This default mapping is not yet fully specified by the group, but it is on its way; it will be published shortly, and will complete the R2RML picture. So watch that space…

Next Page »

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,613 other followers