Ivan’s private site

July 4, 2009

Dagstuhl Workshop on Semantic Web

Dagstuhl castleI have just come back from the Workshop “Semantic Web: Reflections and Future Directions”, held in Dagstuhl, Germany. Organized by John Domingue, Rudi Studer, Jim Hendler, and Dieter Fensel, the workshop positioned itself as the “second release” of a similar workshop that was held at the same place 10 years ago.

The first two days of the workshop were more traditional, in the sense that it was series of presentations and panels. This was the “reflection” part of the workshop: looking back to 10 years’ of history as well a peek into the current state of the art. It was interesting but, for my taste, a bit too long; the programme of the two days could have been compressed into one or, say, one and a half days. That would have given more time to the “future directions” part, ie, discussions in break out groups on various topics. I enjoyed those a lot: free flowing discussions on various topics, helping to exchange ideas, experiences, pointers at other works and results, and crystallizing possible future R&D issues. These discussions took place in a very pleasant, relaxed atmosphere among people who mostly knew one another already, ie, we could really concentrate on issues. Each group formulated a number of research goals for the years to come; some group also came up with more practical steps and goals.

As far as I know, the workshop organizers plan to collect all those research issues in some more coherent form, so we should watch this space. In what follows I just collect some issues that I took away from the workshop without the goal of being exhaustive; indeed, there were 6-7 parallel break out groups.

Issues around Web scale. This is clearly one of the major topics of the day. What happens when one has to deal with data containing billions of triples, when the data (ie, the triples) are “dirty”, ie, inconsistent, faulty, etc. Think of the Linked Open Data cloud, of data coming from sensor networks, mobiles, etc. Do we have to re-think all the notions that the Semantic Web inherited from the logic world, ie, completeness, meaning and consequences of consistency, what it means to get results for a query, etc? This is one area where opinions tend to diverge a lot. Some would prefer to completely put aside the traditional logic approaches (rules, descriptions logic, ontologies, OWL, etc), while others may argue that the advances in computing, in reasoning engines and methods are (and are expected to be) such that these methods should still be just as usable as before. As always, I hate any black-and-white statements… I do not think dismissing an area of technology is the right way but, also, other avenues, or new viewpoints should to be explored, too (e.g., how to react on inconsistencies, trying to get possibly incomplete results but whatever can be obtained within, say, 2 minutes, that sort of things). What approach would be used is very much dependent of the application. Anyway… Web scale is a major issue, everybody agrees on that!

Interaction. This is one of the break out groups that I did not attend, unfortunately. And obviously a hugely important direction of future R&D. Many Semantic Web applications today are such that their user interface is just standard because all Semantic Web related work happens behind the scenes, usually on the server side. However, on long term, there is a clear need for programs that could somehow directly show the data in some friendly way, programs that self-adapt themselves to the nature of the data. Not only for experts, but also for laypeople. Such environments may not only include extensions of current browsers but, eg, full desktop environments. Sort of intelligent, data-oriented user interfaces. A major research problem (user interface methodology is always a major problem, whether related to Semantic Web or not…), but also a hugely exciting research and development opportunities!

Vocabularies. There was a separate group on the management of vocabularies, which has identified a number of R&D issues: how does one describe a vocabulary, its interdependence with other vocabularies, how does one rank vocabularies… These are all fundamental question to solve to be able to find vocabularies for a specific purpose, to make specialized search. There are also issues around archiving, providing stable URI-s; last but not least (and this goes way beyond vocabularies only) major legal issues on what type of attribution, copyright or other legal machinery are to be used with vocabularies (it was good to have Tom Heath, who could tell us a bit about the datacommons’ approach). As an example of the many technological problems arising, the break-out groups coined the term “cherry picking of terms”. Although OWL has a mechanism for import, the practice of the RDF world is to use (ie, “cherry pick”) vocabulary terms (predicates, classes, etc) from various different vocabularies without necessarily taking the whole vocabulary, and certainly without using the owl:import predicates (think of routine usage of dc:title without importing the full Dublin Core vocabulary). How would a reasoner treat those? It may be a little bit easier to use a more rule based approach (like OWL RL) although it is not obvious how to cherry pick just the right amount of information on a, say, predicate. But Ian Horrocks also drew my attention on formal ontology modularization work that might be very relevant here; item added to my “to-be-read” list…

Provenance (and trust). One of the issues that popped up in all other break out groups; in consequence a separate one was formed on the second day of discussions. It is indeed one of the questions that anyone who talks about Semantic Web gets; in my personal view, having a clear “story” to tell about provenance is essential for a further deployment of this technology. The discussion in the group was really interesting because this issue raises a number of other questions, like the overall relationship of cryptographic techniques and the Semantic Web, what it means to have trust in context, what are the relationships to temporal or uncertainty reasoning, etc, etc, etc. It was also interesting for me to hear about other works, like the Open Provenance Model, albeit some of these were not necessarily done by Semantic Web people (eg, by the database community). We agreed that a Wiki page will be created (probably at RPI, set up by Deb McGuinnis) to collect information on this subject, and forming a W3C Incubator Group might also be in the books to provide a more thorough state-of-the-art. A long list of additional items to my “to-be-read” pile is coming…

And, of course, it was also good to meet a bunch of people, discuss things at lunch or dinner. This type of interaction is really fruitful. And there was also intensive twittering going on (using the #swdag2009 tag, pointing to a bunch of other reseources) although this time I did not twitter too much because I had problems with my wireless card:-(

It was a good meeting; thanks for the organizers. Would be good not to wait another 10 years for the next incarnation of this event…

June 19, 2009

SemTech2009 impressions

The first and possibly most important aspect of SemTech 2009 is that… it happened! I must admit that back in April-May, when the conference’s Web Site did not include any news of the program yet, I was a bit concerned that the general economic malaise would kill this year’s conference. O.k., I might have been paranoiac, but I think some level of concern was indeed legitimate. And… not only did the conference happen as planned, but the numbers were essentially the same as last year’s (over 1000). I think that by itself is an important sign of the interest in Semantic Technologies. Kudos to the organizers!

A general trend that was reaffirmed this year: by now, Semantic Web technologies are the obvious reference points for almost all presentations, products, etc, that were presented at the event. RDF(S), RDFa, OWL, SPARQL, etc, have become household names; newer specs like SKOS or POWDER may not have been as widely referred to yet, but I am sure that will come, too. Linked Data (and, more specifically, the Linked Open Data cloud) were almost ubiquitous this year while I do not believe that it was even mentioned last year. That is a huge change (although I still miss real “user facing” applications of LOD to show up; some, like Talis’ system deployed at UK universities, were presented but not as part of the regular conference). All that being said, I somehow seem to have missed more sessions than last year, which make my impressions more patchy. There were several journal interviews that I could not escape, hallway discussions that were great but made me miss a presentation here and there… I guess this is what happens when you have such a number of people around!

Tom Tague (from Open Calais) gave a very nice opening keynote. His talk was actually not on Open Calais (he did that in 2008), but rather on his experience in talking to different people who tried to start up new ventures in the Semantic Web area (a quote from his talk: “in 80% of the discussions I did not understand what the vendors wanted, and I walked away with my cheque book intact… Simplify!”). The main areas that he looked at were tools, social, advertising, search, publishing, user interface. One of the remarks I liked was on search: in his view (and I think I agree with that) Semantic Technologies may not be really interesting for general search (where the statistical, i.e., brute force methods work well) but for specialized, area-specific search tools (things like GoPubMed or applications deployed at, eg, Eli Lilly or experimented with at Elsevier come to my mind as good examples). Similarly, these technologies are not necessarily of interest for general, “robotic” publication tools like Google’s news, but for high quality publishing, with possible editorial oversight (reducing costs and difficulties).

(He also had a nice text on one of his slides: “Web2.0: Take Web 1.0, add a liberal dash of social, generous amounts of user generated content, atomize your content assets and stir until fully confused”:-)

Tom Gruber talked about his newest project: SIRI. A super-duper personal assistant running on an iPhone with conversational (voice directed) interface. The group behind it integrates a bunch of info on the Web (the “usual” stuffs like restaurants and travel sites), categorize them, and hide the complexities behind a sexy user interface. The problem I have is that I just do not see how this would scale. I see one of the major promises of the Semantic Web getting data in RDF out there so that such, essentially mash-up applications would become much easier to create and maintain. Until then, it is really tedious… On a more personal note, I am not sure I would like the voice conversational interface. I know that I have never used the voice commands on my phone for example; I do not feel comfortable with it. But, well, that is probably only me…

Chime Ogbuji made a really nice presentation on the system they have developed at the Cleveland Clinic. Great combination of RDF, OWL, and SPARQL. The interesting aspect (for me) was that usage of a medical expert system called Cyc, which is used to convert the doctor’s question in natural language (insofar as a question full of medical jargon can be considered as “natural”:-) into, essentially, a SPARQL query. The medical ontologies are used to direct this conversion process, and then the triple store could be queried through the generated query. Impressive work. (Part of it was documented in a W3C use case, but this presentation had a different emphasis.)

Unfortunately, I had to skip Peter Mika’s presentation on the SearchMonkey experiences, I will have to look at his slides… But, as a last minute addition to the program, the organizers succeeded in getting Othar Hansson and Kavi Goel to talk about Google’s rich sniplets. I have already blogged on this a few weeks ago but this presentation made the goal of the project way more understandable. Essentially, by recognizing specific microformat or RDFa vocabularies, they can improve the user experience by adding extra information on the search result. It is interesting to observe the difference between Yahoo! and Google in this respect: both of them use microformats/RDFa for the same general goal but, whereas Yahoo! relies on the community providing applications and on users personalizing their own search result page, Google controls the output in a generic way that does not require further user actions. It will be interesting to see how these differences influence people’s usage patterns. There were some discussion on the Google’s choice on vocabularies; the presenters made it quite clear that they are perfectly happy using other vocabularies (eg, vCard or FOAF) if they become pervasive, and this is a discussion that Google plans to engage with the community. There is of course a chicken-and-egg issue there (if a vocabulary is known by Google, then it will be more widely used, too), and this is cleary an area to discuss further. But these are details. The very fact that both Yahoo! and Google look at microformats and RDFa is what counts! Who would have thought just about a year ago?

I was not particularl impressed by the Semantic Search panel. I had the impression that the participants did not really know what they should say and talk about:-(

Nice presentation by Jeffrey Smitz from Boeing on a system called SPARQL Server pages. Essentially: the user can use similar structures like, say, a PHP page, ie, a mixture of HTML tags and server “calls”, except that this “calls” refer to SPARQL queries against a triple store on the server. Their system also includes some rule based OWL reasoning on the server side, although I am not sure I got all the details. All in all, the system seemed a bit complex, but the general approach is interesting! And it is nice to see that a company like Boeing seems to make good use of RDF+OWL+SPARQL; it would be good to know more…

I missed Zepheira’s presentation on freemix which is a shame, but, well, it happens. But I did play with freemix before travelling to San Jose;  I called it “Exhibit for the masses”. And this, I think, is a fair characterization. David Huynh’s exhibit is a really nice tool, but it is not easy to use it. On the other hand, it took me about 2 minutes to make a visualization of a json data set I used for an exhibit page elsewhere…

Andraz Tori talked about Common tag, a small vocabulary that, for example, can be used when marking up texts with tags (something that engines like Zemanta or Open Calais do). Bringing the RDF and the tagging worlds together is really important; I am very curious how successful this initiative will be…

The keynote on the last day was from the New York Times (by Evan Sandhaus and Robert Larson). It was quite interesting to see how a reputable journal like the NYT has developed a tradition of indexing, abstracting, cataloging articles, how these are archived and searched. Impressive. It is also great that the NYT Annotated Corpus has been released to the Research community. I did not know about that and, I presume, this must be a great resource for a lot of people active in the are of, say, natural language processing. Finally they announced their intention to release their thesaurus in a Semantic Web format, to add a “blob” to the Linked Data Cloud. They still have to work out the details (and expect feedback from the community) and I would hope they would publish a SKOS thesaurus and might even annotate the news items on their web site using this thesaurus in RDFa. But something in this space will happen, that is for sure! Other reputable newspapers, like Le Monde, the Guardian, NRC Handelsblatt,  el Pais, will you follow?

I also had my share of talking: gave an intro tutorial to SW, gave an overview of what is happening at W3C (quite a lot this year, including the finalization of POWDER, OWL 2, and SKOS!) and participated at an OWL 2 panel (with Mike Smith, Zhe Wu, Deb McGuinnis, and Ian Horrocks). I was quite happy with the tutorial and the way the panel went; the audience for the talk could have been a bit larger. But, well…

It was a long week, long trips, not much sleep… but well worth it!

Reblog this post [with Zemanta]

May 1, 2009

Library of Congress Subject Headings in SKOS on line

We may remember the experimental lcsh.info site that published the LoC vocabularies on line in SKOS. Well, while lcsh.info was closed down at some point, the official version is now up and running at http://id.loc.gov/authorities/.

This is great news. This means that all items in the LoC subject heading have now a stable, dereferancable URI; the URI refers to a (SKOS) Concept and is linked to broader and narrower terms. The implementation follows the LOD principles; the URI can be dereferenced in an HTML browser providing an RDFa annotated HTML page. To take an example, the subject heading for “Semantic Web” has the URI: http://id.loc.gov/authorities/sh2002000569#concept; dereferencing it leads to an XHTML+RDFa page. The RDF content can be accessed describing the concept in SKOS, providing a link to, eg, the concept of World Wide Web. This vocabulary gives a stable target to characterize the subject of various entities using, eg, the Dublin Core “subject” term. The site provides a search page and the whole dataset can also be downloaded in RDF/XML or n triples.

This is a huge service of the Library of Congress to the Semantic Web community. Thanks to Ed Summers and anybody else who took an active role in it!

April 27, 2009

Simple OWL 2 RL service

The W3C OWL Working group has published a number of OWL 2 documents last week. This included an updated version of the OWL 2 RL profile. I have already blogged about this profile (“Bridge Between SW communities: OWL RL”) when the previous release was published; there are no radical changes in this release, so there is no reason to repeat what was said there.

I have been playing with a simple and naive implementation of OWL 2 RL for a while; I have now decided to live dangerously;-) and release the software and the corresponding service. So… you can go to the OWL 2 RL generator service, give an RDF graph, and see what RDF triples an OWL 2 RL system should generate. It should give you some ideas of what OWL 2 RL is all about.

I cannot emphasize enough that this is not a production level tool. Beyond the bugs that I have not yet found, a proper implementation would, for example, optimize the owl:sameAs triples and, instead of storing them in the graph, would generate those on the fly when, say, a SPARQL request is issued. But my goal was not to produce something optimal; instead, I wanted to see whether OWL 2 RL can be implemented without any sophisticated tool or not. The answer is: yes it can. This also means that if I could do it, anybody with a basic knowledge of the underlying RDF environment and programming language (RDFLib and Python in this case) can do it, too. No need to be familiar with any complex algorithms, rule language implementation tricks, complicated external tools, description logic concepts, whatever…

February 4, 2009

Hillary Clinton’s foaf entry…

Talking about RDF getting into the mainstream! John Musser just reported on the US Congress SpaceBook system that also includes the foaf data of, eg, senators. And he explicitly quotes the foaf data of Hillary Clinton (see that article for the details)…

Hm, hm. The foaf file is actually included in the page for Hillary Clinton as an… XML comment. Oops. Maybe its time to put this into RDFa… Also, as far as I know (I am not in the US so my knowledge of US regulation is not that good), now that she is Secretary of State, she has to give up here senator’s seat, right? If so, that URI will disappear…

Anyway, let us not grumble too much. It is a start!

December 9, 2008

Zemanta and the Linked Data Cloud…

A few weeks ago I blogged on Open Calais and the Linked Data Cloud. I just received a comment on that blog:

Hi Herman,

yes, today Zemanta’s API officially stopped flying low on the radar and was released to wider public.

It does support RDF output, links to Linking Open Data entities and has properly defined namespace:

http://www.zemanta.com/api/

which is great! B.t.w., just as a pure coincidence, a new SW Use Case was published yesterday on Faviki, one of Zemanta’s user…

December 7, 2008

RDFa as an RDF serialization syntax

Just a small thing, really… On a blog Sebastian Heath announced that the Greek, Roman and Byzantine Pottery at Ilion (GRPBIlion) database exports its content as Linked Data using RDFa: see the HTML page itself or its RDF representation via RDFa Distiller. Mark Birbeck already blogged about this on RDFa’s blog site.

What caught my attention, however, is a question on the original blog by Sergey Chemyshev who asked:

Is there any reason why you used XHTML+RDFa instead of RDF/XML or N3 for this? It looks like http://classics.uc.edu/troy/grbpottery/database.html doesn’t have any HTML for presentation purpose, just to produce RDF triples – what was your reasoning for embedding them into HTML?

Of course, this is a valid question but… isn’t the answer “so what” (no offense to Sergey!)? What I mean is: for most of the RDF data around RDFa is a valid serialization just as RDF/XML or Turtle are. Of course, the main message around RDFa is that it gives you tools to add RDF data to an (X)HTML page. But we should realize that one can look at it in a different way, too: it is a tool to serialize your RDF graph so that it can be displayed in a human readable form easily via a browser. The HTML side of is not necessarily fancy; it can be, actually, very simple as indeed Sebastian’s page is. But it is readable for humans, certainly more readable than RDF/XML or even Turtle would be. And I think that may be enough of a reason to use RDFa in such circumstances, too!

December 3, 2008

Bridge between SW communities: OWL RL

The W3C OWL Working Group has just published a series of documents for the new version of OWL, most of them being so-called Last Call Working Drafts (which, in the W3C jargon, means that the design is done; after this, it will only change in response to new problems showing up).

There are many aspects of the new OWL 2 that are of a great interest; I would concentrate here on only one of those, namely the so-called OWL RL Profile. OWL 2 defines several “profiles”, which are subsets of the full OWL 2; subsets that have some good properties, e.g., in terms of implementability. OWL RL is one of those. “RL” stands for “Rule Language” and what this means is that OWL RL is simple enough to be implemented by a traditional (say, Prolog-like) rule engine or can be easily programmed directly in just about any programming language. There is of course a price: the possibilities offered by OWL RL are restricted in terms of building a vocabulary, so there is a delicate balance here. Such rule oriented versions of OWL have also precedences: Herman ter Horst published, some years ago, a profile calld pD*; a number of triple store vendors have a similar, restricted versions of OWL implemented in their systems already, referred to as RDFS++, OWLPrime, or OWLIM; and there has been some more theoretical work done by the research community in this direction, too, usually referred to by “DLP”. The goal was common to all of these: find a subset of OWL that is helpful to build simple vocabularies, and that can be implemented (relatively) easily. Such subsets are also widely seen as more easily understandable and usable by communities that work with RDF(S) and need only a “little bit of OWL” for their applications (instead of building more rigorous and complex ontologies which requires extra skills they may not have). Well, this is  the niche of OWL RL.

OWL RL is defined in terms of a functional, abstract syntax (defining a subset of DL) as well as a set of rules of the sort “if that and that triple pattern exists in the RDF Graph then add these and these triples”. The rule set itself is oblivious to the DL restrictions in the sense that it can be used on any RDF graphs, albeit with a possible loss of completeness. (There is a theorem in the document that describes the exact situation if you are interested.)

The number of rules is fairly high (74 in total), which seems to deceive the goal of simplicity. But this is misleading. Indeed, one has to realize that, for example, these rules subsume most of RDFS (e.g., what the meaning of domain, range, or subproperty is). Around 50 out of the 74 rules simply codify such RDFS definitions or their close equivalents in OWL (what it means to be “same as”, to have equivalent/disjoint properties or classes, that sort of things). All of these are simple, obvious, albeit necessary rules. There are only around 20 rules that bring real extra functionality compared to RDFS for building simple vocabularies. Some of these functionalities are:

  • Characterization of properties as being (a)symmetric, functional, inverse functional, inverse, transitive,…
  • Property chains, ie, defining the composition of two or more properties as being the sub property of another one. (Remember the classic “uncle” relationship that cannot be expressed in terms of OWL 1? Well, by chaining “brother” and “parent” one can say that the chain is a subproperty of “uncle” and that is it…)
  • Intersection and union of classes
  • Limited form of cardinality (only maximum cardinality and only with values 0 and 1) and  property restrictions
  • An “easy key” functionality, i.e., deducing the equivalence of two resources if a list of predefined properties have identical values for them (e.g., if two persons have the same name, same email address, and the same home page URI, then the two persons should be regarded as identical)

Some of these features are new in OWL 2 (property chaining, easy keys), others are already been present in OWL 1.

Quick and dirty implementations of OWL RL can be done fairly easily. Either one uses an existing rule engine (say, Jena rules) and lets the rule engine take its course or one encodes the rules directly on top of an RDF environment like Sesame, RDFLib, or Redland, and uses a simple forward chaining cycle. Of course, this is quick and dirty, i.e., not necessary efficient, because it will generate many extra triples. But if the rule engine can be combined with the query system (SPARQL or other), which is the case for most triple store implementations, the actual generation of some of those extra triples (e.g., <r owl:sameAs r> for all resources) may be avoided. Actually, some of the current triple stores already do such tricks with the OWL profiles they implement. (And, well, when I see the incredible evolution on the size and efficiency of triple stores these days, I wonder whether this is really an issue on long term for a large family of applications.) I actually did such quick and dirty implementation in Python; if you are curious what triples are generated via OWL RL for a specific graph, you can try out a small service I’ve set up. (Caveat: it has not been really thoroughly tested yet, i.e., there are bugs. Neither it is particularly efficient. Do not use it for anything remotely serious!).

So what is the possible role of OWL RL in developing SW applications? I think it will become very important. I usually look at OWL RL as some sort of a “bridge” that allows some RDF/SW applications to evolve in different directions. Such as:

  • Some applications may be perfectly happy with OWL RL as is (usually combined with a SPARQL engine to query the resulting, expanded graph), and they do not really need more in term of vocabulary expressiveness. I actually foresee a very large family of applications in this category.
  • Some applications may want to combine OWL RL with some extra, application specific rules. They can rely on a rule engine fed with the OWL RL rules plus the extra application rules. B.t.w., although the details are still to be fleshed out, the goal is that a RIF implementation would accept OWL RL rules and produce what has to be produced. I.e., RIF compatible implementation would provide a nice environment for these types of applications.
  • Some applications may hit, during their evolution, the limitations of OWL RL in terms of vocabulary building (e.g., they might need more precise cardinality restrictions or the full power of property restrictions). In which case they can try expand their vocabulary towards more complex and formal ontologies using, e.g., OWL DL. They may have to accept some more restrictions because they enter the world of DL, and they would require more complex reasoning engines, but that is the price they might be willing to pay. While developers of applications in the other categories would not necessarily care about that, the fact that the language is also defined in terms of a functional syntax makes (i.e., that that version of OWL RL is integral part of OWL 2) this evolution path easier.

Of course, at the moment, OWL RL is still a Draft, albeit in Last Call. Feedbacks and comments of the community as well as the experience of implementers is vital to finalize it. Comments to the Working Group can be sent to public-owl-comments@w3.org (with public archives).

November 14, 2008

Calais Release 4 and the Linking Data cloud…

Just got to this news via Yves’s blog: Reuters’ Open Calais service comes with a new release in January, and this will bind to the Linked Data cloud. To quote the official blog of Reuters:

Release 4 of Calais will be a big deal. In that release we’ll go beyond the ability to extract semantic data from your content. We will link that extracted semantic data to datasets from dozens of other information sources, from Wikipedia to Freebase to the CIA World Fact Book. In short – instead of being limited to the contents of the document you’re processing, you’ll be able to develop solutions that leverage a large and rapidly growing information asset: the Linked Data Cloud.

Ie: when analyzing a text, Open Calais will return URIs into DBPedia, Freebase, Musicbrainz… Thereby opening up the possibility for various of applications that would not be possible (or would be fairly complicated) without. One more step to make it possible to reuse all those data on the Web… Yey!

B.t.w.: I write these lines using WordPress and I have Zemanta’s Firefox plugin running to generate the tags. However, as far as I know (I may be wrong!), the Zemanta service does not provide those URI-s yet (they do provide some URI-s in their return format, but I am not sure those are LOD URIs). Maybe some day?

(Thanks to Yves for drawing my attention on this…)

(Note after the original publication of the blog: it seems I was wrong and Zemanta does have a similar feature, see Andraz’ comment.)

November 9, 2008

Open Archive Initiative’s aggregation vocabulary

A few days ago I read through the OAI’s aggregation vocabulary (officially: “Object Reuse and Exchange”). One of those very targeted, relatively small RDF vocabularies that may, nevertheless, become important in practice. Instead of getting into a description in my own words, let me be lazy :-) and copy from the introduction of the ORE Primer:

In the physical world we create, use, and refer to aggregations of things all the time. We collect pictures in a photo album, read journals that are collections of articles, and burn CDs of our favorite songs. In this physical world these aggregations are frequently tangible – we can hold the photo album, journal, and CD […].

This practice of aggregating extends to the Web. We accumulate URL’s in bookmarks or favorites lists in our browser, collect photos into sets in popular sites like Flickr […]. Despite our frequent use of these aggregations, their existence on the Web is quite ephemeral. One reason for this is that there is no standard way to identify an aggregation. We often use the URI of one page of an aggregation to identify the whole aggregation. For example, we use the URI of the first page of a multi-page Web document to identify the whole document, or we use the URI of the HTML page that provides access to a Flickr set to identify the entire set of images. But those URIs really just identify those specific pages, and not the union of pages that makes up the whole document, or the union of all images in a Flickr set, respectively. In essence, the problem is that there is no standard way to describe the constituents or boundary of an aggregation, and this is what OAI-ORE aims to provide.

The ORE introduces therefore a strict separation between an Aggregation (which is supposed to be a non Informational Resource, in TAG speak) and a separate Resource Map which is an informational resource describing a specific Aggregation. Ie, the core idea of the ORE spec is to follow the Linked Data principles (and to be in line with the Cool URIs for the Semantic Web note) to describe abstract aggregations.

I tried out ORE in practice. A typical example is my talks. When I make a presentation (like the one I gave in Ghent a few months ago) I publish the slides in different formats (ODP, PDF, and HTML in this case). In this sense the talk itself, which is itself an “abstract” thing, is also an aggregation of the three specific slide sets. The following statements describe the situation in the ORE sense:

<http://www.w3.org/2008/Talks/0822-Ghent-IH/> a ore:ResourceMap ;
     dc:title "Detailed introduction into RDF and the Semantic Web" ;
     ore:describes  <http://www.w3.org/2008/Talks/0822-Ghent-IH/#talk>.

<http://www.w3.org/2008/Talks/0822-Ghent-IH/#talk> a cc:Work, ore:Aggregation ;
     cc:license  a ore:AggregatedResource ;
     ...
     ore:aggregates <http://www.w3.org/2008/Talks/0822-Ghent-IH/HTML/>,
                    <http://www.w3.org/2008/Talks/0822-Ghent-IH/Slides.odp>,
                    <http://www.w3.org/2008/Talks/0822-Ghent-IH/Slides.pdf> . 

<http://www.w3.org/2008/Talks/0822-Ghent-IH/Slides.pdf> a ore:AggregatedResource ;
     dc:format "application/pdf" .
...

The …08-22-Ghent/#talk is the URI for the “abstract” thing, ie, the talk. When dereferenced, that URI yields ...08-22-Ghent/ which is the “resource map” in ORE talk (and is an informational resource that returns, depending on the required format, HTML or RDF). The RDF above is actually encoded using RDFa, with Apache set up to deliver the different formats. The resource map portion is conveniently put into the header of the HTML file, and the body describes the real aggregation, ie, the talk. (The full RDF content encoded in the HTML file can be accessed directly either in RDF/XML or in Turtle; it contains additional information on the talk, not directly relevant for this blog.)

Time will tell whether this vocabulary will catch up; some of its design decisions can also lead to further discussions. But it certainly is an interesting and potentially important addition to the overall vocabulary landscape!

P.S.: the reason I used the example of my presentation in Ghent is because that is where I first heard about this vocabulary in more details thanks to a short tutorial given by Herbert Van de Sompel, from Los Alamos National Laboratory, one of the co-editors of the ORE spec.

« Previous PageNext Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 2,545 other followers