Ivan’s private site

March 24, 2010

Twitter search results in RDF

Filed under: Semantic Web,Work Related — Ivan Herman @ 23:58
Tags: , , ,

This is cool: with the Shredded Tweet site one can search a twitter tag, get back the result in RDFa. Ie, the result can be ran through an RDFa tool to get the data in RDF, ready to be mashed up with other stuffs. For example, one can search the #w3c tag to get to the RDFa page; run it to the RDFa distiller, to yield an RDF data:

<http://twitter.com/Eyalsela/statuses/10981280232> a <sioc:Post> ;
     dcterms:created "2010-03-24T14:23:06+00:00" ;
     sioc:content "At our second meetup: Mobile applications and websites development
        #w3cdf #W3C [English:http://j.mp/bS8qLU Hebrew: http://j.mp/a3i8Tv"@en ;
     sioc:has_creator <http://semantictweet.com/Eyalsela#me> ;
     sioc:has_topic <http://search.twitter.com/search?q=%23W3C>,  ;
     sioc:links_to <http://j.mp/a3i8Tv> . 

<http://twitter.com/JimThijssen/statuses/10979370533> a <sioc:Post> ;
     dcterms:created "2010-03-24T13:40:04+00:00" ;
     sioc:content "Yeahh #www.stagematcher.nl is volledig #html #w3c #certified B-)"@en ;
     sioc:has_creator <http://semantictweet.com/JimThijssen#me> ;
     sioc:has_topic &t;http://search.twitter.com/search?q=%23certified>,
       <http://search.twitter.com/search?q=%23html>,
       <http://search.twitter.com/search?q=%23w3c> .
...

The RDF content also be directed used, eg, in a SPARQL call, but the combined URI-s of the distiller service and the original data; ugly URI, but can be minted programatically fairly easily. This is cool…

Reblog this post [with Zemanta]

October 30, 2009

ISWC2009 4-5

Filed under: Semantic Web,Work Related — Ivan Herman @ 0:59
Tags: , , , , , ,

Fourth day

Shame on me, but I missed the morning keynote… I was a bit late arriving to the conference site and I got stuck in a conversation at breakfast. Things happen…

The most notable event in the morning, at least for me, was the SPARQL WG panel. All members of the Working Group (me included) were on the panel and the room was full. I mean, full, people were standing in the back. And I regard that as a success by itself, it shows not only the overall importance of SPARQL, but the real interest around the new version, ie, SPARQL 1.1 (in case you have missed it, the first working draft has just been published a few days ago). Lee Feigenbaum (co-chair of the group) gave a quick overview of the new features and then questions came.

The difficulty of the SPARQL 1.1 work is that it has to find a balance between what is realistic to standardize in a relatively short time frame and what could be good to see in a new query language. As a consequence, there are features that the community has discussed but have not made it into the document, or only in a simple format. That came up during the discussion but I had the impression that the audience, by and large, understood this balance. Actually, for some, the set of new features were even too much for an efficient implementation. I have the feeling that  the WG will have to publish a separate conformance document (a bit like OWL 2 has), because there is a certain confusion on whether a conforming SPARQL implementation will have to implement, say, update or inference regimes or not. That clearly came up through the questions. Anyway, remember one email address (yes, it is a bit of a mouthful): public-rdf-dawg-comments@w3.org this is where comments have to be sent on SPARQL 1.1!

I chaired a session on the use track in the afternoon.  The paper of Daniel Elenius et al on reasoning about resources (for military exercises) was interesting to me because it was based on reasoning with relatively large OWL ontologies plus rules. The OWL ‘side’ was not very complex (Daniel referred to DLP, today I would say probably OWL 2 RL) but extended with extra rules. What this shows that when RIF will be finished and published, the combination of OWL with RIF may become very important for tons of practical applications. (As an aside, a nice little joke from Daniel: what is the system used by the military today when planning for exercises? The system is called BOGSAT. It stands for ‘Bunch Of Guys Sitting Around a Table’…)

Roland Stuhmer gave a very different style presentation on how user events (clicks, combination of clicks, etc) can be collected, categorized, and integrated into an application, analyzed with some rules for, eg, targeted ads. The system is based on harvesting not only the structure of the Web page, but annotations appearing in the Web page via RDFa. The result is an RDF structure describing the events that can be sent to a server, analyzed locally, distributed, etc. Nice usage of RDFa, but also important to have a Javascript API that can retrieve the RDF triplets from the RDFa structure attached to a specific node. (B.t.w., the old graphics standards of the 80′s and 90′s, called GKS or PHIGS, had notions of combined event structures with different event types. I do not remember all the details any more, but may it be worth looking at those again in a modern setting?)

Personally, the highlight of the day was the presentation of the semantic web challenge finalists. I was member of the jury, which meant that I had to review the submissions in advance and we had two very enjoyable discussions with the rest of the jury on the submissions. We had the first selection the day before, and this time all finalists gave their presentations and demos. And it was a tough task to choose (that is why we had such long discussions:-) because, well, the submissions were great overall. I do not really want to analyze each of the entries; I do not think it would be appropriate for me in this position. But the winner entry for the challenge, namely TrialX, really made a great impression on me. In short, the application is a consumer-centric tool through which patients can find matching clinical trials where they want to participate; it also helps those who organize those trials, etc. It is some sort of a matchmaking tool using all kinds of medical ontologies and vocabularies, public health record data and the like. We should realize the importance of this: here is a great Semantic Web application, winner of the challenge, which is really an application, not only demonstration, already deployed on the Web (soon as an iPhone app, too), and, to be a bit dramatic, may (and possibly has already) save lives. What else to we want as a proof that this technology is not only an academic exercise any more?

Fifth day

Only a partial day for me, as far as the conference goes, because I had to fly out before the end… But I could listen to the last keynote of the conference, ie, that of Nova Spivack.

Not surprisingly, Nova talked about Twine-2, a.k.a. T2. I did not really know what T2 was to be, I only heard that Twine, ie, T1, is moribund. As Nova acknowledged, it is too complicated, it is too hard for users to really figure it out; in fact, most of the users used it for search. Which is not the strongest feature of T1 in the first place.

So T2 is (well, will be) all about semantically backed search. It semantically indexes the Web, with an attempt to extract semantic information from the pages. The user interface would then be some sort of, essentially, faceted interface that would automatically classify the search hit results into different tabs; the user can use these tabs, drill down along other categories, etc. So far nothing radically new, though the user interface Nova showed was indeed very clean and nice. All this is done, internally, via vocabularies/ontologies, using RDF, RDFS, or OWL.

The interesting aspect of T2 (at least as far as I am concerned) is the incorporation of collective knowledge. First of all, T2 will include a system whereby users can add vocabularies that T2 will use in categorization. Users can get back those ontologies in OWL/RDF, they can improve them, etc. The other tool they will provide is a means to help semantically index pages that are, by themselves, not semantically annotated. This can be done via a Firefox extension; users can identify parts of the web pages (I presume, essentially, the DOM nodes) and associate these with classes of specific ontologies. The extension produces an XSLT transformation that can be sent back to the T2 system. Some social mechanism should of course be set up (eg, webmasters annotating their own pages should get a higher priority than third party annotators) but, essentially, it is some sort of a GRDDL transformation by proxy: T2 will have information on how to find transformation to semantically index specific pages without requiring the modification of the pages themselves (in contrast to GRDDL where such transformation is to be referred to from the page itself).

Of course, the system was a bit controversial in this community; indeed, it was not clear whether T2 would make use of the semantic information that do exist in pages already (microformats, RDFa, …) let alone the Linked Open Data information that is already out there. When asked, Nova did not seem to give a clear answer though, to be fair, he did not specifically say no and he also said that the semantic index might be put back to the public in the form of linked data. To be decided. It is also not fully clear whether those proxy-GRDDL transformations would be available for the community at large (hopefully the answer is yes…). It will be interesting to see how it plays out (T2 comes out in beta sometimes early 2010). Certainly a project to keep an eye on.

From a slightly more general point of view it is also interesting to note that two out of the three Semantic Challenge winners are also semantic search engines with different user interfaces (though sig.ma and VisiNav definitely do use the LOD cloud, no question there…). Definitely an area on the move!

I had the time and, frankly, the energy to really listen to only one more paper in the regular track, namely the paper on functions of RDF language elements, by Bernhard Schandl. A nice idea: imagine a traditional spreadsheet, where each cell is a collection of resources from an RDF Graph, or functions that can manipulate those resources (extract information, produce new set of resources, etc). Just like a spreadsheet, if you modify the underlying graph, ie, the resources in a cell, everything is automatically recalculated. Because, just like for a spreadsheet, a function can refer to the result of another function in another cell, one can do fairly complicated transformation and information extraction quite easily. Neat idea, to be tried out from their site.

That is it for ISWC2009. I obviously missed a lot of papers, partly because social life and hallway conversations sometimes had the upper hand, and sometimes simply because there were too many parallel sessions. But it was definitely an enriching week… See you all, hopefully, at ISWC2010, in Shanghai!

October 28, 2009

ISWC2009 2-3

Second day

In fact, there is much less to say… In the morning I was on two workshops; I was at the Uncertainty Reasoning on the SW one for a while, but then I was asked to participate at a panel at the Semantics for the Rest of Us one, so I had to switch. This was a bit unfortunate, because I could not really ‘dive in’ to any of the two. And my afternoon was taken up by ‘networking’, catching up with some people on many many issues that are not worth blogging (yet?).

I listened to Kathryn Laskey’s presentation on how to combine probability theory in the mathematical sense (the good old Kolmogorov axiomatic theory on probability that I learned at university in a distant past…) with first order logic. I cannot claim to have really understood all the details but it made me curious enough to put reading her paper on my to do list…

As for the panel “Little vs Large Semantics: What’s next for the Semantic Web languages?”, with Leigh, Kendall and Ora on the panel besides me… it was not that exciting, I must admit. Maybe the main message I take away from it was the passionate request of Chris Welty to re-open RDF (see also Pat’s keynote below!).

Third day (well, first real conference day)

Preamble: I would have wanted to add links to papers. And I couldn’t: I have not found the papers on the Web. Neither on Springer’s site nor elsewhere. I may have missed a reference somewhere, if somebody knows then tell me. But if the papers are not available, I think it is a shame…

The conference began with a keynote of Pat Hayes. Entertaining and also thought provoking; Pat is a great speaker. What really interested me is his talk on ‘RDF Redux’; I was actually anxious to listen to that one at SemTech last June but he had to call this off back then. So he repeated it here. This is typically the kind of talk that needs more thinking afterwards to understand it (and Pat has promised to write it down!), but he essentially proposed to re-think and re-do some of the fundamentals of RDF semantics. Instead of set-based model theory which we have today, and which makes the treatment of b-nodes, shall we say, a bit complicated (some would use harsher words:-) we should consider RDF graphs as ‘things’ on a ‘surface’ (think of it as a real surface on a sheet of paper) and b-nodes are just ‘scratches on that  surface’. (A bit like ‘context’ of a graph?) Because these surfaces are different from one graph to the other, when a merge occurs then in fact a new surface is created where the unified graph is put, and the issue of b-nodes becomes natural (instead of the ‘renaming’ procedure that the current semantics document describes). Pat claims that the whole semantics could be re-written that way and none of the current RDF implementations would change. But one can go one step further: there may be different kinds of surfaces (eg, negations) and surfaces can have a name (a bit like named graphs) and all can be put together to provide a powerful semantics for these entities. His further claim was that such an extended semantics of RDF could be powerful enough to describe, conceptually, RDFS or even OWL, ie, the semantics should not be layered any more.

No way I would accept all this argumentation on face value:-), so I have to think about this and, mainly, read whatever Pat may want to write down to understand it. In the meantime, I may have to look into the concepts of conceptual graphs, and the Peircian notation of logic that Pat referred to as inspiration…

A more general take away (see also Chris’ remark above): maybe it is time to look into RDF again? A scary thought. Touching to something that is fundamental on the SW has to be done with extreme care… We will see.

There were two papers in the same session that were very close in subject and topic: one of Jesse Weaver and Jim Hendler on the parallel materialization of RDFS graphs and the one of Jacopo Urbani et al on using MapReduce for RDFS reasoning. (Sigh…, this is where I would like to put a reference!) Both aimed at similar challenges, namely the materialization of RDFS inference results of a graph using parallel computing methods. And there was one more similarity: both had some sort of a classification of the rules in the rule set described in the RDF Semantics document to help improving the processing. (Eg, to analyze which rules should be duplicated among processing nodes and which one can be handled without, or which one need a special treatment for a map-reduce pair). It seems that it would be worthwhile to see if some of these classifications (‘ontology rules’ and the like) could be extended to OWL 2 RL (Jesse Weaver told me afterwards that they want to look into this).  But, to put things into perspective: we are the points when billions of triples can be expanded with relative ease. Who would have thought a few years ago? There was also a remark on one of Jesse’s slide (I do not remember the exact wording) which said that RDFS is insanely parallelizable:-)  It was a really interesting session.

The SW in use session included  a paper from Landong Zuo et al “Supporting multi-view network analysis to understand company value chains”.  Integrating a bunch of data in the UK on companies, integrating them in an RDF store, and let users get information on the ‘value chains’, ie, how companies relate to one another as producers/consumers. Technically, the interesting point was the fact that users had the possibility to interactively add new relationships, new classifications to the system, essentially new rules that could be evaluated. The whole system seemed to be a really cool, a well engineered and well functioning machinery. As the speaker put it, although all conclusions drawn from the system could be found by the users by analyzing databases, but it would take weeks to do what this system can give them in a few minutes. This is exactly the kind of message we need for the outside  world about the usefulness of Semantic Web technologies.

On another session Martin Szomszor presented an experiment they conducted at the ESWC conference, combining RFID-based personal badges with an underlying SW system. The resulting system could be used to show personal contacts among delegates, could help people find others with similar interest, could retrace later whom one met at what point (“I remember talking to that chap, but I do not remember his name!”), etc. Lots of privacy issues, for example, but I would have liked to see that in practice, that is for sure!

Stéphane Corlosquet’s presentation on SW and Drupal was really exciting. I already knew about the plans of Drupal 7 to incorporate RDF management from the start, that all Drupal 7 pages will be annotated via RDFa. The RDFa community has been  fairly excited about that for a while now. But the work done by Stéphane and others provide some additional modules that makes it easy to add a SPARQL endpoint to a Drupal based site easily, to import other RDF content, or to manage the vocabularies used on the pages and the like. They already have such a system running with the current Drupal, but these modules will become part of the standard Drupal 7 module set that one can download from the drupal site. And that is cool.  It significantly lowers the barrier to build Web sites that are prepared to be part of the Linked Data cloud, even if the system administrators are not SW experts. I expect this to open up quite a lot of possibilities…

Off to the next day! More paper and the presentation of the Semantic Web challenge finalists…

October 26, 2009

ISWC2009 I.

20091026046This year’s ISWC is held in Chantilly, Virginia. In a nice conference building in a beautiful park with autumn colours that, for reasons I do not really know, is always much more striking and amazing in America than in Europe. It is a bit of a pity that it is so far from Washington but, well, you can’t get it all…

First day: tutorials.

(For me, because there were also a bunch of workshops.) In the morning I was at the tutorial on how to consume Linked Open Data, by Juan Sequeda, Jammie Taylor, Patrick Sinclair, and Olaf Hartig; in the afternoon I went to the one on legal and social frameworks for sharing data on the Web, by Leigh Dodds, Jordan Hatcher, Tom Heath, and Kaitlin Thaney.

Juan and his  friends had actually a difficult task, and that became clear right at the start during the intro of Juan: part of the audience did not really know what LOD was all about, whereas there were also others who were, shall we say, old timers on the subject. I think the speakers did a really good job in navigating through these constraints, making short introductions to what LOD is all about but talking about issues and showing examples that were interesting for all of us. Kudos to that. Issues were raised by the audience that were really to-the-point (who should create sameAs,  links, how trustworthy are they, how to choose vocabularies and how they map to one another, etc) and, in his closing slides, Juan actually gave a list of the open  R&D issues in LOD. Worth looking at those (and no reason to repeat the list here…). B.t.w., the slides of the tutorial are on line.

One very interesting technology I heard about that, shame on me, but I did know was a tool based on a traversal based execution scheme for SPARQL called sqin.  Olaf did a presentation on that. What essentially happens is  as follows. At the beginning the default graph of the SPARQL query is empty. However, the system would systematically fetch RDF triples by dereferencing URI-s in the query pattern, adding those to the default graph. The query is matched against it, some variable will match thereby ‘adding’ new URIs to the pattern. And the process starts again, possibly yielding a complete solution (or more) to the original query. At the end of the process, solutions will be found on the Web, even if the system itself does not have any ‘real’ data behind it at the start. Of course, no one can secure that all solutions will be found, and you need some ‘seed’ URI-s in the original query pattern, but it nevertheless looks like a very powerful tool to explore, say, the LOD.  Very interesting!

Then there were some examples on how LOD is used. Jammie talked about Freebase, and how Freebase is, in fact, a way for everybody to easily add information to the LOD (after all, Freebase works like a wiki, and all the data is reflected on the LOD).  He also had a very important message that is worth repeating (go to his slides for the rest): it takes very little effort to add a republishing capability to your triples store based application, thereby extending the general LOD. So… do it! This is how the system evolves…

Patrick described a quite geeky system that the BBC folks have developed (hopefully will become public soon): take the BBC’s musical data in RDF (which is available), plus the LOD cloud, plus… an IRC bot. What you get is an IRC channel which will pick up data on music, including the sound tracks, photos, etc, and display it on the machine. I presume you  can give orders and preferences through the IRC. Obviously a geeky stuff not for the masses:-) but shows what you can do…

The afternoon tutorial on the Legal and Social frameworks was of course very different. I think one of the many, but maybe the most important aspect of this tutorial is that… it took place! This may sound a bit strange but it is important for all our community to realize that we will have issues around copyright, licensing, waivers, etc, when it comes to the Web of Data, whether we like these issue or not. Tutorials like this, written notes and information, etc, are essential. Let us face it: most of us do not understand the details of the legal issues. So I was simply listening and trying to absorb what I heard…

I do not want to repeat the details of what I heard here; one thing I learned over the years is that I should leave legal argumentations and descriptions to those who really understand that. Ie, look at the slides. It is worth it. But just to show the complexities: I did not know or fully realize that there are major differences what can or cannot be copyrighted among countries: for example, a phone book cannot be copyrighted in the US or Europe, but can in Australia. That the seemingly simple notion of ‘attribution’ can, in fact, become an endless pit when it comes to data and the queries thereof (eg, if I have a filter in a query that results in data, should I give an attribution to the fact that were, in fact, filtered out?). Etc.

There is also a takeaway message for me (though it may be quite trivial) among the things I learned. Tom showed some practical examples on how can one add, say, licensing information to data by adding some RDF triples. However, for a larger data set the licensing may be different within the dataset. Eg, if you retrieve data from somewhere, and you enrich it with additional metadata, the metadata itself may have a different licensing (it is yours) than the data that you use (which may have its own licence). What this means is that when you organize your data internally, you should think about the licensing information you will add well in advance: organize your URI-s accordingly, for example. If you don’t, and you want to add license at the end, you might find yourself in trouble! Sounds like a simple message, but it is important. (Reminds me of what accessibility people always say: if you take accessibility issues into account right at the beginning when you build up a Web site, it is not complicated; but if you have to add accessibility features after the facts, it may become hell…)

By the way, Leigh has made a kind of an overview of the current ‘blobs’ on the LOD cloud to see whether any kind of licensing information is available or not. He has an overview of the results in his slides. The main fact is: the majority of data sets has no information whatsoever (or, at least, nothing that can be found in about 10 minutes)…

It was a good day. Looking forward to the rest.

September 29, 2009

OWL 2 RL closure

OWL 2 has just been published as a Proposed Recommendation (yay!) which means, in laymen’s term, that the technical work is done, and it is up to the membership of W3C to accept it as a full blown Recommendation.

As I already blogged before, I did some implementation work on a specific piece of OWL 2, namely the OWL 2 RL Profile. (I have also blogged about OWL 2 RL and its importance before, nothing to repeat here.) The implementation itself is not really optimized, and it would probably not stand a chance for any large scale deployment (the reader may want to look at the OWL 2 implementation report for other alternatives).  But I can hope that the resulting service can be useful in getting a feel for what OWL 2 RL can give you: by just adding a few triples into the text box you can see what OWL 2 RL means. This is, by the way, an implementation of the OWL 2 RL rule set, which means that it can also accepts triples that are not mandated by the Direct Semantics of OWL 2 (a.k.a. OWL 2 DL). Put it another way, it is an implementation of a small portion of OWL 2 Full.

The core of my implementation turned out to be really easy straightforward: a forward chaining structure directly encoded in Python. I use RDFLib to handle the RDF triples and the triple store. Each triple in the RDF Graph is considered, compared to the premises of the rules; if there is a match then new triples are added to the Graph. (Well, most of the rules contain several triples to match with, and the usual approach is to pick one and explore the Graph deeper check against additional matches. Which one to pick is important, it may affect the overall speed, though.) If, through such a cycle, no additional triples are added to the Graph then we are done, the “deductive closure” of the Graph has been calculated. The rules of OWL 2 RL have been carefully chosen so that no new resources are added to the Graph (only new triples), ie, this process eventually stops.

The rules themselves are usually simple. Although it is possible and probably more efficient to encode the whole process using some sort of a rule engine (I know of implementations based on, eg, Jena’s rules or Jess), one can simply encode the rules using the usual conditional constructs of the programming language. The number of rules is relatively high but nothing that a good screen editor would not manage with copy-paste. There were only a few rules that required a somewhat more careful coding (usually to take care of lists) or many searches through the graph like, for examples, the rule for property chains (see rule prp-spo2 in the rule set). It is also important to note that the higher number of rules does really not affect the efficiency of the final system; if no triple matches a rule then, well, it just does not fire. No side effect of the mere existence of an unused rule.

So is it all easy and rosy? Not quite. First of all, this implementation is of course simplistic in so far as it generates all possible deducted triples that include a number of trivial triples (like ?x owl:sameAs ?x for all possible resources). That means that the resulting graph becomes fairly big even if the (optional) axiomatic triples are not added. If the OWL 2 RL process is bound to a query engine (eg, the new version of SPARQL will, hopefully, give a precise specification of what it means to have OWL 2 RL reasoning on the data set prior to a SPARQL query) then many of these trivial triples could be generated at query time only, thereby avoiding an extra load on the database. Well, that is one place where a proof-of-concept and simple implementation like mine looses against a more professional one:-)

The second issue was the contrast between RDF triples and “generalized” RDF triples, ie, triples where literals can appear in subject positions and bnodes can appear as properties. OWL 2 explicitly says that it works with generalized triples and the OWL 2 RL rule set also shows why that is necessary. Indeed, consider the following set of triples:

ex:X rdfs:subClassOf [
  a owl:Restriction;
  owl:onProperty [ owl:inverseOf ex:p ];
  owl:allValuesFrom ex:A
].

This is a fairly standard “idiom” even for simple ontologies; one wants to restrict, so to say, the subjects instead of the objects using an OWL property restriction. In other words that restriction combined with

ex:x rdf:type ex:X .
ex:y ex:p ex:x .

should yield

ex:y rdf:type ex:A .

Well, this deduction would not occur through the rule set if non-generalized RDF triples were used. Indeed, the inverse of ex:p is a blank node, ie, using it in a triple is not legal; but using that blank node to denote a property is necessary for the full chain of deductions. In other words, to get that deduction to work properly using RDF and rules, the author of the vocabulary would have to give an explicit URI to the inverse of ex:p. Possible, but slightly unnatural. If generalized triples are used, then the OWL 2 RL rules yield the proper result.

It turns out that, in my case, having bnodes as properties was not really an issue, because RDFLib could handle that directly (is that a bug in RDFLib?). But similar, though slightly more complex or even pathological examples can be constructed involving literals in subject positions, and that was a problem because RDFLib refused to handle those triples. What I had to do was to exchange all literals in the graph against a new bnode, perform all the deductions using those, and exchange the bnodes “back” against their original literals at the end. (This mechanism is not my invention; it is actually described by the RDF Semantics document, in the section on Datatype entailment rules.) B.t.w., the triples returned by the system are all “legal” triples, generalized triples play a role during the deduction only (and illegal triples are filtered out at output).

Literals with datatypes were also a source of problems. This is probably where I spent most of my implementation time (I must thank Michael Schneider who, while developing the test cases for OWL 2 RDF Based Semantics, was constantly pushing me to handle those damn datatypes properly…). Indeed, the underlying RDFLib system is fairly lax on checking the typed literals against their definition by the XSD specification (eg, issues like minimum or maximum values were not checked…). As a consequence, I had to re-implement the lexical to value conversion for all datatypes. Once I found out how to do that (I had dive a bit into the internals of RDFLib but, luckily, Python is an interpretative language…) it became a relatively straightforward, repetitive, and slightly time consuming work. Actually, using bnodes instead of “real” literals made it easier to implement datatype subsumptions, too (eg, the fact that, say, an xsd:byte is also a xsd:integer). This became important so that the rules would work properly on property restrictions involving datatypes.

Bottom line: even for a simple implementation literals, mainly literals with datatypes, are the biggest headache. The rest is really easy.  (This is hardly the discovery of the year, but is nevertheless good to remember…)

I was, actually, carried away a bit once I got a hold on how to handle datatypes, so I also implemented a small “extension” to OWL 2 RL by adding datatype restrictions (one of the really nice new features of OWL 2 but which is not mandated for OWL 2 RL). Imagine you have the following vocabulary item:

ex:RE a owl:Restriction ;
    owl:onProperty ex:p ;
    owl:someValuesFrom [
      a rdfs:Datatype ;
      owl:onDatatype xsd:integer ;
      owl:withRestrictions (
          [ xsd:minInclusive "1"^^xsd:integer ]
          [ xsd:maxInclusive "6"^^xsd:integer ]
      )
   ] .

which defines a restriction on the property ex:p so that some its values should be integers in the [1,6] interval. This means that

ex:q ex:p "2"^^xsd:integer.

yields

ex:q rdf:type ex:RE .

And this could be done by a slight extension of OWL 2 RL; no new rules, just adding the datatype restrictions to the datatypes. Nifty…

That is it. I had fun, and maybe it will be useful to others. The package can also be downloaded and used with RDFLib, by the way…

April 27, 2009

Simple OWL 2 RL service

The W3C OWL Working group has published a number of OWL 2 documents last week. This included an updated version of the OWL 2 RL profile. I have already blogged about this profile (“Bridge Between SW communities: OWL RL”) when the previous release was published; there are no radical changes in this release, so there is no reason to repeat what was said there.

I have been playing with a simple and naive implementation of OWL 2 RL for a while; I have now decided to live dangerously;-) and release the software and the corresponding service. So… you can go to the OWL 2 RL generator service, give an RDF graph, and see what RDF triples an OWL 2 RL system should generate. It should give you some ideas of what OWL 2 RL is all about.

I cannot emphasize enough that this is not a production level tool. Beyond the bugs that I have not yet found, a proper implementation would, for example, optimize the owl:sameAs triples and, instead of storing them in the graph, would generate those on the fly when, say, a SPARQL request is issued. But my goal was not to produce something optimal; instead, I wanted to see whether OWL 2 RL can be implemented without any sophisticated tool or not. The answer is: yes it can. This also means that if I could do it, anybody with a basic knowledge of the underlying RDF environment and programming language (RDFLib and Python in this case) can do it, too. No need to be familiar with any complex algorithms, rule language implementation tricks, complicated external tools, description logic concepts, whatever…

December 3, 2008

Bridge between SW communities: OWL RL

The W3C OWL Working Group has just published a series of documents for the new version of OWL, most of them being so-called Last Call Working Drafts (which, in the W3C jargon, means that the design is done; after this, it will only change in response to new problems showing up).

There are many aspects of the new OWL 2 that are of a great interest; I would concentrate here on only one of those, namely the so-called OWL RL Profile. OWL 2 defines several “profiles”, which are subsets of the full OWL 2; subsets that have some good properties, e.g., in terms of implementability. OWL RL is one of those. “RL” stands for “Rule Language” and what this means is that OWL RL is simple enough to be implemented by a traditional (say, Prolog-like) rule engine or can be easily programmed directly in just about any programming language. There is of course a price: the possibilities offered by OWL RL are restricted in terms of building a vocabulary, so there is a delicate balance here. Such rule oriented versions of OWL have also precedences: Herman ter Horst published, some years ago, a profile calld pD*; a number of triple store vendors have a similar, restricted versions of OWL implemented in their systems already, referred to as RDFS++, OWLPrime, or OWLIM; and there has been some more theoretical work done by the research community in this direction, too, usually referred to by “DLP”. The goal was common to all of these: find a subset of OWL that is helpful to build simple vocabularies, and that can be implemented (relatively) easily. Such subsets are also widely seen as more easily understandable and usable by communities that work with RDF(S) and need only a “little bit of OWL” for their applications (instead of building more rigorous and complex ontologies which requires extra skills they may not have). Well, this is  the niche of OWL RL.

OWL RL is defined in terms of a functional, abstract syntax (defining a subset of DL) as well as a set of rules of the sort “if that and that triple pattern exists in the RDF Graph then add these and these triples”. The rule set itself is oblivious to the DL restrictions in the sense that it can be used on any RDF graphs, albeit with a possible loss of completeness. (There is a theorem in the document that describes the exact situation if you are interested.)

The number of rules is fairly high (74 in total), which seems to deceive the goal of simplicity. But this is misleading. Indeed, one has to realize that, for example, these rules subsume most of RDFS (e.g., what the meaning of domain, range, or subproperty is). Around 50 out of the 74 rules simply codify such RDFS definitions or their close equivalents in OWL (what it means to be “same as”, to have equivalent/disjoint properties or classes, that sort of things). All of these are simple, obvious, albeit necessary rules. There are only around 20 rules that bring real extra functionality compared to RDFS for building simple vocabularies. Some of these functionalities are:

  • Characterization of properties as being (a)symmetric, functional, inverse functional, inverse, transitive,…
  • Property chains, ie, defining the composition of two or more properties as being the sub property of another one. (Remember the classic “uncle” relationship that cannot be expressed in terms of OWL 1? Well, by chaining “brother” and “parent” one can say that the chain is a subproperty of “uncle” and that is it…)
  • Intersection and union of classes
  • Limited form of cardinality (only maximum cardinality and only with values 0 and 1) and  property restrictions
  • An “easy key” functionality, i.e., deducing the equivalence of two resources if a list of predefined properties have identical values for them (e.g., if two persons have the same name, same email address, and the same home page URI, then the two persons should be regarded as identical)

Some of these features are new in OWL 2 (property chaining, easy keys), others are already been present in OWL 1.

Quick and dirty implementations of OWL RL can be done fairly easily. Either one uses an existing rule engine (say, Jena rules) and lets the rule engine take its course or one encodes the rules directly on top of an RDF environment like Sesame, RDFLib, or Redland, and uses a simple forward chaining cycle. Of course, this is quick and dirty, i.e., not necessary efficient, because it will generate many extra triples. But if the rule engine can be combined with the query system (SPARQL or other), which is the case for most triple store implementations, the actual generation of some of those extra triples (e.g., <r owl:sameAs r> for all resources) may be avoided. Actually, some of the current triple stores already do such tricks with the OWL profiles they implement. (And, well, when I see the incredible evolution on the size and efficiency of triple stores these days, I wonder whether this is really an issue on long term for a large family of applications.) I actually did such quick and dirty implementation in Python; if you are curious what triples are generated via OWL RL for a specific graph, you can try out a small service I’ve set up. (Caveat: it has not been really thoroughly tested yet, i.e., there are bugs. Neither it is particularly efficient. Do not use it for anything remotely serious!).

So what is the possible role of OWL RL in developing SW applications? I think it will become very important. I usually look at OWL RL as some sort of a “bridge” that allows some RDF/SW applications to evolve in different directions. Such as:

  • Some applications may be perfectly happy with OWL RL as is (usually combined with a SPARQL engine to query the resulting, expanded graph), and they do not really need more in term of vocabulary expressiveness. I actually foresee a very large family of applications in this category.
  • Some applications may want to combine OWL RL with some extra, application specific rules. They can rely on a rule engine fed with the OWL RL rules plus the extra application rules. B.t.w., although the details are still to be fleshed out, the goal is that a RIF implementation would accept OWL RL rules and produce what has to be produced. I.e., RIF compatible implementation would provide a nice environment for these types of applications.
  • Some applications may hit, during their evolution, the limitations of OWL RL in terms of vocabulary building (e.g., they might need more precise cardinality restrictions or the full power of property restrictions). In which case they can try expand their vocabulary towards more complex and formal ontologies using, e.g., OWL DL. They may have to accept some more restrictions because they enter the world of DL, and they would require more complex reasoning engines, but that is the price they might be willing to pay. While developers of applications in the other categories would not necessarily care about that, the fact that the language is also defined in terms of a functional syntax makes (i.e., that that version of OWL RL is integral part of OWL 2) this evolution path easier.

Of course, at the moment, OWL RL is still a Draft, albeit in Last Call. Feedbacks and comments of the community as well as the experience of implementers is vital to finalize it. Comments to the Working Group can be sent to public-owl-comments@w3.org (with public archives).

October 31, 2008

ISWC2008, Karlsruhe

ISWC2008 has just finished (I am still at the hotel, leaving for home in a few hours). As usual, it is very difficult to give an exhaustive overview of the whole conference, not only because there were way too many parallel things going on, but everyone’s interests are different… These are just a few impressions. Still have to find time reading through some of the papers in more details.

Great keynote by John Giannanderea from Metaweb, ie, freebase. Freebase has always been an exciting project but the great news from the Semantic Web community’s point of view is that freebase has opened its database to the rest of the World in RDF, too. As such, freebase will soon become part of the Linking Open Data cloud (I guess there are still some details to be ironed out, and I saw John and Chris Bizer starting to discuss these). Actually, it was also interesting to hear again and again from John that the internal structure of freebase is based on a directed, labeled graph model, because that was the only viable option for them to build up what they needed. Sounds familiar?

An interesting point of the keynote was when John was wondering whether Metaweb is therefore a Semantic Web company or not. He thought that yes, it is, because the internal structure is compatible with RDF, it relies on identifiers with URIs, and is Web based. But he also thought that, well, it is not because… no description logic is in use, nor ontologies. Sigh… This still reflects the erronous view that one must use description logic to be on the Semantic Web. Wrong! So I went up to the mike and welcomed Metaweb in the growing club of Semantic Web companies…

Among the many papers I was interested in, let me refer to the one of Eyal Oren et al., “Anytime Query Answering in RDF through evolutionary algorithms” and, actually, a related submission from the same research group to the Billion Triple Challenge, called MaRVIN. In both cases the issue is that while handling very large datasets one might not necessarily want or is interested in _all_ solutions to a given query (or inferences, in case of MaRVIN) but, rather, whatever can be reached within a reasonable time. Ie, essentially, trading completeness for responsiveness. Whether genetic algorithms are the answer, as explored by Eyal and friends, or some other techniques, nobody knows; as Eyal clearly acknowledged, these are first attempts and we have to wait a few more years and furter results to get a feeling where it will lead. But the direction is really interesting.

This actually leads to what was, for me, the highlight of the conference, namely the SW Challenge, both the traditional Open Call as well as the new Billion Triple Challenge (there more details on both on the challenge’s web site). The entries were really impressive. As Peter Mika said in his closing comments on the challenge, long gone are the days when a challenge was some techie keyboard manipulation; the entries all had great user interface design, with the real regards to non-expert end users who may or may not know (and probably do not care) that the underlying technology is Semantic Web.

Among the finalists in the open call Chris Bizer presented DBPedia Mobile, (see also their site) ie, a system to access the full power of DBPedia (and, actually, the LOD cloud in general) from an iPhone via a proxy somewhere on the Web. The proxy is actually a hugely powerful environment, making use of Falcon and Sindice, and a bunch of query engines distributed over the network, all peeking into the LOD cloud and, actually, adding items to it, eg, photos taken on the iPhone. A few years ago all this would have had a SciFi edge to it, and now it was running at the conference…

Eero Hyvönen showed their HealthFinland portal (see also their site), soon to be deployed by the Finnish health authorities. Half of the system is, shall we say, more “traditional” (hm, well, what this means is that it would have been revolutionary two years ago:-), a number of serious ontologies governing health related data integration and search into the data. However, what I found exciting is the other half. Indeed, Eero and friends realized that search facets derived from serious ontologies are not really ideal for everyday end users. Therefore, they made a survey among users, derived a number of terms to be used on the user interface level, and bound these terms internally to the ontology. The result is a much more friendly system that still has the power offered by ontology directed search.

Actually, having Eero’s and Chris’ system presented side by side was also interesting from another point of view, namely to show that there are cases when using serious ontologies is important and there are cases when it isn’t. When I use an iPhone to navigate in a city and get information about, say, historical buildings then a bit of scruffiness is really not a problem. Speed, interaction, richness of data is more important. However, when it comes to, e.g., health issues, I must admit that I am prepared to wait a bit if I am sure that the results go through the rigorous inference and checking processes that one can achieve through the usage of formal ontologies. This is not the place when one should tolerate scruffiness. The stack (or, to quote Eric Miller, the “menu”) of Semantic Web technologies is rich enough to allow for both; choose what you need! All those discussions description logic vs. Semantic Web in general is futile in my view…

And then came benji’s paggr system (which actually won the Challenge in the Open Call track). Are you user of netvibes, iGoogle, or the new Yahoo user interface? Then you know what it means to quickly build up a Web page using small widgets accessing RSS feeds, stock quotes, clocks, etc. Now imagine that each of these widgets is in fact a small sparql query with some wrapper to present the result properly. Package that into a nice user interface that benji has always been a master of, and you get paggr. Not yet public, but I already signed up to play with it as soon as it is… This will really be cool!

As for the Billion Triples challenge: I already referred to MaRVIN, but there were a bunch of others like SearchWebDB or SemaPlorer, or SAOR. In some cases massively parallel storage approaches, not only offering near real time (federated) SPARQL query possibilities, but, in some cases, preprocessing it with a lower level RDFS or OWL fragment inferencing. All that done starting with millions of triples integrating all kinds of public datasets, yielding storages going beyond the 1 Billion triple mark. And let us not forget that this mark had already been reached by companies such as Tallis or OpenLink, so these new architectures just add to the lot… These were also particularly interesting with and eye to the new OWL RL profile that is being defined in the W3C OWL Working group and which aims at exactly such setups.

Let me finish with another remarkable entry, although this one did not win a price. i-MoCo created a small navigation system over a triple store containing “only” 250 million triples. So what is the big deal, you might say? Well, all the triples were stored on… an iPhone! So the next challenge will probably be to get, say, 10 billions of triples or more on your phone. Just wait a few years…

February 14, 2008

New version of the SPARQL Python wrapper

Filed under: Code,Python,Semantic Web,Work Related — Ivan Herman @ 13:49
Tags:

About half a year ago I announced the availability of a SPARQL endpoint interface to Python. It was really a beta release back then (ie, last July), but recently it went through a more thorough testing and improvement cycle. This was not my merit; all praise should go to Sergio Fernàndez and Carlos Tejo (both from CTIC Foundation, Spain) who decided to use the package in one of their internal projects. They revealed some problems (of course…), and we then worked together to prepare a proper 1.0 release. It is my pleasure to consider them as co-authors of this small package!

As before, the code is available from my site; the API documentation is included in the distribution (and is also available online). However, the project has also been moved to sourceforge, and is now available there, too (including the on-line documentation).

July 6, 2007

SPARQL Endpoint interface to Python

Filed under: Code,Python,Semantic Web,Work Related — Ivan Herman @ 12:43
Tags: ,

I played with SPARQL on my local machine, and I also got inspired by Lee’s SPARQL library for Javascript. But, well, I prefer Python… So I made a set of utility classes first for myself, but then I decided to package it more properly. Maybe others can find it useful, too.

The goal is to give some help in turning a SPARQL query into the corresponding HTTP GET Protocol, send it to a SPARQL endpoint somewhere on the Web, and do something with the results. The simplest usage is something like:

from SPARQL import SPARQLWrapper
queryString = "SELECT * WHERE { ?s ?p ?o. }"
sparql = SPARQLWrapper("http://localhost:2020/sparql")
# add a default graph, though that can also be done in the query string
sparql.addDefaultGraph("http://www.example.com/data.rdf")
sparql.setQuery(queryString)
try :
    ret = sparql.query() # ret is a stream with the results in XML, it is a file like object
except:
    deal_with_the_exception() # eg, syntax error

To make it even easier to use, conversions to more Python-friendly formats can also done on the results: eg, turn it into a proper DOM tree if the result is XML, use Bob Ippolito’s simplejson module to convert a return format in JSON into Python dictionary, or parse it with RDFLib and return an RDFLib Graph in case the return is in RDF/XML. Ie, one could have done:

try :
    sparql.setReturnFormat(SPARQL.JSON)
    ret = sparql.query()
    dict = ret.convert()
except:
    deal_with_the_exception()

where “dict” is a Python dictionary. There are some more tricks in the library, but that essentially it…

The code is available from my site; the API documentation is included in the distribution (and is also available online).

It is an early release. There are some problems, and I expect some more. I have primarily tested it with two different SPARQL endpoints running on my local machine (joseki3 and virtuoso) and also with some public SPARQL endpoints. There are some differences on the return media type for, eg, JSON or N3, the non-standard arguments (eg, setting the return format) still diverge a bit, etc. But I would expect these to converge over time. However, I am sure that my code will have problems with some of the endpoints at least on those grounds (or others)…

« Previous Page

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 2,545 other followers