Ivan’s private site

November 7, 2011

November 2, 2011

Some notes on ISWC2011…

The 10th International Semantic Web Conference (ISWC2011) took place in Bonn last week. Others have already blogged on the conference in a more systematic way (see, for example, Juan Sequeda’s series on semanticweb.com); there is no reason to repeat that. Just a few more personal impression, with the obvious caveat that I may have missed interesting papers or presentations, and the ones I picked here are also the results of my personal bias… So, in no particular order:

Zhishi.me is the outcome of the work of a group from the APEX lab in Shanghai and Southeast University: it is, in some ways, the Chinese DBPedia. “In some ways” because it is actually a mixture of three different Chinese, community driven encyclopedia, namely the Chinese Wikipedia, Baidu Baike and Hudong Baike. I am not sure of the exact numbers, but the combined dataset is probably a bit bigger than DBpedia. The goal of Zhishi.me is to act as a “seed” and a hub for Chinese linked open data contributions, just like DBpedia did and does for the LOD in general.

It is great stuff indeed. I do have one concern (which, hopefully, is only a matter of presentation, i.e., may be a misunderstanding on my side). Although zhishi.me is linked to non-Chinese datasets (DBPedia and others), the paper talks about a “Chinese Linked Open Data (COLD)”, as if this was something different, something separate. As a non-English speaker myself I can fully appreciate the issues of language and culture differences but I would nevertheless hate to see the Chinese community develop a parallel LOD, instead of being an integral part of the the LOD as a whole. Again, I hope this is just a misunderstanding!

There were a number of ontology or RDF graph visualization presentations, for example from the University of Southampton team (“Connecting the Dots”), on the first results of an exploration done by a Magnus Stuhr and his friends in Norway, called LODWheel (the latter was actually at the COLD2011 Workshop), or another one from a mixed team, led by Enrico Motta, on a visualization plugin to the NeOn toolkit called KC-Viz. I have downloaded the latter, and have played a bit with it already, but I haven’t had the time to have a really informed conclusion on it yet. Nevertheless, KC-Viz was interesting for me for a different reason. The basic idea of the tool is to use some sort of an importance metric attached to each node in the class hierarchy and direct the visualization based on that metric. It was reminiscent to some work I did in my previous life on graph visualization, though the metric was different, the graph was only a tree, the visualization approach was different, but nevertheless, there was a similar feel to it… Gosh, that was a long time ago!

The paper of John Howse et al. on visualizing ontologies was also interesting. Interesting because different: the idea is a systematic usage of Euler diagrams to visualize class hierarchies combined with some sort of a visual language for the presentation of property restrictions. In my experience property restrictions is a very difficult (maybe the most difficult?) OWL concept to understand without a logic background; any tool, visual or otherwise, that helps teaching and explaining this can be very important. Whether John’s visual language is the one I am not sure yet, but it may well be. I will consider using it the next time I give a tutorial…

I was impressed by the paper of Gong Cheng and his friends from Nanjing, “Empirical Study of Vocabulary Relatedness…”. Analyzing the results of a search engine (in this case Falcons) to draw conclusion on the nature, the usage, the mutual relationship, etc., of vocabularies is very important indeed. We need empirical results, bound to real life usage. This is not the first work in this direction (see, for example, the work of Ghazvinia et al, from ISWC2009), but there is still much to do. Which reminds me of some much smaller scale work Giovanni, Péter and I didon determining the top vocabulary prefixes for the purpose of the RDFa 1.1 initial context (we used to call it default profile back then). I should probably try to talk to the Nanjing team to merge with their results!

I think the vision paper of Marcus Cobden and his friends (again at the COLD2011 Workshop) on a “Research Agenda for Linked Closed Data” is worth noting. Although not necessarily earthshaking, the fact that we can and we should speak about Linked Closed Data alongside Linked Open Data is important if we want the Semantic Web to be adopted and used by the enterprise world as well. One of the main issue, which is not really addressed frequently enough (although there have been some papers published here and there) is access control. Who has the right to access data? Who has the right to access a particular ontology or rule set that may lead to the deduction of new relationships? What are the licensing requirements, how do we express them? I do not think our community has a full answer to these. B.t.w., W3C organizes a Workshop concentrating on the enterprise usage of Linked Data in December…

Speaking about research agenda… I really liked Frank van Harmelen’s keynote on the second day of the conference. His approach was fresh, and the question he asked was different: essentially, after 10 or more years of research in the Semantic Web area, can we derive some “higher level” laws that describe and govern this area of research? I will not repeat all the laws that he proposed, it is better to look his Web with the HTML version of his slides. The ones that is worth repeating again and again are that “Factual knowledge is a graph”, “Terminological knowledge is a hierarchy”, and “Terminological knowledge is much smaller than the factual knowledge”. Why are these important? To quote from his keynote slides:

  1. traditionally, KR has focussed on small and very intricate sets of axioms: a bunch of universally quantified complex sentences
  2. but now it turns out that much of our knowledge comes in the form of very large but shallow sets of axioms.
  3. lots of the knowledge is in the ground facts, (not in the quantified formula’s)

Which is important to remember when planning future work and activities. “Reasoning”, usually, happens on a huge set of ground facts in a graph, with a shallow hierarchy of terminology…

I was a little bit disappointed by the Linked Science Workshop; probably because I had wrong expectations. I was expecting a workshop looking at how Linked Data in general can help in the renewal of the scientific publication process as a whole (a bit along the lines of the Force11 work on improving the future of scholarly communication). Instead, the workshop was more on how different scientific fields use linked data for their work. Somehow the event was unfocussed for me…

As in some previous years, I was again part of the jury for the Semantic Web Challenge. It was interesting how our own expectations have changed over the years. What was really a wow! a few years ago, has become so natural that we are not excited any more. Which is of course a good thing, it shows that the field is maturing further, but we may need some sort of a Semantic Web Super-Challenge to be really excited again. That being said, the winners of the challenge really did impressive works, I do not want to give the impression of being negative about them… It is just that I was missing that “Wow”.

Finally, I was at one session of the industrial track, which was a bit disappointing. If we wanted to to show the research community that the Semantic Web technologies are really used by industry, then the session did not really make a good job on that. With one exception, and a huge one at it: the presentation of Yahoo! (beware, the link is to a PowerPoint slidedeck). It seems that Yahoo! is building an internal infrastructure based on what they call “Web of Objects”, by regrouping pieces of knowledge in a graph-like fashion. By using internal vocabularies (superset of schema.org) and using the underlying graph infrastructure they aim at regrouping similar or identical knowledge pieces harvested on the Web. I am sure we will hear more about this.

Yes, it was a full week…

Enhanced by Zemanta

May 17, 2011

HTTP Protocol for RDF Stores

Filed under: Semantic Web,Work Related — Ivan Herman @ 9:43
Tags: , ,

Last week the W3C SPARQL Working Group has published a number last call working drafts for SPARQL 1.1. Much have been already said on various fora on the new features of SPARQL 1.1, like update, entailment regimes, property paths; I will not repeat here. But I think it is worthwhile calling attention on one of the documents that may not be seen as a “core” SPARQL query language document, namely the Graph Store HTTP Protocol.

Indeed, this document stands a little bit apart. Instead of adding to the query (and now also update) language, it concentrates on how the HTTP protocol should be used in conjunction with graph stores. I.e., what is the meaning of the well known HTTP verbs like PUT, GET, POST, or DELETE  for graph stores, what should be the response codes, etc. It is important to emphasize that this HTTP behaviour is not bound to SPARQL endpoints; instead, it is valid for any Web sites that serve as a graph store. This could include, for example, a Web site simply storing a number of RDF graphs with minimal services to get or change the content of those. (In this respect, this document is closer to, e.g., the Atom Publishing Protocol which includes similar features for ATOM data, and which also plays an important role for technologies like, for example, OData.) Because such setups, i.e., “just” stores of RDF graphs without a SPARQL endpoint, are fairly frequent, it is important to have these HTTP details set. So… worth looking at this document and send feedbacks to the Working Group! (Use the public-sparql-dev@w3.org mailing list for comments.)

Enhanced by Zemanta

April 20, 2011

RDFa 1.1 Primer (draft)

Filed under: Semantic Web,Work Related — Ivan Herman @ 10:21
Tags: , , ,

I have had several posts in the past on the new features of RDFa 1.1 and where it adds functionalities to RDFa 1.0. The Working Group has just published a first draft for an RDFa 1.1 Primer, which gives an introduction to RDFa. We did have such a primer already for RDFa, but the new version has been updated in the spirit of RDFa 1.1… Check it out if you are interested in RDFa!

April 18, 2011

Open data from Fukushima

This is just an extended tweet… Masahide Kanzaki has just posted an announcement on the LOD mailing list on releasing some data he collected on the radioactivity levels on different places in Japan, enriched with metadata (e.g., geo data or time). Though the original data were in PDF, the results are integrated in RDF with a SPARQL endpoint. He also added some visualization endpoint that gives a simple visualization of the SPARQL query results:

Visualization results for radioactivity data for Tokyo and Fukushima, using integrated datasets and SPARQL query

Simple but effective, and makes the point on the usage of open data in RDF… Thanks!

April 9, 2011

Announcement on rNews

Filed under: Semantic Web,Work Related — Ivan Herman @ 6:38
Tags: , ,
Semantic Web Bus / Bandwagon

Image by dullhunk via Flickr

A few days ago IPTC published a press release on rNews: “Standard draft for embedding metadata in online news”. This is, potentially, a huge thing for Linked Data and the Semantic Web. Without going into too much technical details (no reason to repeat what is on the IPTC pages on rNews, you can look it up there) what this means is that, potentially, all major online news services on the globe, from the Associated Press to the AFP, or from the New York Times to the Süddeutsche Zeitung, will have have their news items enriched with metadata, and this metadata will be expressed in RDFa. In other words, the news items will be usable, by extracting RDF, as part of any Semantic Web applications, can be mashed up with other types of data easily, etc. In short, news item will become a major part of the Semantic Web landscape with the extra specificity to be an extremely dynamic set of data that is renewed every day. That is exciting!

Of course, it will take some time to get there, but we should realize that IPTC is the major standard setting body in the news publishing world. I.e., rNews has a major chance to be largely adopted. It is time for the Semantic Web community to pay attention…

Enhanced by Zemanta

April 1, 2011

2nd Last Call for RDFa 1.1

Filed under: Semantic Web,Work Related — Ivan Herman @ 2:58
Tags: , ,

The W3C RDFa Working Group published a “Last Call” for RDFa 1.1 back at the end of October last year. This was meant to be a “feature freeze” version and was asking for public comments. Well, the group received quite a number of those. Lots of small things, requiring changes of the documents in many places to make them more precise even in various corner cases, and some more significant ones. In some ways, it shows that the W3C process works, ensuring quite an influence of the community on the final shape of the documents. Because of the many changes the group decided to re-issue a Last Call (yes, the jargon is a bit misleading here…), aimed at a last check before the document goes to its next phase on the road of becoming a standard. Almost all the changes are minor for users, though important for, e.g., implementers to ensure interoperability. “Almost all”, because there is one new and, I believe, very important though controversial new feature, namely the so-called default profiles.

I have already blogged about profiles when they were first published back in April last year. In short, profile documents provide an indirection mechanism to define prefixes and terms for an RDFa source: publishers may collect all the prefixes they deem important for a specific application and authors, instead of being required to define a whole set of prefixes in the RDFa file itself, can just refer to the profile file to have them all at their disposal. I think the profile feature was the feature stirring the biggest interest in the RDFa 1.1 work: they are undeniably useful, and undeniably controversial… Indeed, in theory at least, profiles represent yet another HTTP round when extracting RDF from and RDFa file, which is never a good thing. But a good caching mechanism or other implementation tricks can greatly alleviate the pain… (B.t.w., the group has also created some guidelines for profile publishers to help implementers.)

This draft goes one step further by introducing default profiles. These are profiles just like any other, but they are defined with fixed URI-s (namely http://www.w3.org/profile/rdfa-1.1 for RDFa 1.1 in general, and, additionally, http://www.w3.org/profile/html-rdfa-1.1 for the various HTML variants) and the user does not have to declare them in an RDFa source. Which means that a very simple HTML+RDFa file of the sort:

<html>
  <body>
    <p about ="xsd:maxExclusive" rel="rdf:type" resource="owl:DatatypeProperty">
      An OWL Axiom: "xsd:maxExclusive" is a Datatype Property in OWL.
    </p>
  </body>
</html>

(note the missing prefix declarations!) will produce that RDF triple that you might expect. Can’t be simpler, can it?

Why? Why was it necessary to introduce this? Well, the experience shows that many HTML+RDFa authors forget to declare the prefixes. One can look, for example, at the pages that include Facebook’s Open Graph Protocol RDFa statements: although I do not have an exact numbers, I would suspect that around 50% of these pages do not have them. That means that, strictly speaking, those statements cannot be interpreted as RDF triples. The Semantic Web community may ask, try to convince, beg, etc., the HTML authors (or the underlying tools) to do “the right thing”, and we certainly should continue doing so, but we also have to accept this reality. A default profile mechanism can alleviate that, thereby greatly extending the amount of triples that can become part of a Web of Data. And even for seasoned RDF(a) users not having to declare anything for some of the common prefixes is a plus.

Of course, the big, nay, the BIG issue is: what prefixes and terms would those default profiles declare? What is the decision procedure? At this time, we do not have a final answer yet. It is quite obvious that all the vocabularies defined by W3C Recommendations and official Notes and that have a fixed prefix (most of them do) should be part of the list. We may want to add Member Submissions to this list. If you look at the default profile, these are already there in the first table (i.e., the code example above is safe). The HTML variant would add all the traditional @rel values, like license, next, previous, etc.

But what else? At the moment, the profiles include a set of prefixes and terms that are just there for testing purposes (although they do indicate a tendency), so do not take the default profile as the final content. For the HTML @rel values, we would, most probably, rely on any policy that the HTML5 Working Group will define eventually; the role of the HTML default profile will simply be to reflect those. That seems quite straightforward However, the issues of default prefixes is clearly different. For those, the Working Group is contemplating two different approaches

  1. Set up some sort of a registration mechanism, not unlike the xpointer registry. This would also include some accompanying mailing lists where objections can be raised against the inclusion of a specific prefix, etc.
  2. Try to get some information from search engines on the Semantic Web (Sindice, Yahoo!, anyone else?) that may provide with a list of, say, the top 20 prefixes as used on the Semantic Web. Such a list would reflect the real usage of vocabularies and prefixes. (We still have to see whether this is an information these engines can provide or not.)

At this moment it is not yet clear which way is realistic. Personally, I am more in favour of the second approach (if technically feasible), but the end result may be different; this is a policy that W3C will have to set up.

Apart from the content, another issue is the change mode and frequency of the default profile. First of all, the set of default prefixes can only grow. I.e., once a prefix has made it on the default profile, it has to stay there with an unchanged URI. That is obviously important to ensure stability. I.e., new prefixes coming to the fore by virtue of being used by the community can be added to the set, but no prefix can be removed. As for the frequency: a balance has to be found between stability, i.e., that RDFa processors can rely (e.g., for caching) on a not-too-frequent change of the default profiles, and relevance, i.e., that new vocabularies could find their way into the set of default prefixes. Again my personal feeling is that an update of the profiles once every 6 months, or even once a year, might strike a good balance here. To be decided.

As before, all comments are welcome but, again as before, I would prefer if you sent those comments to the RDFa WG’s mailing list rather than commenting this blog: public-rdfa-wg@w3.org (see also the archives).

Finally: I have worked on a new version of my RDFa distiller to include all the 1.1 features. This version of the distiller is now public, so you can try out the different new features. Of course, it is still not a final release, there are bugs, so…

Enhanced by Zemanta

March 29, 2011

LDOW2011 Workshop

Filed under: Semantic Web,Work Related — Ivan Herman @ 15:29
Tags: ,

The Linked Open Data Workshop (LDOW20XX) has become an integral part of the yearly WWW conferences, and this year was no exception under the unsurprising name of LDOW2011. And, as always, it is was an enjoyable, pleasant event. The organizers (Chris Bizer, Tom Heath, Michael Hausenblas, and Tim Berners-Lee) made the choice of accepting slightly less papers to leave room for more discussions. That was a good choice; the workshop was really more of a workshop rather than just listening to presentations, there were nice discussions, lots of comments… and that was great.

It is very difficult to summarize a whole day, and I do not want to go and comment each individual paper. The papers (and, I believe, soon the presentation slides) are on the Web, of course, it is worth glancing at each of them. For me, and that is obviously very personal, maybe the most important takeaway is actually close to the blog I wrote yesterday on the empirical study of SPARQL queries. And this is the general fact that we are at the point when the size and complexity of linked open data cloud is such that we can begin to make meaningful measurements, experimental data analysis, empirical studies, etc, to understand how the data is really used out there, what is the shape and behavior of the beast, and how these affect the tools and specifications we develop.

The workshop started with an overview of Chris (I hope his slides will be on the Web at some point) doing exactly that. He looked at the evolution of the LOD cloud and tried to analyze its content. There were some nice cosy figures: the growth in 2010, in terms of the number of triples, was of 300%, with some spectacular application areas coming into the game, like a 955% growing of library related data, or the appearance of  governmental data from nothing in 2009 to about 11B triples in November 2010. Although Danny Vrandecic made the remark at the end of the Workshop that we should stop measuring the LOD cloud in terms of pure number of triples (and I can agree with that), those numbers are nice nevertheless. Some figures were less satisfactory: links among datasets is relatively low (90 out of the 200 datasets have only around 1000 links to the outside, and the majority only interlink with only one other dataset; only around 9% of the datasets publish machine readable licenses (although 31% publish machine readable provenance data, which is a bit nicer). Some of the common vocabularies are commonly reused (31% use Dublin Core terms, for example), but way too many dataset publishers define their own vocabulary even if that is not strictly necessary, and only about 7% publish mapping relationships from their own vocabulary to others.

Beyond the numbers themselves, I believe the important point is that somebody does collect and publish these data regularly to understand where we should put some emphasis in future. For example (and this came up during the discussion) work should be done on simple (in my view, rule, i.e., RIF or N3 based) mappings among vocabularies, those should be published for others to use; that figure of 7% is really too low. Work on helping data providers to create additional links easily is another area of necessary improvement (and there were, in fact, several papers on that very topic during the day).

I do not know whether it was a coincidence or whether the organizers did it on purpose, but the day ended by a similar paper but on vocabularies. A group from DERI collected some specific datasets to see how a particular vocabulary (in this case the GoodRelations vocabulary) is being used on the Web of Data, what are the usage patterns, how it can be used for specific possible use cases, etc. The issue here is not the GoodRelations ontology as such (you can see the details of the results in the paper) but rather the methodology: we are at the point when we can measure what we got, and we can therefore come up with empirical data that will help us to concentrate on what is essential. I hope this approach will come up to the fore more and more in future.  We need it.

It was a good day.

March 28, 2011

Empirical study of real-world SPARQL queries

Filed under: Semantic Web,Work Related — Ivan Herman @ 12:21
Tags: , ,

A nice paper I just heard at the USEWOD2011 Workshop at the WWW2011 conference: “Empirical study of real-world SPARQL queries”, by M.A. Gallego and his friends from the Univ. of Valladolid, in Spain. What they did was to analyse the SPARQL queries as issued by various clients to the DBPedia and the Semantic Web Dogfood dataset, to see if some general features appear that RDF triple stores and SPARQL implementers can take into account. This is a workshop paper, i.e., work in progress, so the results must be taken with a pinch of salt. E.g., it seems that DESCRIBE and CONSTRUCT queries are very rarely used (not a big surprise), that the OPTIONAL and UNION are used quite a lot, so their optimization is important, that most of the queries are dead simple, but around half of them rely on FILTER (albeit with one variable only), etc.

The interesting point for me is, however, that some of these data were radically different between these two datasets. E.g., 16% of the queries used OPTIONAL for DBPedia, whereas only 0.41% for the Dogfood dataset. What this tells me is that it is extremely difficult to optimise data stores in general. I.e., the characteristics of the data set, and indeed the application area (e.g., I would expect SPARQL queries to be much more complicated in the health care domain) have to play an important role. What the dimensions of optimizations are is not clear, but the type of research Gallego and his friends are doing might shed some light… Kudos for having started this discussion!

March 13, 2011

Example for the power of open data…

Earthquakes around the globe on the week of the 11th of March

I wish I would not have to use this example… But I just hit it this morning via a tweet of Jim Hendler. RPI has an example on how can one combine public gov data (in this case, a Data.gov dataset on Earthquakes), its RDF version with a SPARQL query, and a visualization tool like Exhibit. The result is an interactive map on Earthquakes of the last week. Running the demo today reveals an incredible amount (over 160) of events on the coast of Honshu, Japan, which led to the earthquake and tsunami disaster on the 11th of March. I do not know how much time it took for Li Ding to prepare the original demo, but I suspect it was not a big deal once the tools were in place.

The demo is dynamic, in the sense that in a week it will probably show some other data than today. So I have made a screen dump for memento (I hope it is all right with Jim and Din). If you are looking at it now, it is worth zooming into the area around Japan to gain some more insight into the sheer dimensions of the disaster: there were  325 quakes (out of 411 around the globe) in that area during the week! I must admit I did not know that…

I have the, hopefully not too naïve, belief that tools like this may not only increase our factual knowledge, but would also help, in future, to help those who are now struggling in coping with the aftermath of this disaster. Yes, having open data, and tools to handle them and integrate them, is really important.

« Previous PageNext Page »

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,618 other followers