Ivan’s private site

October 12, 2007

Wikipedia URI-s as reliable identifiers for the Semantic Web?

Filed under: Semantic Web,Work Related — Ivan Herman @ 15:05

Martin Hepp drew my attention on one of his upcoming publications[1]; some related thoughts…

The issues around URI-s come up regularly on the various SW related mailing list and discussion fora leading, sometimes, to passionate discussions. That is all right, it is indeed a complicated. But, somehow, the question of where to find the URI-s for various concepts does not always get enough attention (at least in my view). I remember a while ago, when Frederick, Yves, and some others were working on the Music Ontology, my question was: all that is fine, but what is the authoritative URI for, say, Beethoven’s 7th symphony?

Of course, in some areas, communities are working on such naming schemes for their own constituencies. LSID-s are a prime example in the Life Science domain. On line catalogs of digital libraries (see my earlier blog referring to RDA-s, for example) might provide us with another rich source of stable URI-s. As yet another example the lingvoj site of Bernard Vatant (just updated a few days ago) might establish itself as a set of stable URI-s for spoken languages (ie, the URI http://www.lingvoj.org/lang/hu might become the URI for Hungarian). A number of similar datasets appear, for example, through the Open Linked Data project that could, eventually, play similar roles.

Yes but, in the meantime, what happens to the vast number of other “things”? What is the answer to my 7th symphony question? An idea I heard before: why not using the Wikipedia URI-s for that purpose? And that sounds like a good idea indeed. However, for that to work, a number of questions should be answered. Eg, how stable are those URI-s? How reliable are they? And this is where Martin et al.’s paper come in. They do a series of statistical measurements and analysis on the evolution of Wikipedia entries (they rely on data of this year). Their measurements indicate, for example, that the Wikipedia URI-s are indeed stable enough. To be more precise, their measurement show that this year around 93% of the URI-s on Wikipedia had a stable meaning (ie, the text of the corresponding article may have changed in some details, but the URI can still be considered as referring to the same notion). Given the large number of articles, this seems pretty o.k. to me… There are also some other statistical details in the paper (on the subject of the articles, for example), as well as further references, but, succinctly, that is probably the most important result. I am sure that further analysis on Wikipedia is still necessary (and I am also sure it will happen); this paper is certainly an interesting one among those!

So, should we rely on Wikipedia for the 7th symphony? Almost. If we go this direction, my choice would be to use DBpedia instead. DBpedia being a dump of Wikipedia, it inherits all the stability results that Martin et al. describe. Also, the current DBPedia setup makes a clear difference between a non informational resource URI and its RDF representation (an issue raised as a problem in [1] for Wikipedia URI-s). Last, but certainly not least, the RDF graphs in DBpedia are linked to an increasing number of other data sets via the Open Linked Data setup that applications may also exploit. Ie, a suitable URI for the 7th symphony might be:

http://dbpedia.org/resource/Symphony_No._7_%28Beethoven%29

derived from the corresponding Wikipedia URI.

Of course, this is not a silver bullet. There can be lots of criticisms for the topics treated in Wikipedia (or not). To continue my example, the list of Beethoven’s work is fairly well covered by Wikipedia articles, but this is less true for, say, Robert Schumann. New, more systematic vocabularies might appear in which case we may have URI aliases on our hands. Etc. However… do we have another, existing choice for today? I would be curious to hear…

(Note that the URI alias issue might be solved by automatically adding owl:sameAs predicates wherever appropriate. For example, the lingvoj data already includes such a link for each language, linking to… the corresponding DBpedia URI.)

[1] M. Hepp, K. Siorpaes, and D. Bachlechner, “Harvesting Wiki Consensus Using Wikipedia Entries as Vocabulary for Knowledge Management,” IEEE Internet Computing, vol. 11, pp. 54-65, 2007. Also
available on-line at Martin’s site.

About these ads

8 Comments

  1. I have to say, that I could not agree with you in 100% regarding o.us poetry, but it’s just my opinion, which could be wrong :)

    Comment by Daniel — October 12, 2007 @ 17:02

  2. I also think a lot more work has to be put into providing accepted “standard” URIs for different domains.
    You already run into difficulties when you want to use ISO codes to describe things that are identified by them, for example ISO 639 for languages, ISO 15924 for writing systems, ISO 3166 for countries and areas or ISO 4217 for currencies. Of course you could use an ISO code as a literal, but then you need schemas that say what these literals mean (such as DCMI partly provides them, but they don’t seem to want to go that way any more). I think the optimum would be to use URNs; but defining namespaces for that is something ISO has to do. So far there’s only a draft for a namespace for the ISO specifications (http://www.ietf.org/internet-drafts/draft-goodwin-iso-urn-02.txt) but nothing to encode the actual codes defined in some of them as URNs. Going by that draft it could look like urn:iso:code:639-3:eng.
    Another thing you could use URNs for is identifying products. Fortunately there’s already a draft for that: http://www.ietf.org/internet-drafts/draft-mealling-epc-urn-02.txt – so you can identify every product that is registered with a standard barcode!
    Regarding the music, a year ago I probably would have told you that the answer to this is MusicBrainz. Unfortunately they’re going to drop RDF support in favour of the XML webservice they’re running now. MusicOntology aim to provide a mapping to MusicBrainz I think, so in that case you would just have to go somewhere else to get the RDF descriptions. But it would be nice if the authority for identifying the resources would stay at MusicBrainz. This would be a separation between URIs identifying the resources (which IMHO shouldn’t contain any versioning info or query parameters like the current MusicBrainz RDF export has it, e.g. http://musicbrainz.org/mm-2.1/track/57ee036f-6b2c-4725-9b9e-f6151ff3c18b/4) and URLs at which you find the RDF descriptions about these resources (with query parameters, versioning etc.). So a clean URI would be http://musicbrainz.org/track/57ee036f-6b2c-4725-9b9e-f6151ff3c18b (indeed if you append “.html” to that you get the HTML representation of the track, it also redirects to that view). Note though that this is not explicitly declared by MusicBrainz as the canonical URI for identifying that track, it would just be a convention to use that form. The only problem I can see here is that clients retrieving RDF descriptions from MusicOntology’s mapping would assume that those URIs are also URLs which hold additional information. So they would have to do something like to link back to their service.
    If you want to identify works like Beethoven’s 7th symphony that way, you still have to wait a bit, the MusicBrainz database schema is planned to evolve to also include that. So far they only list versions of it which appear on releases.

    Comment by Simon Reinhardt — October 12, 2007 @ 20:08

  3. Simon,

    thanks for your feedback.

    As for MusicBrainz: obviously, I thought of that. However, my overall experience with MusicBrainz (just like with a number of other on-line facilities and sites for music or, for that matter, with the whole iTune/iPod world) is not good. I am indeed part of a small minority in the sense that my interest is in classical music (hence my examples in the blog); the structure, the terms used, etc, of all those sites are mostly inadequate for that world… But this is obviously a very specific case.

    Comment by Ivan Herman — October 13, 2007 @ 8:46

  4. I am unconvinced that DBprovides better URIs that the wikipedia pages themselves. The ambiguity page/entity is a deceiving problem i think, since the complexity invoved in attempting a correct formal handling is a deadly blow to any hope of pushing the SW out there in dayly use e.g. for annotations.
    e.g. if one is not to use wikipedia URIs directly, you have to arbitrarily select another provider (first choice to make!) ok you go for dbpedia which offers real “resource” and has a IFP pointing at wikipedia, but try to call that URI to see what it is and.. get a 303 which changes your browser URL to yet something else (and something else again if you want the RDF description). How do you explain all this to the end user exactly (who just wants to send an email or pass a concept over IM)?
    Any sensibly SW tool, I am deeply convinced, will have to understand http://en.wikipedia.org/wiki/Semantic_Web as the uri for semantic web. (I’d prefer TDB:http://en.wikipedia.org/wiki/Semantic_Web but non http URI schemas are so unhip these days) :-)

    Comment by Giovanni Tummarello — October 13, 2007 @ 9:57

  5. Hmm it stripped the XML from my comment, one of the last sentences should read:

    “So they would have to do something like rdf:Description rdf:about=”http://musicbrainz.org/track/57ee036f-6b2c-4725-9b9e-f6151ff3c18b” rdf:seeAlso=”http://musicontology.com/…57ee036f-6b2c-4725-9b9e-f6151ff3c18b…”/ to link back to their service.”

    with the brackets around the element.

    Anyway, you’re right that MusicBrainz is primarily a database for Pop music at the moment. I was involved in trying to come up with a more detailed database schema for it which, among other things, would cover classical music much better (and I don’t think it’s a too specific case). I got frustrated by some things there and my interest moved away from it. But, even though it will still take a lot of time (it’s an FOSS project after all), I think they’re going in the right direction. The goal of MusicBrainz is to be a database about everything music related – and with the next server release they’ll include tags so they’re even moving away from covering factual data only.

    Comment by Simon Reinhardt — October 13, 2007 @ 21:05

  6. Wikipedia URIs provided by DBPedia are one of the only intuitive, wide-known and correctly implemented examples that we have at the moment – use them.

    Btw, “The issues around URI-s come up regularly on the various SW related mailing list and discussion fora leading, sometimes, to passionate discussions. That is all right, it is indeed a complicated.”

    This is a good tutorial about how to do it:
    http://www.dfki.uni-kl.de/~sauermann/2006/11/cooluris/

    (soon-to-be released as SWEO note)

    Comment by leo — October 15, 2007 @ 9:20

  7. Giovanni Tummarello wrote:

    “(I’d prefer TDB:http://en.wikipedia.org/wiki/Semantic_Web but non http URI schemas are so unhip these days)”.

    Me too. So write:

    http://t-d-b.org?http://en.wikipedia.org/wiki/Semantic_Web

    which works as a 303 redirect to the Wikipedia article (as does the equivalent http://thing-described-by.org URI). See either root page for more information.

    edavies@bill:~$ wget -S http://t-d-b.org?http://en.wikipedia.org/wiki/Semantic_Web
    –21:55:22– http://t-d-b.org/?http://en.wikipedia.org/wiki/Semantic_Web
    => `index.html?http:%2F%2Fen.wikipedia.org%2Fwiki%2FSemantic_Web’
    Resolving t-d-b.org… 209.68.9.56
    Connecting to t-d-b.org|209.68.9.56|:80… connected.
    HTTP request sent, awaiting response…
    HTTP/1.1 303 See Other
    Date: Fri, 26 Oct 2007 20:55:24 GMT
    Server: Apache/2.2.4
    Location: http://en.wikipedia.org/wiki/Semantic_Web
    Content-Length: 248
    Keep-Alive: timeout=5, max=100
    Connection: Keep-Alive
    Content-Type: text/html; charset=iso-8859-1
    Location: http://en.wikipedia.org/wiki/Semantic_Web [following]
    –21:55:23– http://en.wikipedia.org/wiki/Semantic_Web
    => `Semantic_Web.1′
    Resolving en.wikipedia.org… 145.97.39.155
    Connecting to en.wikipedia.org|145.97.39.155|:80… connected.
    HTTP request sent, awaiting response…
    HTTP/1.0 200 OK
    Date: Fri, 26 Oct 2007 14:49:18 GMT
    Server: Apache
    X-Powered-By: PHP/5.1.2
    Content-Language: en
    Vary: Accept-Encoding,Cookie
    Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
    Last-Modified: Fri, 26 Oct 2007 14:45:14 GMT
    Content-Length: 81798
    Content-Type: text/html; charset=utf-8
    X-Cache: HIT from sq29.wikimedia.org
    X-Cache-Lookup: HIT from sq29.wikimedia.org:3128
    Age: 21550
    X-Cache: HIT from knsq1.knams.wikimedia.org
    X-Cache-Lookup: HIT from knsq1.knams.wikimedia.org:3128
    X-Cache: MISS from knsq2.knams.wikimedia.org
    X-Cache-Lookup: MISS from knsq2.knams.wikimedia.org:80
    Via: 1.0 sq29.wikimedia.org:3128 (squid/2.6.STABLE13), 1.0 knsq1.knams.wikimedia.org:3128 (squid/2.6.STABLE13), 1.0 knsq2.knams.wikimedia.org:80 (squid/2.6.STABLE12)
    Connection: keep-alive
    Length: 81,798 (80K) [text/html]

    Comment by Ed Davies — October 26, 2007 @ 22:58

  8. Perhaps we are not addressing the problem at the right level. Indeed, many content providers (like Wikipedia or DBpedia) may claim that their URIs must be preferred, and each of them may have good reasons for this claim. But this approach has four serious drawbacks:

    (1) the first is how we decide which is the “authoritative” source among all candidates;
    (2) second, most entities will never make their way in these portals;
    (3) third, the URI would bring with itself a description (e.g. a wikipedia article) which may not be the desired one
    (4) finally, and even more important from the architectural standpoint, these portals are not designed to be a service which returns identifiers for applications which need them.

    My proposal is that URIs for non information objects should be treated at an infrastructural level. A good analogy is the way URLs are resolved on the Web. As the “authority” part of URLs is resolved by the DNS, we can imagine a service which sees local URIs (i.e. URIs mint by any given application, like an ontology editor) as “symbolic names” for non information resources (e.g. people, locations, events, etc.), and maps them to a canonical identifier which would be the analoguos of the IP number for a server. I like to see all this as an Entity Naming System (ENS), whose architecture should be fully distributed and decentralized. And suitable protocols should allow the interaction of any application for creating content (both structured, semi- and non structured) with the ENS.

    Realizing such a service poses several challenges, including: an entity matching service (how do we know that such and such URI is the right one for the entity we want to talk about?), a large-scale repository service (well, there are billions of things which people may want to talk about …), general interfaces for interaction with highly heterogeneous applications, bootstrap & population of the service, etc. Addressing these issues is the goal of the new OKKAM EU FP7 funded project, which will start on January 1st, 2008. In the preliminary project portal (http://www.okkam.org) you can already find two toy examples (actually, more proof of concept, though they work) of how we think the idea should work for a FOAF editor and for Protege. Any feedback is more than welcome.

    Comment by Paolo Bouquet — November 22, 2007 @ 18:35


RSS feed for comments on this post.

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,613 other followers

%d bloggers like this: