Ivan’s private site

January 24, 2012

Nice reading on Semantic Search

I had a great time reading a paper on Semantic Search[1]. Although the paper is on the details of a specific Semantic Web search engine (DERI’s SWSE), I was reading it as somebody not really familiar with all the intricate details of such a search engine setup and operation (i.e., I would not dare to give an opinion on whether the choice taken by this group is better or worse than the ones taken by the developers of other engines) and wanting to gain a good image of what is happening in general. And, for that purpose, this paper was really interesting and instructive. It is long (cca. 50 pages), i.e., I did not even try to understand everything at my first reading, but it did give a great overall impression of what is going on.

One of the “associations” I had, maybe somewhat surprisingly, is with another paper I read lately, namely a report on basic profiles for Linked Data[2]. In that paper Nally et al. look at what “subsets” of current Semantic Web specifications could be defined, as “profiles”, for the purpose of publishing and using Linked Data. This was also a general topic at a W3C Workshop on Linked Data Patterns at the end of last year (see also the final report of the event) and it is not a secret that W3C is considering setting up a relevant Working Group in the near future. Well, the experiences of an engine like SWSE might come very handy here. For example, SWSE uses a subset of the OWL 2 RL Profile for inferencing; that may be a good input for a possible Linked Data profile (although the differences are really minor, if one looks at the appendix of the paper that lists the rule sets the engine uses). The idea of “Authoritative Reasoning” is also interesting and possibly relevant; that approach makes a lot of pragmatic sense, I wonder whether this is not something that should be, somehow, documented for a general use. And I am sure there are more: In general, analyzing the experiences of major Semantic Web search engines on handling Linked Data might provide a great set of input for such pragmatic work.

I was also wondering about a very different issue. A great deal of work had to be done in SWSE on the proper handling of owl:sameAs. On the other hand, one of the recurring discussions on various mailing list and elsewhere is on whether the usage of this property is semantically o.k. or not (see, e.g., [3]). A possible alternative would be to define (beyond owl:sameAs) a set of properties borrowed from the SKOS Recommendation, like closeMatch, exactMatch, broadMatch, etc. It is almost trivial to generalize these SKOS properties for the general case but, reading this paper, I was wondering: what effect would such predicates have on search? Would it make it more complicated or, in fact, would such predicates make the life of search engines easier by providing “hints” that could be used for the user interface? Or both? Or is it already too late, because the ubiquitous usage of owl:sameAs is already so prevalent that it is not worth touching that stuff? I do not have a clear answer at this moment…

Thanks to the authors!

  1. A. Hogan, et al., “€œSearching and Browsing Linked Data with SWSE: the Semantic Web Search Engine”€, Journal of Web Semantics, vol. 4, no. December, pp. 365-401, 2011.
  2. M. Nally and S. Speicher, “Toward a Basic Profile for Linked Data”, IBM developersWork, 2011.
  3. H. Halpin, et al. “When owl:sameAs Isn’t the Same: An Analysis of Identity in Linked Data”, Proceedings of the International Semantic Web Conference, pp. 305-320, 2010
About these ads

6 Comments

  1. Others are much more qualified than I am to speak to the core questions you raise in this post. Just this much:

    > Or is it already too late, because the ubiquitous usage of owl:sameAs is already so prevalent that it is not worth touching that stuff?

    Being relatively new to the party, I don’t have any history in this discussion. What I think I can safely say is that in the grand scheme of things (i.e., the Internet as a whole), neither owl:sameAs nor any of the SKOS properties are really “prevalent” in the classic definition of that word. So why not tackle this task as you suggest (provided these alternatives actually *do* provide better data for search).

    We should never rule out scrapping years (or even decades) of work in favor of something new — if it promises to be better.

    Comment by Andreas Gebhard — January 24, 2012 @ 21:07

  2. [...] Nice reading on Semantic Search (ivan-herman.name) comente! [...]

    Pingback by caos! blog » Semantic Web – Zemanta Experiment — January 25, 2012 @ 4:26

  3. > A possible alternative would be to define (beyond owl:sameAs) a set of properties borrowed from the SKOS Recommendation, like closeMatch, exactMatch, broadMatch, etc.

    I have always recommended that registered info should correspond as much as possible to reality. I.e. if two things ARE not the same, owl:sameAs should not be used. If during this fase the transformation from reality to the formalization deviates too much from reality, you loose information and you will never be able to get it back without changing the created formalization.

    If a user wants (at exploitation level) results to include broader, narrower, similar etceteras relationships, local query expansion for the predicates can be easily done.

    Of course there are numerous cases where there is no clear conceptual border between 2 or more concepts (ex. colors) but we can use “cluster” concepts grouping all variants (ex “red” or “red colors”). These clusters might contain personal flavors that might contradict each other (ex reddish-green and a greenish-red might in physical terms (RGB) be the same color but considered “green” for one group of persons and “red” for another).

    Roughly spoken: let predicates proliferate, let search engines do the heavy job of expanding the queries. I did some work on similarity recommandation (Pamela – Personal agent for mapping elements that look alike) and this can be done.

    Regards,

    Ronald Poell

    Comment by Ronald Poell — January 25, 2012 @ 10:58

  4. Many thanks for reading our (lengthy) paper. The positive feedback is really greatly appreciated!

    I had planned on a short comment, but almost all of the questions you’ve raised have been on the forefronts of our minds over the past while, so . . .

    > that may be a good input for a possible Linked Data profile (although the differences are really minor, if one looks at the appendix of the paper that lists the rule sets the engine uses)

    We’ve recently been looking into what kinds of RDFS and OWL primitives are being used in popular Linked Data vocabularies. Our (somewhat coarse) high-level observation is that RDFS/OWL features not requiring blank-nodes to express in the RDF mapping (e.g., sub-class, inverse-of, TransitiveProperty, etc.) are much more prominently used than those requiring blank-nodes (e.g., those requiring lists like owl:intersectionOf, or restrictions like owl:allValuesFrom, etc.). The pattern is quite evident, but as to what it means in a practical sense, we’re not sure. Hopefully we’ll have a paper on this soon.

    > the experiences of major Semantic Web search engines on handling Linked Data might provide a great set of input for such pragmatic work.

    We feel that our primary contribution in SWSE is not the systems or the techniques, but rather a better understanding of how Linked Data works: empirically “putting it though it’s paces” so-to-speak. I’m sure that others working on similar engines also have a good feel for the inner workings of Linked Data. We’ve spent many hours/days/weeks debugging Web datasets to explain strange crawling behaviour, or strange consolidation results, or strange reasoning results, etc. This was part of the inspiration of the “Pedantic Web Group”, although the mailing lists have been quiet of late. It seems that more feedback loops are needed between the consumers and producers of Linked Data.

    > A great deal of work had to be done in SWSE on the proper handling of owl:sameAs. … On the other hand, one of the recurring discussions on various mailing list and elsewhere is on whether the usage of this property is semantically o.k. or not (see, e.g., [3]).

    Yep, the owl:sameAs issue is a priority issue for engines like SWSE, FalconS, Watson, Swoogle, Sindice, FactForge, etc.: if you can’t merge (or at least link) data on a specific entity from multiple sources with a high degree of confidence, then the rest of the system design becomes academic. In an immediate sense, owl:sameAs reasoning far surpasses the importance of generic reasoning (e.g., more complete OWL 2 RL) for search. Some of the guys from the FalconS group have been looking into this in more detail, particularly generating owl:sameAs links using methods beyond deductive reasoning:

    http://www.springerlink.com/content/c21t3w21870t38k7/

    http://dl.acm.org/citation.cfm?doid=1963405.1963421

    The latter paper is interesting for applying inductive methods to the problem, which seems promising. We’ve also just wrapped up a more detailed (and lengthy) paper on owl:sameAs using the SWSE results as a baseline:

    http://aidanhogan.com/docs/entcons_jws_final.pdf

    Related to your comment, some conclusions from the paper were that the published owl:sameAs relations sampled from our dataset were perhaps not as bad as Halpin et al. found. We applied a much simpler (linear) sampling and judging methodology (we crunched through 1,000 relations with one opinion each) than Halpin’s paper, but found that ~97% of sampled owl:sameAs were deemed “not to be problematic”. We were probably not as picky as Halpin’s judges: our core criteria was essentially, “to our trained eyes, would the merged entity look strange in SWSE?”. Some of the inspection results are up here:

    http://aidanhogan.com/econs/bl/

    (Many of the results are trivial given that one or the other URI have insufficient data to actually cause a problem when merging.)

    Also, we found that using OWL 2 RL rules to *infer* owl:sameAs doesn’t really buy you all that much. Primarily, owl:InverseFunctionalProperty gives a lot of raw equivalences, but almost all are between blank-nodes on the same domain as per the old FOAF “smushing” practice (you need a long black-list of nonsense values for homepages and email-addresses to make this feasible). owl:FunctionalProperty ekes out a few more new owl:sameAs relations. Again though, as opposed to explicit owl:sameAs triples, which give rich relations between URIs across domains, the additional OWL 2 RL inferences are almost entirely composed of owl:sameAs relations between blank-nodes *on the same site*, which are simply not as interesting. Similar results are echoed by the FalconS guys in their work.

    We also looked at using inconsistency rules from OWL 2 RL to detect and repair cases where owl:sameAs goes awry, with limited success: current vocabularies are unfortunately not very detailed in terms of axiomatising things they consider to be nonsense, like disjoint classes, properties, etc. Which is a real pity. Linked Data needs more inconsistency!!

    > A possible alternative would be to define (beyond owl:sameAs) a set of properties borrowed from the SKOS Recommendation, like closeMatch, exactMatch, broadMatch, etc.

    I’m not entirely convinced that this is *needed* (yet).

    Our current sketchy idea on robust use of owl:sameAs for Web data is to have different views for each individual entity, where for owl:sameAs, we would allow “transitive claims”, but forego “symmetric claims” and only merge entities where reciprocal, authoritative links exist. This will imply different views for a given entity, a bit like considering owl:sameAs as a transitive “imports” property. So if we find the triple “A sameAs B” in a location authoritative for A, the entity-view for A will include data for B, but not vice-versa. If we also find “B sameAs C” in a document authoritative for B, the view for A and the view for B will include the data for C also. If we also find the triple “B sameAs A”, we can merge A and B such that they become one view. The core principle here is that a publisher has complete control over its local entities: they can opt in or opt out of what data its entities are viewed/merged with by dropping the pertinent owl:sameAs relation(s) in the local document. It would not be difficult to then support selecting and traversing these different views for a given entity in a UI.

    Importantly, this doesn’t require a new property or vocabulary for publishers, but rather a refined/selective/more robust interpretation of owl:sameAs by consumers. In general, when reasoning over Web data, it’s more a case of disentangling who says what and where, as opposed to interpreting everything you see as formal truth. Obviously the burden is not (only) on publishers to get it right, but also on consumers to not implode when they get it wrong.

    > Thanks to the authors!

    And thanks for the encouraging feedback and interesting discussion!

    Comment by Aidan Hogan — January 25, 2012 @ 21:58

    • Wow! This is a blog post by itself, not just a comment:-) But thanks!

      I think the only area where either we are in disagreement or I am not sure I understand what you say is your last section, referring to the handling of owl:sameAs. Isn’t it correct that if you forego symmetric claims then, formally, you do not follow the precise owl:sameAs semantics? Of course it is perfectly for you to do that but, at least for my taste, this is not entirely ok. In other words, it is exactly for the possibility of expressing non-symmetric identifications that new predicates might be needed, to allow you to do that while staying within the definitions…

      Of course, as I say, the issue might be that this train is already gone, and there are too many owl:sameAs out there.

      Comment by Ivan Herman — January 26, 2012 @ 12:03

      • > Wow! This is a blog post by itself, not just a comment:-) But thanks!

        If I had more time, I would’ve made it shorter. :)

        > Isn’t it correct that if you forego symmetric claims then, formally, you do not follow the precise owl:sameAs semantics?

        I guess the distinction between soundness and completeness is important here. By dropping the symmetry of claims, you do follow the precise owl:sameAs semantics, but only partially: as a consumer, you only use the part that you can trust, e.g, based on authoritative considerations.

        > In other words, it is exactly for the possibility of expressing non-symmetric identifications that new predicates might be needed, to allow you to do that while staying within the definitions…

        Reminds me of an analogous discussion about the use of owl:equivalentProperty and owl:equivalentClass relations given by FOAF between foaf:maker/dct:creator and foaf:Agent/dct:Agent. The discussion is very much related to the symmetry of claims for owl:sameAs:

        http://groups.google.com/group/pedantic-web/browse_thread/thread/ed643ec5f391e053?pli=1

        At the time, I didn’t particularly like that FOAF was mandating new inferences for DC data… e.g.,importing all instances of dct:Agent into foaf:Agent. I was arguing that this should be changed, on principle, to the non-symmetric “foaf:Agent rdfs:subClassOf dct:Agent”, with a reciprocal link from DC if desired. Dan pointed out that from his point of view, dct:Agent and foaf:Agent were equivalent classes, and why shouldn’t he be allowed to make that claim in his vocabulary? This was really difficult to argue against (although I tried). I came away with the belief that publishers should be allowed to claim what they want (they will anyways). It’s largely up to consumers then to disentangle these claims (e.g., using authoritative reasoning or similar).

        Of course, having a vocabulary of owl:sameAs with symmetry|transitivity|reflexivity turned off can’t be a bad thing: publishers have more properties to make claims with. Towards official support and evangelism, the open question I have is what *claim* is being made with, e.g., a non-symmetric owl:sameAs? From an inference/trust/imperative point of view, I can see the effect (turn off the “eq-ref”/”eq-sym” rules for this property). From a declarative point of view, what does this claim actually *mean* though, and how would it be described in English? For example, what two real-world entities would the non-symmetric claim hold against (that full blown equality wouldn’t hold for)? And, more importantly, does this actually solve the root problem? (For example, what if publishers ignore the properties and continue to use industrial-strength owl:sameAs?)

        In summary, I still think it’s first a consumer-side problem, second a publisher-side problem.

        Comment by Aidan Hogan — January 26, 2012 @ 16:40


RSS feed for comments on this post.

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,618 other followers

%d bloggers like this: