Ivan’s private site

April 26, 2009

WWW2009 Impressions

As usual, when making notes of a conference like WWW2009, in Madrid, one has only a partial view. This is all the more true for a conference of the size of WWWW2009 with around 1000 attendees and with 5-6 parallel tracks. I must admit that I usually have difficulties with so many tracks at the same time; I obviously loose some of the events happening, which is a source of unavoidable frustration. With this caveat, just some of the topics that I will probably remember…

The power of Twitter. Although this was not a “topic” of the conference, this was the first WWW conference where twitter was king. Twitter was everywhere, the #www2009 topic was getting several new entries per second (it even got spammed:-(, and other twitter tags were used for some of the specialized events (like #w3ctrack or #ldow2009) One could get a glimpse of what was happening elsewhere just by following these topics. In fact, this report is much more sketchy than usual simply because my own tweetes from the conference or, of course, all tweetes of the #www2009 topic can very well replace some of the notes I wrote in blogs in earlier years.

Social networks. Going beyond twitter, the ubiquitous presence of social networks, their effect on just about anything is still a major topic, like the continuous flow of papers trying, eg, to extract semantics from tag clouds (eg, the paper of Benjamin Markines et al) or the Googles and Yahoo!-s of this World trying to exploit these tags to improve their search results. (Yahoo’s experimental tag explorer is a good example trying to exploit these further.) Nothing radically new here, but progress is reported on all conferences, and this one was no exception. One of the keynotes, by Pablo Rodriguez from Telefonica, actually claimed that the needs of social networks in terms of network infrastructure are so different that they are bound to require changes on the hardware/firmware level of networks. Posting, for example, a video on a social site may create a sudden peak of high volume access (for example if posted by a “celebrity”) that makes it very different from the more steady flow of data that more traditional sites provide and require. For example local caching in routers might be needed. I am no expert in this at all (anything that is close to hardware is sort of a black box to me) so I cannot judge these statements but it was interesting to hear. Another interesting point he made was that “celebrities” of a specific network may (not necessary intentionally) start a dos attack against a site: think of the amount of http requests flowing to a site mentioned by one of these social network stars!

Web Science. There was a panel (organized by Nigel Shadbold, with Tim Berners-Lee, Ricardo Baeza-Yates, and Mike Brodie). The whole topic is still fairly open (at least for me): what exactly is Web Science and where are the boundaries? What types of research belongs to WS, and what is better kept outside to be handled by other disciplines? What type of abstractions would be necessary to study the Web as a whole (just as chemistry can be seen as a set of abstractions on top of physics)? What type of interdisciplinary research groups should be established? As far as I am concerned, I do not have a response to any of these questions:-( What I could see happening is that under the banner “Web Science” many different sub-disciplines will appear very soon and gain independent life without too much relationships among themselves. As far as I am concerned, I would be more interested by the relationship between the Web and society at large than by the technical aspects, but that is only me. An interesting practical point for the future is that there are plans to combine (eg, co-locate) future WWW conferences with Web Science events; that would really be a gain for both event series in my view.

Computing cloud. Yep, this comes up more an more often. Obviously a big deal in the keynote of Alfred Spector, from Google, but came up elsewhere, too. The a mini-tutorial on Hadoop, MapReduce, and Hive, given by Tom White as part of the Developers’ track, was really interesting and instructive for me. We know that the computing cloud has a great interest for the Semantic Web community; it may indeed be a tool to handle the significant amount of data out there. The LOD data is already available on the Amazon services (thanks to OpenLink), Chris Bizer and friends’ Mobile DBpedia makes use of cloud facilities, the LarKC project also makes use of massively parallel computing (I am not sure they use the cloud), too. Something to keep an eye on, that is for sure; I am sure the topic will gain more importance in future conferences. (And one more technology I should familiarize myself with…)

Power of data. Issues around search have become the dominating theme of the WWW conferences, and this one was no exception. Many research try to exploit the sheer amount and variety of data that has been accumulated by the big search engines, for example. I have heard several talks over the years coming from Google’s R&D lab (including a keynote at this conference). I must admit the overall impression I get from these is that a more or less straightforward exploitation of a huge amount of data is used like a sledgehammer for all problems. (I am probably unfair.) Ricardo Baeza-Yates (from Yahoo!) also reported some work in his keynote on, eg, analyzing the search queries themselves, ie, the paths of different searches performed by users between the time they begin some search and the time they find what they were looking for. (Interesting stuff! By the way, there is also a conference on weblogs and social media, ICWSM; one more conference coming up around Web technologies.) I also listened to a presentation on Yahoo!’s Boss by Ted Drake (again on the Developers’ track): what is interesting is that one can access to (a part of) Yahoo!’s accumulated indexes to build, eg, one’s own search engines but, I presume, one could also use this data for other type of research exploiting the data. Power of data for the masses? (I have heard of Boss before and I would have welcome more technical details at the presentation but, well…)

Web of data, a.k.a. Semantic Web. The conference started by a great workshop on Linked Data. I again rely on twitter notes and the general twitter notes for more details, no need to repeat them here. Suffices it to say that, beyond the individual papers, there were a general “buzz” in the air, a general enthusiasm that was reflected by the high number of participants (over 100). For anybody interested, it is worth looking at all the papers, they were good! Having said that, what I am really waiting for is to see many real application of the LOD (and not only experimental, university usage) but that takes its time; there were no really breathtaking news on that at the workshop.

But, of course, the workshop was for the converted; what was more interesting is to see that the Linked Data concept, and the Semantic Web in general, created more and more interest at the conference proper and not only for the long time Semantic Web adepts. Jim Hendler did a surprise presentation at the Developers’ track (surprise, because a announced speaker could not come, so he took his place) talking to non-Semantic Web developers about what can be done already today with this technology, about the excitement that is out there, about the companies that have already picked up this technology. It was good to get these messages out there again and again. Georgi Kobilarov did also a great presentation on DBpedia at the track; there were several people I talked to afterward who were really carried away by the possibilities opened up by having access to a huge amount of data through the unifying abstraction of RDF, RDFS, and possibly (a little bit of:-) OWL.

I also went to the Semantic Web referreed paper track, obviously. I must admit I was a little bit disappointed because lots of colleagues that I would typically see on such event that were not around. I presume ISWC has now become major competition to WWW in this area and when money is tight, people have to make a choice. In earlier years ISWC was considered to be much more theoretical while WWW had more practical papers, but the last few ISWC’s I attended seemed to indicate that this is changing. I think any of the WWW papers could have been presented at the ISWC without any problems. As a consequence, I guess many people decided that ISWC is a better place to be. It will be interesting to see how things will evolve in future; it is not impossible that Semantic Web, as a topic, will gradually move away from WWW to ISWC. (I would expect specifically Linked Data papers to appear at ISWC very soon!)

That being said: it was nice to see a paper on DERI Pipes (by Danh Le-Phuoc et al) or on Triplify (by Sören Auer et al). This is not the first time I heard about these but it is good to have them more widely published. There was a paper on a rule system benchmark (by Senlin Liang et al); although I am no expert on this, with the advancement of RIF it will be good to have such benchmarks being put forward. The paper of Philippe Cudré-Mauroux et al on the disambiguation of ID’s on linked data issue caught my attention: with the advancement of linked data we enter (as the presenter put it) an “ID Jungle” with tons of URI-s referring, more or less, to the same concept (eg, a specific person), and a simple owl:sameAs is not an ideal solution to handle this. The idMesh system provides a mean to analyze relationships among those ID-s. I must admit I did not follow all details of the paper but it is certainly one of the papers I will have to study in more details when I get to it!

W3C’s “camps”. W3C tried another model this year, replacing the more traditional W3C tracks by two ‘camps’ on mobile web and on social web. But… this is where the large number of parallel track backfired: I could not go to any of them:-( There were all kinds of overlaps with other presentations (eg, the social web camp fully coincided with the Semantic Web paper track). Pity, because the feedback I heard from participants was very positive. Sigh. Well, actually, courtesy of Fabien Gandon, I was present on the social web camp virtually, witness this slide

It was a slightly exhaustive but good week!

November 14, 2008

Calais Release 4 and the Linking Data cloud…

Just got to this news via Yves’s blog: Reuters’ Open Calais service comes with a new release in January, and this will bind to the Linked Data cloud. To quote the official blog of Reuters:

Release 4 of Calais will be a big deal. In that release we’ll go beyond the ability to extract semantic data from your content. We will link that extracted semantic data to datasets from dozens of other information sources, from Wikipedia to Freebase to the CIA World Fact Book. In short – instead of being limited to the contents of the document you’re processing, you’ll be able to develop solutions that leverage a large and rapidly growing information asset: the Linked Data Cloud.

Ie: when analyzing a text, Open Calais will return URIs into DBPedia, Freebase, Musicbrainz… Thereby opening up the possibility for various of applications that would not be possible (or would be fairly complicated) without. One more step to make it possible to reuse all those data on the Web… Yey!

B.t.w.: I write these lines using WordPress and I have Zemanta’s Firefox plugin running to generate the tags. However, as far as I know (I may be wrong!), the Zemanta service does not provide those URI-s yet (they do provide some URI-s in their return format, but I am not sure those are LOD URIs). Maybe some day?

(Thanks to Yves for drawing my attention on this…)

(Note after the original publication of the blog: it seems I was wrong and Zemanta does have a similar feature, see Andraz’ comment.)

October 31, 2008

ISWC2008, Karlsruhe

ISWC2008 has just finished (I am still at the hotel, leaving for home in a few hours). As usual, it is very difficult to give an exhaustive overview of the whole conference, not only because there were way too many parallel things going on, but everyone’s interests are different… These are just a few impressions. Still have to find time reading through some of the papers in more details.

Great keynote by John Giannanderea from Metaweb, ie, freebase. Freebase has always been an exciting project but the great news from the Semantic Web community’s point of view is that freebase has opened its database to the rest of the World in RDF, too. As such, freebase will soon become part of the Linking Open Data cloud (I guess there are still some details to be ironed out, and I saw John and Chris Bizer starting to discuss these). Actually, it was also interesting to hear again and again from John that the internal structure of freebase is based on a directed, labeled graph model, because that was the only viable option for them to build up what they needed. Sounds familiar?

An interesting point of the keynote was when John was wondering whether Metaweb is therefore a Semantic Web company or not. He thought that yes, it is, because the internal structure is compatible with RDF, it relies on identifiers with URIs, and is Web based. But he also thought that, well, it is not because… no description logic is in use, nor ontologies. Sigh… This still reflects the erronous view that one must use description logic to be on the Semantic Web. Wrong! So I went up to the mike and welcomed Metaweb in the growing club of Semantic Web companies…

Among the many papers I was interested in, let me refer to the one of Eyal Oren et al., “Anytime Query Answering in RDF through evolutionary algorithms” and, actually, a related submission from the same research group to the Billion Triple Challenge, called MaRVIN. In both cases the issue is that while handling very large datasets one might not necessarily want or is interested in _all_ solutions to a given query (or inferences, in case of MaRVIN) but, rather, whatever can be reached within a reasonable time. Ie, essentially, trading completeness for responsiveness. Whether genetic algorithms are the answer, as explored by Eyal and friends, or some other techniques, nobody knows; as Eyal clearly acknowledged, these are first attempts and we have to wait a few more years and furter results to get a feeling where it will lead. But the direction is really interesting.

This actually leads to what was, for me, the highlight of the conference, namely the SW Challenge, both the traditional Open Call as well as the new Billion Triple Challenge (there more details on both on the challenge’s web site). The entries were really impressive. As Peter Mika said in his closing comments on the challenge, long gone are the days when a challenge was some techie keyboard manipulation; the entries all had great user interface design, with the real regards to non-expert end users who may or may not know (and probably do not care) that the underlying technology is Semantic Web.

Among the finalists in the open call Chris Bizer presented DBPedia Mobile, (see also their site) ie, a system to access the full power of DBPedia (and, actually, the LOD cloud in general) from an iPhone via a proxy somewhere on the Web. The proxy is actually a hugely powerful environment, making use of Falcon and Sindice, and a bunch of query engines distributed over the network, all peeking into the LOD cloud and, actually, adding items to it, eg, photos taken on the iPhone. A few years ago all this would have had a SciFi edge to it, and now it was running at the conference…

Eero Hyvönen showed their HealthFinland portal (see also their site), soon to be deployed by the Finnish health authorities. Half of the system is, shall we say, more “traditional” (hm, well, what this means is that it would have been revolutionary two years ago:-), a number of serious ontologies governing health related data integration and search into the data. However, what I found exciting is the other half. Indeed, Eero and friends realized that search facets derived from serious ontologies are not really ideal for everyday end users. Therefore, they made a survey among users, derived a number of terms to be used on the user interface level, and bound these terms internally to the ontology. The result is a much more friendly system that still has the power offered by ontology directed search.

Actually, having Eero’s and Chris’ system presented side by side was also interesting from another point of view, namely to show that there are cases when using serious ontologies is important and there are cases when it isn’t. When I use an iPhone to navigate in a city and get information about, say, historical buildings then a bit of scruffiness is really not a problem. Speed, interaction, richness of data is more important. However, when it comes to, e.g., health issues, I must admit that I am prepared to wait a bit if I am sure that the results go through the rigorous inference and checking processes that one can achieve through the usage of formal ontologies. This is not the place when one should tolerate scruffiness. The stack (or, to quote Eric Miller, the “menu”) of Semantic Web technologies is rich enough to allow for both; choose what you need! All those discussions description logic vs. Semantic Web in general is futile in my view…

And then came benji’s paggr system (which actually won the Challenge in the Open Call track). Are you user of netvibes, iGoogle, or the new Yahoo user interface? Then you know what it means to quickly build up a Web page using small widgets accessing RSS feeds, stock quotes, clocks, etc. Now imagine that each of these widgets is in fact a small sparql query with some wrapper to present the result properly. Package that into a nice user interface that benji has always been a master of, and you get paggr. Not yet public, but I already signed up to play with it as soon as it is… This will really be cool!

As for the Billion Triples challenge: I already referred to MaRVIN, but there were a bunch of others like SearchWebDB or SemaPlorer, or SAOR. In some cases massively parallel storage approaches, not only offering near real time (federated) SPARQL query possibilities, but, in some cases, preprocessing it with a lower level RDFS or OWL fragment inferencing. All that done starting with millions of triples integrating all kinds of public datasets, yielding storages going beyond the 1 Billion triple mark. And let us not forget that this mark had already been reached by companies such as Tallis or OpenLink, so these new architectures just add to the lot… These were also particularly interesting with and eye to the new OWL RL profile that is being defined in the W3C OWL Working group and which aims at exactly such setups.

Let me finish with another remarkable entry, although this one did not win a price. i-MoCo created a small navigation system over a triple store containing “only” 250 million triples. So what is the big deal, you might say? Well, all the triples were stored on… an iPhone! So the next challenge will probably be to get, say, 10 billions of triples or more on your phone. Just wait a few years…

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,021 other followers