Ivan’s private site

July 4, 2009

Dagstuhl Workshop on Semantic Web

Dagstuhl castleI have just come back from the Workshop “Semantic Web: Reflections and Future Directions”, held in Dagstuhl, Germany. Organized by John Domingue, Rudi Struder, Jim Hendler, and Dieter Fensel, the workshop positioned itself as the “second release” of a similar workshop that was held at the same place 10 years ago.

The first two days of the workshop were more traditional, in the sense that it was series of presentations and panels. This was the “reflection” part of the workshop: looking back to 10 years’ of history as well a peek into the current state of the art. It was interesting but, for my taste, a bit too long; the programme of the two days could have been compressed into one or, say, one and a half days. That would have given more time to the “future directions” part, ie, discussions in break out groups on various topics. I enjoyed those a lot: free flowing discussions on various topics, helping to exchange ideas, experiences, pointers at other works and results, and crystallizing possible future R&D issues. These discussions took place in a very pleasant, relaxed atmosphere among people who mostly knew one another already, ie, we could really concentrate on issues. Each group formulated a number of research goals for the years to come; some group also came up with more practical steps and goals.

As far as I know, the workshop organizers plan to collect all those research issues in some more coherent form, so we should watch this space. In what follows I just collect some issues that I took away from the workshop without the goal of being exhaustive; indeed, there were 6-7 parallel break out groups.

Issues around Web scale. This is clearly one of the major topics of the day. What happens when one has to deal with data containing billions of triples, when the data (ie, the triples) are “dirty”, ie, inconsistent, faulty, etc. Think of the Linked Open Data cloud, of data coming from sensor networks, mobiles, etc. Do we have to re-think all the notions that the Semantic Web inherited from the logic world, ie, completeness, meaning and consequences of consistency, what it means to get results for a query, etc? This is one area where opinions tend to diverge a lot. Some would prefer to completely put aside the traditional logic approaches (rules, descriptions logic, ontologies, OWL, etc), while others may argue that the advances in computing, in reasoning engines and methods are (and are expected to be) such that these methods should still be just as usable as before. As always, I hate any black-and-white statements… I do not think dismissing an area of technology is the right way but, also, other avenues, or new viewpoints should to be explored, too (e.g., how to react on inconsistencies, trying to get possibly incomplete results but whatever can be obtained within, say, 2 minutes, that sort of things). What approach would be used is very much dependent of the application. Anyway… Web scale is a major issue, everybody agrees on that!

Interaction. This is one of the break out groups that I did not attend, unfortunately. And obviously a hugely important direction of future R&D. Many Semantic Web applications today are such that their user interface is just standard because all Semantic Web related work happens behind the scenes, usually on the server side. However, on long term, there is a clear need for programs that could somehow directly show the data in some friendly way, programs that self-adapt themselves to the nature of the data. Not only for experts, but also for laypeople. Such environments may not only include extensions of current browsers but, eg, full desktop environments. Sort of intelligent, data-oriented user interfaces. A major research problem (user interface methodology is always a major problem, whether related to Semantic Web or not…), but also a hugely exciting research and development opportunities!

Vocabularies. There was a separate group on the management of vocabularies, which has identified a number of R&D issues: how does one describe a vocabulary, its interdependence with other vocabularies, how does one rank vocabularies… These are all fundamental question to solve to be able to find vocabularies for a specific purpose, to make specialized search. There are also issues around archiving, providing stable URI-s; last but not least (and this goes way beyond vocabularies only) major legal issues on what type of attribution, copyright or other legal machinery are to be used with vocabularies (it was good to have Tom Heath, who could tell us a bit about the datacommons’ approach). As an example of the many technological problems arising, the break-out groups coined the term “cherry picking of terms”. Although OWL has a mechanism for import, the practice of the RDF world is to use (ie, “cherry pick”) vocabulary terms (predicates, classes, etc) from various different vocabularies without necessarily taking the whole vocabulary, and certainly without using the owl:import predicates (think of routine usage of dc:title without importing the full Dublin Core vocabulary). How would a reasoner treat those? It may be a little bit easier to use a more rule based approach (like OWL RL) although it is not obvious how to cherry pick just the right amount of information on a, say, predicate. But Ian Horrocks also drew my attention on formal ontology modularization work that might be very relevant here; item added to my “to-be-read” list…

Provenance (and trust). One of the issues that popped up in all other break out groups; in consequence a separate one was formed on the second day of discussions. It is indeed one of the questions that anyone who talks about Semantic Web gets; in my personal view, having a clear “story” to tell about provenance is essential for a further deployment of this technology. The discussion in the group was really interesting because this issue raises a number of other questions, like the overall relationship of cryptographic techniques and the Semantic Web, what it means to have trust in context, what are the relationships to temporal or uncertainty reasoning, etc, etc, etc. It was also interesting for me to hear about other works, like the Open Provenance Model, albeit some of these were not necessarily done by Semantic Web people (eg, by the database community). We agreed that a Wiki page will be created (probably at RPI, set up by Deb McGuinnis) to collect information on this subject, and forming a W3C Incubator Group might also be in the books to provide a more thorough state-of-the-art. A long list of additional items to my “to-be-read” pile is coming…

And, of course, it was also good to meet a bunch of people, discuss things at lunch or dinner. This type of interaction is really fruitful. And there was also intensive twittering going on (using the #swdag2009 tag, pointing to a bunch of other reseources) although this time I did not twitter too much because I had problems with my wireless card:-(

It was a good meeting; thanks for the organizers. Would be good not to wait another 10 years for the next incarnation of this event…

June 28, 2009

“Because a country using only one language and having only one custom is weak and frail”

Filed under: General, Hungary, Private — Ivan Herman @ 12:27
Tags: , , ,
Saint Stefan
Image via Wikipedia

I have already blogged a few weeks ago on the sad success of right wing extremist parties in Europe. One of the toughests and, in my view, most frightening one among those is the Hungarian “Jobbik” party, with its openly racist, anti-Semitic message. Being Hungarian, I feel embarrassed and saddened by their success… However, a new Facebook group, set up recently to fight against this Hungarian phenomenon, has made me realize a sad irony, too.

One of the historical figures of Hungary is St Stephen I of Hungary, the first king of Hungary. He established the Kingdom of Hungary more than a 1000 years ago, ensuring the future of his nation. As such, he has become, among others, the reference point for all nationalists and, of course, racist movements in Hungary.

St Stephen had a son, Prince Emeric (Imre); and St Stephen wrote a text to prepare his son to play his a role as a king. This old text, known as “Saint Stephen’s admonitions to his son Emeric”, is available on the Web thanks to the National Library of Hungary (sorry, only in Hungarian, I could not find an English translation). It consists of 10 general admonitions, the 6th being on the role of foreigners (the text actually uses the word “guests”) in the country. It would be a bit long to translate, but the title of this blog may be the most important sentence of the paragraph:

Because a country using only one language and having only one custom is weak and frail

(If you are interested by the original: “Mert az egy nyelvű és egy szokású ország gyenge és esendő.”)

Wise words coming from the “dark” middle ages! Worth for a number of people, from the Netherlands to Hungary, to think about…

Reblog this post [with Zemanta]

June 20, 2009

SemTech2009 impressions (addendum)

I wrote a blog yesterday on my SemTech impressions; I realized this morning that I forgot to add an item although I intended to.  Peter Deitz did indeed a presentation on a site called “social actions”: essentially a specialized index and search engine on various social, non-governmental actions around the World that one might want to join, contribute to, etc.  (Eg, the search on climate change will point you to a number of corresponding actions aroud the globe.) The interesting aspect, from the Semantic Web point of view, is that Peter would like to integrate the data, the access, etc, to the rest of the SW, essentially to the LOD (although he did not use this term), but he needs (and asks for) help from the community. Beyond the clear value of this particular dataset this is becoming a pattern (the NYT example in my blog yesterday is similar): people realize the value of publishing their data in a Linked Data format, but it is difficult to make the first steps. Even more tutorials, descriptions, and mainly community help is needed. That is essential for the success of Linked Data!

Reblog this post [with Zemanta]

June 19, 2009

SemTech2009 impressions

The first and possibly most important aspect of SemTech 2009 is that… it happened! I must admit that back in April-May, when the conference’s Web Site did not include any news of the program yet, I was a bit concerned that the general economic malaise would kill this year’s conference. O.k., I might have been paranoiac, but I think some level of concern was indeed legitimate. And… not only did the conference happen as planned, but the numbers were essentially the same as last year’s (over 1000). I think that by itself is an important sign of the interest in Semantic Technologies. Kudos to the organizers!

A general trend that was reaffirmed this year: by now, Semantic Web technologies are the obvious reference points for almost all presentations, products, etc, that were presented at the event. RDF(S), RDFa, OWL, SPARQL, etc, have become household names; newer specs like SKOS or POWDER may not have been as widely referred to yet, but I am sure that will come, too. Linked Data (and, more specifically, the Linked Open Data cloud) were almost ubiquitous this year while I do not believe that it was even mentioned last year. That is a huge change (although I still miss real “user facing” applications of LOD to show up; some, like Talis’ system deployed at UK universities, were presented but not as part of the regular conference). All that being said, I somehow seem to have missed more sessions than last year, which make my impressions more patchy. There were several journal interviews that I could not escape, hallway discussions that were great but made me miss a presentation here and there… I guess this is what happens when you have such a number of people around!

Tom Tague (from Open Calais) gave a very nice opening keynote. His talk was actually not on Open Calais (he did that in 2008), but rather on his experience in talking to different people who tried to start up new ventures in the Semantic Web area (a quote from his talk: “in 80% of the discussions I did not understand what the vendors wanted, and I walked away with my cheque book intact… Simplify!”). The main areas that he looked at were tools, social, advertising, search, publishing, user interface. One of the remarks I liked was on search: in his view (and I think I agree with that) Semantic Technologies may not be really interesting for general search (where the statistical, i.e., brute force methods work well) but for specialized, area-specific search tools (things like GoPubMed or applications deployed at, eg, Eli Lilly or experimented with at Elsevier come to my mind as good examples). Similarly, these technologies are not necessarily of interest for general, “robotic” publication tools like Google’s news, but for high quality publishing, with possible editorial oversight (reducing costs and difficulties).

(He also had a nice text on one of his slides: “Web2.0: Take Web 1.0, add a liberal dash of social, generous amounts of user generated content, atomize your content assets and stir until fully confused”:-)

Tom Gruber talked about his newest project: SIRI. A super-duper personal assistant running on an iPhone with conversational (voice directed) interface. The group behind it integrates a bunch of info on the Web (the “usual” stuffs like restaurants and travel sites), categorize them, and hide the complexities behind a sexy user interface. The problem I have is that I just do not see how this would scale. I see one of the major promises of the Semantic Web getting data in RDF out there so that such, essentially mash-up applications would become much easier to create and maintain. Until then, it is really tedious… On a more personal note, I am not sure I would like the voice conversational interface. I know that I have never used the voice commands on my phone for example; I do not feel comfortable with it. But, well, that is probably only me…

Chime Ogbuji made a really nice presentation on the system they have developed at the Cleveland Clinic. Great combination of RDF, OWL, and SPARQL. The interesting aspect (for me) was that usage of a medical expert system called Cyc, which is used to convert the doctor’s question in natural language (insofar as a question full of medical jargon can be considered as “natural”:-) into, essentially, a SPARQL query. The medical ontologies are used to direct this conversion process, and then the triple store could be queried through the generated query. Impressive work. (Part of it was documented in a W3C use case, but this presentation had a different emphasis.)

Unfortunately, I had to skip Peter Mika’s presentation on the SearchMonkey experiences, I will have to look at his slides… But, as a last minute addition to the program, the organizers succeeded in getting Othar Hansson and Kavi Goel to talk about Google’s rich sniplets. I have already blogged on this a few weeks ago but this presentation made the goal of the project way more understandable. Essentially, by recognizing specific microformat or RDFa vocabularies, they can improve the user experience by adding extra information on the search result. It is interesting to observe the difference between Yahoo! and Google in this respect: both of them use microformats/RDFa for the same general goal but, whereas Yahoo! relies on the community providing applications and on users personalizing their own search result page, Google controls the output in a generic way that does not require further user actions. It will be interesting to see how these differences influence people’s usage patterns. There were some discussion on the Google’s choice on vocabularies; the presenters made it quite clear that they are perfectly happy using other vocabularies (eg, vCard or FOAF) if they become pervasive, and this is a discussion that Google plans to engage with the community. There is of course a chicken-and-egg issue there (if a vocabulary is known by Google, then it will be more widely used, too), and this is cleary an area to discuss further. But these are details. The very fact that both Yahoo! and Google look at microformats and RDFa is what counts! Who would have thought just about a year ago?

I was not particularl impressed by the Semantic Search panel. I had the impression that the participants did not really know what they should say and talk about:-(

Nice presentation by Jeffrey Smitz from Boeing on a system called SPARQL Server pages. Essentially: the user can use similar structures like, say, a PHP page, ie, a mixture of HTML tags and server “calls”, except that this “calls” refer to SPARQL queries against a triple store on the server. Their system also includes some rule based OWL reasoning on the server side, although I am not sure I got all the details. All in all, the system seemed a bit complex, but the general approach is interesting! And it is nice to see that a company like Boeing seems to make good use of RDF+OWL+SPARQL; it would be good to know more…

I missed Zepheira’s presentation on freemix which is a shame, but, well, it happens. But I did play with freemix before travelling to San Jose;  I called it “Exhibit for the masses”. And this, I think, is a fair characterization. David Huynh’s exhibit is a really nice tool, but it is not easy to use it. On the other hand, it took me about 2 minutes to make a visualization of a json data set I used for an exhibit page elsewhere…

Andraz Tori talked about Common tag, a small vocabulary that, for example, can be used when marking up texts with tags (something that engines like Zemanta or Open Calais do). Bringing the RDF and the tagging worlds together is really important; I am very curious how successful this initiative will be…

The keynote on the last day was from the New York Times (by Evan Sandhaus and Robert Larson). It was quite interesting to see how a reputable journal like the NYT has developed a tradition of indexing, abstracting, cataloging articles, how these are archived and searched. Impressive. It is also great that the NYT Annotated Corpus has been released to the Research community. I did not know about that and, I presume, this must be a great resource for a lot of people active in the are of, say, natural language processing. Finally they announced their intention to release their thesaurus in a Semantic Web format, to add a “blob” to the Linked Data Cloud. They still have to work out the details (and expect feedback from the community) and I would hope they would publish a SKOS thesaurus and might even annotate the news items on their web site using this thesaurus in RDFa. But something in this space will happen, that is for sure! Other reputable newspapers, like Le Monde, the Guardian, NRC Handelsblatt,  el Pais, will you follow?

I also had my share of talking: gave an intro tutorial to SW, gave an overview of what is happening at W3C (quite a lot this year, including the finalization of POWDER, OWL 2, and SKOS!) and participated at an OWL 2 panel (with Mike Smith, Zhe Wu, Deb McGuinnis, and Ian Horrocks). I was quite happy with the tutorial and the way the panel went; the audience for the talk could have been a bit larger. But, well…

It was a long week, long trips, not much sleep… but well worth it!

Reblog this post [with Zemanta]

June 8, 2009

The right-wing extremists on the move…

The right-wing extremists on the move—in Austria, Hungary, Romania, Bulgaria, Greece, Denmark, the Netherlands. It is a shame— for the large people’s parties and for the voters, which did not participate in the elections.

This is a quote from a blog published on the Web site of the German ZDF television (see the German original). I couldn’t agree more. I live in the Netherlands, where Geert Wilders’ rasist party became one of the strongest parties in the country; I carry a French passport and the Front National will still send representatives to the EU Parliament in the name of France; and I also carry a Hungarian passport and the local right wing “Jobbik” party of Hungary has made a breakthrough yesterday evening. This is not a good day…

(Somme “nuggets” from the declarations of the EU representative of Jobbik: “I would be glad if the so-called proud Hungarian Jews would go back to playing with their tiny little circumcised tail rather than vilifying me”,…“We had a dream that we would not become a second Palestina. This dream has just come true…”. Wonderful…)

One can of course be optimistic: these movements come and go, they are in a minority. And I hope optimism is till o.k. But when I see these people marching on the streets of Budapest where I grew up, or when, as a foreigner in the Netherlands, I am indirectly accused by an official party of stealing the job of locals, then, well, it is not easy to keep up my optimism…

June 1, 2009

PWC report on Semantic Web

There has already been a number of blogs and tweetes on PriceWaterhouseCoopers’ Spring ’09 Technology Forecast on Semantic Web, but it may still be worth writing about it. The document can be downloaded from the Web free of charge in return for a registration. It includes some of PWC’s own overview on the technology, plus interviews with Tom Scott (BBC), Uche Ogbuji (Zepheira), Lynn Vogel (University of Texas M.D. Anderson Cancer Center), and Frank Chum (Chevron).

The document is clearly not aimed at technologists of the Semantic Web. But there are number of well chosen wordings and quotes that might help us to talk to people around us who have to be convinced about the value of Linked Data/Semantic Web. Just a few of those:

PricewaterhouseCoopers believes a Web of data will develop that fully augments the document Web of today. You’ll be able to find and take pieces of data sets from different places, aggregate them without warehousing, and analyze them in a more straightforward, powerful way than you can now.

[…]

Let’s say your agency represents musicians, and you want to develop your own ontology […]. You might create your own ontology to keep better tabs on what’s current in the music world […]. You also can link your ontology to someone else’s and take advantage of their data in conjunction with yours. Contrast this scenario with how data rationalization occurs in the relational data world. Each time, for each point of data integration, humans must figure out the semantics for the data element and verify through time consuming activities that a field with a specific label […] is actually useful, maintained, and defined to mean what the label implies. Although an ontology-based approach requires more front-end effort than a traditional data integration program, ultimately the ontological approach to data classification is more scalable […]. It’s more scalable precisely because the semantics of any data being integrated is being managed in a collaborative, standard, reusable way.

[…]

With the Semantic Web, you don’t have to reinvent the wheel with your own ontology, because others […] have already created ontologies and made them available on the Web. As long as they’re public and useful, you can use those. Where your context differs from theirs, you make yours specific, but where there’s commonality, you use what they have created and leave it in place. Ideally, you make public the non-sensitive elements of your business-specific ontology that are consistent with your business model, so others can make use of them. All of these are linked over the Web, so you have both the benefits and the risks of these interdependencies. Once you link, you can browse and query across all the domains you’re linked to.

[…]

Traditional data integration methods have fallen short because enterprises have been left to their own devices to develop and maintain all the metadata needed to integrate silos of unconnected data. As a result, most data remain beyond the reach of enterprises, because they run out of integration time and money after accomplishing a fraction of the integration they need.[…] The most basic lesson is that data integration must be rethought as data linking—a decentralized, federated approach that uses ontology-mediated links to leave the data at their sources. The philosophy behind this approach embraces different information contexts, rather than insisting on one version of the truth, to get around the old-style data integration obstacles.

Yeah, we all know that, right? But can we really put it in succint terms for outsiders? That is not that easy… Ie, worth reading the report (and thanks to PWC!).

May 28, 2009

3D Media and the Semantic Web

Filed under: Semantic Web, Work Related — Ivan Herman @ 16:08
Tags: , , ,

It is always interesting to find out about new communities using Semantic Web technologies. So here is one more: 3D Media. Although I already knew about some work going on around this subject, a paper published by Michela Spagnuolo and Bianca Falcidieno[1] just drew my attention on it again. Unfortunately, the paper, being published by IEEE, is not publicly available:-( but if you can get hold of it, it is worth looking at it!

Using specific vocabularies it should be possible to provide semantic annotation of 3D models, shapes, surfaces, etc, regardless of how you present or construct the 3D virtual object itself. The paper gives the example of a 3D surface described in terms of a triangular mesh; in some cases, for example to gain on display efficiency, the mesh may be converted into a coarser mesh, but a corresponding semantic descriptions should stay unchanged. How to exactly do that, how to store the metadata, what is the exact identification of the surface itself (i.e., what are the URI-s to use), what type of vocabularies should be developed by the community, etc, are all R&D issues: there is now a community growing up around these questions. If such semantic descriptions are available then better organizations of shapes, better search facilities, better segmentation tools for 3D objects, etc, can be developed. (It is not unlike what happens with images and their possible semantic characterizations, except that moving from 2D to 3D significantly increases complexity.) It is also possible to create and share depositories of 3D shapes, models, etc, and possibly link them to other datasets around the Web. Such activities are already happening, eg in the AIM@SHAPE project, or the publication of some of the relevant OWL ontologies but, at least to my knowledge, there has not been too much connections between the, shall we say, “core” Semantic Web community (whatever that means:-) and this 3D Model activity. That is why I was pleased to see the paper of Michela and Bianca; maybe more connections can be made in the years to come!

(Full disclosure: I am biased. Indeed, in my previous life, I was active in Computer Graphics and I actually has the pleasure of co-organizing a workshop with Bianca Falcidieno[2] a long time ago; I also met Michela at that Workshop. It was one of the nicest events I participated in in the 90’s…)

  1. Spagnuolo, Michela and Falcidieno, Bianca (2009) “3D Media and the Semantic Web.” IEEE Intelligent Systems, 24(2), pp. 90-96. Available from: http://www2.computer.org/portal/web/csdl/doi/10.1109/MIS.2009.20
  2. Falcidieno, Bianca, Herman, Ivan and Pienovi, Caterina (eds.) (1992) Computer Graphics and Mathematics, Heidelberg, Springer Verlag.

May 13, 2009

RDFa, Google

Filed under: Semantic Web, Work Related — Ivan Herman @ 8:31
Tags: , , , ,

Imagine that you have a review of a restaurant on your page. In your HTML, you show the name of the restaurant, the address and phone number, the number of users who have provided reviews, and the average rating. People can read and understand this information, but to a computer it is nothing but strings of unstructured text. With microformats or RDFa, you can label each piece of text to make it clear that it represents a certain type of data: for example, a restaurant name, an address, or a rating. This is done by providing additional HTML tags that computers understand.

No, this is not an extract of a page at W3C explaining the role of RDFa, or some academics’ paper on the role of RDF and RDFa. This is an extract of a Google page written for webmasters and explaining why adding such information to a page is useful. It is followed by:

These don’t affect the appearance of your pages, but Google and any other services that look at the HTML can use the tags to better understand your information, and display it in useful ways—for example, in search results.

This was published while I was sleeping; by the time I woke up in Brisbane twitter, mailing lists, news sites, or the blogosphere already talked about the news: RDFa is adopted by Google! Ie, this blog is hardly a prime news any more:-( But this is such a good news that I still felt compelled to write. After Yahoo’s SearchMonkey announcement last year (gosh, what a news that was, too!), the fact that social sites like SlideShare, systems like Drupal, or public thesauri sites like the Library Congress’ Subject Heading page have all adopted all RDFa, this Google announcement means that RDFa is in the mainstream now, that all the work that people have put into this technology is now paying off at last. I must admit, it is a good feeling…

Of course, lot is still to be done. The quoted Google page refers to some specific vocabularies that, I presume, will be indexed explicitly at first (reviews, people, products, and business and organizations). I would expect other vocabularies will follow in due time, developed by various communities around the globe. (As a first step, it would be great if Google adopted some of the exisiting vocabularies like SIOC, DOAP, FOAF, DC…). Developing the right vocabularies for the right communities is still a challenge that the Semantic Web community has to work on. And what would make me even happier would be to welcome Google’s developers to participate in the definition of those, together with the rest of the community, in line with the decentralized nature of vocabulary definitions.

But this is for tomorrow and the day after tomorrow. Today, I am just happy to have this news, and worry about the next step later…

May 1, 2009

Library of Congress Subject Headings in SKOS on line

We may remember the experimental lcsh.info site that published the LoC vocabularies on line in SKOS. Well, while lcsh.info was closed down at some point, the official version is now up and running at http://id.loc.gov/authorities/.

This is great news. This means that all items in the LoC subject heading have now a stable, dereferancable URI; the URI refers to a (SKOS) Concept and is linked to broader and narrower terms. The implementation follows the LOD principles; the URI can be dereferenced in an HTML browser providing an RDFa annotated HTML page. To take an example, the subject heading for “Semantic Web” has the URI: http://id.loc.gov/authorities/sh2002000569#concept; dereferencing it leads to an XHTML+RDFa page. The RDF content can be accessed describing the concept in SKOS, providing a link to, eg, the concept of World Wide Web. This vocabulary gives a stable target to characterize the subject of various entities using, eg, the Dublin Core “subject” term. The site provides a search page and the whole dataset can also be downloaded in RDF/XML or n triples.

This is a huge service of the Library of Congress to the Semantic Web community. Thanks to Ed Summers and anybody else who took an active role in it!

April 29, 2009

Drawing consequences on large corpus of data…

Filed under: Social aspects, Work Related — Ivan Herman @ 18:15
Tags: , , , ,

I spent some time today reading through the WWW2009 paper on “Mapping the World’s Photos”, from David Crandall et al[1] . The paper reports on a work analyzing a large number (35 million) of photographs extracted from Flickr, including their metadata. The interesting point of the paper is that they combine various analysis tools: they analyse the users’ tags, the geo location in the photos’ metadata, timing information of a series of photos from the same user, and image processing analysis of the photos’ content. The combination of many different types of information leads to a better clustering of the photo data: photos can be organized in terms of location (either on large scale, ie, on the level of, say, a city, or on a much smaller scale, ie, on the level of a landmark like the Eiffel Tower). This clustering can be done without a priori knowledge of the image contents themselves.

There is the technology/algorithmic side of the paper that I cannot really comment on, I am not familiar enough with the clustering algorithms they used. However, at least for me, the more interesting aspect of the paper is the “social’’ one. As the authors say:

As researchers discovered a decade ago with large-scale collections of Web pages, studying the connective structure of a corpus at a global level exposes a fascinating picture of what the world is paying attention to. In the case of global photo collections, it means that we can discover, through collective behavior, what people consider to be the most significant landmarks both in the world and within specific cities […]; which cities are most photographed […] which cities have the highest and lowest proportions of attention-drawing landmarks […]; which views of these landmarks are the most characteristic […]; and how people move through cities and regions as they visit different locations within them […]. These resulting views of the data add to an emerging theme in which planetary-scale datasets provide insight into different kinds of human activity — in this case those based on images[…].

And this, of course, is really fascinating. But… it can also be dangerous if not done with care, because it is way too easy to jump on false conclusions. Indeed, study of such corpus cannot and should not be done, at least in my view, without a careful consideration of social, cultural, and economical issues. (This is of course no critique on the authors at all who concentrated on the algorithmic aspect only and did a great work at that!)

Let me take one example from the paper: the clustering algorithm produces a table “showing the most photographed places on Earth ranked by number of distinct photographers”. The first 15 cities on the list are: New York, London, San Francisco, Paris, Los Angeles, Chicago, Washington, Seattle, Rome, Amsterdam, Boston, Barcelona, San Diego, Berlin, and Las Vegas. 8 cities from the US, 7 from Western Europe. None from Canada, Asia, Africa, Australia, Latin America… Fascinating (and highly photogenic!) cities like Kyoto, Beijing, Rio de Janeiro, or Istambul are missing. This is not the fault of the authors: this is what this particular data set, ie, Flickr, gives you. However, can we, should we say that the World is not paying attention to these cities? I do not think so. To really draw conclusions, one would have to look at the demography of Flickr users, at economic issues, whether different communities use Flickr or some other photo site elsewhere in the World… The lack of Japanese cities in the list (knowing that Japanese make tons of pictures everywhere they go!) seems to indicate that their attitude towards social sites like Flickr might be different than what we are used to in “the West”. People going to Cairo may not have the same type of sophisticated cameras and easy Internet access to produce Flickr-quality pictures. And there may be many other different aspects that I do not even think of at this moment…

This is indeed an exciting line of research. But we, computer scientists, should be modest enough to realize that drawing social conclusions from such data requires us to work with experts in other disciplines. We could then come up with defensible conclusions that would be interesting to explore and exploit. Ie, the future, in this respect, lies in interdisciplinary work.

  1. Crandall, David, Backstrom, Lars, Huttenlocher, Daniel and Kleinberg, Jon (2009) ‘Mapping the World’s Photos’, In Maarek, Y. and Nejdl, W. (eds.), Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, ACM Press, pp. 761-770. Available online.
Next Page »

Blog at WordPress.com.