I found a Force11 reference on Google+ to a NYT article. The article is entitled “How to Share Scientific Data”, which is a topic I am interested in both privately and also professionally as part of my W3C work (publishing scientific data on the Web is coming to the fore as a major area of Data on the Web). The NYT article is, actually, “just” a short overview of a more detailed paper, written by F. Berman and V. Cerf, on ”Who Will Pay for Public Access of Research Data”, and published in Science. Because the NYT has duly put a reference, I followed it. That is only a reference with the abstract; there is a separate link to the full text. But then… I am asked to subscribe to Science to access the paper which is is about 100$ for an annual subscription! In effect: one would have to pay 100$ to access a paper (o.k., possibly others, but that is the one I am interested in right now!) that looks at public access of data: isn’t this ironic? Sigh…
August 15, 2013
August 11, 2013
This is just a nice little example which might be worth noting for those who do not know Open Street Map (I am also a relatively new user of it).
I had a nice walk in Marseille yesterday, which included going down from the big cathedral on the top of the hill (“Notre Dame de La Garde“) to the seaside. There is a not-very-well-known path behind the church that one can take which is, for my taste anyway, a gorgeous way of doing it.
The path of course appears on Google’s Map: look at the small path going from the church to the “Rue du Bois Sacré”. However: look at the same area using Open Street Map: not only is the path there, but it gives a bunch of details. Indeed, it is not really a simple path: it is a long series of steps, i.e., do not try to drive of even bike there:-( And because it is a hot city, it is also good to know that there is small public fountain along the path (and, indeed, it is there and it works!)…
It is not really Google’s fault. They probably got the material from some sort of an official mapping system (they could not get their camera cars or bikes up there…) and there is no way a company, even as huge as Google, can cover such details. But a community-driven site can: people can add such details easily. (Actually, there was part of the path that was missing, and I will add it soon using my GPS readings.) Therein lies the power:-)
March 16, 2013
March 1, 2013
Tags: HTML, microdata, Python, RDF, RDFa, RDFLib, Resource Description Framework, Turtle
This has been in the works for a while, but it is done now: the latest (3.4.0 version) of the python RDFLib library has just been released, and it includes and RDFa 1.1, microdata, and turtle-in-HTML parser. In other words, the user can add structured data to an HTML file, and that will be parsed into RDF and added to an RDFLib Graph structure. This is a significant step, and thanks to Gunnar Aastrand Grimnes, who helped me adding those parsers into the main distribution.
I have written a blog last summer on some of the technical details of those parsers; although there has been updates since then, essentially following the minor changes that the RDFa Working has defined for RDFa, as well as changes/updates on the microdata->RDF algorithm, the general approach described in that blog remains valid, and it is not necessary to repeat it here. For further details on these different formats, some of the useful links are:
- For RDFa, there is a new version of an RDFa 1.1 Primer in preparation. It is probably worth keeping an eye on the editor’s draft of the primer. The primer has the links to the official recommendations if one wants to look up the gory details. Alternatively, look at the RDFa community page!
- For microdata, the official specification is of course available; the conversion to RDF is the subject of a separate W3C Note.
- For turtle-in-HTML, you can look at the latest version of the Turtle spec.
February 22, 2013
Tags: E-book, eBook, EPUB, Open Web Platform
My last week was all around digital publishing: first, I was at the W3C Workshop on eBooks and the Open Web Platform, that I helped to organize. If I extrapolate from the discussions at the W3C Workshop, there are good prospects that this topic will become more important at the W3C, and that it will also keep us busy (in addition to my role on the Semantic Web). By the way, the minutes of the W3C Workshop (both for the 1stand the 2nd days) and the presentations are public; a somewhat more detailed workshop report should also be available soon.
The Workshop was followed by O’Reilly’s Tools of Change (TOC) conference: a first time for me. And it was extremely interesting to find myself in a new environment where I have never been before. I have seen some great keynotes (e.g., Mark Waid’s on “Reinventing Comics And Graphic Novels For Digital”, or Maria Popova’s, from Brain Pickings), learned a lot at some of the session (for example, at Bill Rosenblatt’s session on some of the legal aspects surrounding eBooks).
My interest in this whole area is, primarily, on how digital publishing in general, and electronic books in particular, relate to technologies developed at W3C. For those of you who may not realize that: if an electronic book uses the ePUB standard (and more and more books do) than the book is, in fact, a “frozen” Web site (depending on the ePUB version either based on XHTML1 or HTML5). Technically, it is a zip file containing all the files necessary to render the content, plus some ePUB specific files to manage table of content, to help readers to display the content even more quickly, etc. Actually, as far as I know, most of the ePUB readers are based on the same core technology as many of the Web browsers, namely Webkit). The strong relationship between publishing in general, and eBooks in particular, was emphasized several times at the conference, especially by the keynote of Jeff Jaffe, the CEO of W3C.
But then… if so, why do we need separate eBook readers, either in hardware or in software? (Let us put aside for now the issue of DRM, vendor lock-in, etc; these are of course reasons but let us hope the business will evolve towards a more open environment where those issues will be less relevant.) Do we really need a separate ePUB reader software on, say, my iPad, or should we simply rely on the browsers taking care of ePUB files either directly or through some extensions? (There is, for example, a project called Readium to add such capabilities to Chrome.) And the answer is not obvious, there are proponents of both approaches. My 2 cents here is: it is not a core technology issue, but a user experience and interface one. Reading a book, electronic or otherwise, is a different intellectual activity than an average Web page. Here are some differences that I feel are important, and I am sure there are more, much more:
- A book must be available off-line; this is, actually, its natural state. This difference is obvious, but worth noting: for example, the user interface for books has to be able to list what is and what is not available at a given moment (all readers have some sort of an imitation of a traditional bookshelf).
- The amount of “information” you want to absorb is different. A typical Web page is not terribly long; even the more detailed Wikipedia articles, when printed, are rarely longer than 4-5 pages. Compare that to an average book that may be hundreds or even thousands of pages. What this mean in practice is that, whereas a Web page is usually read, understood, “absorbed” in one go, reading a single book may take several days or weeks. This has all kinds of consequences on how one navigates, uses traditional bookmarks (not the ones browsers usually provide, i.e., to store URL-s, but what used to be bookmarks in the past), tables of content, indexes, glossaries, etc. These features are essential for books but much less so for an average Web page.
- Modern Web pages have more and more interactive features, they are related to various social sites like Twitter or Facebook; very often these pages are Web applications with very complex features (think of gmail, for example). Obviously, browsers have to be prepared for a high level of interactivity and have to be optimized to offer an optimal user experience. Books are much less interactive. Although newer generations of books may include some level of interactivity, and these are important for, say, the educational book markets or for children’s books, but it is a far cry compared to what Web sites do. Also, some readers (like Kobo’s) try to include some level of Social Web facilities (sharing information about books with friends, that sort of things); to be honest, I never found those social features interesting or important (o.k., I may just be old-skool). Reading a book for me remains a linear reading activity, whether it is a fiction, poetry, history, or politics. I want my eBook reader to optimize on that, and avoid distractions.
- There are some features that a good eBook reader should offer and browsers do not traditionally do. A prime example is annotation facilities. Many people like to scribble on their books, underline full sentences, highlight words; I still have not found any tools to do that properly in a Web browser, although all the eBook readers that I have tested so far have such functionality. This is a typical user interface difference that comes from different demands. (Another example that comes to my mind is a quick on access to a dictionary, to an encyclopedia, etc.)
- Some sort of a payment/right management system must be part of the reader. I personally consider the current DRM system, as used in the eBook world as fundamentally broken insofar as may drive people away from this market. However, I recognise that something should be available that allows authors of books to get some reward for their work. Whether that is some sort of a watermarking, social DRM, or whatever, I do not know, but something is needed, and the reader environment has to handle this.
I realize, of course, that this is a continuum: with ePUB3 we have the ability to make eBooks much more interactive, possibly with scripts, multimedia, etc.; in effect, electronic books are becoming more and more like Web applications. I.e., some of these differences may disappear or become less important. Nevertheless, I believe there will always be a difference in user expectations, in the emphasis that a software (or hardware) may have. eBook readers are not browsers, although electronic books are, in fact, part of the Web just like other types of Web contents. Is it a sign that we may need a more diverse landscape of accessing the Web than we have today?
December 24, 2012
Tags: Apple, iPad, iPhone, Mountain Lion, Operating system
It is December and, just as last year, it is the time for an upgrade of OS X. Last year it was Lion (and I did write down my experiences back then); this time it is Mountain Lion. I decided to make a short note of my experiences because, maybe, by sharing those I will save some time and energy to somebody else. In general, I have not hit any major issues, I must say, just nuisances, but it did take me some time to get around those…
1. The installation process itself was fairly straightforward except that… it was nerve wrecking some times. While installing, the screen duly had a progress bar with a text underneath, saying something like “the remaining time is 25 minutes”, “the remaining time is 5 minutes”, “the remaining time is less than a minute”, then… it stuck. Stuck for a long time. Nothing moved, the progress bar was full. And then an even stranger thing happened: it said something like “the remaining time is -20 minutes”. WTF? Because I have experienced quite some crashes in the 30 years that I am in this business, of course I got nervous. Should I reboot? What will happen then? Is my disk fully destroyed now?
Luckily, I had the instinct not to do anything but take my iPad and look up the Web. And sure thing: there are reports elsewhere saying that the progress bar implementation of the installer, including the time estimate, is buggy, and that I should just wait and things would turn out to be all right. And they did indeed, after around 30 extra minutes. Phew!
2. Everything installed, get to login… and it seems that there is still some installation and/or file adaptation to do at that time, because it took about 4-5 minutes after having typed in my password before any of my windows showed up. Again, WTF? I became wiser, and just waited, and things got back to normal. Note that, since then, everything is fine when I wake up the machine, although I have not rebooted it yet to see if a login would again lead to such a delay.
3. I knew that, in Mountain Lion, Apple decided to remove the simple system preference flag to start up a local apache automatically (having the local apache running is essential for me: I have a partial copy of a Web site on my machine to test pages before they go public). Although I never understood why this decision had been taken, I was prepared; there are a number of sites giving advice on what to do (e.g., the one I looked at), as well as an extra small preference that one can install.
What I did not count on is that that the installation would wipe out the old apache configuration file (i.e.
/etc/apache2/httpd.conf). (I do not think the Lion installation did that, at least I do not remember.) To make things even more difficult, that director is not accessible through the time machine (why?) so I had to reproduce my changes. It took me a certain time because I adapted that file for my needs three years ago and I forgot all about it, of course. Advise: make a copy of that file before upgrading!
4. I need some command line tools like
cvs. That means I had to install a new version of Xcode; I counted on that. However…
cvs was still not there after installation. Sigh… did they remove
cvs as an obsolete tool? But no,
gcc was not available either.
As usual, the Web and Google are your friends; I found a note with an explanation. It turns out that Apple no longer installs the “developer” command line tools by default. That includes compilers,
cvs, and the like. You have to install them explicitly: start up Xcode, and then look for Xcode→Preferences→Downloads→Components and click on the install button next to the command line tools. (Again the same question: why this arbitrary decision?)
5. I was pleased to see that the Note application is now available, and is supposed to synchronise with the note application on my iPhone and iPad. I knew that, and I was looking forward to that. On Lion, the notes were bound to the email accounts and appeared in the Mail application; I always found that setup odd.
But… things are not that simple because Apple again made some unexplainable decision. On Lion, I could assign notes to the various email accounts I had, I could do the same on, say, my iPhone, and things worked properly. Not so in Mountain Lion; indeed (as I understood after some google-ing…) Apple has discontinued this synchronisation except for iCloud. Ie, you have to regroup all your notes under the iCloud account (if you have one, that is) to achieve a smooth synchronisation with your mobile devices. It is not that bad at the end, because you can define folders for notes that you can use those for your own categorisation; but, until I realised all that and got everything running, I again lost quite some time, had some dead ends, etc. Sigh…
6. I also had some small woes with the latest Safari. For reasons that again I do not understand, there is no more preference setup in Safari to set the right font size. The only way is to do that is through a CSS style sheet (see also a relevant note I found). Although my personal problem was that the default character size was way too big for my taste, as the author of the note rightfully said, not having the possibility to adapt the size easily can be a major accessibility issue for some.
Frankly… I love my Mac, and I still find it vastly superior in usability than other machines. It is, nevertheless, disappointing to see Apple making such arbitrary decisions and making the transition to a new system unnecessarily tedious. This should not happen.
(By the way, this just reinforced me in my selfish decision not to upgrade to a new system right away. Having waited half a year meant that all my issues were solved relatively easily by looking at notes published by others…)
November 26, 2012
November 6, 2012
Tags: microdata, Python, RDFa, RDFLib, Turtle
A while ago I wrote of the fact that I have adapted my RDFa and microdata to RDFlib. Although I did some work on it since then, nothing really spectacular happened (e.g., I have updated the microdata part to the latest version of the microdata->RDF conversion note, and I have also gone through the tedious exercise to make the modules usable for Python3).
Nevertheless, a significant milestone has been reached now, but this was not done by me but rather by Gunnar Aastrand Grimnes, who “maintains” RDFlib: the separate branch for RDFa and microdata has now been merged with the master branch of RDFLib on github. So here we are; whenever the next official release of RDFLib comes, these parsers will be part of it…
October 12, 2012
Tags: European Union, France, Germany, Nobel Peace Prize
First World War, somewhere in France or Germany, two brothers are on the front line. The unusual fact is, though, that they are facing one another: one is enrolled in the French army, the other in the German one. Luckily, they both survive the War and do not have to kill one another.
About 25 years later, one of the brothers is enrolled, again, into the German army to defend the Reich on the Rhine; his son joins the French resistance movement. Father and son are many miles apart, luckily, but in opposing armies nevertheless.
Jump ahead again about 35 years. The former French partisan lives in France, works for the local subsidiary of a German company, travels back and forth between the two countries; he believes (and says) that a new war between France and Germany is now unthinkable.
Unrealistic story? Far from it. The two brothers had a third brother, who happened to be my grandfather. They lived in small villages in the North-East of France in a region called Lorraine; part of this region (together with another one called Alsace) have changed hands between France and Germany four times in a century. The tragedy was that the two brothers happened to live on different sides of the artificial border, hence were enrolled in opposing armies.
This was Europe for a long time. It was also a Europe with borders, with an iron curtain (which also played a significant role in my life), with latent and dangerous tensions that could have led to new conflicts. But all this is history. Our children, in many ways, do not even understand this past; stories like the one above seem unbelievable and unrealistic to them. And this is the main achievement of the EU. It is not perfect (far from it), it currently has economic problems and tensions to solve; but every time I pass a border without even noticing it on my way from Amsterdam to Budapest or Paris I should (and I often do) remember the ordeals my own grandfather’s generation went through. It is therefore more than fitting that the EU, as an organization, has just received the Nobel Price for peace. A war-torn, suffering continent closed a terrible period by creating it; as one of my colleagues, Phil Archer, said on twitter: we can be proud of being European.
August 31, 2012
Tags: HTML, microdata, Python, RDFa, RDFLib, Resource Description Framework, Turtle
For those of us programming in Python, RDFLib is certainly one of the RDF packages of choice. Several years ago, when I developed a distiller for RDFa 1.0, some good souls picked the code up and added it to RDFLib as one of the parser formats. However, years have gone by, and have seen the development of RDFa 1.1, of microdata, and also the specification of directly embedding Turtle into HTML. It is time to bring all these into RDFLib…
Some times ago I have developed both a new version of the RDFa distiller, adapted for the 1.1 RDFa standard, as well as a microdata to RDF distiller, based on the Interest Group note on converting microdata to RDF. Both of these were packages and applications on top of RDFLib. Which is fine because they can be used with the deployed RDFLib installations out there. But, ideally, these should be retrofitted into the core of RDFLib; I have used the last few quiet days of the vacation period in August to do just that (thanks to Niklas Lindström and Gunnar Grimes for some email discussion and helping me through the hooplas of RDFLib-on-github). The results are in a separate branch of the RDFLib github repository, under the name
structured_data_parsers. Using these parsers here is what one can do:
g = Graph() # parse an SVG+RDF 1.1 file an store the results in 'g': g.parse(URI_of_SVG_file, format="rdfa1.1") # parse an HTML+microdata file an store the results in 'g': g.parse(URI_of_HTML_file, format="microdata") # parse an HTML file for any structured conent an store the results in 'g': g.parse(URI_of_HTML_file, format="html")
The third option is interesting (thanks to Dan Brickley who suggested it): this will parse an HTML file for any structured data, let that be in microdata, RDFa 1.1, or in Turtle embedded in a
<script type="text/turtle">...</script> tag.
The core of the RDFa 1.1 has gone through a very thorough testing, using the extensive test suite on rdfa.info. This is less true for microdata, because there is not yet an extensive test suite for that one yet (but the code is also simpler). On the other hand, any restructuring like that may introduce some extra bugs. I would very much appreciate if interested geeks in the community could install and test it, and forward me the bugs that are still undeniably there… Note that the microdata->RDF mapping specification may still undergo some changes in the coming few weeks/months (primarily catching up with some development around schema.org); I hope to adapt the code to the changes quickly.
I have also made some arbitrary decisions here, which are minor, but arbitrary nevertheless. Any feedback on those is welcome:
- I decided not to remove the old, 1.0 parser from this branch. Although the new version of the RDFa 1.1 parser can switch into 1.0 mode if the necessary switches are in the code (e.g.,
@versionor a RDFa 1.0 specific DTD), in the absence of those 1.1 will be used. As, unfortunately, 1.1 is not 100% backward compatible with 1.0, this may create some issues with deployed applications. This also means that the
format="rdfa"argument will refer to 1.0 and not to 1.1. Am I too cautious here?
- The format argument in parse can also hold media types. Some of those are fairly obvious: e.g.,
application/svg+xmlwill map on the new parser with RDFa 1.1, for example. But what should be the default mapping for
text/html? At present, it maps to the “universal” extractor (i.e., extracting everything).
Of course, at some point, this branch will be merged with the main branch of RDFLib meaning that, eventually, this will be part of the core distribution. I cannot say at this point when this will happen, I am not involved in the day-to-day management of the RDFLib development.
I hope this will be useful…