Ivan’s private site

March 1, 2013

RDFa 1.1, microdata, and turtle-in-HTML now in the core distribution of RDFLib

This has been in the works for a while, but it is done now: the latest (3.4.0 version) of the python RDFLib library has just been released, and it includes and RDFa 1.1, microdata, and turtle-in-HTML parser. In other words, the user can add structured data to an HTML file, and that will be parsed into RDF and added to an RDFLib Graph structure. This is a significant step, and thanks to Gunnar Aastrand Grimnes, who helped me adding those parsers into the main distribution.

I have written a blog last summer on some of the technical details of those parsers; although there has been updates since then, essentially following the minor changes that the RDFa Working has defined for RDFa, as well as changes/updates on the microdata->RDF algorithm, the general approach described in that blog remains valid, and it is not necessary to repeat it here. For further details on these different formats, some of the useful links are:

Enjoy!

November 6, 2012

RDFa 1.1 and microdata now part of the main branch of RDFLib

Filed under: Code,Python,Semantic Web,Work Related — Ivan Herman @ 21:34
Tags: , , , ,

A while ago I wrote of the fact that I have adapted my RDFa and microdata to RDFlib. Although I did some work on it since then, nothing really spectacular happened (e.g., I have updated the microdata part to the latest version of the microdata->RDF conversion note, and I have also gone through the tedious exercise to make the modules usable for Python3).

Nevertheless, a significant milestone has been reached now, but this was not done by me but rather by Gunnar Aastrand Grimnes, who “maintains” RDFlib: the separate branch for RDFa and microdata has now been merged with the master branch of RDFLib on github. So here we are; whenever the next official release of RDFLib comes, these parsers will be part of it…

August 31, 2012

RDFa, microdata, turtle-in-HTML, and RDFLib

For those of us programming in Python, RDFLib is certainly one of the RDF packages of choice. Several years ago, when I developed a distiller for RDFa 1.0, some good souls picked the code up and added it to RDFLib as one of the parser formats. However, years have gone by, and have seen the development of RDFa 1.1, of microdata, and also the specification of directly embedding Turtle into HTML. It is time to bring all these into RDFLib…

Some times ago I have developed both a new version of the RDFa distiller, adapted for the  1.1 RDFa standard, as well as a microdata to RDF distiller, based on the Interest Group note on converting microdata to RDF. Both of these were packages and applications on top of RDFLib. Which is fine because they can be used with the deployed RDFLib installations out there. But, ideally, these should be retrofitted into the core of RDFLib; I have used the last few quiet days of the vacation period in August to do just that (thanks to Niklas Lindström and Gunnar Grimes for some email discussion and helping me through the hooplas of RDFLib-on-github). The results are in a separate branch of the RDFLib github repository, under the name structured_data_parsers. Using these parsers here is what one can do:

g = Graph()
# parse an SVG+RDF 1.1 file an store the results in 'g':
g.parse(URI_of_SVG_file, format="rdfa1.1") 
# parse an HTML+microdata file an store the results in 'g':
g.parse(URI_of_HTML_file, format="microdata")
# parse an HTML file for any structured conent an store the results in 'g':
g.parse(URI_of_HTML_file, format="html")

The third option is interesting (thanks to Dan Brickley who suggested it): this will parse an HTML file for any structured data, let that be in microdata, RDFa 1.1, or in Turtle embedded in a <script type="text/turtle">...</script> tag.

The core of the RDFa 1.1 has gone through a very thorough testing, using the extensive test suite on rdfa.info. This is less true for microdata, because there is not yet an extensive test suite for that one yet (but the code is also simpler). On the other hand, any restructuring like that may introduce some extra bugs. I would very much appreciate if interested geeks in the community could install and test it, and forward me the bugs that are still undeniably there… Note that the microdata->RDF mapping specification may still undergo some changes in the coming few weeks/months (primarily catching up with some development around schema.org); I hope to adapt the code to the changes quickly.

I have also made some arbitrary decisions here, which are minor, but arbitrary nevertheless. Any feedback on those is welcome:

  • I decided not to remove the old, 1.0 parser from this branch. Although the new version of the RDFa 1.1 parser can switch into 1.0 mode if the necessary switches are in the code (e.g., @version or a RDFa 1.0 specific DTD), in the absence of those 1.1 will be used. As, unfortunately, 1.1 is not 100% backward compatible with 1.0, this may create some issues with deployed applications. This also means that the format="rdfa" argument will refer to 1.0 and not to 1.1. Am I too cautious here?
  • The format argument in parse can also hold media types. Some of those are fairly obvious: e.g., application/svg+xml will map on the new parser with RDFa 1.1, for example. But what should be the default mapping for text/html? At present, it maps to the “universal” extractor (i.e., extracting everything).

Of course, at some point, this branch will be merged with the main branch of RDFLib meaning that, eventually, this will be part of the core distribution. I cannot say at this point when this will happen, I am not involved in the day-to-day management of the RDFLib development.

I hope this will be useful…

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,616 other followers