SemTech 2011 – O’Rielly on RDF in eBooks

Instead of a flood of tweets I thought I'd go a bit old school and do some live blogging from the SemTech 2011 session Discovering and Using RDF for Books at O'Reilly Media this morning.   My own interest in this session is how we might apply this to texts coming from our local repository and in particular related to our Yellowbacks Project which we hope to enhance soon.  We also have a body of texts sitting on our servers in TEI format and we haven't landed on a way to comfortably leverage that in our infrastructure.  My own comments here appear in parenthesis (like so).

O'Reilly took their first stab at modeling information about their books in straight XML in a bit of a "tag soup" approach. This proved way too heavyweight for them and they ended up being late in delivering products because of the time it took to modify and extend their XML approach.  They then moved onto ONIX as an internal format, but it was old and writing xpath was a bit nightmarish because of the standards drift involved and other reasons.  In the end it was just not extensible and not friendly toward being agile.   That lead them to take a stab and creating their own schema, which also proved too heavyweight and slow.  Alas they washed up on the shores of Dublin Core, specifically with DC Terms and this introduced them to the world of RDF.

The extensibility of RDF starting with DC seemed pretty cool and useful to them and they kept adding FOAF, BIBLIO and more.  More useful for the company, the problem at the end of the day was they were still thinking in XML terms.  (Implying they should have been thinking in RDF and triples terms instead I suppose.)

Early shopping cart systems pushed too much data back and forth and keeping the ordering and purchasing systems in sync was difficult and failed too often.  They moved to RESTful services between systems pushing triples and this improved the service dramatically as well as transforming their business approach to ePubs that allowed for a "user library" approach.  By going to a consolidated business logic and a central metadatastore, this improved the quality of the service significantly.   This got them close enough to the structure needed for stable URIs and web services that they could sustain and improve it for some time.

With the emergence of cloud services this changed the landscape for them when service requests became too much for the old system.  They were able to transition to cloud services because their stable URI structure allowed easy porting of the service without changing references to the resources.   A major failing of their system still was that they were trying to parse RDF as xpath or xquery and they still hadn't really taken the full RDF triplestore approach.

Moving to making their queries to SPARQL significantly improved services and made the system much more extensible by leveraging systems like Jena.    Underlying it all were N-triples and this put them in a position to adapt very quickly to expanding needs and evolving their services.   The ease of querying and performance improvements lead to another problem, that they could too easily push way too much information at he user and contextualizing that information in a more meaningful way became a business priority.  

An interesting note on training staff to use SPARQL was that people completely new to querying languages picked it up faster than those who already knew SQL.  "How do I do a JOIN" was too common a point of confusion with the later crowd.

Tools used to manage their system.  They took RDF Graphs and transformed them with pymantic and converted them into Python Object Graphics and used both for the user presentation end as well as on the server side.  They had the predictable problems of multi-threaded processing with Python but avoided this with either brute force or by managing their information as flat documents.  

The culmination of this approach was the ability to adapt very quickly to business needs, citing the Microsoft Press project as an example of significantly expanding the services in as little as 8 weeks.

Their using Jena primarily as their storage and server of triples.  They noted that Quads didn't preform as well as Triples in their case.

My Own Comments follow

Overall great demo.  My head is a swimming a bit because I'm trying to merge the concepts in my own head of how we manage information and objects in a repository versus the success cases I'm hearing about.  I believe it's likely we need to move in a direction of managing the objects and editing the metadata through Fedora but serving and linking the information by producing RDF from that and coming up with a management system to link all the triples produced from that RDF.  This isn't really all that big a step but it's something I need to sit down and talk over with everyone.