SemTech 2011 Redux

The SemTech 2011 Conference delivered a lot to the attendees and I thought I'd jot down a few of my thoughts and note some highlights as the conference draws to a close.

By and large I have to say that the technology has definitely arrived and we're capable of some exciting advances in linking data and having the web to begin to fulfill some of the promises of being a real knowledge base.  I just hope Skynet appreciates all the work we're all doing on it's behalf when it finally becomes self aware. 

What had the structured data crowd buzzing the most was last weeks announcement of MicroData format support by Google, Microsoft and Yahoo at schema.org.  Annoyance aside at Microsoft trying to put up schema.org as if it was some small independent standards board, I think the Microdata format seems just fine to me.  Essentially a competitor to RDFa it targets easy markup of information in a web page and is a bit leaner and easier than the current RDFa 1.0 standard.  The crowd here being a bit biased toward RDFa, there wasn't a lot of positive talk about Schema.org but I find I can't really care too much one way or another.  What we need to develop is a community of practice, and the technology should be secondary to that as long as it's not a barrier.  To me Microdata or RDFa are both fine standards and the only logical argument I would make to prefer one over another is that Schema.org's aim to to mark up information for better searching while RDFa is aimed at marking up knowledge.  It may seem a subtle difference but misaligned motivations like this can be the cause of big directional shifts and I think it's important for us as a community to not lose site of our aims here.  We're not trying to create a more searchable web, we're trying to make a more sensible web.  

This makes a nice segue into what I think was one of the better presentations at the conference,  Ivan Herman's presentation of on RDFa 1.1 .  The specification should be released fairly soon and 1.1 greatly simplifies the already simple standard.  In particular, when stacked up against the stuff at schema.org I think RDFa 1.1 is just as simple and easy to use and I feel more comfortable with it's extensibility over microdata formats. I am particularly interested in the applications of RDFa and TEI with some of our local collections, if they were like I hope I think it could be a fantastic step forward in presenting our TEI tagged collection.

On a side note I have to say Mr. Herman gets my vote for presenter I most want to be my instructor at Hogwarts.  The guy is just an awesome presenter and tremendously knowledgeable.  We need more advocates like him out there.

Much ado was given across several presentations about OWL, SKOS as well as automatic tagging and named entity recognition.  While services like Open Calais seems great for contemporary content and news, these services fall short in dealing with academic information or literature.    Fuzzy as it is I think this kind of automatic marking and tagging does have quiet a bit of value in speeding the cataloging process up and for standardizing tagging and ontologies.  Coupling auto-tagging and entity recognition with full text indexing can be very powerful for enhancing the  overall user experience and I just want to put that into the "lets do more of that" category.  I didn't see a particular presentation that stood out about this topic but it came up enough to be worth mentioning.

As I mentioned in a previous post the folks over at O'Reilly did a great presentation on the evolution of their own RDF and Linked Data strategies. In particularly they highlighted that this is an evolution for everyone and just plain doing something trumps thinking about it anytime.  They also highlighted how powerful working with triple-stores (and N-stores) can be in joining distributed data and systems and also made a great case for why RESTful services make like better for everyone.

Bernadette Hyland had a great Linked Data Cookbook presentation covering what goes into the mix of producing a great Linked Data Strategy.  Throughout the conference she was also a great advocate for the need for a community driven effort for semantic practices and the benefits we can all derive from it.  

Zepheira and MIT gave an interesting presentation on Exhibit3 using pure JavaScript an update of Exhibit a framework for the display of rich data.  Zepheria also demonstrated Recollection, a Django based open source offering on github for easily creating custom interactive displays for data collections on the web.   I think it's going down the wrong path to thing of any software as a solution but when looking at these as a tool in a bigger toolbox I think these have a lot of promise for curators frustrated by not being able to get their data out there.  The use of Django/Python in particular creates a good lower bar to deployment and positions it well to be enhanced, extended and to grow.

There were other highlights and I will try to link or at least tweet links to the slides as the come online.  

I appreciated meeting and talking with everyone at the conference and look forward to exploring this more.

Dead Simple Python Calls to Open Calais API

I was amazed at how easy Open Calais makes it for anyone to make calls to it's API via REST and return suggested tags and entitty recognition for any text.  Native Python libaries urllib(2) and httplib provide some effective methods for connecting and making simple REST calls to the Calais Web Services API but the httplib2 libray makes easier still.

Start off by installing httplib2 via pip

pip install httplib2

From there you just need to get an API key at the Calais site, set some headers, define a bit of text you want to pass to the API for tagging and entity recognition and then reap the benefit.

You can see this in the simple code snippet below…

import httplib2
import json

# Some local values needed for the call
LOCAL_API_KEY = 'PUT_YOUR_KEY_HERE' # Aquire this by registering at the Calais site
CALAIS_TAG_API = 'http://api.opencalais.com/tag/rs/enrich'

# Some sample text from a news story to pass to Calais for analysis
test_body = """
Some huge announcements were made at Apple's Worldwide Developer's Conference Monday, including the new mobile operating system iOS 5, PC software OS X Lion, and the unveiling of the iCloud.
"""

# header information need by Calais.
# For more info see http://www.opencalais.com/documentation/calais-web-service-api/api-invocation/rest
headers = {
    'x-calais-licenseID': LOCAL_API_KEY,
    'content-type': 'text/raw',
    'accept': 'application/json',
}

# Create your http object
http = httplib2.Http()
# Make the http post request, passing the body and headers as needed.
response, content = http.request(CALAIS_TAG_API, 'POST', headers=headers, body=test_body)

jcontent = json.loads(content) # Parse the json return into a python dict
print json.dumps(jcontent, indent=4) # Pretty print the resulting dictionary returned.

The server itself parses the body send as part of the http request and returns a json string with the results in this example because that is the format I requested in the 'accept' header attribute.  The API accepts a number of formats and returns a number as well, see the api documentation for more information.

The take away here is how simple it is to make a call to the Calais API and it wouldn't take much more to expand this to something useful in your own python application.

For reference and completeness, here is the output of the coce above.  Enjoy.

    "doc": {
        "info": {
            "docId": "http://d.opencalais.com/dochash-1/f48cc5c8-84a2-3440-844e-bfc29c3ba8e4",
            "docDate": "2011-06-08 14:14:46.594",
            "docTitle": "",
            "document": "Some huge announcements were made at Apple's Worldwide Developer's Conference Monday, including the new mobile operating system  iOS 5 , PC software OS X Lion, and the unveiling of the iCloud.",
            "calaisRequestID": "4306b4a8-cc7f-7c04-1307-076b7d5f8d35",
            "id": "http://id.opencalais.com/lhr1fFqpi2tMJjQZ3NyIJA"
        },
        "meta": {
            "submitterCode": "8fba6b3e-fef5-76ec-d7dc-ec60686110a4",
            "contentType": "text/html",
            "language": "English",
            "emVer": "7.1.1103.5",
            "messages": [],
            "processingVer": "CalaisJob01",
            "submitionDate": "2011-06-08 14:14:46.485",
            "signature": "digestalg-1|N0M3Ia9fmkexMBwN7kSL4thKM4g=|f8uTykbIPicGbu6y0962n658qv1PwewuM5jh5Gs0hJ79dC+vpurpmA==",
            "langIdVer": "DefaultLangId"
        }
    },
    "http://d.opencalais.com/genericHasher-1/e0feb730-9fc5-3365-9862-a384af633ecc": {
        "_typeReference": "http://s.opencalais.com/1/type/em/e/OperatingSystem",
        "_type": "OperatingSystem",
        "name": "Mac OS X",
        "_typeGroup": "entities",
        "instances": [
            {
                "suffix": " Lion, and the unveiling of the",
                "prefix": " new mobile operating system  iOS 5 , PC software ",
                "detection": "[ new mobile operating system  iOS 5 , PC software ]OS X[ Lion, and the unveiling of the]",
                "length": 4,
                "offset": 159,
                "exact": "OS X"
            }
        ],
        "relevance": 0.714
    },
    "http://d.opencalais.com/dochash-1/f48cc5c8-84a2-3440-844e-bfc29c3ba8e4/cat/1": {
        "category": "http://d.opencalais.com/cat/Calais/TechnologyInternet",
        "score": 1,
        "classifierName": "Calais",
        "categoryName": "Technology_Internet",
        "_typeGroup": "topics"
    },
    "http://d.opencalais.com/genericHasher-1/05ddd98f-097f-3701-a737-6fd2555f411c": {
        "_typeReference": "http://s.opencalais.com/1/type/em/e/Technology",
        "_type": "Technology",
        "name": "operating system",
        "_typeGroup": "entities",
        "instances": [
            {
                "suffix": "  iOS 5 , PC software OS X Lion, and the",
                "prefix": "Conference Monday, including the new mobile ",
                "detection": "[Conference Monday, including the new mobile ]operating system[  iOS 5 , PC software OS X Lion, and the]",
                "length": 16,
                "offset": 121,
                "exact": "operating system"
            }
        ],
        "relevance": 0.714
    }
}

SemTech 2011 – O’Rielly on RDF in eBooks

Instead of a flood of tweets I thought I'd go a bit old school and do some live blogging from the SemTech 2011 session Discovering and Using RDF for Books at O'Reilly Media this morning.   My own interest in this session is how we might apply this to texts coming from our local repository and in particular related to our Yellowbacks Project which we hope to enhance soon.  We also have a body of texts sitting on our servers in TEI format and we haven't landed on a way to comfortably leverage that in our infrastructure.  My own comments here appear in parenthesis (like so).

O'Reilly took their first stab at modeling information about their books in straight XML in a bit of a "tag soup" approach. This proved way too heavyweight for them and they ended up being late in delivering products because of the time it took to modify and extend their XML approach.  They then moved onto ONIX as an internal format, but it was old and writing xpath was a bit nightmarish because of the standards drift involved and other reasons.  In the end it was just not extensible and not friendly toward being agile.   That lead them to take a stab and creating their own schema, which also proved too heavyweight and slow.  Alas they washed up on the shores of Dublin Core, specifically with DC Terms and this introduced them to the world of RDF.

The extensibility of RDF starting with DC seemed pretty cool and useful to them and they kept adding FOAF, BIBLIO and more.  More useful for the company, the problem at the end of the day was they were still thinking in XML terms.  (Implying they should have been thinking in RDF and triples terms instead I suppose.)

Early shopping cart systems pushed too much data back and forth and keeping the ordering and purchasing systems in sync was difficult and failed too often.  They moved to RESTful services between systems pushing triples and this improved the service dramatically as well as transforming their business approach to ePubs that allowed for a "user library" approach.  By going to a consolidated business logic and a central metadatastore, this improved the quality of the service significantly.   This got them close enough to the structure needed for stable URIs and web services that they could sustain and improve it for some time.

With the emergence of cloud services this changed the landscape for them when service requests became too much for the old system.  They were able to transition to cloud services because their stable URI structure allowed easy porting of the service without changing references to the resources.   A major failing of their system still was that they were trying to parse RDF as xpath or xquery and they still hadn't really taken the full RDF triplestore approach.

Moving to making their queries to SPARQL significantly improved services and made the system much more extensible by leveraging systems like Jena.    Underlying it all were N-triples and this put them in a position to adapt very quickly to expanding needs and evolving their services.   The ease of querying and performance improvements lead to another problem, that they could too easily push way too much information at he user and contextualizing that information in a more meaningful way became a business priority.  

An interesting note on training staff to use SPARQL was that people completely new to querying languages picked it up faster than those who already knew SQL.  "How do I do a JOIN" was too common a point of confusion with the later crowd.

Tools used to manage their system.  They took RDF Graphs and transformed them with pymantic and converted them into Python Object Graphics and used both for the user presentation end as well as on the server side.  They had the predictable problems of multi-threaded processing with Python but avoided this with either brute force or by managing their information as flat documents.  

The culmination of this approach was the ability to adapt very quickly to business needs, citing the Microsoft Press project as an example of significantly expanding the services in as little as 8 weeks.

Their using Jena primarily as their storage and server of triples.  They noted that Quads didn't preform as well as Triples in their case.

My Own Comments follow

Overall great demo.  My head is a swimming a bit because I'm trying to merge the concepts in my own head of how we manage information and objects in a repository versus the success cases I'm hearing about.  I believe it's likely we need to move in a direction of managing the objects and editing the metadata through Fedora but serving and linking the information by producing RDF from that and coming up with a management system to link all the triples produced from that RDF.  This isn't really all that big a step but it's something I need to sit down and talk over with everyone.

 

 

 

 

 

 

 

Some Antics for the Week

I'm off this week to the SemTech 2011 conference in San Fransico so content may be a bit light.  I hope to have some interesting things to say when I come back when I clear away the fog of depession from being unhappy with my information architecture, service architecture and 'no doubt' feelng like my content is worthless because it isn't backed by OWL.

Sigh.

On a side note.  I could use a bit of a break from "I've invented the intelligent web" quotes from just about every CEO or manager I meet here.   I wish them luck of course and I love to see competition in the field but this is a tech confrence (mostly), not really a marketing one so people are going to need more information than that.

On a side note meeting up with some folks from Yale, Mayo Clinic and The Library of Congress has proven very interesting. They're doing some great stuff that I think has some application for us.  I'm looking forward to getting back.