New Rise of the Apes European Trailer

There's a new trailer out for the Planet of the Apes reboot called "Rise of the Apes" and it really looks fantastic.    It's amazing how much story they can convey in a simple trailer and I have hopes the movie has a lot of depth in it.  There seems to be a lot of potential for commentary about the nature of intelligence and self, the respect for life and compassion or what a lack of it brings.  Seeing the "acting" of Caesar with the John Lithgow character, who seems to have Alzheimer, brought a tear to my eye as you watch the chimp express compassion for the human who is obviously struggling.    This could be the start of a great movie franchise, at least I hope it is.

Django Libraries for XML and eXist DB

We often use XML at Academic Libraries and decided to create a set of libraries to ease our work connecting our XML and repository based work to the Django framework by building a central set of libraries.  We'll be continuing to build these libraries out and recently released the code as open source projects on GitHub.

EULxml provides XPath parsing features in python and mappings for xml documents to pythonic objects as well as features to provide Django Form to simple XML objects.  The code is available on GitHub and some documentation and examples up on read the docs. 

EULexistdb provides connections and XQuery capability to eXist DB and Django Queryset like objects for rich interaction between Django and XML data stored in eXist DB.  Combined with the XML Django Forms from EULxml (on which it depends) it has enabled us to do a lot with our Library collection.  This library is also available on GitHub and has some documentation and examples up on read the docs.

We're excited at the possibilities of leveraging the power of Django with our XML databases and repositories.   We're open sourcing it in hopes others may find it useful and may want to contribute to the libraries as well.

They Shoot URLs Don’t They?

I've had a rather lengthy and interesting blogging life these last nine years and stopped over the last few for a number of reasons.  A backlog of 2400+ posts however have given me a rather interesting dataset to test when it comes to URL persistence and as I'm going over old posts I find and example of URL persistence that seems very backwards to me.

I use to run a funny little site for Gamespy called Paragon City Hall, just a community based site for an as-then-un-release game called City of Heroes and posted about that site back in 2002

In another post around the same time I reference my depression over a news story from Reuters that made me want to kill myself.  Over dramatic yes, but hey, it was 9 years ago so give me a break.

The amazing thing to me is that the Reuters story results in a dead link, nothing, no forward, no search suggestion, nothing.  The Paragon City Hall link however STILL WORKs, even though I shut the site down 8 years ago and it's unlinked by the network.

What kind of world do we live in when RPGPlanet has better URL persistence than Reuters?

Although I came to this realization later than I should have in my career, Content in the web is a Social Contract.  Tim Berners-Lee made this case quiet eloquently in an article I cite quiet frequently and while I might not expect Gamespy to understand it, of all agencies Reuters should get it.  They should have gotten it perhaps before even TBL posted anything about it in 1998.

A particular fear creeps over me when an agency like Reuters is letting links expire like that and offends me as an adopted digital librarian.   (They found me floating down a bitstream in a whicker basket and took me in as their own)

Let's hope these agencies realize are better about it 10 years from now than they were 10 years ago but honestly I'm not that confident about it since even today.  Reuters seems better from their URL structure though that hashed ID in the slug makes me a bit concerned.

I think I might sit down this weekend and write a quick script to see how many of the links reference in my blog over the last 9 years are still resolvable.  Could be interesting.

Django AuthenticationForm For User Login

Django already makes it insanely easy to log a user in and out via their generic views.  Engineers will often want to create their own login view to provide some flexibility, say an Ajax login or other spin on standard login.  A number of examples are given in Django for that as well, and as with most of the framework this is a snap too.  A convenient feature of Django that doesn't make it into many of the examples I've seen is the AuthenticationForm that provides a convenience Django form with associated logic render a login form, validate input and throw errors if they do things like forget to supply a password and do the basic authentication check.

The form provides that all for you and all you really need to do in your view is read the user submitted data, validate the form and take the final step of logging the user in.

This is just one form in a group of 7 or so that provide all kinds of convenience features like Password Changes and User Registration.  Not only do they provide a developer with very easy access to common functions but they can extended or subclassed like any other Python Class to add or override functionality.

Here's an Example of a simple view method using the AuthenticationForm.  Something of a 'gotcha' for developers who normally use Django form is the POST values are passed as the second argument to the form.  The request object can be passed but that is normally only done to check for authentication cookies.  See the Source for more info on the form..  

from django.contrib.auth import login
from django.contrib.auth.forms import AuthenticationForm
from django.shortcuts import render
from django.shortcuts import HttpResponseRedirect
from django.core.urlresolvers import reverse

def authenticate_user(request):
    """Logs a user into the application."""

    if request.user.is_authenticated():
        return HttpResponseRedirect(reverse('account:index'))

    # Initialize the form either fresh or with the appropriate POST data as the instance
    auth_form = AuthenticationForm(None, request.POST or None)

    # Ye Olde next param so common in login.
    # I send them to their default profile view.
    nextpage = request.GET.get('next', reverse('account:index'))

    # The form itself handles authentication and checking to make sure passowrd and such are supplied.
    if auth_form.is_valid():
        login(request, auth_form.get_user())
        return HttpResponseRedirect(nextpage)

    return render(request, 'account/login.xhtml', {
        'auth_form': auth_form,
        'title': 'User Login',
        'next': nextpage,
    })

The associated template code for this renders the form and any errors if needed.  Note that the form may have individual field errors in the case of a blank username or password and returns a ValidationError if the credentials provided were invalid and this is dispalyed to the user via the 'non_field_errors' attribute if present.{% if auth_form.non_field_errors%}

    {% block message %}

        <div id="error_msg">

        {{ auth_form.non_field_errors }}

        </div>

    {% endblock %}

{% endif %}


{% block content-body %}

    <form action="{% url account:login-form %}" method="POST">

    {% csrf_token %}


        {% for field in auth_form %}

            <div class="fieldWrapper">

                {{ field.errors }}

                {{ field.label_tag }} {{ field }}

            </div>

        {%  endfor %}

    <input type="submit" value="Login" />

    <input type="hidden" name="next" value="{{ next }}" />

    </form>

    {{ ct }}

{%  endblock %}

Django Tempate Tag for Gravatar Images

Gravatar images seem to be growing in popularity across a number of sites and the services already makes it incredibly simple to grab a profile picture there via URL.  The Gravatar site itself has a number of examples on how to grab an image off of the service, as well as more detailed examples of grabbing more information.

They do provide a examples for grabbing an image via Python and even a Django example which renders the image as a template note.  For displays like this I generally prefer an inclusion tag since I can render the image in a template rather than having to build it each time on my own.

The template tag itself is just:

from django import template
import urllib, hashlib

from yourapp import settings

# Provide Default settings so users only need to provide them in settings.py if they want to override.
GRAVATAR_BASEURL = getattr(settings, "GRAVATAR_BASEURL", "http://www.gravatar.com/avatar/")
GRAVATAR_DEFAULT_IMAGE = getattr(settings, "GRAVATAR_DEFAULT_IMAGE", "")
GRAVATAR_SIZE = getattr(settings, "GRAVATAR_SIZE", 40)

register = template.Library()

def gravatar_url(email, size):
    """
    Builds a Gravatar Image URL based on the provided email.

    :param email: Email address to query for a gravatar image.
    :param size:  Size to request and render the image in pixels.
    """

    attrs = {
        'd': GRAVATAR_DEFAULT_IMAGE,
        's': size
    }

    gravatar_url = "%s%s/?" % (GRAVATAR_BASEURL, hashlib.md5(email.lower()).hexdigest())
    gravatar_url += urllib.urlencode(attrs)

    return {'gravatar': {'url': gravatar_url, 'size': size}}

@register.inclusion_tag('account/snippets/gravitar.xhtml')
def gravatar_for_email(email, size=GRAVATAR_SIZE):
    """
    Renders a gravatar image for user with the specified email via a template.

    {% gravatar_for_email "user@email.com" 40 %}

    :param email:  String representing the users email.
    :param size: Size of gravatar to use in pixels.  OPTIONAL
    
    """
    email = "%s" % email
    size = int(size)
    return gravatar_url(email, size)

This approach also has the advantage of being extendable and it's easy enough to build additional template tags such as 'gravatar_for_user' providing it a user instance instead that queries for their email and calls the `gravatar_url` method.

Lastly you only need to provide a template to render the return in, in this example it's calling the template 'account/snippets/gravatar.xhtml' which is simply:

{% if gravatar.url %}
    <img src="{{ gravatar.url }}" height="{{ gravatar.size }}" width="{{ gravatar.size }}" alt="Gravitar image" class="gravitar" />
{% endif %}

You can store the template with your other templates but I suggest storing in a templates directory within the app you create the template tag in. If your settings.TEMPLATE_LOADERS contains 'django.template.loaders.app_directories.load_template_source' it will add any templates directory in an installed app by default.

Gaming Wiki Back Online

I brought my Gaming Wiki back online on the site here after several months of being down. I apologize for that and I don't have any better excuse than not really taking the time to do it.  I had some difficulties with my previous web host and all I was ever able to get was a *.tar.gz download of the wiki database and the service would timeout every time I tried to download a gzipped directory of the mediawiki itself.  I kept hoping I'd find a backup up copy on a CD somewhere but no joy.

So it languished in 404 hell for a bit while I came to terms with the fact that I'd have to take the DB dump from an unknown older version of MediaWiki and try to upgrade it to work with a modern download.  

All in all I have to give it to the MediaWiki folks, I was essentially able to just run the update scripts and only had to make a few settings changes that took a bit of looking up.  More or less though the whole thing came back up.

Because I couldn't get a backup of the files stored in my wiki though some of the file links and thumbnails wont work until I come up with some plan to rebuild them.

Thanks to everyone for your patience.

SemTech 2011 Redux

The SemTech 2011 Conference delivered a lot to the attendees and I thought I'd jot down a few of my thoughts and note some highlights as the conference draws to a close.

By and large I have to say that the technology has definitely arrived and we're capable of some exciting advances in linking data and having the web to begin to fulfill some of the promises of being a real knowledge base.  I just hope Skynet appreciates all the work we're all doing on it's behalf when it finally becomes self aware. 

What had the structured data crowd buzzing the most was last weeks announcement of MicroData format support by Google, Microsoft and Yahoo at schema.org.  Annoyance aside at Microsoft trying to put up schema.org as if it was some small independent standards board, I think the Microdata format seems just fine to me.  Essentially a competitor to RDFa it targets easy markup of information in a web page and is a bit leaner and easier than the current RDFa 1.0 standard.  The crowd here being a bit biased toward RDFa, there wasn't a lot of positive talk about Schema.org but I find I can't really care too much one way or another.  What we need to develop is a community of practice, and the technology should be secondary to that as long as it's not a barrier.  To me Microdata or RDFa are both fine standards and the only logical argument I would make to prefer one over another is that Schema.org's aim to to mark up information for better searching while RDFa is aimed at marking up knowledge.  It may seem a subtle difference but misaligned motivations like this can be the cause of big directional shifts and I think it's important for us as a community to not lose site of our aims here.  We're not trying to create a more searchable web, we're trying to make a more sensible web.  

This makes a nice segue into what I think was one of the better presentations at the conference,  Ivan Herman's presentation of on RDFa 1.1 .  The specification should be released fairly soon and 1.1 greatly simplifies the already simple standard.  In particular, when stacked up against the stuff at schema.org I think RDFa 1.1 is just as simple and easy to use and I feel more comfortable with it's extensibility over microdata formats. I am particularly interested in the applications of RDFa and TEI with some of our local collections, if they were like I hope I think it could be a fantastic step forward in presenting our TEI tagged collection.

On a side note I have to say Mr. Herman gets my vote for presenter I most want to be my instructor at Hogwarts.  The guy is just an awesome presenter and tremendously knowledgeable.  We need more advocates like him out there.

Much ado was given across several presentations about OWL, SKOS as well as automatic tagging and named entity recognition.  While services like Open Calais seems great for contemporary content and news, these services fall short in dealing with academic information or literature.    Fuzzy as it is I think this kind of automatic marking and tagging does have quiet a bit of value in speeding the cataloging process up and for standardizing tagging and ontologies.  Coupling auto-tagging and entity recognition with full text indexing can be very powerful for enhancing the  overall user experience and I just want to put that into the "lets do more of that" category.  I didn't see a particular presentation that stood out about this topic but it came up enough to be worth mentioning.

As I mentioned in a previous post the folks over at O'Reilly did a great presentation on the evolution of their own RDF and Linked Data strategies. In particularly they highlighted that this is an evolution for everyone and just plain doing something trumps thinking about it anytime.  They also highlighted how powerful working with triple-stores (and N-stores) can be in joining distributed data and systems and also made a great case for why RESTful services make like better for everyone.

Bernadette Hyland had a great Linked Data Cookbook presentation covering what goes into the mix of producing a great Linked Data Strategy.  Throughout the conference she was also a great advocate for the need for a community driven effort for semantic practices and the benefits we can all derive from it.  

Zepheira and MIT gave an interesting presentation on Exhibit3 using pure JavaScript an update of Exhibit a framework for the display of rich data.  Zepheria also demonstrated Recollection, a Django based open source offering on github for easily creating custom interactive displays for data collections on the web.   I think it's going down the wrong path to thing of any software as a solution but when looking at these as a tool in a bigger toolbox I think these have a lot of promise for curators frustrated by not being able to get their data out there.  The use of Django/Python in particular creates a good lower bar to deployment and positions it well to be enhanced, extended and to grow.

There were other highlights and I will try to link or at least tweet links to the slides as the come online.  

I appreciated meeting and talking with everyone at the conference and look forward to exploring this more.

San Fran Pic-So

Conference ended by mid-day here so I decided to take an open top  Bus Tour around San Fransico.  Great experience and you could get on and off all day so I got to see  more of the city in 4 hours than I have on most of my previous Trips.  Posted Pics to Picasa Web and linking blow.

Dead Simple Python Calls to Open Calais API

I was amazed at how easy Open Calais makes it for anyone to make calls to it's API via REST and return suggested tags and entitty recognition for any text.  Native Python libaries urllib(2) and httplib provide some effective methods for connecting and making simple REST calls to the Calais Web Services API but the httplib2 libray makes easier still.

Start off by installing httplib2 via pip

pip install httplib2

From there you just need to get an API key at the Calais site, set some headers, define a bit of text you want to pass to the API for tagging and entity recognition and then reap the benefit.

You can see this in the simple code snippet below…

import httplib2
import json

# Some local values needed for the call
LOCAL_API_KEY = 'PUT_YOUR_KEY_HERE' # Aquire this by registering at the Calais site
CALAIS_TAG_API = 'http://api.opencalais.com/tag/rs/enrich'

# Some sample text from a news story to pass to Calais for analysis
test_body = """
Some huge announcements were made at Apple's Worldwide Developer's Conference Monday, including the new mobile operating system iOS 5, PC software OS X Lion, and the unveiling of the iCloud.
"""

# header information need by Calais.
# For more info see http://www.opencalais.com/documentation/calais-web-service-api/api-invocation/rest
headers = {
    'x-calais-licenseID': LOCAL_API_KEY,
    'content-type': 'text/raw',
    'accept': 'application/json',
}

# Create your http object
http = httplib2.Http()
# Make the http post request, passing the body and headers as needed.
response, content = http.request(CALAIS_TAG_API, 'POST', headers=headers, body=test_body)

jcontent = json.loads(content) # Parse the json return into a python dict
print json.dumps(jcontent, indent=4) # Pretty print the resulting dictionary returned.

The server itself parses the body send as part of the http request and returns a json string with the results in this example because that is the format I requested in the 'accept' header attribute.  The API accepts a number of formats and returns a number as well, see the api documentation for more information.

The take away here is how simple it is to make a call to the Calais API and it wouldn't take much more to expand this to something useful in your own python application.

For reference and completeness, here is the output of the coce above.  Enjoy.

    "doc": {
        "info": {
            "docId": "http://d.opencalais.com/dochash-1/f48cc5c8-84a2-3440-844e-bfc29c3ba8e4",
            "docDate": "2011-06-08 14:14:46.594",
            "docTitle": "",
            "document": "Some huge announcements were made at Apple's Worldwide Developer's Conference Monday, including the new mobile operating system  iOS 5 , PC software OS X Lion, and the unveiling of the iCloud.",
            "calaisRequestID": "4306b4a8-cc7f-7c04-1307-076b7d5f8d35",
            "id": "http://id.opencalais.com/lhr1fFqpi2tMJjQZ3NyIJA"
        },
        "meta": {
            "submitterCode": "8fba6b3e-fef5-76ec-d7dc-ec60686110a4",
            "contentType": "text/html",
            "language": "English",
            "emVer": "7.1.1103.5",
            "messages": [],
            "processingVer": "CalaisJob01",
            "submitionDate": "2011-06-08 14:14:46.485",
            "signature": "digestalg-1|N0M3Ia9fmkexMBwN7kSL4thKM4g=|f8uTykbIPicGbu6y0962n658qv1PwewuM5jh5Gs0hJ79dC+vpurpmA==",
            "langIdVer": "DefaultLangId"
        }
    },
    "http://d.opencalais.com/genericHasher-1/e0feb730-9fc5-3365-9862-a384af633ecc": {
        "_typeReference": "http://s.opencalais.com/1/type/em/e/OperatingSystem",
        "_type": "OperatingSystem",
        "name": "Mac OS X",
        "_typeGroup": "entities",
        "instances": [
            {
                "suffix": " Lion, and the unveiling of the",
                "prefix": " new mobile operating system  iOS 5 , PC software ",
                "detection": "[ new mobile operating system  iOS 5 , PC software ]OS X[ Lion, and the unveiling of the]",
                "length": 4,
                "offset": 159,
                "exact": "OS X"
            }
        ],
        "relevance": 0.714
    },
    "http://d.opencalais.com/dochash-1/f48cc5c8-84a2-3440-844e-bfc29c3ba8e4/cat/1": {
        "category": "http://d.opencalais.com/cat/Calais/TechnologyInternet",
        "score": 1,
        "classifierName": "Calais",
        "categoryName": "Technology_Internet",
        "_typeGroup": "topics"
    },
    "http://d.opencalais.com/genericHasher-1/05ddd98f-097f-3701-a737-6fd2555f411c": {
        "_typeReference": "http://s.opencalais.com/1/type/em/e/Technology",
        "_type": "Technology",
        "name": "operating system",
        "_typeGroup": "entities",
        "instances": [
            {
                "suffix": "  iOS 5 , PC software OS X Lion, and the",
                "prefix": "Conference Monday, including the new mobile ",
                "detection": "[Conference Monday, including the new mobile ]operating system[  iOS 5 , PC software OS X Lion, and the]",
                "length": 16,
                "offset": 121,
                "exact": "operating system"
            }
        ],
        "relevance": 0.714
    }
}

SemTech 2011 – O’Rielly on RDF in eBooks

Instead of a flood of tweets I thought I'd go a bit old school and do some live blogging from the SemTech 2011 session Discovering and Using RDF for Books at O'Reilly Media this morning.   My own interest in this session is how we might apply this to texts coming from our local repository and in particular related to our Yellowbacks Project which we hope to enhance soon.  We also have a body of texts sitting on our servers in TEI format and we haven't landed on a way to comfortably leverage that in our infrastructure.  My own comments here appear in parenthesis (like so).

O'Reilly took their first stab at modeling information about their books in straight XML in a bit of a "tag soup" approach. This proved way too heavyweight for them and they ended up being late in delivering products because of the time it took to modify and extend their XML approach.  They then moved onto ONIX as an internal format, but it was old and writing xpath was a bit nightmarish because of the standards drift involved and other reasons.  In the end it was just not extensible and not friendly toward being agile.   That lead them to take a stab and creating their own schema, which also proved too heavyweight and slow.  Alas they washed up on the shores of Dublin Core, specifically with DC Terms and this introduced them to the world of RDF.

The extensibility of RDF starting with DC seemed pretty cool and useful to them and they kept adding FOAF, BIBLIO and more.  More useful for the company, the problem at the end of the day was they were still thinking in XML terms.  (Implying they should have been thinking in RDF and triples terms instead I suppose.)

Early shopping cart systems pushed too much data back and forth and keeping the ordering and purchasing systems in sync was difficult and failed too often.  They moved to RESTful services between systems pushing triples and this improved the service dramatically as well as transforming their business approach to ePubs that allowed for a "user library" approach.  By going to a consolidated business logic and a central metadatastore, this improved the quality of the service significantly.   This got them close enough to the structure needed for stable URIs and web services that they could sustain and improve it for some time.

With the emergence of cloud services this changed the landscape for them when service requests became too much for the old system.  They were able to transition to cloud services because their stable URI structure allowed easy porting of the service without changing references to the resources.   A major failing of their system still was that they were trying to parse RDF as xpath or xquery and they still hadn't really taken the full RDF triplestore approach.

Moving to making their queries to SPARQL significantly improved services and made the system much more extensible by leveraging systems like Jena.    Underlying it all were N-triples and this put them in a position to adapt very quickly to expanding needs and evolving their services.   The ease of querying and performance improvements lead to another problem, that they could too easily push way too much information at he user and contextualizing that information in a more meaningful way became a business priority.  

An interesting note on training staff to use SPARQL was that people completely new to querying languages picked it up faster than those who already knew SQL.  "How do I do a JOIN" was too common a point of confusion with the later crowd.

Tools used to manage their system.  They took RDF Graphs and transformed them with pymantic and converted them into Python Object Graphics and used both for the user presentation end as well as on the server side.  They had the predictable problems of multi-threaded processing with Python but avoided this with either brute force or by managing their information as flat documents.  

The culmination of this approach was the ability to adapt very quickly to business needs, citing the Microsoft Press project as an example of significantly expanding the services in as little as 8 weeks.

Their using Jena primarily as their storage and server of triples.  They noted that Quads didn't preform as well as Triples in their case.

My Own Comments follow

Overall great demo.  My head is a swimming a bit because I'm trying to merge the concepts in my own head of how we manage information and objects in a repository versus the success cases I'm hearing about.  I believe it's likely we need to move in a direction of managing the objects and editing the metadata through Fedora but serving and linking the information by producing RDF from that and coming up with a management system to link all the triples produced from that RDF.  This isn't really all that big a step but it's something I need to sit down and talk over with everyone.