Opportunities of benign-neglect

wednesday, february 24th, 2010 7:37am

Cathy Marshall of Microsoft Research gave a keynote at the wonderful code4lib 2010 conference that provided a useful nudge to my thinking about repository layers.

I've suggested elsewhere that university libraries contemplating a repository should consider developing policies around repository 'layers'. This notion involves both an inner long-term, high-guarantee archival layer -- and an outer services-oriented work-space layer. Reasons for the archival layer are obvious. Perhaps less so are reasons for and benefits of the work-space layer: it fulfills a library mission to further scholarly work; it strengthens the library's position as a central part of the academic campus community; it creates opportunities for valuable work to be moved easily into the archival layer.

Though my Library doesn't conceptualize our repository in this way, it's compelling enough that I think about this layered approach regularly. Given some exciting video initiatives at my University, much of my recent work-space layer thinking has focused on how to avoid the possibility of having precious library disk space overwhelmed with hypothetical services-layer low(er) quality materials. Strategies I've considered to deal with this concern are combinations of limiting the size of an entity's (person/department) work-space, and/or limiting the number of years items may remain in the work-space. Given my strong belief in providing useful and friendly user-services, in this 'limiting' scenario, we would provide terrific charts and notifications which would allow work-space users to easily monitor their usage of this temporal, useful space -- and provide tools and Library staff assistance to easily move appropriate items into the archival layer.

But regardless of the intention to have this work-space be used productively, there would be a high likelihood that the more control we give users over their Library work-space, the more likely that a significant portion of this work-space would fill up with materials that exist simply because it's more of a hassle to delete things than it is to neglect them -- one of Marshall's key points.

While Marshall specifically noted the problems of benign-neglect as a user-strategy for handling materials, she also noted that benign neglect offers opportunities. This was the nudge. I'm finding this notion of opportunities fascinating to reflect upon; it offers new realms for thinking about interesting services that could be built for this work-space layer.

The simple accretion of data from benign neglect suggests the now-common mining strategy associated with usage-data, popularized by amazon: "you may also be interested in this". An acquaintance recently told me about 'mallet', software than can mine texts to discern topics. It would be a worthy experiment to use such a tool to offer repository users an optional discovery service based on their text-based work-space materials.

Two additions to Apple's iPhoto application in the last year or so suggest other possibilities. 'Faces' scans a user's iPhoto library, using pattern-recognition routines to create groupings of people. 'Places' scan's the library and extracts geo-location coordinates if available, and, if I recall correctly, timestamp data, to create a map view over time of photo-locations.

Other scans could be run on work-space data, looking for patterns of government data-sets or citations. And combinations of embedded metadata such as geo-location and mime-type and date could be gathered, so that if, for example, a pattern of images taken at a certain location on a certain date was detected, not only could auto-grouping of those items be presented, but external sources such as flickr could be queried as well, offering the user the ability to see other external views of this 'event'.

Many of these scan/mining ideas would also be useful to apply to the repository as a whole. Such scans could offer both automated randomized general-discovery displays, as well as offer researchers additional focused discovery-views to permitted items. But to the extent that such services enhance the quality of users' work-space experience, it might help to keep the materials in the work-space more relevant: using benign-neglect to minimize benign-neglect.

Fedora / Shibboleth authorization solution

saturday, january 16th, 2010 8:06am

I don't work directly on Fedora (the repository software), but am very familiar with it due to my work with a programmer who does, and because I've worked on a django front-end for ingestion of items into fedora, as well as fedora-apis. My role in fedora work is more akin to the 'corner-man' in boxing. Together the boxer and I strategize about the opponent, his defenses and threats to our plans, and devise approaches to deal with evolving challenges. We cross ourselves, the fedora programmer goes into the ring, and between rounds I provide moral support, bandage wounds, and, because of my distance from actual battle, sometimes have useful ideas for the next round. This analogy's negativity toward the software is appropriate; to use another: We've bought a car that, in hindsight, I wouldn't recommend to others, but that we're committed to getting some terrific mileage out of.

So, it's been a tough fight, but our boxer is quick, has impressive endurance, and we believe we'll come out on top.

Fedora authorization is one round in which we think we've scored well.

Fedora comes bundled with an authorization piece called XACML. I don't know if it's due to xacml, or fedora's implementation of it, but from what I gather indirectly, it's terrifying enough that few use it, and it is, in fact, scheduled to be augmented in a future release with a new Great Hope: FeSL.

But if you want to go into production now, what to do? The dearth of published authentication/authorization 'live' solutions is why, as I understand it, so many fedora installations are either completely open (all objects public), or completely locked down for internal use.

We've assumed we would use some sort of wrapper around fedora, to authenticate against Shibboleth, with which our university is slowly moving forward. Shib's lack of logout capability, and the resultant assumption that users will happily quit their browser to logout, would seem quaintly amusing if it weren't true -- but that is another topic entirely, and single sign-on is certainly convenient. Not long ago we began to tackle how, specifically, to implement shib/fedora authorization.

Recently someone described to me an authorization approach the muradora folk took. I haven't looked at any documentation myself, but I was told that they wrote a servlet filter that takes a submitted name and password, and passes it to a non-centralized custom ldap server that exists only for the purpose of allowing fedora's built in ldap-xacml code to handle authentication. (For those unfamiliar: a java servlet filter acts as a front layer of a java webapp through which incoming requests and outgoing responses must pass, and can be modified.)

A few of us heard this and had divergent reactions. It sounded like a hack, which caused some to dismiss the approach. Personally, being quite partial to hacks that work around monolithic software obstacles, I thought the hack smacked of ingenious creativity and was worth further examination. I was indulged; the result: our corner has devised an approach that initial testing indicates will work well.

First, some necessary background info...

  • Our University shib implementation is integrated with Grouper. I think grouper is, or at least historically has been, a separate project from shibboleth, but they work together brilliantly. Upon shib-login, a list of the groups to which the user belongs is accessible to the server via the shib 'is-member-of' header field.

  • Our implementation of fedora item ingestion involves creating a METS record that contains a bunch of item-info -- including a rights segment. The rights segment contains a series of entries, each one listing an identity (a shib is-member-of group) and a permission. Example (content, not format): identity='chemistry-department' & permission='view_item'

  • The mets record is handed to an ingester that converts the mets xml to FOXML, then fedora grabs the object (we're using the 'managed' option at the moment), and the java messaging built into fedora fires off a message to a listener that indexes (via Solr) parts of the foxml record, including the rights information.

So, our approach: create a fedora servlet filter that reads the shib groups/identities, then does a solr search to see if the object being requested has a 'view' permission for any of the identities in the request's shib is-member-of header. If so, the request is allowed through; if not; it is blocked. If no shib-identity is found, the servlet filter will only yield objects with 'public' view_item permissions.

The beauty of this is that fedora-access can be fully open to the internet while still allowing authorized access for those objects that require it. Further, this solution offers reasonable hope that it will survive fedora upgrades, since the servlet, though a part of the fedora webapp, is somewhat of a separate layer in front of the app. Further, by adding more granular permissions (at the moment permissions are at the object-level; they could be at the data-stream level) -- or simply by a bit of extra programming in the servlet-filter -- we could allow, say, the public to access low and medium-resolution images, but allow, say, faculty to access high-resolution images.

I'll keep this paragraph updated... Our intrepid programmer has figured out where to insert the custom servlet filter, has worked with our systems person to hook up an initial apache/tomcat connection so as to allow the shib installation on apache to pass its headers through to tomcat, and confirmed the filter's detection of the shib identity header information. A nice side-effect of installing shib on apache rather than tomcat directly is that we can allow programmatic access to port 8080.

(some technical info and some code: here)

The bell has rung; the next round begins. We cross our fingers, and the programmer heads into the ring once again.

the wave and the repository

wednesday, november 11th, 2009 8:54pm

I've been playing with Google Wave recently and am deeply impressed.

I would not be surprised if within a few years, a year for many, waves will largely replace emails. Not just for youth, whose primary forms of non-voice digital communication are sms-texts or facebook-posts, but also for those of us for whom email is currently an absolutely essential daily form of communication.

I believe it will be that significant.

For those not familiar with Google Wave, here are some links:

My mind-wheels have been spinning, envisioning what this new form of communication could impact.

Library digital repository

At the Access 2007 conference I saw an inspiring talk by Mark Leggott of the University of Prince Edward Island. He spoke about the Virtual Research Environment that his group had created, which successfully addressed a thorny issue: Libraries which had expended significant resources to build digital repository systems were having a terrible time getting campus entities to contribute content.

What was so compelling about Leggott's approach was his team's shift in perspective from expecting users to meet Library requirements -- to the Library meeting users' needs. I was still fairly new to the Library world at that time, but my sense was that institutions had built their digital repository to meet Library needs for thorough meta-data -- without much regard to user-experience or needs. The result: the new digital repository felt irrelevant to campus users. With onerous submission processes and requirements, little material was submitted. Leggott's team instead focused directly on useful services to key campus constituents such as faculty -- allowing them to more easily do the work they already did. As I recall, one simple example was that his group provided storage space for research data-sets -- but I believe the offer wasn't just for final data-sets, but data-sets under active development, revision, and analysis. The result: campus digital work worthy of inclusion into the repository was already within the UPEI Library system, which made final repository ingestion of appropriate material architecturally easy.

Layering

Now that I have one work-foot in the digital repository world, I've been wondering about an issue stemming from my recollection of the Leggott team's approach: how to design a compelling suite of storage and easy-to-use services, without the Library repository being filled with vacation and pet pictures?

My thought: the Library could develop clear, simple policies for layers of repository-usage. The outer layer would be more flexible, more transactional, and would require a lower-level of quality metadata for submitted items -- tags would be fine. We already have a mission to support transactional, often non-archival work: supporting research. For items to be accepted into the inner more archival layer, more and higher quality metadata would be required, in exchange for the guarantee of permanence, multiple-channel data exposure, and format data-migration. The benefits of this layered approach: the Library can play an increased central role in the creative work of the campus, and ensure access to quality data from across campus that would flow into the repository.

From this layering idea arises the question of how to architecturally separate the layers. Brainstorming, I've imagined that campus users could be allotted x00 GB of outer-layer 'work' storage-space -- with more inner-layer repository-space just a click away for items deemed appropriate. Or the outer-layer work space could be limited by time-frame: all files in the outer-layer workspace could have, say, a two-year lifespan, with a nice status-report system so no one would lose work unexpectedly. That'd help encourage worthy materials to be migrated into the inner layer.

Leveraging

Recently, my thinking is shifting in a different direction. I still like the idea of an outer transactional work-layer, and an inner repository layer with richer metadata and higher archival guarantees. But I question whether we need to build all the outer-layer services. An alternate approach would be to facilitate the use of existing third-party services and tools, and build Library services, plugins, and widgets to streamline the ingestion of appropriate items from those external third-party work-layer services and tools.

My recent experimentation with Google Wave was a catalyst for this shift, especially given its collaborative strengths, and its ability to easily handle files and images via drag & drop.

Vision

One of the use-cases we've envisioned for use of our digital repository is a professor organizing images for a class presentation. Imagine the professor is working with teaching assistants (TAs) to refine the presentation and associated points. With Google Wave, the professor could set up a wave to prepare for the class session. She could invite her two TAs into the wave; each could simply drag pictures into the wave, tagging them. The professor could also set up a bullet-point list in the wave, encourage the TAs to contribute to the bullet-list, and note issues for them to research in preparation for the class session.

Imagine if, when the session-material preparation is complete, the professor could then apply a Library repository-gadget to the wave which would, after a campus authentication process, ingest all the pictures and associated titles (and, optionally, the wave itself), and redirect the prof to a repository web-page to enter a bit more metadata. Upon adding this extra information, the data would be officially ingested into the repository. Because Google Wave is an open-source project, the Library or campus IT folk could, if desired, install a wave server to facilitate branding and make it all the easier for Libary services to be integrated seamlessly into users' work flow.

Google wave comes with an Extensions Gallery that provides inspiration for imagining the varied kinds of services that can be applied to a wave, and tutorials abound on how to program extensions. The same approach could be applied to flickr and facebook: Library programmers could build widgets and mini-apps to enable users to use friendly tools and services they're already comfortable with -- but to still be able to shift their works easily into the official repository. It's part of the idea of meeting users where they are, as opposed to requiring that they come to us. That this approach offers new and exciting realms for Library programmers is just delicious gravy.

The dashboard initiative

monday, september 22nd, 2008 12:08am

I've been putting some productive time into something I'm calling "The Dashboard Initiative". Most of this time to date has been outside of normal work hours due to a few other priorities, but in time I expect to add this to the list of on-going work projects.

Inspired by work done by Brown's Office of Institutional Research, the concept of the dashboard is to provide useful trend information about the operation of different facets of the Library. The analogy to a car dashboard is good: whereas 'instruments' make up a car's dashboard, what I'm calling 'widgets' make up the Library's dashboard.

As shown on this dashboard information page, a dashboard widget consists of three counts (baseline, trend, and current), a trend indicator, and a 'more-info' button that itself is a miniature graph. My visions for possible future dashboard usage within the Library and across campus are grand, but it is important to remember that the dashboard idea is intended to serve a rather specific data-display purpose: to usefully display trend information. Data that lends itself to pie-chart breakdowns can be important to an organization and can be an integral part of an organization's data-farm, but is somewhat outside the scope of the dashboard focus on trends.

One of the reasons I find the dashboard concept so compelling is that it provides a kind of 'template' for data-tracking feeds. Increasingly we've been building into more projects the ability to stream out statistic counts, but to date there hasn't been a clear standardized vision of how this statistical data might be presented. The dashboard offers that standard.

If we were to rebuild from scratch the easyBorrow system, we could from the start automate count-flows that could populate widgets representing trend-usage for Josiah redirects, BorrowDirect, VirtualCatalog, InRhode, and Iliad. This of course applies to all new systems, and over time I expect we'll retrofit many of our existing ad-hoc statistical counts to flow into widgets.

I have a vision of the creation, over time, of a plethora of widgets representing useful trend-information on checkouts, interlibrary-loan usage, new-titles additions, collections-web-access usage, requests for offsite materials, and physical library attendance to name just a few. This then begs the question of how to manage all these widgets.

I envision a 'MyWidgets' page where, based on cookies and login, a user could view a listing of all Library widgets, filter by tag, and select those she finds useful for a personalized widget page. As part of my work I may pay particular attention to the flow of easyBorrow requests to our different borrowing partners, and scan other widgets tracking workbench file uploads to our in-development repository. Other folk in the library might be particularly interested in widgets that track numbers of books sent to our offsite Annex facility, as well as widgets that track the number of requests for those materials, and widgets that track how many requests for offsite materials are still made when the user is offered a link to a Google Book scan of the requested title. Our French scholarly resources librarian might choose for her page widgets tracking French new-title additions, as well as checkouts of French-language items.

Thinking even more broadly -- campus-wide -- it's easy to imagine how, if other departments adopt the dashboard idea, a facility could even be developed for, say, the chair of the French department to 'subscribe' to a Library 'French New-Titles' widget, a Library 'French-Language Checkouts' widget, as well as a Registrar widget representing the numbers of freshmen enrolled in French 1, and another Registrar widget representing numbers of French concentrators.

Along these lines, I expect to one-day add an rss and html parameter-segment to a widget-url to facilitate such cross-campus usage.

For now, starting on a small-scale, I've created via Django's default admin a simple form allowing non-technical end-users to create a widget simply by typing or pasting into the form a list of key-value data-pairs.

widget entry form

Upon submitting the form, the data-points are parsed and made into the discrete data-elements comprising a widget. This is in-place now. In fact, the widget on the dashboard information page was created (and can easily be updated) via this form. Further, this weekend I implemented the ability to view detail line-chart information using Google's chart API. So changing a label or data-point via the simple form now changes not just the widget but also the detail chart, on-the-fly. Though I expect to automate many data flows used to create dashboard widgets, the utility of the form will allow non-technical folk to take data they already create via manual processes and easily make that data much more visible to others.

We'll see how this all unfolds. The potential is exciting.

[ Update: I presented on the dashboard at the 2009 code4lib conference. Good feedback (DC, ELM). Code released. ]

Passwordless logins

friday, may 23rd, 2008 5:48am

[These are notes from a project I worked on in grad-school in 2003-2004. As part of a 'voting' project, I wanted to automate the backup of a postgres database to an offsite location via a dump and rsync. In order to script the backup, my server needed to be able to automatically login to the backup server. A fellow student, J.E., and I worked on this piece together.

Recently a co-worker described a need to do something different, but similar in some ways, so I dug up these notes and pasted 'em in here, fairly raw. Note to hackers: the servers mentioned are long offline.]


Instructions

  • Generate the key...

    [toolbox:~/Desktop] birkin% 
    [toolbox:~/Desktop] birkin% ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/Users/birkin/.ssh/id_rsa): 
    /Users/birkin/.ssh/id_rsa already exists.
    Overwrite (y/n)? y
    Enter passphrase (empty for no passphrase): 
    Enter same passphrase again: 
    Your identification has been saved in /Users/birkin/.ssh/id_rsa.
    Your public key has been saved in /Users/birkin/.ssh/id_rsa.pub.
    The key fingerprint is:
    71:04:1a:69:d4:ee:4a:d5:a8:b6:77:65:20:68:12:df birkin@toolbox.local
    [toolbox:~/Desktop] birkin%
    

    The '-t rsa' flag specifies ssh 2 protocol

  • Examine the created keys...

    [toolbox:~/.ssh] birkin% 
    [toolbox:~/.ssh] birkin% ls -alF
    total 48
    drwx------ 7 birkin staff 238 16 Jun 21:15 ./
    drwxr-xr-x 57 birkin staff 1938 16 Jun 17:06 ../
    -rw------- 1 birkin staff 883 17 Jun 08:05 id_rsa
    -rw-r--r-- 1 birkin staff 230 17 Jun 08:05 id_rsa.pub
    -rw------- 1 birkin staff 535 16 Jun 20:39 identity
    -rw-r--r-- 1 birkin staff 339 16 Jun 20:39 identity.pub
    -rw-r--r-- 1 birkin staff 5351 16 Jun 20:56 known_hosts
    [toolbox:~/.ssh] birkin%
    

    The 'identity' files listed were generated when I was initially trying 'ssh -t rsa1', the ssh 1 protocol, and I believe can be ignored.

    [toolbox:~/.ssh] birkin% 
    [toolbox:~/.ssh] birkin% cat id_rsa.pub 
    ssh-rsa
    AAAAB3NzaC1yc2EAAAABIwAAAIEA0xmINQ6w3KGgxEexNJeb5bRDhOyp3R5zWfL6L5ghb8TqWDoF/x1e4KxoVp3NEMd594QISQzb4w74ZNkdGKnIqOEHs1Uy3zbutijsPQhWqXvZ40AMbOpOjawLAcrTWUfqmBcC7MW54cOiu2FIzvlHJhYVOBCyy1nBVduGJUPF5s=
    birkin@toolbox.local
    [toolbox:~/.ssh] birkin%
    
  • Copy the public key to a file titled 'authorized_keys' which will be transferred to the remote computer(s) that I want to connect to.

    [toolbox:~/.ssh] birkin% 
    [toolbox:~/.ssh] birkin% cat id_rsa.pub > ~/Desktop/authorized_keys
    [toolbox:~/.ssh] birkin%
    
  • Let's take a look to make sure it looks right...

    [toolbox:~/.ssh] birkin% 
    [toolbox:~/.ssh] birkin% cd ~/Desktop/
    [toolbox:~/Desktop] birkin% 
    [toolbox:~/Desktop] birkin% ls -alF
    total 64
    drwxr-xr-x 7 birkin staff 238 17 Jun 08:21 ./
    drwxr-xr-x 57 birkin staff 1938 16 Jun 17:06 ../
    -rwxr-xr-x 1 birkin staff 21508 17 Jun 08:20 .DS_Store*
    -rw-r--r-- 1 birkin staff 253 2 Nov 2003 .bash_profile
    -rw-r--r-- 1 birkin staff 0 20 Apr 2003 .localized
    -rw-r--r-- 1 birkin staff 230 17 Jun 08:21 authorized_keys
    drwxr-xr-x 42 birkin staff 1428 17 Jun 08:20 envelope/
    [toolbox:~/Desktop] birkin%         
    [toolbox:~/Desktop] birkin% cat authorized_keys 
    ssh-rsa
    AAAAB3NzaC1yc2EAAAABIwAAAIEA0xmINQ6w3KGgxEexNJeb5bRDhOyp3R5zWfL6L5ghb8TqWDoF/x1e4KxoVp3NEMd594QISQzb4w74ZNkdGKnIqOEHs1Uy3zbutijsPQhWqXvZ40AMbOpOjawLAcrTWUfqmBcC7MW54cOiu2FIzvlHJhYVOBCyy1nBVduGJUPF5s=
    birkin@toolbox.local
    [toolbox:~/Desktop] birkin%
    

    Looks good.

  • Transfer the 'authorized keys' file from my OS X laptop to the remote computer...

    [toolbox:~/Desktop] birkin% 
    [toolbox:~/Desktop] birkin% rsync -v -e /usr/bin/ssh ~/Desktop/authorized_keys birkinbackup@harmonicas.msie.marlboro.edu:/home/birkinbackup/authorized_keys
    birkinbackup@harmonicas.msie.marlboro.edu's password: 
    authorized_keys
    wrote 316 bytes read 42 bytes 31.13 bytes/sec
    total size is 230 speedup is 0.64
    [toolbox:~/Desktop] birkin%
    
  • Make sure it looks right on the remote computer...

    [toolbox:~/Desktop] birkin% 
    [toolbox:~/Desktop] birkin% ssh birkinbackup@harmonicas.msie.marlboro.edu
    birkinbackup@harmonicas.msie.marlboro.edu's password: 
    [birkinbackup@harmonicas birkinbackup]$ 
    [birkinbackup@harmonicas birkinbackup]$ ls -alF
    total 64
    drwx------ 3 birkinbackup birkinbackup 4096 Jun 17 08:27 ./
    drwxr-xr-x 6 root root 4096 Jun 12 15:01 ../
    -rw-r--r-- 1 birkinbackup birkinbackup 230 Jun 17 08:27 authorized_keys
    -rw------- 1 birkinbackup birkinbackup 6306 Jun 17 08:22 .bash_history
    -rw-r--r-- 1 birkinbackup birkinbackup 24 Jun 12 15:01 .bash_logout
    -rw-r--r-- 1 birkinbackup birkinbackup 191 Jun 12 15:01 .bash_profile
    -rw-r--r-- 1 birkinbackup birkinbackup 124 Jun 12 15:01 .bashrc
    -rw-r--r-- 1 birkinbackup birkinbackup 29 Jun 17 08:29 datecrontest
    -rw-r--r-- 1 birkinbackup birkinbackup 847 Jun 12 15:01 .emacs
    -rw-r--r-- 1 birkinbackup birkinbackup 120 Jun 12 15:01 .gtkrc
    drwx------ 2 birkinbackup birkinbackup 4096 Jun 17 00:20 .ssh/
    -rw-rw-r-- 1 birkinbackup birkinbackup 14220 Jun 12 18:59 testdump
    [birkinbackup@harmonicas birkinbackup]$ 
    [birkinbackup@harmonicas birkinbackup]$ cat authorized_keys 
    ssh-rsa
    AAAAB3NzaC1yc2EAAAABIwAAAIEA0xmINQ6w3KGgxEexNJeb5bRDhOyp3R5zWfL6L5ghb8TqWDoF/x1e4KxoVp3NEMd594QISQzb4w74ZNkdGKnIqOEHs1Uy3zbutijs+PQhWqXvZ40AMbOpOjawLAcrTWUfqmBcC7MW54cOiu2FIzvlHJhYVOBCyy1nBVduGJUPF5s=
    birkin@toolbox.local
    [birkinbackup@harmonicas birkinbackup]$
    

    Looks good.

  • Move the file to the right place on the remote computer...

    [birkinbackup@harmonicas birkinbackup]$ 
    [birkinbackup@harmonicas birkinbackup]$ cat authorized_keys >> .ssh/authorized_keys 
    [birkinbackup@harmonicas birkinbackup]$
    

    The double brackets 'append' instead of overwrite. Also, I've checked this out -- the append is correct for our purposes in that it appends the new string on the following line. Actually, what would be nicer for inspection is this...

    [birkinbackup@harmonicas birkinbackup]$ 
    [birkinbackup@harmonicas birkinbackup]$ echo "" >> .ssh/authorized_keys 
    [birkinbackup@harmonicas birkinbackup]$ 
    [birkinbackup@harmonicas birkinbackup]$ cat authorized_keys >> .ssh/authorized_keys 
    [birkinbackup@harmonicas birkinbackup]$
    

    Let's check out the 'real' authorized_keys file (I should name the transfer file something else in the future to avoid any confusion)...

    [birkinbackup@harmonicas birkinbackup]$ 
    [birkinbackup@harmonicas birkinbackup]$ cd .ssh/
    [birkinbackup@harmonicas .ssh]$ 
    [birkinbackup@harmonicas .ssh]$ ls -alF
    total 24
    drwx------ 2 birkinbackup birkinbackup 4096 Jun 17 00:20 ./
    drwx------ 3 birkinbackup birkinbackup 4096 Jun 17 08:27 ../
    -rw-r--r-- 1 birkinbackup birkinbackup 975 Jun 17 08:42 authorized_keys
    -rw------- 1 birkinbackup birkinbackup 887 Jun 17 07:24 id_rsa
    -rw-r--r-- 1 birkinbackup birkinbackup 251 Jun 17 07:24 id_rsa.pub
    -rw-r--r-- 1 birkinbackup birkinbackup 603 Jun 16 17:34 known_hosts
    [birkinbackup@harmonicas .ssh]$ 
    [birkinbackup@harmonicas .ssh]$ cat authorized_keys 
    ssh-rsa
    AAAAB3NzaC1yc2EAAAABIwAAAIEAu4tdcJlZldiAAnfviR3vXWGjwWa4For/kbi/FvBTeTEtctxsS72/ppn5vFydv4V5iLDVdfWKrnTIwfn8BHinq2yvdX9OLsEyjzBqbu+ZIZCi7UefJxEWCdOGtDd0YWiJbQJkyuoHs4ShwF5YcuMcnmiEjOUWJ7B5N9QkXeD3wc0= birkinbackup@harmonicas.msie.marlboro.edu
    
    authorized_keys
    ssh-rsa
    AAAAB3NzaC1yc2EAAAABIwAAAIEA3+PWa9l6hu6sY43u5FASYr26AhRrUQDqcjT5VO+wePg2OaQyTedcNkRIGG6tVquFC+AXH5BOkI+EJAfSCJG2AE0YxSrM16rMgPM1wADJBlmhumiY5wuX5ROOc0azPpvLyjZwwFsSxgqpdtNtvwUCQEl94y3H5qqOvXtR+IVtp30= birkin@toolbox.local
    authorized_keys
    ssh-rsa
    AAAAB3NzaC1yc2EAAAABIwAAAIEA0xmINQ6w3KGgxEexNJeb5bRDhOyp3R5zWfL6L5ghb8TqWDoF/x1e4KxoVp3NEMd594QISQzb4w74ZNkdGKnIqOEHs1Uy3zbutijs+PQhWqXvZ40AMbOpOjawLAcrTWUfqmBcC7MW54cOiu2FIzvlHJhYVOBCyy1nBVduGJUPF5s= birkin@toolbox.local
    
    ssh-rsa
    AAAAB3NzaC1yc2EAAAABIwAAAIEA0xmINQ6w3KGgxEexNJeb5bRDhOyp3R5zWfL6L5ghb8TqWDoF/x1e4KxoVp3NEMd594QISQzb4w74ZNkdGKnIqOEHs1Uy3zbutijs+PQhWqXvZ40AMbOpOjawLAcrTWUfqmBcC7MW54cOiu2FIzvlHJhYVOBCyy1nBVduGJUPF5s= birkin@toolbox.local
    [birkinbackup@harmonicas .ssh]$
    

    The last line is the one we most recently created; the space preceding it is the result of the echo command; the lines 'authorized_keys' are mistakes from issuing echo in my experimentation instead of cat. I'm leaving these in to illustrate that there is tolerance for non-matching entries.

  • Try connecting...

    [birkinbackup@harmonicas .ssh]$ 
    [birkinbackup@harmonicas .ssh]$ exit
    logout
    Connection to harmonicas.msie.marlboro.edu closed.
    [toolbox:~/Desktop] birkin% 
    [toolbox:~/Desktop] birkin% ssh birkinbackup@harmonicas.msie.marlboro.edu
    [birkinbackup@harmonicas birkinbackup]$
    

    No password-prompt: success!

Possible 'gotchas'

  • Before actually trying a connection-script, run a manual ssh first; you may have to once manually ok that key-exchange message you normally see on a first-time ssh.

  • If things aren't working, it could be a permissions issue...

    J.E. sent me a link [2008 note: this was in 2004] to http://kimmo.suominen.com/ssh/ and pointed out the caution to check file and directory permissions if connections still aren't working right after configuring everything.

    This site shows permissions to the ~/.ssh/ directory that allow writing by 'group', even though the text says only the 'owner' should have write permissions to that directory. On my account on the remote-computer, my ~/.ssh/ directory initially allowed group-write permissions, and passwordless login was not working. Changing those to...

    drwx------ 2 birkinbackup birkinbackup 4096 Jun 17 16:18 .ssh/
    

    ...allowed passwordless login to work. The beauteous text...

    [toolbox:~] birkin% 
    [toolbox:~] birkin% ssh birkinbackup@harmonicas.msie.marlboro.edu
    [birkinbackup@harmonicas birkinbackup]$ 
    [birkinbackup@harmonicas birkinbackup]$ ssh birkinbackup@play.msie.marlboro.edu
    [birkinbackup@play birkinbackup]$
    

    No password required. Sweet.

Nice, lightweight SOA implementation

sunday, may 18th, 2008 4:17pm

I've evangelized service-oriented architecture (SOA) before.

To review, briefly and roughly: SOA promotes decoupled services. For example, a Fahrenheit-to-Celsius converter would likely be implemented as a web-service, instead of as a function/method embedded/tied into some bigger program. The benefits of this are multiple: 1) The service can be written in any programming language, and accessed by other services written in different languages. 2) SOA makes the idealized promise of code-reuse a reality.

I have a programmer friend who works for a large corporation who is familiar with implementing SOA using industrial-scale best-practices; I'm familiar with implementing it in a lightweight, seat-of-the-pants fashion.

Over the past year+ I've created well over a dozen or so SOA web-services for different projects. But I recently implemented one I put some best-practice effort into that'll be a model for my future SOA work. Some links:

What I like about this one...

  • The api urls offer 'discovery' via embedding, in the built-in returned data, contact and documentation information. Having just one of these pieces of info would be great; having both is particularly nice because web urls and staff change over time. Why is this useful? If someone is looking at the code that calls this service 5 years from now, and if I'm not around, the documentation will provide info on some extra features of the service that otherwise wouldn't be apparent if, say, the web-service just returned the word 'English'

  • The api urls are 'hackable', another way of enhancing discovery. One can intuitively try entering a code other than 'enk' to see what comes up (like 'tlh'). Also, reasonably appropriate things happen if one lops off increasing sections of the url (in this case, redirects to documentation pages).

  • The api urls are versioned. Key:value pairs can be added to this api -- but the existing key:value pairs must never be changed. The reason is that post-release, I don't know who's using it for what, thus I have to assume any changes could break someone's app. So if I want to change the label 'response' to 'language', and deliver it in xml, I can leave the existing one as is, and label the new one 'api_v2'.

  • All these urls utilize server-caching. This is an implementation rather than a design feature, but worth mentioning. Django offers a flexible and easy-to-use caching feature; I have it set so that the list and api urls only have to hit the database once a day, no matter how many times the urls are hit. Further, django's caching is intelligent: its response includes 'Cache-Control', 'Etag', and 'Expires' http-headers so that a browser or well-designed code doesn't even have to call the web-service again to redisplay the data. Nice. This would be particularly important and useful for something like RSS feeds.

Good info...

  • A terrific, hands-on review-resource on http-headers: The web-services chapter of Mark Pilgrims 'Dive Into Python' website & book.

  • Many of the features of this language_translator web-service were informed by the book 'RESTful Web Services', by Richardson & Ruby. Some parts are a bit dense, but it's chock-full of terrific detailed info and food for thought. I came across it after having written a half-dozen or so SOA web-services, each one a little different and better, and it directly addressed many issues I had begun to think about or saw referenced via web-research.

[Acknowledgements to Peter Murray's article and Richard Akerman's Access_2006 presentation that first inspired my SOA thinking.]

Discovery Tools and Standards Trends

tuesday, april 1st, 2008 5:57am

[Got a nice little blog-recognition email a couple weeks ago by a reader asking if I would write up a report on this NISO conference for possible inclusion in a newsletter. Here's the web-version.]

Thirteen presentations were given at the NISO Forum 'Next Generation Discovery: New Tools, Aging Standards', held on March 27 and 28, 2008, in Chapel Hill, North Carolina. They covered three main areas: current user-expectations, discovery tools attempting to meet those expectations, and architectures to facilitate the development and adoption of discovery tools.

Speakers reenforced that users want searching to be easy and fast. Dinah Sanders of III noted how this process of finding information has become increasingly iterative, with users expecting to refine their queries. Vinod Chachra of VTLS described that this is why search results must be returned quickly: it allows humans to be seamlessly involved in the discernment process, scanning and instantaneously determining relevancy.

Robert Sandusky of the University of Illinois at Chicago, Cameilia Csora of 2collab.com, Karen Hawkins of scitopio.org, Dinah, and Peter Murray of OhioLINK, all showed tools that incorporate at least some now-common web elements to meet these user-expectations, including faceted results, tagging, tag-clouds, and feeds.

Many new discovery tools offer truly laudable interface-improvements over previous displays of information, but suffer from a significant architectural limitation: If a tool assumes the only end-user experience is the tool's default web-page, opportunities for discovery are drastically limited. For example, if we at Brown were to license the III Encore tool, and a user were to land on one of our 'Napoleonic Satires' collection pages, it would be wonderful to be able to query an Encore API on 'Napoleon' to display in a sidebar the terrific discovery data this tool can access. But this and many other discovery tools offer no such API. Fortunately this is changing, with some vendors updating their business models to meet current library discovery needs both deeply and broadly. Ex Libris' XServer licensing is a prime example, offering API functionality to its federated-searching tool, dramatically broadening discovery possibilities beyond the default MetaLib web interface.

The presentations that touched on system-design and architectural issues to improve discovery were particularly inspiring.

Richard Akerman of NRC CISTI noted that we should utilize our power as information-producers to produce data that more easily lends itself to machine-harvesting. This can be done by encoding data where possible utilizing existing standards such as OpenURL and COinS, as well as emerging de-facto standards such as microformats. He showed as one example information displayed in a useful time-line format by a site that had queried data from a harvester site -- possible because of standardized structured date-fields.

Mike Teets of OCLC gave a presentation on OCLC's emerging WorldCat Grid services, offering possibilities for cross-referencing standard identifiers such as ISBNs and OCLC numbers, and emerging identifiers such as 'identities' -- truly a developer's dream.

Vinod offered numerous useful suggestions for designing systems to minimize user confusion and maximize utility. He showed an example of a site that had both a facet category of LC Subject Headings, and another facet category of Dewey Subject Headings, and the ease with which a user could be confused by this explicit display of overlapping information, hindering rather than helping discovery.

Michael Winkler of the University of Pennsylvania discussed how PennTags is emblematic of an architectural approach the UPenn Library has found successful. His talk brought together multiple threads of this NISO Forum by showing how PennTags offers discovery possibilities in multiple different settings because it is designed as what he called a 'horizontal' service, as opposed to a 'vertical' service. The 'vertical' service paradigm in Winkler's view is exactly the one I described earlier as architecturally limited: much useful information is gathered together and funneled down into one website with no possibilities for alternate exposure of the gathered and massaged data. PennTags, he noted, is an example of the 'horizontal' service paradigm that he sees as the future of UPenn Library discovery software and good discovery software in general. It is not tied to any particular existing service: not the catalog, not electronic resources, not their course-software -- and yet can be used with any of these services -- and in any given context can expose interesting data from another context. Each PennTag entry is exposed as an RSS feed which illustrates the power of simple standards to enhance discovery. In fact, this shift to a horizontal paradigm is so central to the UPenn Library's current and future work, that Winkler noted he toyed with re-titling his presentation to something like 'Not PennTags, but Why'.

John Mark Ockerbloom, also of the University of Pennsylvania, followed with an update on the DLF's 'ILS Discovery Interface Task Force', of which he is the Chair. The task force is set to soon release a standard for ILS discovery services -- in other words, it will set a lightweight API standard for the OPAC layer of the ILS. In the context of this forum, this standard should help foster the shift of the OPAC from a vertical silo-service to a horizontal more flexible one, increasing opportunities for discovery of the underlying OPAC data.

I hope to see the next Forum in this 'Discovery' series showcase more tools that utilize Winkler's horizontal-service architecture concept that increases discovery possibilities. Kudos to NISO for providing another thought-provoking Forum packed with inspirational examples and ideas and conversation.

Links to presentations should be available soon from the NISO event website.

Better logging

saturday, february 23rd, 2008 5:22pm

I'm entranced with a new practice: logging to a database instead of a file.

Long ago I got into a habit of logging to files as a way of monitoring the workings of my programs. For shell scripts I piped the standard output to be appended to a file, and then just sort of stuck with that model as I learned other languages.

Though java and python each have a robust logging library built into the language, I didn't use those, instead focusing my language-learning on features that more directly enabled me to tackle whatever the task at hand. The result is that over time my shell and php and python and java programs ended up with log files that grew ever larger, requiring occasional manual paring.

Given an interest in best-practices, I've begun learning about and experimenting with built-in loggers when available, but on a current project have met my logging needs via a self-rolled approach that offers real benefits.

Problem -- atomized logging

Our easyBorrow project consists of a lightweight php web interface that quickly dumps the incoming request into a database queue, where a python controller takes over, calling a series of independent web tunnelers & other web services. The whole system consists of around a dozen independent web-services of varying degrees of complexity, each with a nicely scoped focus. Most of them also write to a separate log file, which in a way makes sense, but given that the majority of these web-services serve a single goal -- to move the user's request-processing along, the atomized nature of the logs can end up being a hassle.

If something goes wrong with a request, a 'history' table does given an indication of where to start tracking down the issue -- but then I may have to look into as many as half-a-dozen separate log files to see what exactly happened. This is one of those situations where problems don't arise often enough to tackle improving the existing architecture, but just enough to make the existing one annoying at times.

Problem -- data not exposed

Keeping this background in mind, I want to note another issue that happens maybe once every three months that had a role in this new architecture with which I'm so taken. Some two years ago I implemented an automated export of requests from our iii ILS for items held in an offsite location. Those requests get exported, then parsed, and then moved to a location where a different vendor's inventory-control software takes over and presents the workers at the offsite facility a list of items that need to be retrieved.

Occasionally, very occasionally, requests don't show up for the offsite staff and I'm asked if I can confirm that the requests actually got exported and parsed and handed off to the inventory-control software. So I look in my documentation to see where the server application log files are located -- grab them and let the folk know that yes, in fact, my part of the flow worked. When this happened last month, a co-worker noted that it would be terrific if they could view the information that I'm looking up so I wouldn't have to be bothered. Unfortunately, given the existing model, that would require folk having passwords to unix servers and isn't workable. But I've ruminated upon this, and given my current evangelism of APIs and exposing data, I've thought that if I had to do that logging over, I'd expose it via a web interface.

Solution

Now I'm working on a new project, or rather tackling one that's been on the back-burner far too long: exporting newly-cataloged item information from our closed and unfriendly iii ILS into a database where we can present users with useful new-item info and feeds. Like more and more projects these days, this one has many pieces, each of which, had I done this a year-plus ago, would have logged its inner workings to a separate file.

Now though, I'm logging the export script info, the posting script info, and the parsing script info to a single database table. And because one of the scripts lives on a server that doesn't have a library setup to interface with mysql, I'm 'writing' to the db by POSTing that script's log-entry info to a url which then saves it to the db. The log-table consists of (in addition to an unseen auto-incrementing id) a datestamp, an identifier, and the log message. The 'identifier' in this case is a simple number that allows me to group the entries from different sources together in the log. When I eventually apply this beauteous system back to easyBorrow, the identifier will be the request-number the system assigns early on in the process. The function/method in each separate script that writes to the log also takes a detail-level parameter, allowing me to specify a high-level of logging detail in development code, and a low-level in the ongoing in-production code.

This system is sweet. It means that I have only to look in one place to monitor the flow of all three scripts. So if the export cron job fires off at 2am, and the POST cron job fires off at 3am, and the parser cron job fires off at 4am, I can see the whole flow in one view.

Though all developers can write to a database in their sleep, since I'm writing to a django-managed table, it is and feels even easier. For those who haven't yet drunk framework kool-aid:

log_entry = Log()
log_entry.identifier = 'the_identifier'
log_entry.message = 'the_message'
log_entry.save()

Wrapping a function around this allows my log entries to look like:

updateLog( detail='low_detail', identifier='the_identifier', message='the_message' )

But wait, in true Ronco spirit, it gets better... Since I'm writing to a django-managed table, I automatically, without writing extra code, have a complete, useful, sortable and searchable web-interface -- with built-in authentication -- which means that not only can I view the flow of processing, I can easily allow anyone else to view the flow of processing by supplying a url.

The final sprinkle of luscious magic is that django makes it very easy to overwrite the built in save method of its objects. So I've added a bit of logic to the Log object's save method to delete entries older than X days (a configurable number I've put in a settings file). There's a bit of a hack in this solution. The absolute simplest code to write in this save method is just to query for all log-entries older than X days and delete them, which is what I've initially done. But this is unnecessarily expensive database access for every single log-entry, though mitigated by the fact that for this project, the scripts run only once a day and in production, log lightly. A better approach would be to have a separate job run once a day or week and perform the deletions, and I may implement that, though I've been mentally toying with an oddly enjoyable interim hack: to have the save method come up a random number such that it would have, say, a 1 in a 100 chance of running the delete code. Bottom line, though, is that auto-deletion is taken care of right up front.

Put all of these improvements together, and the new system offers more useful, more accessible, and better-sized logging.

Practical campus APIs & feeds

saturday, february 16th, 2008 10:42am

For a while now I've evangelized APIs & feeds, encouraging folk (and reminding myself) to to expose 'web-page' data by presenting it in some alternate structured format. That's partly for the purpose of making code-reuse a reality but even more-so for the purpose of making possible new and interesting uses of data.

At the Library, we've truly moved into the realm of moving code onto the network. The web-services we've created have, not surprisingly, been library-related:

  • An isbn converter.
  • A 'cleaner' for data output from an ILS API.
  • Many 'tunnelers' into consortial borrowing services returning results of searches, with the order number, if applicable.
  • A reprocessor of OCLC xISBN data that returns a only those OCLC xISBNS that have the same format and are in the same language as the submitted ISBN.
  • An OCLC to ISBN converter that will take an OCLC number and see if there are versions of that item available with ISBNs.
  • An OPAC status & location checker.
  • etc. etc. etc.

I've wondered recently what APIs the library could offer that would be of value campus-wide. More specifically, what APIs we might develop for our own needs that would be useful to the campus as a whole. Of course, many of our APIs do currently benefit the wider campus community in that students, staff, and faculty across campus use services of ours that are made possible via our behind-the-scenes use of APIs. I'm thinking more of APIs that developers in other departments might find directly useful.

When considering APIs that would be useful for developers across the campus, I naturally think of our Computing and Information Services department (CIS). I've had good conversations, and hope to have many more in the future, with CIS folk about having them develop and evangelize campus-wide APIs. My thinking has been that over time, developing such APIs could save them an enormous amount of time as well as enhance good will from departmental developers.

An example: for one of our Library projects, we need a listing of faculty and course information. I'm not directly involved in this project, but my understanding is that we periodically request a list of faculty and courses from CIS; they produce the list; and we update some db tables for web-apps that make use of this information. My sense is that if certain Banner APIs could be enabled -- obviously with appropriate security implemented -- we could get this information directly from a feed / API call, simplifying our workflow and lightening the workflow of the CIS folk who produce the list for us.

I'm encouraged from my conversations that there are folk in CIS who share this perspective and are working to realize it. While good discussions and planning proceed, I find myself gravitating to what we in the Library could do now along these lines. Three ideas...

Cafeteria menu

As part of an idea that deserves its own post (the idea sounds a bit silly without context, but indulge me), I've thought that it would be very useful on a particular Library web page to be able to display the next upcoming meal at the main campus cafeterias. I spent about ten minutes exploring the availability of that information, and found two web-pages and a downloadable excel spreadsheet. None of these are ideal sources of information to automate, but it could be done, and I wouldn't be at all surprised if the resulting structured feed would be of use to others, from individual students to the campus newspaper.

SafeRide arrival time

We have a campus shuttle system comprised of about seven vans. A couple of these have GPS receivers, and a vendor website displays on a map, via quite gnarly javascript, the current location of the GPS-enabled vans. That's nice, but the experience could be significantly improved.

I've thought it would be extremely useful to be able to display on a Library web-page (if the student is accessing the page from within the Library) a simple line like "The next SafeRide shuttle should arrive here in about 10 minutes." Simple and seriously, wonderfully useful information, that doesn't get in the way of the task at hand. That same Library web-page, if accessed from outside of the Library, simply wouldn't have that line displayed.

The API we could create, from parsing the javascript on the vendor web-page, could most simply at a minimum return location information for the GPS-enabled shuttles, which could be interpreted by our own server-side logic to approximate arrival times. But even better, the logic of determining arrival times could be embedded in the API itself. The API could take a location-parameter and return expected arrival time for the submitted location. We at the library might only implement logic that focuses on the arrival times at the Library. But by opening up the arrival-assessment code, we could allow BioMed developers to add to add arrival-time logic for shuttle-stop-locations close to BioMed buildings, and students to add arrival-time logic for shuttle-stop-locations close to particular dorms.

Since developers can determine the IP address of an incoming request for information, and since developers and computer-knowledgeable students know the IP-address ranges of buildings in their purview, we really can do this.

Public computer availability

Imagine you're a student. You need to get some good work done and know if you stay in your dorm room this evening you won't get that work done -- there are just too many distractions. So you want to go to the Library. You have a desktop computer, or maybe just don't feel like lugging your laptop in the rain, and you know the Libraries have public clusters. Problem is -- it's getting close to midterms and sometimes the clusters get pretty full. Wouldn't it be great, I mean, really, really great, to be able to access a web-page that shows public cluster availability across campus?

I've talked with some CIS folk about this and found individuals who are working hard to realize this goal. They do have software that can detect the 'in-use' status of each terminal in the clusters, and last I checked (in November, I think) had noted that the software had upgraded its web-display capability which with they were experimenting. However, public web display of cluster availability is as of this writing only accessible... from cluster machines. But the hope is that this information will eventually be made more public. That's great, but I'd like to take the data a step further, and create an API to the data. The reason is that if the data were also exposed via an API in addition to a web-page, I could solve more specific problems in a targeted way. For instance, one of our Libraries has 15 floors, with public computers available on multiple floors. Wouldn't it be terrific if a student entering that Library could glance at a display screen and see the relevant computer availability (with floor numbers listed instead of generic cluster IDs) for just that building? An API would allow that.

I have other ideas as well, but this gives a good flavor of how in the future, as we meet Library needs, we might be able to offer very useful API data to developers across campus.

To close, an exhortation... In each of these three situations, I speak of creating an API from existing publicly available electronic data. My excitement about creating and then utilizing these APIs for user-services is evident. But really, I should not have to create the APIs; I should be able to spend my time building the useful services for the Library and our campus that the APIs allow. So to all: if you know anyone creating any web-information -- encourage that person to expose their data not only via a 'regular' web-page, but also in a predictable structured way that can make its re-use easy. And to anyone purchasing any vendor-service that offers electronic information, demand that the service offers an API to the data.

Feed interfaces & urls

sunday, february 10th, 2008 2:46pm

I'm going to add feeds to the site, which django makes easy to do. But I want to think about the urls for the notes feeds. I had been thinking I'd manually create a feed that can flow to planet.code4lib.org that would have been a combination of tags (at the moment 'user-context' and 'api') so that things irrelevant to code4lib don't show up. But it'd be nice to generalize this, so that anyone coming to the site can easily select multiple tag-categories and get a feed from that.

So I want to think beyond an obvious simple elegant system for single tags. (Before the idea occurred for multiple-tag feeds the url for tag 'user-context' would have been 'http://bspace.us/notes/feeds/user-context' -- but that's not extensible.)

Tripod new titles list

The tripod newtitles list has a great interface for selecting multiple categories. Selecting a few options returns a fully-parameterized url, nicely explicit if a bit busy.

For the notes feeds, I'd like to offer combined-option feeds in two ways: as parameterized, but also in a tinyurl fashion, like:

http://bspace.us/notes/feeds/arzq

If that url returned (in addition to feed info, obviously) a documentation link, the documentation could note that the link

http://bspace.us/notes/feeds/arzq/info

...would return, among other things, the explicit url.

I'll think more on this and look for other examples of interestingly-crafted feed interfaces and urls.

Moving code onto the network

saturday, february 9th, 2008 10:46am

In 2004, while in my masters program, deeply immersed in java object-oriented programming, I saw the potential benefits of code re-use that classes offer. I envisioned over time building up libraries of class-objects; by accessing them in future projects, I expected to be more and more productive.

Code-reuse never quite worked out that way, though. What I've tended to do for new projects has been to copy a similar class from a previous project, paste it into the new project, weed out unnecessary attributes and methods, and add new code. In a way this makes sense: though I lose out on 'pure' code-reuse, I gain by having all code for a project together. That's nice for version-control and portability, and isolation of concerns in that I don't have to worry that a change in a class in one project will have unintended consequences in another project.

But reading a while back about service-oriented-architecture, and shortly thereafter having a need to code a couple of lines in python that I had just coded in php a day or two earlier -- the benefits of moving code into RESTful web-services, that is: moving it onto the network, became apparent.

I do that all the time now. Just last week I had a need to convert between 10 and 13-digit isbns -- for the second time in a recent project, so rather than coding the conversion directly in the program at hand I put it into a webservice.

http://sisko.services.brown.edu/easyborrow/isbn_converter/0688052304/

In this shift, I've finally realized that goal of code reuse, while still being able to maintain the version-control and isolation of concerns benefits of focusing on my specific project at hand.

The book 'RESTful Web Services' by Richardson and Ruby, while a bit dense, offers good insights on creating web-services (example: versioning). At some point, I'd like to come up with standards for Brown Library (and/or campus-wide) web-services. Examples: specifying versioning in the url, a documention url in the returned data, and a url in that documentation of all APIs/web-services the department offers.

For now, though, the simple shift toward moving code out of individual projects and onto the network has been extremely rewarding.

User context

saturday, january 26th, 2008 6:39pm

I recently organized a meeting of some forward-thinking folk to brainstorm about what kinds of cool library things we could do if we knew more about a user's context. I also put up a wiki-page to help the brainstorming process.

This stems from a requirement for a project: I had to be able to access a particular barcode related to a user. Turned out the only way of getting at this barcode was to first get another piece of information, and then use that first piece of information to call an API that returns a bunch of info about a user, including the barcode. Fine; did that; got the barcode and used it for a tunneler I built. I then went to the Access 2007 conference, a terrific library programming/technology conference, where, among other terrific presentations, I heard Mark Leggot speak about the repository he set up at the University of Prince Edward Island (UPEI). He mentioned the importance of understanding a user's 'context'. Something clicked, and the general implications of what our team had achieved by being able to tie a user's log-in to this API-info about the user became clear.

This is nothing new in the internet business world; Amazon has been trailblazing this path for years. But given that authentication has mostly been a simple 'boolean' system in our Library webapps of just determining whether or not a user is permitted to access a site -- this opens up worlds of exciting possibilities. Already we've implemented a proof-of-concept 'drop box' that determines, from login, the user's 'type' (faculty/staff/student) and department and uses this data to customize the page displayed after login. Exciting stuff!

Vendor API Manifesto

thursday, december 6th, 2007 11:02pm

[I wrote this early in the Fall of 2007 and circulated it to folk in the Library who were attending meetings at which vendors were advertising their wares.]

Software products are created, understandably, primarily to meet existing needs. There are varied bodies of thought as to how much a software product should be designed to meet 'future possible needs'.

At certain points in recent history, it may have been reasonable to design the sole interface to a system assuming that the user of the interface would be a person using a web-browser.

Though APIs (application programming interface) have been around for ages, the trend toward programmers wanting to access internal and external systems via APIs has accelerated tremendously over the last few years.[1] As a programmer for a creative web-services department in a creative Library, I'm part of this trend. Our team's need to be able to programmatically access systems has increased dramatically. Fortunately, a few vendors such as Ex Libris understand this and have built possibilities for programmatic access into their products. But many closed systems remain.

To managers and directors making purchasing decisions, I urge that a top-level purchasing consideration be whether the vendor's product offers an API to the information it provides (in addition to any built-in web interface). The simple reason is that a web presentation of information is designed for a single purpose: for a user to interact with the system via a browser. An API allows the system's data to be accessed in any way we see fit, now or in the future.

A concrete example for any reader not familiar with the notion of an API...

Our team is currently developing a system to simplify the process of obtaining a book through interlibrary-loan services. In order to do this we have been able to automate the process of searching a consortial web-catalog for an item, and requesting that item. But the only method of doing this involves creating a program which essentially mimics a browser, automatically simulating clicked buttons and links and reading the resulting HTML of the consortial catalog's web interface.

This works, but is terribly fragile: if the design of a web-page changes, our program may no longer work until it is reconfigured to understand the new design.

What we absolutely need (in addition to the existing web interface) is a catalog-service (the API) which would allow a defined http request to be sent to a URL that will allow a search to be performed, or an item to be requested, etc. (That http request would come from a program our team has written -- instead of from a user sitting at a browser.) Each request to this API would return predictable documented structured information (XML is one standard; there are others). Our team's program would then be able to automatically process this information.

It is worth emphasizing that I am not asking for a 'whole new program' from the catalog vendor. A system's existing internal program logic that produces the information for the regular web data-stream is applicable to production of the alternate API data-stream. Yes, it takes thought and work to create a good and secure API and document it -- but an API, essentially, presents the same data as a web interface, in a simpler format. The mind-shift in offering an API is often larger than the work-shift.

Finally, about interacting with vendors regarding this issue... Vendor sales people aren't the developers, and it sounds like I am asking for something that vendor developers would be more knowledgeable about. But I've seen different vendor sales representatives at workshops and conferences, and the representatives for products that provide APIs have universally very clearly understood the importance of this issue. Thus if a product representative does not seem to understand this important feature, I would have significant concerns with the product.

--

Notes

[1] Key aspects of this trend are articulated in this seminal article:
http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html