Showing posts with label opensource. Show all posts
Showing posts with label opensource. Show all posts

Saturday, July 02, 2011

OR11: Misc notes

The state of Texas
I like going to conferences alone, it’s much easier to meet new people from all over the world than when you’re with a group, groups tend to cling together. With a multitracking conference like OR11 however, the downside is that there’s so much to miss. Especially since I like to check out sessions from fields I’m not familiar with. At OR11, I wanted to take the pulse of DSpace and Eprints, and not just faithfully stick with the Fedora talks.

In this entry, I focus on bits and bobs I found noteworthy, rather than give a complete description. I skip over sessions that were excellent but have already widely covered elsewhere (for instance at library jester) such as Clifford Lynch closing plenary.


“Sheer Curation” of Experiments: data, process and provenance, Mark Hedges 
slides [pdf]

"Sheer curation" is meant to be lightweight, with curation quietly integrated in the normal workflow. The scientific process is complex with many intermediate steps that are discarded. The deposit at the end approach misses these. Goal of this JISC project is to capture provenance experimental structure. It follows up on Scarp (2007-2009).

I really liked the pragmatic approach (I've written this sentence often - I really like pragmatism!). As the researchers tend to work on a single machine and heavily use the file system hierarchy, they wrote a program that runs as a background process on the scientists’ computer. Quite a lot of metadata can be captured from log files, headers, filenames. Notably, it also helps that much work on metadata and vocabulary has already been done in the field in the form of limited practices and standards.

Being pragmatic also means discarding nice-to-haves such as persistent identifiers. That would require the researchers to standardise beyond their own computer and that’s asking too much.

The final lesson learned sounded familiar: it took more, much more time than anticipated to find out what it is the researchers really want.


SWORD v2

SWORD2: looks promising and useful, and actually rather simple. Keeping the S was a design constraint. Hey, otherwise we’d end up with Word, and one is more than enough!

Version 2 will do full Create/Read/Update/Delete (CRUD). Though a service can always be configured to deny a certain actions. It’s modelled on Google’s Gdata and makes an elegant use of Resource Maps and dedicated action URLs.

CottageLabs, one of the design partners, made a really introduction video to Sword v2 demonstrating how it works:





It looks really useful and indeed still easy (as per Einstein's famous quip, as simple as possible but not simpler). If you’re a techie, dive into SwordApp.org. If you’re not, just add Sword compliance to your project requirements!


Ethicshare & Humbox, two sessions on community building

Two examples of successful subject-oriented communities that feature a repository, each with some good ideas to nick.

Ethicshare is a community repository that aggregates social features for bioethics:

  • one of the project partners is a computer scientist who studies social communities. Because of this mutual interest (for the programmer it’s more than just a job) they have had the resources to fine tune the site.
  • the field has a strong professional society that they closely work with.
  • glitches at beginning were a strong deterrent to success - so yes, release early and often, but not with crippling bugs!
  • the most popular feature is a folder for gathering links, and many people choose to make them public (it’s private by default).
  • before offering it to the whole site, new features are tried out on a small, active group of around 30 testers.
  • for the next grant phase they needed more users quickly, so they bought ads. $300 for Facebook ads yielded for 500 clickthroughs, $2000 Google ads 5000. This (likely) contributed to number of unique visitors rising from 4k to 20k per month. Tentative conclusion: these ads cost relatively little and are effective for such a specialized subject, the targeting is really quite good.


Lessons from the UK based Humbox project:

  • approach: analyse what scientists were doing already in real life, in paper and file cabinets, mimic it and extend it.
  • "the repository is not about research papers, it is about the people who write them": the profile page is the heart, putting the user at the centre. Like Facebook’s, it has two distinct views: an outside version about you (to show off), and internal version for you (with your interests). This reminds me of the success of the original, pre-yahoo delicious, which also cleverly put self-interest first with the social sharing as a side-effect.
  • Find a need that's not covered by existing systems: Humbox fills a need to share stuff, not just with students - for that the LCMS is the natural place to go to - but with colleagues, since the course-centric nature of LCMS’s tends to lock colleagues out.
  • Most feedback came from community workshops. Participants often became local evangelists.
  • Comments often were corrections. 60% of the authors changed a resource after a comment - and the 40% comments not leading to a correction also include positives, so the attitude towards criticism was quite positive.
  • over 50% of users modified or augmented material from others, sometimes reuploading it to the site.
  • Humbox only takes Creative Commons licenses, with an educational side-effect: some users indicated they also started looking in other places (such as flickr) for cc material as a result.


The Learning Registry: “Social Networking for Metadata”
slides [google docs]

I just want to mention this for the sheer scope and size of this initiative. It’s [explicative] ambitious.

The aim to gather all social networking metadata! To limit the scope, they won’t do normalising, or offer search or a query api, that's all left to the users of the gathered dataset. But all, they mean everything on the net: data, metadata and paradata (by which I understand they mean the relationships with other data).

Agreements are in the works with major partners (see last slide). The big elephant in the room was Facebook (no surprise, sigh) which wasn’t mentioned at all. (as I'm writing this, Google+ has just been announced, there is some hope after all of the slightly creepy evil eventually triumphing over the even more evil).

They call their approach a do-ocracy. Very agile design principles. Real-time everything in the open: all code and specs are written directly in Google Docs (table of contents, a google spreadsheet). NoSQL master-master storage system, well thought-out architecture, production will run on ec2. Everything will be open, except data harvested from commercial partners.

Something to keep an eye on: www.learningregistry.org.




Finally...


MODS is the new DC. In recent projects, MODS seems to have replaced Dublin Core as the baseline standard for metadata exchange. Interesting development.

Wednesday, June 22, 2011

OR11: opening plenary

See also: OR11 overview

The opening session by Jim Jagielski, President of the Apache Software Foundation, focussed on how to make an open source development project viable, whether it produces code or concepts. As El Reg reports today, doing open source is hard. The ASF has a unique experience in running open projects (see also is apache open by rule). Much nodding in agreement all around, as what he said made good sense, but hard to put in practice. Some choice advise:

Communication is all-important. Despite all the new media that come and go, the mailing list still is king. Any communication that happens elsewhere - wikis, IRC, blogs, twitter, FB, etc - needs to be (re)posted to the list before it officially exists and can be considered. A mailing list is a communication channel which is asynchronous and participants can control themselves, meaning read or skip it at their time of choice, not the time mandated by the medium. A searchable archive of the list is a must.

Software development needs a meritocracy. Merit is built up over time. It’s important that merit never expires, as much open source committers are volunteers who need to be able to take time off when life gets in the way (babies, job change, etc).

You need at least three active committers. Why three? So they can take a vote without getting stuck. You also need ‘enough eyeballs’ to go over a patch or proposal. A vote at ASF needs minimally three positive votes and no negatives.
To create a community, you also need a ‘shepherd’, someone who is knowledgable yet approachable by newbies. It’s vital to keep a community open, so not to let the talent pool become too small. To stay attractive, that you need to find out what’s the ‘itch’ that your audience wants to scratch.

The more 'idealistic' software licenses (GPL and all) are "a boon firstmost to lawyers", because the terms ‘share alike’ and ‘commercial use’ are not (yet) clear in juridical context. Choosing an idealistic license can limit the size of the community for projects where companies play a major role. A commenter added that this mirrors the problems of the Creative Commons licenses. In a way, the apache license mirrors CCzero, which CC created to tackle those.

Friday, June 05, 2009

OR09: Four approaches to implementing Fedora

Open repositories 2009, day three, afternoon.

So far, the conference had not been disappointing, but now it got really interesting. The sessions I followed in the afternoon each highlighted a specific approach of the problem that IMHO has been standing in the way of wider Fedora acceptance: middleware.

What these four have in common, is that they all take leverage an existing OSS product and adapt it to use Fedora as datastore.


1. Facilitating Wiki/Repository Communication with Metadata - Laura M. Bartolo

Summary: interesting approach, a traditional Fez spiced up with Mediawiki. With minimal coding a relative seamless integration.
For this to work, contributors need to know MediaWiki markup, and to really integrate, must learn the fez-specific search markup. Also, I'm not sure how well this can be scaled up to true compound objects, given Fez' limitations.

Notes:
Goal: disseminating of research resources. Specific sites for specific science fields, ie soft matter wiki, materials failure case studies.
MatDL repository: has a repository (Fedora+Fez), want to open up two-way communicating. Example: Soft matter expert community, set up with MediaWiki. "Mediawiki hugely lowers the barrier for participating": familiarity gives low learning curve.

The question: how to integrate the repository with the wiki two-way.

Thinking from user-centric approach. Accommodate user; support complex objects (more useful for research & teaching) thus describe them parts as individual objects.

Components:
- Wiki2Fedora
Batch run. Finds wiki upload file, converts referencing wiki pages to DC metadata for ingest in rep. (wiki has comment, rights, author sections -> very doable) Manual post-processing (Fez Admin review area function)
-Search results plug-in for wiki: display repository results in wiki search. Adds to mediawiki markup, to enable writing standard fez queries in the content.

Sites: Repository - Wiki


2. Fedora and Django for an image repository: a new front-end - Peter Herndon (Memorial Sloan-Kettering Cancer Center)


Summary: using Django as a CMS, internally developed adapters to Fedora 3.1.

My gut feeling: A specific use case, images only, so rather limited in scope. Despite choosing the 'hook up with mainstream package' strategy, effectively still a NIH-based rolling their own. That makes the issues even more instructive.

Notes:
Adapting a CMS that expects SQL underneath is challenging - the plugin needs to be a full object-to-relational database mapper.
Also, Fedora 'delete' caused 'undesired results', 'inactive' should be used.
Further, some more unexpected oddities: had to write their own LDAP plugin to make it work, django has tagging but again plugin was needed to limit this to controlled vocabularies. Performance was not a problem.
Interesting: repository for images only, so exif and the like can be used - tags added using Adobe Bridge! The tested, successful strategy: make use what is already familiar.
In the Q&A the question came up: why use Fedora in this case anyway? Indeed the only reason would be preservation, otherwise it would have saved a lot of trouble to use Django Blobstore.

The django-fedora plugins are available at bitbucket.org.



3. Islandora: a Drupal/Fedora Repository System - Mark A Leggott (University of PEI)

Summary:
Islandora looks *very* promising. I noted before (UPEI's Drupal VRE strategy) that UPEI is a place to watch - they are making radical choices with impressive outcomes.

Notes:
UPEI's culture is opensource friendly. They use Moodle and Evergreen (apparently, they were the first Evergreen site in production).

Rationale: opensourcing an in-house system reinforces good behaviour: full documentation, quality code.

As noted before, UPEI's repositories are hidden behind VRE (see [link]). VRE's are geared towards the researchers. Example of approach: the first thing people do when they set up a VRE is create a webpage. That's what a project needs, and so it's used as a hook to reel people in, they're up and running within a few hours.

The VRE is Drupal; Fedora is for data assets, metadata, policies.
Base Islandora consists of three plugins: Drupal-Fedora connection plugin, xacml filter, rule engine for searches.

This 'rule engine' is indeed very cool.
In a later private conversation with Mark Leggott, he clarified that Islandora indeed uses an atomistic complex object model for research data; the rule engine declares how these can be searched from within Drupal. Example, a dataset consisting of a number of measuring points, each with a set of instruments, atomistically in Fedora; can be queried as 'all the results from specific measure point', 'all the result from instrument x', 'instrument x in specific period' etc.
We haven't reached Nirvana yet, to make the deconstructing of the data objects possible, they have to adhere to specific format (xml). But it's impressive nevertheless.


Other Drupal plugins add functionality for specific data. Impressive example: Drupal FCK editor used as TEI editor, after editing, automatically ads version to datastream. Very cool and 'Just Works' (cheery tweet).

Marine Natural Products Lab: best example of the setup for VRE which includes extensive repository (searchable within the critter xml).

Previous versions used drupal 5/fedora 2, not maintained; currently drupal 6/fedora 3.1

Q: did you replace the drupal storage layer, or do you sync?
A: sometimes it’s saved in the drupal layer, when it doesn’t need to go into fedora (temporary data, while we build the content model). Drupal filesystem is a potential bottleneck when large datablobs

Q: are you bound to content models?
A: standard fedora cm’s, you can build them yourself or change the delivered one. The models are exposed, you can see how it works. We first installed Fez to see how Fedora worked.


4. Project Hydra: Designing & Building a Reusable Framework for Multipurpose, Multifunction, Multi-institutional Repository-Powered Solutions - Tom Cramer (Stanford University), Richard Green (University of Hull), Bess Sadler (University of Virginia) et al.



Summary:
I'm even more excited about Hydra than about Islandora. Different approach: create "A lego set of services". In other words, a toolkit for the common parts of applications.
It all looks really good. Two gotchas though. Firstly, it is still a work in progress. Can we afford to wait? Secondly, there are issues with the Unicode support of Ruby on Rails.

For more info: D-Lib.

Notes:
Modelled after the current 12+ use cases of repositories in use at partner institutions, both institutional and personal.
It needs generic templates - which sometimes may do the job - otherwise it won’t come off the ground.
Hydra will have common content models and datastream names. But ultimately they want Hydra to be able to cope with almost anything. A MODS datastream will always have to be there, but not necessarily as primary (so can be done via dissemminator).

Four multifunctional sections:
  • Deposit
  • manage (edit objects, set access)
  • search & browse
  • deliver
  • plus plumbing: authent, author, complex workflow
Using Rails with ActiveFedora. Turns out Rails lives up to its reputation: they are way ahead of their initial roadmap, now expect full production app by fall.

Specs 3/4 ready, coding 1,5/4.
Demo: http://hydra-dev.stanford.edu/etds

Presentation builds on top of blacklight OPAC. Virgina already has a beta version of their catalogue up using blacklight.

Saturday, May 30, 2009

OR09: Repository workflows: LoC's KISS approach to workflow

Open Repositories 2009, day 2, session 6b.

Leslie Johnston (Library of Congress)

My summary:

A practical approach to dealing with data from varying sources, keep it as simple as possible, but not simpler.
The ingest tools look very useful for any type of digitization project, especially when working with an externel party (such as a specialized scanning company).
The inventory tool may be even more useful, as lifecycle events are generally not  well covered by traditional systems, be it CMS or ILS.

Background

LoC acts as durable storage deposit target for widely varying projects and institutions. Data transfers for archiving range between an usb stick in the mail to 2Tb transferred straight over the network. The answer to dealing with this: simple protocols, developed together with uc digilib (see also John Kunze).

Combined, this is not yet full a repository, but it covers many aspects of ingest and archive functionality. Rest will come. Aim: provide persistent access at file level.

Simple file format: BagIt

Submitter is asked to describe files it in BagIt format. 

BagIt is a standard for packaging files; METS files will fit in there, too. However, BagIt wascreated because we needed something much, much, much simpler. It’s not as detailed; description is a manifest, it may omit relationships, individual descriptions, etc. It is very lightweight (actually too light: we’ve started creating further profiles for certain types of content).

LoC will support Bagit similarly and simultaneously to MODS & METS.

Simple tools

Simple tools for ingest:
- parallel receiver (can handle network transactions over rsync, ftp, http, https)
- validator (checks file format)
- verifyit (checksums files)
These tools are supplied as java lib, java desktop application, and LocDrop webapp (prototype for SWORD ingest).

Integration between transfer and inventory is very important: trying to retrieve the correct information later is very hard.

After receiving, inventory tool records lifecycle events.
Why a standardized tool: 80% of workflow overlap between projects.


All tools availble open source [sourceforge]. What's currently missing will be added soon.

Wednesday, May 27, 2009

OR09: Panel session - Insights from Leaders of Open Source Repository Organizations

Open repositories 2009, day 1, session 4.

A panel with the big three open source players (Dspace’s Michelle Kimpton and Fedora Commons’ Sandy Payette, freshly merged into Duraspace, ePrints’ Les Carr) and Lee Dirks from Microsoft. Zentity (no, not Zentity - 1.0 was officially announced at this conference) brings up lots of good questions. Unfortunately it didn’t get to an interesting exchange of ideas.

I’ll concentrate on Microsoft, as they were the elephant in the room. Warning: opinions ahead.

Microsoft is walking a thin line, their stance has been very defensive. Dirks started out quipping that “We wanted to announce Microsoft merging with ePrints, we got together yesterday, but we couldn’t agree on who was going to take over who.”

He went on stressing that this is Microsoft Research and they're not required to make a profit. Putting on a philanthropist guise, he went on that their goal is to offer an open source repository solution to organizations that already have campus licenses. “How can we help you use software that you already paid for but maybe don’t use?”. They claim they don't want to pull people away from open source solutions.

The most interesting parts were what he was *not* saying. Which open source does MS not want to pull us away from - Java? MySQL? Eclipse? Or did he only mean open source repository packages?
Yeah right… getting visual studio, IIS, SQL server and the most dangerous of all, Sharepoint a foot in the door.

An audience question that nailed the central issue: "The question will be lock-in. commitment in other parts of the lifecycle are therefore more important. Zentity hooks you up everywhere in the MS stack."
Dirks responded with "Everything we’ve done, is built on open API’s, be it Sharepoint or Office or whatever. You could reconstruct it all yourself."

Well with all respect to the Mono and Wine efforts, I wouldn't call Sharepoint and Office API's you could easily replace. The data will still be in a black box. Especially if you want to make any use of the collaboration facilities. Having open API's on the outside is fine and dandy, but one thing we're learned so far with repositories is that it is hard to create an exchange (metadata)format that is neither too limited nor so complicated it hinders adoption.

On an audience question his stance on data preservation, Dirks initially replied that ODF would solve this, including provenance metadata. No mention of the controversy around this file format - what use is an xml format that cannot be understood? - or on filetypes outside the Office Universe.

When this debate stranded, Sandy Payette turned the mood around by mentioning that MS has contributed much to interoperability issues. It is indeed good to keep in mind that MS is not just big and bad - they aren't. A company that employs Accordionguy can't be all that bad. The trouble is, you have to stay aware and awake, for they aren't all that good, either. Imagine an Office-style lock-in for collaboratories.