Following on from part 1 which reviewed Jorum’s requirements for RSS-based deposit, this section synthesises the comments and feedback emerging in response to it.
In response to the requirements and position papers a number of feeds where submitted for testing and there has been some thoughtful reflection on the issues in the blogs and by email. This is a brief summary of responses to the key issues:
Issues around generating the RSS
Although most platforms in use can easily create RSS feeds and some can create a feed from any search result, it has become clear that, the RSS profile that is created is frequently fixed and does not match the profile requested by Jorum (which is very similar to the profile suggested by OCWC).
Irrespective of repository software of other OER management ‘platforms’ in use, adjusting the RSS output profile has proved to be a non-trivial task. Emerging issues in adjusting the RSS outputs include:
- Users of commercial platforms may have to rely on the company’s developers and development schedule.
- Open source platforms may require additional local coding or at the least will require adjusting an XLST.
- It is likely that the RSS output of web 2.0 platforms will simply not be editable.
In all three cases there may also be possible solutions that utilise independent tools, such as Yahoo Pipes, to process the feed after production or create a feed from another interface. However, such an approach to adjust the RSS profile is either still reliant on the information present in the original source feed or is dependent on adding standard profile information or extracting additional information and creating a new feed. See for example http://repositorynews.wordpress.com/2010/01/07/really-not-so-simple-syndication/.
Xpert note that of the 60 feeds they harvest, 5 aren’t valid xml and 20 aren’t valid RSS (see comment on Lorna’s post). It is worth noting that aggregators can and do deal with poorly formed data, however, in the timescale of the programme the kind of manual effort involved in dealing with poorly formed feeds (and the quality of the item metadata they would generate) is not likely to be supportable.
With the exception of some web 2.0 platforms and (potentially) some commercial repositories a revised or reprocessed RSS feed to meet Jorum’s requirements is, in theory, a possibility. However, few projects currently produce RSS meeting Jorum’s required RSS profile. From the project trials thus far, one project has so far been able to conform sufficiently to permit successful harvest. In the context of UKOER, adjusting the RSS profile requires the right congruence of platform(s), skills, and time within the project team. Hence it is unlikely that this solution will work for all projects who need a bulk upload option. In terms of the longer term feasibility of using RSS to facilitate bulk deposit, this which may change over time, particularly if the OCWC profile is adopted more widely.
Issues around metadata
The following issues were noted about the feed content:
- There is tentative agreement that it would be good if RSS feeds used a DC namespace where possible and ideally supporting the OCWC profile.
- The addition of custom elements (for example for the purposes of tracking OER currency) is not regarded as a good idea.
- It cannot be taken for granted that the item identifier in a feed is the same as the identifier of the OER within the platform.
- It cannot currently be assumed that the item identifier in a feed is either unique or persistent. This is a critical issue for processing the feed.
- There may be multiple feed identifiers for a given OER.
- Feeds may contain more than one namespace and / or feed formatting
- In a number of repository platforms the identifier supplied frequently points to the splash page rather than the OER itself. This is an issue if the resource itself is to be harvested.
- Few feeds have rights information for the items or for the feed itself. Including this information is regarded as good practice.
- Feeds that use one of the variants of Creative Commons encoding may allow aggregators and Jorum to provide enhanced services.
- Projects should clearly license their feeds and underlying items.
- In the last resort projects should state clearly on their site or by telling Jorum what rights and licensing exists in connection to their feeds.
- Metadata quality (including completeness) in feeds is variable.
Issues around processing the RSS
There are a number of issues about feed size, currency, updating, identifiers and OER deletion but these depend on whether the service is collecting information to help point to current OERs (like an aggregator) or whether it is seeking to provide a central collection of OERs (a library – even if some of those OERs are actually elsewhere). These distinctions are not clearly made in much of the discussion.
Pushing everything into one classification seems to negate the classification work done by projects and many projects will be producing OERs with more than one JACS code. This could be addressed by having multiple feeds per platform (either from the platform or by subsequently dividing the feed) but there are potential duplication management issues with multiple feeds.
News feeds are the most common use of RSS or Atom, they typically are limited to a fixed number of most recent items (this does not preclude multiple subject based feeds from a repository as mentioned above). They can, however, contain the entire contents of the repository or all the results for the search term.
If the feed contains the entire contents of the repository (or search) it inevitably becomes very large. Large feeds tend to time out in browsers and can be difficult to ingest (as outlined in Xpert’s paper). However, many aggregator services prefer this approach as it provides a straightforward way to maintain currency, avoid duplication, and not have to consider partial deletion. This is because each time the feeds are polled/ gathered the previous index created by the aggregator is deleted. Only the content currently in the feed is indexed.
Magazine type feeds are the most common form of RSS and are more likely to be the default feed produced by repositories or other platforms ; they are usually small. However, to build an aggregation service or collection based upon them would require items from feeds to be stored in an incrementally built index (i.e. new items from feeds are added to a persistent index that retains their information even after they are no longer present in the feed). This works if there are unique and persistent identifiers for feed items or OERs included in the feed record and OERS do not end up with multiple feed identifiers.
I’d suggest that the discussion about how to tell if a given OER has updated is a management and policy question to do with versioning and should be out of scope for this discussion. If a uri/url is provided for an OER, I think subsequent versions of the OER should have different urls as they are different things! There is a difference between an academic’s view of an OER as constantly in flux and a digital asset management perspective which needs a clear notion of the persistence or fixity of an released OER.
The discussion of how often feeds should be polled to check for new items is something which has to be agreed. It will impact on a number of issues and is affected by the type of feeds being consumed (magazine feeds will need to be polled more frequently) and will impact on the performance of the index.
Jorum have currently indicated that uploading OERs via RSS is out of scope. Upload would probably require some form of persistent and locally unique identifier for each OER to be included in the feed.
There are wider questions in connection to deletion from Jorum, but in the context of RSS link deposit, deletion is only an issue if Jorum opts for some form of incremental built index. RSS is not designed to manage the deletion of items.
Overview of combinations of RSS options
I’ve created this table to try to pull together some of the interdependent issues relating to feed processing.
|Feed of all OERS||Feed of all OERS||Subject feed of OERS||Subject feed of OERS||Magazine feed of OERS||Magazine feed of OERS|
|Feed size||Very Big||Very Big||‘Medium’||‘Medium’||Small||Small|
|Update||Replace||Incremental addition||Update||Incremental addition||Incremental addition||Update|
|Coverage||Whole current collection||Whole cumulative collection||Current subject collection||Cumulative subject collection||Whole collection (gradually)||Transient snapshot of collection|
|Deletion||occurs as a feed is replaced||does not occur automatically||occurs as a feed is replaced||does not occur automatically||does not occur automatically||occurs as a feed is replaced|
|OER Deduplication?||Not significant||Issue||Minor issue||Issue||Issue||Not significant|
As a precursor, JorumOpen does not currently act as an OAI harvester so this is a somewhat moot point (Note: the software required for OAI-PMH consumption is distinct from the software needed for harvesting).
Within the programme not all OER producers are using repositories and of those that are not all repositories have OAI-PMH enabled. So, although there is some established practice of harvesting metadata via OAI-PMH it would be at best a partial solution.
There are pieces of software which can add support for OAI-PMH export but adapting and implementing them creates an additional development task for projects.
OAI-PMH harvesting has some built in support for resumption (incremental harvesting of metadata from large repositories) and has some support for record deletion but this is not always well supported.
OAI-PMH harvesting services have a mixed record – a key point of note is that they invariably need time to set up.
OAI-PMH harvesting will face many similar issues to RSS harvesting – in that identifiers will point to splash pages and that resources themselves are not harvested.
As a precursor: JorumOpen does not support SRU based harvesting so this is another moot point.
Considering SRU would require this functionality in both contributing repositories other platforms and into Jorum. It is, however, not yet widely supported or used. Though some commercial repositories do implement it and there are open source clients to bolt-on to repositories or other platforms. This is requires developer time and is likely to be a partial solution only.
The current deposit tool is based on third party software MRCUTE which runs through Moodle – as such it cannot easily be adapted by Jorum to provide an API.
Jorum are however, exploring the addition of a SWORD deposit endpoint. Suitable SWORD deposit tools would need to be identified (ie those that can handle the right metadata and cope with something that isn’t a research paper – given the research focus of SWORD tool development these are likely to be some of the less-developed tools ).