dinsdag 27 mei 2014

What's the best way to harvest an OAI-PMH data provider?



The OAI-PMH protocol defines several verbs which can be used in requests to an OAI-PMH data provider. For harvesting, the most obvious are:
  • ListIdentifiers, which returns a list of identifier of records, in combination with
  • GetRecord, which returns the record for the specified identifier
  • ListRecords, which return a set of records
Which verb to use?

Some harvesting solutions choose to do a ListIdentifiers and for each identifier do a GetRecord. Some choose to harvest with the ListRecords verb. Although both methods lead to the same content (discarding the OAI envelope header and focusing on the record), the number of HTTP requests differ, obviously. But, does this have an impact on the performance?

Which connection method to use?

As the OAI-PMH protocol harvests via the HTTP protocol there are also alternative connection methods to be inspected, like HTTP compression (accept-encoding:gzip) and connection Keep-Alive. What is the impact of these alternative connection methods on the performance of OAI-PMH harvesting?

Test method

To answer these questions the following test was conducted. Four (Dutch) OAI-PMH data provider where selected. For each data provider an harvest for about 10.000 records was done with both the ListIdentifier/GetRecord method and the ListRecords method. For each of these tests the standard connection was timed, as well as a gzipped, gzipped+keep-alive and keep-alive connection method. These 16 tests were carried out twice and an average of the elapsed times was analyzed.

Test results

The graph below shows the results of these tests, so the number of records per minute for each connection type (more=better).
The complete results as well as the graph are available in a Google Spreadsheet. For the connection type WebPageTest was uses to determine if the method was supported. The tests where carried out with 2 Perl scripts and a Bash file which, together with the resulting output, can be downloaded here.

Conclusions

  • Clearly, you get a better performance (=higher number of records per minute in a harvest) when you use the ListRecords method. So lesser HTTP requests results in faster harvests (about 2.5 - 11 times faster!!!)
  • The use of keep-alive and gzip varies per OAI-PMH data provider. In general: if a data provider supports keep-alive and/of gzip, you'd better use is, it improves performance! You mileage may vary per data provider, so test what's the best solution.

Final notes
  • Although this test was conceived to show the difference in verb usage and connection type, it also shows that some data providers perform better than others. Room for improvement...
  • For those who have inspected the used Perl scripts might wonder why the "Beeld en Geluid" and "Open Beelden" data providers receive other parameters. Well, it's seems they do not follow the OAI-PMH version 2.0 standard by the letter. It's stated that the metadataPrefix is required when doing a ListIdentifiers or ListRecords. But these two data providers do not work when you use the metadataPrefix and resumptionToken together...