Editor’s note: This is the second guest post by Kevin Hawkins. His first talked about the process of starting a library publishing program; this one covers options for delivering back content of journals (or other materials) when you become the publisher. This second post is also part of our series on accessibility.
If your library publishing program will absorb any established journals or series of books or technical reports, you will probably want to make previously published content available online in addition to any new content you publish. So what are your options? Below I’ll use the term “back issues” as if we’re talking about a journal, but everything I say could as easily apply to books or technical reports.
The easiest solution would be to create a searchable collection in HathiTrust of these back issues (see this example). Note that you don’t have to be affiliated with a HathiTrust member institution to create a collection; individuals can create collections using guest accounts through the University of Michigan. If the back issues aren’t yet in HathiTrust, you may be able to find a partner institution willing to digitize and submit the issues for you. Your back issues are probably still protected by copyright, so you’ll need to have the rightsholder fill out a permissions agreement authorizing HathiTrust to make the full text viewable. Then you can link to the collection of back issues from the website where you’ll publish new content.
There are, however, some disadvantages to doing this. First, you won’t be able to allow users to search across back issues and new issues in a single interface. Second, HathiTrust’s search functionality uses the OCR text created after scanning, which can contain errors in recognition of the original and which is difficult for visually impaired users to read using assistive technology. Third, a HathiTrust collection provides no index of authors or article titles; a user would need to look at each issues table of contents if a fulltext search is insufficient.
There are many vendors that will scan printed matter and deliver those scans as individual page images or multipage PDF files (as you choose). More libraries are equipped to deliver PDF files to users, and that format has the advantage of allowing OCR text to be embedded to allow for fulltext searching. While Adobe Acrobat has built-in OCR capability, many vendors will not only create the OCR for you but even offer OCR correction as an additional service. And some can even create PDF/UA (“universal accessibility”) files, ensuring they can be read with assistive technology.
Unfortunately, none of this provides an automatic index of authors or article titles. A vendor might be able to create this for you as well, or you could create one on your own as a byproduct of XML encoding of the content. Who would do the XML encoding? Probably a vendor, though you might try creating your own using the PKP XML Parsing Service. This could be especially useful if use the XML Galley Plugin (a standard component of OJS) for new issues since you could potentially have old and new issues all in the same XML format.
In any case, when working with a vendor, it is best for your contract to specify the quality standard you expect, such as acceptable error rate on scanning pages, acceptable error rate on OCR, and file validation. But if you do this, you’ll want to actually evaluate the digitized content that you receive (likely by sampling random pages) and be prepared to reject content not meeting the standard for the vendor to fix without charge.
As you can see, there are a number of ways you might proceed. Perhaps you have other ideas? If so, please add comments!