16 July 2012

Crowd sourcing

By Christoph Albers

A new buzz-word came up in the context of digitizing millions of newspaper pages: "crowd sourcing". If you want to learn more about it and why libraries should take a serious look at this method, please have a look at the ABC news video.

Crowd sourcing for the California Digital Newspaper Collection (CDNC)

The California Digital Newspaper Collection (CDNC) is pleased to announce the implementation of User Text Correction in its archive.

The CDNC is the largest, freely-accessible archive of California newspapers. The collection contains nearly 475,000 pages—and growing, ranging from 1846 to the present. It is available for searching at http://cdnc.ucr.edu. The project is managed and hosted by the Center for Bibliographical Studies and Research (CBSR) at the University of California, Riverside.

User Text Correction (UTC) allows individual users of the CDNC to correct computer-generated text. When newspapers are processed from microfilm or paper originals, optical character recognition (OCR) software is used to generate searchable text. This OCR text, however, is often not perfect, particularly for older newspapers. By correcting this OCR text, users improve the CDNC by making more of the text searchable for other users. The CDNC is the first digital newspaper archive in the US that we are aware of to offer user text correction.  To learn more, see the help section of the CDNC.

The CDNC is over six years old and has been supported in part both by the National Digital Newspaper Program, a joint effort by the National Endowment for the Humanities and the Library of Congress, and by the Institute of Museum and Library Services under the provisions of the Library Services and Technology Act, administered in California by the State Librarian. The CDNC has also worked with local institutions around the state to digitize their newspapers, and has started a project to collect current PDFs from California publishers. Please contact us at cbsrinfo@ucr.edu for more information on both projects.

The User Text Correction tool is part of the Veridian software used to host the CDNC. Veridian is developed by DL Consulting and is used by a number of prominent libraries around the world, including the National Library of New Zealand, the Singapore National Library, Princeton, and Cornell. The CDNC is the first archive to make UTC available and has worked closely with DL Consulting on its development, including beta testing by a number of CDNC users.

CDNC users have already corrected thousands of lines of text. Help us make the CDNC a better archive for all, and experience for yourself how fun — and addictive— correcting OCR text can be.

According to a press release from Center for Bibliographical Studies and Research (CBSR) at the University of California, Riverside.

News Media, Digitisation, Australia, Trove

List all IFLA news