Event box

Acquiring and Preparing a Corpus of Texts (Digital Humanities Workshop Series: Text/Data)

This session focuses on the technical dimensions of corpus development. Using an array of printed matter -- from digital facsimiles of incunabula to modern letterpress/offset books -- we will explore the risks and benefits of optical character recognition (OCR); file formatting and naming issues; organization strategies for large corpora; and problems of data cleaning and preparation. We will also look at some common sources for textual research data, such as Project Gutenberg, the Internet Archive, and Google Books. We will also discuss some common legal concerns around the use of textual corpora.

** This workshop is offered for RCR credit as GS712.15. Participants who plan to receive RCR credit (as indicated on the registration form) will receive priority registration.