Event box

Text/Data (RCR Days): Acquiring and Preparing a Corpus of Texts

This workshop focuses on the technical dimensions of corpus development. Using an array of printed matter -- from digital facsimiles of incunabula to modern letterpress/offset books -- we will explore the risks and benefits of optical character recognition (OCR); file formatting and naming issues; organization strategies for large corpora; and problems of data cleaning and preparation. We will also look at some common sources for textual research data, such as Project Gutenberg, the Internet Archive, and Google Books. While this session will not examine legal issues in detail, we will discuss some common legal concerns around the use of textual corpora.