Event box

Digital Humanities Text/Data: Acquiring and Preparing a Corpus of Texts

This session focuses on the technical dimensions of corpus development. Using an array of printed matter -- from digital facsimiles of incunabula to modern letterpress/offset books -- we will explore the risks and benefits of optical character recognition (OCR); file formatting and naming issues; organization strategies for large corpora; and problems of data cleaning and preparation. We will also look at some sources for textual research data, such as Project Gutenberg, the Internet Archive, and Google Books. We will also discuss some common legal concerns around the use of textual corpora.

Learning Outcomes: Participants will learn how (and where) to assemble a body of texts for analysis, what characteristics those texts should exhibit, and what potential pitfalls -- legal and technical -- exist in the process of corpus acquisition.