Event box

Text/Data (RCR Days): Acquiring and Preparing a Corpus of Texts

This workshop focuses on the technical dimensions of corpus development.  Using an array of printed matter -- from digital facsimiles of incunabula to modern letterpress/offset books -- we will explore the risks and benefits of optical character recognition (OCR); file formatting and naming issues; organization strategies for large corpora; and problems of data cleaning and preparation.  We will also look at some common sources for textual research data, such as Project Gutenberg, the Internet Archive, and Google Books.  While this session will not examine legal issues in detail, we will discuss some common legal concerns around the use of textual corpora.

Tuesday, October 9, 2018
10:00am - 12:00pm
Bostock 121 (Murthy Digital Studio)
West Campus
Digital Scholarship  
Registration has closed.

Event Organizer

Profile photo of Will Shaw
Will Shaw

Digital Humanities Consultant, Duke University Libraries