Event box

OpenRefine to clean inconsistent categorical data

[Online] Categorical data in spreadsheets, such as state names or subject tags, often need to be cleaned before visualization or analysis so things like capitalization, spelling, or abbreviation are consistent. This can be a painstaking process and is prone to human error. OpenRefine is a free, open source tool for cleaning and transforming data that can make this process quick, easy, and reproducible. It has many sophisticated clustering methods built in to help you match similar chunks of text and combine them into consistent data entries. In this workshop I'll lead you through a quick example cleaning up bibliographic records, but the techniques are applicable across many datasets. This event is open to non-Duke participants.

Software: To play along with the example exercise you'll need to install a recent version of OpenRefine (https://openrefine.org). Click the Download button in the upper left. Click on the Zip file download next to your operating system (Windows or Mac). If you have any trouble, there are detailed installation instructions. Note that for Windows there are two versions – one that includes a built-in version of Java, for those who don’t have Java already installed on their machines, and one that doesn’t. Try launching OpenRefine. The interface will open in your default web browser. (On Windows there may be an extra terminal window that opens – just leave it open.)

A zoom link will be sent via email to registered participants to join the workshop.

The content of the workshop may be recorded. If you are uncomfortable with a recording being published, please contact the instructor at anytime prior to the conclusion of the workshop.

Data Science

Date:
Wednesday, November 13, 2024
Time:
10:00am - 11:00am
Campus:
n/a
Categories:
Data and Visualization  

Registration is required. There are 36 seats available.

Event Organizer

Eric Monson
Profile photo of Center for Data and Visualization Sciences
Center for Data and Visualization Sciences