Pre-Processing Digitized Texts

We underestimate our abilities to make sense of orthographic errors and alternative spellings like thcn or shew. Machines are less capable of making these inferences, meaning that OCR text output must often be corrected to render it legible to computational methods.

In this module, we’ll use several approaches to correcting errors in the OCR text output, introduce the concepts of initial data analysis (IDA) and data provenance, and explore how some techniques for correcting OCR errors can extend to pre-processing born-digital texts.

Access the online module.

Format

Text Guide

Level

Beginner

Topic

Data Preparation, Digital Humanities, Digitization, Textual Analysis

Workshop/Event Series

Do More with Digital Scholarship

Software

OpenRefine