SCDS Student Subyeta Haque on Building a Digital Edition

SCDS Student Subyeta Haque on Building a Digital Edition

Eloisa: Or, a Series of Original Letters by Jean-Jacques Rousseau was the most known and scandalous novel of its time. Published in 1761, the novel features an upper-class lady and her love affair with her tutor, St. Preux, a man of a lower social background. This was, at the time, a very controversial notion; therefore, the book became subject to both criticism and praise.

18th century engraving of three European people wearing courtly dress in a garden.
Fig. 1. “Le premier baiser de l’amour.” An illustration from the French edition of the novel. Eloisa kisses St. Preux before fainting on her cousin, Clara.

Despite the past infamous status of the novel, it remains unknown to many, and most people usually come across it in academic settings. This can be accredited to no modern edition of the first English translation of the book.

In the summer of 2022, we began making a digital and accessible version of the book at the Lewis and Ruth Sherman Centre for Digital Scholarship. Dr. Veronica Litt is running this project, and I am in charge of editing this novel.

The Editing Process

The only available version of this novel’s first English translation is scanned copies of eighteenth and nineteenth-century editions of the book. They usually have errors such as missing lines, given that these editions are over a hundred years old. They are full of splotches and letters that look different compared to their modern counterparts.

Scan of a page from the 1761 edition of Eloisa.
Fig. 2. An extract from the 1761 edition of Volume I of Eloisa.

When these pages are scanned by an Optical Character Recognition (OCR) program, the text file is full of errors. They range from additional punctuation to changes in letters to missing words.

Screenshot of the OCR file generated. Shows basic text with plenty of red underlines signifying errors in the conversion.
Fig. 3. An extract from the OCR text file of Volume I of the novel.

My responsibilities include editing this code-like text file into something more readable, something that people can understand.

Correctly formatted version of the novel's title page.
Fig. 4. The edited version of the extract from Volume I, as shown in the figure above.
Correctly formatted version of the novel's first page of text.
Fig. 5. The edited version of the extract from Volume I, as shown in the figure above.

As shown, the changes are considerable, and if you were to compare the digital copy to a second edition of the novel, you would find many similarities. This is because, while I was making decisions regarding how I would go about the process, I chose to keep the modern version as close to the original as possible.

I believe it is best for a number of reasons: a) it preserves the look of the original novel, and b) it also preserves any meanings that may be present. For example, keep the words the exact same; for example: ‘republick’ would remain as such and not be changed to its modern ‘republic’ in order to preserve the eighteenth-century writing style. I also keep punctuations and capitalization of letters, such as ‘Thus’ being left as such in lieu of ‘thus’ in the middle of a sentence to preserve the time’s grammatical ways.

It also includes maintaining a similar structure, such as adding a line at the end of letters that divides the page and indicate the beginning of a new letter and preserving font sizes for aesthetic purposes-this will be in one of the versions; the other version will be an accessible one according to modern standards.

Common Errors

Table 1. Examples of common errors made by OCR software.

The OCR text file contains many inaccuracies. Some examples are shown in the table above. These mistakes often happen because the machine is not accustomed to 18th century fonts. If you take a look at the second example, you will notice that ‘1’ and ‘I’ are very similar shapes, making the machine recognize the ‘I’ as ‘1’.

The first example, as you can see, has the ‘s’ turning into ‘f’; this is a rather common occurrence and happens very often: ‘wife’ is wise and ‘susser’ is suffer. This is because of the ‘Long s’. In the eighteenth century, the letter ‘s’ used to look different in books.

Scan of long S from an eighteenth century text.
Fig. 6. The long S (Image Credit: Shutterstock)

Sometimes the ORC does vice-versa, where the ‘f’ turns into an ‘s’. Other tendencies include adding a comma at the start of every new paragraph.

These errors repeat, and after editing two volumes, I already know what the correction will be without looking at the book! This has led me to be more efficient.

To all the Editors Reading this

After editing hundreds of pages, I have learned many lessons that I will leave at your disposal. Firstly, I would recommend that you take a few days to play around and see how different methods work for you. Test out the sizes of tabs when working on a single screen, and if you have multiple screens, see if that works for you as well. Try taking frequent breaks and go do something you like, it can be anything from cooking to dancing to taking a nap.

And most importantly, do not lose patience. This kind of work requires a lot of focus, so take your time and do things at your own pace; if necessary, ask for more time.

Leave a Reply

Your email address will not be published.