Nothing pleases the soul of the empirical researcher more than new data. Credible data is a remarkable tool of persuasion that you can dazzle your audience with, start exciting new debates or even rekindle old ones. Of course, for all empirical researchers, acquiring this data is challenging. For Economic Historians, it often involves sitting in an archive all day, spluttering every time you open a dusty tome, massaging a cramped neck every few minutes as your hands ache from transcribing row after row of data.
That said, many historical documents have now been scanned and digitised, but even sitting sequestered at home or in an office sifting through hundreds of online pdfs is highly unappealing. It’s an unglamourous and somewhat puzzling lifestyle, especially since Optical Character Recognition (OCR) has been knocking around for a very long time now. There are lots of OCR examples you might be familiar with like Transkribus, Adobe Acrobat and FineReader. For this blog, I’ll use the Tesseract Package for R.
The Transcribers Problem

We will use the example of potato acreage in Ireland in 1891. This series covers all of Ireland, broken into over 160 Poor Law Union areas from at least 1887 to 1914. In total, this amounts to 54 different tables containing 146,800 data points. This would take me around 50 or so hours to manually transcribe and validate. For a researcher under a lot of time pressure, the opportunity costs of transcription might be too high to play. OCR may be exactly the right Machine Learning tool you need to quickly fire in lots of tables and automatically transcribe them.
Of course, OCR comes with very high sunk costs. It is simply an alternative research tool, so my intention is not to force some form of Damascene conversion. Certainly, there’s a lot to be excited about, but like manual transcription, it comes with its own set of challenges.
Image Quality Is King!

Source: House of Commons Parliamentary Papers Online, 2005
It is always a good first step to take a close look at the table you intend to OCR.To you and me, this looks like a standard table. We can easily distinguish different rows and columns of data pertaining to different Irish geographies.
But to an OCR like Tesseract, there is a lot of additional information that will play havoc with character recognition and even line segmentation. We have lots of thick black borders, some column headers are vertical, a lot of the table is misaligned, and we have thick dots which are used both for spacing and to indicate 0.
In this sense, image quality is King. It is probably the single most important factor which determines if OCR is indeed a feasible transcription too. In our case, the image is not bad, but it’s not good either. If your image is poor and you are short on time, then it’s probably not a good idea to proceed with OCR.
Step 1: Load packages, the PDF and Table.
We will use four packages in this tutorial- tidyverse
for data manipulation, pdftools
to convert our pdfs to images, magick
for image processing and tesseract
for OCR. I have omitted the installation process but all can be found on the CRAN.
#load required packages (installation omitted)
library(tidyverse)
library(pdftools)
library(magick)
library(tesseract)
#Set the Working Directory (WD)
setwd("C:/Users/PATH TO WORKING DIRECTORY HERE")
#load pdf
pdf1891 <- "1891.pdf"
#convert pdf to high resolution image. This will output to your WD.
pdf_convert(pdf1891, dpi=600, pages=73)
#load image into R. Check your WD for file name
png1891 <- "1891_73.png"
Step 2: Create Magick object and resize
Image quality is everything. If I tried to run Tesseract right now, it would return some useless garble of character strings. We need to do some image processing first using the Magick Package.
#pass png to magick
pdf1891_magick <- image_read(png1891)
#take a first look at the image
pdf1891_magick
#it's huge!
#resize image
pdf1891_magick <- pdf1891_magick %>%
image_scale("2000") %>% #resize & retain aspect ratio
image_convert(type = "grayscale") %>% #greyscale image_background("white", flatten = TRUE) %>% #white background image_deskew(threshold = 40) #de-skew
In this step, we also resize the image and do some initial edits. First, I rescale the image and retain the original aspect ratio. Secondly, I ensure that the text is black, and the background is white. Colours and gradients are a real problem for Tesseract. Finally, I will de-skew my image to rectify some of the table alignment problems. A threshold of 40 is usually sufficient to correct most alignment problems.
Step 3: Crop out all irrelevant information
It’s a good idea to simplify the table as much as possible. We don’t need any of the table description at the top of the page, we don’t even need those brutally printed column names. These can all be manually added later.
You will need to adjust this code for your own table. In my case, I instruct R to crop the image to a height of 1800 pixels and a height of 2620 pixels. I then finetune this by cropping 70 pixels from the left and 610 pixels from the top to get rid of at least 2 pesky borders.
pdf1891_magick <- pdf1891_magick %>%
image_crop("1800x2620 + 70 + 610")
#take a look at the image.
pdf1891_magick

Source: House of Commons Parliamentary Papers Online, 2005
Step 4: (Optional) Remove Borders
The reason this step is optional is that removing borders is not always feasible. This requires really high image quality, especially for Tesseract to correctly segment different lines on the page. Because my table is slightly misaligned, it will not be possible to run OCR on the entire table without a lot more work since different rows will end up fusing into each other.
I’ll go through the code anyway in case you do have a good image! First, I will create a new object which will identify all the borders in my image. Then I integrate this “border image” into my table and negate all the borders.
pdf1891_lines <- pdf1891_magick %>%
image_morphology("Close", "Rectangle:2,27,0") %>% #find borders
image_negate() #convert white
#Take a look at the borders
pdf1891_lines

Now we can merge this border object with our original table and “negate” the borders. As you can see below, the process is not foolproof, but it does a good job at removing the worst borders from the table.
pdf1891_magick <- image_composite(pdf1891_magick, pdf1891_lines,
operator = "Add")
#view "borderless" table
pdf1891_magick

Source: House of Commons Parliamentary Papers Online, 2005
Step 5: Final image processing
This is a very important part of image processing and may involve a lot of small tweaks. In general, we want to give Tesseract as much information as possible around each character to increase its accuracy. Otherwise it may fail to recognise a character or will assign incorrectly especially if the character is poorly printed.
Most of these steps involve adding random “noise” to the image. If you are old enough to remember the old analogue TV fuzz, then image you are adding some of that to the image. You want to add just enough fuzz to improve Tesseract’s accuracy, but not so much that you end up with a bunch of characters and numbers that aren’t actually there.
pdf1891_magick <- pdf1891_magick %>%
image_background("white", flatten = TRUE) %>%
image_noise(noisetype = "Gaussian") %>%
image_enhance() %>%
image_normalize() %>%
image_trim(fuzz = 50) %>%
image_contrast(sharpen = 1)
pdf1891_magick
Step 6: OCR Row Names
If you are dealing with a table like this one, it’s a good idea to break up your OCR method into two broad categories. First, OCR the row names, then OCR the data points.
To do this, we will define two tesseract engines which whitelist certain characters. This will improve Tesseract’s results enormously! It will stop all your row names having random punctuation or numbers in it, and stop your data points from having random words in it. After this, we will then crop our table and retain only the row names, OCR them with our bespoke tesseract engine, and throw the results into a new Tibble.
#Create string tesseract engine
str_only <- tesseract::tesseract(
options = list(tessedit_char_whitelist = c("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ- ")))
#Isolate and OCR row names
plu <- pdf1891_magick %>%
image_crop(geometry_area(220, 0, 0, 0)) %>% #Customise based on table!
ocr(engine = str_only) #OCR
#Output row names as a tibble
plu_tibble <- plu %>%
str_split(pattern = "\n") %>%
unlist () %>%
tibble(data = .)

Hey now, our results are not that bad! Sure Gorey has become “iorey lg” and Dunshaughlin has become “Danshnughiia j”, but Tesseract has managed to either get the PLU name spot on or come back with a recognisable union name.
Step 7: OCR Data Points
Now, if you are one of the lucky ducks that has an amazing image to work with, you can OCR all the data points together. Unfortunately, I must do this column by column. This code is generalisable to both scenarios however!
First, I’m going to update my Tesseract engine and whitelist only numbers, as well as commas and full stops. Then I will isolate the first data column, OCR it and merge with the PLU names. Of course, if you can OCR all the values in one go, simply update the crop parameters to retain all columns of interest.
#white-list numbers only
num_only <- tesseract::tesseract(
options = list(tessedit_char_whitelist = c("0123456789,. ")))
#crop for first column
spuds<-pdf1891_magick %>%
image_crop(geometry_area(100, 0, 250, 0)) %>% #customise based on table!
ocr(engine = num_only) #OCR
#Output values as tibble
spud_tibble <- spuds %>%
str_split(pattern = "\n") %>%
unlist () %>%
tibble(data = .)
#Merge row names and values
spudyields<- plu_tibble %>%
bind_cols(spud_tibble) %>% #merge tibbles
rename(plu = data...1, spud_total = data...2) #correct column names
I repeated this process 3 times and to my shock, it worked pretty-well!

Worth the Effort?
While I much prefer coding to manual transcription, it quickly has become clear that Tesseract isn’t the right tool for me. It’s been fun to get acquainted with the process, but it requires an incredible amount of time to get into a “workable” state. I could easily spend several days continuing to configure my code until I maximised Tesseract’s accuracy.
A trained data wrangler could easily transcribe an entire years’ worth of potato data with (more or less) 100% accuracy in under an hour. With Tesseract, it has taken 4 hours to transcribe a sixth of that. Indeed, it’s possible that my code won’t generalise to the rest of the potato data series and will need to be tailored every time.
The most critical Tesseract issue for me is its failure to recognise full stops. This means I can only OCR the first 3 columns. For example, in Column 4, Tesseract only identifies 68 rows of data, when there should be 80. This error becomes worse once we get to rarer varieties of potato like Column 5, where Tesseract only recognises 32 rows of data.
Final Verdict?
If the manual costs of transcription are too high, then maybe, just maybe, this is the way to go. A single researcher transcribing millions of observations would really benefit from investing their time into OCR. However, if it would only take a few hours or days to transcribe your data, then manual is the best way to go.
This technology is still in its “infancy”, but as we get more accustomed to the role of Machine Learning in research, OCR will undoubtedly have a role to play in the future. Lots of new research is being carried out, and new, more advanced OCR tools are becoming increasingly available.
Lots to be excited about, but always bear in mind the costs verses the benefits!
Leave a Reply