This article is by Ken Black and originally appeared on the Alteryx Data Science Blog here: https://community.alteryx.com/t5/Data-Science-Blog/Use-Python-Alteryx-and-OCR-AI-to-Extract-Data-From-Images/ba-p/594106
In this article, I explain in more detail an experiment that I am conducting related to the Coronavirus pandemic. When I decide to do experiments, I have been known to take my time. For example, it took me over 2.5 years to write a couple hundred technical articles on Tableau and Alteryx usage in a detailed and unique process-improvement-inspired blogging experiment (click here to learn more).
That sure is a loaded sentence! The included hyperlink points to a collection of articles that house a large number of insights and techniques available for you to peruse. If you are a blogger, there is a lot to learn from that collection. It is my hope that my new experiment will teach us a lot about the Coronavirus pandemic.
The Coronavirus Experiment Description
My ongoing experiment has been running since the Coronavirus pandemic exploded into the United States in March 2020. My Coronavirus experiment includes topics such as data visualization, data quality, data variation, data restatements, and most importantly, helping me to comprehend the nature, extent, and time-series history of a global pandemic.
My experiment takes me about 2 minutes per day, and I try to do this work seven days a week. I don’t know how long the experiment will run, but I am committed to it because I believe I will learn some valuable lessons from this unique event of our lifetime. I am sure that I will write several articles about the results of this experiment.
A couple of months ago, I published the first article in this series. For that work, I manually typed-in the data captured from a series of dashboard images that were included in the article. At the time of publication, the data table was complete through 40 captured dashboard images, or just enough time for 1,000,000th global confirmed cases of the Coronavirus to have occurred.
I am now approaching 300 dashboard images because I now capture three dashboard images instead of just one. This means that the amount of data being captured in the images is growing quickly and my data table has been languishing far behind. As most of us know, manually typing in data isn’t too much fun! To keep my experiment on track, I decided I had to do something to automatically capture the data stored in the dashboard images.
Being a computer programmer, I also know that it is OK to be lazy at times. This is one of those times. Since I don’t want to type in the data from hundreds of dashboard images, I have decided to let the computer do that mundane work for me.
In the next two sections, I describe how I capture the data in image form and how I extract the data from those images. Therefore, as you look at the Figures below, recognize the amount of data that they contain. Also realize that the data contained in these images reflect the state of knowledge at the time the data was assembled. In effect, this collection of images is a Type II database that holds a lot of visual and textural information.
The Data Collection Method – Part 1
As the experiment has matured, so has my data collection methods via capturing dashboard images. Each day I capture three dashboard images (Figures 1-3) from the Johns Hopkins website.
Even though it is a simple procedure for me to capture these images, sometimes I might make a mistake by having something highlighted on the dashboard when I capture the image. An example of that case is shown in Figure 4.
If that happens, those images might not yield perfect data for me, so it is important for you to know that the methods I’ll be showing are not foolproof but they save a huge amount of time! Additionally, the dashboard images themselves have changed over time as the researchers decided to change their layouts to show different types of data as the pandemic developed over time.
The images I capture are produced with the Windows snipping tool and a 27 inch, 4K LG monitor (Model 27UD69P – W). The typical dashboard image size is shown in Figure 5.
The data that I am interested in having the computer capture is in the upper left corner (Figure 6) and lower-left corner (Figure 7) of the Figure 1 dashboard.
Now let’s learn how to tell the computer to extract the target data from the images.
The Data Collection Method – Part 2
Once the images are captured, I implemented a few tricks to make the computer extract the data for me from the images. First, I wrote a very short python code that loops through the dashboard images and uses the tesseract algorithm to extract text data. This algorithm* combines optical character recognition (OCR) with a little dash of artificial intelligence (AI) to extract text from these images. The Python program is shown in Figure 8.
Before you can get this to run, however, you have to load the tesseract data sets. In your Python environment, you will also have to load the Python imaging library (PIL), and pytesseract. The PIL is key to loading and displaying images in Python, so this is worth learning about. The pytesseract part of this is where the OCR/AI magic happens to extract the data.
As with many things Python, this isn’t as straight-forward as it should be and might require that you iterate a few times to get things working. I could spend many paragraphs depicting all the things I had to do to make this work, but in the end it wouldn’t help you too much. You can try this link if you are having problems, but ultimately you just need to debug your Python deployment until this works.
The Python code returns a bunch of information (about 43,000 lines of text from 118 images), with much of it not very usable. Figure 9 shows what I mean by that statement. You will notice little errors like a decimal digit in line 644 instead of a comma. There are many lines that have data that is incomplete or erroneous. The good news, however, is that the data I want to capture is pretty solid and properly identified.
If you are patient (like I am) and you have a secret weapon called Alteryx in your data arsenal (like I do), this random hodgepodge of bits and bytes can make your life a little easier and save you some time. I guarantee you that if I tried to write the Python code to parse these results, it would have taken me way longer than the 10 minutes it took me to write and test the Alteryx workflow I needed to complete the analysis.
As shown in Figure 9, lines 627 and 629 contain key information that I am trying to extract. Line 629 contains the total confirmed cases that is one of my target variables (previously shown in Figure 6). In fact, by knowing that the data I want to extract comes two lines after the words “Total Confirmed”, makes it very easy for me to extract the data. Similarly, the date and time data comes two lines after the word “Updated.”
Parsing the Results in Alteryx
As shown in Figure 10, a quick workflow was all that was needed to read the ugly unstructured data returned from pytesseract. You can see that that dates and times of the last update field came through fairly clean. Record 20 had a problem (as did a few others later on), but I just had to cut and paste these results in a matter of seconds compared to sitting for a couple of hours typing in the data. This is what I referred to earlier as me being “lazy.”
I’ll now explain some of the techniques I used to make this workflow extract the requested data.
Figure 11 shows the configuration used to read the unstructured data that is returned from pytesseract. Two key settings are needed here. First, the file is a flat ascii file that is delimited. Second, you might have to turn on the setting to “Allow long lines for records”. Alteryx will tell you if you need to change that setting.
The key to finding the targeted data is two-fold. First, a record identifier is used to number the lines of the file and a simple search function is used to find two key words: “updated” (Figure 12) and “total confirmed” (not shown, but it is in the second container of the workflow). If the search term is not found in a line, Alteryx returns -1. If it is found, the starting text position is returned (5 in this case because the term found is “last updated”).
Once a filter is used to remove the -1 search results, what is left are the records that matched the search term. Since the data occurs two records after this position, a new record locator is computed as shown in Figure 13.
Now a simple join is needed to extract the data (Figure 14). Notice that the right side joins back to the complete data set to be able to match up the new record locator with the original data.
After the workflow finished in 1 second!, all I had to do was cut and paste the results into my Excel workbook and fix and few missing records. I did that by looking at the dashboard images to get accurate data. Overall, this workflow saved me a few boring hours of typing in the data that is stored in those images.
We now have many fancy terms like OCR, AI, Machine Learning, deep learning and other buzzwords in the field of analytics. Companies like HP and Google have done some amazing research and given us great tools to use, like the tesseract-ocr engine I used in this example.
Having access to these tools allows normal people like me to use these techniques to get work done. This is powerful stuff and as this article shows, it is easy to use. I encourage others to try to do things like this to expand your repertoire of data analytic tricks and techniques! Thanks for reading.
The Tesseract engine was originally developed as proprietary software at Hewlett Packard labs in Bristol, England and Greeley, Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development has been sponsored by Google since 2006.