9. Introducing OCR Example

Data

The full data set consists of 9 million synthetically generated images of 90k English words that can be used to test Optical Character Recognition (OCR) algorithms. The data is open-sourced and can be downloaded from here. In this course, we will work with a subset of the data.

The data is located in a shared scratch location on the yens. In addition to being able to share access to the data, scratch has faster reads/writes compared to a file system location such as /zfs/projects. So, scratch is a good place to put a copy of the data in while you are working with it. But scratch is purged intermittently so it is important to have original data somewhere else as well and move the results off of scratch if you write there.

Let’s look at how the data is organized. The subset of the data is located in /scratch/darc/intermediate-yens/data/.

$ cd /scratch/darc/intermediate-yens 
$ tree data | head

You should see:

data
├── 100_Preventative_59994.jpg
├── 100_PULPWOOD_61196.jpg
├── 101_Bess_7167.jpg
├── 101_Stupid_75438.jpg
├── 102_Ethopian_26696.jpg
├── 102_Summonsing_76059.jpg
├── 103_smileys_71988.jpg
├── 103_WACKOS_85159.jpg
├── 104_ANTHOLOGIST_3112.jpg

In the data directory, there are 1000 jpg images that we want to OCR. Our task is to OCR all of the images and collect the results in a csv table with one row per image.

Jupyter Notebook Data Exploration

Jupyter notebooks are great for data exploration as they support plotting and interactive cells to explore the data. First, we will look at example images from the dataset. Open up a data-explore.ipynb notebook in JupyterHub. You can navigate to /zfs/gsb/intermediate-yens/$USER/scripts by double clicking on folders using the left side panel.

Serial Script

The Python script loops over the images and converts each to text using pytesseract package and appends the OCR results to a data frame which we then write out as a csv file at the end.

Because we will potentially want to run this script on millions of images, it is important to save intermediate results as your script is running because you do not want to have to rerun the script if you lose internet connection or your connection to the yens times out.

Let’s look at the serial script called ocr-serial.py in the scripts folder.

##############################################################
### Activate conda env and run the script
#     module load anaconda3
#     source activate /zfs/gsb/intermediate-yens/conda/ocr
#     python ocr-serial.py
##############################################################
import numpy as np
import pandas as pd
from PIL import Image
import pytesseract # OCR with Tesseract
import glob, os, shutil, csv, json, time
from os import makedirs
from os.path import exists, isdir
from datetime import datetime
from tqdm import tqdm   # progress bar

# start the timer
tmp = time.time()

# store OCR results
out_path = 'ocr_serial_out/'

# remove output directory if it already exists
if exists(out_path) and isdir(out_path):
    shutil.rmtree(out_path)

# make empty dir for OCR results
makedirs(out_path)

# make a master df -> CSV file to keep track of what we have already processed

# check if file exists, if it doesn't, make a new CSV file
if not exists(out_path + 'processed_images.csv'):
    # create a new csv file
    with open(out_path + 'processed_images.csv', 'w') as csvfile:
        filewriter = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_NONE)
        # write headers
        filewriter.writerow(['image_path', 'image_name', 'ocr_text', 'date'])


# csv exists so read it in
df = pd.read_csv(out_path + 'processed_images.csv')

# path to the data
data_path = '/scratch/darc/intermediate-yens/data/'

# make a list of image paths to OCR
image_paths = glob.glob(data_path + '*.jpg')

tot_images = len(image_paths)

# drop images we already processed
image_paths = [img for img in image_paths if img not in set(df['image_path'])]

print('image number to process: ', len(image_paths))

intermediateFileNum = 100
ncore = 1

# number of image batches
if len(image_paths) % intermediateFileNum == 0:
    nbatches = int(len(image_paths) / intermediateFileNum)
else:
    nbatches = int(len(image_paths) / intermediateFileNum) + 1

# make progress bar for the for loop over batches of images
for i in tqdm( range(nbatches) ):
    # i is the batch number
    # write intermediate results while processing all of the files
    # make a temp df to append to result

    df_tmp = pd.DataFrame(columns=['image_path', 'image_name', 'ocr_text', 'date'])
    start = i * intermediateFileNum

    if i == int(len(image_paths) / intermediateFileNum):
        end = len(image_paths)
    else:
        end = i * intermediateFileNum + intermediateFileNum

    image_paths_batch = image_paths[start: end]

    image_names = []
    text_from_ocr = []

    # loop over images to convert them into text in batches
    for img in image_paths_batch:
        text_from_ocr.append( json.dumps(pytesseract.image_to_string(Image.open(img)).strip() ))
        image_names.append(img[32:])

    # make a pd series for image paths
    df_tmp['image_path'] = pd.Series(image_paths_batch, dtype='object')

    # make a pd series for image names
    df_tmp['image_name'] = pd.Series(image_names, dtype='object')

    # make a pd series for OCR results
    df_tmp['ocr_text'] = pd.Series(text_from_ocr, dtype='object')

    # make a pd series for date
    date_list = [datetime.today().strftime("%m/%d/%Y") for i in range(len(image_paths_batch))]
    df_tmp['date'] = pd.Series(date_list, dtype='object')

    # add data to df
    df = df.append(df_tmp)

    # save results
    df.to_csv(out_path + 'processed_images.csv', index = False)
    print('processed', len(df), 'images out of', tot_images)

print('running with %d core took: %s seconds' % (ncore, str(time.time() - tmp)))


Run Serial Script Interactively

Let’s make sure our python program runs on the interactive Yen command line.
If you haven’t already, load anaconda module and activate the conda environment.

$ ml anaconda3
$ source activate /zfs/gsb/intermediate-yens/conda/ocr

Then run the script:

$ python ocr-serial.py

While the job is running, open a new terminal window and login to the same yen machine. Then, monitor your CPU and memory usage with:

$ watch userload

You should see something like:

nrapstin         | 0.08 Cores | 0.00% Mem on yen4

which means you are using less than one core and consuming almost no memory.

You should see a similar output:

image number to process:  1000
  0%|                                                                                                    | 0/10 [00:00<?, ?it/s]processed 100 images out of 1000
 10%|█████████▏                                                                                  | 1/10 [00:18<02:44, 18.25s/it]processed 200 images out of 1000
 20%|██████████████████▍                                                                         | 2/10 [00:36<02:25, 18.18s/it]processed 300 images out of 1000
 30%|███████████████████████████▌                                                                | 3/10 [00:54<02:07, 18.21s/it]processed 400 images out of 1000
 40%|████████████████████████████████████▊                                                       | 4/10 [01:12<01:49, 18.23s/it]processed 500 images out of 1000
 50%|██████████████████████████████████████████████                                              | 5/10 [01:31<01:31, 18.20s/it]processed 600 images out of 1000
 60%|███████████████████████████████████████████████████████▏                                    | 6/10 [01:49<01:12, 18.23s/it]processed 700 images out of 1000
 70%|████████████████████████████████████████████████████████████████▍                           | 7/10 [02:07<00:54, 18.26s/it]processed 800 images out of 1000
 80%|█████████████████████████████████████████████████████████████████████████▌                  | 8/10 [02:26<00:36, 18.29s/it]processed 900 images out of 1000
 90%|██████████████████████████████████████████████████████████████████████████████████▊         | 9/10 [02:44<00:18, 18.37s/it]processed 1000 images out of 1000
100%|███████████████████████████████████████████████████████████████████████████████████████████| 10/10 [03:02<00:00, 18.30s/it]
running with 1 core(s) took: 183.11304306983948 seconds

See the OCR result:

$ head ocr_serial_out/processed_images.csv

You should see:

image_path,image_name,ocr_text,date
/scratch/darc/intermediate-yens/data/136_glassily_32676.jpg,data/136_glassily_32676.jpg,"""glassily""",07/13/2022
/scratch/darc/intermediate-yens/data/396_conflicts_15964.jpg,data/396_conflicts_15964.jpg,"""Ccon\u00a3lict\u00ae""",07/13/2022
/scratch/darc/intermediate-yens/data/337_LORDSHIPS_45284.jpg,data/337_LORDSHIPS_45284.jpg,"""LORDSHIPS""",07/13/2022
/scratch/darc/intermediate-yens/data/359_Salmonellae_67510.jpg,data/359_Salmonellae_67510.jpg,"""""",07/13/2022
/scratch/darc/intermediate-yens/data/158_NONACTIVES_51881.jpg,data/158_NONACTIVES_51881.jpg,"""""",07/13/2022
/scratch/darc/intermediate-yens/data/429_Militia_48419.jpg,data/429_Militia_48419.jpg,"""""",07/13/2022
/scratch/darc/intermediate-yens/data/138_IRRIGATING_40884.jpg,data/138_IRRIGATING_40884.jpg,"""g\nmec\ufb02 TN""",07/13/2022
/scratch/darc/intermediate-yens/data/489_overmodest_54571.jpg,data/489_overmodest_54571.jpg,"""overmodest""",07/13/2022
/scratch/darc/intermediate-yens/data/25_holiness_36493.jpg,data/25_holiness_36493.jpg,"""holiness""",07/13/2022

Make sure you got one row per image in the resulting table (if we processed 1000 images, the csv table should have 1001 rows (header + ocr results).

$ wc -l ocr_serial_out/processed_images.csv

You should see the output:

1001 ocr_serial_out/processed_images.csv