Intermediate Yens
9. Introducing OCR Example
Data
The full data set consists of 9 million synthetically generated images of 90k English words that can be used to test Optical Character Recognition (OCR) algorithms. The data is open-sourced and can be downloaded from here. In this course, we will work with a subset of the data.
The data is located in a shared scratch
location on the yens. In addition to being able to share access to the data, scratch
has faster reads/writes compared to a file system location such as /zfs/projects
. So, scratch
is a good place to put a copy of the data in while
you are working with it. But scratch
is purged intermittently so it is important to have original data somewhere else as well and move
the results off of scratch if you write there.
Let’s look at how the data is organized. The subset of the data is located in /scratch/darc/intermediate-yens/data/
.
$ cd /scratch/darc/intermediate-yens
$ tree data | head
You should see:
data
├── 100_Preventative_59994.jpg
├── 100_PULPWOOD_61196.jpg
├── 101_Bess_7167.jpg
├── 101_Stupid_75438.jpg
├── 102_Ethopian_26696.jpg
├── 102_Summonsing_76059.jpg
├── 103_smileys_71988.jpg
├── 103_WACKOS_85159.jpg
├── 104_ANTHOLOGIST_3112.jpg
In the data
directory, there are 1000 jpg images that we want to OCR. Our task is to OCR all of the images
and collect the results in a csv table with one row per image.
Jupyter Notebook Data Exploration
Jupyter notebooks are great for data exploration as they support plotting and interactive cells to explore the data.
First, we will look at example images from the dataset. Open up a data-explore.ipynb
notebook in JupyterHub. You can navigate to
/zfs/gsb/intermediate-yens/$USER/scripts
by double clicking on folders using the left side panel.
Serial Script
The Python script loops over the images and converts each to text using pytesseract
package and appends the
OCR results to a data frame which we then write out as a csv file at the end.
Because we will potentially want to run this script on millions of images, it is important to save intermediate results as your script is running because you do not want to have to rerun the script if you lose internet connection or your connection to the yens times out.
Let’s look at the serial script called ocr-serial.py
in the scripts
folder.
##############################################################
### Activate conda env and run the script
# module load anaconda3
# source activate /zfs/gsb/intermediate-yens/conda/ocr
# python ocr-serial.py
##############################################################
import numpy as np
import pandas as pd
from PIL import Image
import pytesseract # OCR with Tesseract
import glob, os, shutil, csv, json, time
from os import makedirs
from os.path import exists, isdir
from datetime import datetime
from tqdm import tqdm # progress bar
# start the timer
tmp = time.time()
# store OCR results
out_path = 'ocr_serial_out/'
# remove output directory if it already exists
if exists(out_path) and isdir(out_path):
shutil.rmtree(out_path)
# make empty dir for OCR results
makedirs(out_path)
# make a master df -> CSV file to keep track of what we have already processed
# check if file exists, if it doesn't, make a new CSV file
if not exists(out_path + 'processed_images.csv'):
# create a new csv file
with open(out_path + 'processed_images.csv', 'w') as csvfile:
filewriter = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_NONE)
# write headers
filewriter.writerow(['image_path', 'image_name', 'ocr_text', 'date'])
# csv exists so read it in
df = pd.read_csv(out_path + 'processed_images.csv')
# path to the data
data_path = '/scratch/darc/intermediate-yens/data/'
# make a list of image paths to OCR
image_paths = glob.glob(data_path + '*.jpg')
tot_images = len(image_paths)
# drop images we already processed
image_paths = [img for img in image_paths if img not in set(df['image_path'])]
print('image number to process: ', len(image_paths))
intermediateFileNum = 100
ncore = 1
# number of image batches
if len(image_paths) % intermediateFileNum == 0:
nbatches = int(len(image_paths) / intermediateFileNum)
else:
nbatches = int(len(image_paths) / intermediateFileNum) + 1
# make progress bar for the for loop over batches of images
for i in tqdm( range(nbatches) ):
# i is the batch number
# write intermediate results while processing all of the files
# make a temp df to append to result
df_tmp = pd.DataFrame(columns=['image_path', 'image_name', 'ocr_text', 'date'])
start = i * intermediateFileNum
if i == int(len(image_paths) / intermediateFileNum):
end = len(image_paths)
else:
end = i * intermediateFileNum + intermediateFileNum
image_paths_batch = image_paths[start: end]
image_names = []
text_from_ocr = []
# loop over images to convert them into text in batches
for img in image_paths_batch:
text_from_ocr.append( json.dumps(pytesseract.image_to_string(Image.open(img)).strip() ))
image_names.append(img[32:])
# make a pd series for image paths
df_tmp['image_path'] = pd.Series(image_paths_batch, dtype='object')
# make a pd series for image names
df_tmp['image_name'] = pd.Series(image_names, dtype='object')
# make a pd series for OCR results
df_tmp['ocr_text'] = pd.Series(text_from_ocr, dtype='object')
# make a pd series for date
date_list = [datetime.today().strftime("%m/%d/%Y") for i in range(len(image_paths_batch))]
df_tmp['date'] = pd.Series(date_list, dtype='object')
# add data to df
df = df.append(df_tmp)
# save results
df.to_csv(out_path + 'processed_images.csv', index = False)
print('processed', len(df), 'images out of', tot_images)
print('running with %d core took: %s seconds' % (ncore, str(time.time() - tmp)))
Run Serial Script Interactively
Let’s make sure our python program runs on the interactive Yen command line.
If you haven’t already, load anaconda module and activate the conda environment.
$ ml anaconda3
$ source activate /zfs/gsb/intermediate-yens/conda/ocr
Then run the script:
$ python ocr-serial.py
While the job is running, open a new terminal window and login to the same yen machine. Then, monitor your CPU and memory usage with:
$ watch userload
You should see something like:
nrapstin | 0.08 Cores | 0.00% Mem on yen4
which means you are using less than one core and consuming almost no memory.
You should see a similar output:
image number to process: 1000
0%| | 0/10 [00:00<?, ?it/s]processed 100 images out of 1000
10%|█████████▏ | 1/10 [00:18<02:44, 18.25s/it]processed 200 images out of 1000
20%|██████████████████▍ | 2/10 [00:36<02:25, 18.18s/it]processed 300 images out of 1000
30%|███████████████████████████▌ | 3/10 [00:54<02:07, 18.21s/it]processed 400 images out of 1000
40%|████████████████████████████████████▊ | 4/10 [01:12<01:49, 18.23s/it]processed 500 images out of 1000
50%|██████████████████████████████████████████████ | 5/10 [01:31<01:31, 18.20s/it]processed 600 images out of 1000
60%|███████████████████████████████████████████████████████▏ | 6/10 [01:49<01:12, 18.23s/it]processed 700 images out of 1000
70%|████████████████████████████████████████████████████████████████▍ | 7/10 [02:07<00:54, 18.26s/it]processed 800 images out of 1000
80%|█████████████████████████████████████████████████████████████████████████▌ | 8/10 [02:26<00:36, 18.29s/it]processed 900 images out of 1000
90%|██████████████████████████████████████████████████████████████████████████████████▊ | 9/10 [02:44<00:18, 18.37s/it]processed 1000 images out of 1000
100%|███████████████████████████████████████████████████████████████████████████████████████████| 10/10 [03:02<00:00, 18.30s/it]
running with 1 core(s) took: 183.11304306983948 seconds
See the OCR result:
$ head ocr_serial_out/processed_images.csv
You should see:
image_path,image_name,ocr_text,date
/scratch/darc/intermediate-yens/data/136_glassily_32676.jpg,data/136_glassily_32676.jpg,"""glassily""",07/13/2022
/scratch/darc/intermediate-yens/data/396_conflicts_15964.jpg,data/396_conflicts_15964.jpg,"""Ccon\u00a3lict\u00ae""",07/13/2022
/scratch/darc/intermediate-yens/data/337_LORDSHIPS_45284.jpg,data/337_LORDSHIPS_45284.jpg,"""LORDSHIPS""",07/13/2022
/scratch/darc/intermediate-yens/data/359_Salmonellae_67510.jpg,data/359_Salmonellae_67510.jpg,"""""",07/13/2022
/scratch/darc/intermediate-yens/data/158_NONACTIVES_51881.jpg,data/158_NONACTIVES_51881.jpg,"""""",07/13/2022
/scratch/darc/intermediate-yens/data/429_Militia_48419.jpg,data/429_Militia_48419.jpg,"""""",07/13/2022
/scratch/darc/intermediate-yens/data/138_IRRIGATING_40884.jpg,data/138_IRRIGATING_40884.jpg,"""g\nmec\ufb02 TN""",07/13/2022
/scratch/darc/intermediate-yens/data/489_overmodest_54571.jpg,data/489_overmodest_54571.jpg,"""overmodest""",07/13/2022
/scratch/darc/intermediate-yens/data/25_holiness_36493.jpg,data/25_holiness_36493.jpg,"""holiness""",07/13/2022
Make sure you got one row per image in the resulting table (if we processed 1000 images, the csv table should have 1001 rows (header + ocr results).
$ wc -l ocr_serial_out/processed_images.csv
You should see the output:
1001 ocr_serial_out/processed_images.csv
Connect with us