6. Installing Software on the Yens

Throughout this course we will run and build upon a Python example. We will start with a serial script that we can run from the command line, via Jupyter notebook and via the scheduler. We will then parallelize the script, run it from the command line and via Slurm.

The python example depends on pytesseract package which in turn uses Tesseract software which is free and is developed by Google. Tesseract is an optical character recognition (OCR) tool and pytesseract package is a Tesseract wrapper for python. Tesseract recognizes the text embedded in images. The yens have a default version of Tesseract installed but we can install a newer version of it.

See this guide for details on how to install software in your home directory or any location where you have permissions (such as a shared project space).

Check the default Tesseract version already installed on the yens:

$ tesseract --version

You should see:

tesseract 4.0.0-beta.1
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found SSE

Following the guide above, a newer version of Tesseract (5.2.0) was installed in a shared project space, /zfs/gsb/intermediate-yens/software/tesseract-5.2.0. So, all we need to do is to make sure the path to the new Tesseract binary can be found.

Add a path to the bin directory in your bash profile.

$ echo 'export PATH=/zfs/gsb/intermediate-yens/software/tesseract-5.2.0/bin:$PATH' >> ~/.bash_profile
$ echo 'export TESSDATA_PREFIX=/zfs/gsb/intermediate-yens/software/tesseract-5.2.0/tessdata' >> ~/.bash_profile

Now, we can call tesseract executable from anywhere on the yen’s file system.

Source the bash profile to execute the added export PATH command.

$ source ~/.bash_profile

Check the new tesseract version:

$ tesseract --version

You should see the updated version:

tesseract 5.2.0
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
 Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.13