Installing Software on the Yens

As a yen user, you can install your own custom software in your home directory or any location where you have permissions (such as a shared project space). If you are working with other researchers on a shared project, it is a good idea to have a dedicated shared software directory where you can install required software.

As an example, we can install a newer Tesseract software which is free and is developed by Google. Tesseract is an optical character recognition (OCR) tool and python pytesseract package is a Tesseract wrapper for python. The yens have a default version of Tesseract installed but we can install a newer version of it.

Check the default Tesseract version already installed on the yens:

$ tesseract --version

We need to go the releases repo and copy the link to the source code tarball (tar.gz file).

Then we download the source code to the yens with wget command and pasting the link to the source code. For this example, I am using a shared directory in /zfs/gsb/intermediate-yens/software but choose a shared location where you want the software to be installed and use the correct path to it instead.

First, make directory where you want the binary to be (this should be a shared project space if you want other research to be able to use the software):

$ mkdir /zfs/gsb/intermediate-yens/software/tesseract-5.2.0

Navigate where you want to download the source file to (you can also download the source to your home):

$ cd /zfs/gsb/intermediate-yens/software

Download the source code:

$ wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/5.2.0.tar.gz 

Untar the source code:

$ tar -zxvf 5.2.0.tar.gz
$ cd tesseract-5.2.0/

Install in a location such as /zfs/<project-dir>/software by specifying the desired install location with --prefix argument in configure execution (use the directory path you made above):

$ ./autogen.sh
$ ./configure --prefix=/zfs/gsb/intermediate-yens/software/tesseract-5.2.0
$ make
$ make install

If everything completes successfully, add a path to the bin directory in your bash profile.

$ echo 'export PATH=/zfs/gsb/intermediate-yens/software/tesseract-5.2.0/bin:$PATH' >> ~/.bash_profile

Tesseract also needs to have an English tessdata file (or other language train data) to use language data models as described here. We download eng.traineddata file and copy it to tesseract-5.2.0/tessdata directory.

Now, we can call tesseract executable from anywhere on the yen’s file system.

Source the bash profile to execute the added export PATH command:

$ source ~/.bash_profile

Check the new tesseract version:

$ cd
$ tesseract --version

You should see the updated version:

tesseract 5.2.0
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
 Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.13

Finally, clean up by removing the source tarball (especially important if you downloaded the tarball to your home so we keep our home directory tidy and under the 50 G space limit):

$ cd /zfs/gsb/intermediate-yens/software/
$ rm 5.2.0.tar.gz