Working with Large Zip Files in Python

How to work with large zip files without unzipping them, using the zipfile python library


Problem

You want to access some data in the zip files, but you do not want to copy the zip file over to your home/project directory and unzip it. How would go about accessing this data?

This is our current working directory of files. It contains two notebooks, a sample zip file, and that file unzipped in tmp.

$ tree
.
├── 2022_04_notes.zip
├── edgar-xbrl.ipynb
├── Jeff_test.ipynb
└── tmp
    ├── cal.tsv
    ├── dim.tsv
    ├── notes-metadata.json
    ├── num.tsv
    ├── pre.tsv
    ├── readme.htm
    ├── ren.tsv
    ├── sub.tsv
    ├── tag.tsv
    └── txt.tsv

Let us first look at the size differences. The du command can help us understand how much memory each file type is using.

$ du -sh 2022_04_notes.zip
189M    2022_04_notes.zip


$ du -sh tmp
367M    tmp

When we look at the memory of the unzipped file compared with the zipped file, we can see that the difference is a little under 2X. Usually, this difference would be much more significant, but ZFS automatically compresses files to a certain degree. Furthermore, extracting an entire zipped folder with an extensive directory of files takes time. If you only need data from one or a handful of the files in the zipped file, it is an inefficient use of time and memory to extract everything. Below, we illustrate how you can use the zipfile Python package to grab info from any zip file without extracting all the contents. Let us look at this example where we want to pull out a small amount of data from some EDGAR zipped reports.

Using ZipFile

The python package zipfile is a powerful python package that allows you to work efficiently with zip files extracting the files you need without unzipping them.


$ pip install zipfile

First, we can use zipfile to check what the zip file contains without unzipping it

import pandas as pd
import os
from zipfile import ZipFile

file_name='2022_04_notes.zip'
with ZipFile(file_name, 'r') as edgar:
    edgar.printdir()
File Name                                             Modified             Size
sub.tsv                                        2022-05-01 15:35:42      2177048
tag.tsv                                        2022-05-01 15:35:42     59022625
dim.tsv                                        2022-05-01 15:35:44     25304351
ren.tsv                                        2022-05-01 15:35:44     33408100
cal.tsv                                        2022-05-01 15:35:46     27885932
pre.tsv                                        2022-05-01 15:35:46    204826805
num.tsv                                        2022-05-01 15:35:52    325109088
txt.tsv                                        2022-05-01 15:36:08    325066949
readme.htm                                     2022-05-01 15:36:22       267323
notes-metadata.json                            2022-05-01 15:36:22        67978

Now let us imagine you wanted to pull the first line of all the txt.tsv files and use that information from the following directory.

#This path is representative of whichever data directory you would like to read from
print(os.listdir('/zfs/data/Edgar_xbrl/'))
['2014q3_notes.zip', '2014q2_notes.zip', '2016q4_notes.zip', '2019q4_notes.zip', '2014q1_notes.zip', '2009q4_notes.zip', '2009q3_notes.zip', '2019q2_notes.zip', '2022_08_notes.zip', '2016q2_notes.zip', '2009q2_notes.zip', '2016q3_notes.zip', '2022_09_notes.zip', '2019q3_notes.zip', '2021_08_notes.zip', '2009q1_notes.zip', '2021_09_notes.zip', '2014q4_notes.zip', '2019q1_notes.zip', '2016q1_notes.zip', '2012q4_notes.zip', '2021_07_notes.zip', '2020q3_notes.zip', '2022_05_notes.zip', '2010q1_notes.zip', '2021_06_notes.zip', '2020q2_notes.zip', '2022_04_notes.zip', '2020q1_notes.zip', '2021_05_notes.zip', '2010q3_notes.zip', '2022_07_notes.zip', '2021_04_notes.zip', '2010q2_notes.zip', '2022_06_notes.zip', '2022_01_notes.zip', '2021_03_notes.zip', '2010q4_notes.zip', '2012q1_notes.zip', '2021_02_notes.zip', '2022_03_notes.zip', '2021_01_notes.zip', '2012q2_notes.zip', '2022_02_notes.zip', '2012q3_notes.zip', '2011q2_notes.zip', '2021_12_notes.zip', '2011q3_notes.zip', '2020_12_notes.zip', '2021_10_notes.zip', '2020_11_notes.zip', '2021_11_notes.zip', '2011q1_notes.zip', '2013q4_notes.zip', '2020_10_notes.zip', '2013q3_notes.zip', '2013q2_notes.zip', '2013q1_notes.zip', '2011q4_notes.zip', '2015q1_notes.zip', '2018q4_notes.zip', '2017q4_notes.zip', '2015q2_notes.zip', '2015q3_notes.zip', '2017q1_notes.zip', '2018q1_notes.zip', '2015q4_notes.zip', '2018q3_notes.zip', '2017q3_notes.zip', '2017q2_notes.zip', '2018q2_notes.zip']
firstline=[]
for i in os.listdir('/zfs/data/Edgar_xbrl/'):
    with ZipFile(file_name, 'r') as edgar:
        with edgar.open('txt.tsv','r') as tab_file:
            first=tab_file.readline().decode('utf-8').split('\t')
            second=tab_file.readline().decode('utf-8').split('\t')
            firstline.append(second)

data=pd.DataFrame(firstline,columns=first)
data.head()

image.png
As you can see, with only a few lines of code and a short amount of time, we can pull any data from this directory of zip files. This allows us to utilize the transitory RAM rather than hard disk memory.