Topics
Working with Large Zip Files in Python
How to work with large zip files without unzipping them, using the zipfile
python library
Problem
You want to access some data in the zip files, but you do not want to copy the zip file over to your home/project directory and unzip it. How would go about accessing this data?
This is our current working directory of files. It contains two notebooks, a sample zip file, and that file unzipped in tmp
.
$ tree
.
├── 2022_04_notes.zip
├── edgar-xbrl.ipynb
├── Jeff_test.ipynb
└── tmp
├── cal.tsv
├── dim.tsv
├── notes-metadata.json
├── num.tsv
├── pre.tsv
├── readme.htm
├── ren.tsv
├── sub.tsv
├── tag.tsv
└── txt.tsv
Let us first look at the size differences. The du
command can help us understand how much memory each file type is using.
$ du -sh 2022_04_notes.zip
189M 2022_04_notes.zip
$ du -sh tmp
367M tmp
When we look at the memory of the unzipped file compared with the zipped file, we can see that the difference is a little under 2X.
Usually, this difference would be much more significant, but ZFS automatically compresses files to a certain degree. Furthermore, extracting an
entire zipped folder with an extensive directory of files takes time. If you only need data from one or a handful of the files in the zipped file, it is an inefficient use of time and memory to extract
everything. Below, we illustrate how you can use the zipfile
Python package to grab info from any zip file without extracting all the contents. Let us look at this example where we want to pull
out a small amount of data from some EDGAR zipped reports.
Using ZipFile
The python package zipfile
is a powerful python package that allows you to work efficiently with zip files extracting the files
you need without unzipping them.
$ pip install zipfile
First, we can use zipfile
to check what the zip file contains without unzipping it
import pandas as pd
import os
from zipfile import ZipFile
file_name='2022_04_notes.zip'
with ZipFile(file_name, 'r') as edgar:
edgar.printdir()
File Name Modified Size
sub.tsv 2022-05-01 15:35:42 2177048
tag.tsv 2022-05-01 15:35:42 59022625
dim.tsv 2022-05-01 15:35:44 25304351
ren.tsv 2022-05-01 15:35:44 33408100
cal.tsv 2022-05-01 15:35:46 27885932
pre.tsv 2022-05-01 15:35:46 204826805
num.tsv 2022-05-01 15:35:52 325109088
txt.tsv 2022-05-01 15:36:08 325066949
readme.htm 2022-05-01 15:36:22 267323
notes-metadata.json 2022-05-01 15:36:22 67978
Now let us imagine you wanted to pull the first line of all the txt.tsv files and use that information from the following directory.
#This path is representative of whichever data directory you would like to read from
print(os.listdir('/zfs/data/Edgar_xbrl/'))
['2014q3_notes.zip', '2014q2_notes.zip', '2016q4_notes.zip', '2019q4_notes.zip', '2014q1_notes.zip', '2009q4_notes.zip', '2009q3_notes.zip', '2019q2_notes.zip', '2022_08_notes.zip', '2016q2_notes.zip', '2009q2_notes.zip', '2016q3_notes.zip', '2022_09_notes.zip', '2019q3_notes.zip', '2021_08_notes.zip', '2009q1_notes.zip', '2021_09_notes.zip', '2014q4_notes.zip', '2019q1_notes.zip', '2016q1_notes.zip', '2012q4_notes.zip', '2021_07_notes.zip', '2020q3_notes.zip', '2022_05_notes.zip', '2010q1_notes.zip', '2021_06_notes.zip', '2020q2_notes.zip', '2022_04_notes.zip', '2020q1_notes.zip', '2021_05_notes.zip', '2010q3_notes.zip', '2022_07_notes.zip', '2021_04_notes.zip', '2010q2_notes.zip', '2022_06_notes.zip', '2022_01_notes.zip', '2021_03_notes.zip', '2010q4_notes.zip', '2012q1_notes.zip', '2021_02_notes.zip', '2022_03_notes.zip', '2021_01_notes.zip', '2012q2_notes.zip', '2022_02_notes.zip', '2012q3_notes.zip', '2011q2_notes.zip', '2021_12_notes.zip', '2011q3_notes.zip', '2020_12_notes.zip', '2021_10_notes.zip', '2020_11_notes.zip', '2021_11_notes.zip', '2011q1_notes.zip', '2013q4_notes.zip', '2020_10_notes.zip', '2013q3_notes.zip', '2013q2_notes.zip', '2013q1_notes.zip', '2011q4_notes.zip', '2015q1_notes.zip', '2018q4_notes.zip', '2017q4_notes.zip', '2015q2_notes.zip', '2015q3_notes.zip', '2017q1_notes.zip', '2018q1_notes.zip', '2015q4_notes.zip', '2018q3_notes.zip', '2017q3_notes.zip', '2017q2_notes.zip', '2018q2_notes.zip']
firstline=[]
for i in os.listdir('/zfs/data/Edgar_xbrl/'):
with ZipFile(file_name, 'r') as edgar:
with edgar.open('txt.tsv','r') as tab_file:
first=tab_file.readline().decode('utf-8').split('\t')
second=tab_file.readline().decode('utf-8').split('\t')
firstline.append(second)
data=pd.DataFrame(firstline,columns=first)
data.head()
As you can see, with only a few lines of code and a short amount of time, we can pull any data from this directory of zip files. This allows us to utilize the transitory RAM rather than hard disk memory.
Connect with us