Python unzip file

3/16/2023

It is possible to separate the CPU-bound and IO-bound effort of each file unzip task. Unzip Files Concurrently with Processes and Threads in Batch The updated version of the unzip_file() function is listed below. Then open the zip file before then unzipping a single file to the destination directory. Instead, each process must open the zip archive and extract files separately.įirst, we can update the unzip_file() function to take the name of the zip file instead of the file handle. Unlike the ThreadPoolExecutor, we cannot simply send the handle to the open zip file to each worker process as the ZipFile class cannot be serialized. We can explore using processes to unzip files concurrently using the ProcessPoolExecutor. Nevertheless, using processes requires data for each task to be serialized which introduces additional overhead that might negate any speed-up from performing file operations with true parallelism via processes. Given that we can get a benefit using threads, we know it is possible. It is unclear whether processes can offer a speed benefit in this case. We can also try to unzip files concurrently with processes instead of threads. We will use the context manager to ensure the thread pool is closed once we are finished with it. The ThreadPoolExecutor provides a pool of worker threads that we can use.Įach file can be submitted as a task to the thread pool and worker threads can perform the task of loading data from disk into memory.įirst, we can create the thread pool with 100 worker threads. Next, let’s start to explore the use of threads to speed-up file unzipping and saving. This might offer a benefit if we wish to explore possible speed-ups by separating the two elements of unzipping each file.

One approach would be to call the ZipFile.extract() function directly to first decompress the data into memory then save the data to disk.Īn alternate approach might be to first decompress the data into memory as a string using the ZipFile.read() function, then create a path and save the file to disk manually using Python disk IO functions. That is, we expect to spend more time waiting for the hard drive than waiting for the CPU. Intuitively, we would expect that the IO-bound part of the task is slower than the CPU-bound part of the task.

Saving the decompressed data to file is IO-bound as it is limited by the speed that we can move data from main memory onto the hard drive. We can adapt the program to be multithreaded with very few changes.ĭecompressing data in memory is purely algorithmic and intuitively we might think it is CPU-bound. Next, let’s explore how we might use multi-threading to speed-up the unzipping process. How long does the serial version take to run on your machine?

0 Comments

Python unzip file

Leave a Reply.

Author

Archives

Categories