在 FTP 服务器上的 zip 文件中获取文件名,而无需下载整个存档

Get files names inside a zip file on FTP server without downloading whole archive(在 FTP 服务器上的 zip 文件中获取文件名,而无需下载整个存档)
本文介绍了在 FTP 服务器上的 zip 文件中获取文件名,而无需下载整个存档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我在远程 FTP 服务器中有很多 zip 存档,它们的大小高达 20TB.我只需要这些 zip 档案中的文件名,这样我就可以将它们插入到我的 Python 脚本中.

I have a lot of zip archives in a remote FTP server and their sizes go up to 20TB. I just need the file names inside those zip archives, so that I can plug them into my Python scripts.

有没有什么方法可以只获取文件名而不实际下载文件并在我的本地机器上提取它们?如果是这样,有人可以指导我到正确的库/包吗?

Is there any way to just get the file names without actually downloading files and extracting them on my local machine? If so, can someone direct me to the right library/package?

推荐答案

您可以实现一个类文件对象,从 FTP 读取数据,而不是本地文件.并将其传递给 ZipFile 构造函数,而不是(本地)文件名.

You can implement a file-like object that reads data from FTP, instead of a local file. And pass that to ZipFile constructor, instead of a (local) file name.

一个简单的实现可以是:

A trivial implementation can be like:

from ftplib import FTP
from ssl import SSLSocket

class FtpFile:

    def __init__(self, ftp, name):
        self.ftp = ftp
        self.name = name
        self.size = ftp.size(name)
        self.pos = 0
    
    def seek(self, offset, whence):
        if whence == 0:
            self.pos = offset
        if whence == 1:
            self.pos += offset
        if whence == 2:
            self.pos = self.size + offset

    def tell(self):
        return self.pos

    def read(self, size = None):
        if size == None:
            size = self.size - self.pos
        data = B""

        # Based on FTP.retrbinary 
        # (but allows stopping after certain number of bytes read)
        # An alternative implementation is at
        # https://stackoverflow.com/q/58819210/850848#58819362
        ftp.voidcmd('TYPE I')
        cmd = "RETR {}".format(self.name)
        conn = ftp.transfercmd(cmd, self.pos)
        try:
            while len(data) < size:
                buf = conn.recv(min(size - len(data), 8192))
                if not buf:
                    break
                data += buf
            # shutdown ssl layer (can be removed if not using TLS/SSL)
            if SSLSocket is not None and isinstance(conn, SSLSocket):
                conn.unwrap()
        finally:
            conn.close()
        try:
            ftp.voidresp()
        except:
            pass
        self.pos += len(data)
        return data

然后你可以像这样使用它:

And then you can use it like:

ftp = FTP(host, user, passwd)
ftp.cwd(path)

ftpfile = FtpFile(ftp, "archive.zip")
zip = zipfile.ZipFile(ftpfile)
print(zip.namelist())


上述实现相当琐碎且效率低下.它开始大量(至少三个)下载小块数据以检索包含文件的列表.它可以通过读取和缓存更大的块来优化.但它应该给你的想法.


The above implementation is rather trivial and inefficient. It starts numerous (three at minimum) downloads of small chunks of data to retrieve a list of contained files. It can be optimized by reading and caching larger chunks. But it should give your the idea.

特别是您可以利用您将只阅读列表的事实.该列表位于 ZIP 存档的 和 处.因此,您可以在开始时下载最后(大约)10 KB 的数据.您将能够从该缓存中完成所有 read 调用.

Particularly you can make use of the fact that you are going to read a listing only. The listing is located at the and of a ZIP archive. So you can just download last (about) 10 KB worth of data at the start. And you will be able to fulfill all read calls out of that cache.

知道了这一点,您实际上可以做一个小技巧.由于列表位于存档的末尾,您实际上只能下载存档的末尾.虽然下载的 ZIP 将被破坏,但它仍然可以列出.这样,您将不需要 FtpFile 类.您可以甚至将列表下载到内存中 (StringIO).

Knowing that, you can actually do a small hack. As the listing is at the end of the archive, you can actually download the end of the archive only. While the downloaded ZIP will be broken, it still can be listed. This way, you won't need the FtpFile class. You can even download the listing to memory (StringIO).

zipstring = StringIO()
name = "archive.zip"
size = ftp.size(name)
ftp.retrbinary("RETR " + name, zipstring.write, rest = size - 10*2024)

zip = zipfile.ZipFile(zipstring)

print(zip.namelist())

如果您因为 10 KB 太小而无法包含整个列表而收到 BadZipfile 异常,您可以使用更大的块重试代码.

If you get BadZipfile exception because the 10 KB is too small to contain whole listing, you can retry the code with a larger chunk.

这篇关于在 FTP 服务器上的 zip 文件中获取文件名,而无需下载整个存档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

patching a class yields quot;AttributeError: Mock object has no attributequot; when accessing instance attributes(修补类会产生“AttributeError:Mock object has no attribute;访问实例属性时)
How to mock lt;ModelClassgt;.query.filter_by() in Flask-SqlAlchemy(如何在 Flask-SqlAlchemy 中模拟 lt;ModelClassgt;.query.filter_by())
FTPLIB error socket.gaierror: [Errno 8] nodename nor servname provided, or not known(FTPLIB 错误 socket.gaierror: [Errno 8] nodename nor servname provided, or not known)
Weird numpy.sum behavior when adding zeros(添加零时奇怪的 numpy.sum 行为)
Why does the #39;int#39; object is not callable error occur when using the sum() function?(为什么在使用 sum() 函数时会出现 int object is not callable 错误?)
How to sum in pandas by unique index in several columns?(如何通过几列中的唯一索引对 pandas 求和?)