问题描述
我有一个巨大的 XML 文件,我对如何处理它有点茫然.它是 60 GB,我需要阅读它.
I have a huge XML file, and I'm a tad bit at a loss on how to handle it. It's 60 GBs, and I need to read it.
我在想是否有办法使用多处理模块来读取 python 文件?
I was thinking if there a way to use multiprocessing module to read the python file?
有没有人可以指点我做这件事的任何样本?
Does anyone have any samples of doing this that they could point me to?
谢谢
推荐答案
节省内存解析非常大的 XML 文件您可以使用比 effbot.org 更新的代码,它可能会为您节省更多内存:对大型 XML 文件使用 Python Iterparse
Save memory parsing very large XML files You could use this code which is a bit newer then the effbot.org one, it might save you more memory: Using Python Iterparse For Large XML Files
多处理/多线程如果我没记错的话,在加载/解析 XML 时,您不能轻松地进行多处理以加快处理速度.如果这是一个简单的选择,那么默认情况下每个人都可能已经这样做了.Python 通常使用全局解释器锁 (GIL),这会导致 Python 在一个进程中运行,并且绑定到 CPU 的一个核心.当使用线程时,它们在主要 Python 进程的上下文中运行,该进程仍然只绑定到一个内核.由于上下文切换,在 Python 中使用线程可能会导致性能下降.在多个内核上运行多个 Python 进程会带来预期的额外性能,但这些进程不共享内存,因此您需要进程间通信 (IPC) 以使进程一起工作(您可以在池中使用多处理,它们会在工作完成时同步,但主要用于(不是)有限的小任务).我认为需要共享内存,因为每个任务都在处理相同的大型 XML.然而,LXML 有一些方法可以绕过 GIL,但它只能在某些条件下提高性能.
Multiprocessing / Multithreading If I remember correctly you can not do multiprocessing easily to speed up the proces when loading/parsing the XML. If this was an easy option everyone would probably already do it by default. Python in general uses a global interpreter lock (GIL) and this causes Python to run within one proces and this is bound to one core of your CPU. When threads are used they run in context of the main Python proces which is still bound to only one core. Using threads in Python can lead to a performance decrease due to the context switching. Running multiple Python processes on multiple cores brings the expected additional performance, but those do not share memory so you need inter proces communication (IPC) to have processes work together (you can use multiprocessing in a pool, they sync when the work is done but mostly useful for (not to) small tasks that are finite). Sharing memory is required I would assume as every task is working on the same big XML. LXML however has some way to work around the GIL but it only improves performance under certain conditions.
LXML 中的线程为了在 lxml 中引入线程,FAQ 中有一部分讨论了这一点:http://lxml.de/FAQ.html#id1
Threading in LXML For introducing threading in lxml there is a part in the FAQ that talks about this: http://lxml.de/FAQ.html#id1
我可以使用线程同时访问 lxml API 吗?
简短回答:是的,如果您使用 lxml 2.2 及更高版本.
Short answer: yes, if you use lxml 2.2 and later.
从 1.1 版开始,lxml 在从磁盘和内存解析时会在内部释放 GIL(Python 的全局解释器锁),只要您使用默认解析器(为每个线程复制)或自己为每个线程创建解析器.lxml 还允许在验证(RelaxNG 和 XMLSchema)和 XSL 转换期间并发.您可以在线程之间共享 RelaxNG、XMLSchema 和 XSLT 对象
Since version 1.1, lxml frees the GIL (Python's global interpreter lock) internally when parsing from disk and memory, as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself. lxml also allows concurrency during validation (RelaxNG and XMLSchema) and XSL transformation. You can share RelaxNG, XMLSchema and XSLT objects between threads
如果我使用线程,我的程序会运行得更快吗?
视情况而定.解决这个问题的最佳方法是计时和分析.
Depends. The best way to answer this is timing and profiling.
Python 中的全局解释器锁 (GIL) 会序列化对解释器的访问,因此如果您的大部分处理都是在 Python 代码中完成的(遍历树、修改元素等),那么您的收益将接近于零.但是,您将更多的 XML 处理转移到 lxml 中,您的收益就越高.如果您的应用程序受限于 XML 解析和序列化,或者受限于非常有选择性的 XPath 表达式和复杂的 XSLT,那么您在多处理器机器上的加速可能会非常显着.
The global interpreter lock (GIL) in Python serializes access to the interpreter, so if the majority of your processing is done in Python code (walking trees, modifying elements, etc.), your gain will be close to zero. The more of your XML processing moves into lxml, however, the higher your gain. If your application is bound by XML parsing and serialisation, or by very selective XPath expressions and complex XSLTs, your speedup on multi-processor machines can be substantial.
请参阅上面的问题,了解哪些操作可以释放 GIL 以支持多线程.
See the question above to learn which operations free the GIL to support multi-threading.
有关优化大型 XML 解析性能的其他技巧https://www.ibm.com/developerworks/library/x-hiperfparse/
这篇关于使用多处理解析非常大的 XML 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!