问题描述
一个网络爬虫脚本,最多生成 500 个线程,每个线程基本上都请求从远程服务器提供的某些数据,每个服务器的回复在内容和大小上都与其他服务器不同.
A web crawler script that spawns at most 500 threads and each thread basically requests for certain data served from the remote server, which each server's reply is different in content and size from others.
我将线程的 stack_size 设置为 756K
i'm setting stack_size as 756K's for threads
threading.stack_size(756*1024)
这使我能够拥有足够数量的所需线程并完成大部分作业和请求.但是由于某些服务器的响应比其他服务器大,并且当一个线程获得这种响应时,脚本会因 SIGSEGV 而死.
which enables me to have the sufficient number of threads required and complete most of the jobs and requests. But as some servers' responses are bigger than others, and when a thread gets that kind of response, script dies with SIGSEGV.
stack_sizes 超过 756K 使得不可能同时拥有所需数量的线程.
stack_sizes more than 756K makes it impossible to have the required number of threads at the same time.
关于如何在不崩溃的情况下继续使用给定的 stack_size 的任何建议?以及如何获取任何给定线程的当前使用的 stack_size?
any suggestions on how can i continue with given stack_size without crashes? and how can i get the current used stack_size of any given thread?
推荐答案
为什么你到底要生成 500 个线程?这似乎是一个可怕的主意!
Why on earth are you spawning 500 threads? That seems like a terrible idea!
完全删除线程,使用事件循环进行爬取.您的程序将更快、更简单、更易于维护.
Remove threading completely, use an event loop to do the crawling. Your program will be faster, simpler, and easier to maintain.
大量等待网络的线程不会让您的程序等待得更快.相反,将所有打开的套接字收集到一个列表中并运行一个循环,检查其中是否有任何可用的数据.
Lots of threads waiting for network won't make your program wait faster. Instead, collect all open sockets in a list and run a loop where you check if any of them has data available.
我推荐使用 Twisted - 它是一个事件驱动的网络引擎.它非常灵活、安全、可扩展且非常稳定(无段错误).
I recommend using Twisted - It is an event-driven networking engine. It is very flexile, secure, scalable and very stable (no segfaults).
你也可以看看 Scrapy - 它是一个用 Python/Twisted 编写的网络爬取和屏幕抓取框架.它仍在大力开发中,但也许您可以提出一些想法.
You could also take a look at Scrapy - It is a web crawling and screen scraping framework written in Python/Twisted. It is still under heavy development, but maybe you can take some ideas.
这篇关于Python 线程 stack_size 和 segfaults的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!