问题描述
我对 python 和一般编程很陌生,但我正在尝试对包含大约 700 万行 python 的制表符分隔的 .txt 文件运行滑动窗口"计算.我所说的滑动窗口的意思是,它将运行一个计算,比如 50,000 行,报告数字,然后向上移动说 10,000 行,并在另外 50,000 行上执行相同的计算.我的计算和滑动窗口"工作正常,如果我在一小部分数据上测试它,它运行良好.但是,如果我尝试在我的整个数据集上运行该程序,它会非常慢(我现在已经运行了大约 40 个小时).数学很简单,所以我认为它不应该花这么长时间.
I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.
我现在阅读 .txt 文件的方式是使用 csv.DictReader 模块.我的代码如下:
The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:
file1='/Users/Shared/SmallSetbee.txt'
newfile=open(file1, 'rb')
reader=csv.DictReader((line.replace(' ','') for line in newfile), delimiter=" ")
我相信这是一次从所有 700 万行中制作一本字典,我认为这可能是它对于较大文件的速度如此之慢的原因.
I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.
由于我只对一次对块"或窗口"数据运行计算感兴趣,有没有更有效的方法来一次只读取指定的行,执行计算,然后重复指定行的新指定块"或窗口"?
Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?
推荐答案
collections.deque
是一个有序的项目集合,可以采用最大大小.当您将一个项目添加到一端时,一个项目会从另一端落下.这意味着要遍历 csv 上的窗口",您只需要继续向 deque
添加行,它就会处理丢弃完整的行.
A collections.deque
is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque
and it will handle throwing away complete ones already.
dq = collections.deque(maxlen=50000)
with open(...) as csv_file:
reader = csv.DictReader((line.replace(" ", "") for line in csv_file), delimiter=" ")
# initial fill
for _ in range(50000):
dq.append(reader.next())
# repeated compute
try:
while 1:
compute(dq)
for _ in range(10000):
dq.append(reader.next())
except StopIteration:
compute(dq)
这篇关于在 python 中有效地处理一个大的 .txt 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!