问题描述
这是一个看似简单的问题:给定一个生成升序整数序列的迭代器列表,编写一个只生成每个序列中出现的整数的简洁生成器.
Here is a seemingly simple problem: given a list of iterators that yield sequences of integers in ascending order, write a concise generator that yields only the integers that appear in every sequence.
昨晚阅读了几篇论文后,我决定用 Python 编写一个完全最小的全文索引器,在这里看到(虽然那个版本现在很老了).
After reading a few papers last night, I decided to hack up a completely minimal full text indexer in Python, as seen here (though that version is quite old now).
我的问题在于 search()
函数,它必须遍历每个发布列表并仅生成每个列表中出现的文档 ID.从上面的链接可以看出,我当前的非递归工作"尝试很糟糕.
My problem is with the search()
function, which must iterate over each posting list and yield only the document IDs that appear on every list. As you can see from the link above, my current non-recursive 'working' attempt is terrible.
示例:
postings = [[1, 100, 142, 322, 12312],
[2, 100, 101, 322, 1221],
[100, 142, 322, 956, 1222]]
应该让步:
[100, 322]
至少有一个优雅的递归函数解决方案,但如果可能的话,我想避免这种情况.但是,涉及嵌套生成器表达式、itertools
滥用或任何其他类型的代码高尔夫的解决方案非常受欢迎.:-)
There is at least one elegant recursive function solution to this, but I'd like to avoid that if possible. However, a solution involving nested generator expressions, itertools
abuse, or any other kind of code golf is more than welcome. :-)
应该可以将函数安排为只需要与最小列表中的项目一样多的步骤,而无需将整组整数吸入内存.将来,这些列表可能会从磁盘中读取,并且大于可用的 RAM.
It should be possible to arrange for the function to only require as many steps as there are items in the smallest list, and without sucking the entire set of integers into memory. In future, these lists may be read from disk, and larger than available RAM.
在过去的 30 分钟里,我在舌尖有了一个想法,但我无法将其完全融入代码中.请记住,这只是为了好玩!
For the past 30 minutes I've had an idea on the tip of my tongue, but I can't quite get it into code. Remember, this is just for fun!
推荐答案
import heapq, itertools
def intersect(*its):
for key, values in itertools.groupby(heapq.merge(*its)):
if len(list(values)) == len(its):
yield key
>>> list(intersect(*postings))
[100, 322]
这篇关于加入一组产生 Python 迭代器的有序整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!