• <bdo id='srqlR'></bdo><ul id='srqlR'></ul>
  • <small id='srqlR'></small><noframes id='srqlR'>

  • <tfoot id='srqlR'></tfoot>

      <i id='srqlR'><tr id='srqlR'><dt id='srqlR'><q id='srqlR'><span id='srqlR'><b id='srqlR'><form id='srqlR'><ins id='srqlR'></ins><ul id='srqlR'></ul><sub id='srqlR'></sub></form><legend id='srqlR'></legend><bdo id='srqlR'><pre id='srqlR'><center id='srqlR'></center></pre></bdo></b><th id='srqlR'></th></span></q></dt></tr></i><div id='srqlR'><tfoot id='srqlR'></tfoot><dl id='srqlR'><fieldset id='srqlR'></fieldset></dl></div>
      <legend id='srqlR'><style id='srqlR'><dir id='srqlR'><q id='srqlR'></q></dir></style></legend>

        在 python 的元组列表中有效且更快地迭代超过 3600 万个项目

        Iterate over 36 million items in a list of tuples in python efficiently and faster(在 python 的元组列表中有效且更快地迭代超过 3600 万个项目)

        1. <tfoot id='r8WoQ'></tfoot>
        2. <legend id='r8WoQ'><style id='r8WoQ'><dir id='r8WoQ'><q id='r8WoQ'></q></dir></style></legend>

            <small id='r8WoQ'></small><noframes id='r8WoQ'>

              <tbody id='r8WoQ'></tbody>
              <i id='r8WoQ'><tr id='r8WoQ'><dt id='r8WoQ'><q id='r8WoQ'><span id='r8WoQ'><b id='r8WoQ'><form id='r8WoQ'><ins id='r8WoQ'></ins><ul id='r8WoQ'></ul><sub id='r8WoQ'></sub></form><legend id='r8WoQ'></legend><bdo id='r8WoQ'><pre id='r8WoQ'><center id='r8WoQ'></center></pre></bdo></b><th id='r8WoQ'></th></span></q></dt></tr></i><div id='r8WoQ'><tfoot id='r8WoQ'></tfoot><dl id='r8WoQ'><fieldset id='r8WoQ'></fieldset></dl></div>
              • <bdo id='r8WoQ'></bdo><ul id='r8WoQ'></ul>

                  本文介绍了在 python 的元组列表中有效且更快地迭代超过 3600 万个项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                  问题描述

                  首先,在有人将其标记为重复之前,请阅读以下内容.我不确定迭代中的延迟是由于庞大的规模还是我的逻辑.我有一个用例,我必须在元组列表中迭代 3600 万个项目.我的主要要求是速度和效率.样品清单:

                  Firstly, before anyone marks it as a duplicate, please read below. I am unsure if the delay in the iteration is due to the huge size or my logic. I have a use case where I have to iterate over 36 million items in a list of tuples. My main requirement is speed and efficiency. Sample list:

                  [
                      ('how are you', 'I am fine'),
                      ('how are you', 'I am not fine'),
                      ...36 million items...
                  ]
                  

                  到目前为止我做了什么:

                  What I have done so far:

                  for query_question in combined:
                      query = "{}".format(word_tokenize(query_question[0]))
                      question = "{}".format(word_tokenize(query_question[1]))
                  
                      # the function uses a naive doc2vec extension of GLOVE word vectors
                      vec1 = np.mean([
                          word_vector_dict[word]
                          for word in literal_eval(query)
                          if word in word_vector_dict
                      ], axis=0)
                  
                      vec2 = np.mean([
                          word_vector_dict[word]
                          for word in literal_eval(question)
                          if word in word_vector_dict
                      ], axis=0)
                  
                      similarity_score = 1 - distance.cosine(vec1, vec2)
                      store_question_score = store_question_score.append(
                          (query_question[1], similarity_score)
                      ) 
                      count += 1
                  
                      if(count == len(data_list)):
                          store_question_score_descending = store_question_score.sort(
                              key=itemgetter(1), reverse=True
                          )
                          result_dict[query_question[0]] = store_question_score_descending[:5]
                          store_question_score =[]
                          count = 1
                  

                  上述逻辑旨在计算问题之间的相似度分数并执行文本相似度算法.我怀疑迭代中的延迟可能是 vec1 和 vec2 的计算. 如果是这样,我怎样才能做得更好?我正在寻找如何加快这个过程.

                  The above logic aims to calculate the similarity scores between questions and perform a text similarity algorithm. I'm suspecting the delay in the iteration could be the calculation of vec1 and vec2. If so, how can I do this better? I am looking for how to speed up the process.

                  还有很多其他问题类似于迭代巨大列表,但我找不到任何可以解决我的问题的问题.

                  There are plenty of other questions similar to iterative over huge lists, but I could not find any that solved my problem.

                  非常感谢您提供的任何帮助.

                  I really appreciate any help you can provide.

                  推荐答案

                  尝试缓存:

                  from functools import lru_cache
                  
                  @lru_cache(maxsize=None)
                  def compute_vector(s):
                      return np.mean([
                          word_vector_dict[word]
                          for word in literal_eval(s)
                          if word in word_vector_dict
                      ], axis=0)
                  

                  然后改用这个:

                  vec1 = compute_vector(query)
                  vec2 = compute_vector(question)
                  


                  如果向量的大小是固定的,您可以通过缓存到形状为 (num_unique_keys, len(vec1)) 的 numpy 数组做得更好,在您的情况下 num_unique_keys =370000 + 100:


                  If the size of the vectors is fixed, you can do even better by caching to a numpy array of shape (num_unique_keys, len(vec1)), where in your case num_unique_keys = 370000 + 100:

                  class VectorCache:
                      def __init__(self, func, num_keys, item_size):
                          self.func = func
                          self.cache = np.empty((num_keys, item_size), dtype=float)
                          self.keys = {}
                  
                      def __getitem__(self, key):
                          if key in self.keys
                              return self.cache[self.keys[key]]
                          self.keys[key] = len(self.keys)
                          item = self.func(key)
                          self.cache[self.keys[key]] = item
                          return item
                  
                  
                  def compute_vector(s):
                      return np.mean([
                          word_vector_dict[word]
                          for word in literal_eval(s)
                          if word in word_vector_dict
                      ], axis=0)
                  
                  
                  vector_cache = VectorCache(compute_vector, num_keys, item_size)
                  

                  然后:

                  vec1 = vector_cache[query]
                  vec2 = vector_cache[question]
                  


                  使用类似的技术,您还可以缓存余弦距离:


                  Using a similar technique, you can also cache the cosine distances:

                  @lru_cache(maxsize=None)
                  def cosine_distance(query, question):
                      return distance.cosine(vector_cache[query], vector_cache[question])
                  

                  这篇关于在 python 的元组列表中有效且更快地迭代超过 3600 万个项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                  本站部分内容来源互联网,如果有图片或者内容侵犯了您的权益,请联系我们,我们会在确认后第一时间进行删除!

                  相关文档推荐

                  Running .jl file from R or Python(从 R 或 Python 运行 .jl 文件)
                  Running Julia .jl file in python(在 python 中运行 Julia .jl 文件)
                  Using PIP in a Azure WebApp(在 Azure WebApp 中使用 PIP)
                  How to run python3.7 based flask web api on azure(如何在 azure 上运行基于 python3.7 的烧瓶 web api)
                  Azure Python Web App Internal Server Error(Azure Python Web 应用程序内部服务器错误)
                  Run python dlib library on azure app service(在 azure app 服务上运行 python dlib 库)
                  <i id='6A5Fn'><tr id='6A5Fn'><dt id='6A5Fn'><q id='6A5Fn'><span id='6A5Fn'><b id='6A5Fn'><form id='6A5Fn'><ins id='6A5Fn'></ins><ul id='6A5Fn'></ul><sub id='6A5Fn'></sub></form><legend id='6A5Fn'></legend><bdo id='6A5Fn'><pre id='6A5Fn'><center id='6A5Fn'></center></pre></bdo></b><th id='6A5Fn'></th></span></q></dt></tr></i><div id='6A5Fn'><tfoot id='6A5Fn'></tfoot><dl id='6A5Fn'><fieldset id='6A5Fn'></fieldset></dl></div>
                    <bdo id='6A5Fn'></bdo><ul id='6A5Fn'></ul>

                              <tbody id='6A5Fn'></tbody>

                            <small id='6A5Fn'></small><noframes id='6A5Fn'>

                          • <legend id='6A5Fn'><style id='6A5Fn'><dir id='6A5Fn'><q id='6A5Fn'></q></dir></style></legend><tfoot id='6A5Fn'></tfoot>