<small id='V7eva'></small><noframes id='V7eva'>

<i id='V7eva'><tr id='V7eva'><dt id='V7eva'><q id='V7eva'><span id='V7eva'><b id='V7eva'><form id='V7eva'><ins id='V7eva'></ins><ul id='V7eva'></ul><sub id='V7eva'></sub></form><legend id='V7eva'></legend><bdo id='V7eva'><pre id='V7eva'><center id='V7eva'></center></pre></bdo></b><th id='V7eva'></th></span></q></dt></tr></i><div id='V7eva'><tfoot id='V7eva'></tfoot><dl id='V7eva'><fieldset id='V7eva'></fieldset></dl></div>

<legend id='V7eva'><style id='V7eva'><dir id='V7eva'><q id='V7eva'></q></dir></style></legend>

    1. <tfoot id='V7eva'></tfoot>
        <bdo id='V7eva'></bdo><ul id='V7eva'></ul>
      1. 为什么 DataFrame 的连接会呈指数级变慢?

        Why does concatenation of DataFrames get exponentially slower?(为什么 DataFrame 的连接会呈指数级变慢?)
      2. <small id='vVF5G'></small><noframes id='vVF5G'>

        <tfoot id='vVF5G'></tfoot>

        <legend id='vVF5G'><style id='vVF5G'><dir id='vVF5G'><q id='vVF5G'></q></dir></style></legend>

        <i id='vVF5G'><tr id='vVF5G'><dt id='vVF5G'><q id='vVF5G'><span id='vVF5G'><b id='vVF5G'><form id='vVF5G'><ins id='vVF5G'></ins><ul id='vVF5G'></ul><sub id='vVF5G'></sub></form><legend id='vVF5G'></legend><bdo id='vVF5G'><pre id='vVF5G'><center id='vVF5G'></center></pre></bdo></b><th id='vVF5G'></th></span></q></dt></tr></i><div id='vVF5G'><tfoot id='vVF5G'></tfoot><dl id='vVF5G'><fieldset id='vVF5G'></fieldset></dl></div>
          • <bdo id='vVF5G'></bdo><ul id='vVF5G'></ul>

                <tbody id='vVF5G'></tbody>

                  本文介绍了为什么 DataFrame 的连接会呈指数级变慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                  问题描述

                  我有一个处理 DataFrame 的函数,主要用于将数据处理到存储桶中,使用 pd.get_dummies(df[col]) 在特定列中创建特征的二进制矩阵.

                  I have a function which processes a DataFrame, largely to process data into buckets create a binary matrix of features in a particular column using pd.get_dummies(df[col]).

                  为了避免一次使用此函数处理我的所有数据(内存不足并导致 iPython 崩溃),我使用以下方法将大型 DataFrame 分成块:

                  To avoid processing all of my data using this function at once (which goes out of memory and causes iPython to crash), I have broken the large DataFrame into chunks using:

                  chunks = (len(df) / 10000) + 1
                  df_list = np.array_split(df, chunks)
                  

                  pd.get_dummies(df) 将根据 df[col] 的内容自动创建新列,每个 dfdf_list 中.

                  pd.get_dummies(df) will automatically create new columns based on the contents of df[col] and these are likely to differ for each df in df_list.

                  处理后,我将使用以下方法将 DataFrame 连接在一起:

                  After processing, I am concatenating the DataFrames back together using:

                  for i, df_chunk in enumerate(df_list):
                      print "chunk", i
                      [x, y] = preprocess_data(df_chunk)
                      super_x = pd.concat([super_x, x], axis=0)
                      super_y = pd.concat([super_y, y], axis=0)
                      print datetime.datetime.utcnow()
                  

                  第一个块的处理时间是完全可以接受的,但是,它会随着块的增加而增长!这与 preprocess_data(df_chunk) 无关,因为它没有理由增加.是否由于调用 pd.concat() 而导致时间增加?

                  The processing time of the first chunk is perfectly acceptable, however, it grows per chunk! This is not to do with the preprocess_data(df_chunk) as there is no reason for it to increase. Is this increase in time occurring as a result of the call to pd.concat()?

                  请看下面的日志:

                  chunks 6
                  chunk 0
                  2016-04-08 00:22:17.728849
                  chunk 1
                  2016-04-08 00:22:42.387693 
                  chunk 2
                  2016-04-08 00:23:43.124381
                  chunk 3
                  2016-04-08 00:25:30.249369
                  chunk 4
                  2016-04-08 00:28:11.922305
                  chunk 5
                  2016-04-08 00:32:00.357365
                  

                  有没有办法加快这个速度?我有 2900 个块要处理,因此感谢您的帮助!

                  Is there a workaround to speed this up? I have 2900 chunks to process so any help is appreciated!

                  接受 Python 中的任何其他建议!

                  Open to any other suggestions in Python!

                  推荐答案

                  永远不要在 for 循环中调用 DataFrame.appendpd.concat.它会导致二次复制.

                  Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

                  pd.concat 返回一个新的 DataFrame.空间必须分配给新的DataFrame,旧 DataFrame 中的数据必须复制到新的 DataFrame 中数据框.考虑 for-loop 中这一行所需的复制量(假设每个 x 的大小为 1):

                  pd.concat returns a new DataFrame. Space has to be allocated for the new DataFrame, and data from the old DataFrames have to be copied into the new DataFrame. Consider the amount of copying required by this line inside the for-loop (assuming each x has size 1):

                  super_x = pd.concat([super_x, x], axis=0)
                  
                  | iteration | size of old super_x | size of x | copying required |
                  |         0 |                   0 |         1 |                1 |
                  |         1 |                   1 |         1 |                2 |
                  |         2 |                   2 |         1 |                3 |
                  |       ... |                     |           |                  |
                  |       N-1 |                 N-1 |         1 |                N |
                  

                  1 + 2 + 3 + ... + N = N(N+1)/2.所以需要 O(N**2) 个副本完成循环.

                  1 + 2 + 3 + ... + N = N(N+1)/2. So there is O(N**2) copies required to complete the loop.

                  现在考虑

                  super_x = []
                  for i, df_chunk in enumerate(df_list):
                      [x, y] = preprocess_data(df_chunk)
                      super_x.append(x)
                  super_x = pd.concat(super_x, axis=0)
                  

                  追加到列表是一个O(1)操作并且不需要复制.现在循环完成后,对 pd.concat 有一次调用.这个呼吁pd.concat 需要制作 N 份副本,因为 super_x 包含 N大小为 1 的 DataFrame.因此,当以这种方式构造时,super_x 需要 O(N)副本.

                  Appending to a list is an O(1) operation and does not require copying. Now there is a single call to pd.concat after the loop is done. This call to pd.concat requires N copies to be made, since super_x contains N DataFrames of size 1. So when constructed this way, super_x requires O(N) copies.

                  这篇关于为什么 DataFrame 的连接会呈指数级变慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                  本站部分内容来源互联网,如果有图片或者内容侵犯了您的权益,请联系我们,我们会在确认后第一时间进行删除!

                  相关文档推荐

                  Split a Pandas column of lists into multiple columns(将 Pandas 的列表列拆分为多列)
                  How does the @property decorator work in Python?(@property 装饰器在 Python 中是如何工作的?)
                  What is the difference between old style and new style classes in Python?(Python中的旧样式类和新样式类有什么区别?)
                  How to break out of multiple loops?(如何打破多个循环?)
                  How to put the legend out of the plot(如何将传说从情节中剔除)
                  Why is the output of my function printing out quot;Nonequot;?(为什么我的函数输出打印出“无?)

                  <i id='hMmq0'><tr id='hMmq0'><dt id='hMmq0'><q id='hMmq0'><span id='hMmq0'><b id='hMmq0'><form id='hMmq0'><ins id='hMmq0'></ins><ul id='hMmq0'></ul><sub id='hMmq0'></sub></form><legend id='hMmq0'></legend><bdo id='hMmq0'><pre id='hMmq0'><center id='hMmq0'></center></pre></bdo></b><th id='hMmq0'></th></span></q></dt></tr></i><div id='hMmq0'><tfoot id='hMmq0'></tfoot><dl id='hMmq0'><fieldset id='hMmq0'></fieldset></dl></div>

                  <legend id='hMmq0'><style id='hMmq0'><dir id='hMmq0'><q id='hMmq0'></q></dir></style></legend>

                      1. <tfoot id='hMmq0'></tfoot>
                          <bdo id='hMmq0'></bdo><ul id='hMmq0'></ul>

                          <small id='hMmq0'></small><noframes id='hMmq0'>

                            <tbody id='hMmq0'></tbody>