在OpenPYXL中运行5万行Excel文件的最快方法

我在Python中使用openpyxl，我试图通过5万行，并从每行抓取数据，并将其放置到一个文件中。但是，我发现它越来越慢，我进入它越来越慢。第一条1k线的速度非常快，不到一分钟，但在此之后，下一条1k线需要更长，更长的时间。

我正在打开一个.xlsx文件。我不知道打开一个.txt文件作为一个CSV或什么东西或阅读一个JSON文件或更快？或者转换成某种会更快读取的东西？

我在给定列中有20个唯一值，然后每个值的值都是随机的。我试图抓住每个值的整个唯一值列的string。

价值1：1243,345,34,124，价值2：1243,345,34,124等

我正在运行“值”列表，查看名称是否存在于文件中，如果存在，则会访问该文件并将其添加到新值中，如果文件不存在，则会创build文件并然后将其设置为追加。我有一个字典，所有的“附加写文件”的东西连接到它，所以任何时候我想要写东西，它会抓住文件名，附加的东西将在字典中可用，它会查找和写入该文件，所以它不会保持每次运行时打开新的文件。

第一个1K花了不到一分钟..现在我在4K到5K的纪录，它已经准备好了5分钟..似乎需要更长的时间，因为它在logging上升，我不知道如何加快速度。它根本不打印到控制台上。

writeFile = 1 theDict = {} for row in ws.iter_rows(rowRange): for cell in row: #grabbing the value theStringValueLocation = "B" + str(counter) theValue = ws[theStringValueLocation].value theName = cell.value textfilename = theName + ".txt" if os.path.isfile(textfilename): listToAddTo = theDict[theName] listToAddTo.write("," + theValue) if counter == 1000: print "1000" st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S') else: writeFileName = open(textfilename, 'w') writeFileName.write(theValue) writeFileName = open(textfilename, 'a') theDict[theName] = writeFileName counter = counter + 1

我在上面的代码中添加了一些时间戳，它不在那里，但是您可以看到下面的输出。我看到的问题是，每运行1K，速度越来越高。第一次2分钟，第三次3分钟，然后5分钟，然后7分钟。当它达到5万时，我担心这将花费一个小时或一些东西，这将花费太长时间。

 1000 2016-02-25 15:15:08 20002016-02-25 15:17:07 30002016-02-25 15:20:52 2016-02-25 15:25:28 4000 2016-02-25 15:32:00 5000 2016-02-25 15:40:02 6000 2016-02-25 15:51:34 7000 2016-02-25 16:03:29 8000 2016-02-25 16:18:52 9000 2016-02-25 16:35:30 10000

有些东西，我应该清楚..我不知道的价值提前的名称，也许我应该跑过去，抓住一个单独的Python脚本，使这个更快？

其次，我需要一个由逗号分隔的所有值的string，这就是为什么我把它放入一个文本文件，以便以后抓取。我正在考虑按照我所build议的清单来做，但是我想知道是否会有同样的问题。我认为这个问题与读取excel有关。无论如何，我可以通过逗号分隔出一个string，我可以用另一种方式做到这一点。

或者，也许我可以尝试/赶上，而不是每次search文件，如果有错误，我可以假设创build一个新的文件？也许每次查找都让它变得非常慢？如果文件存在？

这个问题是从我原来的这里延续，我从那里采取了一些build议…. 什么是在python大数据集最快的性能元组？

我想你要做的是从行的列B中获取一个密钥，并使用该密钥来追加到文件名。让我们加快速度：

 from collections import defaultdict Value_entries = defaultdict(list) # dict of lists of row data for row in ws.iter_rows(rowRange): key = row[1].value Value_entries[key].extend([cell.value for cell in row]) # All done. Now write files: for key in Value_entries.keys(): with open(key + '.txt', 'w') as f: f.write(','.join(Value_entries[key]))

看起来你只想要B列的单元格。在这种情况下，您可以使用ws.get_squared_range()来限制要查看的单元格的数量。

 for row in ws.get_squared_range(min_col=2, max_col=2, min_row=1, max_row=ws.max_row): for cell in row: # each row is always a sequence filename = cell.value if os.path.isfilename(filename): …

目前还不清楚你的代码的else部分发生了else ，但是你应该尽快closures你打开的任何文件。

根据您链接的其他问题和上面的代码，看起来您有一个名称 – 值对的电子表格。在列A中的名称和在列B中的值。名称可以在列A中出现多次，并且每次在列B中可以有不同的值。目标是创build一个列出每个名字显示的所有值。

首先对上面的代码进行一些观察：

counter从不初始化。据推测它初始化为1。
open(textfilename,...)被调用两次而不closures之间的文件。调用open分配一些内存来保存与在文件上运行相关的数据。分配给第一个公开呼叫的内存可能不会被释放，直到很久以后，也许直到程序结束。 closures文件是最好的做法，当你使用它们时（参见使用open作为上下文pipe理器）。
循环逻辑不正确。考虑：

内循环的第一次迭代：

 for cell in row: # cell refers to A1 valueLocation = "B" + str(counter) # valueLocation is "B1" value = ws[valueLocation].value # value gets contents of cell B1 name = cell.value # name gets contents of cell A1 textfilename = name + ".txt" ... opens file with name based on contents of cell A1, and writes value from cell B1 to the file ... counter = counter + 1 # counter = 2

但是每行至less有两个单元格，所以在内部循环的第二次迭代中：

 for cell in row: # cell now refers to cell B1 valueLocation = "B" + str(counter) # valueLocation is "B2" value = ws[valueLocation].value # value gets contents of cell B2 name = cell.value # name gets contents of cell B1 textfilename = name + ".txt" ... opens file with name based on contents of cell "B1" <<<< wrong file writes the value of cell "B2" to the file <<<< wrong value ... counter = counter + 1 # counter = 3 when cell B1 is processed

重复每个50K行。根据B列中有多less个唯一值，程序可能会尝试打开数百或数千个文件（基于单元格A1，B1，A2，B2 …）的内容== >>非常慢或者程序崩溃。

iter_rows()返回行中单元的元组。
正如人们在其他问题中提出的那样，使用字典和列表来存储这些值，并在最后把它们全部写出来。像这样（我使用Python 3.5，所以你可能需要调整，如果你使用2.7）

这是一个直接的解决scheme：

 from collections import defaultdict data = defaultdict(list) # gather the values into lists associated with each name # data will look like { 'name1':['value1', 'value42', ...], # 'name2':['value7', 'value23', ...], # ...} for row in ws.iter_rows(): name = row[0].value value = row[1].value data[name].append(value) for key,valuelist in data.items(): # turn list of strings in to a long comma-separated string # eg, ['value1', 'value42', ...] => 'value1,value42, ...' value = ",".join(valuelist) with open(key + ".txt", "w") as f: f.write(value)

在OpenPYXL中运行5万行Excel文件的最快方法

使用Range的excel性能

macros没有响应的macros

加快匹配值处理（如果… = …然后…）

在Excel中使用C＃closures的XML合并列的性能

如何在input数值时停止计算Excel-DNA函数

代码太慢，无法对250行进行分类和复制

如何使用VBA将符号/图标格式化为单元格而不使用条件格式

VBA卓越，提高性能没有循环

使用RubyXL编写xlsx文件需要花费很长时间处理大量的单元

在Excel VBA中，易失性Offset函数非常慢。什么是替代scheme？

在OpenPYXL中运行5万行Excel文件的最快方法

使用Range的excel性能

macros没有响应的macros

加快匹配值处理（如果… = …然后…）

在Excel中使用C＃closures的XML合并列的性能

如何在input数值时停止计算Excel-DNA函数

代码太慢，无法对250行进行分类和复制

如何使用VBA将符号/图标格式化为单元格而不使用条件格式

VBA卓越，提高性能没有循环

使用RubyXL编写xlsx文件需要花费很长时间处理大量的单元

在Excel VBA中，易失性Offset函数非常慢。 什么是替代scheme？

在Excel VBA中，易失性Offset函数非常慢。什么是替代scheme？