Python XMLparsing并计算string的出现，然后输出到Excel

所以这是我的难题！

我有100 + XML文件，我需要parsing并通过标签名称（或正则expression式）find一个string。

一旦我findstring/标签值我需要计算它发生的时间（或find该string的最高值）。

例：

<content styleCode="Bold">Value 1</content> <content styleCode="Bold">Value 2</content> <content styleCode="Bold">Value 3</content> <content styleCode="Bold">Another Value 1</content> <content styleCode="Bold">Another Value 2</content> <content styleCode="Bold">Another Value 3</content> <content styleCode="Bold">Another Value 4</content>

所以基本上我想分析的XML，find上面列出的标签，并输出到Excel电子表格中find最高值。电子表格已经有了标题，所以只要将数值输出到Excel文件即可。

所以输出将在Excel中：

 Value Another Value 3 4

每个文件将输出到另一行。

我不确定您的XML文件是如何命名的。对于简单的情况，假设他们被命名为这种模式：

file1.xml，file2.xml，…和它们存储在python脚本所在的文件夹中。

然后你可以使用下面的代码来完成这个工作：

 import xml.etree.cElementTree as ElementTree import re from xlrd import open_workbook from xlwt import Workbook from xlutils.copy import copy def process(): for i in xrange(1, 100): #loop from file1.xml to file99.xml resultDict = {} xml = ElementTree.parse('file%d.xml' %i) root = xml.getroot() for child in root: value = re.search(r'\d+', child.text).group() key = child.text[:-(1+len(value))] try: if value > resultDict[key]: resultDict[key] = value except KeyError: resultDict[key] = value rb = open_workbook("names.xls") wb = copy(rb) s = wb.get_sheet(0) for index, value in enumerate(resultDict.values()): s.write(i, index, value) wb.save('names.xls') if __name__ == '__main__': process()

所以这个问题有两个主要部分。（1）从每个文件中找出最大值对，（2）将它们写入Excel工作簿。我一直主张的一件事就是编写可重用的代码。在这里你必须把所有的XML文件放在一个文件夹中，只需执行main方法并获得结果。

那么现在有几个选项写入Excel。最简单的是创build一个选项卡或逗号分隔文件（CSV），并手动将其导入到Excel中。 XMWT是一个标准库。 OpenPyxl是另一个库，它使得在代码行中创buildexcel文件变得更简单和更小。

确保在文件的开头导入所需的库和模块。

 import re import os import openpyxl

在读取XML文件时，我们使用正则expression式来提取所需的值。

 regexPatternValue = ">Value\s+(\d+)</content>" regexPatternAnotherValue = ">Another Value\s+(\d+)</content>"

为了模块化一些，创build一个方法来parsing给定的XML文件中的每一行，查找正则expression式模式，提取所有值并返回最大值。在下面的方法中，我返回一个包含两个元素（Value，Another）的元组，这个元素是在该文件中看到的每个types的最大数目。

 def get_values(filepath): values = [] another = [] for line in open(filepath).readlines(): matchValue = re.search(regexPatternValue, line) matchAnother = re.search(regexPatternAnotherValue, line) if matchValue: values.append(int(matchValue.group(1))) if matchAnother: another.append(int(matchAnother.group(1))) # Now we want to calculate highest number in both the lists. try: maxVal = max(values) except: maxVal = '' # This case will handle if there are NO values at all try: maxAnother = max(another) except: maxAnother = '' return maxVal, maxAnother

现在保持你的XML文件在一个文件夹中，遍历它们，并提取每个正则expression式模式。在下面的代码中，我将这些提取的值附加到名为writable_lines的列表中。最后，在parsing完所有文件后，创build一个工作簿并以格式添加提取的值。

 def process_folder(folder, output_xls_path): files = [folder+'/'+f for f in os.listdir(folder) if ".txt" in f] writable_lines = [] writable_lines.append(("Value","Another Value")) # Header in the excel for file in files: values = get_values(file) writable_lines.append((str(values[0]),str(values[1]))) wb = openpyxl.Workbook() sheet = wb.active for i in range(len(writable_lines)): sheet['A' + str(i+1)].value = writable_lines[i][0] sheet['B' + str(i+1)].value = writable_lines[i][1] wb.save(output_xls_path)

在较低的for循环中，我们指导openpyxl将值写入像典型excel格式表[“A3”]，表[“B3”]等指定的单元格中。

准备好出发…

 if __name__ == '__main__': process_folder("xmls", "try.xls")

在这里输入图像说明

Python XMLparsing并计算string的出现，然后输出到Excel

使用喜欢比较一个string在Excel中

将内容从Excel粘贴到Chrome

操纵string来提取地址

Excel VBA使用正则expression式查找和屏蔽PAN数据，以实现PCI DSS合规性

在Excel中使用VBA的正则expression式不符合预期

在Excel-VBA中使用RegExreplace文本

如何通过函数或自定义函数获得excel中的正则expression式支持？

Excel VBA正则expression式删除主要的数字，期间和空间？

使用JQuery / Node将Excel工作表上传到外部API的过程是什么？

使用属性字典导出哈希表为CSV