parsing不同数量的标签的XML,以使长度相等的列表。 openpyxl和Beautifulsoup

我有一个XML文件,其中有包含作者,出版date,标签等标签的书籍logging。 我将parsing这个文件来创build3个列表,其中一个将有书名,另一个列表中的作者,最后是第三个列表中的标签,稍后我将使用openpyxl将这些列表写入Excel列。 问题是一些书籍logging没有标签标签。 使用普通的美味汤的parsing技术将产生相同长度的前两个列表,但标签列表将具有较短的长度。

我有三个问题:

1-如何创build所有三个长度相同的列表(没有标签标签的书籍为空)2-标签列表看起来像这样['Energy; Green Buildings; High Performance Buildings','Computing','Computing ;devise;绿色build筑“,…….]我已经创build了另外15个标题,标题名称是我的,例如”计算“和”devise“,有没有什么办法可以使用openpyXL为图书标签组合创build一个X标记或彩色单元格,例如,如果一本书中包含特定的标签,例如,如果第5行中的标题为“Architecture”的书具有“Design”标签,则需要单元格中的X标记或有色单元格(row'5',col'Design')。是否有更简单的方法来完成此任务(parsingXML文件并在Excel中高效编写)?

下面是XML文件的快照和我写的代码(也可以从这里下载XML文件和Python文件: http : //www.ranialabib.com/#! python/ icfwa

<?xml version="1.0" encoding="UTF-8"?> <xml> <records> <record> <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database> <ref-type name="Book">1</ref-type> <contributors> <authors> <author>AIA Research Corporation</author> </authors> </contributors> <titles> <title>Regional guidelines for building passive energy conserving homes</title> </titles> <periodical/> <keywords/> <dates> <year>1978</year> </dates> <publisher>Dept. of Housing and Urban Development, Office of Policy Development and Research : for sale by the Supt. of Docs., US Govt. Print. Off.</publisher> <urls/> <label>Energy;Green Buildings;High Performance Buildings</label> </record> <record> <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database> <ref-type name="Book">1</ref-type> <contributors> <authors> <author>Akinci, Burcu</author> <author>Ph, D</author> </authors> </contributors> <titles> <title>Computing in Civil Engineering</title> </titles> <periodical/> <pages>692-699</pages> <keywords/> <dates> <year>2007</year> </dates> <publisher>American Society of Civil Engineers</publisher> <isbn>9780784409374</isbn> <electronic-resource-num>ISBN 978-0-7844-1302-9</electronic-resource-num> <urls> <web-urls> <url>http://books.google.com/books?id=QigBgc-qgdoC</url> </web-urls> </urls> </record> import xml.etree.ElementTree as ET fhand = open('My_Collection.xml') data = fhand.read() Title=list() Year=list() Label=list() tree = ET.fromstring(data) titles = tree.findall('.//title') years = tree.findall('.//year') labels = tree.findall('.//label') for t in titles : Title.append(str(t.text)) print 'Titles: ', len(Title) print Title for y in years : Year.append(str(y.text)) print 'years: ', len(Year) print Year for l in labels : Label.append(str(l.text)) print 'Labels: ', len(Label) print Label from openpyxl import Workbook wb = Workbook() ws = wb.active for row in zip(Title, Year, Label): ws.append(row) wb.save("Test2.xlsx") 

这里是我根据查理的build议编写的代码,代码没有工作。 我得到一个错误消息说:“TypeError:'NoneType'对象不可迭代”。我不知道是什么问题。 我怎样才能得到一个列表中的每个logging的所有3个标签(标题,年份,标签)的文本,以及使用openpylx将如此大量的列表(200列表200个书)写入Excel是多么容易?

 import xml.etree.ElementTree as ET fhand = open('My_Collection.xml') data = fhand.read() Label_lst=list() for record in tree.find("records/record") : label = record.find("label") for l in label: if label is not None: label = label_lst.append(label.text) else: label = label_lst.append(' ') print label_lst 

如果你想保留logging结构,你应该逐条logging地parsing,而不是仅仅创build属性列表。 您可以遍历logging并提取相关字段或for record in parsed_xml.find("records/record"); label = record.find("label"); if label is not None: label = label.text for record in parsed_xml.find("records/record"); label = record.find("label"); if label is not None: label = label.text for record in parsed_xml.find("records/record"); label = record.find("label"); if label is not None: label = label.text然后,您可以直接将行写入Excel,而不必在列上进行压缩。

我只是想出来了。 我仍然使用列。

 from openpyxl import Workbook import xml.etree.ElementTree as ET fhand = open ('My_Collection') tree =ET.parse('My_Collection.xml') data= fhand.read() root = tree.getroot() tree = ET.fromstring(data) title_list= ['Title'] year_list = ['Year'] author_list= ['Author'] label_list = ['Label'] for child in tree: for children in child: if children.find('.//title')is None : t='N' else: t=children.find('.//title').text title_list.append(t) print title_list print len(title_list) for child in tree: for children in child: if children.find('.//year')is None : y='N' else: y=children.find('.//year').text year_list.append(y) print year_list print len(year_list) for child in tree: for children in child: if children.find('.//author')is None : a='N' else: a=children.find('.//author').text author_list.append(a) print author_list print len(author_list) for child in tree: for children in child: if children.find('label')is None : l='N' else: l=children.find('label').text label_list.append(l) print label_list print len(author_list) for item in label_list: wb = Workbook() ws = wb.active for row in zip(title_list, year_list, author_list, label_list): ws.append(row) wb.save("Test3.xlsx")