Python – parsing结构化文本到Excel

我需要将结构化文本格式的大量文件转换为excel(csv会工作),以便能够将它们与其他一些数据合并。 这里是一个文本的样本:

FILER: COMPANY DATA: COMPANY CONFORMED NAME: NORTHQUEST CAPITAL FUND INC CENTRAL INDEX KEY: 0001142728 IRS NUMBER: 223772454 STATE OF INCORPORATION: NJ FISCAL YEAR END: 1231 FILING VALUES: FORM TYPE: NSAR-A SEC ACT: 1940 Act SEC FILE NUMBER: 811-10419 FILM NUMBER: 03805344 BUSINESS ADDRESS: STREET 1: 16 RIMWOOD LANE CITY: COLTS NECK STATE: NJ ZIP: 07722 BUSINESS PHONE: 7328423504 FORMER COMPANY: FORMER CONFORMED NAME: NORTHPOINT CAPITAL FUND INC DATE OF NAME CHANGE: 20010615 </SEC-HEADER> <DOCUMENT> <TYPE>NSAR-A <SEQUENCE>1 <FILENAME>answer.fil <DESCRIPTION>ANSWER.FIL <TEXT> <PAGE> PAGE 1 000 A000000 06/30/2003 000 C000000 0001142728 000 D000000 N 000 E000000 NF 000 F000000 Y 000 G000000 N 000 H000000 N 000 I000000 6.1 000 J000000 A 001 A000000 NORTHQUEST CAPITAL FUND, INC. 001 B000000 811-10493 001 C000000 7328921057 002 A000000 16 RIMWOOD LANE 002 B000000 COLTS NECK 002 C000000 NJ 002 D010000 07722 003 000000 N 004 000000 N 005 000000 N 006 000000 N 007 A000000 N 007 B000000 0 007 C010100 1 007 C010200 2 007 C010300 3 007 C010400 4 007 C010500 5 007 C010600 6 007 C010700 7 007 C010800 8 007 C010900 9 007 C011000 10 008 A000001 EMERALD RESEARCH CORP. 008 B000001 A 008 C000001 801-60455 008 D010001 BRICK 008 D020001 NJ 008 D030001 08724 013 A000001 SANVILLE & COMPANY 013 B010001 ABINGTON 013 B020001 PA 013 B030001 19001 015 A000001 FLEET BANK 015 B000001 C 015 C010001 POINT PLEASANT BEACH 015 C020001 NJ 015 C030001 08742 015 E030001 X 018 000000 Y 019 A000000 N 019 B000000 0 <PAGE> PAGE 2 020 A000001 SCHWAB 020 B000001 94-1737782 020 C000001 0 020 A000002 BESTVEST BROOKERAGE 020 B000002 23-1452837 020 C000002 0 

并继续到同一结构的第8页。 关于公司名称的信息应该进入相对的列,其余的应该像前两个值是列名,第三个值是行的值。

我试图用pyparsing来解决这个问题,但一直没有成功。 对这种方法的任何评论都是有帮助的。

你描述它的方式,就像每个文件的键值对。 我会像这样处理parsing部分:

 import sys import re import csv colonseperated = re.compile(' *(.+) *: *(.+) *') fixedfields = re.compile('(\d{3} \w{7}) +(.*)') matchers = [colonseperated, fixedfields] outfile = csv.writer(open('out.csv', 'w')) outfile.writerow(['Filename', 'Key', 'Value']) for filename in sys.argv[1:]: for line in open(filename): line = line.strip() for matcher in matchers: match = matcher.match(line) if match: outfile.writerow([filename] + list(match.groups())) 

你可以像parser.py调用它,并用python parser.py *.infile或者你的文件名约定来调用它。 它将创build一个包含三列的csv文件:一个文件名,一个键和一个值。 你可以在Excel中打开它,然后使用数据透视表来获取正确的格式。

或者,你可以使用这个:

 import csv headers = [] rows = {} filenames = [] outfile = csv.writer(open('flat.csv', 'w')) infile = csv.reader(open('out.csv')) infile.next() for filename, key, value in infile: if not filename in rows: rows[filename] = {} filenames.append(filename) if key not in headers: headers.append(key) rows[filename][key] = value outfile.writerow(headers) for filename in filenames: outfile.writerow([rows[filename].get(header, '') for header in headers])