使用Python xlsxwriter模块将srt数据写入excel

这次我试图用Python的xlsxwriter模块将.srt中的数据写入excel。

字幕文件在崇高的文本中看起来像这样:

但是我想把数据写入一个excel,所以看起来像这样:

这是我第一次为此编写python,所以我仍然处于试验和错误的阶段…我试图写下如下代码

但我不认为这是有道理的

我会继续尝试,但如果你知道如何做,请告诉我。 我会读你的代码,并试图理解他们! 谢谢! 🙂

以下将问题分解成几个部分:

  • parsinginput文件。 parse_subtitles是一个生成器 ,它获取行的源,并以{'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}' 。 我采取的方法是追踪我们所处的三个不同状态中的哪一个:
    1. seeking to next entry ,当我们正在寻找下一个索引号,它应该匹配正则expression式^\d*$ (只是一堆数字)
    2. 当find索引时looking for timestamp ,我们期望时间戳记到下一行,它应该匹配正则expression式^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$ (HH:MM:SS,mmm – > HH:MM:SS,mmm)和
    3. reading subtitles同时消耗实际的字幕文本,用空行和EOF解释为字幕终止点。
  • 将上述logging写入工作表中的一行。 write_dict_to_worksheet接受一行和一个工作表,外加一个logging和一个字典,为每个logging的键定义Excel 0索引列号,然后适当地写入数据。
  • 组织整体转换convert接受一个input文件名(例如'Wildlife.srt' ,将打开并传递给parse_subtitles函数,并输出文件名(例如'Subtitle.xlsx' ,将使用xlsxwriter创build。头,并从input文件parsing每个logging, 将该logging写入到XLSX文件 。

为了自我评价的目的, 日志logging留下来,因为当再现你的input文件时,我发一个: a ; 在一个时间戳,使其无法识别,并popup错误是方便debugging!

在这个Gist中 ,我已经把源文件的文本版本和下面的代码放在一起

 import xlsxwriter import re import logging def parse_subtitles(lines): line_index = re.compile('^\d*$') line_timestamp = re.compile('^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$') line_seperator = re.compile('^\s*$') current_record = {'index':None, 'timestamp':None, 'subtitles':[]} state = 'seeking to next entry' for line in lines: line = line.strip('\n') if state == 'seeking to next entry': if line_index.match(line): logging.debug('Found index: {i}'.format(i=line)) current_record['index'] = line state = 'looking for timestamp' else: logging.error('HUH: Expected to find an index, but instead found: [{d}]'.format(d=line)) elif state == 'looking for timestamp': if line_timestamp.match(line): logging.debug('Found timestamp: {t}'.format(t=line)) current_record['timestamp'] = line state = 'reading subtitles' else: logging.error('HUH: Expected to find a timestamp, but instead found: [{d}]'.format(d=line)) elif state == 'reading subtitles': if line_seperator.match(line): logging.info('Blank line reached, yielding record: {r}'.format(r=current_record)) yield current_record state = 'seeking to next entry' current_record = {'index':None, 'timestamp':None, 'subtitles':[]} else: logging.debug('Appending to subtitle: {s}'.format(s=line)) current_record['subtitles'].append(line) else: logging.error('HUH: Fell into an unknown state: `{s}`'.format(s=state)) if state == 'reading subtitles': # We must have finished the file without encountering a blank line. Dump the last record yield current_record def write_dict_to_worksheet(columns_for_keys, keyed_data, worksheet, row): """ Write a subtitle-record to a worksheet. Return the row number after those that were written (since this may write multiple rows) """ current_row = row #First, horizontally write the entry and timecode for (colname, colindex) in columns_for_keys.items(): if colname != 'subtitles': worksheet.write(current_row, colindex, keyed_data[colname]) #Next, vertically write the subtitle data subtitle_column = columns_for_keys['subtitles'] for morelines in keyed_data['subtitles']: worksheet.write(current_row, subtitle_column, morelines) current_row+=1 return current_row def convert(input_filename, output_filename): workbook = xlsxwriter.Workbook(output_filename) worksheet = workbook.add_worksheet('subtitles') columns = {'index':0, 'timestamp':1, 'subtitles':2} next_available_row = 0 records_processed = 0 headings = {'index':"Entries", 'timestamp':"Timecodes", 'subtitles':["Subtitles"]} next_available_row=write_dict_to_worksheet(columns, headings, worksheet, next_available_row) with open(input_filename) as textfile: for record in parse_subtitles(textfile): next_available_row = write_dict_to_worksheet(columns, record, worksheet, next_available_row) records_processed += 1 print('Done converting {inp} to {outp}. {n} subtitle entries found. {m} rows written'.format(inp=input_filename, outp=output_filename, n=records_processed, m=next_available_row)) workbook.close() convert(input_filename='Wildlife.srt', output_filename='Subtitle.xlsx') 

编辑 :更新以跨多行输出中分割多行字幕