使用python将本地html文件表单列数据提取到.csv文件

我有一个任务，从.docx提取一个表的列数据使用python的.xls或.csv文件，而表下面看起来像这样

表4-1。 Bite_main.c

CHECK Function Line Colum Detail ======================================================= ##overflow.2 xxxxxxx xxx xxx xxx ##overflow.5 xxxxxxx xxx xxx xxx ##overflow.8 xxxxxxx xxx xxx xxx ##overflow.12 xxxxxxx xxx xxx xxx

表4-2。 Bite_Engine.c

 CHECK Function Line Colum Detail overflow.4 xxxxxxx xxx xxx xxx overflow.9 xxxxxxx xxx xxx xxx overflow.8 xxxxxxx xxx xxx xxx overflow.10 xxxxxxx xxx xxx xxx

最初，我首先使用“mammoth”库来将.docx文件转换为.html文件（因为我在很多网站都检查过每个人都将.docx文件转换为html，以便于处理数据。）

现在我需要从转换后的html文件中提取每个表名称的“CHECK”列（即表4-1。Bite_main.c）到.xls或.csv表单中。而它应该在xls表单中看起来像这样

 1. Bite_main.c overflow.2,overflow.5,overflow.8,overflow.12 2. Bite_Engine.c overflow.4,overflow.9,overflow.8,overflow.10

—

我已经使用下面的代码来转换为HTML

 with open("\input.docx", "rb") as docx_file, open("\out_file.html", "w") as myfile: result = mammoth.convert_to_html(docx_file, include_default_style_map=False) html = result.value myfile.write("%s" % html.encode("utf-8", "ignore")) # here one issue is I am getting all the file data in a single line of HTML file After conversion, i tried to extract the table buti am not getting idea properly raw_html = open("\out_file.html", 'r').read() soup = BeautifulSoup(raw_html, "html.parser") tables = soup.findAll("table") table_list = [] for table in tables: table_dict = {} rows = table.findAll("tr") count = 0 for row in rows: value_list = [] entries = row.findAll("td")

当我遇到“Table 4-1。Bite_main.c”，然后在新的xls表单中单独提取“CHECK”列时，我不知道如何提取数据。同样的事情，我需要重复所有的“表4.x. xxx.x”。

我对Python非常陌生。请求提供执行上述概念的逻辑或有更好的方法来处理这个问题。预先感谢那些回答这个问题的人。

使用python将本地html文件表单列数据提取到.csv文件

表4-1。 Bite_main.c

表4-2。 Bite_Engine.c

—

如何从matlab中的csv中读取混合string和数字数据并进行处理

Excel csv文件中的string（123）

使用R导出CSV文件

如何防止Excel处理包含冒号的string作为公式

CakePHP 2.3.0，CsvHelper不在网上生成下载popup窗口，但在localhost中工作

使用排除filter打开文件对话框（Excel VBA）

使用新行parsingCSV作为分隔符

我如何设置utf-8到php的csv文件

在excel文件中存储python程序输出的数据

Excel通过PowerShell的COM对象