是否有一个主标签列表的标签和mhtml文件的含义?

我正在尝试从xls文件中读取和提取数据,这些文件实际上是单个文件网页,请参阅下面的内容

This document is a Single File Web Page, also known as a Web Archive file. 

我试图找出所有标签的含义,所以我可以确保使用lxml正确parsing它们。

例如这里是一个标签的例子:

  <th class=3Dtl colspan=3D1 rowspan=3D2 

虽然我正在用我正在玩的几个文件成功地工作,但我想弄清楚我是否在做出假设,以后会再来困扰我。 因此,这些标签的列表及其含义将会很好。

如果MHTML是由Microsoft Word生成的,则可能是WordprocessingML和HTML4标记的组合。

WordprocessingML文档中的顶级元素是:

 SmartTagType element describes a Smart Tag type used in the document. DocumentProperties element contains Office Document Properties. CustomDocumentProperties element contains Custom Office Document Properties. schemaLibrary element defines a collection of schemas that comprise a document's schema library. fonts element (wordDocumentElt complexType) contains font information frameset element (wordDocumentElt complexType) contains HTML Frameset definitions. styles element (wordDocumentElt complexType) contains style definitions. divs element contains HTML DIV information. shapeDefaults element contains drawing defaults. docOleData element contains supplemental data containing storages for OLE objects. docSuppData element contains supplemental data containing toolbar customizations, envelope data, and the Microsoft Visual Basic project. docPr element contains document options. shapeDefaults element contains the wrapper representing the shape defaults. bgPict element contains background picture information. body element contains the document body. 

但是,最简单的WordprocessingML文档只包含五个元素(和一个单一的名称空间)。 这五个要素是:

 wordDocument element: The root element for a WordprocessingML document. body element: The container for the displayable text. p element: A paragraph. r element: A contiguous set of WordprocessingML components with a consistent set of properties. t element: A piece of text.