从HTML标记中的文件中刮除文本

我有一个文件,我想从中提取date,这是一个HTML源文件,所以它充满了我不需要的代码和短语。 我需要提取每个包裹在特定HTML标记中的date的实例:

abbr title =“((这是我需要的文本))”data-utime =“

什么是最简单的方法来实现呢?

如果您使用的是Excel VBA,请将引用(工具 – 引用)设置为MSHTML库(在参考菜单中标题为Microsoft HTML Object Library

 Sub ScrapeDateAbbr() Dim hDoc As MSHTML.HTMLDocument Dim hElem As MSHTML.HTMLGenericElement Dim sFile As String, lFile As Long Dim sHtml As String 'read in the file lFile = FreeFile sFile = "C:/Users/dick/Documents/My Dropbox/Excel/Testabbr.html" Open sFile For Input As lFile sHtml = Input$(LOF(lFile), lFile) 'put into an htmldocument object Set hDoc = New MSHTML.HTMLDocument hDoc.body.innerHTML = sHtml 'loop through abbr tags For Each hElem In hDoc.getElementsByTagName("abbr") 'only those that have a data-utime attribute If Len(hElem.getAttribute("data-utime")) > 0 Then 'get the title attribute Debug.Print hElem.getAttribute("title") End If Next hElem End Sub 

我以为你在源文件中调用的文件是本地的。 如果您需要先下载它,则需要另外引用MSXML和此代码

 Sub ScrapeDateAbbrDownload() Dim xHttp As MSXML2.XMLHTTP Dim hDoc As MSHTML.HTMLDocument Dim hElem As MSHTML.HTMLGenericElement Set xHttp = New MSXML2.XMLHTTP xHttp.Open "GET", "file:///C:/Users/dick/Documents/My%20Dropbox/Excel/Testabbr.html" xHttp.send Do DoEvents Loop Until xHttp.readyState = 4 'put into an htmldocument object Set hDoc = New MSHTML.HTMLDocument hDoc.body.innerHTML = xHttp.responseText 'loop through abbr tags For Each hElem In hDoc.getElementsByTagName("abbr") 'only those that have a data-utime attribute If Len(hElem.getAttribute("data-utime")) > 0 Then 'get the title attribute Debug.Print hElem.getAttribute("title") End If Next hElem End Sub 

如果你使用Java,你可以使用Jsoup 。 这个问题还不清楚,请详细说明你到底在做什么