vba,getElementsByClassName,HTMLSource的双引号不见了

我用VBA刮了一些网站的乐趣,我用VBA作为工具。 我使用XMLHTTP和HTMLDocument(因为它比internetExplorer.Application更快)。

Public Sub XMLhtmlDocumentHTMLSourceScraper() Dim XMLHTTPReq As Object Dim htmlDoc As HTMLDocument Dim postURL As String postURL = "http://foodffs.tumblr.com/archive/2015/11" Set XMLHTTPReq = New MSXML2.XMLHTTP With XMLHTTPReq .Open "GET", postURL, False .Send End With Set htmlDoc = New HTMLDocument With htmlDoc .body.innerHTML = XMLHTTPReq.responseText End With i = 0 Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass") For Each vr In varTemp ''''the next line is important to solve this issue *1 Cells(1, 1) = vr.outerHTML Set varTemp2 = vr.getElementsByTagName("SPAN class=post_date") Cells(i + 1, 3) = varTemp2.Item(0).innerText ''''the next line occur 438Error'''' Set varTemp2 = vr.getElementsByClassName("hover_inner") Cells(i + 1, 4) = varTemp2.innerText i = i + 1 Next vr End Sub 

我知道这个问题* 1单元格(1,1)显示我下一件事情

 <DIV class="post_glass post_micro_glass" title=""><A class=hover title="" href="http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-really" target=_blank> <DIV class=hover_inner><SPAN class=post_date>............... 

是所有的class级标签都丢失了“”。 只有第一个function的类有“”我真的不知道为什么会出现这种情况。

//我可以通过getElementsByTagName(“span”)进行分析。 但我更喜欢“class”标签…..

getElementsByClassName方法不被认为是自己的方法; 只有父HTMLDocument。 如果要使用它来定位DIV元素中的元素,则需要创build一个由该特定DIV元素的.outerHtml组成的子HTMLDocument。

 Public Sub XMLhtmlDocumentHTMLSourceScraper() Dim xmlHTTPReq As New MSXML2.XMLHTTP Dim htmlDOC As New HTMLDocument, divSUBDOC As New HTMLDocument Dim iDIV As Long, iSPN As Long, iEL As Long Dim postURL As String, nr As Long, i As Long postURL = "http://foodffs.tumblr.com/archive/2015/11" With xmlHTTPReq .Open "GET", postURL, False .Send End With 'Set htmlDOC = New HTMLDocument With htmlDOC .body.innerHTML = xmlHTTPReq.responseText End With i = 0 With htmlDOC For iDIV = 0 To .getElementsByClassName("post_glass post_micro_glass").Length - 1 nr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Offset(1, 0).Row With .getElementsByClassName("post_glass post_micro_glass")(iDIV) 'method 1 - run through multiples in a collection For iSPN = 0 To .getElementsByTagName("span").Length - 1 With .getElementsByTagName("span")(iSPN) Select Case LCase(.className) Case "post_date" Cells(nr, 3) = .innerText Case "post_notes" Cells(nr, 4) = .innerText Case Else 'do nothing End Select End With Next iSPN 'method 2 - create a sub-HTML doc to facilitate getting els by classname divSUBDOC.body.innerHTML = .outerHTML 'only the HTML from this DIV With divSUBDOC If CBool(.getElementsByClassName("hover_inner").Length) Then 'there is at least 1 'use the first Cells(nr, 5) = .getElementsByClassName("hover_inner")(0).innerText End If End With End With Next iDIV End With End Sub 

虽然其他.getElementsByXXXX可以很容易地检索另一个元素中的集合,但getElementsByClassName方法需要考虑它作为一个整体的HTMLDocument,即使您已经将其愚弄到了这种想法。

这是另一种方法。 它与原始代码非常相似,但使用querySelectorAll来select相关的span元素。 这个方法的一个重点是,vr必须被声明为特定的元素types,而不是IHTMLElement或generics对象:

 Option Explicit Public Sub XMLhtmlDocumentHTMLSourceScraper() ' Changed from generic Object to specific type - not ' strictly necessary to do this Dim XMLHTTPReq As MSXML2.XMLHTTP60 Dim htmlDoc As HTMLDocument ' These declarations weren't included in the original code Dim i As Integer Dim varTemp As Object ' IMPORTANT: vr must be declared as a specific element type and not ' as an IHTMLElement or generic Object Dim vr As HTMLDivElement Dim varTemp2 As Object Dim postURL As String postURL = "http://foodffs.tumblr.com/archive/2015/11" ' Changed from XMLHTTP to XMLHTTP60 as XMLHTTP is equivalent ' to the older XMLHTTP30 Set XMLHTTPReq = New MSXML2.XMLHTTP60 With XMLHTTPReq .Open "GET", postURL, False .Send End With Set htmlDoc = New HTMLDocument With htmlDoc .body.innerHTML = XMLHTTPReq.responseText End With i = 0 Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass") For Each vr In varTemp ''''the next line is important to solve this issue *1 Cells(1, 1) = vr.outerHTML Set varTemp2 = vr.querySelectorAll("span.post_date") Cells(i + 1, 3) = varTemp2.Item(0).innerText Set varTemp2 = vr.getElementsByClassName("hover_inner") ' incorporating correction from Jeeped's comment (#56349646) Cells(i + 1, 4) = varTemp2.Item(0).innerText i = i + 1 Next vr End Sub 

笔记:

  • XMLHTTP等同于XMLHTTP30,如此处所述
  • 显然需要在这个问题中声明一个特定的元素types,但是与getElementsByClassName不同,querySelectorAll在任何版本的IHTMLElement中都不存在