从类VBA中提取跨度值

经过大量search,我正努力从VBA下面的HTML中获取数据。 具体来说,我试图从下面的HTML代码中的每个class =“_ Xnb _QJ”中拉取值“DATA ONE”和“DATA THREE”:

<div class="results"> <div class="_s2 _wPc"> <div class="_fW _QJ"> <div class="_Xnb _QJ _Z9b"> <div class="_Xnb _QJ"> <div class="_Xnb _QJ"> <div class="_Xnb _QJ"> <a href="//Extracted URL//"> <span class="_fbb"> <img id="uid_3" //Extracted// > </span> <span class="_PHb"> <span class="_MHb">DATA ONE</span> </span> <span class="_B6e"> <span class="_x2">DATA TWO</span> <span class="_Fs"> DATA THREE </span> 

我一直在尝试使用getElementsByClassName来获得“_Xnb _QJ”类的集合,并且对于这些类中的每一个都使用getElementsByTagName来search“_MHb”和“_FS”。 我不能按照数字顺序挑选孩子,因为这个变化在“_Xnb ..”类之间,但是我需要的数据总是附带相同的(_MHb / FS)类标签。

我是一个VBA / HTML的完整新手,所以这个代码已经在很大程度上通过在其他地方编辑例子在stackoverflow上组装。 我想知道我所需要的类是在“href”而不是直接在_Xnb类之下的事实是我无法提取正确数据的原因吗?

我的VBA代码下面的相关部分 – 当我运行它,代码似乎运行良好,但没有收集数据。

 Dim RowNumber As Long Dim DataOne As String Dim DataThree As String Dim QuestionList As IHTMLElementCollection Dim Question As IHTMLElement Dim QuestionFields As IHTMLElementCollection Dim QuestionField As IHTMLElement RowNumber = 1 Set QuestionList = html.getElementsByClassName("_Xnb _QJ") For Each Question In QuestionList Set QuestionFields = Question.getElementsByTagName("SPAN") For Each QuestionField In QuestionFields If QuestionField.className = "_MHb" Then DataOne= QuestionField.innerText Cells(RowNumber, 1).Value = DataOne End If If QuestionField.className = "_Fs" Then DataThree = QuestionField.innerText Cells(RowNumber, 2).Value = DataThree End If Next QuestionField RowNumber = RowNumber + 1 Next Set html = Nothing MsgBox "Done!" End Sub 

任何帮助将非常感激。

非常感谢

我build议你研究一下XPath–一个基于标准的查询语言来处理XML文档。 您也可以在格式良好的HTML文档中使用它。 这有点神秘,但超级有用,也可以在VBA中使用。

您的示例HTML看起来有点复杂,因为您有多个具有相同类的<div>标记。 另外,由于<img>标签中的/ / //Extracted//它不是有效的XML。 另外,这个例子中没有结束标签。 无论如何,我已经在下面的代码示例中进行了整理。

我已经看了你的问题,并解释它是这样的:

从types为_MHbFs<span>标签中提取任何文本; 以及它是类_Xnb _QJ<div>标签的_Xnb _QJ

如果是这样,您的XPath查询可以被构build为三个部分:

 //div[@class='_Xnb _QJ'] 

含义 – 获得_Xnb _QJ类的任何div标签。

 (//div[@class='_Xnb _QJ'])[last()] 

含义 – 只要从第一组中获取最内层的项目(记住你有多个嵌套的<div>标签具有相同的类别)。

 (//div[@class='_Xnb _QJ'])[last()]//span[@class='_MHb' or @class='_Fs'] 

含义 – 过滤具有_Mhb_Fs类的<span>标签的最内层<div>

因此,如果您包含MSXML库(我认为您已经完成),则可以在VBA中使用XPath。 代码如下所示:

 Option Explicit Sub Test() Dim strXml As String Dim objXml As New DOMDocument60 Dim strXPath As String Dim objXmlNodeList As IXMLDOMNodeList Dim objXmlNode As IXMLDOMNode 'get the sample XML strXml = GetXml 'load xml to document If Not objXml.LoadXML(strXml) Then Debug.Print "Not parsed" Exit Sub End If 'apply XPath 'first just let's get the last <div> tag of class _Xnb _QJ strXPath = "(//div[@class='_Xnb _QJ'])[last()]" 'test that query Set objXmlNodeList = objXml.SelectNodes(strXPath) For Each objXmlNode In objXmlNodeList Debug.Print objXmlNode.XML Next objXmlNode 'now lets append a filter to only get the <span> texts strXPath = strXPath & "//span[@class='_MHb' or @class='_Fs']" 'get output nodes by applying query to xml Set objXmlNodeList = objXml.SelectNodes(strXPath) For Each objXmlNode In objXmlNodeList Debug.Print objXmlNode.Text Next objXmlNode End Sub Function GetXml() As String Dim strXml As String strXml = "" strXml = strXml & "<div class=""results"">" strXml = strXml & " <div class=""_s2 _wPc"">" strXml = strXml & " <div class=""_fW _QJ"">" strXml = strXml & " <div class=""_Xnb _QJ _Z9b"">" strXml = strXml & " <div class=""_Xnb _QJ"">" strXml = strXml & " <div class=""_Xnb _QJ"">" strXml = strXml & " <div class=""_Xnb _QJ"">" strXml = strXml & " <a href=""//Extracted URL//"">" strXml = strXml & " <span class=""_fbb"">" strXml = strXml & " <img id=""uid_3"" />" strXml = strXml & " </span>" strXml = strXml & " <span class=""_PHb"">" strXml = strXml & " <span class=""_MHb"">DATA ONE</span>" strXml = strXml & " </span>" strXml = strXml & " <span class=""_B6e"">" strXml = strXml & " <span class=""_x2"">DATA TWO</span>" strXml = strXml & " <span class=""_Fs""> DATA THREE </span>" strXml = strXml & " </span>" strXml = strXml & " </a>" strXml = strXml & " </div>" strXml = strXml & " </div>" strXml = strXml & " </div>" strXml = strXml & " </div>" strXml = strXml & " </div>" strXml = strXml & " </div>" strXml = strXml & "</div>" GetXml = strXml End Function 

debugging输出如下所示:

 <div class="_Xnb _QJ"> <a href="//Extracted URL//"> <span class="_fbb"> <img id="uid_3"/> </span> <span class="_PHb"> <span class="_MHb">DATA ONE</span> </span> <span class="_B6e"> <span class="_x2">DATA TWO</span> <span class="_Fs"> DATA THREE </span> </span> </a> </div> DATA ONE DATA THREE 

这一切看起来有点复杂 – 但是一旦你尝试了一下,你就会好起来的。