如何从VBA中的特定父HTML元素的子元素中获取数据

我又一次发现自己遇到了一个非常具体的问题。 我对VBA相当陌生,尤其是HTML,所以请耐心等待。 我已经在VBA中构build了一个可以正常运行的网页抓取工具,但还有一些我想完成的具体任务,我无法弄清楚。

这是我的问题涉及的HTML示例。

我用椭圆代替了所有不重要的东西。 我要刮的重要部分是“a”标签,“数据简写”(或简称innerText )。 这是一个网站名称。 最多有五个,但并不总是五个。 这也只是最多五个网站上市的两个部分之一。 发布部分有<div class="referralsSites referring">的孩子,另一个孩子有<div class="referralsSites destination">

在“引用”部分中的每个网站,我想分配给“上”variables – 第一个网站分配到“Up1”,第二个“Up2”,等等,但只取决于有多less网站在“指“部分。 我想在“目的地”部分做同样的事情,但根据有多less个目的地网站,分配给“下”variables(Down1,Down2等)。

如果我只是使用getElementsByClassName("websitePage-listItemLink js-tooltipTarget") ,例如,我将无法区分引荐网站和目标网站。

这是我的代码到目前为止:

 Sub GetSimilarWebData() Dim appIE As InternetExplorer Dim HTML As HTMLDocument Dim ieWindow As SHDocVw.InternetExplorer Dim URL As String Dim Rankings As IHTMLElementCollection, Traffic As IHTMLElementCollection, ReferSites As IHTMLElementCollection, DestSites As IHTMLElementCollection, _ rSite As IHTMLElement, rSiteNo As Long, dSite As IHTMLElement, dSiteNo As Long, GlobalRank As String, CountryName As String, CountryRank As String, _ Visits As String, Direct As String, Refer As String, Search As String, Social As String, Display As String, _ Up1 As String, Up2 As String, Up3 As String, Up4 As String, Up5 As String, _ D1 As String, D2 As String, D3 As String, D4 As String, D5 As String Dim FraudLast As Long CheckLast = Worksheets("Sheet1").Range("I1").End(xlDown).Offset(1).Row webStr = Worksheets("Sheet1").Range("A" & CheckLast).Value Set appIE = New InternetExplorer appIE.Visible = False appIE.navigate "https://www.similarweb.com/website/" & webStr Do While appIE.readyState <> READYSTATE_COMPLETE Application.StatusBar = "Connecting to SimilarWeb..." DoEvents Loop Set HTML = appIE.document Set appIE = Nothing Application.StatusBar = "" Set Rankings = HTML.getElementsByClassName("rankingItem-value") GlobalRank = Rankings(0).innerText If GlobalRank = "N/A" Then GlobalRank = "null" CountryName = "null" CountryRank = "null" Else CountryName = HTML.getElementsByClassName("rankingItem-subTitle")(1).innerText CountryRank = Rankings(1).innerText End If Visits = HTML.getElementsByClassName("engagementInfo-value engagementInfo-value--large u-text-ellipsis")(0).innerText If InStr(Visits, "M") <> 0 Then Visits = Replace(Visits, ".", "") Visits = Replace(Visits, "M", "00000") ElseIf InStr(Visits, "K") <> 0 Then Visits = Replace(Visits, ".", "") Visits = Replace(Visits, "K", "00") ElseIf InStr(Visits, "B") <> 0 Then Visits = Replace(Visits, ".", "") Visits = Replace(Visits, "B", "00000000") End If Set Traffic = HTML.getElementsByClassName("trafficSourcesChart-value") Direct = Traffic(0).innerText Refer = Traffic(1).innerText Search = Traffic(2).innerText Social = Traffic(3).innerText Display = Traffic(4).innerText 'Here's what I've started off with: Set ReferSite = HTML.getElementsByClassName("referralsSites referring") rSiteNo = ReferSite.Length Set DestSite = HTML.getElementsByClassName("referralsSites destination") dSiteNo = DestSite.Length 'For Each rSite In ReferSite End Sub 

我不太清楚如何解决这个问题。 在我的代码中的其他一切工作正常,但当然,如果有什么我可以做,以提高速度,也是值得欢迎的。

所有这些都是指similarweb.com上的数据。

getElementsByClassName方法可以用于IHTMLElement对象以及HTMLDocument对象。 这意味着您可以通过两个“跳跃”获得单独的转介和目标网站列表。

首先获取<div> s与referralsSites referringreferralsSites destinationgetElementsByClassName方法返回一个IHTMLElementCollection ,它是IHTMLElementCollection的集合。 所以你得到了集合的第0个元素(假设只有一个<div> ),然后通过在IHTMLElement上再次调用getElementsByClassName方法获得<div><a> s与一个websitePage-listItemLink<div>

这里是一个stackoverflow.com的例子 – 我只是做输出的Debug.Print ,但你可能想分配站点名称到一个数组,或Collection或东西。

 Option Explicit Sub Test() 'references required: 'Microsoft HTML Object Library 'Microsoft Internet Controls Dim strUrl As String Dim objIe As InternetExplorer Dim objHtml As HTMLDocument Dim strHtml As String Dim objDivs As IHTMLElementCollection Dim objAnchors As IHTMLElementCollection Dim intCounter As Integer 'set target to scrape strUrl = "https://www.similarweb.com/website/stackoverflow.com" 'get html from page Set objIe = New InternetExplorer objIe.Visible = False objIe.navigate strUrl While objIe.readyState <> READYSTATE_COMPLETE DoEvents Wend 'assign html to DOM document Set objHtml = New HTMLDocument Set objHtml = objIe.document 'get referrals Set objDivs = objHtml.getElementsByClassName("referralsSites referring") If objDivs.Length > 0 Then Set objAnchors = objDivs(0).getElementsByClassName("websitePage-listItemLink") Debug.Print "Referrers:" If objAnchors.Length > 0 Then For intCounter = 0 To objAnchors.Length - 1 Debug.Print objAnchors(intCounter).innerText Next intCounter End If End If 'get destinations Set objDivs = objHtml.getElementsByClassName("referralsSites destination") If objDivs.Length > 0 Then Set objAnchors = objDivs(0).getElementsByClassName("websitePage-listItemLink") Debug.Print "Destinations:" If objAnchors.Length > 0 Then For intCounter = 0 To objAnchors.Length - 1 Debug.Print objAnchors(intCounter).innerText Next intCounter End If End If 'clean up Set objHtml = Nothing objIe.Quit Set objIe = Nothing End Sub 

这给出了一个输出:

 Referrers: news.ycombinator.com qwant.com github.com remoteok.io serverfault.com Destinations: jsfiddle.net youtube.com github.com i.stack.imgur.com w3schools.com