刮去维基百科的季节和情节(VBA EXCEL)

我正在尝试使用Excel 2013和Visual Basic创build一个程序。 作为该节目的一部分,我将在列表框中列出一系列电视节目。 我希望能够双击其中的一个,并打开另一个表格,列表框中包含那个季节的所有季节和情节。

我发现这样做的最好方法就是去Wikipedia.org。 我认为这将是唯一的网站将有大致相同的格式这个信息之一。 我也打算用书来做这个。

我原来读了关于在这个网站刮: http : //www.wiseowl.co.uk/blog/s393/scrape-website-html.htm

然而,我从来没有做过任何与getelementby *,所以我不知道他们是如何工作的。 任何帮助,将不胜感激。 淘宝上网之后,下面是我可以拼凑起来的最好的代码:

Private Sub cmdTest_Click() 'to refer to the running copy of Internet Explorer Dim ie As InternetExplorer 'to refer to the HTML document returned Dim html As HTMLDocument 'for iteration Dim i As Integer Dim j As Integer 'open Internet Explorer in memory, and go to website Set ie = New InternetExplorer ie.Visible = False ie.navigate "http://en.wikipedia.org/wiki/List_of_Archer_episodes" 'ie.navigate "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes" 'Wait until IE is done loading page Do While ie.READYSTATE <> READYSTATE_COMPLETE Application.StatusBar = "Trying to go to Episodes ..." DoEvents Loop 'show text of HTML document returned Set html = ie.document 'close down IE and reset status bar Set ie = Nothing Application.StatusBar = "" 'clear old data out and put titles in Cells.Clear 'put heading across the top of row 3 Range("A3").Value = "Season" Range("B3").Value = "Episode" i = 4 For Each ele In html.getElementsByClassName("summary") Sheets("Wiki2").Range("B" & i).Value = ele.innerText i = i + 1 Next i = 4 For Each ele In html.getElementsByClassName("mw-headline") Sheets("Wiki2").Range("A" & i).Value = Left(ele.innerText, 8) i = i + 1 Next End Sub 

第一部分似乎是获取给定页面的源代码的非常通用的方法。 我现在准备从电视节目“阿切尔”(Archer)中拉开剧集。 以下代码:

 i = 4 For Each ele In html.getElementsByClassName("summary") Sheets("Wiki2").Range("B" & i).Value = ele.innerText i = i + 1 Next i = 4 For Each ele In html.getElementsByClassName("mw-headline") Sheets("Wiki2").Range("A" & i).Value = Left(ele.innerText, 8) i = i + 1 Next 

是我用来拉我正在寻找的文字。 我需要帮助的是将这些结合在一起。 我需要循环查找“mw-headline”的每个实例,并且每次发现它时,都要查找类名称摘要。 如果发现摘要,则应该在列A的单元格中显示标题(也称为季节)内部文本,并在相邻的列B单元格中显示摘要(又名插曲名称)。

这些目前独立工作。 如果你运行这个代码,你会得到从单元格b4开始的所有电视节目,你会得到一个列在a4中的“mw-headline”的列表。 问题是,“标题”不仅适用于本季,而且还适用于其他一些情况,因此需要检查它是否有“总结”类。 这也将摆脱它表示电视节目新一季的情况下,但它只是说,新的一年即将到来。 没有“摘要”标签,它不应该列出它。 我希望这个季节能够显示在B列表的每一集旁边的A列,所以如果每个季节有10集,那么A列将会有10个“第1季”的实例,那么10个“第二季“等。

感谢您的帮助,对于那些不知道将来会出现这个问题的人,您需要将下面的代码放在编码窗口的顶部:

 Enum READYSTATE READYSTATE_UNINITIALIZED = 0 READYSTATE_LOADING = 1 READYSTATE_LOADED = 2 READYSTATE_INTERACTIVE = 3 READYSTATE_COMPLETE = 4 End Enum 

PS – 在代码“For Each ele in *”中,ele是一个未定义的variables,还是代表元素的vba认可的词? 我从复制和粘贴工作中得到了这个,我不明白这一点。 谢谢。

这是一个可能的解决scheme。 我查看了这个特定页面的html,这对于季节和情节的相关性提出了很大的挑战。 我后退了一步,认为由于季节大概是按照数字顺序排列的,所以我们不需要为季节数字刮取任何东西。 在你所呈现的页面上,每个特定季节的剧集位于同一个表格中,所以我只是从一个表格中抓住每个片段,并假设它是第一季,下一个表格的所有情节都是第二季…

 Private Sub cmdTest_Click() 'to refer to the running copy of Internet Explorer Dim ie As InternetExplorer 'to refer to the HTML document returned Dim html As HTMLDocument 'for iteration Dim i As Integer Dim j As Integer 'open Internet Explorer in memory, and go to website Set ie = New InternetExplorer ie.Visible = False ie.navigate "http://en.wikipedia.org/wiki/List_of_Archer_episodes" 'ie.navigate "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes" 'Wait until IE is done loading page Do While ie.READYSTATE <> READYSTATE_COMPLETE Application.StatusBar = "Trying to go to Episodes ..." DoEvents Loop 'show text of HTML document returned Set html = ie.document 'close down IE and reset status bar Set ie = Nothing Application.StatusBar = "" 'clear old data out and put titles in Cells.Clear 'put heading across the top of row 3 Range("A3").Value = "Season" Range("B3").Value = "Episode" i = 4 Dim season As Integer: season = 1 For Each tableTag In html.getElementsByTagName("table") 'look through each table for "summary" (you could change this to be something a bit more discriminating!) If (InStr(1, tableTag.innerHTML, "summary")) Then Sheets(1).Cells(i, 1) = "Season " & season For Each objEpisode In tableTag.getElementsByClassName("summary") Sheets(1).Range("B" & i).Value = objEpisode.innerText i = i + 1 Next season = season + 1 End If Next End Sub 

我也build议尝试IMDb。 这里是显示如何通过HTTP请求,从IMDb和维基百科,刮季节和情节的代码。

 Option Explicit Sub ExtractDataWikipedia() Dim y, sUrl, sRespText, arrMatchSeasons, arrSeason, arrMatchEpisodes, arrEpisode sUrl = "https://en.wikipedia.org/wiki/List_of_Archer_episodes" ' sUrl = "https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes" ' sUrl = "https://en.wikipedia.org/wiki/List_of_DuckTales_episodes" XmlHttpRequest "GET", sUrl, "", "", "", sRespText ParseToArray "<span class=""mw-headline"" id=""Season[\s\S]*?>.*?(Season.*?)<[\s\S]*?(<table[\s\S]*?</table>)", sRespText, arrMatchSeasons y = 1 For Each arrSeason In arrMatchSeasons ParseToArray "(<td class=""summary""[\s\S]*?</td>)", arrSeason(1), arrMatchEpisodes For Each arrEpisode In arrMatchEpisodes Cells(y, 1).Value = arrSeason(0) Cells(y, 2).Value = GetInnerText(arrEpisode(0)) y = y + 1 Next Next End Sub Sub ExtractDataIMDb() Dim y, sUrl, sRespText, arrData, arrMatchSeasons, arrSeason, sUrlEp, arrMatchEpisodes, arrEpisode sUrl = "http://www.imdb.com/title/tt1486217/episodes" ' Archer ' sUrl = "http://www.imdb.com/title/tt0096697/episodes" ' The Simpsons ' sUrl = "http://www.imdb.com/title/tt0092345/episodes" ' DuckTales XmlHttpRequest "GET", sUrl, "", "", "", sRespText ParseToArray "(<select id=""bySeason""[\s\S]*?</select>)", sRespText, arrData ParseToArray "<option[\s\S]*?value=""([\d]*)"">", arrData(0)(0), arrMatchSeasons y = 1 For Each arrSeason In arrMatchSeasons DoEvents sUrlEp = sUrl & "?season=" & arrSeason(0) XmlHttpRequest "GET", sUrlEp, "", "", "", sRespText ParseToArray "itemprop=""episodes""[\s\S]*?itemprop=""name""[\s\S]*?>([\s\S]*?)</a>", sRespText, arrMatchEpisodes For Each arrEpisode In arrMatchEpisodes Cells(y, 3).Value = "Season " & arrSeason(0) Cells(y, 4).Value = arrEpisode(0) y = y + 1 Next Next MsgBox "Completed" End Sub Sub XmlHttpRequest(sMethod, sUrl, arrSetHeaders, sFormData, sRespHeaders, sRespText) Dim arrHeader With CreateObject("Msxml2.ServerXMLHTTP.3.0") .SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS .Open sMethod, sUrl, False If IsArray(arrSetHeaders) Then For Each arrHeader In arrSetHeaders .SetRequestHeader arrHeader(0), arrHeader(1) Next End If .Send sFormData sRespHeaders = .GetAllResponseHeaders sRespText = .ResponseText End With End Sub Sub ParseToArray(sPattern, sResponse, arrMatches) Dim oMatch, arrSMatches, sSubMatch arrMatches = Array() With CreateObject("VBScript.RegExp") .Global = True .MultiLine = True .Pattern = sPattern For Each oMatch In .Execute(sResponse) arrSMatches = Array() For Each sSubMatch In oMatch.SubMatches PushItem arrSMatches, sSubMatch Next PushItem arrMatches, arrSMatches Next End With End Sub Sub PushItem(arrList, varItem) ReDim Preserve arrList(UBound(arrList) + 1) arrList(UBound(arrList)) = varItem End Sub Function GetInnerText(sText) With CreateObject("htmlfile") .Write ("<body>" & sText & "</body>") GetInnerText = .DocumentElement.Document.GetElementsByTagName("body")(0).InnerText End With End Function 

关于RegExp的HTMLparsing:这里是免责声明和替代 。

UPDATE

举个例子,下面的代码从IMDb中检索包含Season,Episode,Title和Air date的表格:

 Option Explicit Sub ExtractDataIMDB() Dim i As Long Dim sURL As String Dim sRespText As String Dim aData Dim aMatchSeasons Dim aSeason Dim sUrlEp As String Dim aMatchEpisodes Dim aEpisode Dim aResult() As String Dim aCells ReDim aResult(1 To 4, 1 To 1) aResult(1, 1) = "Season" aResult(2, 1) = "Episode" aResult(3, 1) = "Title" aResult(4, 1) = "Air date" sURL = "http://www.imdb.com/title/tt1486217/episodes" ' Archer ' sUrl = "http://www.imdb.com/title/tt0096697/episodes" ' The Simpsons ' sUrl = "http://www.imdb.com/title/tt0092345/episodes" ' DuckTales XmlHttpRequest "GET", sURL, "", "", "", sRespText ParseToArray "(<select id=""bySeason""[\s\S]*?</select>)", sRespText, aData ParseToArray "<option[\s\S]*?value=""([\d]*)"">", aData(0)(0), aMatchSeasons i = 2 For Each aSeason In aMatchSeasons DoEvents sUrlEp = sURL & "?season=" & aSeason(0) XmlHttpRequest "GET", sUrlEp, "", "", "", sRespText ParseToArray "itemprop=""episodes""[\s\S]*?itemprop=""episodeNumber"" content=""(.*?)""[\s\S]*?<div class=""airdate"">[\r\n\s]*([\s\S]*?)[\r\n\s]*</div>[\s\S]*?itemprop=""name""[\s\S]*?>([\s\S]*?)</a>", sRespText, aMatchEpisodes For Each aEpisode In aMatchEpisodes ReDim Preserve aResult(1 To 4, 1 To i) aResult(1, i) = aSeason(0) aResult(2, i) = aEpisode(0) aResult(3, i) = aEpisode(2) aResult(4, i) = aEpisode(1) i = i + 1 Next Next aCells = WorksheetFunction.Transpose(aResult) Cells.Delete Output Cells(1, 1), aCells MsgBox "Completed" End Sub Sub XmlHttpRequest(sMethod, sURL, aSetHeaders, sFormData, sRespHeaders, sRespText) Dim aHeader With CreateObject("Msxml2.ServerXMLHTTP.3.0") .SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS .Open sMethod, sURL, False If IsArray(aSetHeaders) Then For Each aHeader In aSetHeaders .SetRequestHeader aHeader(0), aHeader(1) Next End If .Send sFormData sRespHeaders = .GetAllResponseHeaders sRespText = .ResponseText End With End Sub Sub ParseToArray(sPattern, sResponse, aMatches) Dim oMatch, aSubMatches, sSubMatch aMatches = Array() With CreateObject("VBScript.RegExp") .Global = True .MultiLine = True .Pattern = sPattern For Each oMatch In .Execute(sResponse) aSubMatches = Array() For Each sSubMatch In oMatch.SubMatches PushItem aSubMatches, sSubMatch Next PushItem aMatches, aSubMatches Next End With End Sub Sub PushItem(aArray, vElement) ReDim Preserve aArray(UBound(aArray) + 1) aArray(UBound(aArray)) = vElement End Sub Sub Output(oDstRng As Range, aCells As Variant) With oDstRng .Parent.Select With .Resize( _ UBound(aCells, 1) - LBound(aCells, 1) + 1, _ UBound(aCells, 2) - LBound(aCells, 2) + 1 _ ) .NumberFormat = "@" .Value = aCells .Columns.AutoFit End With End With End Sub