Excelmacros从网站绘制线程注释到单元格

我试图在Excel电子表格中存储Reddit线程注释,但是我试图找出如何执行此操作时遇到了麻烦。 我没有太多的使用macros从网页获取数据的经验,所以我一直在发现很难弄清楚如何从指定的Reddit线程抽取每个注释并将其放置在一个单元格中,以及它是否是可能做到。

这是我迄今为止:

Sub getRedditData() Dim x As Long, y As Long Dim htm As Object Set htm = CreateObject("htmlFile") With CreateObject("msxml2.xmlhttp") .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False .send htm.body.innerhtml = .responsetext End With With htm.getelementbyid("comments") Set cellrangex = .Rows(x).Cells.Length - 1 Set cellrangey = .Rows(x).Cells.Length - 1 Set cellrange1 = Sheets(1).Cells(x + 1, y + 1).Value Set cellrange2 = .Rows(x).Cells(y).innertext For x = 0 To cellrangex For y = 0 To cellrangey cellrange = cellrange2 Next y Next x End With End Sub 

你真的需要分析你正在用一个体面的HTML编辑器刮的网页的内容。 我build议浏览到铬的问题页面,并使用F12打开它的开发工具。 在“元素”选项卡中,可以快速查看哪个HTML正在生成页面的哪一部分(同时打开页面和开发人员工具旁边的对象)。

您会注意到,每个注释的文本都在<p>标签内,每个<p>标签位于<div> 。 我们正在寻找模式,所以这是一个好的开始。

您还会注意到,每个<div>标签都有一个md class 。 所以…让我们把所有的页面<div>标签加载到一个对象中,然后寻找那些包含"md"className

Sub getRedditData()

 Dim x As Long, y As Long Dim htm As Object Set htm = CreateObject("htmlFile") With CreateObject("msxml2.xmlhttp") .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False .send htm.body.innerhtml = .responsetext End With Set Divelements = htm.getElementsByTagName("div") For Each DivElement In Divelements If InStr(1, DivElement.ClassName, "md") Then 'print contents to the Immediate window for debugging View>>Immediate Window to insure it's up in your VBE Debug.Print DivElement.InnerText End If Next 

结束小组

这样你就可以看到立即窗口中的所有注释(转到查看>>立即窗口),所以你可以看到这个debugging输出。


在跳过节点之后,看起来可以导航一些元素,然后返回树中以获取用户名:

 Sub getRedditData() Dim x As Long, y As Long Dim htm As Object Set htm = CreateObject("htmlFile") With CreateObject("msxml2.xmlhttp") .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False .send htm.body.innerhtml = .responsetext End With Set Divelements = htm.getElementsByTagName("div") On Error Resume Next For Each divElement In Divelements If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then Set commentEntry = divElement.ParentNode.ParentNode.ParentNode 'Print the name and the comment Debug.Print commentEntry.FirstChild.FirstChild.NextSibling.InnerText & ":", divElement.InnerText End If Next End Sub 

要打印出来,只需指向一个单元而不是debug.print立即窗口。 就像是:

 Sub getRedditData() Dim x As Long, y As Long Dim htm As Object Dim ws As Worksheet, wsCell As Integer 'set the worksheet to print to and the first row to start printing. Set ws = Sheets("Sheet1") wsCell = 1 Set htm = CreateObject("htmlFile") With CreateObject("msxml2.xmlhttp") .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False .send htm.body.innerhtml = .responsetext End With Set Divelements = htm.getElementsByTagName("div") On Error Resume Next For Each divElement In Divelements If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then Set commentEntry = divElement.ParentNode.ParentNode.ParentNode 'Print the name and the comment to ws sheet columns 1 and 2 ws.Cells(wsCell, 1).Value = commentEntry.FirstChild.FirstChild.NextSibling.InnerText ws.Cells(wsCell, 2).Value = divElement.InnerText 'iterate to the next row wsCell = wsCell + 1 End If Next End Sub