如何parsingHTML,而无需在vba中创buildInternet Explorer的对象?

我没有在任何工作的计算机上的Internet Explorer,因此创buildInternet Explorer的对象,并使用ie.navigateparsingHTML和search标签是不可能的。 我的问题是, 我怎样才能使用标签自动从框架源到我的电子表格不使用IE浏览器的某些数据? 答案中的代码示例将非常有用:)谢谢

您可以使用XMLHTTP来检索网页的HTML源代码:

Function GetHTML(url As String) As String With CreateObject("MSXML2.XMLHTTP") .Open "GET", url, False .Send GetHTML = .ResponseText End With End Function 

我不会build议将其用作工作表函数,否则每次重新计算工作表时都会重新查询站点URL。 一些网站有逻辑检测通过频繁,重复的呼叫,并且您的IP可能被禁止 ,暂时或永久取决于站点的抓取。

一旦你有源HTMLstring(最好存储在一个variables,以避免不必要的重复调用),您可以使用基本的文本函数parsingstring来search您的标记。

这个基本的函数将返回<tag></tag>之间的值:

 Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String Dim html As String, pStart As Long, pEnd As Long, o As Integer html = GetHTML(url) 'remove <> if they exist so we can add our own If Left(tag, 1) = "<" And Right(tag, 1) = ">" Then tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1) End If ' default to Occurrence #1 If occurNum = 0 Then occurNum = 1 pEnd = 1 For o = 1 To occurNum ' find start <tag> beginning at 1 (or after previous Occurence) pStart = InStr(pEnd, html, "<" & tag & ">", vbTextCompare) If pStart = 0 Then getTag = "{Not Found}" Exit Function End If pStart = pStart + Len("<" & tag & ">") ' find first end </tag> after start <tag> pEnd = InStr(pStart, html, "</" & tag & ">", vbTextCompare) Next o 'return string between start <tag> & end </tag> getTag = Mid(html, pStart, pEnd - pStart) End Function 

这只会find基本的<tag> ,但您可以添加/删除/更改文本function,以满足您的需求。

用法示例:

 Sub findTagExample() Const testURL = "https://en.wikipedia.org/wiki/Web_scraping" 'search for 2nd occurence of tag: <h2> which is "Contents" : Debug.Print getTag(testURL, "<h2>", 2) '...this returns the 8th occurence, "Navigation Menu" : Debug.Print getTag(testURL, "<h2>", 8) '...and this returns an HTML <span> containing a title for the 'Legal Issues' section: Debug.Print getTag("https://en.wikipedia.org/wiki/Web_scraping", "<h2>", 4) End Sub 

任何做过networking抓取的人都会熟悉创buildInternet Explorer(IE)实例并导航到Web地址,然后在页面准备就绪后,使用“Microsoft HTML Object Library”(MSHTML)types开始导航DOM图书馆。 问题是否IE不可用做什么。 我在运行Windows 10的情况下处于相同的情况。

我曾怀疑有可能创build一个独立于IE的MSHTML.HTMLDocument实例,但其创build并不明显。 感谢提问者提出这个问题。 答案在于MSHTML.IHTMLDocument4.createDocumentFromUrl方法。 一个需要一个本地文件工作(编辑:实际上可以把一个webbyurl以及!),但我们有一个很好的Windows API函数称为URLDownloadToFile下载文件。

此代码在运行Microsoft Edge的Windows 10框上运行,而不是Internet Explorer。 这是一个重要的发现,并感谢提问者。

 Option Explicit '* Tools->Refernces Microsoft HTML Object Library '* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _ (ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _ ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long Sub Test() Dim fso As Object Set fso = CreateObject("Scripting.FileSystemObject") Dim sLocalFilename As String sLocalFilename = Environ$("TMP") & "\urlmon.html" Dim sURL As String sURL = "https://stackoverflow.com/users/3607273/s-meaden" Dim bOk As Boolean bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0) If bOk Then If fso.FileExists(sLocalFilename) Then '* Tools->Refernces Microsoft HTML Object Library Dim oHtml4 As MSHTML.IHTMLDocument4 Set oHtml4 = New MSHTML.HTMLDocument Dim oHtml As MSHTML.HTMLDocument Set oHtml = Nothing '* IHTMLDocument4.createDocumentFromUrl '* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "") '* need to wait a little whilst the document parses '* because it is multithreaded While oHtml.readyState <> "complete" DoEvents '* do not comment this out it is required to break into the code if in infinite loop Wend Debug.Assert oHtml.readyState = "complete" Dim sTest As String sTest = Left$(oHtml.body.outerHTML, 100) Debug.Assert Len(Trim(sTest)) > 50 '* just testing we got a substantial block of text, feel free to delete '* page specific logic goes here Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink") Dim lAnswerLoop As Long For lAnswerLoop = 0 To htmlAnswers.Length - 1 Dim vAnswerLoop Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop) Debug.Print vAnswerLoop.outerText Next End If End If End Sub 

感谢您提出这个问题。

PS我已经使用TaskList来validation这个代码运行时不会在引擎盖下创buildIExplore.exe。

PPS如果你喜欢这个,那么在我的Excel开发平台博客上看到更多