3

I am trying to pull data from some 500 urls of a website. All the pages are same in structure. I am facing a problem with understanding the HTML of this particular site

https://www.coworker.com/s-f/6033/united-states_hawaii_honolulu_impact-hub-honolulu 

I want to extract Name, Address, Tel and website. My current code:

Sub GetData() Dim wsSheet As Worksheet, Rows As Long, links As Variant, IE As Object, link As Variant Set wb = ThisWorkbook Set wsSheet = wb.Sheets("Sheet1") Set IE = New InternetExplorer Rows = wsSheet.Cells(wsSheet.Rows.Count, "A").End(xlUp).Row links = wsSheet.Range("A1:A" & Rows) With IE .Visible = True For Each link In links .navigate (link) While .Busy Or .readyState <> 4: DoEvents: Wend Next End With End Sub 
2
  • Could you please show exactly what info you are after? Which Name, Address, Tel and Website? Also, you are using a loop... are there more pages to scrape? Can you show another one as well please? Commented Aug 1, 2018 at 6:55
  • @QHarr1 , hi thanks for quick response. I am after ( Impact Hub Honolulu, 1050 Queen Street #100, Honolulu, United States, +8086643306 and impacthubhnl.com/?ref=coworker) these 4 bits of info. Yes, there are similar more urls and all have same html. coworker.com/s-f/7254/… And i want this loop to continue after getting info from 1st one Commented Aug 1, 2018 at 7:01

1 Answer 1

2

Here you go. Without more links to test with this is very fragile. It relies heavily on consistent styling across pages.


XHR Looping link list:

Option Explicit Public Sub GetInfo() Dim wsSheet As Worksheet, Rows As Long, links(), link As Long, wb As Workbook, html As HTMLDocument Set wb = ThisWorkbook: Set wsSheet = wb.Sheets("Sheet1") Application.ScreenUpdating = False With wsSheet Rows = .Cells(.Rows.Count, "A").End(xlUp).Row If Rows = 1 Then ReDim links(1 To 1, 1 To 1) links(1, 1) = wsSheet.Range("A1") Else links = wsSheet.Range("A1:A" & Rows).Value End If Dim r As Long For link = LBound(links, 1) To UBound(links, 1) r = r + 1 Set html = GetHTML(links(link, 1)) On Error Resume Next Dim aNodeList As Object: Set aNodeList = html.querySelectorAll(".col-xs-12.pade_none.muchroom_mail") .Cells(r, 2) = "Name: " & html.querySelector(".col-sm-9.col-md-9.col-xs-12.pade_none.kohub_space_headings h2").innerText .Cells(r, 3) = "Address: " & aNodeList.item(0).innerText .Cells(r, 4) = "Tel: " & aNodeList.item(1).innerText .Cells(r, 5) = "Website: " & html.querySelector(".website-link-text a[href]").getAttribute("href") On Error GoTo 0 Next link End With Application.ScreenUpdating = True End Sub Public Function GetHTML(ByVal url As String) As HTMLDocument Dim sResponse As String, html As New HTMLDocument With CreateObject("MSXML2.XMLHTTP") .Open "GET", url, False .send sResponse = StrConv(.responseBody, vbUnicode) End With sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE ")) With html .body.innerHTML = sResponse End With Set GetHTML = html End Function 

Output:

output


References (VBE>Tools>References):

  1. HTML object Library

Internet Explorer:

Option Explicit Public Sub GetInfo() Dim wsSheet As Worksheet, Rows As Long, links(), link As Long, wb As Workbook, html As HTMLDocument, ie As InternetExplorer Set wb = ThisWorkbook: Set wsSheet = wb.Sheets("Sheet1") Application.ScreenUpdating = False With wsSheet Rows = .Cells(.Rows.Count, "A").End(xlUp).Row If Rows = 1 Then ReDim links(1, 1) links(1, 1) = wsSheet.Range("A1") Else links = wsSheet.Range("A1:A" & Rows).Value End If Dim r As Long Set ie = New InternetExplorer ie.Visible = True For link = LBound(links, 1) To UBound(links, 1) ie.navigate links(link, 1) While ie.Busy Or ie.readyState < 4: DoEvents: Wend ' Application.Wait Now + TimeSerial(0, 0, 10) On Error Resume Next r = r + 1: Set html = ie.document .Cells(r, 2) = "Name: " & html.querySelector(".col-sm-9.col-md-9.col-xs-12.pade_none.kohub_space_headings h2").innerText .Cells(r, 3) = "Address: " & html.querySelector(".col-xs-12.pade_none.muchroom_mail").innerText .Cells(r, 4) = "Tel: " & html.querySelector(".fa.fa-phone.fa-rotate-270 ~ a").innerText .Cells(r, 5) = "Website: " & html.querySelector(".website-link-text a[href]").getAttribute("href") On Error GoTo 0 Next link ie.Quit End With Application.ScreenUpdating = True End Sub 

References (VBE>Tools>References):

  1. HTML object Library
  2. Microsoft Internet Controls

Sign up to request clarification or add additional context in comments.

7 Comments

Bro, this is giving me out of Memory Error here sResponse = StrConv(.responseBody, vbUnicode)
That shouldn't be down to the script above. Have you been opening lots of IE instances and not quitting them?
I updated with an Internet Explorer browser based version.
Don't know if i am doing something wrong but none are working . Tried both codes. I am using office 365 64 bit with windows 10 64 bit
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.