1
\$\begingroup\$

I have built an Excel application, which gets all the titles of books from Amazon.com, which it is asked to do and scrapes the following data out of them:

  • Book Title
  • Author
  • Price

What do you need to run the app?

3 worksheets in Excel, named as in the picture:

enter image description here

Then in tblInput, make sure to write some keywords in range A2:A4 and execute the Main function. No dependencies or additional Excel libraries to be added.

The code is also in GitHub here. I wrote a blog post about it - http://www.vitoshacademy.com/vba-data-scraping-from-internet-with-excel-part-2/

AmazonInternet

Public Function PageWithResultsExists(appIE As Object, keyword As String) As Boolean On Error GoTo PageWithResultsExists_Error Dim allData As Object Set allData = appIE.document.getElementById("s-results-list-atf") PageWithResultsExists = True IeErrors = 0 On Error GoTo 0 Exit Function PageWithResultsExists_Error: WaitSomeMilliseconds IeErrors = IeErrors + 1 Select Case Err.Number Case 424 If IeErrors > MAX_IE_ERRORS Then PageWithResultsExists = False IeErrors = 0 Else LogMe "PageWithResultsExists", IeErrors, keyword, IeErrors PageWithResultsExists appIE, keyword End If Case Else Err.Raise Err.Number, Err.Source, Err.Description, Err.HelpFile, Err.HelpContext End Select End Function Public Function MakeUrl(i As Long, keyword As String) As String MakeUrl = "https://www.amazon.com/s/ref=sr_pg_" & i & "?rh=i%3Aaps%2Ck%3A" & keyword & "&page=" & i & "&keywords=" & keyword End Function Public Sub Navigate(i As Long, appIE As Object, keyword As String) Do While appIE.Busy DoEvents Loop With appIE .Navigate MakeUrl(i, keyword) .Visible = False End With Do While appIE.Busy DoEvents Loop End Sub 

ConstValues

Public IeErrors As Long Public Const MAX_IE_ERRORS = 10 Public Const IN_PRODUCTION = False 

ExcelRelated

Public Function GetNextKeyWord() As String With tblInput Dim lastRowB As Long lastRowB = lastRow(.Name, 2) + 1 GetNextKeyWord = Trim(.Cells(lastRowB, 1)) If Len(GetNextKeyWord) <> 0 Then .Cells(lastRowB, 2) = Now End With End Function Public Sub WriteFormulas() Dim i As Long With tblInput For i = lastRow(.Name) To 2 Step -1 .Cells(i, 3).FormulaR1C1 = "=COUNTIF(Summary!C[1],Input!RC[-2])" .Cells(i, 4).FormulaArray = "=MAX(IF(Summary!C=RC[-3],Summary!C[-1]))" FormatUSD .Cells(i, 4) .Cells(i, 5).FormulaArray = "=AVERAGE(IF(Summary!C[-1]=Input!RC[-4],Summary!C[-2]))" FormatUSD .Cells(i, 5) Next i End With End Sub Public Sub FixWorksheets() OnStart With tblInput .Range("B1") = "Start Time" .Range("C1") = "Count" .Range("D1") = "Max" .Range("E1") = "Average" End With With tblSummary .Range("A1") = "Title" .Range("B1") = "Author" .Range("C1") = "Price" .Range("D1") = "Keyword" End With Dim ws As Worksheet For Each ws In Worksheets ws.Columns.AutoFit Next ws OnEnd End Sub Public Sub FormatUSD(myRange As Range) myRange.NumberFormat = "_-[$$-409]* #,##0.00_ ;_-[$$-409]* -#,##0.00 ;_-[$$-409]* ""-""??_ ;_-@_ " End Sub Public Sub CleanWorksheets() tblRawData.Cells.Delete tblSummary.Cells.Delete tblInput.Columns("B:F").Delete End Sub Public Function GetNthString(n As Long, myRange As Range) As String Dim i As Long Dim myVar As Variant myVar = Split(myRange, vbCrLf) For i = LBound(myVar) To UBound(myVar) If Len(myVar(i)) > 0 And n = 0 Then GetNthString = myVar(i) Exit Function ElseIf Len(myVar(i)) > 0 Then n = n - 1 End If Next i End Function Public Function GetPrice(myRange As Range) As String Dim i As Long Dim myVar As Variant myVar = Split(myRange, "$") If UBound(myVar) > 0 Then GetPrice = Mid(myVar(1), 1, InStr(1, myVar(1), " ")) Else GetPrice = "" End If End Function Public Sub WriteToExcel(appIE As Object, keyword As String) If IN_PRODUCTION Then On Error GoTo WriteToExcel_Error Dim allData As Object Set allData = appIE.document.getElementById("s-results-list-atf") Dim book As Object Dim myRow As Long For Each book In allData.getElementsByClassName("a-fixed-left-grid-inner") With tblRawData myRow = lastRow(.Name) + 1 On Error Resume Next .Cells(myRow, 1) = book.innertext .Cells(myRow, 2) = keyword On Error GoTo 0 End With Next IeErrors = 0 On Error GoTo 0 Exit Sub WriteToExcel_Error: IeErrors = IeErrors + 1 If IeErrors > MAX_IE_ERRORS Then Debug.Print "Error " & Err.Number & " (" & Err.Description & ") in procedure WriteToExcel, line " & Erl & "." Else LogMe "WriteToExcel", IeErrors, keyword, IeErrors WriteToExcel appIE, keyword End If End Sub Public Sub RawDataToStructured(keyword As String, firstRow As Long) Dim i As Long For i = firstRow To lastRow(tblRawData.Name) With tblRawData If InStr(1, .Cells(i, 1), "Sponsored ") < 1 Then Dim title As String title = GetNthString(0, .Cells(i, 1)) Dim author As String author = GetNthString(1, .Cells(i, 1)) Dim price As String price = GetPrice(.Cells(i, 1)) If Not IsNumeric(price) Or price = "0" Then price = "" Dim currentRow As String: currentRow = lastRow(tblSummary.Name) + 1 With tblSummary .Cells(currentRow, 1) = title .Cells(currentRow, 2) = author .Cells(currentRow, 3) = price .Cells(currentRow, 4) = keyword End With End If End With Next i End Sub Public Function lastRow(wsName As String, Optional columnToCheck As Long = 1) As Long Dim ws As Worksheet Set ws = Worksheets(wsName) lastRow = ws.Cells(ws.Rows.Count, columnToCheck).End(xlUp).Row End Function 

General

Public Declare PtrSafe Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As LongPtr) Public Sub OnEnd() Application.ScreenUpdating = True Application.EnableEvents = True Application.AskToUpdateLinks = True Application.DisplayAlerts = True Application.Calculation = xlAutomatic ThisWorkbook.Date1904 = False Application.StatusBar = False End Sub Public Sub OnStart() Application.ScreenUpdating = False Application.EnableEvents = False Application.AskToUpdateLinks = False Application.DisplayAlerts = False Application.Calculation = xlAutomatic ThisWorkbook.Date1904 = False ActiveWindow.View = xlNormalView End Sub Public Sub LogMe(ParamArray arg() As Variant) Debug.Print Join(arg, "--") End Sub Public Sub PrintMeUsefulFormula() Dim strFormula As String Dim strParenth As String strParenth = """" strFormula = Selection.FormulaR1C1 strFormula = Replace(strFormula, """", """""") strFormula = strParenth & strFormula & strParenth Debug.Print strFormula End Sub Public Sub WaitSomeMilliseconds(Optional Milliseconds As Long = 1000) Sleep Milliseconds End Sub 

StartUp

Public Sub Main() If IN_PRODUCTION Then On Error GoTo Main_Error CleanWorksheets Dim keyword As String: keyword = GetNextKeyWord While keyword <> "" Dim appIE As Object Set appIE = CreateObject("InternetExplorer.Application") LogMe keyword Dim nextPageExists As Boolean: nextPageExists = True Dim i As Long: i = 1 Dim firstRow As Long: firstRow = lastRow(tblRawData.Name) + 1 While nextPageExists WaitSomeMilliseconds Navigate i, appIE, keyword nextPageExists = PageWithResultsExists(appIE, keyword) If nextPageExists Then WriteToExcel appIE, keyword i = i + 1 Wend LogMe Time, keyword, "RawDataToStructured" RawDataToStructured keyword, firstRow keyword = GetNextKeyWord WaitSomeMilliseconds 4000 appIE.Quit Wend FixWorksheets WriteFormulas LogMe "Program has ended!" On Error GoTo 0 Exit Sub Main_Error: MsgBox "Error " & Err.Number & " (" & Err.Description & ") in procedure Main, line " & Erl & "." End Sub 

In general, I probably could have done it with some OOP/classes, but I have considered not to.

Thanks!

\$\endgroup\$

1 Answer 1

3
\$\begingroup\$
  • OnStart and OnEnd should be called from Main.

  • Although, I am a proponent of using single letter iterates in simple For loops, I think that i should have a more descriptive name like pageIndex.

    Navigate pageIndex, appIE, keyword 
  • Sleep - A common subroutine used across many programming languages. I see no reason to wrap it in WaitSomeMilliseconds(). You could even give it a default value.
    Public Declare PtrSafe Sub Sleep Lib "kernel32" (Optional ByVal dwMilliseconds As LongPtr = 1000) 
  • I don't see any reason to use Sleep in the first place. I would use appIE.readyState <> READYSTATE_COMPLETE instead because the only thing that you are waiting on is the Page to load. There is no lazy loading or controls to be clicked.
    Const READYSTATE_COMPLETE = 4 While appIE.readyState <> READYSTATE_COMPLETE 
  • PageWithResultsExists() - Probably the biggest reason for the slow code. The way the code is structured it runs until a page is called that has no data and checks that page for 10 seconds. The last link in the Paginator class name is pagnDisabled. You can get the last page number by checking its innerText.
  • WriteToExcel() - Why? This just adds an extra layer of complexity and slows down the code. Simply process the data in memory.
  • MakeURL() - I know getters are pretty boring but I would still use getURL(). No big deal though.
  • WriteFormulas() - You should write all the formulas at once after all the data is processed.
  • GetPrice() - There is no distinction made between Paperback, Hardcover or Kindle. I would expand the dataset to include all the categories; so that you are not comparing apples to oranges.

  • Microsoft HTML Object Library - This library is very convenient when working with HTML. Since there is only one version of the library, I would take advantage of early binding and intellisense by setting a reference to it.

Microsoft HTML Object Library

I only use Internet Explorer which I need to process events, I prefer XMLHTTP.

Sample Userform

The Userform should have a single Textbox with multiline set to true. When ran the code parses 20 pages of results asynchronously in under 12 seconds. The code is unrefined. It is just a proof of concept.

Option Explicit Const READYSTATE_COMPLETE = 4 Private Sub UserForm_Initialize() Dim t As Double: t = Timer TextBox1.Text = Join(getBooks("VBA").ToArray, vbNewLine) Debug.Print Round(Timer - t, 2) End Sub Function getDocument(URL As String) As MSHTML.HTMLDocument Dim document As MSHTML.HTMLDocument With CreateObject("MSXML2.XMLHTTP") 'open(bstrMethod As String, bstrUrl As String, [varAsync], [bstrUser], [bstrPassword]) .Open bstrMethod:="GET", bstrUrl:=URL, varAsync:=False .send If .readyState = READYSTATE_COMPLETE And .Status = 200 Then Set document = New MSHTML.HTMLDocument document.body.innerHTML = .responseText Set getDocument = document Else MsgBox "URL: " & vbCrLf & "Ready state: " & .readyState & vbCrLf & "HTTP request status: " & .Status, vbInformation, "URL Not Responding" End If End With End Function Function getBooks(keyword As String) As Object Dim server As Object, servers As Object Dim document As MSHTML.HTMLDocument, documents As Object Set servers = CreateObject("System.Collections.ArrayList") Set documents = CreateObject("System.Collections.ArrayList") Dim URL As String URL = MakeUrl(1, keyword) Set document = getDocument(URL) documents.Add document Dim pageindex As Long For pageindex = 2 To getPageCount(document) URL = MakeUrl(pageindex, keyword) Set server = CreateObject("MSXML2.XMLHTTP") server.Open bstrMethod:="GET", bstrUrl:=URL, varAsync:=True server.send servers.Add server Next For Each server In servers While server.readyState <> READYSTATE_COMPLETE DoEvents Wend If server.Status = 200 Then Set document = New MSHTML.HTMLDocument document.body.innerHTML = server.responseText documents.Add document End If Next Dim books As Object Set books = CreateObject("System.Collections.ArrayList") Dim ul As HTMLUListElement Dim li As HTMLLIElement For Each document In documents Set ul = document.getElementById("s-results-list-atf") If Not ul Is Nothing Then For Each li In ul.getElementsByTagName("LI") books.Add li.innerText Next End If Next Set getBooks = books End Function Function getPageCount(document As HTMLDocument) As Long Dim element As HTMLGenericElement Set element = document.querySelector(".pagnDisabled") If Not element Is Nothing Then getPageCount = CInt(element.innerText) End Function Public Function MakeUrl(i As Long, keyword As String) As String MakeUrl = "https://www.amazon.com/s/ref=sr_pg_" & i & "?rh=i%3Aaps%2Ck%3A" & keyword & "&page=" & i & "&keywords=" & keyword End Function 
\$\endgroup\$
12
  • \$\begingroup\$ Thanks for the feedback! :) The library is a good idea indeed, I was trying to make the code portable through copy and paste, thus I was not using early binding. On your code I get an error here .Open bstrMethod:="GET", bstrUrl:=URL, varAsync:=False - "448 - named arbugment is not found". I call it like this - Debug.Print Join(getBooks("VBA").ToArray, vbNewLine) \$\endgroup\$ Commented Dec 27, 2018 at 10:03
  • \$\begingroup\$ @Vityata I got that error using a different version of the library. CreateObject("MSXML2.XMLHTTP.6.0") gave me the error but CreateObject("MSXML2.XMLHTTP") worked. download amazon-scraper.xlsb \$\endgroup\$ Commented Dec 27, 2018 at 11:01
  • \$\begingroup\$ I have just tried the amazon-scraper.xlsb and still got the same error on the same place. Currently with Excel 2010, 64 bits. \$\endgroup\$ Commented Dec 27, 2018 at 11:32
  • 1
    \$\begingroup\$ @Vityata very strange. Try removing the parameter names. \$\endgroup\$ Commented Dec 27, 2018 at 11:34
  • \$\begingroup\$ I am impressed by your remote-debugging skills! It worked! :) \$\endgroup\$ Commented Dec 27, 2018 at 11:37

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.