Tuesday, May 3, 2022

Scrape Web Data Excel VBA Macro (Part 2)

In the previous post we have seen how to send an HTTP request to get web data as text (Part 1). In this post we see how to process the response into the body of a html document to target the data of interest (Part 2). We need to add the Microsoft HTML Object Library to the references in order to manipulate the response as a html document. Then we need to target and get the elements with the information we are after. In this particular example, that’s the job title, company, location, etc. Despite the macro is entirely written with Excel VBA, basic HTML/web programming skills are required to understand the whole process. 

Macro/VBA code:

 
  Sub ProcessHTTPresponse(response)
 
      'Add Microsoft HTML Object Library
      Dim html As HTMLDocument, r As Integer
      Set html = CreateObject("htmlfile")
      html.body.innerHTML = response
 
      For Each divElement In html.getElementsByClassName("job_seen_beacon")
          r = r + 1
          Set divCollection = divElement.all
          
          For Each elem In divCollection
              If InStr(elem.className, "jobTitle") > 0 Then
                  If elem.Children.Length > 1 Then
                      Range("A" & r + 1).Value = elem.Children(1).innerText
                  Else
                      Range("A" & r + 1).Value = elem.Children(0).innerText
                  End If
              End If
              If elem.className = "companyName" Then _
              Range("B" & r + 1).Value = elem.innerText
              If elem.className = "companyLocation" Then _
              Range("C" & r + 1).Value = elem.innerText
          Next elem
      Next divElement
 
  End Sub
 

 

Macro explained:

  • First of all, we need to add the Microsoft HTML Object Library to the references in the VBA editor under Tools->References. This library allows to create a HTML file and work with the DOM. Note that this macro, and the macro in the previous post work together, so the Microsoft XML library should also be added (check the previous post). For that reason, the procedure has the parameter “response”, which brings the HTTP request response as text from the previous macro. You can also put the two macros together in one procedure.
  • We need to declare a variable (html) to store the HTML file document, and later create and assign an HTML file to it (late binding). We declare a row index variable, and may declare other variables as Variant (not included, Excel assigns Variant type by default).
  • Then we define the innerHTML of the body of the HTML document as the HTTP response text. Now, we can navigate the HTML body within the html variable to target and get elements of interest. There are basically three ways to get HTML elements:
    • getElementByID
    • getElementsByClassName
    • getElementsByTagName
  • We need to inspect the web page inner HTML to decide which option and which elements we need to target. In most browsers, that’s achieved selecting any element and right-clicking to select inspect document. In this example, we could get elements containing the job info that have a particular class, or also get it by tag (table and tr), as the info happens to be inside a table (or two tables to be precise), for each job advert. In the code above, we are using the class name though.
  • First, we have a look through elements with that particular class name (job_seen_beacon). Then we put those elements into a collection and start a second loop.
  • The second loop goes through each element in that collection and checks if it’s the element we are looking for by looking at the class name. The first item is the job title. In this case, it happens to have an additional tag for new jobs. In order to avoid that text, we check whether the job is new or not. The next elements of interest are the company name and location, which can easily be target looking at the class names including those terms.
  • We could get any other element using the same method. For that, we need to inspect the web page as explained earlier.


This is how we process the HTML response to scrape web data in Excel with VBA macros.


Other examples:

 

1 comment:

Popular Posts