1

So I am working on trying to webscrape https://data.bls.gov/cgi-bin/surveymost?bls and was able to figure out how to webcrawl through clicks to get to a table.

The selection that I am practicing on is after you select the checkbox associated with " Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A" under Compensation and then select "Retrieve data".

Once those two are processed a table shows. This is the table I am trying to scrape.

Below is the code that I have as of right now.

Note that you have to put your own path for your browser driver where I have put < browser driver >.

from bs4 import BeautifulSoup from urllib.request import urlopen import pandas as pd import numpy as np import requests import lxml.html as lh from selenium import webdriver url = "https://data.bls.gov/cgi-bin/surveymost?bls" ChromeSource = r"<browser driver>" # Open up a Chrome browser and navigate to web page. options = webdriver.ChromeOptions() options.add_argument('--ignore-certificate-errors') options.add_argument('--incognito') options.add_argument('--headless') # will run without opening browser. driver = webdriver.Chrome(ChromeSource, chrome_options=options) driver.get(url) driver.find_element_by_xpath("//input[@type='checkbox' and @value = 'CIU1010000000000A']").click() driver.find_element_by_xpath("//input[@type='Submit' and @value = 'Retrieve data']").click() i = 2 def myTEST(i): xpath = '//*[@id="col' + str(i) + '"]' TEST = driver.find_elements_by_xpath(xpath) num_page_items = len(TEST) for i in range(num_page_items): print(TEST[i].text) myTEST(i) # Clean up (close browser once completed task). driver.close() 

Right now this only is looking at the headers. I would like to also get the table content as well.

If I make i = 0, it produces "Year". i = 1, it produces "Period". But if I select i = 2 I get two variables which have the same col2 id for "Estimated Value" and "Standard Error".

I tried to think of a way to work around this and can't seem to get anything that I have researched to work.

In essence, it would be better to start at the point where I am done clicking and am at the table of interest and then look at the xpath of the header and pull in the text for all of the sub 's.

<tr> == $0 <th id="col0"> Year </th> <th id="col1"> Period </th> <th id="col2">Estimated Value</th> <th id="col2">Standard Error</th> <tr> 

I am not sure how to do that. I also tried to loop through the {i} but obviously sharing with two header text causes an issue.

Once I am able to get the header, I want to get the contents. I could you some insight on if I am on the right path, overthinking it or if there is a simpler way to do all of this. I am learning and this is my first attempt using the selenium library for clicks. I just want to get it to work so I can try it again on a different table and make it as automate or reusable (with tweaking) as possible.

2
  • Scrapping or scraping? Commented Apr 30, 2020 at 3:48
  • [Edited] Scraping* Thanks for catching that, through a spelling mistake is not really the point of what I am asking for help. :) Commented Apr 30, 2020 at 3:51

1 Answer 1

4

Actually you don't need selenium, You can just track the POST Form data, and apply the same within your POST request.

Then you can load the table using Pandas easily.

import requests import pandas as pd data = { "series_id": "CIU1010000000000A", "survey": "bls" } def main(url): r = requests.post(url, data=data) df = pd.read_html(r.content)[1] print(df) main("https://data.bls.gov/cgi-bin/surveymost") 

Explanation:

  • open the site.
  • Select Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A
  • Now you have to open your browser Developer Tools and navigate to Network Monitor section. etc Press  Ctrl + Shift + E ( Command + Option + E on a Mac).
  • Now you will found a POST request done.

    enter image description here

  • Navigate to Params tab.

    enter image description here

  • Now you can make the POST request. and since the Table is presented within the HTML source and it's not loaded via JavaScript, so you can parse it within bs4 or read it in nice format using pandas.read_html()

Note: You can read the table as long as it's not loaded via JavaScript. otherwise you can try to track the XHR request (Check previous answer) or you can use selenium or requests_html to render JS since requests is an HTTP library which can't render it for you.

Sign up to request clarification or add additional context in comments.

6 Comments

Wow!!! Not only did that work, but it is so dynamic that it works with the other tables as well! I was definitely overthinking it but then again, I don't fully understand how this works. I will need to digest it some more to figure that part out but thank you so much!
@AndrewHicks you welcome, let me know in case if you found anything unclear so i can explain
Yeah...do you know of any "literature" that might cover the concept of what you did here? My background is in analytics (python, r and sql) and not html. Is the information in the data = {} different for every website? Can I use this on let's say yahoo finance or any other page that has a table? I would assume that some tweaks would be needed (aside from the url and the data variables like you put in series_id and survey. Thanks again. I definitely want to learn this.
@AndrewHicks well let me explain that within the answer. hold on
So a follow-up. First off, it looks like this method is not appropriate for all websites which is alright for now. Second, I want to tweak the code that that it doesn't just go from 2010-2020, but rather 1939-2020. On the website this requires selecting 1939 from the dropping on the top and then selecting "go". I tried going about it your way, but it only errors out. Any ideas? data = "from_year" : "1939", & "to_year" : "2020". Also, so I take it that the params field exist in FireFox. Do you know if there is something similar in Chrome?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.