So I am working on trying to webscrape https://data.bls.gov/cgi-bin/surveymost?bls and was able to figure out how to webcrawl through clicks to get to a table.
The selection that I am practicing on is after you select the checkbox associated with " Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A" under Compensation and then select "Retrieve data".
Once those two are processed a table shows. This is the table I am trying to scrape.
Below is the code that I have as of right now.
Note that you have to put your own path for your browser driver where I have put < browser driver >.
from bs4 import BeautifulSoup from urllib.request import urlopen import pandas as pd import numpy as np import requests import lxml.html as lh from selenium import webdriver url = "https://data.bls.gov/cgi-bin/surveymost?bls" ChromeSource = r"<browser driver>" # Open up a Chrome browser and navigate to web page. options = webdriver.ChromeOptions() options.add_argument('--ignore-certificate-errors') options.add_argument('--incognito') options.add_argument('--headless') # will run without opening browser. driver = webdriver.Chrome(ChromeSource, chrome_options=options) driver.get(url) driver.find_element_by_xpath("//input[@type='checkbox' and @value = 'CIU1010000000000A']").click() driver.find_element_by_xpath("//input[@type='Submit' and @value = 'Retrieve data']").click() i = 2 def myTEST(i): xpath = '//*[@id="col' + str(i) + '"]' TEST = driver.find_elements_by_xpath(xpath) num_page_items = len(TEST) for i in range(num_page_items): print(TEST[i].text) myTEST(i) # Clean up (close browser once completed task). driver.close() Right now this only is looking at the headers. I would like to also get the table content as well.
If I make i = 0, it produces "Year". i = 1, it produces "Period". But if I select i = 2 I get two variables which have the same col2 id for "Estimated Value" and "Standard Error".
I tried to think of a way to work around this and can't seem to get anything that I have researched to work.
In essence, it would be better to start at the point where I am done clicking and am at the table of interest and then look at the xpath of the header and pull in the text for all of the sub 's.
<tr> == $0 <th id="col0"> Year </th> <th id="col1"> Period </th> <th id="col2">Estimated Value</th> <th id="col2">Standard Error</th> <tr> I am not sure how to do that. I also tried to loop through the {i} but obviously sharing with two header text causes an issue.
Once I am able to get the header, I want to get the contents. I could you some insight on if I am on the right path, overthinking it or if there is a simpler way to do all of this. I am learning and this is my first attempt using the selenium library for clicks. I just want to get it to work so I can try it again on a different table and make it as automate or reusable (with tweaking) as possible.

