Issue Description:
I am trying to automate a process where I can visit a website, hover over the menu navigation bar and click on each navigation category options from tier 1 dropdown, visit that page and scrape product details of top 20 products on that page and put it in an excel file. If that page does not contain any product, the script will continue scrolling down till it reaches the end of the page and if no product-div is found, it will go back to the top of the page and click on the next category in the navigation panel
I am working with Selenium (using python) for this. I have attached my code below.
scroll_and_click_view_more function is for scrolling down the page, prod_vitals function is for scraping product details specific to each page, and prod_count function is for extracting total count of products on each page and creating a summary of all pages.
Error Description:
When I run the below code, every function is working fine except one. The first page that this code is scrolling down, does not have any product details. The script will scroll down the entire page, print no product tiles found on that page and then is supposed to click on the next category but for some reason it can't click on the next category in the path. It throws timeout exception error and clicks on the next category which is working fine again. This website has two categories where there is no product tile present and for both of these pages, the script is unable to click on the next category available. I am attaching a screenshot of the error.
Output of my code:
['/feature/unlock-your-courage.html', '/shop/new/women', '/shop/women', '/shop/men/bags', '/shop/collection', '/shop/gift/women/bestseller', '/shop/coachworld', '/shop/coachreloved/coach-reloved'] Reached the end of the page and no product tiles were found: /feature/unlock-your-courage.html Element with href /shop/new/women not clickable Link: /shop/women Link: /shop/men/bags Link: /shop/collection Link: /shop/gift/women/bestseller Reached the end of the page and no product tiles were found: /shop/coachworld Element with href /shop/coachreloved/coach-reloved not clickable If you look at the output, in the first line, it prints all the navigation categories available on the site. After that, the script visits all the URLs in that array is able to click on all the URLs except the second and eighth one. FYI, the first and seventh category does not contain any product tile on that page. Rest all the links are clickable. The clicking on each category and iterating over the loop is taken care inside the WebScraper class.
Resolution Steps:
I have tried adding time.sleep() in between the actions but still this doesn't work. I also added a step where it is taking a screenshot when timeout exception is happening, I can see the category is visible on screen but still it is not clickable.
I am attaching a screenshot of the output on terminal. 
I am attaching my code below:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import TimeoutException from selenium.webdriver.support import expected_conditions as EC from bs4 import BeautifulSoup import pandas as pd import time import re import os import shutil import datetime import openpyxl import chromedriver_autoinstaller from openpyxl import Workbook from openpyxl.styles import PatternFill from openpyxl.utils.dataframe import dataframe_to_rows #custom_path = r"c:\Users\DELL\Documents\Self_Project" # Define the custom path where you want ChromeDriver to be installed #temp_path=chromedriver_autoinstaller.install() # Installs the ChromeDriver to a temporary directory and returns the path to that directory. #print("Temporary path",temp_path) #final_path = os.path.join(custom_path, "chromedriver.exe") # constructs and stores the full path to the ChromeDriver executable in the custom directory. #shutil.move(temp_path, final_path) # Moves the ChromeDriver executable from the temporary directory to the custom directory. #print("ChromeDriver installed at:", final_path) date_time = datetime.datetime.now().strftime("%m%d%Y_%H%M%S") file_name = f'CRTL_JP_staging_products_data_{date_time}.xlsx' products_summary = [] max_count_of_products=20 def scroll_and_click_view_more(driver,href): flag=False last_height = driver.execute_script("return window.pageYOffset + window.innerHeight") while True: try: driver.execute_script("window.scrollBy(0, 800);") time.sleep(4) new_height1 = driver.execute_script("return window.pageYOffset + window.innerHeight") try: WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div.product-tile'))) except Exception as e: new_height = driver.execute_script("return window.pageYOffset + window.innerHeight") if new_height1 == last_height and flag==False: print("Reached the end of the page and no product tiles were found: ",href) return "No product tiles found" else: last_height = new_height continue div_count = 0 flag=True #while div_count >= 0: response = driver.page_source soup = BeautifulSoup(response, 'html.parser') div_elements = soup.find_all('div', class_ = 'product-tile') div_count = len(div_elements) if(div_count > max_count_of_products): return(driver.page_source) driver.execute_script("window.scrollBy(0, 300);") time.sleep(3) new_height = driver.execute_script("return window.pageYOffset + window.innerHeight") #print(new_height) if new_height == last_height: print("Reached the end of the page: ",href) return("Reached the end of the page.") break else: last_height = new_height except Exception as e: print(e) break def prod_vitals(soup,title,url): count_of_items=1 products_data = [] # Array to store all product data for our excel sheet for div in soup.find_all('div', class_ = 'product-tile'): # Iterate over each individual product-tile div tag if count_of_items<=max_count_of_products: #print(title) list_price = 0 # Variable to store list price sale_price = 0 # Variable to store sale price discount1 = 0 # Variable to store discount% that is displayed on the site discount2 = 0 count_of_items = count_of_items+1; # Variable to store discount% calculated manually res = "Incorrect" # Variable to store result of discount1==discount2; initialized with Incorrect #pro_code = div.select('div.css-1fg6eq7 img')[0]['id'] pro_name = div.select('div.product-name a.css-avqw6d p.css-1d5mpur')[0].get_text() pdpurl = div.select('div.css-grdrdu a.css-avqw6d')[0]['href'] pdpurl = url+pdpurl element = div.select('div.salePriceWrapper span.salesPrice') # Extract all the salesPrice span elements inside salePriceWrapper div (Ideally only one should be present) "<span class="chakra-text salesPrice false css-1gi2nbo" data-qa="m_plp_txt_pt_price_upper_rl">¥179000 </span>" if element: # If sale price exists sale_price = float(element[0].get_text().replace('¥', '').replace(',', '')) # Extract the text of the first element in the list (which is the price including the dollar sign), removes the dollar sign with the replace method, and converts the result to a float res="Correct" element = div.select('div.comparablePriceWrapper span.css-l96gil') # Similarly extract list price if element: list_price = float(element[0].get_text().replace('¥', '').replace(',', '')) percent_off = div.select('div.salePriceWrapper span.css-181q1zt') # Similarly extract the DR% off text if percent_off: percent_off = percent_off[0].get_text() discount1 = re.search(r'\d+', percent_off).group() # Extract only the digits from the DR% using the search function from regex library and group them together; return type is a string discount1 = int(discount1) else: percent_off = 0 # Convert the DR% characters into integer discount2 = round(((list_price - sale_price) / list_price) * 100) # Calculate the correct DR% manually using list price and sale price if(discount1 == discount2): # Check if DR% on site matches with the expected DR% or not res = "Correct" # If yes then store result as correct else Incorrect else: res = "Incorrect" products_data.append({'Product Name': pro_name,'Product URL': pdpurl, 'Sale Price': '¥'+format(sale_price, '.2f'), 'List Price': '¥'+format(list_price, '.2f'), 'Discount on site': str(discount1)+'%', 'Actual Discount': str(discount2)+'%', 'Result': res}) # Append the extracted data to the list else: break time.sleep(5) df = pd.DataFrame(products_data, columns=['Product Name', 'Product URL', 'Sale Price', 'List Price', 'Discount on site', 'Actual Discount', "Result" ]) # Convert the array along with specific column names to a pandas DataFrame; A DataFrame is a two-dimensional labeled data structure with columns potentially of different types if os.path.exists(file_name): book = openpyxl.load_workbook(file_name) else: book = Workbook() default_sheet = book.active book.remove(default_sheet) sheet = book.create_sheet(title) for row in dataframe_to_rows(df, index=False, header=True): sheet.append(row) yellow_fill = PatternFill(start_color='FFFF00', end_color='FFFF00', fill_type='solid') green_fill = PatternFill(start_color='00FF00', end_color='00FF00', fill_type='solid') for row in range(2, sheet.max_row + 1): cell = sheet.cell(row=row, column=8) if cell.value == "Correct": cell.fill = green_fill else: cell.fill = yellow_fill book.save(file_name) def prod_count(soup,title): product_count_element = soup.find('p', {'class': 'chakra-text total-count css-120gdxl', 'data-qa': 'plp_txt_resultcount'}) if product_count_element: pro_count_text = product_count_element.get_text() pro_count_text = pro_count_text.replace(',', '') pro_count = re.search(r'\d+', pro_count_text).group() products_summary.append({'Category': title,'Total products available': pro_count, 'Total products scraped': max_count_of_products}) class WebScraper: def __init__(self): self.url = "https://staging1-japan.coach.com/?auto=true" self.reloved_url="https://staging1-japan.coach.com/shop/coachreloved/coach-reloved" self.driver = webdriver.Chrome() #options = Options() #options.add_argument("--lang=en") #self.driver = webdriver.Chrome(service=Service(r"c:\Users\DELL\Documents\Self_Project\chromedriver.exe"), options=options) def scrape(self): self.driver.get(self.url) self.driver.maximize_window() time.sleep(5) nav_count = 0 soup = BeautifulSoup(self.driver.page_source, 'html.parser') links = soup.find('div', {'class': 'css-wnawyw'}).find_all('a', {'class': 'css-ipxypz'}) hrefs = [link.get('href') for link in links] print(hrefs) for i,href in enumerate(hrefs): try: #print(href) element1 = WebDriverWait(self.driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'a[href="{href}"]'))) #self.driver.execute_script("arguments[0].scrollIntoView(true);", element1) self.driver.execute_script("window.scrollTo(0, arguments[0].getBoundingClientRect().top + window.scrollY - 100);", element1) time.sleep(10) is_visible = self.driver.execute_script("return arguments[0].offsetParent !== null && arguments[0].getBoundingClientRect().top >= 0 && arguments[0].getBoundingClientRect().left >= 0 && arguments[0].getBoundingClientRect().bottom <= (window.innerHeight || document.documentElement.clientHeight) && arguments[0].getBoundingClientRect().right <= (window.innerWidth || document.documentElement.clientWidth);", element1) #print("Displayed: {element1.is_displayed()}, Visible: {is_visible}") WebDriverWait(self.driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, f'a[href="{href}"]'))).click() time.sleep(3) response = scroll_and_click_view_more(self.driver,href) time.sleep(3) if(response!="No product tiles found" and response!="Reached the end of the page."): print("Link: \n",href) soup = BeautifulSoup(response, 'html.parser') PLP_title=links[nav_count].get('title') prod_vitals(soup,PLP_title,self.url) time.sleep(5) prod_count(soup,PLP_title) self.driver.execute_script("window.scrollBy(0, -500);") else: self.driver.execute_script("window.scrollTo(0,0);") #element2 = WebDriverWait(self.driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'a[href="{hrefs[i+1]}"]'))) #self.driver.execute_script("window.scrollTo(0, arguments[0].getBoundingClientRect().top + window.scrollY - 100);", element2) #time.sleep(3) #is_visible = self.driver.execute_script("return arguments[0].offsetParent !== null && arguments[0].getBoundingClientRect().top >= 0 && arguments[0].getBoundingClientRect().left >= 0 && arguments[0].getBoundingClientRect().bottom <= (window.innerHeight || document.documentElement.clientHeight) && arguments[0].getBoundingClientRect().right <= (window.innerWidth || document.documentElement.clientWidth);", element2) #print(f"Element href: {hrefs[i+1]}, Displayed: {element2.is_displayed()}, Visible: {is_visible}") time.sleep(3) continue except TimeoutException: print(f"Element with href {href} not clickable") self.driver.save_screenshot('timeout_exception.png') except Exception as e: print(f"An error occurred: {e}") nav_count+=1 df = pd.DataFrame(products_summary, columns=['Category', 'Total products available','Total products scraped']) book = openpyxl.load_workbook(file_name) sheet = book.create_sheet('Summary') for row in dataframe_to_rows(df, index=False, header=True): sheet.append(row) book.save(file_name) scraper = WebScraper() scraper.scrape() time.sleep(5) scraper.driver.quit() Please find my updated code below as per @mehdi-ahmadi's comment and along with it the output and issues I am facing now
I initially tried with your first option but that was not working fine so decided to change the logic instead and tried with second option by getting anchor from nav each time. With this logic, the second link is clickable now ('/shop/new/women'). However, the last link is again getting timeout exception and not able to click on it(/shop/coachreloved/coach-reloved).
Please find the output below:
0 /feature/unlock-your-courage.html Reached the end of the page and no product tiles were found: /feature/unlock-your-courage.html nav_count 1 1 /shop/new/women nav_count 2 2 /shop/women nav_count 3 3 /shop/men/bags nav_count 4 4 /shop/collection nav_count 5 5 /shop/gift/women/bestseller nav_count 6 6 /shop/coachworld Reached the end of the page and no product tiles were found: /shop/coachworld nav_count 7 Element with href /shop/coachreloved/coach-reloved not clickable I am attaching my updated class also below. Can you please help?
def scrape(self): self.driver.get(self.url) self.driver.maximize_window() time.sleep(5) nav_count = 0 while True: try: # Refresh the page source and parse it soup = BeautifulSoup(self.driver.page_source, 'html.parser') links = soup.find('div', {'class': 'css-wnawyw'}).find_all('a', {'class': 'css-ipxypz'}) hrefs = [link.get('href') for link in links] # Check if nav_count is within the range of hrefs if nav_count < len(hrefs): href = hrefs[nav_count] time.sleep(2) element = WebDriverWait(self.driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'a[href="{href}"]'))) self.driver.execute_script("arguments[0].scrollIntoView(true);", element) time.sleep(3) WebDriverWait(self.driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, f'a[href="{href}"]'))).click() time.sleep(3) print(nav_count, href) response = scroll_and_click_view_more(self.driver, href) time.sleep(3) if response != "No product tiles found" and response != "Reached the end of the page.": #print("Link: \n", href) soup = BeautifulSoup(response, 'html.parser') PLP_title = links[nav_count].get('title') prod_vitals(soup, PLP_title, self.url) time.sleep(5) prod_count(soup, PLP_title) self.driver.execute_script("window.scrollBy(0, -500);") time.sleep(2) else: self.driver.get(self.url) time.sleep(5) continue else: break except TimeoutException: print(f"Element with href {href} not clickable") self.driver.save_screenshot('timeout_exception.png') except Exception as e: print(f"An error occurred: {e}") finally: nav_count += 1 print("nav_count", nav_count)