1
\$\begingroup\$

I'm a newbie getting into web scrapers. I've made something that works, it takes 3.2 hours to complete job and randomly have 10 lines blank each time I run this job. Help is much appreciated!

import sys import pandas as pd from selenium import webdriver import time from datetime import datetime from bs4 import BeautifulSoup from selenium.webdriver.support.select import Select from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException from selenium.webdriver import ActionChains from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from webdriver_manager.chrome import ChromeDriverManager def getBrowser(): options = Options() options.add_argument("--incognito") global browser options.add_argument("start-maximized") s = Service('''C:\\Users\\rajes\\yogita\\drivers\\chromedriver.exe''') browser = webdriver.Chrome('''C:\\Users\\rajes\\yogita\\drivers\\chromedriver.exe''') return browser def getISINUrls(browser): url = 'http://www.nasdaqomxnordic.com/bonds/denmark/' browser.get(url) browser.maximize_window() time.sleep(1) bonds = {} try: getUrls(browser, bonds) pg_down = browser.find_element(By.CSS_SELECTOR, "#bondsSearchDKOutput > div:nth-child(1) > table > tbody > tr > td.pgDown") browser.execute_script("arguments[0].click();", pg_down) time.sleep(1) while (True): # pages = browser.find_element(By.ID, 'bondsSearchDKOutput') getUrls(browser, bonds) pg_down = browser.find_element(By.CSS_SELECTOR, "#bondsSearchDKOutput > div:nth-child(1) > table > tbody > tr > td.pgDown") browser.execute_script("arguments[0].click();", pg_down) time.sleep(1) except NoSuchElementException as e: pass return bonds def getUrls(browser, bonds): hrefs_in_table = browser.find_elements(By.XPATH, '//a[@href]') count = 0 for element in hrefs_in_table: href = element.get_attribute('href') if 'microsite?Instrumen' in href: bonds[element.text] = href count += 1 def saveURLs(bond): filename = "linkstoday.txt" fo = open(filename, "w") for k, v in bonds.items(): fo.write(str(v) + '\n') fo.close() def getSleepTime(count): first = 1 res = 1 i = 0; while i < count: i += 1 temp = res res = temp + first first = temp return res def getISINData(browser2): with open("linkstoday.txt", "r") as a_file: denmark_drawing = [] for line in a_file: result_found = False count = 2 Isin_code = str() short_name = str() Volume_circulating = str() Repayment_date = str() Drawing_percent = str() wait_time = getSleepTime(0) + 1 while not result_found and count < 5: stripped_line = line.strip() browser2.get(stripped_line) browser2.maximize_window() time.sleep(getSleepTime(count) + 1) WebDriverWait(browser2, 1).until( EC.element_to_be_clickable((By.CSS_SELECTOR, '#ui-id-3 > span'))).click() time.sleep(getSleepTime(count)) Isin_code = browser2.find_element(By.CSS_SELECTOR, '#db-f-isin').text short_name = browser2.find_element(By.CSS_SELECTOR, '#db-f-nm').text Volume_circulating = browser2.find_element(By.CSS_SELECTOR, '#db-f-oa').text Repayment_date = browser2.find_element(By.CSS_SELECTOR, '#db-f-drd').text Drawing_percent = browser2.find_element(By.CSS_SELECTOR, '#db-f-dp').text if Isin_code == " ": count += 1 else: result_found = True temp_data = [Isin_code, short_name, Volume_circulating, Repayment_date, Drawing_percent] denmark_drawing.append(temp_data) # Writing data to dataframe df3 = pd.DataFrame(denmark_drawing, columns=['ISIN', 'Shortname', 'OutstandingVolume', 'Repaymentdate', 'Drawingpercent']) df3.to_csv('Denamrkscrapedsata_20220121.csv', index=False) if __name__ == "__main__": browser = getBrowser() print(f'''Call to getISINUrls start at: {datetime.now()}''') bonds = getISINUrls(browser) print(f'''Call to getISINUrls ends at : {datetime.now()}''') print(f'''total records: {len(bonds)}''') browser.close() browser2 = getBrowser() print(f'''Call to getISINData start at: {datetime.now()}''') getISINData(browser2) print(f'''Call to getISINData ends at : {datetime.now()}''') saveURLs(bonds) browser2.close() sys.exit(0) 
\$\endgroup\$
2
  • \$\begingroup\$ Scraping will always be slow. If you're serious about this, get access to Genium/INET. \$\endgroup\$ Commented Jan 24, 2022 at 21:55
  • 1
    \$\begingroup\$ Welcome to Code Review! I changed the title so that it describes what the code does per site goals: "State what your code does in your title, not your main concerns about it.". Please check that I haven't misrepresented your code, and correct it if I have. \$\endgroup\$ Commented Jan 25, 2022 at 8:01

1 Answer 1

1
\$\begingroup\$

Whereas it would be nice to use an official API, I don't see any.

Scraping is morally ambiguous and in each case you need to think about the impact to the service. In this case I don't feel too bad about doing it.

If you were to keep using Selenium, don't use BeautifulSoup: your browser already has a DOM, so you don't want a secondary library doing HTML parsing. But much more importantly, don't use Selenium at all, and don't even hit the HTML URLs themselves. Invest some time in reverse engineering and you'll see that the data actually come from a (strange, inconsistent, poorly-designed) API that is exposed unauthenticated to the internet. You can see traffic to this API in the developer tools of your favourite browser.

Don't call sleep. Even if you were to keep Selenium, there are better ways to wait for conditions to be met.

Don't save the URLs: instead, just save the instrument IDs.

Don't use Pandas, since you're just writing to a CSV file; use the built-in CSV support.

Don't exit(0) at the end; that's redundant.

Suggested

import csv from typing import Iterator, Iterable, Literal from xml.etree import ElementTree from requests import Session, Response class APIError(Exception): pass def fetch_data( session: Session, xml_query: str, bond_type: Literal[ 'doMortgageCreditAndSpecialInstitutions', 'doGovernment', 'doStructuredBonds', ] = 'doMortgageCreditAndSpecialInstitutions', ) -> Response: with session.post( url='http://www.nasdaqomxnordic.com/webproxy/DataFeedProxy.aspx', headers={ 'Accept': '*/*', 'Content-Type': 'application/x-www-form-urlencoded', 'X-Requested-With': 'XMLHttpRequest', }, cookies={'bonds_dk_search_view': bond_type}, data={'xmlquery': xml_query}, timeout=5, ) as resp: resp.raise_for_status() if resp.text == 'Invalid Request': raise APIError() return resp def get_isin_ids(session: Session) -> Iterator[str]: xml_query = '''<post> <param name="Exchange" value="NMF"/> <param name="SubSystem" value="Prices"/> <param name="Market" value="GITS:CO:CPHCB"/> <param name="Action" value="GetMarket"/> <param name="inst__an" value="ed,itid"/> <param name="XPath" value="//inst[@ed!='' and (@itid='2' or @itid='3')]"/> <param name="RecursiveMarketElement" value="True"/> <param name="inst__e" value="7"/> <param name="app" value="/bonds/denmark/"/> </post>''' xml_response = fetch_data(session, xml_query).text doc = ElementTree.fromstring(xml_response) for institution in doc.findall('./inst'): yield institution.attrib['id'] def save_ids(ids: Iterable[str]) -> None: with open('linkstoday.txt', 'w') as fo: for id in ids: fo.write(id + '\n') def get_isin_data(session: Session, ids: Iterable[str]) -> Iterator[dict[str, str]]: for id in ids: xml_query = f'''<post> <param name="Exchange" value="NMF"/> <param name="SubSystem" value="Prices"/> <param name="Action" value="GetInstrument"/> <param name="inst__a" value="0,1,2,5,21,23"/> <param name="Exception" value="false"/> <param name="ext_xslt" value="/nordicV3/inst_table.xsl"/> <param name="Instrument" value="{id}"/> <param name="inst__an" value="id,nm,oa,dp,drd"/> <param name="inst__e" value="1,3,6,7,8"/> <param name="trd__a" value="7,8"/> <param name="t__a" value="1,2,10,7,8,18,31"/> <param name="json" value="1"/> <param name="app" value="/bonds/denmark/microsite"/> </post>''' doc = fetch_data(session, xml_query).json()['inst'] record = { 'ISIN': doc['@id'], 'ShortName': doc['@nm'], 'OutstandingVolume': doc['@oa'], 'RepaymentDate': doc['@drd'], 'DrawingPercent': doc['@dp'], } yield record def save_csv(records: Iterable[dict[str, str]]) -> None: with open('DenmarkScrapeData.csv', 'w', newline='') as f: writer = csv.DictWriter( f=f, fieldnames=( 'ISIN', 'ShortName', 'OutstandingVolume', 'RepaymentDate', 'DrawingPercent', )) writer.writeheader() writer.writerows(records) def main() -> None: with Session() as session: session.headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) ' 'Gecko/20100101 ' 'Firefox/96.0', } ids = tuple(get_isin_ids(session)) print(f'total records: {len(ids)}') save_ids(ids) records = get_isin_data(session, ids) save_csv(records) if __name__ == '__main__': main() 

Output

This output comes from letting the program run for a few seconds and then cancelling it:

ISIN,ShortName,OutstandingVolume,RepaymentDate,DrawingPercent XCSE-0:5_111.E.33,"-0,5 111.E.33",84493151,2022-01-01,2.8112571826 XCSE-0:5_PCT_111.E_2030,"-0,5 pct 111.E 2030",665797291,2022-01-01,3.2633533803 XCSE-0:5RDS20S33,"-0,5RDS20S33",451515514,2022-01-01,2.9740738964 XCSE-05NYK01EA30,-05NYK01EA30,878951074,2022-01-01,3.3091375583 XCSE-05NYK01EA33,-05NYK01EA33,881317489,2022-01-01,2.8425761917 XCSE0_111.E.33,0 111.E.33,653907754,2022-01-01,2.4804408498 XCSE0_111.E.38,0 111.E.38,564600489,2022-01-01,1.855852439 XCSE0_111.E.43,0 111.E.43,176772175,2022-01-01,1.3165659355 XCSE0_PCT_111.E.30,0 pct 111.E.30,1029518700,2022-01-01,5.9398552916 XCSE0_PCT_111.E.40,0 pct 111.E.40,1977182808,2022-01-01,1.554107146 XCSE0:0_42A_B_2040,"0,0 42A B 2040",209526702,2022-01-01,1.7771871286 XCSE0:0_ANN_2040,"0,0 Ann 2040",56394289,2022-01-01,1.5425031762 XCSE0:00NDASDRO30,"0,00NDASDRO30",1224109240,2022-01-01,3.8723465773 XCSE0:0NDASDRO33,"0,0NDASDRO33",860502381,2022-01-01,2.2338528235 XCSE0:0NDASDRO40,"0,0NDASDRO40",1459255905,2022-01-01,1.4938027118 XCSE0:0NDASDRO43,"0,0NDASDRO43",236599645,2022-01-01,1.3792804762 XCSE0:0RDSD20S33,"0,0RDSD20S33",718629368,2022-01-01,2.4815379285 XCSE0:0RDSD21S35,"0,0RDSD21S35",3462720875,2022-01-01,2.0717949411 XCSE0:0RDSD21S38,"0,0RDSD21S38",2251018344,2022-01-01,1.7993154025 XCSE0:0RDSD22S40,"0,0RDSD22S40",2011886347,2022-01-01,1.4705608857 XCSE0:0RDSD22S43,"0,0RDSD22S43",184303670,2022-01-01,0.8792686726 XCSE0:5_111.E.38,"0,5 111.E.38",338880761,2022-01-01,1.3969909037 XCSE0:5_411.E.OA.53,"0,5 411.E.OA.53",1393252566,2022-01-01,0.0591983882 XCSE0:5_42A_B_2040,"0,5 42A B 2040",8131539033,2022-01-01,1.3671120294 XCSE0:5_ANN_2040,"0,5 Ann 2040",451401810,2022-01-01,2.4906658405 XCSE0:5_ANN_2050,"0,5 Ann 2050",653730646,2022-01-01,0.8698109475 XCSE0:5_B_2043,"0,5 B 2043",3606287111,2022-01-01,1.2922558113 XCSE0:5_B_2053,"0,5 B 2053",1642437254,2022-01-01,0.8121455742 XCSE0:5_OA_2050,"0,5 OA 2050",99003000,2022-01-01,0.0 XCSE0:5_OA_43A_B_2050,"0,5 OA 43A B 2050",482048871,2022-01-01,0.0 XCSE0:5_OA_B_2053,"0,5 OA B 2053",219089000,2022-01-01,0.0 XCSE0:5_PCT_111.E.27,"0,5 pct 111.E.27",283502208,2022-01-01,8.3610097198 XCSE0:5_PCT_111.E.35,"0,5 pct 111.E.35",1583753306,2022-01-01,2.5446722633 XCSE0:5_PCT_111.E.40,"0,5 pct 111.E.40",6407716782,2022-01-01,1.4303071825 XCSE0:5_PCT_111.E.50,"0,5 pct 111.E.50",10350998334,2022-01-01,0.9164192286 XCSE0:5_PCT_411.E.OA.50,"0,5 pct 411.E.OA.50",3087755141,2022-01-01,0.0671974779 XCSE0:50_43_A_B_2050,"0,50 43 A B 2050",1343369361,2022-01-01,0.8540709785 XCSE0:5NDASDRO27,"0,5NDASDRO27",481074018,2022-01-01,8.803025784 XCSE0:5NDASDRO30,"0,5NDASDRO30",881168727,2022-01-01,7.5241365836 XCSE0:5NDASDRO40,"0,5NDASDRO40",11220224408,2022-01-01,1.5304442894 XCSE0:5NDASDRO43,"0,5NDASDRO43",5986915933,2022-01-01,1.3334597191 XCSE0:5NDASDRO50,"0,5NDASDRO50",9229180944,2022-01-01,0.8712084834 XCSE0:5NDASDRO53,"0,5NDASDRO53",6247426527,2022-01-01,0.8375194931 XCSE0:5NDASDROOA50,"0,5NDASDROOA50",5461281793,2022-01-01,0.0083519478 XCSE0:5NDASDROOA53,"0,5NDASDROOA53",3175404000,2022-01-01,0.0 XCSE0:5NYK01EA27,"0,5NYK01EA27",1269241062,2022-01-01,8.7144274945 XCSE0:5NYK01EA30,"0,5NYK01EA30",1753360650,2022-01-01,6.3398581018 XCSE0:5NYK01EA35,"0,5NYK01EA35",3850470214,2022-01-01,2.2289106154 XCSE0:5NYK01EA38,"0,5NYK01EA38",2292707137,2022-01-01,1.6186901466 XCSE0:5NYK01EA40,"0,5NYK01EA40",25051403176,2022-01-01,1.410155416 XCSE0:5NYK01EA50,"0,5NYK01EA50",23594969156,2022-01-01,0.8777441213 XCSE0:5NYK01EDA50,"0,5NYK01EDA50",9778359698,2022-01-01,0.010495042 XCSE0:5NYK01EDA53,"0,5NYK01EDA53",7720300441,2022-01-01,0.0015823625 XCSE0:5NYK01IA43,"0,5NYK01IA43",152247394,2022-01-01,1.2410533255 
\$\endgroup\$
2
  • \$\begingroup\$ I am using Python 3.8 and getting error, def get_isin_data(session: Session, ids: Iterable[str]) -> Iterator[dict[str, str]]: TypeError: 'type' object is not subscriptable. I have copied your code with no changes. Could you please help me with this. Thanks . \$\endgroup\$ Commented Jan 26, 2022 at 13:12
  • \$\begingroup\$ The code I've showed is for Python 3.10, to which you should upgrade. The reason for the failure is that built-ins cannot be used as type hints (dict, tuple, etc.) in 3.8. \$\endgroup\$ Commented Jan 26, 2022 at 14:29

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.