How do I scrape data from multiple webpages with BeauitfulSoup?

Question

I have a problem with the following code and I am sorry, I am new to this all, I want to add the strings in the FullPage list to the actual URL and then I want to visit them and scrape some data from the pages. So far, It has been good but I do not know how to make it visit the other links in the list.

The output will only give me the data of one page but I need the data for 30 pages, how can I make this program to go over each link?

The URL has a pattern, the first part has 'http://arduinopak.com/Prd.aspx?Cat_Name=' and then the second part has the product category name.

import urllib2 from bs4 import BeautifulSoup FullPage = ['New-Arrivals-2017-6', 'Big-Sales-click-here', 'Arduino-Development-boards', 'Robotics-and-Copters'] urlp1 = "http://www.arduinopak.com/Prd.aspx?Cat_Name=" URL = urlp1 + FullPage[0] for n in FullPage: URL = urlp1 + n page = urllib2.urlopen(URL) bsObj = BeautifulSoup(page, "html.parser") descList = bsObj.findAll('div', attrs={"class": "panel-default"}) for desc in descList: print(desc.getText(separator=u' '))

Amey Kumar Samala · Accepted Answer · 2017-07-11 07:55:16Z

import urllib2 from bs4 import BeautifulSoup FullPage = ['New-Arrivals-2017-6', 'Big-Sales-click-here', 'Arduino-Development-boards', 'Robotics-and-Copters'] urlp1 = "http://www.arduinopak.com/Prd.aspx?Cat_Name=" URL = urlp1 + FullPage[0] for n in FullPage: URL = urlp1 + n page = urllib2.urlopen(URL) bsObj = BeautifulSoup(page, "html.parser") descList = bsObtTj.findAll('div', attrs={"class": "panel-default"}) for desc in descList: print(desc.geext(separator=u' '))

If you want to scape each links then moving last 3 lines of your code into loop will do it.

Was that all? Oh my, I am such a beginner. Thanks so much bro!

LiquidLemon · Accepted Answer · 2017-07-11 07:58:53Z

Your current code fetches all the links but it stores only one BeautifulSoup object reference. You could instead store them all in the array or process them before visiting another URL (as shown below).

for n in FullPage: URL = urlp1 + n page = urllib2.urlopen(URL) bsObj = BeautifulSoup(page, "html.parser") descList = bsObj.findAll('div', attrs={"class": "panel-default"}) for desc in descList: print(desc.getText(separator=u' '))

Also, note that the names using PascalCase are by convention reserved for classes. FullPage would usually be written as fullPage or FULL_PAGE if it's meant to be constant.

Collectives™ on Stack Overflow

How do I scrape data from multiple webpages with BeauitfulSoup?

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related