0

I have a problem with the following code and I am sorry, I am new to this all, I want to add the strings in the FullPage list to the actual URL and then I want to visit them and scrape some data from the pages. So far, It has been good but I do not know how to make it visit the other links in the list.

The output will only give me the data of one page but I need the data for 30 pages, how can I make this program to go over each link?

The URL has a pattern, the first part has 'http://arduinopak.com/Prd.aspx?Cat_Name=' and then the second part has the product category name.

import urllib2 from bs4 import BeautifulSoup FullPage = ['New-Arrivals-2017-6', 'Big-Sales-click-here', 'Arduino-Development-boards', 'Robotics-and-Copters'] urlp1 = "http://www.arduinopak.com/Prd.aspx?Cat_Name=" URL = urlp1 + FullPage[0] for n in FullPage: URL = urlp1 + n page = urllib2.urlopen(URL) bsObj = BeautifulSoup(page, "html.parser") descList = bsObj.findAll('div', attrs={"class": "panel-default"}) for desc in descList: print(desc.getText(separator=u' ')) 

2 Answers 2

1
import urllib2 from bs4 import BeautifulSoup FullPage = ['New-Arrivals-2017-6', 'Big-Sales-click-here', 'Arduino-Development-boards', 'Robotics-and-Copters'] urlp1 = "http://www.arduinopak.com/Prd.aspx?Cat_Name=" URL = urlp1 + FullPage[0] for n in FullPage: URL = urlp1 + n page = urllib2.urlopen(URL) bsObj = BeautifulSoup(page, "html.parser") descList = bsObtTj.findAll('div', attrs={"class": "panel-default"}) for desc in descList: print(desc.geext(separator=u' ')) 

If you want to scape each links then moving last 3 lines of your code into loop will do it.

Sign up to request clarification or add additional context in comments.

2 Comments

Was that all? Oh my, I am such a beginner. Thanks so much bro!
I am glad that it was helpful. Just accept the answer
1

Your current code fetches all the links but it stores only one BeautifulSoup object reference. You could instead store them all in the array or process them before visiting another URL (as shown below).

for n in FullPage: URL = urlp1 + n page = urllib2.urlopen(URL) bsObj = BeautifulSoup(page, "html.parser") descList = bsObj.findAll('div', attrs={"class": "panel-default"}) for desc in descList: print(desc.getText(separator=u' ')) 

Also, note that the names using PascalCase are by convention reserved for classes. FullPage would usually be written as fullPage or FULL_PAGE if it's meant to be constant.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.