Web scraping multiple pages

Question

I am scraping a web page with multiple pages. I would very much appreciate your help for my following problem:

I have built a loop around the URL of the web page. However, when looking for the tags in the HTML code only information from page one appears. It seems like the loop is not really flowing through. I unfortunately cannot find my mistake in the following code:

for pagenumber in range(1,50): url = "http://suchen.mobile.de/fahrzeuge/auto/search.html?zipcodeRadius=100&scopeId=C&ambitCountry=DE&makeModelVariant1.makeId=3500&makeModelVariant1.modelId=115%2C98%2C80%2C99%2C102%2C81%2C100%2C83%2C105%2C82%2C101%2C120%2C121&makeModelVariant1.modelGroupId=53&isSearchRequest=true&pageNumber + str(pageNumber)" r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") # parsing the data from the webpage carTypeTemp=[] carTypeWeb = soup.find_all("span", {"class":"h3"}) # writing the car type/description in a list for i in range(0,len(carTypeWeb),2): carTypeTemp.extend((carTypeWeb[i]))

pagenumber is not the same as pageNumber and your final double quote should come before the plus sign. — jDo
– jDo, Commented Jun 6, 2016 at 21:28

Pythonista · Accepted Answer · 2016-06-06 21:34:08Z

In your forloop you are doing:

url = "* + str(pageNumber)"

This is literally what the url will be, and isn't concatenating as you think it is.

>>> "a url + str(pageNumber)" "a url + str(pageNumber)"

You want:

url = "*" + str(pagenumber)

Or you could use string formatters, whatever you prefer.

Edit: didn't catch the difference between names / capitalization as noted in the comment.

You want pagenumber not pageNumber. pageNumber doesn't exist.

Many thanks for your help! This was indeed wrong. Were those lines correct: r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") I still face the problem that my find_all function does not extract data for pages >1. Does my soup variable save all the html code from all 50 pages? Again many thanks for your help! I very much appreciate!

gr1zzly be4r · Accepted Answer · 2016-06-06 21:07:55Z

Try changing the first two lines in your code to this:

for pagenumber in range(1,50): url = "http://suchen.mobile.de/fahrzeuge/auto/search.html?zipcodeRadius=100&scopeId=C&ambitCountry=DE&makeModelVariant1.makeId=3500&makeModelVariant1.modelId=115%2C98%2C80%2C99%2C102%2C81%2C100%2C83%2C105%2C82%2C101%2C120%2C121&makeModelVariant1.modelGroupId=53&isSearchRequest=true&pageNumber={pagenumber}".format(pagenumber))

Right now you're not sending a GET request with a proper URL.

Many thanks for your help! This was indeed wrong. Were those lines correct: r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") I still face the problem that my find_all function does not extract data for pages >1. Does my soup variable save all the html code from all 50 pages? Again many thanks for your help! I very much appreciate!

Amine Amhoume · Accepted Answer · 2017-03-27 13:06:29Z

It seems like you forget to put "N" in 'pageNumber' instead of 'n' and change

 url = "https://.................. + str(pageNumber)"

to

url = ("http://suchen.mobile.de/fahrzeuge..... " + str(pageNumber))

this give me a loop of

['BMW 430d xDrive Coupé M Sportpaket Head-Up ACC LED', 'BMW 425d Gran Coupé M-Sportpaket Sport-Aut. Navi Pro', 'BMW 420d xDrive Coupé M Sportpaket Navi Apps PDC']

and

['BMW 435i xDrive Gran Coupé M Sportpaket Navi Prof. A', 'BMW 420 Gran Coupé M Sportpaket NEUES MODELL Nav LED', 'BMW 435i Coupé Sport Line GSD Navi Speed Limit Info']

Collectives™ on Stack Overflow

Web scraping multiple pages

3 Answers 3

1 Comment

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Related