2

I'm writing a python script to automatically check dog re-homing sites for dogs that we might be able to adopt as they become available, however I'm stuck completing the form data on this site and can't figure out why.

The form attributes state it should have a post method and I've gone through all of the inputs for the form and created a payload.

I expect the page with the search results to be returned and the html scraped from the results page so I can start processing it, but the scrape is just the form page and never has the results.

I've tried using .get with the payload as params, the url with the payload and using the requests-html library to render any java script elements without success.

If you paste the url_w_payload into a browser it loads the page and says one of the fields is empty. If you then press enter in the url bar again to reload the page without modifying the url it loads... something to do with cookies maybe?

import requests from requests_html import HTMLSession session = HTMLSession() form_url = "https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search" url_w_payload = "https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search&noPageView=false&animalType=DOG&freshSearch=false&arrivalSort=false&previousAnimalType=&location=WC2N5DU&previousLocation=&prevSearchedPostcode=&postcode=WC2N5DU&searchedLongitude=-0.1282688&searchedLatitude=51.5072106" payload = {'noPageView': 'false','animalType': 'DOG', 'freshSearch': 'false', 'arrivalSort': 'false', 'previousAnimalType': '', 'location': 'WC2N5DU', 'previousLocation': '','prevSearchedPostcode': '', 'postcode': 'WC2N5DU', 'searchedLongitude': '-0.1282688', 'searchedLatitude': '51.5072106'} #req = requests.post(form_url, data = payload) #with open("requests_output.txt", "w") as f: # f.write(req.text) ses = session.post(form_url, data = payload) ses.html.render() with open("session_output.txt", "w") as f: f.write(ses.text) print("Done") 

1 Answer 1

2

There's a few hoops to jump with cookies and headers but once you get those right, you'll get the proper response.

Here's how to do it:

import time from urllib.parse import urlencode import requests from bs4 import BeautifulSoup query_string = { "p_p_id": "petSearch2016_WAR_ptlPetRehomingPortlets", "p_p_lifecycle": 1, "p_p_state": "normal", "p_p_mode": "view", "_petSearch2016_WAR_ptlPetRehomingPortlets_action": "search", } payload = { 'noPageView': 'false', 'animalType': 'DOG', 'freshSearch': 'false', 'arrivalSort': 'false', 'previousAnimalType': '', 'location': 'WC2N5DU', 'previousLocation': '', 'prevSearchedPostcode': '', 'postcode': 'WC2N5DU', 'searchedLongitude': '-0.1282688', 'searchedLatitude': '51.5072106', } def make_cookies(cookie_dict: dict) -> str: return "; ".join(f"{k}={v}" for k, v in cookie_dict.items()) with requests.Session() as connection: main_url = "https://www.rspca.org.uk" connection.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) " \ "AppleWebKit/537.36 (KHTML, like Gecko) " \ "Chrome/90.0.4430.212 Safari/537.36" r = connection.get(main_url) cookies = make_cookies(r.cookies.get_dict()) additional_string = f"; cb-enabled=enabled; " \ f"LFR_SESSION_STATE_10110={int(time.time())}" post_url = f"https://www.rspca.org.uk/findapet?{urlencode(query_string)}" connection.headers.update( { "cookie": cookies + additional_string, "referer": post_url, "content-type": "application/x-www-form-urlencoded", } ) response = connection.post(post_url, data=urlencode(payload)).text dogs = BeautifulSoup(response, "lxml").find_all("a", class_="detailLink") print("\n".join(f"{main_url}{dog['href']}" for dog in dogs)) 

Output (shortened for brevity and no need to paginate the page as all dogs come in the response):

https://www.rspca.org.uk/findapet/details/-/Animal/JAY_JAY/ref/217747/rehome/ https://www.rspca.org.uk/findapet/details/-/Animal/STORM/ref/217054/rehome/ https://www.rspca.org.uk/findapet/details/-/Animal/DASHER/ref/205702/rehome/ https://www.rspca.org.uk/findapet/details/-/Animal/EVE/ref/205701/rehome/ https://www.rspca.org.uk/findapet/details/-/Animal/SEBASTIAN/ref/178975/rehome/ https://www.rspca.org.uk/findapet/details/-/Animal/FIJI/ref/169578/rehome/ https://www.rspca.org.uk/findapet/details/-/Animal/ELLA/ref/154419/rehome/ https://www.rspca.org.uk/findapet/details/-/Animal/BEN/ref/217605/rehome/ https://www.rspca.org.uk/findapet/details/-/Animal/SNOWY/ref/214416/rehome/ https://www.rspca.org.uk/findapet/details/-/Animal/BENSON/ref/215141/rehome/ https://www.rspca.org.uk/findapet/details/-/Animal/BELLA/ref/207716/rehome/ and much more ... 

PS. I really enjoyed this challenge as I have two dogs from a shelter. Keep it up, man!

Sign up to request clarification or add additional context in comments.

3 Comments

This looks amazing, thanks! Can't wait to try it out...
Worked great and I was able to put it in a function and call from my main script. Thanks for your help!
My pleasure, @Steve. Keep up the good work.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.