5

I want to scrape data from a webpage with a dynamic table. The table contains information on train rides.

This is the website: https://www.laerm-monitoring.de/zug/?mp=3/

I tried to request the data with a simple mounted request session, but I only got basic HTML data without the data from the table.

def requests_retry_session( retries=3, backoff_factor=0.3, status_forcelist=(500, 502, 504, 429), session=None, ): session = session or requests.Session() retry = Retry( total=retries, read=retries, connect=retries, backoff_factor=backoff_factor, status_forcelist=status_forcelist, ) adapter = HTTPAdapter(max_retries=retry) session.mount('http://', adapter) session.mount('https://', adapter) return session session = requests_retry_session() response = session.get('https://www.laerm-monitoring.de/zug/?mp=3/') response.content 

How can I do this correctly?

3 Answers 3

3

The data is loaded dynamically from different URL. You can use this example how to load it just with requests/beautifulsoup:

import json import requests from bs4 import BeautifulSoup data = { "sort": "Einfahrtzeit-desc", "page": "1", "pageSize": "10", "group": "", "filter": "", "__RequestVerificationToken": "", "locid": "1", } headers = {"X-Requested-With": "XMLHttpRequest"} url = "https://www.laerm-monitoring.de/zug/" api_url = "https://www.laerm-monitoring.de/zug/train_read" with requests.Session() as s: soup = BeautifulSoup(s.get(url).content, "html.parser") data["__RequestVerificationToken"] = soup.select_one( '[name="__RequestVerificationToken"]' )["value"] data = s.post(api_url, data=data, headers=headers).json() # pretty print the data print(json.dumps(data, indent=4)) 

Prints:

{ "Data": [ { "id": 2536954, "Einfahrtzeit": "2021-04-24T20:56:26.1703+02:00", "Gleis": 1, "Richtung": "Kiel", "Category": "PZ", "Zugkategorie": "Personenzug", "Vorbeifahrtdauer": 7.3, "Zugl\u00e4nge": 181.85884, "Geschwindigkeit": 115.57797, "Maximalpegel": 88.611084, "Vorbeifahrtpegel": 85.421326, "G\u00fcltig": "OK" }, { "id": 2536944, "Einfahrtzeit": "2021-04-24T20:52:25.1703+02:00", "Gleis": 2, "Richtung": "Hamburg", "Category": "PZ", "Zugkategorie": "Personenzug", "Vorbeifahrtdauer": 6.3, "Zugl\u00e4nge": 211.10226, "Geschwindigkeit": 152.60104, "Maximalpegel": 91.81743, "Vorbeifahrtpegel": 87.95224, "G\u00fcltig": "OK" }, { "id": 2536929, "Einfahrtzeit": "2021-04-24T20:44:31.4703+02:00", "Gleis": 1, "Richtung": "Kiel", "Category": "PZ", "Zugkategorie": "Personenzug", "Vorbeifahrtdauer": 5.3, "Zugl\u00e4nge": 104.69964, "Geschwindigkeit": 110.10052, "Maximalpegel": 82.100815, "Vorbeifahrtpegel": 79.98168, "G\u00fcltig": "OK" }, { "id": 2536924, "Einfahrtzeit": "2021-04-24T20:42:30.3703+02:00", "Gleis": 1, "Richtung": "Kiel", "Category": "PZ", "Zugkategorie": "Personenzug", "Vorbeifahrtdauer": 2.9, "Zugl\u00e4nge": 49.305683, "Geschwindigkeit": 125.18, "Maximalpegel": 98.63289, "Vorbeifahrtpegel": 97.25019, "G\u00fcltig": "OK" }, { "id": 2536925, "Einfahrtzeit": "2021-04-24T20:42:20.5703+02:00", "Gleis": 2, "Richtung": "Hamburg", "Category": "PZ", "Zugkategorie": "Personenzug", "Vorbeifahrtdauer": 0.0, "Zugl\u00e4nge": 0.0, "Geschwindigkeit": 0.0, "Maximalpegel": 0.0, "Vorbeifahrtpegel": 0.0, "G\u00fcltig": "-" }, { "id": 2536911, "Einfahrtzeit": "2021-04-24T20:35:19.3703+02:00", "Gleis": 1, "Richtung": "Kiel", "Category": "PZ", "Zugkategorie": "Personenzug", "Vorbeifahrtdauer": 4.1, "Zugl\u00e4nge": 103.97647, "Geschwindigkeit": 132.2034, "Maximalpegel": 87.111984, "Vorbeifahrtpegel": 85.6776, "G\u00fcltig": "OK" }, { "id": 2536907, "Einfahrtzeit": "2021-04-24T20:33:31.2703+02:00", "Gleis": 2, "Richtung": "Hamburg", "Category": "GZ", "Zugkategorie": "G\u00fcterzug", "Vorbeifahrtdauer": 23.8, "Zugl\u00e4nge": 583.19586, "Geschwindigkeit": 95.63598, "Maximalpegel": 88.02967, "Vorbeifahrtpegel": 85.02115, "G\u00fcltig": "OK" }, { "id": 2536890, "Einfahrtzeit": "2021-04-24T20:25:36.1703+02:00", "Gleis": 2, "Richtung": "Hamburg", "Category": "PZ", "Zugkategorie": "Personenzug", "Vorbeifahrtdauer": 3.5, "Zugl\u00e4nge": 104.63446, "Geschwindigkeit": 160.47487, "Maximalpegel": 88.60612, "Vorbeifahrtpegel": 86.46721, "G\u00fcltig": "OK" }, { "id": 2536882, "Einfahrtzeit": "2021-04-24T20:22:05.8703+02:00", "Gleis": 2, "Richtung": "Hamburg", "Category": "GZ", "Zugkategorie": "G\u00fcterzug", "Vorbeifahrtdauer": 26.6, "Zugl\u00e4nge": 653.52515, "Geschwindigkeit": 94.59859, "Maximalpegel": 91.9396, "Vorbeifahrtpegel": 85.50632, "G\u00fcltig": "OK" }, { "id": 2536869, "Einfahrtzeit": "2021-04-24T20:16:24.3703+02:00", "Gleis": 1, "Richtung": "Kiel", "Category": "PZ", "Zugkategorie": "Personenzug", "Vorbeifahrtdauer": 3.3, "Zugl\u00e4nge": 87.8222, "Geschwindigkeit": 160.01207, "Maximalpegel": 91.3928, "Vorbeifahrtpegel": 89.54336, "G\u00fcltig": "OK" } ], "Total": 8657, "AggregateResults": null, "Errors": null } 
Sign up to request clarification or add additional context in comments.

2 Comments

Great answer! A few questions: How did you know what to write into the data dictionary and especially, where did you get the RequestVerificationToken? Where did you get the api_url from? Why do you need a post request here? If these questions are too much, can recommend a good ressource to read up on these things? Thank you!
@gython All the information I saw in Firefox developer tools->Network tab (there are all requests that the page is doing, among them one with this data). The __RequestVerificationToken was contained within the original page (I did Ctrl+F and marked the place where it is)
3

With a simple GET request you can retrieve the HTML of the landing page.

import requests response = requests.get('https://www.laerm-monitoring.de/zug/') # even without query-parameters: ?mp=3/ print( response.content ) 

Analyze the dynamic requests (browser)

This can also be done in any browser. In the source view (in Win/Linux: CRTL + U or in Mac: CMD + U) you will find the token needed for all subsequent requests against the REST API: __RequestVerificationToken.

It's inside a hidden <input> form-field one this page:

<input name="__RequestVerificationToken" type="hidden" value="CfDJ8B_eKmsiQC9Esc7ZjyC063dp6MzAtP3Sawnrfz3SCqxOMoPCYMV4sjDbrhDbuOsPcLnOiElgqQWTdMxCgfmhNVx1eC6oR81kZT3os2z3DJxtu6H9V7fKt9z9bdSJwB1ACYSSYWHsmPzt-AMWvSk4eYU" /> 

When the page loads in your browser this token will be used to load the data dynamically (as you already assumed) via JavaScript XMLHttpRequests (XHR).

To view these XHR requests open the Network tab of your browser's developer tools window (shortcut F12):

browsers dev-tools network tab shows 2 XHR requests

Both requests are fetching the measured data as JSON. For security reasons the called web API requires a token which is sent using a POST request. It's submitted in the body as x-www-form-urlencoded along with the pagination parameters.

See following example from the command-line via cURL:

curl -vi 'https://www.laerm-monitoring.de/zug/train_read' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' --data-raw 'sort=Einfahrtzeitdesc&page=1&pageSize=10&group=&filter=&__RequestVerificationToken=CfDJ8... 

(token was shortened for illustration purpose)

Hint: in the browser's Network tab you can usually right-click on the request to copy as CURL command.

Comments

1

I have used Selenium to do something similar with python. Not sure if that works for your. Basically open the website and right click on table and do inspect element. After that Go over to the div that the table belongs to and right-click to copy full xpath. After you found the xpath, you can scrape it using selenium. See this answer .

The only problem is that Selenium actually opens the browser and doesn't run in background. I think you can do it silently, but I have never done it.

Another thing is that websites can block you if repeated automated requests come from a single IP. You can use tor to make request from a new IP every time you make a request. I have done something like that with twitter here.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.