Scraping "hidden" table from webpage

Question

I'm trying to get the table at this URL: https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2 . I tried reading it qith requests and BeautifulSoup:

from bs4 import BeautifulSoup as bs import requests s = requests.session() req = s.get('https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2', headers={ "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/51.0.2704.103 Safari/537.36"}) soup = bs(req.content) table = soup.find('table')

However, I only get the headers of the table.

<table class="table"> <caption class="pl8">Ricoverati e posti letto in area non critica e terapia intensiva.</caption> <thead> <tr> <th class="cella-tabella-sm align-middle text-center" scope="col">Regioni</th> <th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">Ricoverati in Area Non Critica</th> <th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL in Area Non Critica</th> <th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">Ricoverati in Terapia intensiva</th> <th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL in Terapia Intensiva</th> <th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL Terapia Intensiva attivabili</th> </tr> </thead> <tbody id="tab2_body"> </tbody> </table>

So I tried with the URL i think the table is located: https://Agenas:[email protected]/covid19/web/index.php?r=json%2Ftab2 . But in this case I always get 401 status code, even adding in headers username and password as shown in previous request. For example:

 requests.get('https://Agenas:[email protected]/covid19/web/index.php?r=json%2Ftab2', headers={'username':'Agenas', 'password':'tab2-19' 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'})

Any idea on how to solve this? Thank you.

If the data in the table is dynamically loaded with javascript you might have to use selenium. — Jortega
– Jortega, Commented Mar 19, 2021 at 14:50
@Jortega just FYI, this can be done without the heavy guns of selenium. — baduker
– baduker, Commented Mar 19, 2021 at 15:27
@Panzerotto remember to mark the answer that solves your issue. See stackoverflow.com/help/someone-answers — Jortega
– Jortega, Commented Mar 19, 2021 at 15:47

baduker · Accepted Answer · 2021-03-19 15:22:36Z

Those "secrets" needed for the headers are actually embedded in a <script> tag. So you can fish them out, parse'em to a JSON and use in the request headers.

Here's how:

import json import re import requests from bs4 import BeautifulSoup headers = { "user-agent": "Mozilla/5.0 (X11; Linux x86_64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/89.0.4389.90 Safari/537.36", "x-requested-with": "XMLHttpRequest", } with requests.Session() as s: end_point = "https://Agenas:[email protected]/covid19/web/index.php?r=json%2Ftab2" regular_page = "https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2" html = s.get(regular_page, headers=headers).text soup = BeautifulSoup(html, "html.parser").find_all("script")[-1].string hacked_payload = json.loads( re.search(r"headers:\s({.*}),", soup, re.S).group(1).strip() ) headers.update(hacked_payload) print(json.dumps(s.get(end_point, headers=headers).json(), indent=2))

Output:

[ { "regione": "Abruzzo", "dato1": "667", "dato2": "1495", "dato3": "89", "dato4": "215", "dato5": "0" }, { "regione": "Basilicata", "dato1": "164", "dato2": "426", "dato3": "12", "dato4": "88", "dato5": "13" }, and so on ...

Collectives™ on Stack Overflow

Scraping "hidden" table from webpage

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related