1

I'm trying to get the table at this URL: https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2 . I tried reading it qith requests and BeautifulSoup:

from bs4 import BeautifulSoup as bs import requests s = requests.session() req = s.get('https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2', headers={ "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/51.0.2704.103 Safari/537.36"}) soup = bs(req.content) table = soup.find('table') 

However, I only get the headers of the table.

<table class="table"> <caption class="pl8">Ricoverati e posti letto in area non critica e terapia intensiva.</caption> <thead> <tr> <th class="cella-tabella-sm align-middle text-center" scope="col">Regioni</th> <th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">Ricoverati in Area Non Critica</th> <th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL in Area Non Critica</th> <th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">Ricoverati in Terapia intensiva</th> <th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL in Terapia Intensiva</th> <th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL Terapia Intensiva attivabili</th> </tr> </thead> <tbody id="tab2_body"> </tbody> </table> 

So I tried with the URL i think the table is located: https://Agenas:[email protected]/covid19/web/index.php?r=json%2Ftab2 . But in this case I always get 401 status code, even adding in headers username and password as shown in previous request. For example:

 requests.get('https://Agenas:[email protected]/covid19/web/index.php?r=json%2Ftab2', headers={'username':'Agenas', 'password':'tab2-19' 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'}) 

Any idea on how to solve this? Thank you.

3
  • If the data in the table is dynamically loaded with javascript you might have to use selenium. Commented Mar 19, 2021 at 14:50
  • @Jortega just FYI, this can be done without the heavy guns of selenium. Commented Mar 19, 2021 at 15:27
  • 1
    @Panzerotto remember to mark the answer that solves your issue. See stackoverflow.com/help/someone-answers Commented Mar 19, 2021 at 15:47

1 Answer 1

2

Those "secrets" needed for the headers are actually embedded in a <script> tag. So you can fish them out, parse'em to a JSON and use in the request headers.

Here's how:

import json import re import requests from bs4 import BeautifulSoup headers = { "user-agent": "Mozilla/5.0 (X11; Linux x86_64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/89.0.4389.90 Safari/537.36", "x-requested-with": "XMLHttpRequest", } with requests.Session() as s: end_point = "https://Agenas:[email protected]/covid19/web/index.php?r=json%2Ftab2" regular_page = "https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2" html = s.get(regular_page, headers=headers).text soup = BeautifulSoup(html, "html.parser").find_all("script")[-1].string hacked_payload = json.loads( re.search(r"headers:\s({.*}),", soup, re.S).group(1).strip() ) headers.update(hacked_payload) print(json.dumps(s.get(end_point, headers=headers).json(), indent=2)) 

Output:

[ { "regione": "Abruzzo", "dato1": "667", "dato2": "1495", "dato3": "89", "dato4": "215", "dato5": "0" }, { "regione": "Basilicata", "dato1": "164", "dato2": "426", "dato3": "12", "dato4": "88", "dato5": "13" }, and so on ... 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.