How to do webscraping in Python from site that generates tables?

Question

In Python3, I need to scrape a site that has search options in bar menu: http://www.cnj.jus.br/bnmp/#/pesquisar

I just need to select the item "Estado" and within it the option "Rio de Janeiro" (it is a State of Brazil, with several cities). Then click "Pesquisar"

The site generates a screen with the items I need to store in a dataframe after (on multiple pages, with a table in each) - 53,022 items such as:

Numero: "0002274-09.2012.8.19.0002.0001" Nome: "Bruno Da Silva" Situacao: "Aguardando Cumprimento" Data: "23/01/2012" Orgao: "TJRJ"

...

And so on in the following lines and pages

With inspect element, in Network I tried to find in XHR the site with the JSON of the website that I wish, but I found only a link with the cities (municipios) of the State of Rio de Janeiro:

import requests import pandas as pd url = 'http://www.cnj.jus.br/bnmp/rest/pesquisarMunicipios/RJ' response = requests.get(url) print(response.json()) {'sucesso': True, 'mensagem': None, 'municipios': ['ANGRA DOS REIS', 'APERIBE', 'ARARUAMA', 'ARMACAO DOS BUZIOS', 'ARRAIAL DO CABO', 'BARRA DO PIRAI', 'BARRA MANSA', 'BELFORD ROXO', 'BOM JARDIM', 'BOM JESUS DO ITABAPOANA', 'CABO FRIO', 'CACHOEIRAS DE MACACU', 'CAMBUCI', 'CAMPOS DOS GOYTACAZES', 'CANTAGALO', 'CARAPEBUS', 'CARDOSO MOREIRA', 'CARMO', 'CASIMIRO DE ABREU', 'CONCEICAO DE MACABU', 'CORDEIRO', 'DUAS BARRAS', 'DUQUE DE CAXIAS', 'ENGENHEIRO PAULO DE FRONTIN', 'GUAPIMIRIM', 'IGUABA GRANDE', 'ITABORAI', 'ITAGUAI', 'ITALVA', 'ITAOCARA', 'ITAPERUNA', 'ITATIAIA', 'JAPERI', 'LAJE DO MURIAE', 'MACAE', 'MAGE', 'MANGARATIBA', 'MARICA', 'MENDES', 'MESQUITA', 'MIGUEL PEREIRA', 'MIRACEMA', 'NATIVIDADE', 'NILOPOLIS', 'NITEROI', 'NOVA FRIBURGO', 'NOVA IGUACU', 'PARACAMBI', 'PARAIBA DO SUL', 'PARATI', 'PATY DO ALFERES', 'PETROPOLIS', 'PINHEIRAL', 'PIRAI', 'PORCIUNCULA', 'PORTO REAL', 'QUEIMADOS', 'QUISSAMA', 'RESENDE', 'RIO BONITO', 'RIO CLARO', 'RIO DAS FLORES', 'RIO DAS OSTRAS', 'RIO DE JANEIRO', 'SANTA MARIA MADALENA', 'SANTO ANTONIO DE PADUA', 'SAO FIDELIS', 'SAO FRANCISCO DE ITABAPOANA', 'SAO GONCALO', 'SAO JOAO DA BARRA', 'SAO JOAO DE MERITI', 'SAO JOSE DO VALE DO RIO PRETO', 'SAO PEDRO DA ALDEIA', 'SAO SEBASTIAO DO ALTO', 'SAPUCAIA', 'SAQUAREMA', 'SEROPEDICA', 'SILVA JARDIM', 'SUMIDOURO', 'TERESOPOLIS', 'TRAJANO DE MORAES', 'TRAJANO DE MORAIS', 'TRES RIOS', 'VALENCA', 'VARRE-SAI', 'VASSOURAS', 'VOLTA REDONDA']}

Please, is there any way to find the created JSON of the items I want to scrape?

Or is there a better scraping strategy?

afult · Accepted Answer · 2018-04-12 20:22:02Z

I was able to see the right XHR request file with chrome's developer tools networks tab. I had the preserve log option checked, so that may be why I was able to see it when you weren't.

I found it by starting at http://www.cnj.jus.br/bnmp/#/pesquisar, then selecting an estado, clicking Pesquisar and then checking the network logs.

It looks like you need to make a post request to http://www.cnj.jus.br/bnmp/rest/pesquisar. Youll also need to edit the payload to include the state and the page you need.

so it should look like this:

payload = { "criterio":{ "orgaoJulgador":{ "uf":"AC", "municipio":"", "descricao":"" }, "orgaoJTR":{}, "parte":{ "documentos":[ {"identificacao":""} ] } }, "paginador":{"paginaAtual":2}, "fonetica":"true", "ordenacao":{"porNome":False,"porData":False} } url = ('http://www.cnj.jus.br/bnmp/rest/pesquisar') r = requests.post(url, json=payload) print(r.status_code) print(r.json())

Thank you, it worked. I did an iteration to get the data from all the pages. But in 304 this connection error appeared: ConnectionError: HTTPConnectionPool(host='www.cnj.jus.br', port=80): Max retries exceeded with url: /bnmp/rest/pesquisar (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc05f9c6da0>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Yeah, I'd set one. I usually use a random number generator for the intervals. You'll have to play around with it a bit to figure out what works for that site.
Thank you. I put a question here about this: stackoverflow.com/questions/49806225/…

Collectives™ on Stack Overflow

How to do webscraping in Python from site that generates tables?

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related