1

I am trying to get the title and the link of every article from this site.

The data of interest is loaded with javascript after some time in json response.

 var ltcom = 'TEFURVJDRVJB'; var ltpapaer = 'TFRQQVBFUg=='; var bender = new Canela.tool.Bender('searchBox',ltcom, { replaceImg: 'http://resource.latercera.com/2015/css/img/bx_loader.gif', objectId: 'contentId', hl: 'abstract', taxonomyId: '24', ajaxTpl: true, targets: { rowsContainer: 'result', pageContainer: 'pages', resumeContainer: 'resume' }, parameters: { type: 'CONTENT', fq: 'taxonomyId:24 AND status:2 AND launchDate:[2008-05-31T23:59:59.999Z TO NOW]', sort: 'launchDate desc', rows: 15 }, templates: { rowTpl: '/index/tpl/rowTpl.html', rowContainerTpl: '/index/tpl/rowContainerTpl.html', pageTpl: '/index/tpl/pageTpl.html', pageContainerTpl: '/index/tpl/pageContainerTpl.html', resumeTpl: '/index/tpl/resumeTpl.html' } 

I tried using selenium approach, but with no success.

Current code:

import requests url="http://www.latercera.com/app/application" data= { 'action':'searchSolr', 'type':'CONTENT', 'siteCode':'TEFURVJDRVJB', 'fq':'taxonomyId:24 AND status:2 AND launchDate:[2008-05-31T23:59:59.999Z TO NOW]', 'indent':'on', 'wt':'json', 'qt':'default', 'sort':'launchDate desc', 'start':'0', 'rows':'15', 'q':'enersis' } print (requests.get(url, data=data).text) 

requests.get(url, data=data) spits out 200.

Is there a need to use some header info? How should I move forward with this? Thanks in advance!

2
  • If "The data of interest is loaded with javascript after some time in json response", then couldn't you directly get the data from there? I don't know what you're after, but usually if a public website can access it, you can, too. Open Google Chrome developer tools (<kbd>F12</kbd>) and navigate to Network tab. You can see all requests and their responses and contents. You can get the URL and params from there. (not an answer because I don't know, if your data is publicly accessible or not. If it helps, I can make it into an answer) Commented Sep 25, 2016 at 22:04
  • I am trying to get the title and the link of every article. Take a look here: latercera.com/resultadoBusqueda.html?q=enersis 'print (requests.post(url, data=data).text)' doesn't return anything also. Commented Sep 25, 2016 at 22:42

1 Answer 1

1

You need a Referer header:

headers = {"Referer": "http://www.latercera.com/resultadoBusqueda.html?q=enersis"} data = {"type": 'CONTENT', "fq": 'taxonomyId:24 AND status:2 AND launchDate:[2008-05-31T23:59:59.999Z TO NOW]', "sort": 'launchDate desc', "rows": 15, "siteCode": 'TEFURVJDRVJB', "q": "enersis", "action": "searchSolr"} r = (requests.post("http://www.latercera.com/app/application", data=data,headers=headers)) print(r) print(r.content) 

Which gives you all the data in xml format which we can parse with bs4:

soup = BeautifulSoup(r.content, "xml") print [(s.text, s.parent.select_one("arr[name=url]").text) for s in soup.select("arr[name=n_title]")] 

That returns:

[(u'Las advertencias de la SEC al proceso de fusi\xf3n que Enel impulsa en Enersis', u'/noticia/negocios/2016/09/655-697979-9-las-advertencias-de-la-sec-al-proceso-de-fusion-que-enel-impulsa-en-enersis.shtml'), (u'Enersis Am\xe9ricas aclar\xf3 mejor\xeda en precio de OPA', u'/noticia/negocios/2016/09/655-695035-9-enersis-americas-aclaro-mejoria-en-precio-de-opa.shtml'), (u'Enersis Am\xe9ricas mejora precio de OPA y fija fecha en proceso de fusi\xf3n de activos', u'/noticia/negocios/2016/09/655-694868-9-enersis-americas-mejora-precio-de-opa-y-fija-fecha-en-proceso-de-fusion-de.shtml/noticia/portada/2016/09/653-694868-9-enersis-americas-mejora-precio-de-opa-y-fija-fecha-en-proceso-de-fusion-de.shtml'), (u'Nicola Cotugno, gerente general Enersis Chile: "queremos mantener el liderazgo, #7;pero no s\xf3lo con nueva capacidad"', u'/noticia/negocios/2016/08/655-694297-9-nicola-cotugno-gerente-general-enersis-chile-queremos-mantener-el-liderazgo-pero.shtml'), (u'El buen momento de las firmas chilenas en la bolsa de EE.UU.', u'/noticia/negocios/2016/08/655-692187-9-el-buen-momento-de-las-firmas-chilenas-en-la-bolsa-de--eeuu.shtml/noticia/portada/2016/08/653-692187-9-el-buen-momento-de-las-firmas-chilenas-en-la-bolsa-de--eeuu.shtml'), (u'Estiman en US$ 145 millones el costo en que incurrir\xe1 Enersis Am\xe9ricas por fusi\xf3n', u'/noticia/negocios/2016/08/655-691649-9-estiman-en-us-145-millones-el-costo-en-que-incurrira-enersis-americas-por-fusion.shtml/noticia/portada/2016/08/653-691649-9-estiman-en-us-145-millones-el-costo-en-que-incurrira-enersis-americas-por-fusion.shtml'), (u'Los pasos que seguir\xe1 el cambio de imagen de Enersis', u'/noticia/negocios/2016/08/655-691378-9-los-pasos-que-seguira-el-cambio-de-imagen-de-enersis.shtml/noticia/portada/2016/08/653-691378-9-los-pasos-que-seguira-el-cambio-de-imagen-de-enersis.shtml'), (u'Enersis, Endesa y Chilectra cambiar\xedan de nombre para unificarse bajo marca Enel', u'/noticia/negocios/2016/08/655-691270-9-enersis-endesa-y-chilectra-cambiarian-de-nombre-para-unificarse-bajo-marca-enel.shtml'), (u'El nuevo dilema de Enel: sepultar #7;las marcas Enersis, Endesa y Chilectra', u'/noticia/negocios/2016/07/655-690896-9-el-nuevo-dilema-de-enel-sepultar-las-marcas-enersis-endesa-y-chilectra.shtml'), (u'SVS niega m\xe1s plazo a Enersis Am\xe9ricas para fusi\xf3n y complica a el\xe9ctrica en EE.UU.', u'/noticia/negocios/2016/07/655-687234-9-svs-niega-mas-plazo-a-enersis-americas-para-fusion-y-complica-a-electrica-en.shtml/noticia/portada/2016/07/653-687234-9-svs-niega-mas-plazo-a-enersis-americas-para-fusion-y-complica-a-electrica-en.shtml'), (u'Rafael Fern\xe1ndez: CMPC #7;tiene un buen compliance y colusi\xf3n fue un "accidente"', u'/noticia/negocios/2016/06/655-685371-9-rafael-fernandez-cmpc-tiene-un-buen-compliance-y-colusion-fue-un-accidente.shtml/noticia/portada/2016/06/653-685371-9-rafael-fernandez-cmpc-tiene-un-buen-compliance-y-colusion-fue-un-accidente.shtml'), (u'The Panama Papers: las sociedades que Mossack Fonseca cre\xf3 para los protagonistas del "Caso Chispas"', u'/noticia/nacional/2016/05/680-679986-9-the-panama-papers-las-sociedades-que-mossack-fonseca-creo-para-los-protagonistas.shtml/noticia/portada/2016/05/653-679986-9-the-panama-papers-las-sociedades-que-mossack-fonseca-creo-para-los-protagonistas.shtml/noticia/despliegue/canal/epigrafe-destacado-rojo/2016/05/3032-679986-9-the-panama-papers-las-sociedades-que-mossack-fonseca-creo-para-los-protagonistas.shtml/noticia/despliegue/home/epigrafe-destacado-rojo/2016/05/3038-679986-9-the-panama-papers-las-sociedades-que-mossack-fonseca-creo-para-los-protagonistas.shtml'), (u'Enersis Am\xe9ricas inicia proceso de fusi\xf3n con Endesa Am\xe9ricas y Chilectra Am\xe9ricas', u'/noticia/negocios/2016/05/655-679596-9-enersis-americas-inicia-proceso-de-fusion-con-endesa-americas-y-chilectra.shtml'), (u'Herman Chadwick Pi\xf1era es elegido como nuevo presidente de Enersis Chile', u'/noticia/negocios/2016/04/655-678650-9-herman-chadwick-pinera-es-elegido-como-nuevo-presidente-de-enersis-chile.shtml/noticia/portada/2016/04/653-678650-9-herman-chadwick-pinera-es-elegido-como-nuevo-presidente-de-enersis-chile.shtml'), (u'Daniel Fern\xe1ndez deja Enersis tras fin de primera etapa en proceso de reestructuraci\xf3n', u'/noticia/negocios/2016/04/655-678440-9-daniel-fernandez-deja-el-cargo-de-country-manager-de-enersis.shtml/noticia/tamano-contenedor/home/col9/2016/04/3057-678440-9-daniel-fernandez-deja-el-cargo-de-country-manager-de-enersis.shtml/noticia/portada/2016/04/653-678440-9-daniel-fernandez-deja-el-cargo-de-country-manager-de-enersis.shtml')] 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.