LXML unable to retrieve webpage with error "failed to load HTTP resource"

Question

Hi so I tried opening the link below in a browser and it works but not in the code. The link is actually a combination of a news site and then the extension of the article called from another file url.txt. I tried the code with a normal website (www.google.com) and it works perfectly.

import sys import MySQLdb from mechanize import Browser from bs4 import BeautifulSoup, SoupStrainer from nltk import word_tokenize from nltk.tokenize import * import urllib2 import nltk, re, pprint import mechanize #html form filling import lxml.html with open("url.txt","r") as f: first_line = f.readline() #print first_line url = "http://channelnewsasia.com/&s" + (first_line) t = lxml.html.parse(url) print t.find(".//title").text

And this is the error I am getting.

And this is the content of url.txt

/news/asiapacific/australia-to-send-armed/1284790.html

alecxe · Accepted Answer · 2014-07-29 03:01:44Z

1

This is because of the &s part of the url - it is definitely not needed:

url = "http://channelnewsasia.com" + first_line

Also, url parts are better be joined using urljoin():

from urlparse import urljoin import lxml.html BASE_URL = "http://channelnewsasia.com" with open("url.txt") as f: first_line = f.readline() url = urljoin(BASE_URL, first_line) t = lxml.html.parse(url) print t.find(".//title").text

prints:

Australia to send armed personnel to MH17 site - Channel NewsAsia

answered Jul 29, 2014 at 3:01

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jmo Over a year ago

Hi thanks man! However I'm still getting this error with your codes. tinypic.com/r/al2l5e/8 Is there smth wrong on my side?

alecxe Over a year ago

@Jmo just a simple check, does replacing first_line = f.readline() with first_line = f.readline().strip() help? If not, could you print out url variable value before calling parse() so that I can see what the actual url is being retrieved and parsed? Thanks.

Jmo Over a year ago

Yup it works now thanks! Is there anyway I can get the full content of the article also?

alecxe Over a year ago

@Jmo sure, print '\n'.join(t.xpath('.//div[@class="news_detail"]/div/p/text()')).

Jmo Over a year ago

Btw man is there a way of text analysis whereby I'm able to isolate the time/location from the article?

Collectives™ on Stack Overflow

LXML unable to retrieve webpage with error "failed to load HTTP resource"

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related