0

Hi so I tried opening the link below in a browser and it works but not in the code. The link is actually a combination of a news site and then the extension of the article called from another file url.txt. I tried the code with a normal website (www.google.com) and it works perfectly.

import sys import MySQLdb from mechanize import Browser from bs4 import BeautifulSoup, SoupStrainer from nltk import word_tokenize from nltk.tokenize import * import urllib2 import nltk, re, pprint import mechanize #html form filling import lxml.html with open("url.txt","r") as f: first_line = f.readline() #print first_line url = "http://channelnewsasia.com/&s" + (first_line) t = lxml.html.parse(url) print t.find(".//title").text 

And this is the error I am getting.

enter image description here

And this is the content of url.txt

/news/asiapacific/australia-to-send-armed/1284790.html

1 Answer 1

1

This is because of the &s part of the url - it is definitely not needed:

url = "http://channelnewsasia.com" + first_line 

Also, url parts are better be joined using urljoin():

from urlparse import urljoin import lxml.html BASE_URL = "http://channelnewsasia.com" with open("url.txt") as f: first_line = f.readline() url = urljoin(BASE_URL, first_line) t = lxml.html.parse(url) print t.find(".//title").text 

prints:

Australia to send armed personnel to MH17 site - Channel NewsAsia 
Sign up to request clarification or add additional context in comments.

5 Comments

Hi thanks man! However I'm still getting this error with your codes. tinypic.com/r/al2l5e/8 Is there smth wrong on my side?
@Jmo just a simple check, does replacing first_line = f.readline() with first_line = f.readline().strip() help? If not, could you print out url variable value before calling parse() so that I can see what the actual url is being retrieved and parsed? Thanks.
Yup it works now thanks! Is there anyway I can get the full content of the article also?
@Jmo sure, print '\n'.join(t.xpath('.//div[@class="news_detail"]/div/p/text()')).
Btw man is there a way of text analysis whereby I'm able to isolate the time/location from the article?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.