Failed to crawl element of specific website with scrapy spider

Question

I want to get website addresses of some jobs, so I write a scrapy spider, I want to get all of the value with xpath://article/dl/dd/h2/a[@class="job-title"]/@href, but when I execute the spider with command :

scrapy spider auseek -a addsthreshold=3

the variable "urls" used to preserve values is empty, can someone help me to figure it,

here is my code:

from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.conf import settings from scrapy.mail import MailSender from scrapy.xlib.pydispatch import dispatcher from scrapy.exceptions import CloseSpider from scrapy import log from scrapy import signals from myProj.items import ADItem import time class AuSeekSpider(CrawlSpider): name = "auseek" result_address = [] addressCount = int(0) addressThresh = int(0) allowed_domains = ["seek.com.au"] start_urls = [ "http://www.seek.com.au/jobs/in-australia/" ] def __init__(self,**kwargs): super(AuSeekSpider, self).__init__() self.addressThresh = int(kwargs.get('addsthreshold')) print 'init finished...' def parse_start_url(self,response): print 'This is start url function' log.msg("Pipeline.spider_opened called", level=log.INFO) hxs = Selector(response) urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract() print 'urls is:',urls print 'test element:',urls[0].encode("ascii") for url in urls: postfix = url.getAttribute('href') print 'postfix:',postfix url = urlparse.urljoin(response.url,postfix) yield Request(url, callback = self.parse_ad) return def parse_ad(self, response): print 'this is parse_ad function' hxs = Selector(response) item = ADItem() log.msg("Pipeline.parse_ad called", level=log.INFO) item['name'] = str(self.name) item['picNum'] = str(6) item['link'] = response.url item['date'] = time.strftime('%Y%m%d',time.localtime(time.time())) self.addressCount = self.addressCount + 1 if self.addressCount > self.addressThresh: raise CloseSpider('Get enough website address') return item

The problems is:

urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()

urls is empty when I tried to print it out, I just cant figure out why it doesn't work and how can I correct it, thanks for your help.

Community · Accepted Answer · 2017-05-23 12:29:26Z

Here is a working example using selenium and phantomjs headless webdriver in a download handler middleware.

class JsDownload(object): @check_spider_middleware def process_request(self, request, spider): driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe') driver.get(request.url) return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

I wanted to ability to tell different spiders which middleware to use so I implemented this wrapper:

def check_spider_middleware(method): @functools.wraps(method) def wrapper(self, request, spider): msg = '%%s %s middleware step' % (self.__class__.__name__,) if self.__class__ in spider.middleware: spider.log(msg % 'executing', level=log.DEBUG) return method(self, request, spider) else: spider.log(msg % 'skipping', level=log.DEBUG) return None return wrapper

settings.py:

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

for wrapper to work all spiders must have at minimum:

middleware = set([])

to include a middleware:

middleware = set([MyProj.middleware.ModuleName.ClassName])

You could have implemented this in a request callback (in spider) but then the http request would be happening twice. This isn't a full proof solution but it works for stuff that loads on .ready(). If you spend some time reading into selenium you can wait for specific event's to trigger before saving page source.

Another example: https://github.com/scrapinghub/scrapyjs

More info: What's the best way of scraping data from a website?

Cheers!

Fabricator · Accepted Answer · 2014-06-26 06:21:22Z

0

Scrapy does not evaluate Javascript. If you run the following command, you will see that the raw HTML does not contain the anchors you are looking for.

curl http://www.seek.com.au/jobs/in-australia/ | grep job-title

You should try PhantomJS or Selenium instead.

After examining the network requests in Chrome, the job listing appear to have originated from this JSONP request. It should be easy to retrieve whatever you need from it.

edited Jun 26, 2014 at 6:21

answered Jun 26, 2014 at 6:15

Fabricator

12.8k2 gold badges29 silver badges40 bronze badges

4 Comments

eric Over a year ago

yeah, this jsonp request includes the IDs I need, but how can I get the address of the request within my spider code?

Fabricator Over a year ago

@eric, can't you just replace the original url with the new one?

eric Over a year ago

muh..yes,I can just copy the address, but I found the request address changes each time I refresh the page, so maybe its just an temp address and will expire sometime, But I need to crawl the website everyday, so I need a permanent address or a specific method to find the address of the json file. looking forward to your help:)

Fabricator Over a year ago

@eric, among the jsonp request parameters, callback can be any random value, usersessionid can also be any random value, _ holds the request timestamp, everything else can be changed to fit your query.

Collectives™ on Stack Overflow

Failed to crawl element of specific website with scrapy spider

2 Answers 2

Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Linked

Related