optimise scraping and requesting web page

Question

How should I optimise my time in making requests

link=['http://youtube.com/watch?v=JfLt7ia_mLg', 'http://youtube.com/watch?v=RiYRxPWQnbE' 'http://youtube.com/watch?v=tC7pBOPgqic' 'http://youtube.com/watch?v=3EXl9xl8yOk' 'http://youtube.com/watch?v=3vb1yIBXjlM' 'http://youtube.com/watch?v=8UBY0N9fWtk' 'http://youtube.com/watch?v=uRPf9uDplD8' 'http://youtube.com/watch?v=Coattwt5iyg' 'http://youtube.com/watch?v=WaprDDYFpjE' 'http://youtube.com/watch?v=Pm5B-iRlZfI' 'http://youtube.com/watch?v=op3hW7tSYCE' 'http://youtube.com/watch?v=ogYN9bbU8bs' 'http://youtube.com/watch?v=ObF8Wz4X4Jg' 'http://youtube.com/watch?v=x1el0wiePt4' 'http://youtube.com/watch?v=kkeMYeAIcXg' 'http://youtube.com/watch?v=zUdfNvqmTOY' 'http://youtube.com/watch?v=0ONtIsEaTGE' 'http://youtube.com/watch?v=7QedW6FcHgQ' 'http://youtube.com/watch?v=Sb33c9e1XbY']

I have a list of 15-20 links of youtube search result of first page Now the task is to get the likes,dislikes,view count from each video url and for that what I had done is

def parse(url,i,arr): req=requests.get(url) soup = bs4.BeautifulSoup(req.text,"lxml")#, 'html5lib') try: likes=int(soup.find("button",attrs={"title": "I like this"}).getText().__str__().replace(",","")) except: likes=0 try: dislikes=int(soup.find("button",attrs={"title": "I dislike this"}).getText().__str__().replace(",","")) except: dislikes=0 try: view=int(soup.find("div",attrs={"class": "watch-view-count"}).getText().__str__().split()[0].replace(",","")) except: view=0 arr[i]=(likes,dislikes,view,url) time.sleep(0.3) def parse_list(link): arr=len(link)*[0] threadarr=len(link)*[0] import threading a=time.clock() for i in range(len(link)): threadarr[i]=threading.Thread(target=parse,args=(link[i],i,arr)) threadarr[i].start() for i in range(len(link)): threadarr[i].join() print(time.clock()-a) return arr arr=parse_list(link)

Now I am getting the populated result array in about 6 seconds.Is there any faster way I can get my array(arr) so that it takes quite less time than 6 secs

my array first 4 elements look like so that you get a rough idea

[(105, 11, 2836, 'http://youtube.com/watch?v=JfLt7ia_mLg'), (32, 18, 5420, 'http://youtube.com/watch?v=RiYRxPWQnbE'), (45, 3, 7988, 'http://youtube.com/watch?v=tC7pBOPgqic'), (106, 38, 4968, 'http://youtube.com/watch?v=3EXl9xl8yOk')] Thanks in advance :)

If your code works, but you're looking for some improvements, you should ask your question on CodeReview — Andersson
– Andersson, Commented Aug 25, 2017 at 5:47

Philippe Oger · Accepted Answer · 2017-08-25 13:55:27Z

I would use multiprocessing Pool object for that particular case.

import requests import bs4 from multiprocessing import Pool, cpu_count links = [ 'http://youtube.com/watch?v=JfLt7ia_mLg', 'http://youtube.com/watch?v=RiYRxPWQnbE', 'http://youtube.com/watch?v=tC7pBOPgqic', 'http://youtube.com/watch?v=3EXl9xl8yOk' ] def parse_url(url): req=requests.get(url) soup = bs4.BeautifulSoup(req.text,"lxml")#, 'html5lib') try: likes=int(soup.find("button", attrs={"title": "I like this"}).getText().__str__().replace(",","")) except: likes=0 try: dislikes=int(soup.find("button", attrs={"title": "I dislike this"}).getText().__str__().replace(",","")) except: dislikes=0 try: view=int(soup.find("div", attrs={"class": "watch-view-count"}).getText().__str__().split()[0].replace(",","")) except: view=0 return (likes, dislikes, view, url) pool = Pool(cpu_count) # number of processes data = pool.map(parse_url, links) # this is where your results are

This is cleaner as you only have one function to write and you end up with exactly the same results.

error :TypeError: '<' not supported between instances of 'method' and 'int'

SIM · Accepted Answer · 2017-08-25 17:50:17Z

This is not a workaround but it can save your script from using "try/except block" which definitely plays a role to somewhat slow the operation down.

for url in links: response = requests.get(url).text soup = BeautifulSoup(response,"html.parser") for item in soup.select("div#watch-header"): view = item.select("div.watch-view-count")[0].text likes = item.select("button[title~='like'] span.yt-uix-button-content")[0].text dislikes = item.select("button[title~='dislike'] span.yt-uix-button-content")[0].text print(view, likes, dislikes)

try,except are somewhat necessary for me to use in my program since some videos are also disabled to show likes and dislikes etc
But the links you have provided above have got no issues without them. I tested it..

Collectives™ on Stack Overflow

optimise scraping and requesting web page

2 Answers 2

1 Comment

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Related