I try to parallelize a scraper. Unfortunately, when I execute this code it runs unusually long. Until I stop. The output is not generated either. Is there something I missed here? Is the problem that I use os.system?
First I define the function, then I load the data pool and then I enter it into the multiprocess.
All in all is what I want do like this:
def cube(x): return x**3 pool = mp.Pool(processes=2) results = pool.map(cube, range(1,7)) print(results) But this small calculation is running now for more than 5 min. So I think there is no error in the code itself. Rather how I understand multiprocessing
from multiprocessing import Pool import os import json import datetime from dateutil.relativedelta import relativedelta import re os.chdir(r'C:\Users\final_tweets_de') p = Pool(5) import time def get_id(data_tweets): for i in range(len(data_tweets)): account = data_tweets[i]['user_screen_name'] created = datetime.datetime.strptime(data_tweets[i]['date'], '%Y-%m-%d').date() until = created + relativedelta(days=10) id = data_tweets[i]['id'] filename = re.search(r'(.*).json',file).group(1) + '_' + 'tweet_id_' +str(id)+ '_' + 'user_id_' + str(data_tweets[i]['user_id']) os.system('snscrape twitter-search "(to:'+account+') since:'+created.strftime("%Y-%m-%d")+' until:'+until.strftime("%Y-%m-%d")+' filter:replies" >C:\\Users\\test_'+filename) directory =r'C:\Users\final_tweets_de' path= r'C:\Users\final_tweets_de' for file in os.listdir(directory): fh = open(os.path.join(path, file),'r') print(file) with open(file, 'r', encoding='utf-8') as json_file: data_tweets = json.load(json_file) data_tweets = data_tweets[0:5] start = time.time() print("start") p.map(get_id, data_tweets) p.terminate() p.join() end = time.time() print(end - start) Update
The reason why the code did not run is firstly the problem addressed by @Booboo. The other one is that the script has to be started via cmd when using windows, in case of muliprocessing.
Like here: Python multiprocessing example not working
Now I the key error 0. If I run the code.
import multiprocessing as mp import os import json import datetime from dateutil.relativedelta import relativedelta import re os.chdir(r'C:\Users\Paul\Documents\Uni\Masterarbeit\Datengewinnung\final_tweets_de') import time def get_id(data_tweets): for i in range(len(data_tweets)): print(i) account = data_tweets[i]['user_screen_name'] created = datetime.datetime.strptime(data_tweets[i]['date'], '%Y-%m-%d').date() until = created + relativedelta(days=10) id = data_tweets[i]['id'] filename = re.search(r'(.*).json',file).group(1) + '_' + 'tweet_id_' +str(id)+ '_' + 'user_id_' + str(data_tweets[i]['user_id']) try: os.system('snscrape twitter-search "(to:'+account+') since:'+created.strftime("%Y-%m-%d")+' until:'+until.strftime("%Y-%m-%d")+' filter:replies" >C:\\Users\\Paul\\Documents\\Uni\\Masterarbeit\\Datengewinnung\\Tweets_antworten\\Antworten\\test_'+filename) except: continue directory =r'C:\Users\Paul\Documents\Uni\Masterarbeit\Datengewinnung\final_tweets_de' path= r'C:\Users\Paul\Documents\Uni\Masterarbeit\Datengewinnung\final_tweets_de' for file in os.listdir(directory): fh = open(os.path.join(path, file),'r') print(file) with open(file, 'r', encoding='utf-8') as json_file: data_tweets = json.load(json_file) data_tweets = data_tweets[0:2] start = time.time() print("start") if __name__ == '__main__': pool = mp.Pool(processes=2) pool.map(get_id, data_tweets) end = time.time() print(end - start) del(data_tweets) Output:
(NLP 2) C:\Users\Paul\Documents\Uni\Masterarbeit\Datengewinnung\Tweets_antworten>python scrape_id_antworten_parallel.py corona.json start corona.json corona.json start 0.0009980201721191406 coronavirus.json start 0.0 coronavirus.json start 0.0 covid.json start 0.0 SARS_CoV.json start 0.0 0 0 multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "C:\Users\Paul\Anaconda3\envs\NLP 2\lib\multiprocessing\pool.py", line 121, in worker result = (True, func(*args, **kwds)) File "C:\Users\Paul\Anaconda3\envs\NLP 2\lib\multiprocessing\pool.py", line 44, in mapstar return list(map(*args)) File "C:\Users\Paul\Documents\Uni\Masterarbeit\Datengewinnung\Tweets_antworten\scrape_id_antworten_parallel.py", line 25, in get_id account = data_tweets[i]['user_screen_name'] KeyError: 0 """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "scrape_id_antworten_parallel.py", line 60, in <module> pool.map(get_id, data_tweets) File "C:\Users\Paul\Anaconda3\envs\NLP 2\lib\multiprocessing\pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "C:\Users\Paul\Anaconda3\envs\NLP 2\lib\multiprocessing\pool.py", line 657, in get raise self._value KeyError: 0