I tried to use threading with python, there were issues with the Global Interpreter Lock. Ported code to use the multiprocessing module and it worked with 2x speedup, internally hadoop assigns as many mappers as there are cores in the cluster, hence multiprocessing is not the way to go if you need speed up. Multithreading if performed right might give some speedup