Skip to main content
added 861 characters in body
Source Link
Ciro Santilli OurBigBook.com
  • 392.6k
  • 120
  • 1.3k
  • 1.1k

CPython 3.13: the GIL became optional with the --disable-gil option

I'd need to better read up on this, but is it worth knowing about:

Note however that this options is not yet widely available as of May 2025 as it requires different Python build options and most distros haven't enabled it yet, e.g. Ubuntu 25.04 which ships Python 3.13. The following question asks how to get it: How to disable the GIL in python3.13?

CPython 3.13: the GIL became optional with the --disable-gil option

I'd need to better read up on this, but is it worth knowing about:

Note however that this options is not yet widely available as of May 2025 as it requires different Python build options and most distros haven't enabled it yet, e.g. Ubuntu 25.04 which ships Python 3.13. The following question asks how to get it: How to disable the GIL in python3.13?

added 647 characters in body
Source Link
Ciro Santilli OurBigBook.com
  • 392.6k
  • 120
  • 1.3k
  • 1.1k

ConclusionsMeaning of each section:

  • "CPU Bound": how long it takes for all threads/processes to finish a fixed amount of CPU bound-bound work for each thread. The amount of work per thread is fixed, multiprocessingso e.g. when 2 threads were used, 2x total work is done.

    Interpretation: threads were always faster, presumably due to the GILslower because they were fighting for CPU lock

  • for IO"CPU bound work/ threads": the above graph divided by the number of threads. both are exactlyThis gives the same speedaverage time it took to finish each unit of work. We observe that:

    • for threads, this is constant: 2x work takes 2x time to finish. Therefore, it didn't parallelize at all.

    • for processes, this decreases until 4x, and then remains constant. Therefore it parallelized well up to 4x and was able to run things faster, but didn't scale beyond that.

      I would have expected scaling up to 8x since I'm on a 4 core 8 hyperthread machine.

      Contrast that with a C POSIX CPU-bound work which reaches the expected 8x speedup: What do 'real', 'user' and 'sys' mean in the output of time(1)?

      TODO: I don't know the reason for this, there must be other Python inefficiencies coming into play.

  • threads only scale up to about 4x instead"Thread / Process ratio": ratio of the expected 8x since I'm on an 8 hyperthread machinetwo above lines.

    Contrast that with a C POSIX CPU-bound work which reaches This shows us the expected 8x4x speedup: What do 'real', 'user' and 'sys' mean in the output of time(1)?

    TODO: I don't know the reason for this, there must be other Python inefficiencies coming into play limit very clearly.

  • "IO bound": same as "CPU bound" but with an IO bound task

Conclusions:

  • for CPU bound work, multiprocessing is always faster, presumably due to the GIL
  • for IO bound work. both are exactly the same speed

Conclusions:

  • for CPU bound work, multiprocessing is always faster, presumably due to the GIL

  • for IO bound work. both are exactly the same speed

  • threads only scale up to about 4x instead of the expected 8x since I'm on an 8 hyperthread machine.

    Contrast that with a C POSIX CPU-bound work which reaches the expected 8x speedup: What do 'real', 'user' and 'sys' mean in the output of time(1)?

    TODO: I don't know the reason for this, there must be other Python inefficiencies coming into play.

Meaning of each section:

  • "CPU Bound": how long it takes for all threads/processes to finish a fixed amount of CPU-bound work for each thread. The amount of work per thread is fixed, so e.g. when 2 threads were used, 2x total work is done.

    Interpretation: threads were always slower because they were fighting for CPU lock

  • "CPU bound / threads": the above graph divided by the number of threads. This gives the average time it took to finish each unit of work. We observe that:

    • for threads, this is constant: 2x work takes 2x time to finish. Therefore, it didn't parallelize at all.

    • for processes, this decreases until 4x, and then remains constant. Therefore it parallelized well up to 4x and was able to run things faster, but didn't scale beyond that.

      I would have expected scaling up to 8x since I'm on a 4 core 8 hyperthread machine.

      Contrast that with a C POSIX CPU-bound work which reaches the expected 8x speedup: What do 'real', 'user' and 'sys' mean in the output of time(1)?

      TODO: I don't know the reason for this, there must be other Python inefficiencies coming into play.

  • "Thread / Process ratio": ratio of the two above lines. This shows us the 4x speedup limit very clearly.

  • "IO bound": same as "CPU bound" but with an IO bound task

Conclusions:

  • for CPU bound work, multiprocessing is always faster, presumably due to the GIL
  • for IO bound work. both are exactly the same speed
Source Link
Ciro Santilli OurBigBook.com
  • 392.6k
  • 120
  • 1.3k
  • 1.1k

Python documentation quotes

I've highlighted the key Python documentation quotes about Process vs Threads and the GIL at: What is the global interpreter lock (GIL) in CPython?

Process vs thread experiments

I did a bit of benchmarking in order to show the difference more concretely.

In the benchmark, I timed CPU and IO bound work for various numbers of threads on an 8 hyperthread CPU. The work supplied per thread is always the same, such that more threads means more total work supplied.

The results were:

enter image description here

Plot data.

Conclusions:

  • for CPU bound work, multiprocessing is always faster, presumably due to the GIL

  • for IO bound work. both are exactly the same speed

  • threads only scale up to about 4x instead of the expected 8x since I'm on an 8 hyperthread machine.

    Contrast that with a C POSIX CPU-bound work which reaches the expected 8x speedup: What do 'real', 'user' and 'sys' mean in the output of time(1)?

    TODO: I don't know the reason for this, there must be other Python inefficiencies coming into play.

Test code:

#!/usr/bin/env python3 import multiprocessing import threading import time import sys def cpu_func(result, niters): ''' A useless CPU bound function. ''' for i in range(niters): result = (result * result * i + 2 * result * i * i + 3) % 10000000 return result class CpuThread(threading.Thread): def __init__(self, niters): super().__init__() self.niters = niters self.result = 1 def run(self): self.result = cpu_func(self.result, self.niters) class CpuProcess(multiprocessing.Process): def __init__(self, niters): super().__init__() self.niters = niters self.result = 1 def run(self): self.result = cpu_func(self.result, self.niters) class IoThread(threading.Thread): def __init__(self, sleep): super().__init__() self.sleep = sleep self.result = self.sleep def run(self): time.sleep(self.sleep) class IoProcess(multiprocessing.Process): def __init__(self, sleep): super().__init__() self.sleep = sleep self.result = self.sleep def run(self): time.sleep(self.sleep) if __name__ == '__main__': cpu_n_iters = int(sys.argv[1]) sleep = 1 cpu_count = multiprocessing.cpu_count() input_params = [ (CpuThread, cpu_n_iters), (CpuProcess, cpu_n_iters), (IoThread, sleep), (IoProcess, sleep), ] header = ['nthreads'] for thread_class, _ in input_params: header.append(thread_class.__name__) print(' '.join(header)) for nthreads in range(1, 2 * cpu_count): results = [nthreads] for thread_class, work_size in input_params: start_time = time.time() threads = [] for i in range(nthreads): thread = thread_class(work_size) threads.append(thread) thread.start() for i, thread in enumerate(threads): thread.join() results.append(time.time() - start_time) print(' '.join('{:.6e}'.format(result) for result in results)) 

GitHub upstream + plotting code on same directory.

Tested on Ubuntu 18.10, Python 3.6.7, in a Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB), SSD: Samsung MZVLB512HAJQ-000L7 (3,000 MB/s).

Visualize which threads are running at a given time

This post https://rohanvarma.me/GIL/ taught me that you can run a callback whenever a thread is scheduled with the target= argument of threading.Thread and the same for multiprocessing.Process.

This allows us to view exactly which thread runs at each time. When this is done, we would see something like (I made this particular graph up):

 +--------------------------------------+ + Active threads / processes + +-----------+--------------------------------------+ |Thread 1 |******** ************ | | 2 | ***** *************| +-----------+--------------------------------------+ |Process 1 |*** ************** ****** **** | | 2 |** **** ****** ** ********* **********| +-----------+--------------------------------------+ + Time --> + +--------------------------------------+ 

which would show that:

  • threads are fully serialized by the GIL
  • processes can run in parallel