How to use Numba to optimize stochastic Runge-Kutta?

Question

I am trying to optimize a stochastic differential equation. What it means for me is that I have to solve a set of Mmax independent differential equations. For that task I use Numba, but I have the feeling I am doing it wrong, as even if I do not get any errors and I get some time-reduction, it is not as much as I expected.

My code looks like

 @njit(fastmath=True) def Liouvillian(rho,t_list,alpha1,alpha1p): C = np.zeros((len(rho0)),dtype=np.complex64) B = L0+alpha1p*L1m+alpha1*L1p for j in range(len(rho0)): for i in range(len(rho0)): C[j] += B[j, i]*rho[i] return C @njit(parallel=True, fastmath=True) def RK4(rho0,t_list,alpha1,alpha1p): rho=np.zeros((len(rho0),len(t_list),Mmax),dtype=np.complex64) for m in prange(Mmax): rho[:,0,m]=rho0 n=0 for t in t_list[0:len(t_list)-1]: k1=Liouvillian(rho[:,n,m],t,alpha1[st*n,m],alpha1p[st*n,m]) k2=Liouvillian(rho[:,n,m]+dt*k1/2,t+dt/2,alpha1[st*n,m],alpha1p[st*n,m]) k3=Liouvillian(rho[:,n,m]+dt*k2/2,t+dt/2,alpha1[st*n,m],alpha1p[st*n,m]) k4=Liouvillian(rho[:,n,m]+dt*k3,t+dt,alpha1[st*n,m],alpha1p[st*n,m]) rho[:,n+1,m]=rho[:,n,m]+dt*(k1+2*k2+2*k3+k4)/6.0 n=n+1 return rho

where alpha1p and alpha are 2D arrays with elements [t,m] and L0,Lp,Lm are just arrays.

Specifically taking Mmax=10000, without using Numba it takes 16s, while using Numba (both parallel and fastmath), it reduces to 14s. While I see some reduction, this is not what I expected from the parallelization. Is there something wrong in my code or this is what I should get?

Edit: Here I attach the code I use to generate the random arrays I later use on the ODE solver.

def A_matrix(num_modes,kappa,E_0): A_modes=np.zeros(((2*num_modes),(2*num_modes)),dtype=np.complex64) for i in range(np.shape(A_modes)[0]): for j in range(np.shape(A_modes)[0]): if i==j: A_modes[i,j]=-kappa/4 if i==0: A_modes[i,i+num_modes+1]=E_0*kappa/2 A_modes[i,-1]=E_0*kappa/2 elif 0<i<num_modes-1: A_modes[i,i+num_modes+1]=E_0*kappa/2 A_modes[i,i+num_modes-1]=E_0*kappa/2 elif i==num_modes: A_modes[i-1,i]=E_0*kappa/2 A_modes[i-1,-2]=E_0*kappa/2 return (A_modes+A_modes.transpose()) def D_mat(num_modes): D_matrix=np.zeros(((num_modes),(num_modes)),dtype=np.complex64) for i in range(np.shape(D_matrix)[0]): for j in range(np.shape(D_matrix)[0]): if i==j: if i==0: D_matrix[i,i+1]=1 D_matrix[i,-1]=1 elif 0<i<num_modes-1: D_matrix[i,i+1]=1 return (D_matrix+D_matrix.transpose()) def a_part(alpha,t_list): M=A_matrix(num_modes,kappa,E_0) return M.dot(alpha) def b_part(w,t_list): D_1=D_mat(num_modes) Zero_modes=np.zeros(((num_modes),(num_modes)),dtype=np.complex64) D=np.bmat([[D_1, Zero_modes], [Zero_modes, D_1]]) B=sqrtm(D)*np.sqrt(E_0*kappa/2) return B.dot(w) def SDE_Param_Euler_Mauyrama(Mmax): alpha=np.zeros((2*num_modes,Smax+1,Mmax),dtype=np.complex64) n=0 alpha[:,n,:]=0.0+1j*0.0 for s in s_list[0:len(s_list)-1]: alpha[:,n+1,:]=alpha[:,n,:]+ds*a_part(alpha[:,n,:],s)+b_part(w[:,n,:],s) n=n+1 return (alpha) #Parameters E_0=0.5 kappa=10. gamma=1. num_modes=2 ## number of modes Mmax=10000 #number of samples Tmax=20 ##max value for time dt=1/(2*kappa) st=10 ds=dt/st Nmax=int(Tmax/dt) ##number of steps Smax=int(Tmax/ds) ##number of steps t_list=np.arange(0,Tmax+dt/2,dt) s_list=np.arange(0,Tmax+ds/2,ds) w = np.random.randn(2*num_modes,Smax+1,Mmax)*np.sqrt(ds) (alpha1,alpha2,alpha1p,alpha2p)=SDE_Param_Euler_Mauyrama(Mmax) from qutip import * Delta_a=0 num_qubits=1 ##Atom 1 sm_1=sigmam() sp_1=sigmap() sx_1=sigmax() sy_1=sigmay() sz_1=sigmaz() state0=np.zeros(2**num_qubits,dtype=np.complex64) state0[-1]=1 rho0=np.kron(state0,np.conjugate(np.transpose(state0))) Ha=Delta_a*(sz_1)/2 #Deterministic Liouvillian L_ops=[np.sqrt(gamma)*sm_1] L0=np.array(liouvillian(Ha,L_ops),dtype=np.complex64) L1m=-np.array(np.sqrt(kappa*gamma)*1j*liouvillian(sm_1),dtype=np.complex64) L1p=np.array(np.sqrt(kappa*gamma)*1j*liouvillian(sp_1),dtype=np.complex64) rho_1=RK4(rho0,t_list,alpha1,alpha1p).reshape(2**num_qubits,2**num_qubits,len(t_list),Mmax)

Are you sure you are solving SDE here? I see nothing involving randomness. Also, if you have to solve many initial conditions in parallel, there are usually other ways to speed up things depending on why you want to do this. — Wrzlprmft
– Wrzlprmft, Commented Apr 27, 2023 at 5:21
@Wrzlprmft I edited my post showing how I get the random elements. — J.Agusti
– J.Agusti, Commented Apr 27, 2023 at 10:11

Jérôme Richard · Accepted Answer · 2023-04-28 16:25:20Z

I expect the main bottleneck to be temporary arrays, especially since the code is parallel and the amount of computation is small compared to the number of temporary arrays and their size (bigger arrays are slower to allocate). Thus, the key is to avoid temporary arrays by allocating them once and reusing them. You can also use loops to avoid creating temporary arrays implicitly like Numpy does. Allocating temporary array is an expensive operation and filling a newly allocated array is slow because of page faults. For example, k1+2*k2+2*k3+k4 creates many temporary arrays that are not needed. You can write a loop directly writing in rho based on k1, k2, k3 and k4. You could even avoid creating those four arrays in the first place by accumulating values in rho. Memory operations tends to be slower than computations on modern hardware so doing operations in-place and avoiding copy and temporary arrays is often a good idea for performance, especially in parallel codes since the L3 cache and the DRAM are shared between cores.

Besides, there is another issue: the memory access pattern is inefficient. Indeed, rho[:,n,m] means you access items with a stride of n*m*16. Strided memory accesses are far less efficient than contiguous because the hardware is optimized for the later. One solution to avoid repeated strided access is to physically transpose the input/output arrays (note arr.T do not physically transpose an array but returns a non-contiguous view on it, while arr.T.copy() does return contiguous physically transposed array, though it is not very efficiently implemented yet). In your case, the best is certainly to change the layout of rho. The transposition trick is only worth it if other algorithms operating on it before/after this one also need to operate on it contiguously. You can transpose the array once between each algorithm rather than doing twice here (and possibly in other algorithms).

Here is an example for the second point:

@njit(parallel=True, fastmath=True) def RK4(rho0,t_list,alpha1,alpha1p): rho=np.zeros((len(t_list), Mmax, len(rho0)),dtype=np.complex64) for m in prange(Mmax): rho[0,m,:]=rho0 n=0 for t in t_list[0:len(t_list)-1]: k1=Liouvillian(rho[n,m,:],t,alpha1[st*n,m],alpha1p[st*n,m]) k2=Liouvillian(rho[n,m,:]+dt*k1/2,t+dt/2,alpha1[st*n,m],alpha1p[st*n,m]) k3=Liouvillian(rho[n,m,:]+dt*k2/2,t+dt/2,alpha1[st*n,m],alpha1p[st*n,m]) k4=Liouvillian(rho[n,m,:]+dt*k3,t+dt,alpha1[st*n,m],alpha1p[st*n,m]) rho[n+1,m,:]=rho[n,m,:]+dt*(k1+2*k2+2*k3+k4)/6.0 n=n+1 return rho

The result is transposed (and so different from the initial function). The computation is about twice faster on my machine thanks to the transposition. Applying the first point should result is an even more significant speed up.

Thank you for your detailed answer, much appreciated. While I understand the first point, I do not seem to understand your second idea. Could you please provide a toy model of what you mean? Important to notice that the first element of rho[:,n,m], as I show now in my edited post, is small but in general the first label will be an array of length 1048576, which is quite demanding.
I provided an example of code for the second point. Doing the same for the n-axis (ie. now the first one) may also speed this up even more.
ah, I was not aware of this problem of the stride memory! Thanks! However, when implementing your answer in my laptop, I actually get the same computational time as before. Can I ask you the size of the matrix you have tested?
Strange. I juste copy pasted your code for to get the input matrix with no changes. The bottleneck can change from one machine to another but I expect the performance to be at least slightly better. Anyway, the key point is probably the first for performance.
Then I do have one question: I tried to implement your first point and I actually got it a slower performance... Let me see if I understood it: The idea is to avoid creating the k_i variables. So instead we just do something like rho[:,n+1,m]=rho[:,n,m]+dt*(k1+2*k2+2*k3+k4)/6.0 what I should do it rho[:,n+1,m]=rho[:,n,m]+dt/6*(Liouvillian(rho[:,n,m],t,alpha1[stn,m],alpha1p[stn,m])+2*Liouvillian(rho[n,m,:]+dtLiouvillian(rho[:,n,m],t,alpha1[stn,m],alpha1p[stn,m])/2,t+dt/2,alpha1[stn,m],alpha1p[st*n,m]))+etc.. where we are using that k_2 depends on k_1, etc..

Collectives™ on Stack Overflow

How to use Numba to optimize stochastic Runge-Kutta?

1 Answer 1

7 Comments

Hot Network Questions