In addition to @Douglas post, here is a very short (and I think: efficient) implementation in Pari/GP - however, I didn't run it to high N myself: [Update] I added an improved computation method at the end[/update]
MaxN=101 \\ set some upper limit for the computation S=[1,1] \\ contains current factorial and current sum for(k=3,MaxN, S *= [k-1,k-1;0,1]; if(isprime(k),print([k,S[1] % k,S[2]%k]))) Result :
p | (p-1)! %p | S %p --+-----------+----- [3, 2, 0] [5, 4, 3] [7, 6, 5] [11, 10, 0] [13, 12, 9] [17, 16, 12] [19, 18, 8] [23, 22, 20] [29, 28, 16] [31, 30, 1] [37, 36, 4] [41, 40, 3] ... Up to the 1000'th prime MaxN=7919
MaxN = prime(1000) ... ... [7919, 7918, 2882] this needed 281 msec.
A Version using the modulo-computation is the following
gettime() {for(j=1,1000,p=prime(j); S=[1,1]; for(k=3,p, S*=[k-1,k-1;0,1]; S=S % p ); print([p,S[1] % p,S[2]%p]) )} gettime() %1124 = 9219 \\ this is msecs [update] The following is only relevant, if we want a routine, which produces the list over all primes up to some limit.
It has then some advantage to do a compromise between the first method where we need all factorials and sums only once but which needs many bits for the representation of the numbers and the complete modulo-computation, which operates on small numbers but must do the whole run of computing factorials (residuals) for each prime such that we have a double loop.
If we compute and store the required factorials and partial sums in chunks of, say, $t=10$ terms, such that we have a list of precomputed values $$ \small \begin{array} {r|l|l|} & f & S \\ \hline \\ c_1=& 10! & \sum_{k=1}^{10} k! \\ c_2=& {20!\over 10!} & {\sum_{k=11}^{20} k! \over 10!} \\ c_3=& {30!\over 20!} & {\sum_{k=21}^{30} k! \over 20!} \\ \ldots \end{array} $$ then we can compose our results for each prime by modular multiplication of that constants and a short remainder. Finetuning the parameter $t$ for higher allows then to find a compromise between storage/size of the stored numbers and number of steps in the inner loop, which contains the modular computation per prime.
Using some empirical data I have done an estimation with a quadratic curve in Excel, which says, for the list of primes up to 1000000 I need 90 min with blocksize $t=20$ and, 30 min with $t=100$ and 15 min with $t=500$ .
(If it is wished I can append the Pari/GP code here)