Pyth, 41 [48*.8=]38.441 bytes
I<Q4.q;#Km#Km.SQQI.AmqQl{d+.TK.Tm,@@Kdd@@Kt-QddQB;K I<Q4.q; # if input < 4, quit # ; # while(True) Km.SQQ # K = random QxQ 2d list I ; # if ... .A # all of... m Q # map(range(Q))... + # concat .TK # transpose K .Tm,@@Kdd@@Kt-Qdd # diagonals of K m d # map(range(d)) , # 2-elem list of... @@Kdd # K[n][n] @@Kt-Qd # and K[len(K)-n-1][n] .T # transpose qQl{d # subarrays have no dups... B; # ... then, break K # output final result llama@llama:~$ time echo 4 | pyth <(echo 'I<Q4.q#Km'#Km.SQQI.AmqQl{d+.TK.Tm,@@Kdd@@Kt-QddQB;K') [[2, 1, 0, 3], [0, 3, 2, 1], [3, 0, 1, 2], [1, 2, 3, 0]] echo 4 0.00s user 0.00s system 0% cpu 0.001 total pyth <(echo 'I<Q4.q#Km'#Km.SQQI.AmqQl{d+.TK.Tm,@@Kdd@@Kt-QddQB;K') 0.38s user 0.00s system 96% cpu 0.397 total I've yet to get a timing on 5x5 (it ran to completion once, but that was in a REPL and I wasn't timing it).
Note that the rule for the time limit was only edited into the question after this answer was posted.