Regarding negative steps I would stop making my live hard and simply swap t_end and t_start. That way you can use the same code path for both cases without worrying about signs etc.
If you t_end-t_start is not a multiple of your dt you need to evaluate again with a smaller time step at the end of your computation.
You are stuck with float or double for h. The steps are non integral so ...
RK is a multistep methods, where you calculate the n-th step using the result of the n-1 th step. So it is not possible to parallelize the computation of a single step. What you can parallelize is the computation of the different ODEs.
You need to really step back and consider whether your code is doing the right thing and then you should ask your self how long it took you to come to that conclusion. RK is a really simple method and your code makes it just so incredible complicated
temp[func_i] = temp[func_i] + (*((*(currentAs_1_it + s_2)).begin()))*(*std::next(it_fContainer, func_i))(y_K_L[s_1], tk_L[s_2]);
There are 3 dereferences and one multiplication here and the braces are just hard to read. What makes it really bad is that it is completely unnecessary.
*(currentAs_1_it + s_2)
This is just the iterator to the s_2 elementh of the array. Why on earth do you mix index based loops with iterator loops? Either you use
for (size_t index = 0; index < container.size(); ++index) { auto element = container[index]; }
or
for (auto it = container.begin(); it != container.end(); ++it) { auto element = *it; }
However, if you really want to simplify your code, you should use range based loops:
for (auto&& element : container) { }
This is most likely responsible for some of the performance problems you have as you are always calculating pointers on the fly. As an example:
for (size_t j = 0; j < L; j++) { tk_L[j] = t + butcher_tableau.getC()[j]*h; }
Now you see that you should also create a getC() member that take a const size_t index and returns the appropriate element.
I dont see where you reserve the memory for your operations. From a performance point of view this is catastrophic as you are forced to permanently reallocated huge chunks of memory
trajectory_u.push_back(std::vector<initContainer::value_type>(u0.size())); std::copy(u1.begin(), u1.end(), (trajectory_u[i]).begin()); trajectory_t.push_back(t);
I dont even know what you are trying to do here, but there is no possible reason you want to copy all that memory around. Also you know in advance the number of steps you have, but I do not see a single reserve()
So what you should do is:
trajectory_u.reserve(numberOfSteps); trajectory_t.reserve(numberOfSteps);
And later:
trajectory_u.push_back(u1);
You have so many typedefs, but not one for the most used ones
std::vector<initContainer::value_type>