I don't have much to say about numeric stability of code I can't run, as there's usually a lot of experimentation needed.
broadcast
for i in range(1, nx-1): a[i] = -k_left[i] / dx**2 b[i] = ... c[i] = -k_right[i] / dx**2 I can say that the interpreted loop looks slow, and an NDArray broadcast would be more appropriate.
And even if you don't do that, at least hoist that dx ** 2 squaring operation into a local variable computed before the loop begins.
Similarly you would prefer to broadcast the subsequent loop of
for j, i in enumerate(range(1, nx-1)):
helpers
The "step N" comments are helpful. But they suggest extracting a few helper functions. Minimally, "step 1" should be a function call.
roundoff
observe small oscillations
Presumably you observe it through the print(A_sparse) statement. It's unclear what the magnitude of the errors is. You pass matrices into several scipy solvers. It would be useful to find their condition number, to better understand if we're solving a "stiff" system.
The only numeric analysis detail I noticed which leapt off the page was computing a simple average over a window.
Te_face = (Te_k[:-1] + Te_k[1:]) / 2 Roughly the per-element expression is (first + second) / 2.
Standard advice in numeric analysis texts is to prefer this expression:
first + (second - first) / 2
Similarly for the various flux face calculations, where we divide by dx rather than by 2.
With positive integers this avoids the spectre of overflowing uint32_t or similar. With floats, this lets us perform the division on a numerator of smaller magnitude. And then we're performing a "big + little" addition, which gives us a better chance at properly rounding the result.
This only gets you about one more ULP, one more bit of precision per iteration. I can't imagine it's significant, compared to the effects of the various sophisticated solvers you're calling into.