There is a lot to digest in this question. There appear to be some misconceptions about unit tests, what exactly constitutes a "regression", and how much unit test code you should expect to change when you change the System Under Test. Understanding how unit tests prevent regressions requires a change in perspective first.
I would like to define what "unit test" means to me. I think it has some similarities with your definition:
- A unit test executes fast — blazing fast! Sub-millisecond, please.
- A unit test can be executed concurrently without affecting other tests.
- A unit test should not utilize resources from the outside world.
- No file system access.
- No web service calls.
- No e-mails.
- No cross-thread communication.
- No cross-process communication.
- No SQL and no database connectivity.
- A unit tests verifies the public behavior of that unit conforms to a requirement.
Beyond that, mocking has nothing to do with unit tests, except for rule #3 above. I think this is the most important thing to remember:
Unit tests do not need to mock every dependency.
The "overhead" you describe in writing unit tests is the overhead of mocking dependencies. If these dependencies don't have side effects outside of the current test (see rule 3 above), then you don't need to mock that dependency. Use the real thing. The goal here is to reduce the very overhead that frustrates you.
Finally, many bugs stem from misunderstandings of other pieces of code I'm working with, or misunderstandings of the contracts of 3rd party APIs. Errors, therefore, lie not in any particular method or class, but rather in the way methods or classes interact.
This is very, very, very true. The interactions between objects can be quite complex, especially when the behavior involves side effects like database calls, file system access, etc. In my opinion, this code is not suitable for unit tests. What you are calling "integration tests" are a better strategy here, which is something you've already noted.
Firstly, integration tests don't force me to write mocks, meaning they incur less overhead. Secondly, integration tests, by their nature, test the way code behaves, rather than the way it is written.
Oh boy do we have a lot to dissect here. The first sentence mentions overhead again. The overhead with mocking comes when you write the test. An integration test might be faster to write, but it is orders of magnitude slower to run! The "overhead" gets shifted further down the line. The overhead is not incurred when the test is written. The overhead of integration tests gets shifted to test execution time. Humans are waiting many minutes to hours for tests to run. This is overhead, too. If you mock dependencies, you get a sub-millisecond test rather than a test taking many seconds to many minutes to run.
The second sentence is very interesting too. You say integration tests verify the behavior, but unit tests verify the way code is written. This is a big red flag that the unit test is either not written properly, or not valuable (see rule #4 above). The unit test should verify public behavior which corresponds to a requirement. If you have unit tests that mock a bunch of dependencies only to assert that some method got called on that dependency, then I would say you haven't written a good unit test. What is the outward behavior the rest of the world should see? That's what you should test.
To be very clear, unit tests should test how code behaves, not how it is written.
Ideally, an integration test will only change if business requirements do.
Until management gets a good deal on a different cloud provider, and now you need to lift-and-shift your entire infrastructure. What used to be a MySQL database is now a SQL Server database. What used to be MongoDB is now Couchbase. What used to be an SMTP e-mail server is now a "notification service". The infrastructure of your ecosystem undergoes a lot more change than you might think, especially if multiple teams are involved building out microservices. There is a lot more churn here than you think, so be careful about making this assumption. This brings us to the issue of "regressions".
A software regression is a kind of defect that occurs in something that was working previously, but suddenly doesn't. This can happen because of a code change, or infrastructure change within your ecosystem. Unit tests cannot guard against regressions caused by the outside world. Unit tests guard against those code changes that your team has control over.
Assume I have a unit of code and a test for it. Now time comes to refactor. Trivially this test can never guard against regressions in any other place of my code, because all dependencies are mocked. But it also can't guard against regressions in this very piece of code it is associated with! Whenever change this unit of code, or maybe even remove it completely, I will also have to rewrite or even completely remove the test that guards it. The test, therefore, gets removed right before it could get useful. It was, therefore, useless and writing it was a wasted effort.
... Code cannot be refactored without modifying the tests as well. Refactoring is, thus, rendered difficult.
Don't conflate "regression" with "change in business requirements." If business requirements change, the System Under Test changes, and guess what? You need to update tests. You are also correct in saying that you cannot guard against regressions in code that changes because requirements change. That's not a regression. That's a change in requirements.
Regressions occur in other parts of the application that are not directly related to the requirements being changed. All of the other unit tests (that should each be running sub-millisecond, by the way) guard against regressions — accidental changes in behavior that violate previously functional implementations of unchanged requirements. So the test you are changing doesn't guard against regressions. The tests you aren't changing are guarding against regressions. All of those other Systems Under Test should continue functioning as they did before.
Remember how each unit test should execute sub-millisecond? This is where test execution time becomes important. The faster a test runs, the more likely you are to run it, which means it becomes more likely to catch regressions earlier in the development lifecycle where they are quicker and easier to fix.
That's how unit tests prevent regressions.