OK, this sounds like the standard My query was faster yesterday question. I hope it's not.
I have a query that usually runs within seconds (e.g. like 4 secs). Last night, it ran for 12 hours before we killed it and reran it. The rerun did not help, and we eventually killed it again.
Since we are using Informatica (an ETL tool) to generate the query, it adds the date clause on the query as a literal. I am not happy about it, since I get a new query every day, which makes comparing performance over time very much impossible. However, I think that rules parameter sniffing out.
We ran the same query through a DB client and, as expected, it didn't finish. Rerunning the query with extra spaces from the DB client gave us a new plan and our fast execution.
I checked statistics and they were all updated just before the original slow run of the query, so rebuilding statistics wouldn't have helped here. I cleared the plan cache (yes, I could have just deleted the one plan but that is not the point here) and performance of the original query was back to just a few seconds.
We compared the execution plans, and they were different. No surprises there; however, the statistics used by both execution plans were identical, except for the modification counts that were higher for the newer plan—but not high enough to cause an update to statistics.
We looked at the CPU of the machine and it was reported as being over 90% for the execution time with the old (bad) plan. So, it seems that the query was responsible for it. When looking at the wait times in SolarWinds Database Performance Analyzer (DPA), I get 60 min for a 1-hour interval, so I assume we are using about 1 CPU full time when the bad plan is active.
I see both plans (good and bad) for the same query in DPA, which means that the SQL text between the runs did not change.
The question is, what else influences the creation of the execution plan that we didn't look at yet. We can't really recreate the issue so we need to do some forensics so we hopefully can prevent the generation of bad query plans in the future...or at least have a better answer than "It happens once in a while" for our managers.
- SQL Server 2017 Enterprise
- DB part of an Availability Group
- 16 CPUs
- 118GB maximum server memory