Timeline for Is it reliable to process millions of records with INSERT INTO SELECT directly in the database?

Current License: CC BY-SA 4.0

34 events

when toggle format	what		by	license	comment
Dec 28, 2023 at 18:25	comment	added	Ewan		I'm giving you the benefit of the doubt, maybe you know some clever DB tricks to avoid these problems, but if its just that you haven't come across them yet.....
Dec 28, 2023 at 18:24	comment	added	Ewan		Your assumptions here that you just wouldn't have to deal with these problems is exactly why my answer is correct. If you ignore them, your application WILL hit a saleability limit and crash disastrously at some point. This is the "performance" I'm talking about. If you follow my general advice your app will scale and you will not be getting calls at xmas to ask you why the accounts run has killed the webshop during peak sale hours etc etc
Dec 28, 2023 at 18:19	comment	added	Ewan		I only bring the transaction up because its a problem with the database, not the process as a whole. If you need a transaction as part of the operation, you would obvs have to deal with that in some way. I don't see it as a problem there are many approaches you could use depending on the details of the calculation
Dec 28, 2023 at 18:17	comment	added	Ewan		OK. 1. we know for sure that a simple one liner will fail at some critical number of affected rows. At some point it either eats all your db memory with an open transaction, locks a table for longer than the other operations timeout or just takes too long for your connection timeout. This gets more likely as the calculation complexity, number of affected rows and general other requests per second on your db increase.
Dec 28, 2023 at 13:41	comment	added	Steve		Continuing (4/3)!... Those three IFs are however very contingent, and if the transaction is the problem, it does not necessarily follow that the operation can simply be made non-transactional (which requires expert analysis). There may instead need to be a bespoke algorithm implemented to guarantee the necessary consistency whilst lifting the data out of the database (which requires expert development), or if the initial reading of the table itself is too slow to even get a copy out, then there may be no alternatives but hardware reinforcement or a systemic redesign.
Dec 28, 2023 at 13:34	comment	added	Steve		IF we've established that the transaction is a problem, and IF we don't need both the read and write in the same transaction, and IF we can read the table fast enough to return raw data to a client but not to make a transactional summary and insert of the same data, then the standard solution would be simple: not batching the aggregation, but reading the raw data into a temp table and then treating that temp table as the source (and from there onwards, just adapting the same simple one-liner we began with to draw from the temp table instead of the master table). (3/3)
Dec 28, 2023 at 13:27	comment	added	Steve		None of us so far have any reason to believe that a simple `INSERT INTO ... SELECT FROM ... GROUP BY` one-liner wouldn't fit the bill, and if it didn't fit the bill for some performance reason, then why the same performance problem wouldn't apply to a basic `SELECT FROM` which returns all necessary raw data to the client. You ask me how I would "avoid large transactions and table looking". By default, I wouldn't be trying to avoid a transaction, because it is an important (and often necessary) safety guarantee which the database engine provides simply and by default. (2/3)
Dec 28, 2023 at 13:27	comment	added	Steve		@Ewan, the reason I won't be drawn on an formal answer which addresses and counters all your canards, is because I anticipate the extraordinary complexity of making such an answer. As for implementing batching, I'd have no problem at all giving you a code example of batching (although I'd have to move off my mobile phone and onto the desktop to draft it), but I'm unclear exactly what impediment such batching is supposed to be working around (and which approaches couldn't be used), and I'm unclear what consistency constraints apply (and which approaches couldn't be used) (1/3)
Dec 28, 2023 at 12:20	comment	added	Ewan		A naive approach to this problem, even with a simple "select sum(order.cost), customerid into .." will land you in hot water at some point. Show us how you would implement this with batches on the database. Show us how you prevent large transactions and table locking etc My feeling is that once you account for anything above the most trivial the complexity of the two solutions makes the comparison obvious
Dec 28, 2023 at 12:14	comment	added	Ewan		@Steve I think you need to write an answer to that effect. I contend with a client based approach there is no need for a deep analysis, you use the DB methods you already have for setting and retrieving and do the calculation in unit tested code. You are using cheap, unshared resources to perform the work in a scalable and rate limitable way. You avoid the well known pitfalls of overloading your central shared, probably transactional, resource with large batch calculations.
Dec 28, 2023 at 12:12	comment	added	Steve		But I don't think it's reasonable to infer that the OP is working under that supervision, or that his level of expertise with databases is where he could be expected implicitly to perform the analysis that is necessary to determine whether this task can be moved to the client-side without subtle defects in its consistency when the database is under load performing concurrent writes. I suspect your awareness may be hazy, because most of us express howls of anguish at the idea of losing transactional protection provided by the engine, and there's no reason here to move out of that realm. (3/3)
Dec 28, 2023 at 12:01	comment	added	Steve		The overwhelming risk of your client-first approach is that manually analysing the transactional consistency needs, evaluating whether there is a solution for those needs whilst moving data back and forth to the client side, and then implementing a correct algorithm, is an expert-level task. If you're exclusively a client-side developer having your database-altering work overseen by a database developer, you'd be fine to rely on these simple rules, knowing that an expert will monitor and intervene as appropriate. (2/3)
Dec 28, 2023 at 12:00	comment	added	Steve		@Ewan, for an ETL-style process where the source and destination is the same database engine, then doing all execution server-side will always be the simpler starting approach (from an analysis and coding perspective), and quite probably the most correct and performant. This is because there is less heaving of data in and out, and because the transaction manager can oversee the entire operation. It doesn't have to be a sproc if that doesn't suit your approach to source code control or testing - the SQL code can be submitted from the client-side, but with execution all server-side. (1/3)
Dec 27, 2023 at 14:57	comment	added	Ewan		Yeah, I mean you could optimise it slightly to "ask if there is a problem with the simple approach first and then try it if not"
Dec 27, 2023 at 14:12	comment	added	Doc Brown		@Ewan: with simple, general rules, it is a little bit tricky. My answer contains also a simple, general rule for creating high performance, problem free software: "If one has to make a decision between two approaches, try out the simpler one first and use a more complicated only when the simpler one proofs itself not to be sufficient".
Dec 27, 2023 at 10:13	comment	added	Ewan		simple general rules make for good helpful answers to simple low detail questions
Dec 27, 2023 at 10:10	comment	added	Ewan		disagreeing. Its a simple general rule to avoid doing processing on the database. Following it will result in high performance, problem free software.
Dec 26, 2023 at 14:58	comment	added	Steve		But you're not being specific about which circumstances your answer would apply to. In many circumstances your answer wouldn't be right. I'm only cautioning against sweeping statements. Whether to execute server-side or client-side, and whether the solution to a found problem lies in changing the site of execution (rather than hardware reinforcement, say), is an expert professional judgment requiring a large amount of contextual information. There is no simple rule for high performance or problem-free operation, like "always the client". Unsure whether you're misunderstanding or disagreeing.
Dec 26, 2023 at 13:54	comment	added	Ewan		ok well my answer is for the case where its non trivial
Dec 26, 2023 at 12:44	comment	added	Steve		The OPs case sounds trivial enough to me that in normal circumstances it could be handled as a one-liner. He only asked abstractly which approach is better - he wasn't saying he already had a problem with a simple aggregating query. If a query was a problem, I'd need far more information about why, about the context, and about the latitude for a solution. There are all different kinds of "complicated logic" - the database is the best place for any complications involving consistency, but not necessarily the best place for all complications.
Dec 26, 2023 at 11:21	comment	added	Ewan		Why don't you write an answer with a sample sproc showing how you would deal with updating each row as "done" or "errored", allowing for batching up into say 1000 row groups per transaction to avoid memory consumption and dealing with a connection timeout? would you run it on a separate reporting box? I want to know if you just think its so trivial you wont run into these issues, or if having complex logic in the db is preferable in general?
Dec 25, 2023 at 20:11	comment	added	Steve		I agree tribalism is not a justification. And if you already have a client application, then that makes putting the logic there more reasonable than it would be otherwise. But it's very difficult to make sweeping statements about performance, reliability, or any other factor. Circumstances are so various that the context always counts. I'm reluctant myself to be more specific in the rebuttal, because of the complexity of setting up the context and covering my own backside!
Dec 25, 2023 at 18:36	comment	added	Ewan		i mean if you dont have the option to add code somewhere, you have to do it in sql or ssis or a job or something. Spinning up a machine and writing and deploying code is a big deal. If you have a bunch of code already, running on multiple machines, then writing a new worker and running it on some cheap load balanced machine is nothing and the obvious choice. You don't have a tribal allegiance to the database being the best solution
Dec 25, 2023 at 17:00	comment	added	Steve		You mean, if you're someone who is paid specifically to understand databases and their performance? 😂 Honestly it's a very complicated area; but it's your answer.
Dec 25, 2023 at 16:25	comment	added	Ewan		maybe if you are a dba
Dec 25, 2023 at 16:20	comment	added	Steve		I'd say it's very controversial.
Dec 25, 2023 at 15:26	comment	added	Ewan		its considered bad practice, obviously if it works it works, but its a code smell. Using your database resources to run logic is expensive. You can do it more cheaply, scalably and reliably by writing code which runs on some other box. I don't think this is a controversial opinion in 2023. You would only try to do this in a DB if you have no other option
Dec 25, 2023 at 13:24	comment	added	Steve		The complexity of any calculations may be a reason to palm it off to a different language. But honestly, looping in its own right is not a bad practice if it is (a) still a correct solution, and (b) done specifically to fall back from a correct single-statement solution which is too demanding for the context. The reason loops are suspect in SQL is because new practitioners either (a) use them as a first resort (inefficiently, and often incorrectly), or (b) use them incorrectly in a way that doesn't preserve an appropriate transactional consistency.
Dec 25, 2023 at 12:16	comment	added	Ewan		hmm i have to disagree with you here, if you have complex "business logic" or ETL procedures in SQL, that's bad practice. A loop is a sure sign you are doing something wrong.
Dec 25, 2023 at 12:13	comment	added	Steve		Loops are considered a bad practice in SQL when a single-statement alternative exists, but if your whole argument rests on saying that a single statement isn't appropriate for this case (for example because of the duration of the locks taken, or the resources consumed for transactional consistency of the entire batch), then obviously looping in SQL then becomes a legitimate solution. I certainly agree it's a complicated judgment.
Dec 25, 2023 at 11:58	comment	added	Ewan		In my defence, A. the question at face value implies a single SQL statement, and B. If you go further, with a loop in the SQL, itself considered bad practice, and sub transactions you can see that its still not trivial to solve the issues of this long running process and knowing which rows have been processed. If we compare a 'full' pure SQL solution which handles all these problems vs a code solution, the apparent benefits of doing it the "easy" one line of SQL way disappear and its not about simple vs complex anymore, its complex vs complex a much fairer comparison
Dec 25, 2023 at 11:48	comment	added	Steve		These are all very questionable assertions indeed. Certainly (2) is completely unfounded, since there is no significant obstacle in SQL to breaking up a batch of work into individual transactions, and the transactional guarantees the database offers will prevent inconsistency under all circumstances (or if intractable performance problems emerge, it will demonstrate the contradictions of your design).
Dec 25, 2023 at 11:36	history	edited	Ewan	CC BY-SA 4.0	added 153 characters in body
Dec 25, 2023 at 11:11	history	answered	Ewan	CC BY-SA 4.0

toggle format