5

Often in my job I have to create a procedure inside SQL Server that will process millions of data rows, save them into a temp table (staging table) and finally save them into a table (s) in a database(s).

I am not looking into alternative solutions, like SSIS.

I cannot disable the indexes,constraints, take DB offline, change the recovery mode, etc.

we have looked for setting this process to run when the systems are less busy, but we work on a 24/7/365 online retailer environment.

there is a very similar question: Performance Inserting and Updating Millions of rows into a table

This question is also relevant: What is the fastest way to insert large numbers of rows?

example one:

CREATE PROCEDURE [dbo].[udpstaging_page_import_fromFilter] @sourceDesc nvarchar(50) -- e.g. 'Coremetrics' ,@feedDesc nvarchar(50) -- e.g. 'Daily Exports' ,@process_job_task_logID bigint AS BEGIN SET NOCOUNT ON; BEGIN TRY --truncate table prior INSERT exec dbo.udpstaging_page_truncateTable; declare @source_feedID int; exec crm_admin.dbo.udpsource_feedID_select @sourceDesc ,@feedDesc ,@source_feedID = @source_feedID OUTPUT; -- drop temp tables if exists (select * from tempdb.dbo.sysobjects o where o.xtype in ('U') and o.id = object_id(N'tempdb..#pageImport')) drop table #pageImport; -- create temp tables create table #pageImport( pageImportID [bigint] identity(1,1) NOT NULL ,pageCode [varbinary](16) NOT NULL ); insert into #pageImport( pageCode ) select pageCode from Coremetrics.PageView group by pageCode; -- add indexes to temp table CREATE CLUSTERED INDEX IDX_pageImport_pageImportID ON #pageImport(pageImportID); CREATE INDEX IDX_pageImport_pageCode ON #pageImport(pageCode); declare @recordCount bigint ,@counter int ,@maxCounter int ,@updateRowCount int; select @counter = MIN(pageImportID) ,@recordCount = MAX(pageImportID) from #pageImport; set @updateRowCount = 1000000; while @counter <= @recordCount begin set @maxCounter = (@counter + @updateRowCount - 1); with pageImport as ( select pv.pageCode ,pv.websiteCode as 'pageCIV' ,dbo.udfDerivePageName(pv.PAGE_ID, pv.CONTENT_CATEGORY_ID) as 'pageName' ,dbo.udfDerivePageName(pv.PAGE_ID, pv.CONTENT_CATEGORY_ID) as 'pageDesc' ,pv.[TIMESTAMP] as 'pageCreateDate' ,pv.pageTypeCode ,'' as 'pageTypeCIV' ,pv.websiteCode ,pv.marketID ,@source_feedID as 'source_feedID' ,@process_job_task_logID as 'process_job_task_logID' ,GETDATE() as 'createdDate' ,SUSER_NAME() as 'createdBy' ,GETDATE() as 'modifiedDate' ,SUSER_NAME() as 'modifiedBy' ,ROW_NUMBER() over ( PARTITION BY [pi].pageCode ORDER BY pv.[TIMESTAMP] ) as 'is_masterPageImport' from #pageImport [pi] inner join Coremetrics.PageView pv on pv.pageCode = [pi].pageCode and [pi].pageImportID between @counter and @maxCounter ) insert into staging.[page]( pageCode ,pageCIV ,pageName ,pageDesc ,pageCreateDate ,pageTypeCode ,pageTypeCIV ,websiteCode ,marketID ,source_feedID ,process_job_task_logID ,createdDate ,createdBy ,modifiedDate ,modifiedBy ) select pageCode ,pageCIV ,pageName ,pageDesc ,pageCreateDate ,pageTypeCode ,pageTypeCIV ,websiteCode ,marketID ,source_feedID ,process_job_task_logID ,createdDate ,createdBy ,modifiedDate ,modifiedBy from pageImport where 1 = 1 and is_masterPageImport = 1; set @counter = @counter + @updateRowCount; end; SET NOCOUNT OFF; RETURN 0; END TRY BEGIN CATCH print N'inner catch: ' + error_message(); SET NOCOUNT OFF; RETURN -10; END CATCH END; 

Example Two:

this is just part of a stored procedure that is too big to be posted here.

IF OBJECT_ID('tempdb.dbo.#ztblOrgProductStockView', 'U') IS NOT NULL DROP TABLE #ztblOrgProductStockView; CREATE TABLE #ztblOrgProductStockView ( [lngID] [int] NOT NULL IDENTITY PRIMARY KEY, [sintMarketId] [smallint] NOT NULL, [sintChannelId] [smallint] NOT NULL, [strOrgVwName] [varchar](50) COLLATE Latin1_General_CI_AS NOT NULL, [tintSequence] [tinyint] NOT NULL, [tintOrgGrpId] [tinyint] NOT NULL, [strTier1] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL, [strTier2] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL, [strTier3] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL, [strTier4] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL, [strTier5] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL, [strTier6] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL, [strItemNo] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL, [strStockTypeName] [varchar](50) COLLATE Latin1_General_CI_AS NOT NULL, [tintStockTypeId] [tinyint] NOT NULL, [sintDueWithinDays] [tinyint] NOT NULL, [bitOverSellingAllowed] [bit] NOT NULL, [dtmStartDate] [datetime] NULL, [dtmEndDate] [datetime] NULL, [dtmExpected] [datetime] NULL, [blnIsLocalToMarket] [bit] NULL, [blnPremiunDelvAllowed] [bit] NULL, [strStdDeliveryDaysCode] [varchar](20) ) INSERT into #ztblOrgProductStockView ( sintMarketId ,sintChannelId ,strOrgVwName ,tintSequence ,tintOrgGrpId ,strTier1 ,strTier2 ,strTier3 ,strTier4 ,strTier5 ,strTier6 ,strItemNo ,strStockTypeName ,tintStockTypeId ,sintDueWithinDays ,bitOverSellingAllowed ,dtmStartDate ,dtmEndDate ,dtmExpected ,blnIsLocalToMarket ,blnPremiunDelvAllowed ,strStdDeliveryDaysCode ) select rv.sintMarketId ,rv.sintChannelId ,rv.strOrgVwName ,tintSequence ,tintOrgGrpId ,ISNULL(rv.pnTier1,'ÿ') ,ISNULL(rv.pnTier2,'ÿ') ,ISNULL(rv.pnTier3,'ÿ') ,ISNULL(rv.strTier4,'ÿ') ,ISNULL(rv.strTier5,'ÿ') ,ISNULL(rv.strTier6,'ÿ') ,rv.strItemNo ,strStockTypeName ,tintStockTypeId ,sintDueWithinDays ,bitOverSellingAllowed ,dtmStartDate ,dtmEndDate ,dtmExpected ,blnIsLocalToMarket ,blnPremiunDelvAllowed ,strStdDeliveryDaysCode from #ztblOrgProductRangeView_1 rv inner join #ztblOrgProductSeqView_1 sv on rv.strItemNo = sv.strItemNo and rv.lngOrgVwId = sv.lngOrgVwId --order by rv.sintMarketId, rv.sintChannelId, sv.tintOrgGrpId, rv.strItemNo, sv.tintStockTypeId --set @DebugDate = convert(nvarchar(10),getdate(),108) --raiserror('%s [%s]', 0, 1, N'Populated #ztblOrgProductStockView', @DebugDate) with nowait --select [sintMarketId], [sintChannelId], [tintOrgGrpId], [strItemNo], [tintStockTypeId], count(*) --from [#ztblOrgProductStockView] --group by [sintMarketId], [sintChannelId], [tintOrgGrpId], [strItemNo], [tintStockTypeId] --having count(*) > 1 set @lngRowcount = @@ROWCOUNT set nocount on; While @lngRowcount > 0 Begin Set @lngMinID = @lngMaxID Set @lngMaxID = @lngMaxID + 5000 INSERT INTO [ztblOrgProductStockView] ([sintActiveView] ,[sintMarketId] ,[sintChannelId] ,[strOrgVwName] ,[tintSequence] ,[tintOrgGrpId] ,[strTier1] ,[strTier2] ,[strTier3] ,[strTier4] ,[strTier5] ,[strTier6] ,[strItemNo] ,[strStockTypeName] ,[tintStockTypeId] ,[sintDueWithinDays] ,[bitOverSellingAllowed] ,[dtmStartDate] ,[dtmEndDate] ,[dtmExpected] ,[blnIsLocalToMarket] ,[blnPremiunDelvAllowed] ,[strStdDeliveryDaysCode]) Select @sintActiveView_new ,[sintMarketId] ,[sintChannelId] ,[strOrgVwName] ,[tintSequence] ,[tintOrgGrpId] ,[strTier1] ,[strTier2] ,[strTier3] ,[strTier4] ,[strTier5] ,[strTier6] ,[strItemNo] ,[strStockTypeName] ,[tintStockTypeId] ,[sintDueWithinDays] ,[bitOverSellingAllowed] ,[dtmStartDate] ,[dtmEndDate] ,[dtmExpected] ,[blnIsLocalToMarket] ,[blnPremiunDelvAllowed] ,[strStdDeliveryDaysCode] From #ztblOrgProductStockView Where lngID >= @lngMinID And lngID < @lngMaxID set @lngRowcount = @@ROWCOUNT End 

Questions Please note that opinion based answers are not the most popular here, try to give evidence when possible.

1) how to decide the best way to organize the size of the batches? for example on the Example two it is 5000.

2) would it generally have more chances of improving performance if I BEGIN TRANSACTION and COMMIT TRANSACTION within the while loop? One transaction for batch.

3) In case I would like to change the size of the batch, what I could monitor in order to decide whether I could increase the size of the batch, or I am causing I/O latency?

I currently find the I/O Latency using the script below:

-- How to identify I/O latency issues -- Below SQL code helps in identifying the I/O latency issues in a SQL Server system on a per-file basis. -- http://sqlserverdbknowledge.wordpress.com/2011/11/08/how-to-identify-io-latency-issues/ --http://www.sqlskills.com/blogs/paul/how-to-examine-io-subsystem-latencies-from-within-sql-server/ --MARCELO MIORELLI 26-JULY-2013 SELECT --- virtual file latency ReadLatency = CASE WHEN num_of_reads = 0 THEN 0 ELSE (io_stall_read_ms / num_of_reads) END, WriteLatency = CASE WHEN num_of_writes = 0 THEN 0 ELSE (io_stall_write_ms / num_of_writes) END, Latency = CASE WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0 ELSE (io_stall / (num_of_reads + num_of_writes)) END, --– avg bytes per IOP AvgBPerRead = CASE WHEN num_of_reads = 0 THEN 0 ELSE (num_of_bytes_read / num_of_reads) END, AvgBPerWrite = CASE WHEN io_stall_write_ms = 0 THEN 0 ELSE (num_of_bytes_written / num_of_writes) END, AvgBPerTransfer = CASE WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0 ELSE ((num_of_bytes_read + num_of_bytes_written) / (num_of_reads + num_of_writes)) END, LEFT (mf.physical_name, 2) AS Drive, DB_NAME (vfs.database_id) AS DB, --- –vfs.*, mf.physical_name FROM sys.dm_io_virtual_file_stats (NULL,NULL) AS vfs JOIN sys.master_files AS mf ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id --WHERE vfs.file_id = 2 — log files -- ORDER BY Latency DESC -- ORDER BY ReadLatency DESC ORDER BY WriteLatency DESC; GO 
2
  • mostly I am running on sql 2014, but unfortunately I still have some core systems running on sql 2005, and these are not likely to be upgraded until late 2016. Commented Oct 28, 2015 at 16:27
  • 2
    Here's an article on Transaction Log that might guide you in terms of whether to use multiple batch-sized transactions or not. For example, if you use the smaller transactions, your log might perform better. msdn.microsoft.com/en-us/library/ms190925%28v=sql.120%29.aspx Commented Oct 28, 2015 at 16:35

1 Answer 1

4

opinion based answers are not the most popular here, try to give evidence when possible.

Well, that is not entirely fair given that ultimately, the "evidence" for what works the best will come from your system ;-). Your hardware, your data, your system load, etc will determine what works the best. There are a lot of variables in both how to approach things as well as how your system works, so what works best in one system might not be so great in another.

1) how to decide the best way to organize the size of the batches? for example on the Example two it is 5000.

This is mostly a matter of trial-and-error to see what works best. However, it is important to keep in mind that lock escalation typically occurs at 5000 locks. Depending on how your data is organized, 5000 changes could be 5000 row locks or a few page locks. This is on a per-object basis since the ordering of the rows can be different across different indexes. The point is, changes made to tables that need to be usable by other processes during these operations should try to avoid table locks (i.e. the result of lock escalation). But tables such as the staging table and temp tables would benefit from table locks since it is a single operation and there should be no contention, hence the TABLOCK hint when doing a "bulk" INSERT.

2) would it generally have more chances of improving performance if I BEGIN TRANSACTION and COMMIT TRANSACTION within the while loop? One transaction for batch.

Wrapping multiple DML operations into an explicit transaction greatly helps performance when those operations are row-by-row. From the code you posted here, you are already doing set-based operations so there is only minor timing benefit from combining anything into a transaction. Also, in your WHILE loops in both examples you are doing a single INSERT operation, which is its own transaction, hence adding a transaction inside of the WHILE loop gains nothing anyway. And, adding an explicit transaction around the WHILE loop would put the entire set into a single transaction, which again might help a little on the timing, but you would then also have a huge transaction that would increase chances for blocking as well as contribute to LOG file growth since this transaction would be active for longer.

3) In case I would like to change the size of the batch, what I could monitor in order to decide whether I could increase the size of the batch, or I am causing I/O latency?

Monitor whether the process run faster or slower. Try several different batch sizes and let each one run for several iterations of the process. Keep track of how long the process run for each batch size and you will find what works best.


Along these lines, I would at least try reducing the million row batch size in Example 1.

I would also convert the scalar UDF dbo.udfDerivePageName into an Inline TVF and incorporate that into the query (Example 1) using CROSS APPLY or OUTER APPLY. And, considering that both calls to the UDF pass in the same two parameters, you would just reference the returned field twice (once as pageName and once as pageDesc) rather than having two calls to the iTVF.

Another option to reduce contention on the destination table ( if this is merely inserting new rows and not updating existing rows ), is to use Table Partitioning. This would allow you to stage the data as you are currently doing, but then rather than inserting that new data into the live table, you would SWITCH the new Partition in which is a rather quick operation. This won't help with the time or I/O it takes to stage the data in the first place, but it could eliminate the time and contention taken by "merging" the staged data into the live table. It's something to look into.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.