2

I'm working on quite a complex for which I do need your advice. I need to copy data from one table to another table, and I know that for this approach, there are quite a few solutions like

If both are the same schema

INSERT INTO newTable SELECT * FROM oldTable 

If both are a different schema

INSERT INTO newTable (col1, col2, col3) SELECT column1, column2, column3 FROM oldTable 

A SQL Cursor can also be used, and a lot more variations does exists. But let me summarize the problem:

The problem

I do have a CSV file which contains approx. 1.5 million records. Each record currently has 6 fields which are being imported.

Now, to insert the data from the CSV file into SQL server, I'm using C# in combination with Entity Framework.

For performance, I'll insert all those records in a temporary table. This is the schema of the temporary table:

CREATE TABLE [dbo].[TEMP_GENERIC_ARTICLE]( [Id] [int] IDENTITY(1,1) NOT NULL, [GlnCode] [nvarchar](100) NULL, [Description] [nvarchar](max) NULL, [VendorId] [nvarchar](100) NULL, [VendorName] [nvarchar](100) NULL, [ItemNumber] [nvarchar](100) NULL, [ItemUOM] [nvarchar](max) NULL, [DateCreatedInternal] [datetime] NOT NULL, [DateUpdatedInternal] [datetime] NOT NULL, CONSTRAINT [PK_dbo.TEMP_GENERIC_ARTICLE] PRIMARY KEY CLUSTERED ( [Id] ASC ) 

Then I do have a table which another application will consume, called T_GENERIC_ARTICLE which schema is:

CREATE TABLE [dbo].[T_GENERIC_ARTICLE]( [GlnCode] [nvarchar](100) NOT NULL, [Description] [nvarchar](max) NULL, [VendorId] [nvarchar](100) NOT NULL, [VendorName] [nvarchar](100) NULL, [ItemNumber] [nvarchar](100) NOT NULL, [ItemUOM] [nvarchar](128) NOT NULL, CONSTRAINT [PK_dbo.T_GENERIC_ARTICLE] PRIMARY KEY CLUSTERED ( [GlnCode] ASC, [VendorId] ASC, [ItemNumber] ASC, [ItemUOM] ASC ) 

So, the real table doesn't have the 'ID' field anymore and has primary key which spans 4 columns on the database.

Now, what I would like to do:

As soon as the data is stored in the Temp table, or every 1000 records for example, I need to run a SQL Stored Procedure which will copy the data from the temp table to the destination table.

Upon the copy, I need to check wether a record with that primary key is already existing. If that's the case, then I would like to update the record, otherwise I want to insert a new record.

After the copy has been fully done I would like to remove all the records in the temp table.

The question

What's the best approach to work which such a large data set (1.5 million records) in order to transfer the records from the temp, to the destination table as performant and as fast as possible?

I've never worked with such a large data sets so I really need some advice on this one.

Kind regards

6
  • 1
    Why don't you import the data directly from the csv to the target table? Commented Apr 16, 2015 at 8:03
  • Probably should be done through an application. Commented Apr 16, 2015 at 8:04
  • But you can use SSIS as well just for your info. Commented Apr 16, 2015 at 8:04
  • SSIS Is perfectly capable of handling data from csv, and 1.5 million records is hardly breakfast for sql server. Even if this process is done repeatedly, I don't see a reason to write c# code and use another table in the process. Commented Apr 16, 2015 at 8:08
  • 1.5 mil record not so large for this task, all the more so for processing in one instance. This task may be done with MERGE command in one step with appropriate time i suppose. As it said in comment before, it is easier to merge data directly from CSV. With SSIS if your do this regualry, or with BULK INSERT and MERGE if it is once-only stackoverflow.com/questions/23026501/… Commented Apr 16, 2015 at 8:11

1 Answer 1

1

Use MERGE to copy the data from the staging table to the main table, there should be no problem handling 1.5m rows:

MERGE [dbo].[T_GENERIC_ARTICLE] AS t USING [TEMP_GENERIC_ARTICLE] AS s ON s.GlnCode = t.GlnCode AND s.VendorId = t.VendorId AND s.ItemNumber = t.ItemNumber AND s.ItemUOM = t.ItemUOM WHEN MATCHED THEN UPDATE SET Description = s.Description, VendorName = s.VendorName WHEN NOT MATCHED THEN INSERT (GlnCode, Description, VendorId, VendorName, ItemNumber, ItemUOM) VALUES (s.GlnCode, s.Description, s.VendorId, s.VendorName, s.ItemNumber, s.ItemUOM); 

SSIS is good for the insert but it does not handle the upsert very well, as the OLEDB command task operates RBAR, so it is much more efficient to insert to a staging table, and use MERGE than it is to try to upsert directly to the main table.


EDIT

To answer the question in the comment:

Does it to a check before inserting every record? or does it do the check only on the first time? I mean, if the merge command does insert a record first, for the following record, will it check if it's existing in the destination table again (since a new record has been added from the previous merge statement)?

It does the check first, but if you have two matching rows in your source table, then the MERGE will fail with the error message

The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.

If you have duplicate rows in your temp table you will need to de-dupe them before you perform the merge, so, if you only want the latest record from the temp table for example then you could use NOT EXISTS to exlcude records where a new one exists for the same key:

WITH LatestTempGenericArticle AS ( SELECT GlnCode, Description, VendorId, VendorName, ItemNumber, ItemUOM FROM dbo.[TEMP_GENERIC_ARTICLE] AS t WHERE NOT EXISTS ( SELECT 1 FROM dbo.[TEMP_GENERIC_ARTICLE] AS t2 WHERE t2.GlnCode = t.GlnCode AND t2.VendorId = t.VendorId AND t2.ItemNumber = t.ItemNumber AND t2.ItemUOM = t.ItemUOM AND t2.DateCreatedInternal > t.DateCreatedInternal ) ) MERGE [dbo].[T_GENERIC_ARTICLE] AS t USING LatestTempGenericArticle AS s ON s.GlnCode = t.GlnCode AND s.VendorId = t.VendorId AND s.ItemNumber = t.ItemNumber AND s.ItemUOM = t.ItemUOM WHEN MATCHED THEN UPDATE SET Description = s.Description, VendorName = s.VendorName WHEN NOT MATCHED THEN INSERT (GlnCode, Description, VendorId, VendorName, ItemNumber, ItemUOM) VALUES (s.GlnCode, s.Description, s.VendorId, s.VendorName, s.ItemNumber, s.ItemUOM); 
Sign up to request clarification or add additional context in comments.

3 Comments

I do believe that this is what I need. Give me some time to test the implementation and if everything is working, I'll accept the answer. Otherwise, you'll see a comment on this post from me :-)
I do have a small question regarding the solution that you proposed. When you're using the Merge command. Does it to a check before inserting every record? or does it do the check only on the first time? I mean, if the merge command does insert a record first, for the following record, will it check if it's existing in the destination table again (since a new record has been added from the previous merge statement)?
Thanks for your very valuable feedback (the update on your answer). I've opted for the solution to change my temp table using the same primary keys as my destination table. That way I'm always sure that the merge command will succeed.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.