5

At the risk of over-explaining my question, I'm going to err on the side of too much information.

I am creating a bulk upload process that inserts data into two tables. The two tables look roughly as follows. TableA is a self-referencing table that allows N levels of reference.

Parts (self-referencing table) -------- PartId (PK Int Non-Auto-Incrementing) DescriptionId (Fk) ParentPartId HierarchyNode (HierarchyId) SourcePartId (VARCHAR(500) a unique Part Id from the source) (other columns) Description -------- DescriptionId (PK Int Non-Auto-Incrementing) Language (PK either 'EN' or 'JA') DescriptionText (varchar(max)) 

(I should note too that there are other tables that will reference our PartID that I'm leaving out of this for now.)

In Description, the combo of Description and Language will be unique, but the actual `DescriptionID will always have at least two instances.

Now, for the bulk upload process, I created two staging tables that look a lot like Parts and Description but don't have any PK's, Indexes, etc. They are Parts_Staging and Description_Staging.

In Parts_Staging there is an extra column that contains a Hierarchy Node String, which is the HierarchyNode in this kind of format: /1/2/3/ etc. Then when data is copied from the _Staging table to the actual table, I use a CAST(Source.Column AS hierarchyid).

Because of the complexity of the ID's shared across the two tables, the self-referencing id's and the hierarchyid in Parts, and the number of rows to be inserted (possible in the 100,000's) I decided to 100% compile ALL of the data in a C# model first, including the PK ID's. So the process looks like this in C#:

  1. Query the two tables for MAX ID
  2. Using those Max ID's, compile a complete model of all the data for both tables (inlcuding the hierarchyid /1/2/3/)
  3. Do a bulk insert into both _Staging Tables
  4. Trigger a SP that copies non-duplicate data from the two _Staging tables into the actual tables. (This is where the CAST(Source.Column AS hierarchyid) happens).

We are importing lots of parts books, and a single part may be replicated across multiple books. We need to remove the duplicates. In step 4, duplicates are weeded out by checking the SourcePartId in the Parts table and the Description in the DescriptionText in the Description table.

That entire process works beautifully! And best of all, it's really fast. But, if you are reading this carefully (and I thank if you are) then you have already noticed one glaring, obvious problem.

If multiple processes are happening at the same time (and that absolutely WILL happen!) then there is a very real risk of getting the ID's mixed up and the data becoming really corrupted. Process1 could do the GET MAX ID query and before it manages to finish, Process2 could also do a GET MAX ID query, and because Process1 hasn't actually written to the tables yet, it would get the same ID's.

My original thought was to use a SEQUENCE object. And at first, that plan seemed to be brilliant. But it fell apart in testing because it's entirely possible that the same data will be processed more than once and eventually ignored when the copy happens from the _Staging tables to the final tables. And in that case, the SEQUENCE numbers will already be claimed and used, resulting in giant gaps in the ID's. Not that this is a fatal flaw, but it's an issue we would rather avoid.

So... that was a LOT of background info to ask this actual question. What I'm thinking of doing is this:

  1. Lock both of the tables in question
  2. Steps 1-4 as outlined above
  3. Unlock both of the tables.

The lock would need to be a READ lock (which I think is an Exclusive lock?) so that if another process attempts to do the GET MAX ID query, it will have to wait.

My question is: 1) Is this the best approach? And 2) How does one place an Exclusive lock on a table?

Thanks!

16
  • The process doesn't work well at all. Moving data processing to the client adds delays, it doesn't reduce them. Using MAX(ID) to calculate IDs introduces the risk of duplicate IDs, not only if there are multiple processed but also if some rows get deleted. In that case there won't be any errors but some new values will receive the IDs that were used by deleted entries and get related with the wrong records Commented Mar 7, 2019 at 15:41
  • 2
    Any time you think you need to lock tables you should consider whether the entire process is broken too. Locks are meant for short-lived operations. They aren't a checkin/checkout mechanism. Commented Mar 7, 2019 at 15:42
  • Ok thanks Panagiotis. I hear you. Can I ask how you would do this entire process? Commented Mar 7, 2019 at 15:43
  • In this particular case you can add primary keys to the staging tables that get their values from SEQUENCE object through a DEFAULT constraint. This way there won't be any risk of duplicates and no need for locking. Bulk inserts will assign unique, incrementing values just as if they were added through an IDENTITY function. The difference is that IDENTITY can work with only one table at a time while SEQUENCE can be used by multiple tables, eg if you insert multiple types of similar records that need a common key once they reach their final destination Commented Mar 7, 2019 at 15:46
  • 1
    The short version: This is a classic case of inserting multiple parents and multiple children at the same transaction. The long version: Read my answer to a similar case, draw your similarities, change the proposed solution to fit your exact details, and go make yourself a cup of well earned coffee. Commented Mar 7, 2019 at 16:21

1 Answer 1

3

I'm not sure in regards to what's the best approach but in terms of placing an 'exclusive' lock on a table, simply using with (TABLOCKX) in your query will put one on the table.

If you wish to learn about it;

https://msdn.microsoft.com/en-GB/library/ms187373.aspx

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.