Handling Production Seed Data

Question

We have a requirement for a new feature which requires some seed data to be present in the database (basically some default values) for the feature to function correctly. We have this in 2 different scenarios at the moment, and the best method of generation / insertion of the seed data seems to vary depending on the data store we use. I'm not talking about seeding data for test purposes here.

For example, some features require tables to be present in SQL Server. We're using manual migrations between versions (basically diffing the schemas) so inserting the seed data for this would make sense to be done in the same SQL script that updates the schema. The way that some ORMs seem to handle this is to have a Seed() method (or equivalent) in the initialisation code which will create the data.

A different feature is using Azure Table Storage (ATS) as it's data store. Being schema-less, there's no script to create the table here, but the application does check for the existence of the tables on first run, and creates them if it doesn't exist there. This means that we normally don't create the tables before the deployment goes ahead. To seed data in ATS we could either pre-populate the environment (which would require some code to be written and executed somewhere) or we could make the component that checks for the existence of the table insert the data as it's created.

Is there a long term disadvantage to having the seed data in classes in the code, and if that's the best place to put it, is it going to make more sense to keep the data together (eg having a Seed class with the data in it that's run on app startup) or should the Repository have the responsibility for making sure that the base data exists before issuing any queries?

JimmyJames · Accepted Answer · 2016-07-22 21:33:51Z

I'd say no, don't put this in your code if putting it in your code means that when the application starts up it detects that it needs to seed the DB and does it. There are a few reasons for this (some will be similar or the same as what Alpha has provided):

As I understand this, it needs to run exactly once in production. Having this procedure in your code has no value after that first run.
Over time, no one will remember what this code is for or why it was there. New people (or even those who knew about it at one time) will see it and think "WTF? is this." The best case scenario is someone removes it from source control.
If for some reason something goes wrong and the app comes up in a state where it detects a fresh install, it will execute this again and go happily on it's way. This is not desirable. Fail fast instead. It could be a real mess to clean up if your application runs in that state for a while.

This should be part of the deployment of the code, not part of the code. This makes the creation of the seed deliberate.

k3b · Accepted Answer · 2016-09-21 10:42:05Z

You actually need two things:

A mechanism that decides that some action is needed (i.e. seed data must be added)
The seed data and some code that transfers the seed data into the target-storage (i.e. sql-script or csv-data+csv-interpreter)

I would solve the problem like this:

Add a version-number into your target-storage
Define a namingconvention for version-change-scripts
- Example version-update-1-2.sql would update from Version 1 to Version 2
- Example version-update-1-2.exe (if the process is not not scriptable (running through an interpreter)
The last step in version-update-* would set the new version number.
At program startup the code checks if there is an update available and execute it.
Update-scripts can be safaly removed once they where successfully executed.

If you add a feature to your product

increase the data-version-number
supply an update-script that matches the naming-convention.

Is there a long term disadvantage to having the seed data in classes in the code, and if that's the best place to put it

pro the code is selfcontained. The chances are high that the update still wors after a few years.
contra It is more difficuilt to remove update-steps. over time the code becomes more and more cluttered.
contra old update-steps may break if system changes.

Alpha · Accepted Answer · 2016-06-22 11:34:17Z

Is there a long term disadvantage to having the seed data in classes in the code, and if that's the best place to put it

I would argue that, on the contrary, this is the best place to have this kind of initialization. Since your code depends directly on this data, its shape and contents, this initialization being directly tied to your code is actually the best thing that could have happened to you.

However, nobody usually does that. Usually, you'd create a script, or a deployment step, for a couple of reasons:

The initialization code does not need to run all the time. Coding this for a particular one-time scenario is not usually very fruitful. Remember that incorporating it into your code base means it needs to be maintained and tested too.
Once the initialization in real environments is done, you're locked to a particular set of contents unless you have a way of managing the changes to the data, and this is something that a simple initialization code will not work on its own.

An example of this being taken care of in the codebase is Entity Framework's or Rails' migrations. They initialize, they update, and they seed and transform data.

The actual storage does not matter really. Azure Tables may be schemaless but your code will require a certain schema to be present in order to work with that data.

Having said this, my recommendations are:

If possible, include this seed logic in your code. Prepare to maintain that data and changes to it. Prepare for old environment data. Prepare to migrate back and forth.
If you can delegate to a known framework, do it. It'll take a lot of bloat from your code.
Do not include this logic in your repositories. They should deal with data access, not data source initialization.

Stack Exchange Network

Handling Production Seed Data

3 Answers 3

Hot Network Questions

Handling Production Seed Data

3 Answers 3

Related

Hot Network Questions