Is there a standard way of cleaning data files?

Question

Having spent quite a lot of time working in software companies that deal in medium to large data sets, I've seen that a choke point in their processes is often prepping data files for loading into databases. For example, searching for formatting errors, peculiar whitespace or line-end characters, wrangling with Unicode formats, that sort of thing.

Everyone I've ever known deals with this in a bespoke manner because the requirements are so varied and often unique to the companies involved.

I normally just fiddle about in a combination of hex editors, excel, powershell and SQL to get the job done. But this is such a perennial problem that I find it hard to believe that there isn't some standards already in place to take some of the basic grunt-work out of the process.

Is there a standard technique that is tailored for the purpose of cleaning data files?

Interesting question. You may want to try this question also on softwarerecs.stackexchange.com BTW As for manual cleaning up: do you know the trick of sorting every column asc/dec and then just visually inspecting the top n results? It's remarkable what garbage turns up ;-) And the terms to search for are "data(base) cleansing", "data(base) scrubbing". — Jan Doggen
– Jan Doggen, Commented Jun 20, 2014 at 11:22
@JanDoggen The close voters are gathering. Thanks for your answer - I did try googling this question, of course, but a lot of what turns up on those terms are database-only, whereas I'm interested in preprocessing flat files of various formats. — Bob Tway
– Bob Tway, Commented Jun 20, 2014 at 12:37
It sounds like you're looking for Extract Transform Load (ETL) tools. There's a whole industry dedicated to that, & prices range from bespoke, free, to six figures. — Dan Pichelman
– Dan Pichelman, Commented Jun 20, 2014 at 12:49

DougM · Accepted Answer · 2014-06-20 12:55:40Z

This is what XML Schemas were designed to do, albeit for XML files specifically. The overall technique is easily replicable, though.

Given a needs for data of shape, first write a file that defines shape that you can provide to your users and expect their computers to understand.
When you need to ask someone for a file of shape, give them said definition.
Once you get their file back, before you do anything else run the appropriate tool to ensure that it meets the requirements.

I begin with XML Schemas because there are several patters to choose from, many of which are either open standards or freely available. Most of the tools you mentioned can attach an XSD file to an XML document, and both determine if the file is valid or not and specify how the file isn't.

If you don't want to use XML, the same basic process can be applied to any form of data.

If it's a relational database file, write some CREATE SCHEMA / CREATE TABLE commands.
For a flat file like a CSV or space-delimited "text" file, you could write a RegEx expression which defines each line.
A more complex data format (JSON or YAML) may have their own schema functionality, or you could write a simple test script to parse and evaluate the file.
If you're gathering data that's more or less manually combined, send an example document. An empty Microsoft Access would be appropriate if your users are all in the Windows world.

Stack Exchange Network

Is there a standard way of cleaning data files?

1 Answer 1

Hot Network Questions

Is there a standard way of cleaning data files?

1 Answer 1

Related

Hot Network Questions