0

I'm dealing with cleaning relatively large (30ish lines) blocks of text. Here's an excerpt:

PID|1||06225401^^^PA0^MR||PATIENT^FAKE R|||F

PV1|1|I|||||025631^DoctorZ^^^^^^^PA0^^^^DRH|DRH||||...

ORC|RE||CYT-09-06645^AP||||||200912110333|INTERFACE07

OBR|1||CYT09-06645|8104^^L|||20090602|||||||200906030000[conditio...

OBX|1|TX|8104|1|SOURCE OF SPECIMEN:[source]||||||F|||200912110333|CYT ...

I currently have a script that takes out illegal characters or terms. Here's an example.

 infile = open(thisFile,'r') m = infile.read() #remove junk headers m = m.replace("4þPATHþ", "") m = m.replace("10þALLþ", "") 

My goal is to modify this script so that I can add 4 digits to the end of one of the fields. In specific, the date field ("20090602") in the OBR line. The finished script will be able to work with any file that follows this same format. Is this possible with the way I currently handle the file input or do I have to use some different logic?

2
  • The question is not completely clear. Can you provide sample input and output? Commented Aug 7, 2009 at 0:22
  • 1
    Those junk headers might be better expressed as e.g. "4\xfePATH\xfe" to make it quite plain that the so-called "illegal characters" are not characters (neither Icelandic lower-case thorn nor Latin lowercase p (which is what they look like at first glance)) but just binary rubbish. Commented Aug 7, 2009 at 1:03

2 Answers 2

2

Here's an outline (untested) ... basically you do it a line at a time

for line in infile: data = line.rstrip("\n").split("|") kind = data[0] # start of changes if kind == "OBR": data[7] += "0000" # check that 7 is correct! # end of changes outrecord = "|".join(data) outfile.write(outrecord + "\n") 

The above assumes that you are selecting fix-up targets by line-type (example: "OBR") and column index (example: 7). If there are only a few such targets, just add more similar fix statements. If there are many targets, you could specify them like this:

fix_targets = { "OBR": [7], "XYZ": [1, 42], } 

and the fix_up code would look like this:

if kind in fix_targets: for col_index in fix_targets[kind]: data[col_index] += "0000" 

You may like in any case to add code to check that data[col_index] really is a date in YYYYMMDD format before changing it.

None of the above addresses removing the unwanted headers, because you didn't show example data. I guess that applying your replacements to each line (and avoiding writing the line if it became only whitespace after replacements) would do the trick.

Sign up to request clarification or add additional context in comments.

3 Comments

Perfect. I figured it might be something like this, but I couldn't really visualize it. Thanks for the answer.
Good. outfile.write(outrecord + '\n') has a built-in Python equivalent: print >> outfile, outrecord, which is simpler.
@EOL - except that the "print chevron" syntax, which is not often used, is also deprecated.
2

You may find the answers here helpful.

Iterative find/replace from a list of tuples in Python

1 Comment

Thanks. I don't think this is exactly what I'm looking for, but it's good to know nonetheless.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.