Optimal way to replace characters in large string with Python?

Question

I'm dealing with cleaning relatively large (30ish lines) blocks of text. Here's an excerpt:

PID|1||06225401^^^PA0^MR||PATIENT^FAKE R|||F

PV1|1|I|||||025631^DoctorZ^^^^^^^PA0^^^^DRH|DRH||||...

ORC|RE||CYT-09-06645^AP||||||200912110333|INTERFACE07

OBR|1||CYT09-06645|8104^^L|||20090602|||||||200906030000[conditio...

OBX|1|TX|8104|1|SOURCE OF SPECIMEN:[source]||||||F|||200912110333|CYT ...

I currently have a script that takes out illegal characters or terms. Here's an example.

 infile = open(thisFile,'r') m = infile.read() #remove junk headers m = m.replace("4þPATHþ", "") m = m.replace("10þALLþ", "")

My goal is to modify this script so that I can add 4 digits to the end of one of the fields. In specific, the date field ("20090602") in the OBR line. The finished script will be able to work with any file that follows this same format. Is this possible with the way I currently handle the file input or do I have to use some different logic?

The question is not completely clear. Can you provide sample input and output? — Sridhar Ratnakumar
– Sridhar Ratnakumar, Commented Aug 7, 2009 at 0:22
Those junk headers might be better expressed as e.g. "4\xfePATH\xfe" to make it quite plain that the so-called "illegal characters" are not characters (neither Icelandic lower-case thorn nor Latin lowercase p (which is what they look like at first glance)) but just binary rubbish. — John Machin
– John Machin, Commented Aug 7, 2009 at 1:03

John Machin · Accepted Answer · 2009-08-07 00:45:24Z

Here's an outline (untested) ... basically you do it a line at a time

for line in infile: data = line.rstrip("\n").split("|") kind = data[0] # start of changes if kind == "OBR": data[7] += "0000" # check that 7 is correct! # end of changes outrecord = "|".join(data) outfile.write(outrecord + "\n")

The above assumes that you are selecting fix-up targets by line-type (example: "OBR") and column index (example: 7). If there are only a few such targets, just add more similar fix statements. If there are many targets, you could specify them like this:

fix_targets = { "OBR": [7], "XYZ": [1, 42], }

and the fix_up code would look like this:

if kind in fix_targets: for col_index in fix_targets[kind]: data[col_index] += "0000"

You may like in any case to add code to check that data[col_index] really is a date in YYYYMMDD format before changing it.

None of the above addresses removing the unwanted headers, because you didn't show example data. I guess that applying your replacements to each line (and avoiding writing the line if it became only whitespace after replacements) would do the trick.

Perfect. I figured it might be something like this, but I couldn't really visualize it. Thanks for the answer.
Good. outfile.write(outrecord + '\n') has a built-in Python equivalent: print >> outfile, outrecord, which is simpler.
@EOL - except that the "print chevron" syntax, which is not often used, is also deprecated.

Community · Accepted Answer · 2017-05-23 12:19:35Z

2

You may find the answers here helpful.

Iterative find/replace from a list of tuples in Python

edited May 23, 2017 at 12:19

CommunityBot

11 silver badge

answered Aug 7, 2009 at 0:18

Unknown

47k29 gold badges142 silver badges184 bronze badges

1 Comment

Sean Over a year ago

Thanks. I don't think this is exactly what I'm looking for, but it's good to know nonetheless.

Collectives™ on Stack Overflow

Optimal way to replace characters in large string with Python?

2 Answers 2

3 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Linked

Related