92

I am trying to find an efficient way of parsing files that holds fixed width lines. For example, the first 20 characters represent a column, from 21:30 another one and so on.

Assuming that the line holds 100 characters, what would be an efficient way to parse a line into several components?

I could use string slicing per line, but it's a little bit ugly if the line is big. Are there any other fast methods?

11 Answers 11

80

Using the Python standard library's struct module would be fairly easy as well as fairly fast since it's written in C. The code below how it use it. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.

import struct fieldwidths = (2, -10, 24) fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths) # Convert Unicode input to bytes and the result back to Unicode string. unpack = struct.Struct(fmtstring).unpack_from # Alias. parse = lambda line: tuple(s.decode() for s in unpack(line.encode())) print('fmtstring: {!r}, record size: {} chars'.format(fmtstring, struct.calcsize(fmtstring))) line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n' fields = parse(line) print('fields: {}'.format(fields)) 

Output:

fmtstring: '2s 10x 24s', recsize: 36 chars fields: ('AB', 'MNOPQRSTUVWXYZ0123456789') 

Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. It is kind of complicated and speedwise it's about the same as the version based the struct module — although I have an idea about how it could be sped up (which might make the extra complexity worthwhile). See update below on that topic.

from itertools import zip_longest from itertools import accumulate def make_parser(fieldwidths): cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths)) pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad) # Optional informational function attributes. parse.size = sum(abs(fw) for fw in fieldwidths) parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths) return parse line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n' fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields parse = make_parser(fieldwidths) fields = parse(line) print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size)) print('fields: {}'.format(fields)) 

Output:

format: '2s 10x 24s', rec size: 36 chars fields: ('AB', 'MNOPQRSTUVWXYZ0123456789') 

Update

As I suspected, there is a way of making the string-slicing version of the code faster — which in Python 2.7 make it about the same speed as the version using struct, but in Python 3.x make it 233% faster (as well as the un-optimized version of itself which is about the same speed as the struct version).

What the version presented above does is define a lambda function that's primarily a comprehension that generates the limits of a bunch of slices at runtime.

parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad) 

Which is equivalent to a statement like the following, depending on the values of i and j in the for loop, to something looking like this:

parse = lambda line: tuple(line[0:2], line[12:36], line[36:51], ...) 

However the latter executes more than twice as fast since the slice boundaries are all constants.

Fortunately it relatively easy to convert and "compile" the former into the latter using the built-in eval() function:

def make_parser(fieldwidths): cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths)) pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad) parse = eval('lambda line: ({})\n'.format(slcs)) # Create and compile source code. # Optional informational function attributes. parse.size = sum(abs(fw) for fw in fieldwidths) parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths) return parse 
Sign up to request clarification or add additional context in comments.

20 Comments

How would that work with unicode? Or, a utf-8 encoded string? struct.unpack seems to operate on binary data. I can't get this working.
@Reiner Gerecke: The struct module is designed to operate on binary data. Files with fixed-width fields are legacy jobs which are also highly likely to pre-date UTF-8 (in mind set, if not in chronology). Bytes read from files are binary data. You don't have unicode in files. You need to decode bytes to get unicode.
@Reiner Gerecke: Clarification: In those legacy file formats, each field is a fixed number of bytes, not a fixed number of characters. Although unlikely to be encoded in UTF-8, they can be encoded in an encoding that has a variable number of bytes per character e.g. gbk, big5, euc-jp, shift-jis, etc. If you wish to work in unicode, you can't decode the whole record at once; you need to decode each field.
@John Machin Thank you, I understood what you meant. For some reason I haven't seen this as an approach to legacy file formats, hence me asking about multibyte characters. (And yeah, thinking of unicode was dumb)
This breaks down entirely when you try to apply this for Unicode values (like in Python 3) with text outside the ASCII character set and where 'fixed width' means 'fixed number of characters', not bytes.
|
75

I'm not really sure if this is efficient, but it should be readable (as opposed to do the slicing manually). I defined a function slices that gets a string and column lengths, and returns the substrings. I made it a generator, so for really long lines, it doesn't build a temporary list of substrings.

def slices(s, *args): position = 0 for length in args: yield s[position:position + length] position += length 

Example

In [32]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2)) Out[32]: ['ab'] In [33]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2, 10, 50)) Out[33]: ['ab', 'cdefghijkl', 'mnopqrstuvwxyz0123456789'] In [51]: d,c,h = slices('dogcathouse', 3, 3, 5) In [52]: d,c,h Out[52]: ('dog', 'cat', 'house') 

But I think the advantage of a generator is lost if you need all columns at once. Where one could benefit from is when you want to process columns one by one, say in a loop.

1 Comment

AFAICT, this method is slower than struct, but it is readable and easier to handle. I've done some tests using your slices function, struct module and also re module and it turns out for large files, struct is the fastest, re comes second (1.5x slower) and slices third (2x slower). There is however a small overhead using struct so your slices function can be faster on smaller files.
36

Two more options that are easier and prettier than already mentioned solutions:

The first is using pandas:

import pandas as pd path = 'filename.txt' #inferred - as suggested in the comments by James Paul Mason data = pd.read_fwf(path, colspecs='infer') # Or using Pandas with a column specification col_specification = [(0, 20), (21, 30), (31, 50), (51, 100)] data = pd.read_fwf(path, colspecs=col_specification) 

And the second option using numpy.loadtxt:

import numpy as np # Using NumPy and letting it figure it out automagically data_also = np.loadtxt(path) 

It really depends on in what way you want to use your data.

6 Comments

Is this competitive with the accepted answer in terms of speed?
Haven't tested it, but it should be a lot faster than the accepted answer.
Pandas can do automagic detection on its own if you set colspecs='infer' pandas.pydata.org/pandas-docs/stable/generated/…
Thanks, don't think that was an option when I wrote the answer, will update it with that.
The question doesn't specify an efficiency metric, and this answer is certainly the most efficient in terms of human processing, i.e., readability and code complexity.
|
15

The code below gives a sketch of what you might want to do if you have some serious fixed-column-width file handling to do.

"Serious" = multiple record types in each of multiple file types, records up to 1000 bytes, the layout-definer and "opposing" producer/consumer is a government department with attitude, layout changes result in unused columns, up to a million records in a file, ...

Features: Precompiles the struct formats. Ignores unwanted columns. Converts input strings to required data types (sketch omits error handling). Converts records to object instances (or dicts, or named tuples if you prefer).

Code:

import struct, datetime, io, pprint # functions for converting input fields to usable data cnv_text = rstrip cnv_int = int cnv_date_dmy = lambda s: datetime.datetime.strptime(s, "%d%m%Y") # ddmmyyyy # etc # field specs (field name, start pos (1-relative), len, converter func) fieldspecs = [ ('surname', 11, 20, cnv_text), ('given_names', 31, 20, cnv_text), ('birth_date', 51, 8, cnv_date_dmy), ('start_date', 71, 8, cnv_date_dmy), ] fieldspecs.sort(key=lambda x: x[1]) # just in case # build the format for struct.unpack unpack_len = 0 unpack_fmt = "" for fieldspec in fieldspecs: start = fieldspec[1] - 1 end = start + fieldspec[2] if start > unpack_len: unpack_fmt += str(start - unpack_len) + "x" unpack_fmt += str(end - start) + "s" unpack_len = end field_indices = range(len(fieldspecs)) print unpack_len, unpack_fmt unpacker = struct.Struct(unpack_fmt).unpack_from class Record(object): pass # or use named tuples raw_data = """\ ....v....1....v....2....v....3....v....4....v....5....v....6....v....7....v....8 Featherstonehaugh Algernon Marmaduke 31121969 01012005XX """ f = cStringIO.StringIO(raw_data) headings = f.next() for line in f: # The guts of this loop would of course be hidden away in a function/method # and could be made less ugly raw_fields = unpacker(line) r = Record() for x in field_indices: setattr(r, fieldspecs[x][0], fieldspecs[x][3](raw_fields[x])) pprint.pprint(r.__dict__) print "Customer name:", r.given_names, r.surname 

Output:

78 10x20s20s8s12x8s {'birth_date': datetime.datetime(1969, 12, 31, 0, 0), 'given_names': 'Algernon Marmaduke', 'start_date': datetime.datetime(2005, 1, 1, 0, 0), 'surname': 'Featherstonehaugh'} Customer name: Algernon Marmaduke Featherstonehaugh 

1 Comment

How would one update this code to parse records greater than 1000 bytes? I'm running into this error: struct.error: unpack_from requires a buffer of at least 1157 bytes
4
> str = '1234567890' > w = [0,2,5,7,10] > [ str[ w[i-1] : w[i] ] for i in range(1,len(w)) ] ['12', '345', '67', '890'] 

Comments

4

This is how I solved with a dictionary that contains where fields start and end. Giving start and end points helped me to manage changes at the length of the column also.

# fixed length # '---------- ------- ----------- -----------' line = '20.06.2019 myname active mydevice ' SLICES = {'date_start': 0, 'date_end': 10, 'name_start': 11, 'name_end': 18, 'status_start': 19, 'status_end': 30, 'device_start': 31, 'device_end': 42} def get_values_as_dict(line, SLICES): values = {} key_list = {key.split("_")[0] for key in SLICES.keys()} for key in key_list: values[key] = line[SLICES[key+"_start"]:SLICES[key+"_end"]].strip() return values >>> print (get_values_as_dict(line,SLICES)) {'status': 'active', 'name': 'myname', 'date': '20.06.2019', 'device': 'mydevice'} 

Comments

2

Here's a simple module for Python 3, based on John Machin's answer - adapt as needed :)

""" fixedwidth Parse and iterate through a fixedwidth text file, returning record objects. Adapted from https://stackoverflow.com/a/4916375/243392 USAGE import fixedwidth, pprint # define the fixed width fields we want # fieldspecs is a list of [name, description, start, width, type] arrays. fieldspecs = [ ["FILEID", "File Identification", 1, 6, "A/N"], ["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"], ["SUMLEV", "Summary Level", 9, 3, "A/N"], ["LOGRECNO", "Logical Record Number", 19, 7, "N"], ["POP100", "Population Count (100%)", 30, 9, "N"], ] # define the fieldtype conversion functions fieldtype_fns = { 'A': str.rstrip, 'A/N': str.rstrip, 'N': int, } # iterate over record objects in the file with open(f, 'rb'): for record in fixedwidth.reader(f, fieldspecs, fieldtype_fns): pprint.pprint(record.__dict__) # output: {'FILEID': 'SF1ST', 'LOGRECNO': 2, 'POP100': 1, 'STUSAB': 'TX', 'SUMLEV': '040'} {'FILEID': 'SF1ST', 'LOGRECNO': 3, 'POP100': 2, 'STUSAB': 'TX', 'SUMLEV': '040'} ... """ import struct, io # fieldspec columns iName, iDescription, iStart, iWidth, iType = range(5) def get_struct_unpacker(fieldspecs): """ Build the format string for struct.unpack to use, based on the fieldspecs. fieldspecs is a list of [name, description, start, width, type] arrays. Returns a string like "6s2s3s7x7s4x9s". """ unpack_len = 0 unpack_fmt = "" for fieldspec in fieldspecs: start = fieldspec[iStart] - 1 end = start + fieldspec[iWidth] if start > unpack_len: unpack_fmt += str(start - unpack_len) + "x" unpack_fmt += str(end - start) + "s" unpack_len = end struct_unpacker = struct.Struct(unpack_fmt).unpack_from return struct_unpacker class Record(object): pass # or use named tuples def reader(f, fieldspecs, fieldtype_fns): """ Wrap a fixedwidth file and return records according to the given fieldspecs. fieldspecs is a list of [name, description, start, width, type] arrays. fieldtype_fns is a dictionary of functions used to transform the raw string values, one for each type. """ # make sure fieldspecs are sorted properly fieldspecs.sort(key=lambda fieldspec: fieldspec[iStart]) struct_unpacker = get_struct_unpacker(fieldspecs) field_indices = range(len(fieldspecs)) for line in f: raw_fields = struct_unpacker(line) # split line into field values record = Record() for i in field_indices: fieldspec = fieldspecs[i] fieldname = fieldspec[iName] s = raw_fields[i].decode() # convert raw bytes to a string fn = fieldtype_fns[fieldspec[iType]] # get conversion function value = fn(s) # convert string to value (eg to an int) setattr(record, fieldname, value) yield record if __name__=='__main__': # test module import pprint, io # define the fields we want # fieldspecs are [name, description, start, width, type] fieldspecs = [ ["FILEID", "File Identification", 1, 6, "A/N"], ["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"], ["SUMLEV", "Summary Level", 9, 3, "A/N"], ["LOGRECNO", "Logical Record Number", 19, 7, "N"], ["POP100", "Population Count (100%)", 30, 9, "N"], ] # define a conversion function for integers def to_int(s): """ Convert a numeric string to an integer. Allows a leading ! as an indicator of missing or uncertain data. Returns None if no data. """ try: return int(s) except: try: return int(s[1:]) # ignore a leading ! except: return None # assume has a leading ! and no value # define the conversion fns fieldtype_fns = { 'A': str.rstrip, 'A/N': str.rstrip, 'N': to_int, # 'N': int, # 'D': lambda s: datetime.datetime.strptime(s, "%d%m%Y"), # ddmmyyyy # etc } # define a fixedwidth sample sample = """\ SF1ST TX04089000 00000023748 1 SF1ST TX04090000 00000033748! 2 SF1ST TX04091000 00000043748! """ sample_data = sample.encode() # convert string to bytes file_like = io.BytesIO(sample_data) # create a file-like wrapper around bytes # iterate over record objects in the file for record in reader(file_like, fieldspecs, fieldtype_fns): # print(record) pprint.pprint(record.__dict__) 

Comments

2

Here is what NumPy uses under the hood (much much simplified, but still - this code is found in the LineSplitter class within the _iotools module):

import numpy as np DELIMITER = (20, 10, 10, 20, 10, 10, 20) idx = np.cumsum([0] + list(DELIMITER)) slices = [slice(i, j) for (i, j) in zip(idx[:-1], idx[1:])] def parse(line): return [line[s] for s in slices] 

It does not handle negative delimiters for ignoring column so it is not as versatile as struct, but it is faster.

Comments

1

Because my old work often handles 1 million lines of fixwidth data, I did research on this issue when I started using Python.

There are 2 types of FixedWidth

  1. ASCII FixedWidth (ascii character length = 1, double-byte encoded character length = 2)
  2. Unicode FixedWidth (ascii character & double-byte encoded character length = 1)

If the resource string is all composed of ascii characters, then ASCII FixedWidth = Unicode FixedWidth

Fortunately, string and byte are different in py3, which reduces a lot of confusion when dealing with double-byte encoded characters (e.g.gbk, big5, euc-jp, shift-jis, etc.).
For the processing of "ASCII FixedWidth", the String is usually converted to Bytes and then split.

Without importing third-party modules
totalLineCount = 1 million, lineLength = 800 byte , FixedWidthArgs=(10,25,4,....), I split the Line in about 5 ways and get the following conclusion:

  1. struct is the fastest (1x)
  2. Loop only, not pre-processing FixedWidthArgs is the slowest (5x+)
  3. slice(bytes) is faster than slice(string)
  4. The source string is the bytes test result: struct(1x) , operator.itemgetter(1.7x) , precompiled sliceObject & list comprehensions(2.8x), re.patten object (2.9x)

When dealing with large files, we often use with open ( file, "rb") as f:.
The method traverses one of the above files, about 2.4 second.
I think the appropriate handler, which processes 1 million rows of data, splits each row into 20 fields and takes less than 2.4 seconds.

I only find that stuct and itemgetter meet the requirements

ps: For normal display, I converted unicode str to bytes. If you are in a double-byte environment, you don't need to do this.

from itertools import accumulate from operator import itemgetter def oprt_parser(sArgs): sum_arg = tuple(accumulate(abs(i) for i in sArgs)) # Negative parameter field index cuts = tuple(i for i,num in enumerate(sArgs) if num < 0) # Get slice args and Ignore fields of negative length ig_Args = tuple(item for i, item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts) # Generate `operator.itemgetter` object oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args]) return oprtObj lineb = b'abcdefghijklmnopqrstuvwxyz\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4\xb6\xee\xb7\xa2\xb8\xf6\xba\xcd0123456789' line = lineb.decode("GBK") # Unicode Fixed Width fieldwidthsU = (13, -13, 4, -4, 5,-5) # Negative width fields is ignored # ASCII Fixed Width fieldwidths = (13, -13, 8, -8, 5,-5) # Negative width fields is ignored # Unicode FixedWidth processing parse = oprt_parser(fieldwidthsU) fields = parse(line) print('Unicode FixedWidth','fields: {}'.format(tuple(map(lambda s: s.encode("GBK"), fields)))) # ASCII FixedWidth processing parse = oprt_parser(fieldwidths) fields = parse(lineb) print('ASCII FixedWidth','fields: {}'.format(fields)) line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n' fieldwidths = (2, -10, 24) parse = oprt_parser(fieldwidths) fields = parse(line) print(f"fields: {fields}") 

Output:

Unicode FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234') ASCII FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234') fields: ('AB', 'MNOPQRSTUVWXYZ0123456789') 

oprt_parser is 4x make_parser(list comprehensions + slice)


During the research, it was found that when the cpu speed is faster, it seems that the efficiency of the re method increases faster.
Since I don't have more and better computers to test, provide my test code, if anyone is interested, you can test it with a faster computer.

Run Environment:

  • os:win10
  • python: 3.7.2
  • CPU:amd athlon x3 450
  • HD:seagate 1T
import timeit import time import re from itertools import accumulate from operator import itemgetter def eff2(stmt,onlyNum= False,showResult=False): '''test function''' if onlyNum: rl = timeit.repeat(stmt=stmt,repeat=roundI,number=timesI,globals=globals()) avg = sum(rl) / len(rl) return f"{avg * (10 ** 6)/timesI:0.4f}" else: rl = timeit.repeat(stmt=stmt,repeat=10,number=1000,globals=globals()) avg = sum(rl) / len(rl) print(f"【{stmt}】") print(f"\tquick avg = {avg * (10 ** 6)/1000:0.4f} s/million") if showResult: print(f"\t Result = {eval(stmt)}\n\t timelist = {rl}\n") else: print("") def upDouble(argList,argRate): return [c*argRate for c in argList] tbStr = "000000001111000002222真2233333333000000004444444QAZ55555555000000006666666ABC这些事中文字abcdefghijk" tbBytes = tbStr.encode("GBK") a20 = (4,4,2,2,2,3,2,2, 2 ,2,8,8,7,3,8,8,7,3, 12 ,11) a20U = (4,4,2,2,2,3,2,2, 1 ,2,8,8,7,3,8,8,7,3, 6 ,11) Slng = 800 rateS = Slng // 100 tStr = "".join(upDouble(tbStr , rateS)) tBytes = tStr.encode("GBK") spltArgs = upDouble( a20 , rateS) spltArgsU = upDouble( a20U , rateS) testList = [] timesI = 100000 roundI = 5 print(f"test round = {roundI} timesI = {timesI} sourceLng = {len(tStr)} argFieldCount = {len(spltArgs)}") print(f"pure str \n{''.ljust(60,'-')}") # ========================================== def str_parser(sArgs): def prsr(oStr): r = [] r_ap = r.append stt=0 for lng in sArgs: end = stt + lng r_ap(oStr[stt:end]) stt = end return tuple(r) return prsr Str_P = str_parser(spltArgsU) # eff2("Str_P(tStr)") testList.append("Str_P(tStr)") print(f"pure bytes \n{''.ljust(60,'-')}") # ========================================== def byte_parser(sArgs): def prsr(oBytes): r, stt = [], 0 r_ap = r.append for lng in sArgs: end = stt + lng r_ap(oBytes[stt:end]) stt = end return r return prsr Byte_P = byte_parser(spltArgs) # eff2("Byte_P(tBytes)") testList.append("Byte_P(tBytes)") # re,bytes print(f"re compile object \n{''.ljust(60,'-')}") # ========================================== def rebc_parser(sArgs,otype="b"): re_Args = "".join([f"(.{{{n}}})" for n in sArgs]) if otype == "b": rebc_Args = re.compile(re_Args.encode("GBK")) else: rebc_Args = re.compile(re_Args) def prsr(oBS): return rebc_Args.match(oBS).groups() return prsr Rebc_P = rebc_parser(spltArgs) # eff2("Rebc_P(tBytes)") testList.append("Rebc_P(tBytes)") Rebc_Ps = rebc_parser(spltArgsU,"s") # eff2("Rebc_Ps(tStr)") testList.append("Rebc_Ps(tStr)") print(f"struct \n{''.ljust(60,'-')}") # ========================================== import struct def struct_parser(sArgs): struct_Args = " ".join(map(lambda x: str(x) + "s", sArgs)) def prsr(oBytes): return struct.unpack(struct_Args, oBytes) return prsr Struct_P = struct_parser(spltArgs) # eff2("Struct_P(tBytes)") testList.append("Struct_P(tBytes)") print(f"List Comprehensions + slice \n{''.ljust(60,'-')}") # ========================================== import itertools def slice_parser(sArgs): tl = tuple(itertools.accumulate(sArgs)) slice_Args = tuple(zip((0,)+tl,tl)) def prsr(oBytes): return [oBytes[s:e] for s, e in slice_Args] return prsr Slice_P = slice_parser(spltArgs) # eff2("Slice_P(tBytes)") testList.append("Slice_P(tBytes)") def sliceObj_parser(sArgs): tl = tuple(itertools.accumulate(sArgs)) tl2 = tuple(zip((0,)+tl,tl)) sliceObj_Args = tuple(slice(s,e) for s,e in tl2) def prsr(oBytes): return [oBytes[so] for so in sliceObj_Args] return prsr SliceObj_P = sliceObj_parser(spltArgs) # eff2("SliceObj_P(tBytes)") testList.append("SliceObj_P(tBytes)") SliceObj_Ps = sliceObj_parser(spltArgsU) # eff2("SliceObj_Ps(tStr)") testList.append("SliceObj_Ps(tStr)") print(f"operator.itemgetter + slice object \n{''.ljust(60,'-')}") # ========================================== def oprt_parser(sArgs): sum_arg = tuple(accumulate(abs(i) for i in sArgs)) cuts = tuple(i for i,num in enumerate(sArgs) if num < 0) ig_Args = tuple(item for i,item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts) oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args]) return oprtObj Oprt_P = oprt_parser(spltArgs) # eff2("Oprt_P(tBytes)") testList.append("Oprt_P(tBytes)") Oprt_Ps = oprt_parser(spltArgsU) # eff2("Oprt_Ps(tStr)") testList.append("Oprt_Ps(tStr)") print("|".join([s.split("(")[0].center(11," ") for s in testList])) print("|".join(["".center(11,"-") for s in testList])) print("|".join([eff2(s,True).rjust(11," ") for s in testList])) 

Output:

Test round = 5 timesI = 100000 sourceLng = 744 argFieldCount = 20 ... ...    Str_P | Byte_P | Rebc_P | Rebc_Ps | Struct_P | Slice_P | SliceObj_P|SliceObj_Ps| Oprt_P | Oprt_Ps -----------|-----------|-----------|-----------|-- ---------|-----------|-----------|-----------|---- -------|-----------      9.6315| 7.5952| 4.4187| 5.6867| 1.5123| 5.2915| 4.2673| 5.7121| 2.4713| 3.9051 

1 Comment

@MartijnPieters More efficient function
0

String slicing doesn't have to be ugly as long as you keep it organized. Consider storing your field widths in a dictionary and then using the associated names to create an object:

from collections import OrderedDict class Entry: def __init__(self, line): name2width = OrderedDict() name2width['foo'] = 2 name2width['bar'] = 3 name2width['baz'] = 2 pos = 0 for name, width in name2width.items(): val = line[pos : pos + width] if len(val) != width: raise ValueError("not enough characters: \'{}\'".format(line)) setattr(self, name, val) pos += width file = "ab789yz\ncd987wx\nef555uv" entry = [] for line in file.split('\n'): entry.append(Entry(line)) print(entry[1].bar) # output: 987 

Comments

0

I like to process text files containing fixed width fields using regular expressions. More specifically, using named capture groups. It's fast, does not require importing large libraries and is quite descriptive and convenient (in my opinion).

I also like the fact that the named capture groups are basically auto-documenting the data format, acting as a sort of data specification, since each capture group can be written to define each fields' name, data type and length.

Here's simple example...

import re data = [ "1234ABCDEFGHIJ5", "6789KLMNOPQRST0" ] record_regex = ( r"^" r"(?P<firstnumbers>[0-9]{4})" r"(?P<middletext>[a-zA-Z0-9_\-\s]{10})" r"(?P<lastnumber>[0-9]{1})" r"$" ) records = [] for line in data: match = re.match(record_regex, line) if match: records.append(match.groupdict()) print(records) 

...that yields a convenient dictionary of each record:

[ {'firstnumbers': '1234', 'lastnumber': '5', 'middletext': 'ABCDEFGHIJ'}, {'firstnumbers': '6789', 'lastnumber': '0', 'middletext': 'KLMNOPQRST'} ] 

Helpful tools, like the online regex tester and debugger, are available if you are not familiar (or comfortable) with Python regular expressions or named capture groups.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.