Get newline stats for a text file in Python

Question

I had a nasty CRLF / LF conflict in git file that was probably committed from Windows machine. Is there a cross-platform way (preferably in Python) to detect what type of newlines is dominant through the file?

I've got this code (based on idea from https://stackoverflow.com/a/10562258/239247):

import sys if not sys.argv[1:]: sys.exit('usage: %s <filename>' % sys.argv[0]) with open(sys.argv[1],"rb") as f: d = f.read() crlf, lfcr = d.count('\r\n'), d.count('\n\r') cr, lf = d.count('\r'), d.count('\n') print('crlf: %s' % crlf) print('lfcr: %s' % lfcr) print('cr: %s' % cr) print('lf: %s' % lf) print('\ncr-crlf-lfcr: %s' % (cr - crlf - lfcr)) print('lf-crlf-lfcr: %s' % (lf - crlf - lfcr)) print('\ntotal (lf+cr-2*crlf-2*lfcr): %s\n' % (lf + cr - 2*crlf - 2*lfcr))

But it gives the stats wrong (for this file):

crlf: 1123 lfcr: 58 cr: 1123 lf: 1123 cr-crlf-lfcr: -58 lf-crlf-lfcr: -58 total (lf+cr-2*crlf-2*lfcr): -116

Like sorrat, I get 1123 crlf pairs for that file, with 0 for the 3 other EOL markers. — PM 2Ring
– PM 2Ring, Commented Apr 17, 2015 at 11:30
@PM2Ring I need a better test file. I thought that this one actually contained mixed linefeeds. — anatoly techtonik
– anatoly techtonik, Commented Apr 17, 2015 at 12:55

sorrat · Accepted Answer · 2020-05-01 20:22:34Z

import sys def calculate_line_endings(path): # order matters! endings = [ b'\r\n', b'\n\r', b'\n', b'\r', ] counts = dict.fromkeys(endings, 0) with open(path, 'rb') as fp: for line in fp: for x in endings: if line.endswith(x): counts[x] += 1 break print(counts) if __name__ == '__main__': if len(sys.argv) == 2: calculate_line_endings(sys.argv[1]) sys.exit('usage: %s <filepath>' % sys.argv[0])

Gives output for your file

crlf: 1123 lfcr: 0 cr: 0 lf: 0

Is it enough?

This one is good. Do you know how the line in open(filename, "rb"): detects the lines correctly? Just to know about corner cases.

PM 2Ring · Accepted Answer · 2015-04-17 11:35:36Z

The posted code doesn't work properly bcause Counter is counting characters in the file - it doesn't look for character pairs like \r\n and \n\r.

Here's some Python 2.6 code that finds each occurrence of the 4 EOL markers \r\n, \n\r, \r and \n using a regex. The trick is to look for the \r\n and \n\r pairs before looking for the single char EOL markers.

For testing purposes it creates some random text data; I wrote this before I noticed your link to a test file.

#!/usr/bin/env python ''' Find and count various line ending character combinations From http://stackoverflow.com/q/29695861/4014959 Written by PM 2Ring 2015.04.17 ''' import random import re from itertools import groupby random.seed(42) #Make a random text string containing various EOL combinations tokens = list(2*'ABCDEFGHIJK ' + '\r\n') + ['\r\n', '\n\r'] datasize = 300 data = ''.join([random.choice(tokens) for _ in range(datasize)]) print repr(data), '\n' #regex to find various EOL combinations pat = re.compile(r'\r\n|\n\r|\r|\n') eols = pat.findall(data) print eols, '\n' grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))] print sorted(grouped, reverse=True)

output

'FAHGIG\rC AGCAFGDGEKAKHJE\r\nJCC EKID\n\rKD F\rEHBGICGCHFKKFH\r\nGFEIEK\n\rFDH JGAIHF\r\n\rIG \nAHGDHE\n G\n\rCCBDFK BK\n\rC\n\r\rAIHDHFDAA\r\n\rHCF\n\rIFFEJDJCAJA\r\n\r IB\r\r\nCBBJJDBDH\r FDIFI\n\rGACDGJEGGBFG\n\rBGGFD\r\nDBJKFCA BIG\n\rC J\rGFA HG\nA\rDB\n\r \n\r\n EBF BK\n\rHJA \r\n\n\rDIEI\n\rEDIBEC E\r\nCFEGGD\rGEF EC\r\nFIG GIIJCA\n\r\n\rCFH\r\n\r\rKE HF\n\rGAKIG\r\nDDCDHEIFFHB\n C HAJFHID AC\r' ['\r', '\r\n', '\n\r', '\r', '\r\n', '\n\r', '\r\n', '\r', '\n', '\n', '\n\r', '\n\r', '\n\r', '\r', '\r\n', '\r', '\n\r', '\r\n', '\r', '\r', '\r\n', '\r', '\n\r', '\n\r', '\r\n', '\n\r', '\r', '\n', '\r', '\n\r', '\n\r', '\n', '\n\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r', '\n\r', '\r\n', '\n', '\r'] [(17, '\n\r'), (14, '\r'), (12, '\r\n'), (5, '\n')]

Here's a version that reads the data from a named file, following the pattern of the code in the question.

import re from itertools import groupby import sys if not sys.argv[1:]: exit('usage: %s <filename>' % sys.argv[0]) with open(sys.argv[1], 'rb') as f: data = f.read() print repr(data), '\n' #regex to find various EOL combinations pat = re.compile(r'\r\n|\n\r|\r|\n') eols = pat.findall(data) print eols, '\n' grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))] print sorted(grouped, reverse=True)

Nice approach. Especially cool that it has test data to compare.

Mykhaylo Kopytonenko · Accepted Answer · 2015-04-17 11:23:05Z

The best way to deal with line endings in git is to use git configuration. You can define what exactly must be done to line endings globally, in a particular repository or for specific files. In .gitattributes file, you can define that certain files must be converted to the native line endings of your system for each checkout, and converted back at checkins. See GitHub line endings help for a detailed description.

I don't want to convert anything, Can git just leave my files as-is by default?

go2 · Accepted Answer · 2015-04-17 11:27:25Z

From what I see, I would recommend to check if you have the following case: \r\n\r\n\r\n. Following your code this will count the following:

crlf: 3 -- [\r\n][\r\n][\r\n] lfcr: 2 -- \r[\n\r][\n\r]\n cr: 3 -- [\r]\n[\r]\n[\r]\n lf: 3 -- \r[\n]\r[\n]\r[\n] cr-crlf-lfcr: -2 lf-crlf-lfcr: -2 total (lf+cr-2*crlf-2*lfcr): -4

As you can see some \n's and some \r's are counted twice for crlf and lfcr. Instead you can just read line by line and count the line endings line.endswith(). To get exact statistics for cr and lf then you can count \r\n and \n\r as cr+1 and lf+1.

Collectives™ on Stack Overflow

Get newline stats for a text file in Python

4 Answers 4

2 Comments

1 Comment

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

1 Comment

Comments

Linked

Related