5

I had a nasty CRLF / LF conflict in git file that was probably committed from Windows machine. Is there a cross-platform way (preferably in Python) to detect what type of newlines is dominant through the file?

I've got this code (based on idea from https://stackoverflow.com/a/10562258/239247):

import sys if not sys.argv[1:]: sys.exit('usage: %s <filename>' % sys.argv[0]) with open(sys.argv[1],"rb") as f: d = f.read() crlf, lfcr = d.count('\r\n'), d.count('\n\r') cr, lf = d.count('\r'), d.count('\n') print('crlf: %s' % crlf) print('lfcr: %s' % lfcr) print('cr: %s' % cr) print('lf: %s' % lf) print('\ncr-crlf-lfcr: %s' % (cr - crlf - lfcr)) print('lf-crlf-lfcr: %s' % (lf - crlf - lfcr)) print('\ntotal (lf+cr-2*crlf-2*lfcr): %s\n' % (lf + cr - 2*crlf - 2*lfcr)) 

But it gives the stats wrong (for this file):

crlf: 1123 lfcr: 58 cr: 1123 lf: 1123 cr-crlf-lfcr: -58 lf-crlf-lfcr: -58 total (lf+cr-2*crlf-2*lfcr): -116 
2
  • Like sorrat, I get 1123 crlf pairs for that file, with 0 for the 3 other EOL markers. Commented Apr 17, 2015 at 11:30
  • @PM2Ring I need a better test file. I thought that this one actually contained mixed linefeeds. Commented Apr 17, 2015 at 12:55

4 Answers 4

10
import sys def calculate_line_endings(path): # order matters! endings = [ b'\r\n', b'\n\r', b'\n', b'\r', ] counts = dict.fromkeys(endings, 0) with open(path, 'rb') as fp: for line in fp: for x in endings: if line.endswith(x): counts[x] += 1 break print(counts) if __name__ == '__main__': if len(sys.argv) == 2: calculate_line_endings(sys.argv[1]) sys.exit('usage: %s <filepath>' % sys.argv[0]) 

Gives output for your file

crlf: 1123 lfcr: 0 cr: 0 lf: 0 

Is it enough?

Sign up to request clarification or add additional context in comments.

2 Comments

This one is good. Do you know how the line in open(filename, "rb"): detects the lines correctly? Just to know about corner cases.
Sorry, I don't know. May be the cause in PEP-278
2

The posted code doesn't work properly bcause Counter is counting characters in the file - it doesn't look for character pairs like \r\n and \n\r.

Here's some Python 2.6 code that finds each occurrence of the 4 EOL markers \r\n, \n\r, \r and \n using a regex. The trick is to look for the \r\n and \n\r pairs before looking for the single char EOL markers.

For testing purposes it creates some random text data; I wrote this before I noticed your link to a test file.

#!/usr/bin/env python ''' Find and count various line ending character combinations From http://stackoverflow.com/q/29695861/4014959 Written by PM 2Ring 2015.04.17 ''' import random import re from itertools import groupby random.seed(42) #Make a random text string containing various EOL combinations tokens = list(2*'ABCDEFGHIJK ' + '\r\n') + ['\r\n', '\n\r'] datasize = 300 data = ''.join([random.choice(tokens) for _ in range(datasize)]) print repr(data), '\n' #regex to find various EOL combinations pat = re.compile(r'\r\n|\n\r|\r|\n') eols = pat.findall(data) print eols, '\n' grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))] print sorted(grouped, reverse=True) 

output

'FAHGIG\rC AGCAFGDGEKAKHJE\r\nJCC EKID\n\rKD F\rEHBGICGCHFKKFH\r\nGFEIEK\n\rFDH JGAIHF\r\n\rIG \nAHGDHE\n G\n\rCCBDFK BK\n\rC\n\r\rAIHDHFDAA\r\n\rHCF\n\rIFFEJDJCAJA\r\n\r IB\r\r\nCBBJJDBDH\r FDIFI\n\rGACDGJEGGBFG\n\rBGGFD\r\nDBJKFCA BIG\n\rC J\rGFA HG\nA\rDB\n\r \n\r\n EBF BK\n\rHJA \r\n\n\rDIEI\n\rEDIBEC E\r\nCFEGGD\rGEF EC\r\nFIG GIIJCA\n\r\n\rCFH\r\n\r\rKE HF\n\rGAKIG\r\nDDCDHEIFFHB\n C HAJFHID AC\r' ['\r', '\r\n', '\n\r', '\r', '\r\n', '\n\r', '\r\n', '\r', '\n', '\n', '\n\r', '\n\r', '\n\r', '\r', '\r\n', '\r', '\n\r', '\r\n', '\r', '\r', '\r\n', '\r', '\n\r', '\n\r', '\r\n', '\n\r', '\r', '\n', '\r', '\n\r', '\n\r', '\n', '\n\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r', '\n\r', '\r\n', '\n', '\r'] [(17, '\n\r'), (14, '\r'), (12, '\r\n'), (5, '\n')] 

Here's a version that reads the data from a named file, following the pattern of the code in the question.

import re from itertools import groupby import sys if not sys.argv[1:]: exit('usage: %s <filename>' % sys.argv[0]) with open(sys.argv[1], 'rb') as f: data = f.read() print repr(data), '\n' #regex to find various EOL combinations pat = re.compile(r'\r\n|\n\r|\r|\n') eols = pat.findall(data) print eols, '\n' grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))] print sorted(grouped, reverse=True) 

1 Comment

Nice approach. Especially cool that it has test data to compare.
1

The best way to deal with line endings in git is to use git configuration. You can define what exactly must be done to line endings globally, in a particular repository or for specific files. In .gitattributes file, you can define that certain files must be converted to the native line endings of your system for each checkout, and converted back at checkins. See GitHub line endings help for a detailed description.

1 Comment

I don't want to convert anything, Can git just leave my files as-is by default?
1

From what I see, I would recommend to check if you have the following case: \r\n\r\n\r\n. Following your code this will count the following:

crlf: 3 -- [\r\n][\r\n][\r\n] lfcr: 2 -- \r[\n\r][\n\r]\n cr: 3 -- [\r]\n[\r]\n[\r]\n lf: 3 -- \r[\n]\r[\n]\r[\n] cr-crlf-lfcr: -2 lf-crlf-lfcr: -2 total (lf+cr-2*crlf-2*lfcr): -4 

As you can see some \n's and some \r's are counted twice for crlf and lfcr. Instead you can just read line by line and count the line endings line.endswith(). To get exact statistics for cr and lf then you can count \r\n and \n\r as cr+1 and lf+1.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.