How to get the line count of a large file cheaply in Python

Question

How do I get a line count of a large file in the most memory- and time-efficient manner?

def file_len(filename): with open(filename) as f: for i, _ in enumerate(f): pass return i + 1

Do you need exact line count or will an approximation suffice? — pico
– pico, Commented May 11, 2009 at 20:14
I would add i=-1 before for loop, since this code doesn't work for empty files. — Maciek Sawicki
– Maciek Sawicki, Commented Dec 27, 2011 at 16:13
@Legend: I bet pico is thinking, get the file size (with seek(0,2) or equiv), divide by approximate line length. You could read a few lines at the beginning to guess the average line length. — Anne
– Anne, Commented Feb 7, 2012 at 17:02
@IanMackinnon Works for empty files, but you have to initialize i to 0 before the for-loop. — scai
– scai, Commented Aug 13, 2013 at 16:29

Jean-Francois T. · Accepted Answer · 2023-11-03 02:12:46Z

830

One line, faster than the for loop of the OP (although not the fastest) and very concise:

num_lines = sum(1 for _ in open('myfile.txt'))

You can also boost the speed (and robustness) by using rbU mode and include it in a with block to close the file:

with open("myfile.txt", "rbU") as f: num_lines = sum(1 for _ in f)

Note: The U in rbU mode is deprecated since Python 3.3 and above, so iwe should use rb instead of rbU (and it has been removed in Python 3.11).

edited Nov 3, 2023 at 2:12

Jean-Francois T.

13.3k7 gold badges83 silver badges118 bronze badges

answered Jun 19, 2009 at 19:07

Kyle

8,8231 gold badge20 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

James Sapam Over a year ago

its similar to sum(sequence of 1) every line is counting as 1. >>> [ 1 for line in range(10) ] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] >>> sum( 1 for line in range(10) ) 10 >>>

Honghe.Wu Over a year ago

num_lines = sum(1 for line in open('myfile.txt') if line.rstrip()) for filter empty lines

Mannaggia Over a year ago

as we open a file, will this be closed automatically once we iterate over all the elements? Is it required to 'close()'? I think we cannot use 'with open()' in this short statement, right?

dfrancis Over a year ago

A slight lint improvement: num_lines = sum(1 for _ in open('myfile.txt'))

Nico Schlömer Over a year ago

It's not any faster than the other solutions, see stackoverflow.com/a/68385697/353337.

|

Jean-Francois T. · Accepted Answer · 2023-05-08 05:59:28Z

431

You can't get any better than that.

After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.

[Edit May 2023]

As commented in many other answers, in Python 3 there are better alternatives. The for loop is not the most efficient. For example, using mmap or buffers is more efficient.

edited May 8, 2023 at 5:59

Jean-Francois T.

13.3k7 gold badges83 silver badges118 bronze badges

answered May 10, 2009 at 10:37

Yuval Adam

166k95 gold badges318 silver badges406 bronze badges

14 Comments

Ólafur Waage Over a year ago

Exactly, even WC is reading through the file, but in C and it's probably pretty optimized.

Tomalak Over a year ago

As far as I understand the Python file IO is done through C as well. docs.python.org/library/stdtypes.html#file-objects

bobpoekert Over a year ago

@Tomalak That's a red herring. While python and wc might be issuing the same syscalls, python has opcode dispatch overhead that wc doesn't have.

Erik Aronesty Over a year ago

You can approximate a line count by sampling. It can be thousands of times faster. See: documentroot.com/2011/02/…

Skippy le Grand Gourou Over a year ago

Other answers seem to indicate this categorical answer is wrong, and should therefore be deleted rather than kept as accepted.

|

Peter Mortensen · Accepted Answer · 2023-11-02 04:43:06Z

I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).

I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.

Windows XP, Python 2.5, 2 GB RAM, 2 GHz AMD processor

Here are my results:

mapcount : 0.465599966049 simplecount : 0.756399965286 bufcount : 0.546800041199 opcount : 0.718600034714

Numbers for Python 2.6:

mapcount : 0.471799945831 simplecount : 0.634400033951 bufcount : 0.468800067902 opcount : 0.602999973297

So the buffer read strategy seems to be the fastest for Windows/Python 2.6

Here is the code:

from __future__ import with_statement import time import mmap import random from collections import defaultdict def mapcount(filename): with open(filename, "r+") as f: buf = mmap.mmap(f.fileno(), 0) lines = 0 readline = buf.readline while readline(): lines += 1 return lines def simplecount(filename): lines = 0 for line in open(filename): lines += 1 return lines def bufcount(filename): f = open(filename) lines = 0 buf_size = 1024 * 1024 read_f = f.read # loop optimization buf = read_f(buf_size) while buf: lines += buf.count('\n') buf = read_f(buf_size) return lines def opcount(fname): with open(fname) as f: for i, l in enumerate(f): pass return i + 1 counts = defaultdict(list) for i in range(5): for func in [mapcount, simplecount, bufcount, opcount]: start_time = time.time() assert func("big_file.txt") == 1209138 counts[func].append(time.time() - start_time) for key, vals in counts.items(): print key.__name__, ":", sum(vals) / float(len(vals))

It seems that wccount() is the fastest gist.github.com/0ac760859e614cd03652
The buffered read is the fastest solution, not mmap or wccount. See stackoverflow.com/a/68385697/353337.
@NicoSchlömer it depends the characteristics of your file. See stackoverflow.com/a/76197308/1603480 for a comparison of both on different files.
For me, it needs a check whether the last byte is b'\n' and the file non-empty to count a last line without newline. This also needs to work when the file size is a multiple of buf_size, so I think this needs an extra variable to remember whether the last non-empty buf ended with a newline.

Peter Mortensen · Accepted Answer · 2023-11-02 14:40:02Z

All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)

Using a modified version of the timing tool, I believe the following code is faster (and marginally more Pythonic) than any of the solutions offered:

def rawcount(filename): f = open(filename, 'rb') lines = 0 buf_size = 1024 * 1024 read_f = f.raw.read buf = read_f(buf_size) while buf: lines += buf.count(b'\n') buf = read_f(buf_size) return lines

Using a separate generator function, this runs a smidge faster:

def _make_gen(reader): b = reader(1024 * 1024) while b: yield b b = reader(1024*1024) def rawgencount(filename): f = open(filename, 'rb') f_gen = _make_gen(f.raw.read) return sum(buf.count(b'\n') for buf in f_gen)

This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:

from itertools import (takewhile, repeat) def rawincount(filename): f = open(filename, 'rb') bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None))) return sum(buf.count(b'\n') for buf in bufgen)

Here are my timings:

function average, s min, s ratio rawincount 0.0043 0.0041 1.00 rawgencount 0.0044 0.0042 1.01 rawcount 0.0048 0.0045 1.09 bufcount 0.008 0.0068 1.64 wccount 0.01 0.0097 2.35 itercount 0.014 0.014 3.41 opcount 0.02 0.02 4.83 kylecount 0.021 0.021 5.05 simplecount 0.022 0.022 5.25 mapcount 0.037 0.031 7.46

I am working with 100Gb+ files, and your rawgencounts is the only feasible solution I have seen so far. Thanks!
is wccount in this table for the subprocess shell wc tool?
Thanks @michael-bacon, it's a really nice solution. You can make the rawincount solution less weird looking by using bufgen = iter(partial(f.raw.read, 1024*1024), b'') instead of combining takewhile and repeat.
Oh, partial function, yeah, that's a nice little tweak. Also, I assumed that the 1024*1024 would get merged by the interpreter and treated as a constant but that was on hunch not documentation.
@MichaelBacon, would it be faster to open the file with buffering=0 and then calling read instead of just opening the file as "rb" and calling raw.read, or will that be optimized to the same thing?

nosklo · Accepted Answer · 2009-05-11 12:23:02Z

107

You could execute a subprocess and run wc -l filename

import subprocess def file_len(fname): p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, stderr=subprocess.PIPE) result, err = p.communicate() if p.returncode != 0: raise IOError(err) return int(result.strip().split()[0])

edited May 11, 2009 at 12:23

nosklo

224k58 gold badges300 silver badges299 bronze badges

answered May 10, 2009 at 10:28

Ólafur Waage

70.3k22 gold badges147 silver badges199 bronze badges

7 Comments

SilentGhost Over a year ago

what would be the windows version of this?

Ólafur Waage Over a year ago

You can refer to this SO question regarding that. stackoverflow.com/questions/247234/…

bendin Over a year ago

Indeed, in my case (Mac OS X) this takes 0.13s versus 0.5s for counting the number of lines "for x in file(...)" produces, versus 1.0s counting repeated calls to str.find or mmap.find. (The file I used to test this has 1.3 million lines.)

nosklo Over a year ago

No need to involve the shell on that. edited answer and added example code;

e-info128 Over a year ago

Is not cross platform.

|

Nico Schlömer · Accepted Answer · 2021-12-03 08:23:16Z

After a perfplot analysis, one has to recommend the buffered read solution

def buf_count_newlines_gen(fname): def _make_gen(reader): while True: b = reader(2 ** 16) if not b: break yield b with open(fname, "rb") as f: count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read)) return count

It's fast and memory-efficient. Most other solutions are about 20 times slower.

Code to reproduce the plot:

import mmap import subprocess from functools import partial import perfplot def setup(n): fname = "t.txt" with open(fname, "w") as f: for i in range(n): f.write(str(i) + "\n") return fname def for_enumerate(fname): i = 0 with open(fname) as f: for i, _ in enumerate(f): pass return i + 1 def sum1(fname): return sum(1 for _ in open(fname)) def mmap_count(fname): with open(fname, "r+") as f: buf = mmap.mmap(f.fileno(), 0) lines = 0 while buf.readline(): lines += 1 return lines def for_open(fname): lines = 0 for _ in open(fname): lines += 1 return lines def buf_count_newlines(fname): lines = 0 buf_size = 2 ** 16 with open(fname) as f: buf = f.read(buf_size) while buf: lines += buf.count("\n") buf = f.read(buf_size) return lines def buf_count_newlines_gen(fname): def _make_gen(reader): b = reader(2 ** 16) while b: yield b b = reader(2 ** 16) with open(fname, "rb") as f: count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read)) return count def wc_l(fname): return int(subprocess.check_output(["wc", "-l", fname]).split()[0]) def sum_partial(fname): with open(fname) as f: count = sum(x.count("\n") for x in iter(partial(f.read, 2 ** 16), "")) return count def read_count(fname): return open(fname).read().count("\n") b = perfplot.bench( setup=setup, kernels=[ for_enumerate, sum1, mmap_count, for_open, wc_l, buf_count_newlines, buf_count_newlines_gen, sum_partial, read_count, ], n_range=[2 ** k for k in range(27)], xlabel="num lines", ) b.save("out.png") b.show()

I hae very long lines in my file; i'm thinking the buffer should be allocated only once using readinto
Great graph: thanks for the code. But actually, this overlooks the case where a line is more than just than 10 characters. For long lines, mmap tends to be more efficient than buf_count_newlines_gen: see answer stackoverflow.com/a/76197308/1603480

Peter Mortensen · Accepted Answer · 2023-11-02 04:56:52Z

53

A one-line Bash solution similar to this answer, using the modern subprocess.check_output function:

def line_count(filename): return int(subprocess.check_output(['wc', '-l', filename]).split()[0])

edited Nov 2, 2023 at 4:56

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Apr 3, 2017 at 7:48

1''

27.2k32 gold badges150 silver badges204 bronze badges

2 Comments

Shan Dou Over a year ago

This answer should be voted up to a higher spot in this thread for Linux/Unix users. Despite the majority preferences in a cross-platform solution, this is a superb way on Linux/Unix. For a 184-million-line csv file I have to sample data from, it provides the best runtime. Other pure python solutions take on average 100+ seconds whereas subprocess call of wc -l takes ~ 5 seconds.

Alexey Vazhnov Over a year ago

shell=True is bad for security, it is better to avoid it.

Peter Mortensen · Accepted Answer · 2023-11-02 14:14:04Z

Here is a Python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20 million line file from 26 seconds to 7 seconds using an 8-core Windows 64-bit server. Note: not using memory mapping makes things much slower.

import multiprocessing, sys, time, os, mmap import logging, logging.handlers def init_logger(pid): console_format = 'P{0} %(levelname)s %(message)s'.format(pid) logger = logging.getLogger() # New logger at root level logger.setLevel(logging.INFO) logger.handlers.append(logging.StreamHandler()) logger.handlers[0].setFormatter(logging.Formatter(console_format, '%d/%m/%y %H:%M:%S')) def getFileLineCount(queues, pid, processes, file1): init_logger(pid) logging.info('start') physical_file = open(file1, "r") # mmap.mmap(fileno, length[, tagname[, access[, offset]]] m1 = mmap.mmap(physical_file.fileno(), 0, access=mmap.ACCESS_READ) # Work out file size to divide up line counting fSize = os.stat(file1).st_size chunk = (fSize / processes) + 1 lines = 0 # Get where I start and stop _seedStart = chunk * (pid) _seekEnd = chunk * (pid+1) seekStart = int(_seedStart) seekEnd = int(_seekEnd) if seekEnd < int(_seekEnd + 1): seekEnd += 1 if _seedStart < int(seekStart + 1): seekStart += 1 if seekEnd > fSize: seekEnd = fSize # Find where to start if pid > 0: m1.seek(seekStart) # Read next line l1 = m1.readline() # Need to use readline with memory mapped files seekStart = m1.tell() # Tell previous rank my seek start to make their seek end if pid > 0: queues[pid-1].put(seekStart) if pid < processes-1: seekEnd = queues[pid].get() m1.seek(seekStart) l1 = m1.readline() while len(l1) > 0: lines += 1 l1 = m1.readline() if m1.tell() > seekEnd or len(l1) == 0: break logging.info('done') # Add up the results if pid == 0: for p in range(1, processes): lines += queues[0].get() queues[0].put(lines) # The total lines counted else: queues[0].put(lines) m1.close() physical_file.close() if __name__ == '__main__': init_logger('main') if len(sys.argv) > 1: file_name = sys.argv[1] else: logging.fatal('parameters required: file-name [processes]') exit() t = time.time() processes = multiprocessing.cpu_count() if len(sys.argv) > 2: processes = int(sys.argv[2]) queues = [] # A queue for each process for pid in range(processes): queues.append(multiprocessing.Queue()) jobs = [] prev_pipe = 0 for pid in range(processes): p = multiprocessing.Process(target = getFileLineCount, args=(queues, pid, processes, file_name,)) p.start() jobs.append(p) jobs[0].join() # Wait for counting to finish lines = queues[0].get() logging.info('finished {} Lines:{}'.format( time.time() - t, lines))

How does this work with files much bigger than main memory? for instance a 20GB file on a system with 4GB RAM and 2 cores
Hard to test now, but I presume it would page the file in and out.
This is pretty neat code. I was surprised to find that it is faster to use multiple processors. I figured that the IO would be the bottleneck. In older Python versions, line 21 needs int() like chunk = int((fSize / processes)) + 1
do it load all the file into the memory? what about a bigger fire where the size is bigger then the ram on the computer?
Would you mind if I formatted the answer with black? black.vercel.app

Daniel Lee · Accepted Answer · 2013-10-08 12:46:12Z

18

I would use Python's file object method readlines, as follows:

with open(input_file) as foo: lines = len(foo.readlines())

This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.

answered Oct 8, 2013 at 12:46

Daniel Lee

2,0701 gold badge23 silver badges30 bronze badges

4 Comments

Steen Schütt Over a year ago

While this is one of the first ways that comes to mind, it probably isn't very memory efficient, especially if counting lines in files up to 10 GB (Like I do), which is a noteworthy disadvantage.

robert Over a year ago

@TimeSheep Is this an issue for files with many (say, billions) of small lines, or files which have extremely long lines (say, Gigabytes per line)?

robert Over a year ago

The reason I ask is, it would seem that the compiler should be able to optimize this away by not creating an intermediate list.

Kumba Over a year ago

@dmityugov Per Python docs, xreadlines has been deprecated since 2.3, as it just returns an iterator. for line in file is the stated replacement. See: docs.python.org/2/library/stdtypes.html#file.xreadlines

Peter Mortensen · Accepted Answer · 2023-11-02 04:46:01Z

This is the fastest thing I have found using pure Python.

You can use whatever amount of memory you want by setting buffer, though 2**16 appears to be a sweet spot on my computer.

from functools import partial buffer=2**16 with open(myfile) as f: print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))

I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. It’s a very good read to understand how to count lines quickly, though wc -l is still about 75% faster than anything else.

pkit · Accepted Answer · 2009-05-10 10:33:41Z

13

def file_len(full_path): """ Count number of lines in a file.""" f = open(full_path) nr_of_lines = sum(1 for line in f) f.close() return nr_of_lines

answered May 10, 2009 at 10:33

pkit

8,3798 gold badges38 silver badges37 bronze badges

2 Comments

Ente Fetz Over a year ago

The command "sum(1 for line in f)" seems to delete the content of the file. The command "f.readline()" returns null if I put it after that line.

C.Nivs Over a year ago

@EnteFetz that's because the file handle is exhausted, so there are no more lines to read. If you do f.seek(0); f.readline() this problem won't persist

Peter Mortensen · Accepted Answer · 2023-11-02 05:20:14Z

12

Here is what I use, and it seems pretty clean:

import subprocess def count_file_lines(file_path): """ Counts the number of lines in a file using wc utility. :param file_path: path to file :return: int, no of lines """ num = subprocess.check_output(['wc', '-l', file_path]) num = num.split(' ') return int(num[0])

This is marginally faster than using pure Python, but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.

edited Nov 2, 2023 at 5:20

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jul 26, 2017 at 18:13

radtek

36.6k13 gold badges149 silver badges114 bronze badges

5 Comments

Bram Vanroy Over a year ago

Just as a side note, this won't work on Windows of course.

radtek Over a year ago

core utils apparently provides "wc" for windows stackoverflow.com/questions/247234/…. You can also use a linux VM in your windows box if your code will end up running in linux in prod.

Bram Vanroy Over a year ago

Or WSL, highly advised over any VM if stuff like this is the only thing you do. :-)

radtek Over a year ago

Yeah that works. I'm not a windows guy but from goolging I learned WSL = Windows Subsystem for Linux =)

Alexey Alexeenka Over a year ago

python3.7: subprocess return bytes, so code looks like this: int(subprocess.check_output(['wc', '-l', file_path]).decode("utf-8").lstrip().split(" ")[0])

Peter Mortensen · Accepted Answer · 2023-11-02 04:55:18Z

10

One line solution:

import os os.system("wc -l filename")

My snippet:

>>> os.system('wc -l *.txt')

Output:

0 bar.txt 1000 command.txt 3 test_file.txt 1003 total

edited Nov 2, 2023 at 4:55

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jan 6, 2017 at 8:29

TheExorcist

2,0341 gold badge22 silver badges25 bronze badges

8 Comments

Kim Over a year ago

Good idea, unfortunately this does not work on Windows though.

TheExorcist Over a year ago

if you want to be surfer of python , say good bye to windows.Believe me you will thank me one day .

Kim Over a year ago

I just considered it noteworthy that this will only work on windows. I prefer working on a linux/unix stack myself, but when writing software IMHO one should consider the side effects a program could have when run under different OSes. As the OP did not mention his platform and in case anyone pops on this solution via google and copies it (unaware of the limitations a Windows system might have), I wanted to add the note.

An Se Over a year ago

You can't save output of os.system() to variable and post-process it anyhow.

TheExorcist Over a year ago

@AnSe you are correct but question is not asked whether it saves or not.I guess you are understanding the context.

|

Peter Mortensen · Accepted Answer · 2023-11-02 14:32:55Z

Kyle's answer

num_lines = sum(1 for line in open('my_file.txt'))

is probably best. An alternative for this is:

num_lines = len(open('my_file.txt').read().splitlines())

Here is the comparison of performance of both:

In [20]: timeit sum(1 for line in open('Charts.ipynb')) 100000 loops, best of 3: 9.79 µs per loop In [21]: timeit len(open('Charts.ipynb').read().splitlines()) 100000 loops, best of 3: 12 µs per loop

Peter Mortensen · Accepted Answer · 2023-11-02 14:16:27Z

7

I got a small (4-8%) improvement with this version which reuses a constant buffer, so it should avoid any memory or GC overhead:

lines = 0 buffer = bytearray(2048) with open(filename) as f: while f.readinto(buffer) > 0: lines += buffer.count('\n')

You can play around with the buffer size and maybe see a little improvement.

edited Nov 2, 2023 at 14:16

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Feb 25, 2013 at 19:31

Scott Persinger

3,5643 gold badges23 silver badges12 bronze badges

4 Comments

ryuusenshi Over a year ago

Nice. To account for files that don't end in \n, add 1 outside of loop if buffer and buffer[-1]!='\n'

Jay Over a year ago

A bug: buffer in the last round might not be clean.

pelos Over a year ago

what if in between buffers one portion ends with \ and the other portion starts with n? that will miss one new line in there, I would sudgest to variables to store the end and the start of each chunk, but that might add more time to the script =(

Peter Mortensen Over a year ago

What file size? And other context, like the underlying hardware (disk type and speed. RAM, incl. speed. L1, L2, and L3 cache sizes. Etc.)

SilentGhost · Accepted Answer · 2011-01-30 16:43:40Z

5

As for me this variant will be the fastest:

#!/usr/bin/env python def main(): f = open('filename') lines = 0 buf_size = 1024 * 1024 read_f = f.read # loop optimization buf = read_f(buf_size) while buf: lines += buf.count('\n') buf = read_f(buf_size) print lines if __name__ == '__main__': main()

reasons: buffering faster than reading line by line and string.count is also very fast

edited Jan 30, 2011 at 16:43

SilentGhost

322k67 gold badges312 silver badges294 bronze badges

answered May 10, 2009 at 11:29

Mykola Kharechko

3,2475 gold badges34 silver badges41 bronze badges

6 Comments

dF. Over a year ago

But is it? At least on OSX/python2.5 the OP's version is still about 10% faster according to timeit.py.

tzot Over a year ago

What if the last line does not end in '\n'?

SilentGhost Over a year ago

I don't know how you tested it, dF, but on my machine it's ~2.5 times slower than any other option.

Ólafur Waage Over a year ago

You state that it will be the fastest and then state that you haven't tested it. Not very scientific eh? :)

SherylHohman Over a year ago

See solution and stats provided by Ryan Ginstrom answer below. Also check out JF Sebastian's comment and link on the same answer.

|

Texom512 · Accepted Answer · 2015-02-23 18:38:13Z

5

This code is shorter and clearer. It's probably the best way:

num_lines = open('yourfile.ext').read().count('\n')

answered Feb 23, 2015 at 18:38

Texom512

4,9213 gold badges19 silver badges16 bronze badges

2 Comments

user3672754 Over a year ago

You should also close the file.

Ivelin Over a year ago

It will load the whole file into memory.

Peter Mortensen · Accepted Answer · 2023-11-02 14:02:18Z

Just to complete the methods in previous answers, I tried a variant with the fileinput module:

import fileinput as fi def filecount(fname): for line in fi.input(fname): pass return fi.lineno()

And passed a 60-million-lines file to all the stated methods in previous answers:

mapcount: 6.13 simplecount: 4.59 opcount: 4.43 filecount: 43.3 bufcount: 0.171

It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...

Dummy · Accepted Answer · 2011-11-25 14:55:52Z

I have modified the buffer case like this:

def CountLines(filename): f = open(filename) try: lines = 1 buf_size = 1024 * 1024 read_f = f.read # loop optimization buf = read_f(buf_size) # Empty file if not buf: return 0 while buf: lines += buf.count('\n') buf = read_f(buf_size) return lines finally: f.close()

Now also empty files and the last line (without \n) are counted.

Maybe also explain (or add in comment in the code) what you changed and what for ;). Might give people some more inside in your code much easier (rather than "parsing" the code in the brain).
The loop optimization I think allows Python to do a local variable lookup at read_f, python.org/doc/essays/list2str

Jean-Francois T. · Accepted Answer · 2023-11-03 02:08:14Z

There are already so many answers with great timing comparison, but I believe they are just looking at number of lines to measure performance (e.g., the great graph from Nico Schlömer).

To be accurate while measuring performance, we should look at:

the number of lines
the average size of the lines
... the resulting total size of the file (which might impact memory)

First of all, the function of the OP (with a for) and the function sum(1 for line in f) are not performing that well...

Good contenders are with mmap or buffer.

To summarize: based on my analysis (Python 3.9 on Windows with SSD):

For big files with relatively short lines (within 100 characters): use function with a buffer buf_count_newlines_gen

def buf_count_newlines_gen(fname: str) -> int: """Count the number of lines in a file""" def _make_gen(reader): b = reader(1024 * 1024) while b: yield b b = reader(1024 * 1024) with open(fname, "rb") as f: count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read)) return count

For files with potentially longer lines (up to 2000 characters), disregarding the number of lines: use function with mmap: count_nb_lines_mmap

def count_nb_lines_mmap(file: Path) -> int: """Count the number of lines in a file""" with open(file, mode="rb") as f: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) nb_lines = 0 while mm.readline(): nb_lines += 1 mm.close() return nb_lines

For a short code with very good performance (especially for files of size up to medium size):

def itercount(filename: str) -> int: """Count the number of lines in a file""" with open(filename, 'rb') as f: return sum(1 for _ in f)

Here is a summary of the different metrics (average time with timeit on 7 runs with 10 loops each):

Function	Small file, short lines	Small file, long lines	Big file, short lines	Big file, long lines	Bigger file, short lines
... size ...	0.04 MB	1.16 MB	318 MB	17 MB	328 MB
... nb lines ...	915 lines < 100 chars	915 lines < 2000 chars	389000 lines < 100 chars	389,000 lines < 2000 chars	9.8 millions lines < 100 chars
`count_nb_lines_blocks`	0.183 ms	1.718 ms	36.799 ms	415.393 ms	517.920 ms
`count_nb_lines_mmap`	0.185 ms	0.582 ms	44.801 ms	185.461 ms	691.637 ms
`buf_count_newlines_gen`	0.665 ms	1.032 ms	15.620 ms	213.458 ms	318.939 ms
`itercount`	0.135 ms	0.817 ms	31.292 ms	223.120 ms	628.760 ms

Note: I have also compared count_nb_lines_mmap and buf_count_newlines_gen on a file of 8 GB, with 9.7 million lines of more than 800 characters. We got an average of 5.39 seconds for buf_count_newlines_gen vs. 4.2 seconds for count_nb_lines_mmap, so this latter function seems indeed better for files with longer lines.

Here is the code I have used:

import mmap from pathlib import Path def count_nb_lines_blocks(file: Path) -> int: """Count the number of lines in a file""" def blocks(files, size=65536): while True: b = files.read(size) if not b: break yield b with open(file, encoding="utf-8", errors="ignore") as f: return sum(bl.count("\n") for bl in blocks(f)) def count_nb_lines_mmap(file: Path) -> int: """Count the number of lines in a file""" with open(file, mode="rb") as f: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) nb_lines = 0 while mm.readline(): nb_lines += 1 mm.close() return nb_lines def count_nb_lines_sum(file: Path) -> int: """Count the number of lines in a file""" with open(file, "r", encoding="utf-8", errors="ignore") as f: return sum(1 for line in f) def count_nb_lines_for(file: Path) -> int: """Count the number of lines in a file""" i = 0 with open(file) as f: for i, _ in enumerate(f, start=1): pass return i def buf_count_newlines_gen(fname: str) -> int: """Count the number of lines in a file""" def _make_gen(reader): b = reader(1024 * 1024) while b: yield b b = reader(1024 * 1024) with open(fname, "rb") as f: count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read)) return count def itercount(filename: str) -> int: """Count the number of lines in a file""" with open(filename, 'rbU') as f: return sum(1 for _ in f) files = [small_file, big_file, small_file_shorter, big_file_shorter, small_file_shorter_sim_size, big_file_shorter_sim_size] for file in files: print(f"File: {file.name} (size: {file.stat().st_size / 1024 ** 2:.2f} MB)") for func in [ count_nb_lines_blocks, count_nb_lines_mmap, count_nb_lines_sum, count_nb_lines_for, buf_count_newlines_gen, itercount, ]: result = func(file) time = Timer(lambda: func(file)).repeat(7, 10) print(f" * {func.__name__}: {result} lines in {mean(time) / 10 * 1000:.3f} ms") print()

File: small_file.ndjson (size: 1.16 MB) * count_nb_lines_blocks: 915 lines in 1.718 ms * count_nb_lines_mmap: 915 lines in 0.582 ms * count_nb_lines_sum: 915 lines in 1.993 ms * count_nb_lines_for: 915 lines in 3.876 ms * buf_count_newlines_gen: 915 lines in 1.032 ms * itercount: 915 lines in 0.817 ms File: big_file.ndjson (size: 317.99 MB) * count_nb_lines_blocks: 389000 lines in 415.393 ms * count_nb_lines_mmap: 389000 lines in 185.461 ms * count_nb_lines_sum: 389000 lines in 485.370 ms * count_nb_lines_for: 389000 lines in 967.075 ms * buf_count_newlines_gen: 389000 lines in 213.458 ms * itercount: 389000 lines in 223.120 ms File: small_file__shorter.ndjson (size: 0.04 MB) * count_nb_lines_blocks: 915 lines in 0.183 ms * count_nb_lines_mmap: 915 lines in 0.185 ms * count_nb_lines_sum: 915 lines in 0.251 ms * count_nb_lines_for: 915 lines in 0.244 ms * buf_count_newlines_gen: 915 lines in 0.665 ms * itercount: 915 lines in 0.135 ms File: big_file__shorter.ndjson (size: 17.42 MB) * count_nb_lines_blocks: 389000 lines in 36.799 ms * count_nb_lines_mmap: 389000 lines in 44.801 ms * count_nb_lines_sum: 389000 lines in 59.068 ms * count_nb_lines_for: 389000 lines in 81.387 ms * buf_count_newlines_gen: 389000 lines in 15.620 ms * itercount: 389000 lines in 31.292 ms File: small_file__shorter_sim_size.ndjson (size: 1.21 MB) * count_nb_lines_blocks: 36457 lines in 1.920 ms * count_nb_lines_mmap: 36457 lines in 2.615 ms * count_nb_lines_sum: 36457 lines in 3.993 ms * count_nb_lines_for: 36457 lines in 6.011 ms * buf_count_newlines_gen: 36457 lines in 1.363 ms * itercount: 36457 lines in 2.147 ms File: big_file__shorter_sim_size.ndjson (size: 328.19 MB) * count_nb_lines_blocks: 9834248 lines in 517.920 ms * count_nb_lines_mmap: 9834248 lines in 691.637 ms * count_nb_lines_sum: 9834248 lines in 1109.669 ms * count_nb_lines_for: 9834248 lines in 1683.859 ms * buf_count_newlines_gen: 9834248 lines in 318.939 ms * itercount: 9834248 lines in 628.760 ms

Nice comparison BUT Big file, long lines with a 17mb size is not that big !!
@Enissay. It's not very big indeed, but still 389,000 lines so no short :) I did some tests on 8 GB text files: this was big ;)

Lerner Zhang · Accepted Answer · 2014-08-28 09:25:44Z

If one wants to get the line count cheaply in Python in Linux, I recommend this method:

import os print os.popen("wc -l file_path").readline().split()[0]

file_path can be both abstract file path or relative path. Hope this may help.

Peter Mortensen · Accepted Answer · 2023-11-02 15:11:57Z

This is a meta-comment on some of the other answers.

The line-reading and buffered \n-counting techniques won't return the same answer for every file, because some text files have no newline at the end of the last line. You can work around this by checking the last byte of the last nonempty buffer and adding 1 if it's not b'\n'.
In Python 3, opening the file in text mode and in binary mode can yield different results, because text mode by default recognizes CR, LF, and CRLF as line endings (converting them all to '\n'), while in binary mode only LF and CRLF will be counted if you count b'\n'. This applies whether you read by lines or into a fixed-size buffer. The classic Mac OS used CR as a line ending; I don't know how common those files are these days.
The buffer-reading approach uses a bounded amount of RAM independent of file size, while the line-reading approach could read the entire file into RAM at once in the worst case (especially if the file uses CR line endings). In the worst case it may use substantially more RAM than the file size, because of overhead from dynamic resizing of the line buffer and (if you opened in text mode) Unicode decoding and storage.
You can improve the memory usage, and probably the speed, of the buffered approach by pre-allocating a bytearray and using readinto instead of read. One of the existing answers (with few votes) does this, but it's buggy (it double-counts some bytes).
The top buffer-reading answer uses a large buffer (1 MiB). Using a smaller buffer can actually be faster because of OS readahead. If you read 32K or 64K at a time, the OS will probably start reading the next 32K/64K into the cache before you ask for it, and each trip to the kernel will return almost immediately. If you read 1 MiB at a time, the OS is unlikely to speculatively read a whole megabyte. It may preread a smaller amount but you will still spend a significant amount of time sitting in the kernel waiting for the disk to return the rest of the data.

gaborous · Accepted Answer · 2023-11-03 21:19:20Z

There are a lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are three major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling GC collection and other micro-managing tricks:

Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preprocessing and lines parallelization: if you generate or can modify how the files you process are generated, or if it's acceptable that you can preprocess them. First convert the line return to Unix style (\n) as this will save 1 character compared to Windows (not a big save, but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can pad smaller lines if the length variability is not that big. This way, you can calculate instantly the number of lines from the total file size, which is much faster to access. Also, by having fixed length lines, not only can you generally pre-allocate memory which will speed up processing, but also you can process lines in parallel! Of course, parallelization works better with a flash/SSD disk that has much faster random access I/O than HDDs.. Often, the best solution to a problem is to preprocess it so that it better fits your end purpose.
Disks parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micromanaging tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionate compared to the returns in speed improvement you'll see.

I don't think macOS uses two characters (only Windows). Classic Mac OS certainly didn't (just 13 (decimal) instead of 10 (decimal)).

Andrés Torres · Accepted Answer · 2014-03-21 06:10:30Z

2

print open('file.txt', 'r').read().count("\n") + 1

answered Mar 21, 2014 at 6:10

Andrés Torres

7335 silver badges17 bronze badges

Comments

Peter Mortensen · Accepted Answer · 2023-11-02 05:28:29Z

Using Numba

We can use Numba to JIT (Just in time) compile our function to machine code. def numbacountparallel(fname) runs 2.8x faster than def file_len(fname) from the question.

Notes:

The OS had already cached the file to memory before the benchmarks were run as I don't see much disk activity on my PC. The time would be much slower when reading the file for the first time making the time advantage of using Numba insignificant.

The JIT compilation takes extra time the first time the function is called.

This would be useful if we were doing more than just counting lines.

Cython is another option.

Conclusion

As counting lines will be I/O bound, use the def file_len(fname) from the question unless you want to do more than just count lines.

import timeit from numba import jit, prange import numpy as np from itertools import (takewhile,repeat) FILE = '../data/us_confirmed.csv' # 40.6MB, 371755 line file CR = ord('\n') # Copied from the question above. Used as a benchmark def file_len(fname): with open(fname) as f: for i, l in enumerate(f): pass return i + 1 # Copied from another answer. Used as a benchmark def rawincount(filename): f = open(filename, 'rb') bufgen = takewhile(lambda x: x, (f.read(1024*1024*10) for _ in repeat(None))) return sum( buf.count(b'\n') for buf in bufgen ) # Single thread @jit(nopython=True) def numbacountsingle_chunk(bs): c = 0 for i in range(len(bs)): if bs[i] == CR: c += 1 return c def numbacountsingle(filename): f = open(filename, "rb") total = 0 while True: chunk = f.read(1024*1024*10) lines = numbacountsingle_chunk(chunk) total += lines if not chunk: break return total # Multi thread @jit(nopython=True, parallel=True) def numbacountparallel_chunk(bs): c = 0 for i in prange(len(bs)): if bs[i] == CR: c += 1 return c def numbacountparallel(filename): f = open(filename, "rb") total = 0 while True: chunk = f.read(1024*1024*10) lines = numbacountparallel_chunk(np.frombuffer(chunk, dtype=np.uint8)) total += lines if not chunk: break return total print('numbacountparallel') print(numbacountparallel(FILE)) # This allows Numba to compile and cache the function without adding to the time. print(timeit.Timer(lambda: numbacountparallel(FILE)).timeit(number=100)) print('\nnumbacountsingle') print(numbacountsingle(FILE)) print(timeit.Timer(lambda: numbacountsingle(FILE)).timeit(number=100)) print('\nfile_len') print(file_len(FILE)) print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100)) print('\nrawincount') print(rawincount(FILE)) print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))

Time in seconds for 100 calls to each function

numbacountparallel 371755 2.8007332000000003 numbacountsingle 371755 3.1508585999999994 file_len 371755 6.7945494 rawincount 371755 6.815438

Peter Mortensen · Accepted Answer · 2023-11-02 14:49:44Z

2

Simple methods:

Method 1

>>> f = len(open("myfile.txt").readlines()) >>> f

Output:

Method 2

>>> f = open("myfile.txt").read().count('\n') >>> f

Output:

Method 3

num_lines = len(list(open('myfile.txt')))

edited Nov 2, 2023 at 14:49

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Sep 17, 2018 at 10:27

Mohideen bin Mohammed

20.4k11 gold badges121 silver badges131 bronze badges

3 Comments

Maciej M Over a year ago

In this example file is not closed.

Georg Plaz Over a year ago

why did you give 3 options? how are they different? what are the benefits and drawbacks to each of these?

Peter Mortensen Over a year ago

Please add an explanation. Thanks in advance.

jciloa · Accepted Answer · 2017-12-17 14:50:35Z

1

def count_text_file_lines(path): with open(path, 'rt') as file: line_count = sum(1 for _line in file) return line_count

answered Dec 17, 2017 at 14:50

jciloa

1,1572 gold badges12 silver badges23 bronze badges

3 Comments

jciloa Over a year ago

Could you please explain what is wrong with it if you think it is wrong? It worked for me. Thanks!

cessor Over a year ago

I would be interested in why this answer was downvoted, too. It iterates over the file by lines and sums them up. I like it, it is short and to the point, what's wrong with it?

Jean-Francois T. Over a year ago

I believe some people like to have at least a bit text and not pure raw code. @jciloa Just add something to explain why you use rt as open mode and why this implementation performs better than the one in the original question.

blackbrandt · Accepted Answer · 2021-07-22 16:06:54Z

1

An alternative for big files is using xreadlines():

count = 0 for line in open(thefilepath).xreadlines( ): count += 1

For Python 3 please see: What substitutes xreadlines() in Python 3?

edited Jul 22, 2021 at 16:06

blackbrandt

2,1251 gold badge19 silver badges37 bronze badges

answered Nov 18, 2019 at 12:47

alexisdevarennes

5,6604 gold badges28 silver badges39 bronze badges

1 Comment

Jean-Francois T. Over a year ago

Is it really more efficient than the initial algorithm in the question?

Peter Mortensen · Accepted Answer · 2023-11-02 04:38:41Z

1

The result of opening a file is an iterator, which can be converted to a sequence, which has a length:

with open(filename) as f: return len(list(f))

This is more concise than your explicit loop, and avoids the enumerate.

edited Nov 2, 2023 at 4:38

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered May 10, 2009 at 11:35

Andrew Jaffe

27.2k4 gold badges54 silver badges59 bronze badges

3 Comments

SilentGhost Over a year ago

which means that 100 Mb file will need to be read into the memory.

Andrew Jaffe Over a year ago

yep, good point, although I wonder about the speed (as opposed to memory) difference. It's probably possible to create an iterator that does this, but I think it would be equivalent to your solution.

orip Over a year ago

-1, it's not just the memory, but having to construct the list in memory.

Peter Mortensen · Accepted Answer · 2023-11-02 14:05:43Z

1

This could work:

import fileinput import sys counter = 0 for line in fileinput.input([sys.argv[1]]): counter += 1 fileinput.close() print counter

edited Nov 2, 2023 at 14:05

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jul 19, 2011 at 15:55

leba-lev

2,91611 gold badges34 silver badges44 bronze badges

Collectives™ on Stack Overflow

How to get the line count of a large file cheaply in Python

44 Answers 44

7 Comments

14 Comments

4 Comments

14 Comments

7 Comments

2 Comments

2 Comments

7 Comments

4 Comments

Comments

2 Comments

5 Comments

8 Comments

Comments

4 Comments

6 Comments

2 Comments

Comments

2 Comments

2 Comments

Comments

Comments

1 Comment

Comments

Using Numba

Notes:

Conclusion

Comments

3 Comments

3 Comments

1 Comment

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

44 Answers 44

7 Comments

14 Comments

4 Comments

14 Comments

7 Comments

2 Comments

2 Comments

7 Comments

4 Comments

Comments

2 Comments

5 Comments

8 Comments

Comments

4 Comments

6 Comments

2 Comments

Comments

2 Comments

2 Comments

Comments

Comments

1 Comment

Comments

Using Numba

Notes:

Conclusion

Comments

3 Comments

3 Comments

1 Comment

3 Comments

Comments

Linked

Related