Why is reading lines from stdin much slower in C++ than Python?

Question

I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Python code. Since my C++ is rusty and I'm not yet an expert Pythonista, please tell me if I'm doing something wrong or if I'm misunderstanding something.

(TLDR answer: include the statement: cin.sync_with_stdio(false) or just use fgets instead.

TLDR results: scroll all the way down to the bottom of my question and look at the table.)

C++ code:

#include <iostream> #include <time.h> using namespace std; int main() { string input_line; long line_count = 0; time_t start = time(NULL); int sec; int lps; while (cin) { getline(cin, input_line); if (!cin.eof()) line_count++; }; sec = (int) time(NULL) - start; cerr << "Read " << line_count << " lines in " << sec << " seconds."; if (sec > 0) { lps = line_count / sec; cerr << " LPS: " << lps << endl; } else cerr << endl; return 0; } // Compiled with: // g++ -O3 -o readline_test_cpp foo.cpp

Python Equivalent:

#!/usr/bin/env python import time import sys count = 0 start = time.time() for line in sys.stdin: count += 1 delta_sec = int(time.time() - start_time) if delta_sec >= 0: lines_per_sec = int(round(count/delta_sec)) print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec, lines_per_sec))

Here are my results:

$ cat test_lines | ./readline_test_cpp Read 5570000 lines in 9 seconds. LPS: 618889 $ cat test_lines | ./readline_test.py Read 5570000 lines in 1 seconds. LPS: 5570000

I should note that I tried this both under Mac OS X v10.6.8 (Snow Leopard) and Linux 2.6.32 (Red Hat Linux 6.2). The former is a MacBook Pro, and the latter is a very beefy server, not that this is too pertinent.

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done

Test run 1 at Mon Feb 20 21:29:28 EST 2012 CPP: Read 5570001 lines in 9 seconds. LPS: 618889 Python:Read 5570000 lines in 1 seconds. LPS: 5570000 Test run 2 at Mon Feb 20 21:29:39 EST 2012 CPP: Read 5570001 lines in 9 seconds. LPS: 618889 Python:Read 5570000 lines in 1 seconds. LPS: 5570000 Test run 3 at Mon Feb 20 21:29:50 EST 2012 CPP: Read 5570001 lines in 9 seconds. LPS: 618889 Python:Read 5570000 lines in 1 seconds. LPS: 5570000 Test run 4 at Mon Feb 20 21:30:01 EST 2012 CPP: Read 5570001 lines in 9 seconds. LPS: 618889 Python:Read 5570000 lines in 1 seconds. LPS: 5570000 Test run 5 at Mon Feb 20 21:30:11 EST 2012 CPP: Read 5570001 lines in 10 seconds. LPS: 557000 Python:Read 5570000 lines in 1 seconds. LPS: 5570000

Tiny benchmark addendum and recap

For completeness, I thought I'd update the read speed for the same file on the same box with the original (synced) C++ code. Again, this is for a 100M line file on a fast disk. Here's the comparison, with several solutions/approaches:

Implementation	Lines per second
python (default)	3,571,428
cin (default/naive)	819,672
cin (no sync)	12,500,000
fgets	14,285,714
wc (not fair comparison)	54,644,808

Since nobody seems to have mentioned why you get an extra line with C++: Do not test against cin.eof()!! Put the getline call into the 'if` statement. — Xeo
– Xeo, Commented Feb 21, 2012 at 18:29
wc -l is fast because it reads the stream more than one line at a time (it might be fread(stdin)/memchr('\n') combination). Python results are in the same order of magnitude e.g., wc-l.py — jfs
– jfs, Commented Feb 27, 2012 at 0:21
If you ever need high-resolution timestamps for testing smaller sample sizes, see here for C, here for C++, and here for Python. — Gabriel Staples
– Gabriel Staples, Commented Feb 4, 2022 at 19:29
There's no need to guess about how wc -l gets results: the code reveals that in coreutils 9.0, wc has two implementations: One does buffered reads 16KiB at a time and uses simple string walking for short lines, rawmemchr() for longer lines (>= 15 chars/line average). The second is AVX2-based, and uses parallel __mm256i accumulators that it populates using _mm256_cmpeq_epi8() and _mm256_sub_epi8(), then sums with _mm256_sad_epu8() and extracts the counts from using _mm256_extract_epi16(). Yeah, it's built to be fast. — FeRD
– FeRD, Commented Dec 8, 2022 at 7:12

Gabriel Staples · Accepted Answer · 2022-02-04 18:43:49Z

tl;dr: Because of different default settings in C++ requiring more system calls.

By default, cin is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:

std::ios_base::sync_with_stdio(false);

Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the FILE* based stdio and iostreams often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:

int myvalue1; cin >> myvalue1; int myvalue2; scanf("%d",&myvalue2);

If more input was read by cin than it actually needed, then the second integer value wouldn't be available for the scanf function, which has its own independent buffer. This would lead to unexpected results.

To avoid this, by default, streams are synchronized with stdio. One common way to achieve this is to have cin read each character one at a time as needed using stdio functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn't a big problem, but when you are reading millions of lines, the performance penalty is significant.

Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the sync_with_stdio method. From this link (emphasis added):

If the synchronization is turned off, the C++ standard streams are allowed to buffer their I/O independently, which may be considerably faster in some cases.

This should be at the top. It is almost certainly correct. The answer cannot lie in replacing the read with an fscanf call, because that quite simply doesn't do as much work as Python does. Python must allocate memory for the string, possibly multiple times as the existing allocation is deemed inadequate - exactly like the C++ approach with std::string. This task is almost certainly I/O bound and there is way too much FUD going around about the cost of creating std::string objects in C++ or using <iostream> in and of itself.
Yes, adding this line immediately above my original while loop sped the code up to surpass even python. I'm about to post the results as the final edit. Thanks again!
Note that sync_with_stdio() is a static member function, and a call to this function on any stream object (e.g. cin) toggles on or off synchronization for all standard iostream objects.
Note that it may be possible to call std::cin.rdbuf().pubsetbuf(_my_large_buffer_, _some_large_number_) to reduce the number of IO calls. But pubsetbuf does nothing in the base class basic_streambuf, and although std::cin necessarily uses a derived streambuf, it is of an opaque type that is completely platform-dependent, and we can't guarantee that the library implementer added the behavior to make it useful. YMMV.

Peter Mortensen · Accepted Answer · 2014-04-02 18:27:57Z

Just out of curiosity I've taken a look at what happens under the hood, and I've used dtruss/strace on each test.

C++

./a.out < in Saw 6512403 lines in 8 seconds. Crunch speed: 814050

syscalls sudo dtruss -c ./a.out < in

CALL COUNT __mac_syscall 1 <snip> open 6 pread 8 mprotect 17 mmap 22 stat64 30 read_nocancel 25958

Python

./a.py < in Read 6512402 lines in 1 seconds. LPS: 6512402

syscalls sudo dtruss -c ./a.py < in

CALL COUNT __mac_syscall 1 <snip> open 5 pread 8 mprotect 17 mmap 21 stat64 29

7 revs, 6 users 85% · Accepted Answer · 2020-06-02 18:52:35Z

I'm a few years behind here, but:

In 'Edit 4/5/6' of the original post, you are using the construction:

$ /usr/bin/time cat big_file | program_to_benchmark

This is wrong in a couple of different ways:

You're actually timing the execution of cat, not your benchmark. The 'user' and 'sys' CPU usage displayed by time are those of cat, not your benchmarked program. Even worse, the 'real' time is also not necessarily accurate. Depending on the implementation of cat and of pipelines in your local OS, it is possible that cat writes a final giant buffer and exits long before the reader process finishes its work.
Use of cat is unnecessary and in fact counterproductive; you're adding moving parts. If you were on a sufficiently old system (i.e. with a single CPU and -- in certain generations of computers -- I/O faster than CPU) -- the mere fact that cat was running could substantially color the results. You are also subject to whatever input and output buffering and other processing cat may do. (This would likely earn you a 'Useless Use Of Cat' award if I were Randal Schwartz.

A better construction would be:

$ /usr/bin/time program_to_benchmark < big_file

In this statement it is the shell which opens big_file, passing it to your program (well, actually to time which then executes your program as a subprocess) as an already-open file descriptor. 100% of the file reading is strictly the responsibility of the program you're trying to benchmark. This gets you a real reading of its performance without spurious complications.

I will mention two possible, but actually wrong, 'fixes' which could also be considered (but I 'number' them differently as these are not things which were wrong in the original post):

A. You could 'fix' this by timing only your program:

$ cat big_file | /usr/bin/time program_to_benchmark

B. or by timing the entire pipeline:

$ /usr/bin/time sh -c 'cat big_file | program_to_benchmark'

These are wrong for the same reasons as #2: they're still using cat unnecessarily. I mention them for a few reasons:

they're more 'natural' for people who aren't entirely comfortable with the I/O redirection facilities of the POSIX shell
there may be cases where cat is needed (e.g.: the file to be read requires some sort of privilege to access, and you do not want to grant that privilege to the program to be benchmarked: sudo cat /dev/sda | /usr/bin/time my_compression_test --no-output)
in practice, on modern machines, the added cat in the pipeline is probably of no real consequence.

But I say that last thing with some hesitation. If we examine the last result in 'Edit 5' --

$ /usr/bin/time cat temp_big_file | wc -l 0.01user 1.34system 0:01.83elapsed 74%CPU ...

-- this claims that cat consumed 74% of the CPU during the test; and indeed 1.34/1.83 is approximately 74%. Perhaps a run of:

$ /usr/bin/time wc -l < temp_big_file

would have taken only the remaining .49 seconds! Probably not: cat here had to pay for the read() system calls (or equivalent) which transferred the file from 'disk' (actually buffer cache), as well as the pipe writes to deliver them to wc. The correct test would still have had to do those read() calls; only the write-to-pipe and read-from-pipe calls would have been saved, and those should be pretty cheap.

Still, I predict you would be able to measure the difference between cat file | wc -l and wc -l < file and find a noticeable (2-digit percentage) difference. Each of the slower tests will have paid a similar penalty in absolute time; which would however amount to a smaller fraction of its larger total time.

In fact I did some quick tests with a 1.5 gigabyte file of garbage, on a Linux 3.13 (Ubuntu 14.04) system, obtaining these results (these are actually 'best of 3' results; after priming the cache, of course):

$ time wc -l < /tmp/junk real 0.280s user 0.156s sys 0.124s (total cpu 0.280s) $ time cat /tmp/junk | wc -l real 0.407s user 0.157s sys 0.618s (total cpu 0.775s) $ time sh -c 'cat /tmp/junk | wc -l' real 0.411s user 0.118s sys 0.660s (total cpu 0.778s)

Notice that the two pipeline results claim to have taken more CPU time (user+sys) than real wall-clock time. This is because I'm using the shell (bash)'s built-in 'time' command, which is cognizant of the pipeline; and I'm on a multi-core machine where separate processes in a pipeline can use separate cores, accumulating CPU time faster than realtime. Using /usr/bin/time I see smaller CPU time than realtime -- showing that it can only time the single pipeline element passed to it on its command line. Also, the shell's output gives milliseconds while /usr/bin/time only gives hundredths of a second.

So at the efficiency level of wc -l, the cat makes a huge difference: 409 / 283 = 1.453 or 45.3% more realtime, and 775 / 280 = 2.768, or a whopping 177% more CPU used! On my random it-was-there-at-the-time test box.

I should add that there is at least one other significant difference between these styles of testing, and I can't say whether it is a benefit or fault; you have to decide this yourself:

When you run cat big_file | /usr/bin/time my_program, your program is receiving input from a pipe, at precisely the pace sent by cat, and in chunks no larger than written by cat.

When you run /usr/bin/time my_program < big_file, your program receives an open file descriptor to the actual file. Your program -- or in many cases the I/O libraries of the language in which it was written -- may take different actions when presented with a file descriptor referencing a regular file. It may use mmap(2) to map the input file into its address space, instead of using explicit read(2) system calls. These differences could have a far larger effect on your benchmark results than the small cost of running the cat binary.

Of course it is an interesting benchmark result if the same program performs significantly differently between the two cases. It shows that, indeed, the program or its I/O libraries are doing something interesting, like using mmap(). So in practice it might be good to run the benchmarks both ways; perhaps discounting the cat result by some small factor to "forgive" the cost of running cat itself.

Wow, that was quite insightful! While I've been aware that cat is unnecessary for feeding input to stdin of programs and that the < shell redirect is preferred, I've generally stuck to cat due to the left-to-right flow of data that the former method preserves visually when I reason about pipelines. Performance differences in such cases I've found to be negligible. But, I do appreciate your educating us, Bela.
Redirection is parsed out of the shell command line at an early stage, which allows you to do one of these, if it gives a more pleasing appearance of left-to-right flow: $ < big_file time my_program $ time < big_file my_program This should work in any POSIX shell (i.e. not `csh` and I'm not sure about exotica like `rc` : )

Ted Klein Bergman · Accepted Answer · 2022-02-05 17:25:56Z

112

I reproduced the original result on my computer using g++ on a Mac.

Adding the following statements to the C++ version just before the while loop brings it inline with the Python version:

std::ios_base::sync_with_stdio(false); char buffer[1048576]; std::cin.rdbuf()->pubsetbuf(buffer, sizeof(buffer));

sync_with_stdio improved speed to 2 seconds, and setting a larger buffer brought it down to 1 second.

edited Feb 5, 2022 at 17:25

Ted Klein Bergman

9,8075 gold badges35 silver badges61 bronze badges

answered Feb 21, 2012 at 3:33

karunski

4,1102 gold badges20 silver badges10 bronze badges

3 Comments

Matthieu M. Over a year ago

I would also avoid setting up a 1MB buffer on the stack. It can lead to stackoverflow (though I guess it's a good place to debate about it!)

SEK Over a year ago

Matthieu, Mac uses a 8MB process stack by default. Linux uses 4MB per thread default, IIRC. 1MB isn't that much of an issue for a program that transforms input with relatively shallow stack depth. More importantly, though, std::cin will trash the stack if the buffer goes out of scope.

Étienne Over a year ago

@SEK Windows default Stack size is 1MB.

Gaius · Accepted Answer · 2018-05-28 17:55:09Z

getline, stream operators, scanf, can be convenient if you don't care about file loading time or if you are loading small text files. But, if the performance is something you care about, you should really just buffer the entire file into memory (assuming it will fit).

Here's an example:

//open file in binary mode std::fstream file( filename, std::ios::in|::std::ios::binary ); if( !file ) return NULL; //read the size... file.seekg(0, std::ios::end); size_t length = (size_t)file.tellg(); file.seekg(0, std::ios::beg); //read into memory buffer, then close it. char *filebuf = new char[length+1]; file.read(filebuf, length); filebuf[length] = '\0'; //make it null-terminated file.close();

If you want, you can wrap a stream around that buffer for more convenient access like this:

std::istrstream header(&filebuf[0], length);

Also, if you are in control of the file, consider using a flat binary data format instead of text. It's more reliable to read and write because you don't have to deal with all the ambiguities of whitespace. It's also smaller and much faster to parse.

even though I'm not confident enough in my knowledge to write code for it here, I just want to mention memory mapping as another alternative to reading files into memory. see mmap for the linux-specific api.

Petter · Accepted Answer · 2014-04-23 14:56:24Z

The following code was faster for me than the other code posted here so far: (Visual Studio 2013, 64-bit, 500 MB file with line length uniformly in [0, 1000)).

const int buffer_size = 500 * 1024; // Too large/small buffer is not good. std::vector<char> buffer(buffer_size); int size; while ((size = fread(buffer.data(), sizeof(char), buffer_size, stdin)) > 0) { line_count += count_if(buffer.begin(), buffer.begin() + size, [](char ch) { return ch == '\n'; }); }

It beats all my Python attempts by more than a factor 2.

Gregg · Accepted Answer · 2012-03-12 04:04:01Z

By the way, the reason the line count for the C++ version is one greater than the count for the Python version is that the eof flag only gets set when an attempt is made to read beyond eof. So the correct loop would be:

while (cin) { getline(cin, input_line); if (!cin.eof()) line_count++; };

The really correct loop would be: while (getline(cin, input_line)) line_count++;

Waqar · Accepted Answer · 2020-07-06 06:03:13Z

In your second example (with scanf()) reason why this is still slower might be because scanf("%s") parses string and looks for any space char (space, tab, newline).

Also, yes, CPython does some caching to avoid harddisk reads.

Peter Mortensen · Accepted Answer · 2014-04-02 18:25:14Z

A first element of an answer: <iostream> is slow. Damn slow. I get a huge performance boost with scanf as in the below, but it is still two times slower than Python.

#include <iostream> #include <time.h> #include <cstdio> using namespace std; int main() { char buffer[10000]; long line_count = 0; time_t start = time(NULL); int sec; int lps; int read = 1; while(read > 0) { read = scanf("%s", buffer); line_count++; }; sec = (int) time(NULL) - start; line_count--; cerr << "Saw " << line_count << " lines in " << sec << " seconds." ; if (sec > 0) { lps = line_count / sec; cerr << " Crunch speed: " << lps << endl; } else cerr << endl; return 0; }

Waqar · Accepted Answer · 2020-07-06 06:02:28Z

14

Well, I see that in your second solution you switched from cin to scanf, which was the first suggestion I was going to make you (cin is sloooooooooooow). Now, if you switch from scanf to fgets, you would see another boost in performance: fgets is the fastest C++ function for string input.

BTW, didn't know about that sync thing, nice. But you should still try fgets.

edited Jul 6, 2020 at 6:02

Waqar

9,5396 gold badges39 silver badges47 bronze badges

answered Feb 22, 2012 at 2:33

José Ernesto Lara Rodríguez

1,3971 gold badge13 silver badges26 bronze badges

1 Comment

ShadowRanger Over a year ago

fgets has the issue of needing to pre-allocate the buffer. And if a line from the file is longer than the buffer, the only way to tell is to either linearly scan to find the length of the data read (fgets returns a pointer to the buffer, not something useful like the number of characters read), or ensure that every fgets is preceded by explicitly NUL-ing out that final character so you can check if it's been replaced and if it was replaced with a newline or something else. It may be faster, but giving up modern C++ features for uglier, less convenient, less secure C APIs is unpleasant.

Mark F. Sanderson · Accepted Answer · 2024-09-23 13:19:46Z

-2

I am way late to this game, but I thought I'd put my two cents in:

The python line:

for line in sys.stdin: count += 1

Does NOT read data from the stream. It merely counts the number lines that the stream encounters - nothing more.

Peter Mortensen was on to something with his dtruss analysis: https://stackoverflow.com/a/9657502/1043530

Note the

read_nocancel 25958

that is not done with the python. A python interpreter is just like any other program, if it does I/O it will show it via strace or dtruss.

Redo this 'super fast' python program with actual reads and I believe you'll see some changes in the dtruss output. Yeah, I know I'm late . . . but this one caught my eye because of how folks are talking disabling this and that, fgets, vs blah whatever . . . If program 'A' does not do disk I/O but program 'B' does, in many/most/all cases that explain why program 'A' is faster.

TLDR: False equivalence between programs as proven by Peter Mortensen dtruss run and his post should be marked as the answer.

-Mark

answered Sep 23, 2024 at 13:19

community wiki

Mark F. Sanderson

8 Comments

Dan Mašek Over a year ago

"Peter Mortensen was on to something" -- Peter edited a few spelling mistakes there, the author is shown in the bottom right corner, and it's 2mia.

Dan Mašek Over a year ago

"Does NOT read data from the stream." -- I'm really curious about how you count lines without doing any reading. sys.stdin is an instance of TextIOWrapper, and the iterator there calls readline, returing a str. And looking at disassembly, I don't see anything being optimized out (not that I expected it in the first place). So, please explain.

Peter Cordes Over a year ago

@DanMašek: Mainstream OSes such as macOS, Linux, and Windows don't store metadata like line-counts for text files. There's no way to get a line count without reading the data. Files are just byte arrays with a length (and potentially holes you can seek to, sparse files and lseek(2) with SEEK_DATA and SEEK_HOLE). Unlike on VMS (on Vax hardware) where you did have record-based text files so there was structure, not just a byte stream to read.

Peter Cordes Over a year ago

Anyway, on modern OSes and hardware, there isn't a way to ask the OS to count matches for a byte like '\n' and have it do it without getting the data to the CPU in user-space. (Theoretically you could have a system where you could send a command to an SSD and its firmware / controller would count matches for a byte in some range of sectors and it would send back the result, but in practice there's no such support. I forget if I read somewhere about some research or experiments or even special SSDs sold commercially with special drivers for that, but it's not mainstream.)

Peter Cordes Over a year ago

@DanMašek: Python might do a few huge read_nocancel or pread system calls, or it might mmap the data into user-space since 2mia's answer redirects stdin from a regular file, not a pipe or something. We see both mmap and pread (like read but taking an offset to read from) in the dtruss output. So even in the set of system calls we see from dtruss in that answer, it's plausible it could have read the whole file. A bit surprising the counts were that low, but plausible. This answer's needs much stronger evidence for such a surprising claim.

|

Collectives™ on Stack Overflow

Why is reading lines from stdin much slower in C++ than Python?

11 Answers 11

tl;dr: Because of different default settings in C++ requiring more system calls.

4 Comments

Comments

2 Comments

3 Comments

1 Comment

Comments

1 Comment

Comments

Comments

1 Comment

8 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

tl;dr: Because of different default settings in C++ requiring more system calls.

4 Comments

Comments

2 Comments

3 Comments

1 Comment

Comments

1 Comment

Comments

Comments

1 Comment

8 Comments

Linked

Related