Revisions to How to get the line count of a large file cheaply in Python

add more details on parallelization + remove MacOS line returns wrong info

edited Nov 3, 2023 at 21:19

16.8k
11
91
104

There are a lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are three major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling GC collection and other micro-managing tricks:

Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preparation solutionpreprocessing and lines parallelization: if you generate or can modify how the files you process are generated, or if it's acceptable that you can preprocess them. First convert the line return to Unix style (\n) as this will save 1 character compared to Windows or macOS styles (not a big save, but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines if the length variability is not that big. This way, you can calculate instantly the number of lines from the total file size, which is much faster to access. Also, by having fixed length lines, not only can you generally pre-allocate memory which will speed up processing, but also you can process lines in parallel! Of course, parallelization works better with a flash/SSD disk that has much faster random access I/O than HDDs.. Often, the best solution to a problem is to preprocess it so that it better fits your end purpose.
ParallelizationDisks parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micromanaging tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionate compared to the returns in speed improvement you'll see.

There are a lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are three major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling GC collection and other micro-managing tricks:

Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can preprocess them. First convert the line return to Unix style (\n) as this will save 1 character compared to Windows or macOS styles (not a big save, but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total file size, which is much faster to access. Often, the best solution to a problem is to preprocess it so that it better fits your end purpose.
Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micromanaging tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionate compared to the returns in speed improvement you'll see.

There are a lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are three major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling GC collection and other micro-managing tricks:

Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preprocessing and lines parallelization: if you generate or can modify how the files you process are generated, or if it's acceptable that you can preprocess them. First convert the line return to Unix style (\n) as this will save 1 character compared to Windows (not a big save, but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can pad smaller lines if the length variability is not that big. This way, you can calculate instantly the number of lines from the total file size, which is much faster to access. Also, by having fixed length lines, not only can you generally pre-allocate memory which will speed up processing, but also you can process lines in parallel! Of course, parallelization works better with a flash/SSD disk that has much faster random access I/O than HDDs.. Often, the best solution to a problem is to preprocess it so that it better fits your end purpose.
Disks parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micromanaging tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionate compared to the returns in speed improvement you'll see.

Second iteration [<https://en.wiktionary.org/wiki/file_size#Noun> <https://en.wiktionary.org/wiki/preprocess#Verb> <https://en.wiktionary.org/wiki/micromanage#Verb> <https://en.wiktionary.org/wiki/disproportionate#Adjective>]. Added some context.

Source Link

edited Nov 2, 2023 at 15:30

Peter Mortensen

31.4k
22
110
134

There are a lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are three major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling GC collection and other micro-managing tricks:

Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can preprocess them. First convert the line return to Unix style (\n) as this will save 1 character compared to Windows or macOS styles (not a big save, but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesizefile size, which is much faster to access. Often, the best solution to a problem is to pre-processpreprocess it so that it better fits your end purpose.
Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMarkCrystalDiskMark for example).

If none of those are an option, then you can only rely on micro-managingmicromanaging tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionateddisproportionate compared to the returns in speed improvement you'll see.

There are a lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are three major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling GC collection and other micro-managing tricks:

Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can preprocess them. First convert the line return to Unix style (\n) as this will save 1 character compared to Windows or macOS styles (not a big save, but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesize, which is much faster to access. Often, the best solution to a problem is to pre-process it so that it better fits your end purpose.
Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micro-managing tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionated compared to the returns in speed improvement you'll see.

There are a lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are three major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling GC collection and other micro-managing tricks:

Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can preprocess them. First convert the line return to Unix style (\n) as this will save 1 character compared to Windows or macOS styles (not a big save, but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total file size, which is much faster to access. Often, the best solution to a problem is to preprocess it so that it better fits your end purpose.
Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micromanaging tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionate compared to the returns in speed improvement you'll see.

Active reading [<https://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29> <https://en.wiktionary.org/wiki/preprocess#Verb> <https://en.wikipedia.org/wiki/Unix> <https://en.wikipedia.org/wiki/MacOS>]. Eliminated the sentence fragment.

Source Link

edited Nov 2, 2023 at 15:19

Peter Mortensen

31.4k
22
110
134

AThere are a lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are 3three major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling gcGC collection and other micro-managing tricks:

Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can pre-processpreprocess them, first. First convert the line return to unix styleUnix style (\n) as this will save 1 character compared to Windows or MacOSmacOS styles (not a big save, but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesize, which is much faster to access. Often, the best solution to a problem is to pre-process it so that it better fits your end purpose.
Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micro-managing tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionated compared to the returns in speed improvement you'll see.

A lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are 3 major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling gc collection and other micro-managing tricks:

Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can pre-process them, first convert the line return to unix style (\n) as this will save 1 character compared to Windows or MacOS styles (not a big save but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesize, which is much faster to access. Often, the best solution to a problem is to pre-process it so that it better fits your end purpose.
Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micro-managing tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionated compared to the returns in speed improvement you'll see.

There are a lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are three major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling GC collection and other micro-managing tricks:

Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can preprocess them. First convert the line return to Unix style (\n) as this will save 1 character compared to Windows or macOS styles (not a big save, but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesize, which is much faster to access. Often, the best solution to a problem is to pre-process it so that it better fits your end purpose.
Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micro-managing tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionated compared to the returns in speed improvement you'll see.

added 56 characters in body

Source Link

edited Nov 9, 2020 at 19:18

gaborous

16.8k
11
91
104

Loading

Source Link

answered Nov 9, 2020 at 19:12

gaborous

16.8k
11
91
104

Loading

Collectives™ on Stack Overflow

Return to Answer