cat line X to line Y on a huge file

Question

Say I have a huge text file (>2GB) and I just want to cat the lines X to Y (e.g. 57890000 to 57890010).

From what I understand I can do this by piping head into tail or viceversa, i.e.

head -A /path/to/file | tail -B

or alternatively

tail -C /path/to/file | head -D

where A,B,C and D can be computed from the number of lines in the file, X and Y.

But there are two problems with this approach:

You have to compute A,B,C and D.
The commands could pipe to each other many more lines than I am interested in reading (e.g. if I am reading just a few lines in the middle of a huge file)

Is there a way to have the shell just work with and output the lines I want? (while providing only X and Y)?

FYI, actual speed test comparison of 6 methods added to my answer. — Kevin
– Kevin, Commented Sep 8, 2012 at 2:41
See also What's the best way to take a segment out of a text file? — don_crissti
– don_crissti, Commented Jul 17, 2015 at 23:02

Stéphane Chazelas · Accepted Answer · 2019-11-14 17:02:57Z

I suggest the sed solution, but for the sake of completeness,

awk 'NR >= 57890000 && NR <= 57890010' /path/to/file

To cut out after the last line:

awk 'NR < 57890000 { next } { print } NR == 57890010 { exit }' /path/to/file

Speed test (here on macOS, YMMV on other systems):

100,000,000-line file generated by seq 100000000 > test.in
Reading lines 50,000,000-50,000,010
Tests in no particular order
real time as reported by bash's builtin time

 4.373 4.418 4.395 tail -n+50000000 test.in | head -n10 5.210 5.179 6.181 sed -n '50000000,50000010p;57890010q' test.in 5.525 5.475 5.488 head -n50000010 test.in | tail -n10 8.497 8.352 8.438 sed -n '50000000,50000010p' test.in 22.826 23.154 23.195 tail -n50000001 test.in | head -n10 25.694 25.908 27.638 ed -s test.in <<<"50000000,50000010p" 31.348 28.140 30.574 awk 'NR<57890000{next}1;NR==57890010{exit}' test.in 51.359 50.919 51.127 awk 'NR >= 57890000 && NR <= 57890010' test.in

These are by no means precise benchmarks, but the difference is clear and repeatable enough* to give a good sense of the relative speed of each of these commands.

*: Except between the first two, sed -n p;q and head|tail, which seem to be essentially the same.

Out of curiosity: how have you flushed the disk cache between tests? — Paweł Rumian
– Paweł Rumian, Commented Sep 8, 2012 at 8:08
What about tail -n +50000000 test.in | head -n10, which unlike tail -n-50000000 test.in | head -n10 would give the correct result? — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Sep 8, 2012 at 10:55
Ok, I went and did some benchmarks. tail|head is way faster than sed, the difference is a lot more than I expected. — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Sep 8, 2012 at 11:30
@Gilles you're right, my bad. tail+|head is faster by 10-15% than sed, I've added that benchmark. — Kevin
– Kevin, Commented Sep 8, 2012 at 13:50
I realize that the question asks for lines, but if you use the -c to skip characters, tail+|head is instantaneous. Of course, you can't say "50000000" and may have to manually search out the start of the section you're looking for. — Danny Kirchmeier
– Danny Kirchmeier, Commented Apr 25, 2014 at 16:26

Stéphane Chazelas · Accepted Answer · 2019-11-14 16:59:43Z

If you want lines X to Y inclusive (starting the numbering at 1), use

tail -n "+$X" /path/to/file | head -n "$((Y-X+1))"

tail will read and discard the first X-1 lines (there's no way around that), then read and print the following lines. head will read and print the requested number of lines, then exit. When head exits, tail receives a SIGPIPE signal and dies, so it won't have read more than a buffer size's worth (typically a few kilobytes) of lines from the input file.

Alternatively, as gorkypl suggested, use sed:

sed -n -e "$X,$Y p" -e "$Y q" /path/to/file

The sed solution is significantly slower though (at least for GNU utilities and Busybox utilities; sed might be more competitive if you extract a large part of the file on an OS where piping is slow and sed is fast). Here are quick benchmarks under Linux; the data was generated by seq 100000000 >/tmp/a, the environment is Linux/amd64, /tmp is tmpfs and the machine is otherwise idle and not swapping.

real user sys command 0.47 0.32 0.12 </tmp/a tail -n +50000001 | head -n 10 #GNU 0.86 0.64 0.21 </tmp/a tail -n +50000001 | head -n 10 #BusyBox 3.57 3.41 0.14 sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #GNU 11.91 11.68 0.14 sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #BusyBox 1.04 0.60 0.46 </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #GNU 7.12 6.58 0.55 </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #BusyBox 9.95 9.54 0.28 sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #GNU 23.76 23.13 0.31 sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #BusyBox

If you know the byte range you want to work with, you can extract it faster by skipping directly to the start position. But for lines, you have to read from the beginning and count newlines. To extract blocks from x inclusive to y exclusive starting at 0, with a block size of b:

dd bs="$b" seek="$x" count="$((y-x))" </path/to/file

Are you sure that there is no caching inbetween? The differences between tail|head and sed seem too big to me. — Paweł Rumian
– Paweł Rumian, Commented Sep 8, 2012 at 12:03
@gorkypl I did several measures and the times were comparable. As I wrote, this is all happening in RAM (everything is in the cache). — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Sep 8, 2012 at 12:06
@Gilles tail will read and discard the first X-1 line seems to be avoided when the number of lines is given from the end, In such case, tail seems to read backwards from the end according to the executing times. Please read: http://unix.stackexchange.com/a/216614/79743. — user79743
– user79743, Commented Jul 17, 2015 at 4:52
@BinaryZebra Yes, if the input is a regular file, some implementations of tail (including GNU tail) have heuristics to read from the end. That improves the tail | head solution compared to other methods. — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Jul 17, 2015 at 7:08

jw013 · Accepted Answer · 2012-11-14 22:48:26Z

The head | tail approach is one of the best and most "idiomatic" ways to do this:

X=57890000 Y=57890010 < infile.txt head -n "$Y" | tail -n +"$X"

As pointed out by Gilles in the comments, a faster way is

< infile.txt tail -n +"$X" | head -n "$((Y - X))"

The reason this is faster is the first X - 1 lines don't need to go through the pipe compared to the head | tail approach.

Your question as phrased is a bit misleading and probably explains some of your unfounded misgivings towards this approach.

You say you have to calculate A, B, C, D but as you can see, the line count of the file is not needed and at most 1 calculation is necessary, which the shell can do for you anyways.
You worry that piping will read more lines than necessary. In fact this is not true: tail | head is about as efficient as you can get in terms of file I/O. First, consider the minimum amount of work necessary: to find the X'th line in a file, the only general way to do it is to read every byte and stop when you count X newline symbols as there is no way to divine the file offset of the X'th line. Once you reach the *X*th line, you have to read all the lines in order to print them, stopping at the Y'th line. Thus no approach can get away with reading less than Y lines. Now, head -n $Y reads no more than Y lines (rounded to the nearest buffer unit, but buffers if used correctly improve performance, so no need to worry about that overhead). In addition, tail will not read any more than head, so thus we have shown that head | tail reads the fewest number of lines possible (again, plus some negligible buffering that we are ignoring). The only efficiency advantage of a single tool approach that does not use pipes is fewer processes (and thus less overhead).

Never seen the redirection go first on the line before. Cool, it makes the pipe flow clearer. — clacke
– clacke, Commented Apr 27, 2016 at 3:42

Community · Accepted Answer · 2017-04-13 12:36:56Z

20

The most orthodox way (but not the fastest, as noted by Gilles above) would be to use sed.

In your case:

X=57890000 Y=57890010 sed -n -e "$X,$Y p" -e "$Y q" filename

The -n option implies that only the relevant lines are printed to stdout.

The p at the end of finishing line number means to print lines in given range. The q in second part of the script saves some time by skipping the remainder of the file.

edited Apr 13, 2017 at 12:36

CommunityBot

1

answered Sep 6, 2012 at 22:59

Paweł Rumian

1,70610 silver badges23 bronze badges

1

I expected sed and tail | head to be about on par, but it turns out that tail | head is significantly faster (see my answer).

Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil'

2012-09-08 11:31:58 +00:00
Commented Sep 8, 2012 at 11:31
1

I dunno, from what I've read, tail/head are considered more "orthodox", since trimming either end of a file is precisely what they're made for. In those materials, sed only seems to enter the picture when substitutions are required - and to quickly be pushed out of the picture when anything much more complex starts to happen, since its syntax for complex tasks is so much worse than AWK, which then takes over.

underscore_d
– underscore_d

2016-10-08 11:40:32 +00:00
Commented Oct 8, 2016 at 11:40

Add a comment |

score 7 · Accepted Answer · 2015-07-19 03:49:13Z

If we know the range to select, from the first line: lStart to the last line: lEnd we could calculate:

lCount="$((lEnd-lStart+1))"

If we know the total amount of lines: lAll we also could calculate the distance to the end of the file:

toEnd="$((lAll-lStart+1))"

Then we will know both:

"how far from the start" ($lStart) and "how far from the end of the file" ($toEnd).

Choosing the smallest of any of those: tailnumber as this:

tailnumber="$toEnd"; (( toEnd > lStart )) && tailnumber="+$linestart"

Allows us to use the consistently fastest executing command:

tail -n"${tailnumber}" ${thefile} | head -n${lCount}

Please note the additional plus ("+") sign when $linestart is selected.

The only caveat is that we need the total count of lines, and that may take some additional time to find.
As is usual with:

linesall="$(wc -l < "$thefile" )"

Some times measured are:

lStart |500| lEnd |500| lCount |11| real user sys frac 0.002 0.000 0.000 0.00 | command == tail -n"+500" test.in | head -n1 0.002 0.000 0.000 0.00 | command == tail -n+500 test.in | head -n1 3.230 2.520 0.700 99.68 | command == tail -n99999501 test.in | head -n1 0.001 0.000 0.000 0.00 | command == head -n500 test.in | tail -n1 0.001 0.000 0.000 0.00 | command == sed -n -e "500,500p;500q" test.in 0.002 0.000 0.000 0.00 | command == awk 'NR<'500'{next}1;NR=='500'{exit}' test.in lStart |50000000| lEnd |50000010| lCount |11| real user sys frac 0.977 0.644 0.328 99.50 | command == tail -n"+50000000" test.in | head -n11 1.069 0.756 0.308 99.58 | command == tail -n+50000000 test.in | head -n11 1.823 1.512 0.308 99.85 | command == tail -n50000001 test.in | head -n11 1.950 2.396 1.284 188.77| command == head -n50000010 test.in | tail -n11 5.477 5.116 0.348 99.76 | command == sed -n -e "50000000,50000010p;50000010q" test.in 10.124 9.669 0.448 99.92| command == awk 'NR<'50000000'{next}1;NR=='50000010'{exit}' test.in lStart |99999000| lEnd |99999010| lCount |11| real user sys frac 0.001 0.000 0.000 0.00 | command == tail -n"1001" test.in | head -n11 1.960 1.292 0.660 99.61 | command == tail -n+99999000 test.in | head -n11 0.001 0.000 0.000 0.00 | command == tail -n1001 test.in | head -n11 4.043 4.704 2.704 183.25| command == head -n99999010 test.in | tail -n11 10.346 9.641 0.692 99.88| command == sed -n -e "99999000,99999010p;99999010q" test.in 21.653 20.873 0.744 99.83 | command == awk 'NR<'99999000'{next}1;NR=='99999010'{exit}' test.in

Note that times change drastically if the selected lines are near the start or near the end. A command which appear to work nicely at one side of the file, may be extremely slow at the other side of the file.

Comments are not for extended discussion; this conversation has been moved to chat. — terdon
– terdon ♦, Commented Jul 17, 2015 at 22:46

Doolan · Accepted Answer · 2014-10-08 22:31:12Z

I do this often enough and so wrote this script. I don't need to find the line numbers, the script does it all.

#!/bin/bash # $1: start time # $2: end time # $3: log file to read # $4: output file # i.e. log_slice.sh 18:33 19:40 /var/log/my.log /var/log/myslice.log if [[ $# != 4 ]] ; then echo 'usage: log_slice.sh <start time> <end time> <log file> <output file>' echo exit; fi if [ ! -f $3 ] ; then echo "'$3' doesn't seem to exit." echo 'exiting.' exit; fi sline=$(grep -n " ${1}" $3|head -1|cut -d: -f1) #what line number is first occurrance of start time eline=$(grep -n " ${2}" $3|head -1|cut -d: -f1) #what line number is first occurrance of end time linediff="$((eline-sline))" tail -n+${sline} $3|head -n$linediff > $4

You're answering a question that wasn't asked. Your answer is 10% tail|head, which has been discussed extensively in the question and the other answers, and 90% determining the line numbers where specified strings/patterns appear, which wasn't part of the question. P.S. you should always quote your shell parameters and variables; e.g., "$3" and "$4". — G-Man Says 'Reinstate Monica'
– G-Man Says 'Reinstate Monica', Commented Oct 8, 2014 at 22:51

RARE Kpop Manifesto · Accepted Answer · 2023-08-06 09:33:08Z

even the fastest tail + head combo is only like 1.3 % faster than awk :

__='147654389' # extracting rows 147,654,389 - 147,654,399

 ( time ( pvE0 < "$_____" | mawk2 -v __=$__ 'BEGIN {_=(__=+__)+10}NR<__{next}_<NR{exit}_' )) in0: 7.17GiB 0:00:05 [1.33GiB/s] [1.33GiB/s] [=====> ] 94% ( pvE 0.1 in0 < "$_____" | mawk2 -v __=$__ ; ) 4.65s user 1.79s system 118% cpu 5.424 total 02de381a4ea9c6d101c1935ae75cf565 stdin

 in0: 7.17GiB 0:00:05 [1.34GiB/s] [1.34GiB/s] [=====> ] 94% ( pvE 0.1 in0 < "$_____" | gtail -n"+$__" | ghead -n11; ) 2.50s user 3.96s system 120% cpu 5.355 total 02de381a4ea9c6d101c1935ae75cf565 stdin

ironically, gnu-tail is actually slower when using its own I/O mechanism to read the file instead of through the pipe

Lerie · Accepted Answer · 2020-07-27 23:44:51Z

-2

If you cat the data you will want to use tail first and then head.

cat file.name | tail -n +"3" | head -n -"1"

answered Jul 27, 2020 at 23:44

Lerie

11 bronze badge

2

Pretty much every other answer has at least mentioned, if not recommended, tail piped into head. Adding a cat to the pipeline just adds noise and overhead, but doesn’t add value. You might as well say “If you’re wearing gloves when you type the command, you will want to use tail first and then head.” — it doesn’t matter.

Scott - Слава Україні
– Scott - Слава Україні

2020-07-28 05:34:02 +00:00
Commented Jul 28, 2020 at 5:34
This is the solution that worked for me.

Lerie
– Lerie

2020-07-29 20:59:46 +00:00
Commented Jul 29, 2020 at 20:59
You say “This is the solution that worked for me.” [emphasis added] I say it is a solution that works. Did you try tail -n +3 file.name | head -n -1? What happened?

Scott - Слава Україні
– Scott - Слава Україні

2020-07-30 00:06:05 +00:00
Commented Jul 30, 2020 at 0:06

Add a comment |

Stack Exchange Network

cat line X to line Y on a huge file

8 Answers 8

Some times measured are:

You must log in to answer this question.

Linked

Hot Network Questions

cat line X to line Y on a huge file

8 Answers 8

Some times measured are:

You must log in to answer this question.

Linked

Related

Hot Network Questions