909

Is there a "canonical" way of doing that? I've been using head -n | tail -1 which does the trick, but I've been wondering if there's a Bash tool that specifically extracts a line (or a range of lines) from a file.

By "canonical" I mean a program whose main function is doing that.

9
  • 13
    The "Unix way" is to chain tools that do their respective job well. So I think you already found a very suitable method. Other methods include awk and sed and I'm sure someone can come up with a Perl one-liner or so as well ;) Commented May 16, 2011 at 19:35
  • 4
    The double-command suggests that the head | tail solution is sub-optimal. Other more nearly optimal solutions have been suggested. Commented May 16, 2011 at 19:57
  • 8
    Benchmarks (for a range) at cat line X to line Y on a huge file on Unix & Linux. (cc @Marcin, in case you're still wondering after two+ years) Commented Aug 8, 2013 at 14:13
  • 15
    The head | tail solution does not work, if you query a line that does not exist in the input: it will print the last line. Commented Mar 1, 2016 at 0:24
  • 1
    head -n$NN file | tail -1 may take more time, but this NIX logic can also solve the reverse problem: tail -n$NN file | head -1 gives you the NN-th line counting from the back of the file, while using sed or awk needs some arithmetics, too. Commented Jul 15, 2022 at 7:15

24 Answers 24

1167

head and pipe with tail will be slow for a huge file. I would suggest sed like this:

sed 'NUMq;d' file 

Where NUM is the number of the line you want to print; so, for example, sed '10q;d' file to print the 10th line of file.

Explanation:

NUMq will quit immediately when the line number is NUM.

d will delete each line, 1 through NUM-1, instead of printing it; this is not executed on line NUM because the q causes the rest of the script to be skipped when quitting.

If you have NUM in a variable, you will want to use double quotes instead of single:

sed "${NUM}q;d" file 
Sign up to request clarification or add additional context in comments.

17 Comments

For those wondering, this solution seems about 6 to 9 times faster than the sed -n 'NUMp' and sed 'NUM!d' solutions proposed below.
I think tail -n+NUM file | head -n1 is likely to be just as fast or faster. At least, it was (significantly) faster on my system when I tried it with NUM being 250000 on a file with half a million lines. YMMV, but I don't really see why it would.
no it is not. Without q it will process full file
Try running sed "${NUM}q" file and you will understand better why ;d is also needed
@LeeMeador: Output will be nothing for that case.
|
429
sed -n '2p' < file.txt 

will print 2nd line

sed -n '2011p' < file.txt 

2011th line

sed -n '10,33p' < file.txt 

line 10 through line 33 (i.e., line 33 is included in the output)

sed -n '1p;3p' < file.txt 

1st and 3rd line

and so on...

For adding lines with sed, you can check this:

sed: insert a line in a certain position

9 Comments

Why is the '<' necessary in this case? Wouldn't I achieve the same output without it?
@RafaelBarbosa the < in this case is not necessary. Simply, it is my preference using redirects, because me often used redirects like sed -n '100p' < <(some_command) - so, universal syntax :). It is NOT less effective, because redirection are done with shell when forking itself, so... it is only a preference... (and yes, it is one character longer) :)
@jm666 Actually it's 2 characters longer since you would normally put the '<' as well as an extra space ' ' after < as oppposed to just one space if you hadn't used the < :)
@rasen58 the space is an character too? :) /okay, just kidding - youre right/ :)
This is about 5 times slower than the tail / head combination when reading a file with 50M rows
|
137

I have a unique situation where I can benchmark the solutions proposed on this page, and so I'm writing this answer as a consolidation of the proposed solutions with included run times for each.

Set Up

I have a 3.261 gigabyte ASCII text data file with one key-value pair per row. The file contains 3,339,550,320 rows in total and defies opening in any editor I have tried, including my go-to Vim. I need to subset this file in order to investigate some of the values that I've discovered only start around row ~500,000,000.

Because the file has so many rows:

  • I need to extract only a subset of the rows to do anything useful with the data.
  • Reading through every row leading up to the values I care about is going to take a long time.
  • If the solution reads past the rows I care about and continues reading the rest of the file it will waste time reading almost 3 billion irrelevant rows and take 6x longer than necessary.

My best-case-scenario is a solution that extracts only a single line from the file without reading any of the other rows in the file, but I can't think of how I would accomplish this in Bash.

For the purposes of my sanity I'm not going to be trying to read the full 500,000,000 lines I'd need for my own problem. Instead I'll be trying to extract row 50,000,000 out of 3,339,550,320 (which means reading the full file will take 60x longer than necessary).

I will be using the time built-in to benchmark each command.

Baseline

First let's see how the head tail solution:

$ time head -50000000 myfile.ascii | tail -1 pgm_icnt = 0 real 1m15.321s 

The baseline for row 50 million is 00:01:15.321, if I'd gone straight for row 500 million it'd probably be ~12.5 minutes.

cut

I'm dubious of this one, but it's worth a shot:

$ time cut -f50000000 -d$'\n' myfile.ascii pgm_icnt = 0 real 5m12.156s 

This one took 00:05:12.156 to run, which is much slower than the baseline! I'm not sure whether it read through the entire file or just up to line 50 million before stopping, but regardless this doesn't seem like a viable solution to the problem.

AWK

I only ran the solution with the exit because I wasn't going to wait for the full file to run:

$ time awk 'NR == 50000000 {print; exit}' myfile.ascii pgm_icnt = 0 real 1m16.583s 

This code ran in 00:01:16.583, which is only ~1 second slower, but still not an improvement on the baseline. At this rate if the exit command had been excluded it would have probably taken around ~76 minutes to read the entire file!

Perl

I ran the existing Perl solution as well:

$ time perl -wnl -e '$.== 50000000 && print && exit;' myfile.ascii pgm_icnt = 0 real 1m13.146s 

This code ran in 00:01:13.146, which is ~2 seconds faster than the baseline. If I'd run it on the full 500,000,000 it would probably take ~12 minutes.

sed

The top answer on the board, here's my result:

$ time sed "50000000q;d" myfile.ascii pgm_icnt = 0 real 1m12.705s 

This code ran in 00:01:12.705, which is 3 seconds faster than the baseline, and ~0.4 seconds faster than Perl. If I'd run it on the full 500,000,000 rows it would have probably taken ~12 minutes.

mapfile

I have bash 3.1 and therefore cannot test the mapfile solution.

Conclusion

It looks like, for the most part, it's difficult to improve upon the head tail solution. At best the sed solution provides a ~3% increase in efficiency.

(percentages calculated with the formula % = (runtime/baseline - 1) * 100)

Row 50,000,000

  1. 00:01:12.705 (-00:00:02.616 = -3.47%) sed
  2. 00:01:13.146 (-00:00:02.175 = -2.89%) perl
  3. 00:01:15.321 (+00:00:00.000 = +0.00%) head|tail
  4. 00:01:16.583 (+00:00:01.262 = +1.68%) awk
  5. 00:05:12.156 (+00:03:56.835 = +314.43%) cut

Row 500,000,000

  1. 00:12:07.050 (-00:00:26.160) sed
  2. 00:12:11.460 (-00:00:21.750) perl
  3. 00:12:33.210 (+00:00:00.000) head|tail
  4. 00:12:45.830 (+00:00:12.620) awk
  5. 00:52:01.560 (+00:40:31.650) cut

Row 3,338,559,320

  1. 01:20:54.599 (-00:03:05.327) sed
  2. 01:21:24.045 (-00:02:25.227) perl
  3. 01:23:49.273 (+00:00:00.000) head|tail
  4. 01:25:13.548 (+00:02:35.735) awk
  5. 05:47:23.026 (+04:24:26.246) cut

5 Comments

I wonder how long just cat'ting the entire file into /dev/null would take. (What if this was only a hard disk benchmark?)
I feel a perverse urge to bow at your ownership of a 3+ gig text file dictionary. Whatever the rationale, this so embraces textuality :)
The overhead of running two processes with head + tail will be negligible for a single file, but starts to show when you do this on many files.
How can your file have more rows than bytes?
With that input, it may well be worth writing a special-purpose utility in C (either mmap() the file and search for the (n-1)th newline, or for more portability, just loop over fgetc() until you reach the nth line).
65

With awk it is pretty fast:

awk 'NR == num_line' file 

When this is true, the default behaviour of awk is performed: {print $0}.


Alternative versions

If your file happens to be huge, you'd better exit after reading the required line. This way you save CPU time See time comparison at the end of the answer.

awk 'NR == num_line {print; exit}' file 

If you want to give the line number from a bash variable you can use:

awk 'NR == n' n=$num file awk -v n=$num 'NR == n' file # equivalent 

See how much time is saved by using exit, specially if the line happens to be in the first part of the file:

# Let's create a 10M lines file for ((i=0; i<100000; i++)); do echo "bla bla"; done > 100Klines for ((i=0; i<100; i++)); do cat 100Klines; done > 10Mlines $ time awk 'NR == 1234567 {print}' 10Mlines bla bla real 0m1.303s user 0m1.246s sys 0m0.042s $ time awk 'NR == 1234567 {print; exit}' 10Mlines bla bla real 0m0.198s user 0m0.178s sys 0m0.013s 

So the difference is 0.198s vs 1.303s, around 6x times faster.

11 Comments

This method is always going to be slower because awk attempts to do field splitting. The overhead of field splitting can be reduced by awk 'BEGIN{FS=RS}(NR == num_line) {print; exit}' file
The real power of awk in this method comes forth when you want to concatenate line n1 of file1, n2 of file2, n3 or file3 ... awk 'FNR==n' n=10 file1 n=30 file2 n=60 file3. With GNU awk this can be sped up using awk 'FNR==n{print;nextfile}' n=10 file1 n=30 file2 n=60 file3 .
@kvantour indeed, GNU awk's nextfile is great for such things. How come FS=RS avoids field splitting?
FS=RS does not avoid field splitting, but it only parses the $0 ones and only assigns one field because there is no RS in $0
@fedorqui i'm happy for you
|
52

According to my tests, in terms of performance and readability my recommendation is:

tail -n+N | head -1

N is the line number that you want. For example, tail -n+7 input.txt | head -1 will print the 7th line of the file.

tail -n+N will print everything starting from line N, and head -1 will make it stop after one line.


The alternative head -N | tail -1 is perhaps slightly more readable. For example, this will print the 7th line:

head -7 input.txt | tail -1

When it comes to performance, there is not much difference for smaller sizes, but it will be outperformed by the tail | head (from above) when the files become huge.

The top-voted sed 'NUMq;d' is interesting to know, but I would argue that it will be understood by fewer people out of the box than the head/tail solution and it is also slower than tail/head.

In my tests, both tails/heads versions outperformed sed 'NUMq;d' consistently. That is in line with the other benchmarks that were posted. It is hard to find a case where tails/heads was really bad. It is also not surprising, as these are operations that you would expect to be heavily optimized in a modern Unix system.

To get an idea about the performance differences, these are the number that I get for a huge file (9.3G):

  • tail -n+N | head -1: 3.7 sec
  • head -N | tail -1: 4.6 sec
  • sed Nq;d: 18.8 sec

Results may differ, but the performance head | tail and tail | head is, in general, comparable for smaller inputs, and sed is always slower by a significant factor (around 5x or so).

To reproduce my benchmark, you can try the following, but be warned that it will create a 9.3G file in the current working directory:

#!/bin/bash readonly file=tmp-input.txt readonly size=1000000000 readonly pos=500000000 readonly retries=3 seq 1 $size > $file echo "*** head -N | tail -1 ***" for i in $(seq 1 $retries) ; do time head "-$pos" $file | tail -1 done echo "-------------------------" echo echo "*** tail -n+N | head -1 ***" echo seq 1 $size > $file ls -alhg $file for i in $(seq 1 $retries) ; do time tail -n+$pos $file | head -1 done echo "-------------------------" echo echo "*** sed Nq;d ***" echo seq 1 $size > $file ls -alhg $file for i in $(seq 1 $retries) ; do time sed $pos'q;d' $file done /bin/rm $file 

Here is the output of a run on my machine (ThinkPad X1 Carbon with an SSD and 16G of memory). I assume in the final run everything will come from the cache, not from disk:

*** head -N | tail -1 *** 500000000 real 0m9,800s user 0m7,328s sys 0m4,081s 500000000 real 0m4,231s user 0m5,415s sys 0m2,789s 500000000 real 0m4,636s user 0m5,935s sys 0m2,684s ------------------------- *** tail -n+N | head -1 *** -rw-r--r-- 1 phil 9,3G Jan 19 19:49 tmp-input.txt 500000000 real 0m6,452s user 0m3,367s sys 0m1,498s 500000000 real 0m3,890s user 0m2,921s sys 0m0,952s 500000000 real 0m3,763s user 0m3,004s sys 0m0,760s ------------------------- *** sed Nq;d *** -rw-r--r-- 1 phil 9,3G Jan 19 19:50 tmp-input.txt 500000000 real 0m23,675s user 0m21,557s sys 0m1,523s 500000000 real 0m20,328s user 0m18,971s sys 0m1,308s 500000000 real 0m19,835s user 0m18,830s sys 0m1,004s 

6 Comments

Is performance different between head | tail vs tail | head? Or does it depend on which line is being printed (beginning of file vs end of file)?
@wisbucky I have no hard figures, but one disadvantage of first using tail followed by a "head -1" is that you need to know the total length in advance. If you do not know it, you would have to count it first, which will be a loss performance-wise. Another disadvantage is that it is less intuitive to use. For instance, if you have the number 1 to 10 and you want to get the 3rd line, you would have to use "tail -8 | head -1". That is more error prone than "head -3 | tail -1".
sorry, I should have included an example to be clear. head -5 | tail -1 vs tail -n+5 | head -1. Actually, I found another answer that did a test comparison and found tail | head to be faster. stackoverflow.com/a/48189289
@wisbucky Thank you for mentioning it! I did some tests and have to agree that it was always slightly faster, independent of the position of the line from what I saw. Given that, I changed my answer and also included the benchmark in case someone wants to reproduce it.
Is there a simple way to extend this solution to multiple files at once? e.g. head -7 -q input*.txt | tail -1 to get the 7th line from several files input*.txt? Currently this will just obtain the 7th line from the first file listed in input*.txt.
|
37

Save two keystrokes, print Nth line without using bracket:

sed -n Np <fileName> ^ ^ \ \___ 'p' for printing \______ '-n' for not printing by default 

For example, to print 100th line:

sed -n 100p foo.txt 

1 Comment

Note that the first line has N = 1 instead of zero.
28

Wow, all the possibilities!

Try this:

sed -n "${lineNum}p" $file 

or one of these depending upon your version of Awk:

awk -vlineNum=$lineNum 'NR == lineNum {print $0}' $file awk -v lineNum=4 '{if (NR == lineNum) {print $0}}' $file awk '{if (NR == lineNum) {print $0}}' lineNum=$lineNum $file 

(You may have to try the nawk or gawk command).

Is there a tool that only does the print that particular line? Not one of the standard tools. However, sed is probably the closest and simplest to use.

Comments

24

This question being tagged Bash, here's the Bash (≥4) way of doing: use mapfile with the -s (skip) and -n (count) option.

If you need to get the 42nd line of a file file:

mapfile -s 41 -n 1 ary < file 

At this point, you'll have an array ary the fields of which containing the lines of file (including the trailing newline), where we have skipped the first 41 lines (-s 41), and stopped after reading one line (-n 1). So that's really the 42nd line. To print it out:

printf '%s' "${ary[0]}" 

If you need a range of lines, say the range 42–666 (inclusive), and say you don't want to do the math yourself, and print them on stdout:

mapfile -s $((42-1)) -n $((666-42+1)) ary < file printf '%s' "${ary[@]}" 

If you need to process these lines too, it's not really convenient to store the trailing newline. In this case use the -t option (trim):

mapfile -t -s $((42-1)) -n $((666-42+1)) ary < file # do stuff printf '%s\n' "${ary[@]}" 

You can have a function do that for you:

print_file_range() { # $1-$2 is the range of file $3 to be printed to stdout local ary mapfile -s $(($1-1)) -n $(($2-$1+1)) ary < "$3" printf '%s' "${ary[@]}" } 

No external commands, only Bash builtins!

Comments

13

You may also used sed print and quit:

sed -n '10{p;q;}' file # print line 10 

2 Comments

The -n option disables the default action to print every line, as surely you would have found out by a quick glance at the man page.
In GNU sed all the sed answers are about the same speed. Therefore (for GNU sed) this is the best sed answer, since it would save time for large files and small nth line values.
9

You can also use Perl for this:

perl -wnl -e '$.== NUM && print && exit;' some.file 

1 Comment

While testing on a file with 6,000,000 lines, and retrieving arbitrary line #2,000,000, this command was almost instantaneous and much faster than the sed answers.
7

As a followup to CaffeineConnoisseur's very helpful benchmarking answer... I was curious as to how fast the 'mapfile' method was compared to others (as that wasn't tested), so I tried a quick-and-dirty speed comparison myself as I do have bash 4 handy. Threw in a test of the "tail | head" method (rather than head | tail) mentioned in one of the comments on the top answer while I was at it, as folks are singing its praises. I don't have anything nearly the size of the testfile used; the best I could find on short notice was a 14M pedigree file (long lines that are whitespace-separated, just under 12000 lines).

Short version: mapfile appears faster than the cut method, but slower than everything else, so I'd call it a dud. tail | head, OTOH, looks like it could be the fastest, although with a file this size the difference is not all that substantial compared to sed.

$ time head -11000 [filename] | tail -1 [output redacted] real 0m0.117s $ time cut -f11000 -d$'\n' [filename] [output redacted] real 0m1.081s $ time awk 'NR == 11000 {print; exit}' [filename] [output redacted] real 0m0.058s $ time perl -wnl -e '$.== 11000 && print && exit;' [filename] [output redacted] real 0m0.085s $ time sed "11000q;d" [filename] [output redacted] real 0m0.031s $ time (mapfile -s 11000 -n 1 ary < [filename]; echo ${ary[0]}) [output redacted] real 0m0.309s $ time tail -n+11000 [filename] | head -n1 [output redacted] real 0m0.028s 

Hope this helps!

Comments

6

The fastest solution for big files is always tail|head, provided that the two distances:

  • from the start of the file to the starting line. Lets call it S
  • the distance from the last line to the end of the file. Be it E

are known. Then, we could use this:

mycount="$E"; (( E > S )) && mycount="+$S" howmany="$(( endline - startline + 1 ))" tail -n "$mycount"| head -n "$howmany" 

howmany is just the count of lines required.

Some more detail in https://unix.stackexchange.com/a/216614/79743

3 Comments

Please clarify the units of S and E, (i.e. bytes, chars, or lines).
@agc The units are lines.
i don't see how this is a good solution at all because it requires distance till end of file as an input - by the time you've calculated distance to end you've already scanned the whole file at least once, so what's the point ?
6

All the above answers directly answer the question. But here's a less direct solution but a potentially more important idea, to provoke thought.

Since line lengths are arbitrary, all the bytes of the file before the nth line need to be read. If you have a huge file or need to repeat this task many times, and this process is time-consuming, then you should seriously think about whether you should be storing your data in a different way in the first place.

The real solution is to have an index, e.g. at the start of the file, indicating the positions where the lines begin. You could use a database format, or just add a table at the start of the file. Alternatively create a separate index file to accompany your large text file.

e.g. you might create a list of character positions for newlines:

awk 'BEGIN{c=0;print(c)}{c+=length()+1;print(c+1)}' file.txt > file.idx 

then read with tail, which actually seeks directly to the appropriate point in the file!

e.g. to get line 1000:

tail -c +$(awk 'NR=1000' file.idx) file.txt | head -1 
  • This may not work with 2-byte / multibyte characters, since awk is "character-aware" but tail is not.
  • I haven't tested this against a large file.
  • Also see this answer.
  • Alternatively - split your file into smaller files!

Comments

5

tl;dr

gawk, sed, ruby fastest.

head | tail and bsd awk slowest.

Testing on MacOS 14.1


Well here are few methods not otherwised mentioned here and a better benchmark to compare them.

First, let's make a 10 million line test file:

awk -v n=10000000 'BEGIN{for(i=1;i<=n;i++) print "Line " i}' >/tmp/tst/10m.txt $ ls -lh 10m.txt -rw-r--r--@ 1 andrew wheel 123M Nov 3 10:50 10m.txt % wc -l 10m.txt 10000000 10m.txt 

Here is a Bash file with several different methods to access a single given line number:

#!/bin/bash cd /tmp || exit find_line=8999999 file="/tmp/tst/10m.txt" case "$1" in sed) echo "sed" sed "${find_line}q;d" "${file}" ;; ruby) echo "ruby" ruby -e "n=${find_line}.to_i" -e '$<.each_line{|line| if $.==n then puts line; exit(0) end}' "${file}" ;; perl) echo "perl" perl -e "\$n=${find_line};" -lnE 'if ($.==$n) {say $_; exit;}' "${file}" ;; awk) echo "awk" awk -v n="${find_line}" 'FNR==n{print; exit}' "${file}" ;; gawk) echo "gawk" gawk -v n="${find_line}" 'FNR==n{print; exit}' "${file}" ;; head) echo "head" head -n "+${find_line}" "${file}" | tail -1 ;; esac 

Now use the Perl benchmark module to print a 'pretty' comparison between them:

#!/usr/bin/perl use strict; use warnings; use Benchmark qw(:all) ; use Benchmark ':hireswallclock'; cmpthese(6, { 'awk' => sub { `/tmp/tst/routines.sh awk` }, 'gawk' => sub { `/tmp/tst/routines.sh gawk` }, 'ruby' => sub { `/tmp/tst/routines.sh ruby` }, 'perl' => sub { `/tmp/tst/routines.sh perl` }, 'head' => sub { `/tmp/tst/routines.sh head` }, 'sed' => sub { `/tmp/tst/routines.sh sed` }, }); 

Prints:

 Rate awk head perl ruby sed gawk awk 0.342/s -- -17% -67% -71% -72% -77% head 0.414/s 21% -- -60% -65% -66% -72% perl 1.02/s 199% 147% -- -13% -17% -31% ruby 1.18/s 244% 184% 15% -- -4% -20% sed 1.23/s 260% 197% 20% 5% -- -17% gawk 1.48/s 332% 257% 44% 25% 20% -- 

Comments

4

If you got multiple lines by delimited by \n (normally new line). You can use 'cut' as well:

echo "$data" | cut -f2 -d$'\n' 

You will get the 2nd line from the file. -f3 gives you the 3rd line.

1 Comment

Can be also used to display multiple lines: cat FILE | cut -f2,5 -d$'\n' will display lines 2 and 5 of the FILE. (But it will not preserve the order.)
4

Using what others mentioned, I wanted this to be a quick & dandy function in my bash shell.

Create a file: ~/.functions

Add to it the contents:

getline() { line=$1 sed $line'q;d' $2 }

Then add this to your ~/.bash_profile:

source ~/.functions

Now when you open a new bash window, you can just call the function as so:

getline 441 myfile.txt

2 Comments

There is no need to assign $1 to another variable before using it, and you are clobbering any other global line. In Bash, use local for function variables; but here, as stated already, probably just do sed "$1d;q" "$2". (Notice also the quoting of "$2".)
Correct, but it could be helpful to have self-documented code.
3

Lots of good answers already. I personally go with awk. For convenience, if you use bash, just add the below to your ~/.bash_profile. And, the next time you log in (or if you source your .bash_profile after this update), you will have a new nifty "nth" function available to pipe your files through.

Execute this or put it in your ~/.bash_profile (if using bash) and reopen bash (or execute source ~/.bach_profile)

# print just the nth piped in line nth () { awk -vlnum=${1} 'NR==lnum {print; exit}'; } 

Then, to use it, simply pipe through it. E.g.,:

$ yes line | cat -n | nth 5 5 line 

Comments

2

After taking a look at the top answer and the benchmark, I've implemented a tiny helper function:

function nth { if (( ${#} < 1 || ${#} > 2 )); then echo -e "usage: $0 \e[4mline\e[0m [\e[4mfile\e[0m]" return 1 fi if (( ${#} > 1 )); then sed "$1q;d" $2 else sed "$1q;d" fi } 

Basically you can use it in two fashions:

nth 42 myfile.txt do_stuff | nth 42 

Comments

1

To print nth line using sed with a variable as line number:

a=4 sed -e $a'q:d' file 

Here the '-e' flag is for adding script to command to be executed.

1 Comment

The colon is a syntax error, and should be a semicolon.
1

I've put some of the above answers into a short bash script that you can put into a file called get.sh and link to /usr/local/bin/get (or whatever other name you prefer).

#!/bin/bash if [ "${1}" == "" ]; then echo "error: blank line number"; exit 1 fi re='^[0-9]+$' if ! [[ $1 =~ $re ]] ; then echo "error: line number arg not a number"; exit 1 fi if [ "${2}" == "" ]; then echo "error: blank file name"; exit 1 fi sed "${1}q;d" $2; exit 0 

Ensure it's executable with

$ chmod +x get 

Link it to make it available on the PATH with

$ ln -s get.sh /usr/local/bin/get 

Comments

1

UPDATE 1 : found much faster method in awk

  • just 5.353 secs to obtain a row above 133.6 mn :
rownum='133668997'; ( time ( pvE0 < ~/master_primelist_18a.txt | LC_ALL=C mawk2 -F'^$' -v \_="${rownum}" -- '!_{exit}!--_' ) ) 
in0: 5.45GiB 0:00:05 [1.02GiB/s] [1.02GiB/s] [======> ] 71% ( pvE 0.1 in0 < ~/master_primelist_18a.txt | LC_ALL=C mawk2 -F'^$' -v -- ; ) 5.01s user 

1.21s system 116% cpu 5.353 total

77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F= 

===============================================

I'd like to contest the notion that perl is faster than awk :

so while my test file isn't nearly quite as many rows, it's also twice the size, at 7.58 GB -

I even gave perl some built-in advantageous - like hard-coding in the row number, and also going second, thus gaining any potential speedups from OS caching mechanism, if any

 f="$( grealpath -ePq ~/master_primelist_18a.txt )" rownum='133668997' fg;fg; pv < "${f}" | gwc -lcm echo; sleep 2; echo; ( time ( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C mawk2 '_{exit}_=NR==+__' FS='^$' __="${rownum}" ) ) | mawk 'BEGIN { print } END { print _ } NR' sleep 2 ( time ( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C perl -wnl -e '$.== 133668997 && print && exit;' ) ) | mawk 'BEGIN { print } END { print _ } NR' ; fg: no current job fg: no current job 7.58GiB 0:00:28 [ 275MiB/s] [============>] 100% 148,110,134 8,134,435,629 8,134,435,629 <<<< rows, chars, and bytes count as reported by gnu-wc in0: 5.45GiB 0:00:07 [ 701MiB/s] [=> ] 71% ( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C mawk2 '_{exit}_=NR==+__' FS='^$' ; ) 6.22s user 2.56s system 110% cpu 7.966 total 77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F= in0: 5.45GiB 0:00:17 [ 328MiB/s] [=> ] 71% ( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C perl -wnl -e ; ) 14.22s user 3.31s system 103% cpu 17.014 total 77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F= 

I can re-run the test with perl 5.36 or even perl-6 if u think it's gonna make a difference (haven't installed either), but a gap of

7.966 secs (mawk2) vs. 17.014 secs (perl 5.34)

between the two, with the latter more than double the prior, it seems clear which one is indeed meaningfully faster to fetch a single row way deep in ASCII files.

This is perl 5, version 34, subversion 0 (v5.34.0) built for darwin-thread-multi-2level Copyright 1987-2021, Larry Wall mawk 1.9.9.6, 21 Aug 2016, Copyright Michael D. Brennan 

2 Comments

A comparison with head|tail and with gawk using the BEGIN{FS=RS} optimization would be very welcome.
@AmitNaidu : head/tail's major speed improvements over others are their ability to do file sector seeking. When coming in over the pipe, they aren't much better, especially for tail, since it has absolutely no idea when the end of stream arrives. As for gawk, setting -b flag (byte-mode) is far more useful than FS = RS in terms of speed gains. setting LC_ALL=C achieves the same effect.
1

sed '123q; d' is REALLY slow next to awk, for either BSD-sed or GNU-sed :

 in0: 7.17GiB 0:00:15 [ 471MiB/s] [ 471MiB/s] [=====> ] 94% ( pvE 0.1 in0 < ~/ABC.txt | bsd-sed '147654389q;d' ; ) 15.18s user 1.84s system 109% cpu 15.586 total 1 900.07259= 8888888888888888888888888888888888888888888888888888888888 8888888888888888888888888888888888888888888888888888888886 6666666666666666666666666666666666666666666666666666666666 6666666666666666666666666666666666666666666666666888888888 888888888888888888888888888888888888881 = 0x10D35C8F2B57FA0034318EC78AB989C94B63BDAAB1803BE1881D435E 604C5477626F1C07D9FDF39432A808C3C1E46D10CDC6FEB3D3B0C853 83F2A19D1F3F2BB2E998DC1F4598A49AC74AEE73ACA279F90F52090F E35F2972CEFAA597F1B19774ABCA47E86893357CB4552AE38E38E38E31 = 

 in0: 7.17GiB 0:00:11 [ 661MiB/s] [ 661MiB/s] [=====> ] 94% ( pvE 0.1 in0 < ~/ABC.txt | gnu-sed '147654389q;d'; ) 10.92s user 1.74s system 113% cpu 11.122 total 1 900.07259=88888…… 

 in0: 7.17GiB 0:00:08 [ 858MiB/s] [ 858MiB/s] [=====> ] 94% ( pvE 0.1 in0 < "$___" | mawk -v __=$__ 'BEGIN{__=+__}__<NR{exit}__==NR'; ) 7.72s user 2.21s system 115% cpu 8.575 total 1 900.07259=88888…… 

 in0: 7.17GiB 0:00:05 [1.22GiB/s] [1.22GiB/s] [=====> ] 94% ( pvE 0.1 in0 < "$___" | mawk2 -v __=$__ 'BEGIN{__=+__}__<NR{exit}__==NR'; ) 5.18s user 1.78s system 118% cpu 5.885 total 1 900.07259=88888…… 

Comments

1

This is not a bash solution, but I found out that top choices didn't satisfy my needs, eg,

sed 'NUMq;d' file 

was fast enough, but was hanging for hours and did not tell about any progress. I suggest compiling this cpp program and using it to find the row you want. You can compile it with g++ main.cpp, where main.cpp is the file with the content below. I got a.out and executed it with ./a.out

#include <iostream> #include <string> #include <fstream> using namespace std; int main() { string filename; cout << "Enter filename "; cin >> filename; int needed_row_number; cout << "Enter row number "; cin >> needed_row_number; int progress_line_count; cout << "Enter at which every number of rows to monitor progress "; cin >> progress_line_count; char ch; int row_counter = 1; fstream fin(filename, fstream::in); while (fin >> noskipws >> ch) { int ch_int = (int) ch; if (row_counter == needed_row_number) { cout << ch; } if (ch_int == 10) { if (row_counter == needed_row_number) { return 0; } row_counter++; if (row_counter % progress_line_count == 0) { cout << "Progress: line " << row_counter << endl; } } } return 0; } 

Comments

1

To get an nth line (single line)

If you want something that you can later customize without having to deal with bash you can compile this c program and drop the binary in your custom binaries directory. This assumes that you know how to edit the .bashrc file accordingly (only if you want to edit your path variable), If you don't know, this is a helpful link.

To run this code use (assuming you named the binary "line").

line [target line] [target file] 

example

line 2 somefile.txt 

The code:

#include <stdio.h> #include <string.h> #include <stdlib.h> int main(int argc, char* argv[]){ if(argc != 3){ fprintf(stderr, "line needs a line number and a file name"); exit(0); } int lineNumber = atoi(argv[1]); int counter = 0; char *fileName = argv[2]; FILE *fileReader = fopen(fileName, "r"); if(fileReader == NULL){ fprintf(stderr, "Failed to open file"); exit(0); } size_t lineSize = 0; char* line = NULL; while(counter < lineNumber){ getline(&line, &linesize, fileReader); counter++ } getline(&line, &lineSize, fileReader); printf("%s\n", line); fclose(fileReader); return 0; } 

EDIT: removed the fseek and replaced it with a while loop

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.