I want to shuffle the lines of a text file randomly and create a new file. The file may have several thousands of lines.
How can I do that with cat, awk, cut, etc?
I want to shuffle the lines of a text file randomly and create a new file. The file may have several thousands of lines.
How can I do that with cat, awk, cut, etc?
You can use shuf. On some systems at least (doesn't appear to be in POSIX).
As jleedev pointed out: sort -R might also be an option. On some systems at least; well, you get the picture. It has been pointed out that sort -R doesn't really shuffle but instead sort items according to their hash value.
[Editor's note: sort -R almost shuffles, except that duplicate lines / sort keys always end up next to each other. In other words: only with unique input lines / keys is it a true shuffle. While it's true that the output order is determined by hash values, the randomness comes from choosing a random hash function - see manual.]
shuf and sort -R differ slightly, because sort -R randomly orders the elements according to hash of them, which is, sort -R will put the repeated elements together, while shuf shuffles all the elements randomly.brew install coreutils, then use gshuf ... (:sort -R and shuf should be seen as completely different. sort -R is deterministic. If you call it twice at different times on the same input you will get the same answer. shuf, on the other hand, produces randomized output, so it will most likely give different output on the same input.Perl one-liner would be a simple version of Maxim's solution
perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < myfile \n; yes, that \n must be present - and it typically is - otherwise you'll get what you describe.<STDIN> with <>, so the solution works with input from files too.This answer complements the many great existing answers in the following ways:
The existing answers are packaged into flexible shell functions:
stdin input, but alternatively also filename argumentsSIGPIPE in the usual way (quiet termination with exit code 141), as opposed to breaking noisily. This is important when piping the function output to a pipe that is closed early, such as when piping to head.A performance comparison is made.
awk, sort, and cut, adapted from the OP's own answer:shuf() { awk 'BEGIN {srand(); OFMT="%.17f"} {print rand(), $0}' "$@" | sort -k1,1n | cut -d ' ' -f2-; } shuf() { perl -MList::Util=shuffle -e 'print shuffle(<>);' "$@"; } shuf() { python -c ' import sys, random, fileinput; from signal import signal, SIGPIPE, SIG_DFL; signal(SIGPIPE, SIG_DFL); lines=[line for line in fileinput.input()]; random.shuffle(lines); sys.stdout.write("".join(lines)) ' "$@"; } See the bottom section for a Windows version of this function.
shuf() { ruby -e 'Signal.trap("SIGPIPE", "SYSTEM_DEFAULT"); puts ARGF.readlines.shuffle' "$@"; } Performance comparison:
Note: These numbers were obtained on a late-2012 iMac with 3.2 GHz Intel Core i5 and a Fusion Drive, running OSX 10.10.3. While timings will vary with OS used, machine specs, awk implementation used (e.g., the BSD awk version used on OSX is usually slower than GNU awk and especially mawk), this should provide a general sense of relative performance.
Input file is a 1-million-lines file produced with seq -f 'line %.0f' 1000000.
Times are listed in ascending order (fastest first):
shuf 0.090s0.289s0.589s1.342s with Python 2.7.6; 2.407s(!) with Python 3.4.2awk + sort + cut 3.003s with BSD awk; 2.388s with GNU awk (4.1.1); 1.811s with mawk (1.3.4); For further comparison, the solutions not packaged as functions above:
sort -R (not a true shuffle if there are duplicate input lines) 10.661s - allocating more memory doesn't seem to make a difference24.229sbash loops + sort 32.593sConclusions:
shuf, if you can - it's the fastest by far.awk + sort + cut combo as a last resort; which awk implementation you use matters (mawk is faster than GNU awk, BSD awk is slowest).sort -R, bash loops, and Scala.Windows versions of the Python solution (the Python code is identical, except for variations in quoting and the removal of the signal-related statements, which aren't supported on Windows):
$OutputEncoding if you want to send non-ASCII characters via the pipeline):# Call as `shuf someFile.txt` or `Get-Content someFile.txt | shuf` function shuf { $Input | python -c @' import sys, random, fileinput; lines=[line for line in fileinput.input()]; random.shuffle(lines); sys.stdout.write(''.join(lines)) '@ $args } Note that PowerShell can natively shuffle via its Get-Random cmdlet (though performance may be a problem); e.g.:
Get-Content someFile.txt | Get-Random -Count ([int]::MaxValue)
cmd.exe (a batch file):Save to file shuf.cmd, for instance:
@echo off python -c "import sys, random, fileinput; lines=[line for line in fileinput.input()]; random.shuffle(lines); sys.stdout.write(''.join(lines))" %* python -c "import sys, random; lines = [x for x in sys.stdin.read().splitlines()] ; random.shuffle(lines); print(\"\n\".join([line for line in lines]));"from signal import signal, SIGPIPE, SIG_DFL; signal(SIGPIPE, SIG_DFL); from the original solution is sufficient, and retains the flexibility of also being able to pass filename arguments - no need to change anything else (except for quoting) - please see the new section I've added at the bottom.I use a tiny perl script, which I call "unsort":
#!/usr/bin/perl use List::Util 'shuffle'; @list = <STDIN>; print shuffle(@list); I've also got a NULL-delimited version, called "unsort0" ... handy for use with find -print0 and so on.
PS: Voted up 'shuf' too, I had no idea that was there in coreutils these days ... the above may still be useful if your systems doesn't have 'shuf'.
<STDIN> with <> in order to make the solution work with input from files too.Here is a first try that's easy on the coder but hard on the CPU which prepends a random number to each line, sorts them and then strips the random number from each line. In effect, the lines are sorted randomly:
cat myfile | awk 'BEGIN{srand();}{print rand()"\t"$0}' | sort -k1 -n | cut -f2- > myfile.shuffled head myfile | awk .... Then I just change it to cat; that's why it was left there.-k1 -n for sort, since the output of awk's rand() is a decimal between 0 and 1 and because all that matters is that it gets reordered somehow. -k1 might help speed it up by ignoring the rest of the line, though the output of rand() should be unique enough to short-circuit the comparison.cat filename | (or < filename |) than remember how each single program takes file input (or not).here's an awk script
awk 'BEGIN{srand() } { lines[++d]=$0 } END{ while (1){ if (e==d) {break} RANDOM = int(1 + rand() * d) if ( RANDOM in lines ){ print lines[RANDOM] delete lines[RANDOM] ++e } } }' file output
$ cat file 1 2 3 4 5 6 7 8 9 10 $ ./shell.sh 7 5 10 9 6 8 2 1 3 4 awk with sort and cut. For no more than several thousands line it doesn't make much of a difference, but with higher line counts it matters (the threshold depends on the awk implementation used). A slight simplification would be to replace lines while (1){ and if (e==d) {break} with while (e<d).A one-liner for python:
python -c "import random, sys; lines = open(sys.argv[1]).readlines(); random.shuffle(lines); print ''.join(lines)," myFile And for printing just a single random line:
python -c "import random, sys; print random.choice(open(sys.argv[1]).readlines())," myFile But see this post for the drawbacks of python's random.shuffle(). It won't work well with many (more than 2080) elements.
/dev/urandom does. To utilize it from Python: random.SystemRandom().shuffle(L)..readLines() returns the lines with a trailing newline.Simple awk-based function will do the job:
shuffle() { awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8- } usage:
any_command | shuffle This should work on almost any UNIX. Tested on Linux, Solaris and HP-UX.
Update:
Note, that leading zeros (%06d) and rand() multiplication makes it to work properly also on systems where sort does not understand numbers. It can be sorted via lexicographical order (a.k.a. normal string compare).
"$@", it'll also work with files as input. There is no reason to multiply rand(), because sort -n is capable of sorting decimal fractions. It is, however, a good idea to control awk's output format, because with the default format, %.6g, rand() will output the occasional number in exponential notation. While shuffling up to 1 million lines is arguably enough in practice, it's easy to support more lines without paying much of a performance penalty; e.g. %.17f.sort should be able to handle decimal fractions (even with thousands separators, as I've just noticed).A simple and intuitive way would be to use shuf.
Example:
Assume words.txt as:
the an linux ubuntu life good breeze To shuffle the lines, do:
$ shuf words.txt which would throws the shuffled lines to standard output; So, you've to pipe it to an output file like:
$ shuf words.txt > shuffled_words.txt One such shuffle run could yield:
breeze the linux an ubuntu good life Ruby FTW:
ls | ruby -e 'puts STDIN.readlines.shuffle' One liner for Python based on scai's answer, but a) takes stdin, b) makes the result repeatable with seed, c) picks out only 200 of all lines.
$ cat file | python -c "import random, sys; random.seed(100); print ''.join(random.sample(sys.stdin.readlines(), 200))," \ > 200lines.txt If like me you came here to look for an alternate to shuf for macOS then use randomize-lines.
Install randomize-lines(homebrew) package, which has an rl command which has similar functionality to shuf.
brew install randomize-lines
Usage: rl [OPTION]... [FILE]... Randomize the lines of a file (or stdin). -c, --count=N select N lines from the file -r, --reselect lines may be selected multiple times -o, --output=FILE send output to file -d, --delimiter=DELIM specify line delimiter (one character) -0, --null set line delimiter to null character (useful with find -print0) -n, --line-number print line number with output lines -q, --quiet, --silent do not output any errors or warnings -h, --help display this help and exit -V, --version output version information and exit brew install coreutils provides the shuf binary as gshuf.This is a python script that I saved as rand.py in my home folder:
#!/bin/python import sys import random if __name__ == '__main__': with open(sys.argv[1], 'r') as f: flist = f.readlines() random.shuffle(flist) for line in flist: print line.strip() On Mac OSX sort -R and shuf are not available so you can alias this in your bash_profile as:
alias shuf='python rand.py' If you have Scala installed, here's a one-liner to shuffle the input:
ls -1 | scala -e 'for (l <- util.Random.shuffle(io.Source.stdin.getLines.toList)) println(l)' This bash function has the minimal dependency(only sort and bash):
shuf() { while read -r x;do echo $RANDOM$'\x1f'$x done | sort | while IFS=$'\x1f' read -r x y;do echo $y done } awk-assisted solution, but performance will be a problem with larger input; your use of a single $RANDOM value shuffles correctly only up to 32,768 input lines; while you could extend that range, it's probably not worth it: for instance, on my machine, running your script on 32,768 short input lines takes about 1 second, which is about 150 times as long as running shuf takes, and about 10-15 times as long as the OP's own awk-assisted solution takes. If you can rely on sort being present, awk should be there as well.In windows You may try this batch file to help you to shuffle your data.txt, The usage of the batch code is
C:\> type list.txt | shuffle.bat > maclist_temp.txt After issuing this command, maclist_temp.txt will contain a randomized list of lines.
Hope this helps.
Not mentioned as of yet:
The unsort util. Syntax (somewhat playlist oriented):
unsort [-hvrpncmMsz0l] [--help] [--version] [--random] [--heuristic] [--identity] [--filenames[=profile]] [--separator sep] [--concatenate] [--merge] [--merge-random] [--seed integer] [--zero-terminated] [--null] [--linefeed] [file ...] msort can shuffle by line, but it's usually overkill:
seq 10 | msort -jq -b -l -n 1 -c r