As another option for reading a random line from a file (and assigning it to a variable), consider a simplified reservoir samplingreservoir sampling method, converted from perlthrig's perl implementation to awk, with Peter.O's seeding improvement:
becauseBecause of the way awk's srand() works, you'llyou would get the same value ifif you run this script within the same second, unless unless you seed it with something else randomrandom; here I've passed in bash's $RANDOM as the seed. Here I'm selecting words from /usr/share/dict/words, just as a source of text.
This method does not care how many lines are in the file (my local copy has 479,828 lines), so it should be pretty flexible.
To see the program's math in action, I wrote up a wrapper script that iterates through different line numbers and probabilities:
demo.sh
#!/bin/sh for lineno in 1 2 3 4 5 20 100 do echo "0 .. 0.99999 < ( 1 / FNR == " $(printf 'scale=2\n1 / %d\n' "$lineno" | bc) ")" for r in 0 0.01 0.25 0.5 0.99 do result=$(printf '%f * %d\n' "$r" "$lineno" | bc) case $result in (0*|\.*) echo "Line $lineno: Result of probability $r * line $lineno is $result and is < 1, choosing line" ;; (*) echo "Line $lineno: Result of probability $r * line $lineno is $result and is >= 1, not choosing line" ;; esac done echo done The results are:
0 .. 0.99999 < ( 1 / FNR == 1.00 ) Line 1: Result of probability 0 * line 1 is 0 and is < 1, choosing line Line 1: Result of probability 0.01 * line 1 is .010000 and is < 1, choosing line Line 1: Result of probability 0.25 * line 1 is .250000 and is < 1, choosing line Line 1: Result of probability 0.5 * line 1 is .500000 and is < 1, choosing line Line 1: Result of probability 0.99 * line 1 is .990000 and is < 1, choosing line 0 .. 0.99999 < ( 1 / FNR == .50 ) Line 2: Result of probability 0 * line 2 is 0 and is < 1, choosing line Line 2: Result of probability 0.01 * line 2 is .020000 and is < 1, choosing line Line 2: Result of probability 0.25 * line 2 is .500000 and is < 1, choosing line Line 2: Result of probability 0.5 * line 2 is 1.000000 and is >= 1, not choosing line Line 2: Result of probability 0.99 * line 2 is 1.980000 and is >= 1, not choosing line 0 .. 0.99999 < ( 1 / FNR == .33 ) Line 3: Result of probability 0 * line 3 is 0 and is < 1, choosing line Line 3: Result of probability 0.01 * line 3 is .030000 and is < 1, choosing line Line 3: Result of probability 0.25 * line 3 is .750000 and is < 1, choosing line Line 3: Result of probability 0.5 * line 3 is 1.500000 and is >= 1, not choosing line Line 3: Result of probability 0.99 * line 3 is 2.970000 and is >= 1, not choosing line 0 .. 0.99999 < ( 1 / FNR == .25 ) Line 4: Result of probability 0 * line 4 is 0 and is < 1, choosing line Line 4: Result of probability 0.01 * line 4 is .040000 and is < 1, choosing line Line 4: Result of probability 0.25 * line 4 is 1.000000 and is >= 1, not choosing line Line 4: Result of probability 0.5 * line 4 is 2.000000 and is >= 1, not choosing line Line 4: Result of probability 0.99 * line 4 is 3.960000 and is >= 1, not choosing line 0 .. 0.99999 < ( 1 / FNR == .20 ) Line 5: Result of probability 0 * line 5 is 0 and is < 1, choosing line Line 5: Result of probability 0.01 * line 5 is .050000 and is < 1, choosing line Line 5: Result of probability 0.25 * line 5 is 1.250000 and is >= 1, not choosing line Line 5: Result of probability 0.5 * line 5 is 2.500000 and is >= 1, not choosing line Line 5: Result of probability 0.99 * line 5 is 4.950000 and is >= 1, not choosing line 0 .. 0.99999 < ( 1 / FNR == .05 ) Line 20: Result of probability 0 * line 20 is 0 and is < 1, choosing line Line 20: Result of probability 0.01 * line 20 is .200000 and is < 1, choosing line Line 20: Result of probability 0.25 * line 20 is 5.000000 and is >= 1, not choosing line Line 20: Result of probability 0.5 * line 20 is 10.000000 and is >= 1, not choosing line Line 20: Result of probability 0.99 * line 20 is 19.800000 and is >= 1, not choosing line 0 .. 0.99999 < ( 1 / FNR == .01 ) Line 100: Result of probability 0 * line 100 is 0 and is < 1, choosing line Line 100: Result of probability 0.01 * line 100 is 1.000000 and is >= 1, not choosing line Line 100: Result of probability 0.25 * line 100 is 25.000000 and is >= 1, not choosing line Line 100: Result of probability 0.5 * line 100 is 50.000000 and is >= 1, not choosing line Line 100: Result of probability 0.99 * line 100 is 99.000000 and is >= 1, not choosing line The original formula:
rand() * FNR < 1 can be mathematically rewritten as:
rand() < 1 / FNR ... which is more intuitive to me, as it demonstrates the decreasing values on the right-hand side as the line numbers go up. As the values on the right side of the equation go down, there's a smaller and smaller chance that the rand() function will return a value that's less than the right-hand side.
For each line number, I print a representation of the formula that will be tested: the range of rand()'s output and "1 divided by the line number". I then iterate through some sample random values to see whether the line would be chosen given that random value.
A few sample cases are interesting to look at:
- on line 1, since rand() generates values in the range 0 <= rand() < 1, the result will always be less than (1 / 1 == 1), so line 1 will always be chosen.
- on line 2, you can see that the random value needs to be less than 0.50, indicating a 50% chance of choosing line 2.
- on line 100, rand() now needs to generate a value less than 0.01 in order for the line to be chosen.