AWK: Insert in random selected lines target terms after source terms from a dictionary

Question

Note: I have already asked a similar question in AWK: Quick way to insert target words after a source term and I am at the beginner level of AWK.

This question considers the insertion of multiple target terms after source terms in a number of random selected lines.

With this AWK code snippet

awk '(NR==FNR){a[$1];next} FNR in a { gsub(/\<source term\>/,"& target term") } 1 ' <(shuf -n 5 -i 1-$(wc -l < file)) file

I want to insert a target term after the source term in 5 random lines of the file.

For example: I have a bilingual dictionary dict which contains the source terms on the left and the target terms on the right like

apple : Apfel banana : Banane raspberry : Himbeere

My file consists of these lines:

I love the Raspberry Pi. The monkey loves eating a banana. Who wants an apple pi? Apple pen... pineapple pen... pen-pineapple-apple-pen! The banana is tasty and healthy. An apple a day keeps the doctor away. Which fruit is tastes better: raspberry or strawberry?

Assuming for the first word apple the random lines 1, 3, 5, 4, 7 are selected. The output with the word apple will be like this:

I love the Raspberry Pi. The monkey loves eating a banana. Who wants an apple Apfel pi? Apple Apfel pen... pineapple pen... pen-pineapple-apple-pen! The banana is tasty and healthy. An apple a day keeps the doctor away. Which fruit is tastes better: raspberry or strawberry?

then another 5 random lines; 3, 3, 5, 6, 7; for the word banana will be selected:

I love the Raspberry Pi . The monkey loves eating a banana . Who wants an apple Apfel pi ? Apple Apfel pen... pineapple pen... pen-pineapple-apple-pen! The banana Banane is tasty and healthy . An apple a day keeps the doctor away . Which fruit is tastes better: raspberry or strawberry?

And the same goes on with all the other entries in dict until the last entry is matched.

I want to choose 5 random lines. If these lines have a whole source term like apple I only want to match Apfel to apple as whole word (terms like "pineapple" will be ignored). If a line contains a source term twice, like apple, than I want to insert the target term after it as well. Matches should be case-insensitive, so I can also match source terms like apple and Apple.

My question: How can I rewrite the code snippet above, so I can use a dictionary dict, which selects random lines in file and inserts target terms behind the source terms?

You should start more simple. Do you know how to do this without the randomization? — aviro
– aviro, Commented Jan 31, 2022 at 12:45
Thanks for the clarifications, but next time please edit the question instead of deleting and reposting. — terdon
– terdon ♦, Commented Jan 31, 2022 at 12:55
@aviro I no idea how to do this without randomization in AWK. — Ramón Wilhelm
– Ramón Wilhelm, Commented Jan 31, 2022 at 13:03
So why don't you start step by step? First do it without randomization, and then continue. You're trying to build a second floor in your building without even having a first floor. Also, please try first to do this yourself (without randomization), and tell us where you got stuck. You already got an answer to an earlier similar question. Did you understand the answer? Did you try to learn what exactly it's doing? If you really understand it, try to "expand" to your current need (again, without randomization). If you didn't under the previous answer, ask for clarification so you could learn. — aviro
– aviro, Commented Jan 31, 2022 at 14:03
@RamónWilhelm regarding I no idea how to do this without randomization in AWK - that's what my answer to your previous question does, see unix.stackexchange.com/a/688458/133219. That would be a much better starting point than the code currently in your question. — Ed Morton
– Ed Morton, Commented Jan 31, 2022 at 23:04

Ed Morton · Accepted Answer · 2022-02-01 20:07:50Z

Here's how to select 5 line numbers at random from an input file using awk (and wc for the first pass to just count line numbers):

$ awk -v numLines="$(wc -l < file)" 'BEGIN{srand(); for (i=1; i<=5; i++) print int(1+rand()*numLines)}' 7 2 88 13 18

Now all you have to do is take my previous answer and for every "old" string being read in the ARGIND==1 block generate 5 line numbers as shown above, populate an array that maps the generated line numbers to the old strings associated with each line number, and when reading the final input file check if the current line number is in the array and if so loop through the "old"s stored in the array for that line number, doing the gsub() shown from my previous answer.

Using GNU awk for ARGIND, IGNORECASE, word boundaries, arrays of arrays and \s shorthand for [[:space:]]:

$ cat tst.sh #!/usr/bin/env bash awk -v numLines=$(wc -l < file) ' BEGIN { FS = "\\s*:\\s*" IGNORECASE = 1 srand() } ARGIND == 1 { old = "\\<" $1 "\\>" new = "& " $2 for (i=1; i<=5; i++) { lineNr = int(1+rand()*numLines) map[lineNr][old] = new } next } FNR in map { for ( old in map[FNR] ) { new = map[FNR][old] gsub(old,new) } } { print } ' dict file

$ ./tst.sh I love the Raspberry Pi. The monkey loves eating a banana Banane. Who wants an apple Apfel pi? Apple Apfel pen... pineapple pen... pen-pineapple-apple Apfel-pen! The banana Banane is tasty and healthy. An apple a day keeps the doctor away. Which fruit is tastes better: raspberry Himbeere or strawberry?

I've tried this script, but the file wasn't edited. For some reason, the -i inplace is refused. — Ramón Wilhelm
– Ramón Wilhelm, Commented Feb 1, 2022 at 19:35
I'm not using -i inplace in my script and you haven't mentioned that in your question but if you want to use that you'll have to get a version of GNU awk that supports it. We already know you're using GNU awk so you must be on a very old pre-inplace version of it. Personally I wouldn't use -i inplace as it'll just complicate things because then you'd need to rewrite BOTH input files. Just do awk '...' dict file > tmp && mv tmp file if you really want to overwrite file with the output. — Ed Morton
– Ed Morton, Commented Feb 1, 2022 at 19:36
I already did awk '...' dict file > tmp && mv tmp file, but nothing has changed. — Ramón Wilhelm
– Ramón Wilhelm, Commented Feb 1, 2022 at 19:49
That's a different problem to "-i inplace is refused". Then you're doing something wrong as the script in my answer WILL produce output that's different from the input if some the lines that contain "apple", for example, occur on some of the random line numbers generated. idk what you're doing wrong though because I can't see your code. I just updated my answer to show my script running against the sample input you provided and producing output that's different from the input. — Ed Morton
– Ed Morton, Commented Feb 1, 2022 at 19:54
Maybe you have a file that's pretty long and none of the 5 line numbers generated for each of the "old" strings in dict actually have those strings on any of those 5 line numbers associated with each string? That's the only thing that comes to mind that would cause the output to be unchanged. Do you see changes given the input in your question and it's some other input it "fails" with? — Ed Morton
– Ed Morton, Commented Feb 1, 2022 at 20:01

guest_7 · Accepted Answer · 2022-02-01 05:27:13Z

GNU sed with extended regex mode (-E) and the (/e) modifier of the s/// command:

n=$(< file wc -l) sed -E '/\n/ba s#^(\S+)\s*:\s*(\S+)$#s/\\<\1\\>/\& \2/Ig#;h'" s/.*/shuf -n 5 -i '1-$n'/e;G :a s/^([0-9]+)(\n.*\n(.*))/\1 \3\2/ /\n.*\n/!s/\n/ / P;D " dict | sed -f /dev/stdin file

generate the GNU sed commands from the contents of the duct file.
store the command in hold.
roll the dice and generate 5 random numbers in in the range of line length of input file.
stick the hold onto pattern and generate sed commands to run on these particular lines only.
apply these commands generated on the input file.

I was curious what the first sed script would output so I tried it and got an apparently infinite stream of 4 <number> s/\<apple\>/& Apfel/Ig lines (always just apple, no other substitution) and a number on a line of it's own. I'm using sed (GNU sed) 4.4 in bash on cygwin. But then when I piped it to the 2nd sed it did actually terminate and made changes for other fruits too so it seems like the combination actually worked. Could you edit your answer to explain what's going on with that interaction between the 2 scripts? — Ed Morton
– Ed Morton, Commented Feb 1, 2022 at 13:54
This behavior was not observed at my end. The first sed terminates after generating 5 * 3 = 15 sed substitution commands — guest_7
– guest_7, Commented Feb 3, 2022 at 3:27
Well, that's weird, I don't get that behavior today either but I tried it multiple times yesterday just to make sure of what I was seeing. Beats me - magic... — Ed Morton
– Ed Morton, Commented Feb 3, 2022 at 13:25

Stack Exchange Network

AWK: Insert in random selected lines target terms after source terms from a dictionary

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

AWK: Insert in random selected lines target terms after source terms from a dictionary

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions