How to create a new column and add a random identifier to it with miller

Question

I want to add a column with a randomly created "case number" to my csv file. The first 2 letters of the casenumber must be any letter from A-Z in capitals. followed by 5 random numbers.

input:

COMPANY,NAME,STREET,ZIP,CITY,IBAN Test Ltd,John,Big Ben 343,4343,London,UK2348020384 Test Ltd,Kate,Big Ben 343,4343,London,UK4389223892 Test Ltd,Jake,Big Ben 343,4343,London,UK3892898999

output

COMPANY,NAME,STREET,ZIP,CITY,IBAN,CASENUMBER Test Ltd,John,Big Ben 343,4343,London,UK2348020384,IN84903 Test Ltd,Kate,Big Ben 343,4343,London,UK4389223892,TY93842 Test Ltd,Jake,Big Ben 343,4343,London,UK3892898999,OL34307

How to do this with miller? I have the following command ready

mlr -I --csv put '${CASENUMBER}=xxx' then \ reorder -f COMPANY,NAME,STREET,ZIP,CITY,IBAN,CASENUMBER input/input.csv

What to add to the above command exactly?

You may just be doing this for fun, but I would heartily recommend creating case numbers that are self-verifying, eg make the last digit derived from the rest of the id. See for example the isbn. — meuh
– meuh, Commented Jan 29, 2024 at 12:55
Can the same case number be used multiple times or does each case number have to be unique? Do the case numbers REALLY need to be random or just unique? — Ed Morton
– Ed Morton, Commented Jan 29, 2024 at 18:45

terdon · Accepted Answer · 2024-01-29 11:59:44Z

This does not use Miller (mainly because I couldn't find a convenient way to do the random characters) but GNU Awk:

awk -i ord -v OFS=, ' function randint(n) { return int(n*rand()) } BEGIN { srand(); A = ord("A"); Z = ord("Z") } NR == 1 { printf("%s,CASENUMBER\n", $0); next } { printf "%s,%c%c%.5d\n", $0, A + randint(Z-A+1), A + randint(Z-A+1), randint(100000) }' file.csv

This GNU Awk command pulls in the ord() function from the ord.awk library (which is distributed together with GNU Awk) to be able to convert between a character and its ASCII representation. It then initializes the random number generator with srand() and precomputes the ASCII values for the letters A and Z for convenience.

If the current record is the first record (i.e., it's the CSV header), it outputs it with the string ,CASENUMBER appended.

For all other records, it outputs the original record with an appended string after a comma. The string is computed as two characters and a zero-filled number. The two characters are picked from the range [A,Z], and the number is picked from the range [0,100000). The random picking of integers from a range is done using the randint() function, which I'm using unmodified from the GNU Awk manual.

An example output of running this on the given data:

COMPANY,NAME,STREET,ZIP,CITY,IBAN,CASENUMBER Test Ltd,John,Big Ben 343,4343,London,UK2348020384,HP88271 Test Ltd,Kate,Big Ben 343,4343,London,UK4389223892,XS17910 Test Ltd,Jake,Big Ben 343,4343,London,UK3892898999,UX00409

Note that as long as we're assuming that no field contains embedded newlines, we don't need to actually parse the input data. With that assumption, it's enough to just append new data to the end of each line.

terdon pointed out in comments that it may be a good idea to ensure that the computed case IDs are unique. This does this by keeping track of the already generated IDs in an associative array called seen:

awk -i ord -v OFS=, ' function randint(n) { return int(n*rand()) } function randid() { return sprintf("%c%c%.5d", A + randint(Z-A+1), A + randint(Z-A+1), randint(100000)) } BEGIN { srand(); A = ord("A"); Z = ord("Z") } NR == 1 { printf("%s,CASENUMBER\n", $0); next } { id = randid() while (seen[id]++ > 0) id = randid() printf "%s,%s\n", $0, id }' file.csv

I also moved the creation of the random case ID into its own function, as we need to call it twice from the main part of the code.

Nice! One nitpick: isn't there an (admittedly tiny) chance that one of the random strings will be repeated here? Would it be worth keeping track of them and discarding any doubles? — terdon
– terdon ♦, Commented Jan 29, 2024 at 11:48
@terdon There was, strictly speaking, no requirement for uniqueness in the question, but I will give an alternative command in this answer to do that, as it's fairly simple to do. Thanks! — Kusalananda
– Kusalananda ♦, Commented Jan 29, 2024 at 11:50
Nice use of ord(). You might want to change while (seen[id]++ > 0) to cnt=0; while ( (seen[id]++ > 0) && (++cnt < 100) ) or similar to avoid an infinite or unusably long loop if most combinations are used. — Ed Morton
– Ed Morton, Commented Jan 29, 2024 at 22:58

Stéphane Chazelas · Accepted Answer · 2024-01-29 18:40:16Z

With perl:

perl -lpe '$_ .= $. == 1 ? ",CASENUMBER" : sprintf ",%s%s%05d", ("A".."Z")[rand 26,rand 26], rand 1e5'

That assumes CSV fields don't contain newline characters. If they may, you could use perl's Text::CSV module to do proper CSV parsing and formatting like miller does.

aborruso · Accepted Answer · 2024-01-29 18:53:20Z

A way to do it in Miller.

alphabet="[\"A\", \"B\", \"C\", \"D\", \"E\", \"F\", \"G\", \"H\", \"I\", \"J\", \"K\", \"L\", \"M\", \"N\", \"O\", \"P\", \"Q\", \"R\", \"S\", \"T\", \"U\", \"V\", \"W\", \"X\", \"Y\", \"Z\"]" mlr --csv put '$CASENUMBER=urandelement('"$alphabet"').urandelement('"$alphabet"').substr0(urand32(),0,5)' input.csv

You will have

COMPANY	NAME	STREET	ZIP	CITY	IBAN	CASENUMBER
Test Ltd	John	Big Ben 343	4343	London	UK2348020384	BV134526
Test Ltd	Kate	Big Ben 343	4343	London	UK4389223892	ZF321045
Test Ltd	Jake	Big Ben 343	4343	London	UK3892898999	VQ302039

If you use single quotes around that long string, then you don't need to escape each double quote. — Kusalananda
– Kusalananda ♦, Commented Jan 29, 2024 at 15:27
Hi @Kusalananda if I use only single quote and double string, I have mlr error: mlr DSL: syntax error at "[" ... — aborruso
– aborruso, Commented Jan 29, 2024 at 16:11
@stéphane-chazelas I brought it back, because with the syntax you proposed mlr gives me an error mlr DSL: syntax error at "[" ... — aborruso
– aborruso, Commented Jan 29, 2024 at 18:54
That can't make a difference, you'll find that with both syntaxes, $alphabet contains the exact same value which you can confirm with typeset -p alphabet — Stéphane Chazelas
– Stéphane Chazelas, Commented Jan 29, 2024 at 21:04

steeldriver · Accepted Answer · 2024-01-30 13:18:51Z

You could use urandint to write a quick'n'dirty random character generator using string slicing:

mlr --csv put ' begin{@chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";} func urandchars(n) { x = ""; while (n > 0) { i = urandint(1,strlen(@chars)); x = x . @chars[i:i]; n = n - 1; } return x; } ${CASENUMBER} = urandchars(2) . fmtnum(urandint(0,99999),"%05d") ' input.csv COMPANY,NAME,STREET,ZIP,CITY,IBAN,CASENUMBER Test Ltd,John,Big Ben 343,4343,London,UK2348020384,MC47966 Test Ltd,Kate,Big Ben 343,4343,London,UK4389223892,TM78575 Test Ltd,Jake,Big Ben 343,4343,London,UK3892898999,WI58624

If you want to generate the character list externally, you could pass it in via the environment:

CHARS="$(printf '%c' {A..Z})" mlr --csv put ' begin {@chars = ENV["CHARS"]} func urandchars(n) { x = ""; while (n > 0) { i = urandint(1,strlen(@chars)); x = x . @chars[i:i]; n = n - 1; } return x; } ${CASENUMBER}=urandchars(2) . fmtnum(urandint(0,99999),"%05d") ' input.csv

Ed Morton · Accepted Answer · 2024-01-29 21:59:29Z

Using any awk:

$ cat tst.sh #!/usr/bin/env bash [[ -s /dev/urandom ]] && seed="$(od -An -N4 -tu4 /dev/urandom)" awk -v seed="$seed" ' BEGIN { if (seed) srand(seed); else srand() FS = OFS = "," chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" numChars = length(chars) for ( i=1; i<=numChars; i++ ) { for ( j=1; j<=numChars; j++ ) { strs[++numStrs] = substr(chars,i,1) substr(chars,j,1) } } numDigitCols = 5 numNumbers = length("0123456789") ^ numDigitCols maxUnqAttempts = 100 } { for ( attempts=1; attempts<=maxUnqAttempts; attempts++ ) { id = sprintf( "%s%05g", strs[int(1+rand()*numStrs)], int(rand()*numNumbers) ) if ( !seen[id]++ ) { break } } if ( attempts > maxUnqAttempts ) { printf "%s[%d]: Failed to generate a unique ID after %d attempts, reusing %s\n", FILENAME, NR, maxUnqAttempts, id | "cat>&2" } print $0, (NR>1 ? id : "CASENUMBER") } ' "${@:--}"

$ ./tst.sh input.csv COMPANY,NAME,STREET,ZIP,CITY,IBAN,CASENUMBER Test Ltd,John,Big Ben 343,4343,London,UK2348020384,TT93580 Test Ltd,Kate,Big Ben 343,4343,London,UK4389223892,MP70836 Test Ltd,Jake,Big Ben 343,4343,London,UK3892898999,TM77788

Stack Exchange Network

How to create a new column and add a random identifier to it with miller

5 Answers 5

You must log in to answer this question.

Hot Network Questions

How to create a new column and add a random identifier to it with miller

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions