3

I want to add a column with a randomly created "case number" to my csv file. The first 2 letters of the casenumber must be any letter from A-Z in capitals. followed by 5 random numbers.

input:

COMPANY,NAME,STREET,ZIP,CITY,IBAN Test Ltd,John,Big Ben 343,4343,London,UK2348020384 Test Ltd,Kate,Big Ben 343,4343,London,UK4389223892 Test Ltd,Jake,Big Ben 343,4343,London,UK3892898999 

output

COMPANY,NAME,STREET,ZIP,CITY,IBAN,CASENUMBER Test Ltd,John,Big Ben 343,4343,London,UK2348020384,IN84903 Test Ltd,Kate,Big Ben 343,4343,London,UK4389223892,TY93842 Test Ltd,Jake,Big Ben 343,4343,London,UK3892898999,OL34307 

How to do this with miller? I have the following command ready

mlr -I --csv put '${CASENUMBER}=xxx' then \ reorder -f COMPANY,NAME,STREET,ZIP,CITY,IBAN,CASENUMBER input/input.csv 

What to add to the above command exactly?

2
  • You may just be doing this for fun, but I would heartily recommend creating case numbers that are self-verifying, eg make the last digit derived from the rest of the id. See for example the isbn. Commented Jan 29, 2024 at 12:55
  • Can the same case number be used multiple times or does each case number have to be unique? Do the case numbers REALLY need to be random or just unique? Commented Jan 29, 2024 at 18:45

5 Answers 5

5

This does not use Miller (mainly because I couldn't find a convenient way to do the random characters) but GNU Awk:

awk -i ord -v OFS=, ' function randint(n) { return int(n*rand()) } BEGIN { srand(); A = ord("A"); Z = ord("Z") } NR == 1 { printf("%s,CASENUMBER\n", $0); next } { printf "%s,%c%c%.5d\n", $0, A + randint(Z-A+1), A + randint(Z-A+1), randint(100000) }' file.csv 

This GNU Awk command pulls in the ord() function from the ord.awk library (which is distributed together with GNU Awk) to be able to convert between a character and its ASCII representation. It then initializes the random number generator with srand() and precomputes the ASCII values for the letters A and Z for convenience.

If the current record is the first record (i.e., it's the CSV header), it outputs it with the string ,CASENUMBER appended.

For all other records, it outputs the original record with an appended string after a comma. The string is computed as two characters and a zero-filled number. The two characters are picked from the range [A,Z], and the number is picked from the range [0,100000). The random picking of integers from a range is done using the randint() function, which I'm using unmodified from the GNU Awk manual.

An example output of running this on the given data:

COMPANY,NAME,STREET,ZIP,CITY,IBAN,CASENUMBER Test Ltd,John,Big Ben 343,4343,London,UK2348020384,HP88271 Test Ltd,Kate,Big Ben 343,4343,London,UK4389223892,XS17910 Test Ltd,Jake,Big Ben 343,4343,London,UK3892898999,UX00409 

Note that as long as we're assuming that no field contains embedded newlines, we don't need to actually parse the input data. With that assumption, it's enough to just append new data to the end of each line.


terdon pointed out in comments that it may be a good idea to ensure that the computed case IDs are unique. This does this by keeping track of the already generated IDs in an associative array called seen:

awk -i ord -v OFS=, ' function randint(n) { return int(n*rand()) } function randid() { return sprintf("%c%c%.5d", A + randint(Z-A+1), A + randint(Z-A+1), randint(100000)) } BEGIN { srand(); A = ord("A"); Z = ord("Z") } NR == 1 { printf("%s,CASENUMBER\n", $0); next } { id = randid() while (seen[id]++ > 0) id = randid() printf "%s,%s\n", $0, id }' file.csv 

I also moved the creation of the random case ID into its own function, as we need to call it twice from the main part of the code.

3
  • 1
    Nice! One nitpick: isn't there an (admittedly tiny) chance that one of the random strings will be repeated here? Would it be worth keeping track of them and discarding any doubles? Commented Jan 29, 2024 at 11:48
  • @terdon There was, strictly speaking, no requirement for uniqueness in the question, but I will give an alternative command in this answer to do that, as it's fairly simple to do. Thanks! Commented Jan 29, 2024 at 11:50
  • Nice use of ord(). You might want to change while (seen[id]++ > 0) to cnt=0; while ( (seen[id]++ > 0) && (++cnt < 100) ) or similar to avoid an infinite or unusably long loop if most combinations are used. Commented Jan 29, 2024 at 22:58
3

With perl:

perl -lpe '$_ .= $. == 1 ? ",CASENUMBER" : sprintf ",%s%s%05d", ("A".."Z")[rand 26,rand 26], rand 1e5' 

That assumes CSV fields don't contain newline characters. If they may, you could use perl's Text::CSV module to do proper CSV parsing and formatting like miller does.

3

A way to do it in Miller.

alphabet="[\"A\", \"B\", \"C\", \"D\", \"E\", \"F\", \"G\", \"H\", \"I\", \"J\", \"K\", \"L\", \"M\", \"N\", \"O\", \"P\", \"Q\", \"R\", \"S\", \"T\", \"U\", \"V\", \"W\", \"X\", \"Y\", \"Z\"]" mlr --csv put '$CASENUMBER=urandelement('"$alphabet"').urandelement('"$alphabet"').substr0(urand32(),0,5)' input.csv 

You will have

COMPANY NAME STREET ZIP CITY IBAN CASENUMBER
Test Ltd John Big Ben 343 4343 London UK2348020384 BV134526
Test Ltd Kate Big Ben 343 4343 London UK4389223892 ZF321045
Test Ltd Jake Big Ben 343 4343 London UK3892898999 VQ302039
4
  • 1
    If you use single quotes around that long string, then you don't need to escape each double quote. Commented Jan 29, 2024 at 15:27
  • Hi @Kusalananda if I use only single quote and double string, I have mlr error: mlr DSL: syntax error at "[" ... Commented Jan 29, 2024 at 16:11
  • @stéphane-chazelas I brought it back, because with the syntax you proposed mlr gives me an error mlr DSL: syntax error at "[" ... Commented Jan 29, 2024 at 18:54
  • 1
    That can't make a difference, you'll find that with both syntaxes, $alphabet contains the exact same value which you can confirm with typeset -p alphabet Commented Jan 29, 2024 at 21:04
3

You could use urandint to write a quick'n'dirty random character generator using string slicing:

mlr --csv put ' begin{@chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";} func urandchars(n) { x = ""; while (n > 0) { i = urandint(1,strlen(@chars)); x = x . @chars[i:i]; n = n - 1; } return x; } ${CASENUMBER} = urandchars(2) . fmtnum(urandint(0,99999),"%05d") ' input.csv COMPANY,NAME,STREET,ZIP,CITY,IBAN,CASENUMBER Test Ltd,John,Big Ben 343,4343,London,UK2348020384,MC47966 Test Ltd,Kate,Big Ben 343,4343,London,UK4389223892,TM78575 Test Ltd,Jake,Big Ben 343,4343,London,UK3892898999,WI58624 

If you want to generate the character list externally, you could pass it in via the environment:

CHARS="$(printf '%c' {A..Z})" mlr --csv put ' begin {@chars = ENV["CHARS"]} func urandchars(n) { x = ""; while (n > 0) { i = urandint(1,strlen(@chars)); x = x . @chars[i:i]; n = n - 1; } return x; } ${CASENUMBER}=urandchars(2) . fmtnum(urandint(0,99999),"%05d") ' input.csv 
2

Using any awk:

$ cat tst.sh #!/usr/bin/env bash [[ -s /dev/urandom ]] && seed="$(od -An -N4 -tu4 /dev/urandom)" awk -v seed="$seed" ' BEGIN { if (seed) srand(seed); else srand() FS = OFS = "," chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" numChars = length(chars) for ( i=1; i<=numChars; i++ ) { for ( j=1; j<=numChars; j++ ) { strs[++numStrs] = substr(chars,i,1) substr(chars,j,1) } } numDigitCols = 5 numNumbers = length("0123456789") ^ numDigitCols maxUnqAttempts = 100 } { for ( attempts=1; attempts<=maxUnqAttempts; attempts++ ) { id = sprintf( "%s%05g", strs[int(1+rand()*numStrs)], int(rand()*numNumbers) ) if ( !seen[id]++ ) { break } } if ( attempts > maxUnqAttempts ) { printf "%s[%d]: Failed to generate a unique ID after %d attempts, reusing %s\n", FILENAME, NR, maxUnqAttempts, id | "cat>&2" } print $0, (NR>1 ? id : "CASENUMBER") } ' "${@:--}" 

$ ./tst.sh input.csv COMPANY,NAME,STREET,ZIP,CITY,IBAN,CASENUMBER Test Ltd,John,Big Ben 343,4343,London,UK2348020384,TT93580 Test Ltd,Kate,Big Ben 343,4343,London,UK4389223892,MP70836 Test Ltd,Jake,Big Ben 343,4343,London,UK3892898999,TM77788 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.