linux command to shuffle a large csv file to alternate rows according to a pattern

Question

I have a 40M+ csv file. One of the columns is a binary indicator (-1,1). I'd like to know if there is a linux command to create a new file that alternates rows with -1 and 1.

Old:

1,x,y -1,t,r -1,e,t 1,r,t

New:

1,x,y -1,t,r 1,r,t -1,e,t

Id doesn't have to follow any particular logic about how -1 and 1 are shuffled (could be random) as long as it alternates one row of each. I'm on Ubuntu 12.04.

+1 for sample data and required output, but -1 for no code:-( . What have you tried? Also note the low number of followers for many of your tags. Consider switching one them to a programming language (awk/python/perl) or shell (bash/ksh). Good luck. — shellter
– shellter, Commented Sep 6, 2014 at 16:41
also, does the real data have an even distribution of -1 vs (+)1 records? Good luck. — shellter
– shellter, Commented Sep 6, 2014 at 18:40

shellter · Accepted Answer · 2014-09-07 18:47:08Z

Here is a shell/awk solution. Not the most efficient, but given the speed of modern machines, shouldn't be an issue.

first, split data between pos and neg values.

awk '/^-/{print}' minus1Pos1data.txt > negsData.txt awk '/^[^-]/{print}' minus1Pos1data.txt > posData.txt

Now merge the two files, using awk array to hold first file. you can change order if you want neg numbers as first record.

awk 'pass==1{pos[FNR]=$0} pass==2{print pos[FNR]; print}' pass=1 posData.txt pass=2 negsData.txt > alternateRows.txt cat alternateRows.txt 1,x,y -1,t,r 1,r,t -1,e,t

awk evaluates the variable assignments on the cmd line pass=1 and tests them pass==1? VS pass==2? (inside the awk code) and only performs the block where the pass==? test is true. Note that pass=1 is an assignment statment, while pass==1 is an equality test.

First pass loads the first file into an array pos with the current file's record-number (FNR) as the key.

The 2nd pass uses its current record number (FNR) to get the pos array rec, and the bare print cmd could be print $0, which means print the current line (from the pass=2 file).

IHTH.

David L. · Accepted Answer · 2014-09-06 19:46:13Z

Here is another solution using the grep, shuf and paste commands:

shuffle1-1.sh

#!/usr/bin/env bash input=$1 if [ $# -eq 0 ] then echo "must provide a file as 1st parameter..." exit -1 fi # split data between pos and neg values and shuffle them # in temporary files grep -v "\-1" $input | shuf > tmp_subset1 grep "\-1" $input | shuf > tmp_subsetm1 # alternate 1 and -1 line paste -d"\n" tmp_subset1 tmp_subsetm1 # cleanup rm tmp_subset1 rm tmp_subsetm1

output

# ./shuffle1-1.sh test.data 1,x,y -1,t,r 1,r,t -1,e,t # ./shuffle1-1.sh test.data 1,x,y -1,e,t 1,r,t -1,t,r # cat test.data 1,x,y -1,t,r -1,e,t 1,r,t

If your file does not have the same number of lines with 1 and -1, adding | grep 1 at the end should get rid of the blank lines:

# ./shuffle1-1.sh test.data2 1,z,z -1,e,t 1,x,y -1,t,r 1,r,t 1,Z,Z # ./shuffle1-1.sh test.data2 | grep 1 1,r,t -1,t,r 1,x,y -1,e,t 1,z,z 1,Z,Z

good stuff, I didn't know about `paste -d"\n"'. Good luck to all.

Aaron Okano · Accepted Answer · 2014-09-06 19:54:30Z

1

Here's a one-liner:

paste -d"\n" <( grep '^1,' test.txt ) <( grep '^-1,' test.txt )

answered Sep 6, 2014 at 19:54

Aaron Okano

2,3731 gold badge14 silver badges5 bronze badges

Collectives™ on Stack Overflow

linux command to shuffle a large csv file to alternate rows according to a pattern

3 Answers 3

Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Related