2

I have a 40M+ csv file. One of the columns is a binary indicator (-1,1). I'd like to know if there is a linux command to create a new file that alternates rows with -1 and 1.

Old:

1,x,y -1,t,r -1,e,t 1,r,t 

New:

1,x,y -1,t,r 1,r,t -1,e,t 

Id doesn't have to follow any particular logic about how -1 and 1 are shuffled (could be random) as long as it alternates one row of each. I'm on Ubuntu 12.04.

2
  • +1 for sample data and required output, but -1 for no code:-( . What have you tried? Also note the low number of followers for many of your tags. Consider switching one them to a programming language (awk/python/perl) or shell (bash/ksh). Good luck. Commented Sep 6, 2014 at 16:41
  • also, does the real data have an even distribution of -1 vs (+)1 records? Good luck. Commented Sep 6, 2014 at 18:40

3 Answers 3

2

Here is a shell/awk solution. Not the most efficient, but given the speed of modern machines, shouldn't be an issue.

first, split data between pos and neg values.

awk '/^-/{print}' minus1Pos1data.txt > negsData.txt awk '/^[^-]/{print}' minus1Pos1data.txt > posData.txt 

Now merge the two files, using awk array to hold first file. you can change order if you want neg numbers as first record.

awk 'pass==1{pos[FNR]=$0} pass==2{print pos[FNR]; print}' pass=1 posData.txt pass=2 negsData.txt > alternateRows.txt cat alternateRows.txt 1,x,y -1,t,r 1,r,t -1,e,t 

awk evaluates the variable assignments on the cmd line pass=1 and tests them pass==1? VS pass==2? (inside the awk code) and only performs the block where the pass==? test is true. Note that pass=1 is an assignment statment, while pass==1 is an equality test.

First pass loads the first file into an array pos with the current file's record-number (FNR) as the key.

The 2nd pass uses its current record number (FNR) to get the pos array rec, and the bare print cmd could be print $0, which means print the current line (from the pass=2 file).

IHTH.

Sign up to request clarification or add additional context in comments.

Comments

1

Here is another solution using the grep, shuf and paste commands:

shuffle1-1.sh

#!/usr/bin/env bash input=$1 if [ $# -eq 0 ] then echo "must provide a file as 1st parameter..." exit -1 fi # split data between pos and neg values and shuffle them # in temporary files grep -v "\-1" $input | shuf > tmp_subset1 grep "\-1" $input | shuf > tmp_subsetm1 # alternate 1 and -1 line paste -d"\n" tmp_subset1 tmp_subsetm1 # cleanup rm tmp_subset1 rm tmp_subsetm1 

output

# ./shuffle1-1.sh test.data 1,x,y -1,t,r 1,r,t -1,e,t # ./shuffle1-1.sh test.data 1,x,y -1,e,t 1,r,t -1,t,r # cat test.data 1,x,y -1,t,r -1,e,t 1,r,t 

If your file does not have the same number of lines with 1 and -1, adding | grep 1 at the end should get rid of the blank lines:

# ./shuffle1-1.sh test.data2 1,z,z -1,e,t 1,x,y -1,t,r 1,r,t 1,Z,Z # ./shuffle1-1.sh test.data2 | grep 1 1,r,t -1,t,r 1,x,y -1,e,t 1,z,z 1,Z,Z 

1 Comment

good stuff, I didn't know about `paste -d"\n"'. Good luck to all.
1

Here's a one-liner:

paste -d"\n" <( grep '^1,' test.txt ) <( grep '^-1,' test.txt ) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.