OP changed requirement from "greater" to "smaller" - edit text to match updated formula

edited Jun 1, 2021 at 13:31

23.6k
25
55
77

Use awk to select only select lines with one field largersmaller than threshold from CSV file

Dealing with the post-processing of multi-column csv file contained many (10000+) lines:

ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before) 9000, lig662, 1, 0.421573, -7.8400, 153 10V2, lig807, 1, 0.42692, -8.0300, 149 3000, lig158, 1, 0.427342, -8.1900, 147 3001, lig158, 1, 0.427342, -8.1900, 147 10V2, lig342, 1, 0.432943, -9.4200, 137 10V1, lig807, 1, 0.434338, -8.0300, 147 4000, lig236, 1, 0.440377, -7.3200, 156 10V1, lig342, 1, 0.441205, -9.4200, 135 4000, lig497, 1, 0.442088, -7.7900, 148 9000, lig28, 1, 0.442239, -7.5200, 152 3001, lig296, 1, 0.444512, -7.8900, 146 10V2, lig166, 1, 0.447681, -7.1500, 157 .... 4000, lig612, 1, 0.452904, -7.0200, 158 9000, lig123, 1, 0.461601, -6.8000, 160 10V1, lig166, 1, 0.463963, -7.1500, 152 10V1, lig369, 1, 0.465029, -7.3600, 148

What I did so far

I am using the following awk code integrated into a bash function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):

take_top44 () { # Take the top lines from the initial CSV awk -v lines="$(wc -l < original.csv)" ' BEGIN{ top=int(lines/100) } FNR>(top){exit} 1 ' original.csv >> csv_with_top_lines.csv }

What I want to do now

How could I modify my awk code to apply more selective filter on the original CSV? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored))?

For example I need to use the lowest value (which is always on the second line, minForth = 0.421573) as the reference and print all lines from the CSV where $4 is greatersmaller than a selected threshold (say, 20% above the minForth):

$4<=(1+0.2)*min))'

Use awk to select only select lines with one field larger than threshold from CSV file

Dealing with the post-processing of multi-column csv file contained many (10000+) lines:

ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before) 9000, lig662, 1, 0.421573, -7.8400, 153 10V2, lig807, 1, 0.42692, -8.0300, 149 3000, lig158, 1, 0.427342, -8.1900, 147 3001, lig158, 1, 0.427342, -8.1900, 147 10V2, lig342, 1, 0.432943, -9.4200, 137 10V1, lig807, 1, 0.434338, -8.0300, 147 4000, lig236, 1, 0.440377, -7.3200, 156 10V1, lig342, 1, 0.441205, -9.4200, 135 4000, lig497, 1, 0.442088, -7.7900, 148 9000, lig28, 1, 0.442239, -7.5200, 152 3001, lig296, 1, 0.444512, -7.8900, 146 10V2, lig166, 1, 0.447681, -7.1500, 157 .... 4000, lig612, 1, 0.452904, -7.0200, 158 9000, lig123, 1, 0.461601, -6.8000, 160 10V1, lig166, 1, 0.463963, -7.1500, 152 10V1, lig369, 1, 0.465029, -7.3600, 148

What I did so far

I am using the following awk code integrated into a bash function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):

take_top44 () { # Take the top lines from the initial CSV awk -v lines="$(wc -l < original.csv)" ' BEGIN{ top=int(lines/100) } FNR>(top){exit} 1 ' original.csv >> csv_with_top_lines.csv }

What I want to do now

How could I modify my awk code to apply more selective filter on the original CSV? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored))?

For example I need to use the lowest value (which is always on the second line, minForth = 0.421573) as the reference and print all lines from the CSV where $4 is greater than a selected threshold (say, 20% above the minForth):

$4<=(1+0.2)*min))'

Use awk to select only select lines with one field smaller than threshold from CSV file

Dealing with the post-processing of multi-column csv file contained many (10000+) lines:

ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before) 9000, lig662, 1, 0.421573, -7.8400, 153 10V2, lig807, 1, 0.42692, -8.0300, 149 3000, lig158, 1, 0.427342, -8.1900, 147 3001, lig158, 1, 0.427342, -8.1900, 147 10V2, lig342, 1, 0.432943, -9.4200, 137 10V1, lig807, 1, 0.434338, -8.0300, 147 4000, lig236, 1, 0.440377, -7.3200, 156 10V1, lig342, 1, 0.441205, -9.4200, 135 4000, lig497, 1, 0.442088, -7.7900, 148 9000, lig28, 1, 0.442239, -7.5200, 152 3001, lig296, 1, 0.444512, -7.8900, 146 10V2, lig166, 1, 0.447681, -7.1500, 157 .... 4000, lig612, 1, 0.452904, -7.0200, 158 9000, lig123, 1, 0.461601, -6.8000, 160 10V1, lig166, 1, 0.463963, -7.1500, 152 10V1, lig369, 1, 0.465029, -7.3600, 148

What I did so far

I am using the following awk code integrated into a bash function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):

take_top44 () { # Take the top lines from the initial CSV awk -v lines="$(wc -l < original.csv)" ' BEGIN{ top=int(lines/100) } FNR>(top){exit} 1 ' original.csv >> csv_with_top_lines.csv }

What I want to do now

How could I modify my awk code to apply more selective filter on the original CSV? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored))?

For example I need to use the lowest value (which is always on the second line, minForth = 0.421573) as the reference and print all lines from the CSV where $4 is smaller than a selected threshold (say, 20% above the minForth):

$4<=(1+0.2)*min))'

deleted 15 characters in body

Source Link

edited Jun 1, 2021 at 13:27

Hot JAMS

197
2
6

Dealing with the post-processing of multi-column csv file contained many (10000+) lines:

ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before) 9000, lig662, 1, 0.421573, -7.8400, 153 10V2, lig807, 1, 0.42692, -8.0300, 149 3000, lig158, 1, 0.427342, -8.1900, 147 3001, lig158, 1, 0.427342, -8.1900, 147 10V2, lig342, 1, 0.432943, -9.4200, 137 10V1, lig807, 1, 0.434338, -8.0300, 147 4000, lig236, 1, 0.440377, -7.3200, 156 10V1, lig342, 1, 0.441205, -9.4200, 135 4000, lig497, 1, 0.442088, -7.7900, 148 9000, lig28, 1, 0.442239, -7.5200, 152 3001, lig296, 1, 0.444512, -7.8900, 146 10V2, lig166, 1, 0.447681, -7.1500, 157 .... 4000, lig612, 1, 0.452904, -7.0200, 158 9000, lig123, 1, 0.461601, -6.8000, 160 10V1, lig166, 1, 0.463963, -7.1500, 152 10V1, lig369, 1, 0.465029, -7.3600, 148

What I did so far

I am using the following awk code integrated into a bash function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):

take_top44 () { # Take the top lines from the initial CSV awk -v lines="$(wc -l < original.csv)" ' BEGIN{ top=int(lines/100) } FNR>(top){exit} 1 ' original.csv >> csv_with_top_lines.csv }

What I want to do now

How could I modify my awk code to apply more selective filter on the original CSV? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored))?

For example I need to use the lowest value (which is always on the second line, minForth = 0.421573) as the reference and print all lines from the CSV where $4 is greater than a selected threshold (say, 20% above the minForth):

$4 >= $4<=(minForth + 01+0.2 * minForth)*min))'

Dealing with the post-processing of multi-column csv file contained many (10000+) lines:

ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before) 9000, lig662, 1, 0.421573, -7.8400, 153 10V2, lig807, 1, 0.42692, -8.0300, 149 3000, lig158, 1, 0.427342, -8.1900, 147 3001, lig158, 1, 0.427342, -8.1900, 147 10V2, lig342, 1, 0.432943, -9.4200, 137 10V1, lig807, 1, 0.434338, -8.0300, 147 4000, lig236, 1, 0.440377, -7.3200, 156 10V1, lig342, 1, 0.441205, -9.4200, 135 4000, lig497, 1, 0.442088, -7.7900, 148 9000, lig28, 1, 0.442239, -7.5200, 152 3001, lig296, 1, 0.444512, -7.8900, 146 10V2, lig166, 1, 0.447681, -7.1500, 157 .... 4000, lig612, 1, 0.452904, -7.0200, 158 9000, lig123, 1, 0.461601, -6.8000, 160 10V1, lig166, 1, 0.463963, -7.1500, 152 10V1, lig369, 1, 0.465029, -7.3600, 148

What I did so far

I am using the following awk code integrated into a bash function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):

take_top44 () { # Take the top lines from the initial CSV awk -v lines="$(wc -l < original.csv)" ' BEGIN{ top=int(lines/100) } FNR>(top){exit} 1 ' original.csv >> csv_with_top_lines.csv }

What I want to do now

How could I modify my awk code to apply more selective filter on the original CSV? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored))?

For example I need to use the lowest value (which is always on the second line, minForth = 0.421573) as the reference and print all lines from the CSV where $4 is greater than a selected threshold (say, 20% above the minForth):

$4 >= (minForth + 0.2 * minForth)

Dealing with the post-processing of multi-column csv file contained many (10000+) lines:

ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before) 9000, lig662, 1, 0.421573, -7.8400, 153 10V2, lig807, 1, 0.42692, -8.0300, 149 3000, lig158, 1, 0.427342, -8.1900, 147 3001, lig158, 1, 0.427342, -8.1900, 147 10V2, lig342, 1, 0.432943, -9.4200, 137 10V1, lig807, 1, 0.434338, -8.0300, 147 4000, lig236, 1, 0.440377, -7.3200, 156 10V1, lig342, 1, 0.441205, -9.4200, 135 4000, lig497, 1, 0.442088, -7.7900, 148 9000, lig28, 1, 0.442239, -7.5200, 152 3001, lig296, 1, 0.444512, -7.8900, 146 10V2, lig166, 1, 0.447681, -7.1500, 157 .... 4000, lig612, 1, 0.452904, -7.0200, 158 9000, lig123, 1, 0.461601, -6.8000, 160 10V1, lig166, 1, 0.463963, -7.1500, 152 10V1, lig369, 1, 0.465029, -7.3600, 148

What I did so far

I am using the following awk code integrated into a bash function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):

take_top44 () { # Take the top lines from the initial CSV awk -v lines="$(wc -l < original.csv)" ' BEGIN{ top=int(lines/100) } FNR>(top){exit} 1 ' original.csv >> csv_with_top_lines.csv }

What I want to do now

How could I modify my awk code to apply more selective filter on the original CSV? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored))?

For example I need to use the lowest value (which is always on the second line, minForth = 0.421573) as the reference and print all lines from the CSV where $4 is greater than a selected threshold (say, 20% above the minForth):

$4<=(1+0.2)*min))'

Rephrase title and body to make actual task more obvious

Source Link

edited Jun 1, 2021 at 13:26

AdminBee

23.6k
25
55
77

Use awk: extract % of the top to select only select lines with one field larger than threshold from dataCSV file

What I did so far

I am using the following AWKawk code integrated into a bashbash function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):

take_top44 () { # Take the top lines from the initial CSV awk -v lines="$(wc -l < original.csv)" ' BEGIN{ top=int(lines/100) } FNR>(top){exit} 1 ' original.csv >> csv_with_top_lines.csv

take_top44 () { # Take the top lines from the initial CSV awk -v lines="$(wc -l < original.csv)" ' BEGIN{ top=int(lines/100) } FNR>(top){exit} 1 ' original.csv >> csv_with_top_lines.csv }

What I want to do now

How could I modify my awkawk code to apply more selective filter on the original.csv CSV? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored)dG(rescored))?

For example I need to use the lowest value (which is always on the second line, minForth = 0.421573 minForth = 0.421573) as the reference and print all lines from the csvCSV where $4$4 is greater than a selected threshold (up tosay, 20% above the minForthminForth):

awk: extract % of the top lines from data file

I am using the following AWK code integrated into a bash function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):

take_top44 () { # Take the top lines from the initial CSV awk -v lines="$(wc -l < original.csv)" ' BEGIN{ top=int(lines/100) } FNR>(top){exit} 1 ' original.csv >> csv_with_top_lines.csv

How could I modify my awk code to apply more selective filter on the original.csv? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored))? For example I need to use the lowest value (which is always on the second line, minForth = 0.421573 ) as the reference and print all lines from the csv where $4 is greater than a selected threshold (up to 20% above the minForth):

Use awk to select only select lines with one field larger than threshold from CSV file

What I did so far

I am using the following awk code integrated into a bash function, which takes 1% (top lines) from the CSV and saves it as a new CSV (contained thus reduced number of the lines):

take_top44 () { # Take the top lines from the initial CSV awk -v lines="$(wc -l < original.csv)" ' BEGIN{ top=int(lines/100) } FNR>(top){exit} 1 ' original.csv >> csv_with_top_lines.csv }

What I want to do now

How could I modify my awk code to apply more selective filter on the original CSV? For example to make filtering of the data based on the value (float number) of the 4th column (in dG(rescored))?

For example I need to use the lowest value (which is always on the second line, minForth = 0.421573) as the reference and print all lines from the CSV where $4 is greater than a selected threshold (say, 20% above the minForth):