2

I have a dataset that has 100 columns and 100k rows. How can I print the maximum value and its corresponding row and column names, if the maximum value (ex: 20.17 of g1) is 2 time higher than the median value of the rest (0.21 and 0.57). This should be performed separately for each row name and the median should not be calculated including the maximum number but the rest of the numbers.

FYI: This has been answered before but using a small dataset that has only few columns and rows.

sample input

name s1 s2 s3 g1 20.17 0.21 0.57 g2 0.19 0.19 94.0 g3 0.15 0.21 0.26 g4 0.09 0.19 0.16 g5 0.019 0.19 0 g7 2.28 0 0 

sample output

g1 s1 20.17 g2 s3 94.0 g7 s1 2.28 
4
  • 1
    Your output doesn't seem to match your description. You say you only want to print the row name but also show a value. Should that be the value which is higher than the median? Should the median be calculated including this maximum value? Why are you changing 20.17 to 20? Is that a typo or do you want some sort of transformation? Please edit your question and clarify. Commented Jun 22, 2017 at 16:07
  • Yes, it's a typo. Sorry for the error. Commented Jun 22, 2017 at 17:26
  • No worries, we all make typos. But please edit your question and answer the other questions I asked as well. Commented Jun 22, 2017 at 17:27
  • Done. Please let me know if that's not clear. Thanks. Commented Jun 22, 2017 at 17:52

1 Answer 1

1

You are tagged as awk, hopefully Python will be useful.

Code:

# !/usr/bin/python import operator import sys with open(sys.argv[1], 'rU') as f: header = next(f).split() for line in f: data = line.split() numbers = [float(i) for i in data[1:]] max_index, max_value = max( enumerate(numbers), key=operator.itemgetter(1)) del numbers[max_index] half = len(numbers) >> 1 numbers.sort() if len(numbers) % 2: median = numbers[half] else: median = sum(numbers[half-1:half+1]) / 2.0 if max_value > median * 2: print('{}\t{}\t{}'.format( data[0], header[max_index+1], max_value)) 

Results:

g1 s1 20.17 g2 s3 94.0 g5 s2 0.19 g7 s1 2.28 
3
  • thank you very much. How about if I want to use mean instead of median? It will be sum(numbers/2) ? Commented Jun 23, 2017 at 10:14
  • 1
    mean = sum(numbers) / len(numbers) Commented Jun 23, 2017 at 13:44
  • thanks a lot!. one last thing. How can I change the condition to the maximum value with the next maximum value (20.17 > 0.57) instead of the median of the rest (0.21 and 0.57)? Commented Jun 23, 2017 at 14:49

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.