I have this bash script to compute the max tps from an application's log file. The script works but it takes several hours to run on files with millions of entries. The log entries have the following pattern:
2015-11-01 14:34:20,969 TRACE [Thread-2868] [TrafficLogger] service transaction data 2015-11-01 14:34:20,987 TRACE [Thread-2868] [TrafficLogger] service transaction data The script has a loop to grep for all possible hour:minute:second combinations and count the matches for each, each time comparing to the previous highest count to update the peak TPS:
for h in {00..23}; do for m in {00..59}; do for s in {00..59}; do tps=$(grep -c "${h}:${m}:${s}" $log_file) if [ "$tps" -gt "$peak_tps" ]; then peak_tps=$tps fi done done done This is the straightforward way to compute the max TPS, but I'm wondering if there's a way to optimize it maybe using some heuristics about the input: (1) the input file is sorted by the time stamp; (2) it only contains entries for one day (i.e. the first column is constant).
I've tried a couple of things: (1) Adding --mmap to grep; (2) pre-finding all the time stamps and only searching those:
for timestamp in $(awk '{print $2}' $log_file | cut -d \, -f 1 | sort -u); do tps=$(grep --mmap -c "${h}:${m}:${s}" $log_file) ... neither has yielded much improvement. I'm sure this is a classic test question but I can't seem to find the answer. Can you guys help?
Regards!
cut -c12-19 $log_file | uniq -c | sort -rn | head -1what you're looking for?uniqandsortin your pipeline.uniqonly combines (and counts) matching adjacent lines.cut -c12-19 $log_file | sort -rn | uniq -c | head -1sortbeforeuniq, but I'm relying on the OP's claim that the input is sorted.uniqbeforesortin this case ...and yes, that solution works for me. I'm used to usingsort -uand didn't know about the-coption foruniq. I tried it and it now takes minutes vs hours previously! Thanks so much! If you post the answer I'll select it.