Use case: You've got a multi-GB log file for a whole week, and you need to search for something which happened on Saturday using for example grep. Making an educated guess, you assume that starting the search from the middle of the file will more than halve the processing time (since it's definitely not going to have to process the whole of the rest of the file) while not skipping any relevant data. Is this possible?
5 Answers
Assuming your data is in chronological order:
- Get the size of the file by seeking to the end and doing
ftell(); - Divide that result by 2;
- Use
fseek()to seek to that location; - Seek to the beginning of the next line by calling
getline()once; - Use
strptime()to find out what date you're currently at; - Do a binary search, repeating steps 4 and 5 until you find the line you want.
- 1Don't forget the usual: compile, fix most of the most blatant errors in your code, compile, use.peterph– peterph2013-10-09 13:00:15 +00:00Commented Oct 9, 2013 at 13:00
You could use dd along the lines of:
dd if=log skip=xK bs=1M which would skip x * 1024 blocks of size 1M (2^20). See dd(1) for details on its handling of units.
If you'd like to automate the binary search, assuming your log has the usual formar <date> [data] you can pipe the output to head -n 2, check the date on the beginning of the second line (which - under the reasonable assumption of "normally" long lines - will be complete) and decide what half you want.
- The question asks about checking that no "relevant data" has been missed, which requires also checking the current data to see whether you should seek backwards.Chris Down– Chris Down2013-10-09 08:51:45 +00:00Commented Oct 9, 2013 at 8:51
- @ChrisDown I don't think so: assume that starting the search from the middle ... will more than halve the processing time (...) while not skipping any relevant data. The assumption is that throwing away the first half is ok.peterph– peterph2013-10-09 08:55:17 +00:00Commented Oct 9, 2013 at 8:55
- @peterph You're right, this is for exploration rather than automated processing.l0b0– l0b02013-10-09 10:50:59 +00:00Commented Oct 9, 2013 at 10:50
- I'm sure someone can find an easier way to express this but to expand the dd usage
dd if=testfile skip=`du testfile | awk '{print $1/2}' | cut -d "." -f 1` | grep "error string"sambler– sambler2013-10-09 11:52:40 +00:00Commented Oct 9, 2013 at 11:52 - @sambler du measures the disk usage (which, for example, is affected by block size), not the filesize. If you want to look at the file size, look at
st_size.Chris Down– Chris Down2013-10-10 03:38:24 +00:00Commented Oct 10, 2013 at 3:38
Get file size and divide by 2. Divide that by 1024 to get KiB. (Or 1024*1024 to get MiB etc.)
((fs = $(stat -c %s logfile) / 2 / 1024)) Skip and search
dd if=logfile bs=1024 skip=$fs | grep blahblah You could further expand on this, if the logfile is very consistent with amount of data pr. day by adding a count= value to dd.
((cnt = $(stat -c %s logfile) / 5 / 1024)) dd if=logfile bs=1024 skip=$fs count=$cnt | grep blahblah That would pipe cnt * 1024 bytes of data at offset fs * 1024 bytes.
Wrap it all up in a script and do the piping outside the script to grep, temporary file or what ever you want.
consider tail over less. Tail command has an option to offset by bytes. For instance to retrieve the last KiB=K of data from the domainnames file and pipe to grep limiting the return records to 50 matches:
tail --bytes=1000K domainnames | grep --max-count=50 --fixed-strings --line-number 'yoga' It's not very clear what exactly you want to do and what you mean my “process”. For big files my favourite interactive program is less. It handles large files without problems. It can also skip to a particular percentage, e.g. using 30%. Furthermore, you can search using / and ?.
- Searching in
lesstakes an enormous amount of time, much slower than doing agrepfor the exact same string (I tried).l0b0– l0b02013-10-09 10:51:54 +00:00Commented Oct 9, 2013 at 10:51 - If you know what you're looking for then
grepis definitely the better choice.lesson the other hand is interactive and it's easy to peek into the file without loading it entirely.Marco– Marco2013-10-09 11:06:51 +00:00Commented Oct 9, 2013 at 11:06