2

Use case: You've got a multi-GB log file for a whole week, and you need to search for something which happened on Saturday using for example grep. Making an educated guess, you assume that starting the search from the middle of the file will more than halve the processing time (since it's definitely not going to have to process the whole of the rest of the file) while not skipping any relevant data. Is this possible?

5 Answers 5

4

Assuming your data is in chronological order:

  1. Get the size of the file by seeking to the end and doing ftell();
  2. Divide that result by 2;
  3. Use fseek() to seek to that location;
  4. Seek to the beginning of the next line by calling getline() once;
  5. Use strptime() to find out what date you're currently at;
  6. Do a binary search, repeating steps 4 and 5 until you find the line you want.
1
  • 1
    Don't forget the usual: compile, fix most of the most blatant errors in your code, compile, use. Commented Oct 9, 2013 at 13:00
2

You could use dd along the lines of:

dd if=log skip=xK bs=1M 

which would skip x * 1024 blocks of size 1M (2^20). See dd(1) for details on its handling of units.

If you'd like to automate the binary search, assuming your log has the usual formar <date> [data] you can pipe the output to head -n 2, check the date on the beginning of the second line (which - under the reasonable assumption of "normally" long lines - will be complete) and decide what half you want.

8
  • The question asks about checking that no "relevant data" has been missed, which requires also checking the current data to see whether you should seek backwards. Commented Oct 9, 2013 at 8:51
  • @ChrisDown I don't think so: assume that starting the search from the middle ... will more than halve the processing time (...) while not skipping any relevant data. The assumption is that throwing away the first half is ok. Commented Oct 9, 2013 at 8:55
  • @peterph You're right, this is for exploration rather than automated processing. Commented Oct 9, 2013 at 10:50
  • I'm sure someone can find an easier way to express this but to expand the dd usage dd if=testfile skip=`du testfile | awk '{print $1/2}' | cut -d "." -f 1` | grep "error string" Commented Oct 9, 2013 at 11:52
  • @sambler du measures the disk usage (which, for example, is affected by block size), not the filesize. If you want to look at the file size, look at st_size. Commented Oct 10, 2013 at 3:38
1

Get file size and divide by 2. Divide that by 1024 to get KiB. (Or 1024*1024 to get MiB etc.)

((fs = $(stat -c %s logfile) / 2 / 1024)) 

Skip and search

dd if=logfile bs=1024 skip=$fs | grep blahblah 

You could further expand on this, if the logfile is very consistent with amount of data pr. day by adding a count= value to dd.

((cnt = $(stat -c %s logfile) / 5 / 1024)) dd if=logfile bs=1024 skip=$fs count=$cnt | grep blahblah 

That would pipe cnt * 1024 bytes of data at offset fs * 1024 bytes.

Wrap it all up in a script and do the piping outside the script to grep, temporary file or what ever you want.

0

consider tail over less. Tail command has an option to offset by bytes. For instance to retrieve the last KiB=K of data from the domainnames file and pipe to grep limiting the return records to 50 matches:

tail --bytes=1000K domainnames | grep --max-count=50 --fixed-strings --line-number 'yoga' 
-1

It's not very clear what exactly you want to do and what you mean my “process”. For big files my favourite interactive program is less. It handles large files without problems. It can also skip to a particular percentage, e.g. using 30%. Furthermore, you can search using / and ?.

2
  • Searching in less takes an enormous amount of time, much slower than doing a grep for the exact same string (I tried). Commented Oct 9, 2013 at 10:51
  • If you know what you're looking for then grep is definitely the better choice. less on the other hand is interactive and it's easy to peek into the file without loading it entirely. Commented Oct 9, 2013 at 11:06

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.