How to start processing a file at an offset?

Question

Use case: You've got a multi-GB log file for a whole week, and you need to search for something which happened on Saturday using for example grep. Making an educated guess, you assume that starting the search from the middle of the file will more than halve the processing time (since it's definitely not going to have to process the whole of the rest of the file) while not skipping any relevant data. Is this possible?

Chris Down · Accepted Answer · 2013-10-09 08:47:57Z

Assuming your data is in chronological order:

Get the size of the file by seeking to the end and doing ftell();
Divide that result by 2;
Use fseek() to seek to that location;
Seek to the beginning of the next line by calling getline() once;
Use strptime() to find out what date you're currently at;
Do a binary search, repeating steps 4 and 5 until you find the line you want.

Don't forget the usual: compile, fix most of the most blatant errors in your code, compile, use. — peterph
– peterph, Commented Oct 9, 2013 at 13:00

peterph · Accepted Answer · 2013-10-09 13:04:26Z

2

You could use dd along the lines of:

dd if=log skip=xK bs=1M

which would skip x * 1024 blocks of size 1M (2^20). See dd(1) for details on its handling of units.

If you'd like to automate the binary search, assuming your log has the usual formar <date> [data] you can pipe the output to head -n 2, check the date on the beginning of the second line (which - under the reasonable assumption of "normally" long lines - will be complete) and decide what half you want.

edited Oct 9, 2013 at 13:04

answered Oct 9, 2013 at 8:47

peterph

31.5k3 gold badges74 silver badges76 bronze badges

The question asks about checking that no "relevant data" has been missed, which requires also checking the current data to see whether you should seek backwards.

Chris Down
– Chris Down

2013-10-09 08:51:45 +00:00
Commented Oct 9, 2013 at 8:51
@ChrisDown I don't think so: assume that starting the search from the middle ... will more than halve the processing time (...) while not skipping any relevant data. The assumption is that throwing away the first half is ok.

peterph
– peterph

2013-10-09 08:55:17 +00:00
Commented Oct 9, 2013 at 8:55
@peterph You're right, this is for exploration rather than automated processing.

l0b0
– l0b0

2013-10-09 10:50:59 +00:00
Commented Oct 9, 2013 at 10:50
I'm sure someone can find an easier way to express this but to expand the dd usage dd if=testfile skip=`du testfile | awk '{print $1/2}' | cut -d "." -f 1` | grep "error string"

sambler
– sambler

2013-10-09 11:52:40 +00:00
Commented Oct 9, 2013 at 11:52
@sambler du measures the disk usage (which, for example, is affected by block size), not the filesize. If you want to look at the file size, look at st_size.

Chris Down
– Chris Down

2013-10-10 03:38:24 +00:00
Commented Oct 10, 2013 at 3:38

| Show 3 more comments

Runium · Accepted Answer · 2013-10-09 09:10:10Z

Get file size and divide by 2. Divide that by 1024 to get KiB. (Or 1024*1024 to get MiB etc.)

((fs = $(stat -c %s logfile) / 2 / 1024))

Skip and search

dd if=logfile bs=1024 skip=$fs | grep blahblah

You could further expand on this, if the logfile is very consistent with amount of data pr. day by adding a count= value to dd.

((cnt = $(stat -c %s logfile) / 5 / 1024)) dd if=logfile bs=1024 skip=$fs count=$cnt | grep blahblah

That would pipe cnt * 1024 bytes of data at offset fs * 1024 bytes.

Wrap it all up in a script and do the piping outside the script to grep, temporary file or what ever you want.

Omar Khan · Accepted Answer · 2023-01-22 19:48:32Z

consider tail over less. Tail command has an option to offset by bytes. For instance to retrieve the last KiB=K of data from the domainnames file and pipe to grep limiting the return records to 50 matches:

tail --bytes=1000K domainnames | grep --max-count=50 --fixed-strings --line-number 'yoga'

Marco · Accepted Answer · 2013-10-09 09:08:20Z

-1

It's not very clear what exactly you want to do and what you mean my “process”. For big files my favourite interactive program is less. It handles large files without problems. It can also skip to a particular percentage, e.g. using 30%. Furthermore, you can search using / and ?.

answered Oct 9, 2013 at 9:08

Marco

34.3k11 gold badges115 silver badges147 bronze badges

Searching in less takes an enormous amount of time, much slower than doing a grep for the exact same string (I tried).

l0b0
– l0b0

2013-10-09 10:51:54 +00:00
Commented Oct 9, 2013 at 10:51
If you know what you're looking for then grep is definitely the better choice. less on the other hand is interactive and it's easy to peek into the file without loading it entirely.

Marco
– Marco

2013-10-09 11:06:51 +00:00
Commented Oct 9, 2013 at 11:06

Add a comment |

Stack Exchange Network

How to start processing a file at an offset?

5 Answers 5

You must log in to answer this question.

Linked

Hot Network Questions

How to start processing a file at an offset?

5 Answers 5

You must log in to answer this question.

Linked

Related

Hot Network Questions