Splitting a file in linux based on content [duplicate]

Question

I have an email dump of around 400mb. I want to split this into .txt files, consisting of one mail in each file. Every e-mail starts with the standard HTML header specifying the doctype.

This means I will have to split my files based on the above said header. How do I go about it in linux?

Is that really an email dump? You mean you have no mail headers at all? And what do you call the "standard HTML header specifying the doctype"? — fge
– fge, Commented Dec 17, 2011 at 10:52
"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"><html><head> <xmeta content=\"text/html;charset=ISO-8859-1\" http-equiv=\"Content-Type\"> This is followed by the entire e-mail! — Greenhorn
– Greenhorn, Commented Dec 17, 2011 at 10:53

kev · Accepted Answer · 2011-12-17 12:15:00Z

91

If you have a mail.txt

$ cat mail.txt <html> mail A </html> <html> mail B </html> <html> mail C </html>

run csplit to split by <html>

$ csplit mail.txt '/^<html>$/' '{*}' - mail.txt => input file - /^<html>$/ => pattern match every `<html>` line - {*} => repeat the previous pattern as many times as possible

check output

$ ls mail.txt xx00 xx01 xx02 xx03

If you want do it in awk

$ awk '/<html>/{filename=NR".txt"}; {print >filename}' mail.txt $ ls 1.txt 5.txt 9.txt mail.txt

edited Dec 17, 2011 at 12:15

answered Dec 17, 2011 at 11:58

kev

163k49 gold badges286 silver badges282 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Greenhorn Over a year ago

Am afraid! I did the same and did a $ls mail.txt xx00 and obviously mail.txt was the same as xx00 Any fixes?

kev Over a year ago

@Ramprakash My csplit's ver is 8.5. Maybe yours don't have the {*} which repeat pattern. please check manpage. I just add awk solution. You can try it.

Daniel Gasienica Over a year ago

@Greenhorn My version of csplit also didn’t support {*}, but this worked: csplit -n 6 -f 'mail-' -k mail.txt '/^<html>$/' '{5000}'

mwfearnley Over a year ago

To prevent an awk error if the first line doesn't match the pattern (for gawk at least), do: awk 'BEGIN {filename="0.txt"} /...'

Fredrik Pihl · Accepted Answer · 2011-12-17 12:00:17Z

csplit is the best solution to this problem. Just thought I'd post a bash-solution to show that there is no need to go perl on this task:

#!/usr/bin/bash MAIL='mail' # path to huge mail-file #get linenumbers for all headers line_no=$(grep -n html $MAIL | cut -d: -f1) read -a LINES<<< $line_no file=0 for i in $(seq 0 2 ${#LINES[@]}); do start=${LINES[i]} end=$((${LINES[i+1]}-1)) echo $start, $end sed -n "${start},${end}p" $MAIL > ${MAIL}${file}.txt file=$((file+1)) done

In the seq command, I don't know why a step-width of 2 was chosen. I changed it to 1 in order to work for me.

thiton · Accepted Answer · 2011-12-17 11:57:47Z

2

The csplit program solves your problem elegantly:

csplit '/<!DOCTYPE.*/' $FILE

answered Dec 17, 2011 at 11:57

thiton

36.1k4 gold badges74 silver badges104 bronze badges

1 Comment

qwertzguy Over a year ago

Arguments are in the wrong order and is missing the repetition to actually do as intended.

jaypal singh · Accepted Answer · 2011-12-17 11:09:11Z

I agree with fge. With perl it would be a lot simpler. You can try something like this -

#!/usr/bin/perl undef $/; $_ = <>; $n = 0; for $match (split(/(?=HEADER_FORMAT)/)) { open(O, '>mail' . ++$n); print O $match; close(O); }

Replace HEADER_FORMAT with your header type.

Yep, a positive lookahead would work nicely, especially since here the header does not contain any metacharacter. You could even use qr// to build the split regex.

fge · Accepted Answer · 2011-12-17 11:12:34Z

It is doable with some perl "magic"... Many people would call this ugly but here goes.

The trick is to replace $/ with what you want and read your input, as such:

#!/usr/bin/perl -W use strict; my $i = 1; $/ = <<EOF; <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> <xmeta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> EOF open INPUT, "/path/to/inputfile" or die; while (my $mail = <INPUT>) { $mail = substr($mail, 0, index($mail, $/)); open OUTPUT, ">/path/to/emailfile." . $i . ".txt" or die; $i++; print OUTPUT $mail; close OUTPUT; }

edit: fixed, I always forget that $/ is included in the input. Also, the first file will always be empty, but then it can be easily handled.

Collectives™ on Stack Overflow

Splitting a file in linux based on content [duplicate]

5 Answers 5

4 Comments

1 Comment

1 Comment

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

1 Comment

1 Comment

1 Comment

Comments

Linked

Related