53

I have an email dump of around 400mb. I want to split this into .txt files, consisting of one mail in each file. Every e-mail starts with the standard HTML header specifying the doctype.

This means I will have to split my files based on the above said header. How do I go about it in linux?

2
  • Is that really an email dump? You mean you have no mail headers at all? And what do you call the "standard HTML header specifying the doctype"? Commented Dec 17, 2011 at 10:52
  • "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"><html><head> <xmeta content=\"text/html;charset=ISO-8859-1\" http-equiv=\"Content-Type\"> This is followed by the entire e-mail! Commented Dec 17, 2011 at 10:53

5 Answers 5

91

If you have a mail.txt

$ cat mail.txt <html> mail A </html> <html> mail B </html> <html> mail C </html> 

run csplit to split by <html>

$ csplit mail.txt '/^<html>$/' '{*}' - mail.txt => input file - /^<html>$/ => pattern match every `<html>` line - {*} => repeat the previous pattern as many times as possible 

check output

$ ls mail.txt xx00 xx01 xx02 xx03 

If you want do it in awk

$ awk '/<html>/{filename=NR".txt"}; {print >filename}' mail.txt $ ls 1.txt 5.txt 9.txt mail.txt 
Sign up to request clarification or add additional context in comments.

4 Comments

Am afraid! I did the same and did a $ls mail.txt xx00 and obviously mail.txt was the same as xx00 Any fixes?
@Ramprakash My csplit's ver is 8.5. Maybe yours don't have the {*} which repeat pattern. please check manpage. I just add awk solution. You can try it.
@Greenhorn My version of csplit also didn’t support {*}, but this worked: csplit -n 6 -f 'mail-' -k mail.txt '/^<html>$/' '{5000}'
To prevent an awk error if the first line doesn't match the pattern (for gawk at least), do: awk 'BEGIN {filename="0.txt"} /...'
3

csplit is the best solution to this problem. Just thought I'd post a bash-solution to show that there is no need to go perl on this task:

#!/usr/bin/bash MAIL='mail' # path to huge mail-file #get linenumbers for all headers line_no=$(grep -n html $MAIL | cut -d: -f1) read -a LINES<<< $line_no file=0 for i in $(seq 0 2 ${#LINES[@]}); do start=${LINES[i]} end=$((${LINES[i+1]}-1)) echo $start, $end sed -n "${start},${end}p" $MAIL > ${MAIL}${file}.txt file=$((file+1)) done 

1 Comment

In the seq command, I don't know why a step-width of 2 was chosen. I changed it to 1 in order to work for me.
2

The csplit program solves your problem elegantly:

csplit '/<!DOCTYPE.*/' $FILE 

1 Comment

Arguments are in the wrong order and is missing the repetition to actually do as intended.
1

I agree with fge. With perl it would be a lot simpler. You can try something like this -

#!/usr/bin/perl undef $/; $_ = <>; $n = 0; for $match (split(/(?=HEADER_FORMAT)/)) { open(O, '>mail' . ++$n); print O $match; close(O); } 

Replace HEADER_FORMAT with your header type.

1 Comment

Yep, a positive lookahead would work nicely, especially since here the header does not contain any metacharacter. You could even use qr// to build the split regex.
1

It is doable with some perl "magic"... Many people would call this ugly but here goes.

The trick is to replace $/ with what you want and read your input, as such:

#!/usr/bin/perl -W use strict; my $i = 1; $/ = <<EOF; <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> <xmeta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> EOF open INPUT, "/path/to/inputfile" or die; while (my $mail = <INPUT>) { $mail = substr($mail, 0, index($mail, $/)); open OUTPUT, ">/path/to/emailfile." . $i . ".txt" or die; $i++; print OUTPUT $mail; close OUTPUT; } 

edit: fixed, I always forget that $/ is included in the input. Also, the first file will always be empty, but then it can be easily handled.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.