How do I get just the text content from a multipart email?

Question

 #!/usr/bin/php -q <?php $savefile = "savehere.txt"; $sf = fopen($savefile, 'a') or die("can't open file"); ob_start(); // read from stdin $fd = fopen("php://stdin", "r"); $email = ""; while (!feof($fd)) { $email .= fread($fd, 1024); } fclose($fd); // handle email $lines = explode("\n", $email); // empty vars $from = ""; $subject = ""; $headers = ""; $message = ""; $splittingheaders = true; for ($i=0; $i < count($lines); $i++) { if ($splittingheaders) { // this is a header $headers .= $lines[$i]."\n"; // look out for special headers if (preg_match("/^Subject: (.*)/", $lines[$i], $matches)) { $subject = $matches[1]; } if (preg_match("/^From: (.*)/", $lines[$i], $matches)) { $from = $matches[1]; } if (preg_match("/^To: (.*)/", $lines[$i], $matches)) { $to = $matches[1]; } } else { // not a header, but message $message .= $lines[$i]."\n"; } if (trim($lines[$i])=="") { // empty line, header section has ended $splittingheaders = false; } } /*$headers is ONLY included in the result at the last section of my question here*/ fwrite($sf,"$message"); ob_end_clean(); fclose($sf); ?>

That is an example of my attempt. The problem is I am getting too much in the file. Here is what is being written to the file: (I just sent a bunch of garbage to it as you can see)

From xxxxxxxxxxxxx Tue Sep 07 16:26:51 2010 Received: from xxxxxxxxxxxxxxx ([xxxxxxxxxxx]:3184 helo=xxxxxxxxxxx) by xxxxxxxxxxxxx with esmtpa (Exim 4.69) (envelope-from <xxxxxxxxxxxxxxxx>) id 1Ot4kj-000115-SP for xxxxxxxxxxxxxxxxxxx; Tue, 07 Sep 2010 16:26:50 -0400 Message-ID: <EE3B7E26298140BE8700D9AE77CB339D@xxxxxxxxxxx> From: "xxxxxxxxxxxxx" <xxxxxxxxxxxxxx> To: <xxxxxxxxxxxxxxxxxxxxx> Subject: stackoverflow is helping me Date: Tue, 7 Sep 2010 16:26:46 -0400 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0169_01CB4EA9.773DF5E0" X-Priority: 3 X-MSMail-Priority: Normal Importance: Normal X-Mailer: Microsoft Windows Live Mail 14.0.8089.726 X-MIMEOLE: Produced By Microsoft MimeOLE V14.0.8089.726 This is a multi-part message in MIME format. ------=_NextPart_000_0169_01CB4EA9.773DF5E0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable 111 222 333 444 ------=_NextPart_000_0169_01CB4EA9.773DF5E0 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <META content=3Dtext/html;charset=3Diso-8859-1 = http-equiv=3DContent-Type> <META name=3DGENERATOR content=3D"MSHTML 8.00.6001.18939"></HEAD> <BODY style=3D"PADDING-LEFT: 10px; PADDING-RIGHT: 10px; PADDING-TOP: = 15px"=20 id=3DMailContainerBody leftMargin=3D0 topMargin=3D0 = CanvasTabStop=3D"true"=20 name=3D"Compose message area"> <DIV><FONT face=3DCalibri>111</FONT></DIV> <DIV><FONT face=3DCalibri>222</FONT></DIV> <DIV><FONT face=3DCalibri>333</FONT></DIV> <DIV><FONT face=3DCalibri>444</FONT></DIV></BODY></HTML> ------=_NextPart_000_0169_01CB4EA9.773DF5E0--

I found this while searching around but have no idea how to implement or where to insert in my code or if it would work.

preg_match("/boundary=\".*?\"/i", $headers, $boundary); $boundaryfulltext = $boundary[0]; if ($boundaryfulltext!="") { $find = array("/boundary=\"/i", "/\"/i"); $boundarytext = preg_replace($find, "", $boundaryfulltext); $splitmessage = explode("--" . $boundarytext, $message); $fullmessage = ltrim($splitmessage[1]); preg_match('/\n\n(.*)/is', $fullmessage, $splitmore); if (substr(ltrim($splitmore[0]), 0, 2)=="--") { $actualmessage = $splitmore[0]; } else { $actualmessage = ltrim($splitmore[0]); } } else { $actualmessage = ltrim($message); } $clean = array("/\n--.*/is", "/=3D\n.*/s"); $cleanmessage = trim(preg_replace($clean, "", $actualmessage));

So, how can I get just the plain text area of the email into my file or script for furthr handling??

Thanks in advance. stackoverflow is great!

Is that the full email? It's missing the Content-Type: multipart/mixed header, which should specify what the boundary string is (which the code you found needs). — Daniel Vandersluis
– Daniel Vandersluis, Commented Sep 7, 2010 at 20:15
That is just the part of the email that is saved to the file. That is as stripped down as I could get it using the first code example. — Jimbo
– Jimbo, Commented Sep 7, 2010 at 20:18
The boundary header is important to be able to parse your email as it specifies where each part of the email begins and ends. Without it, all you can do is guess, and you know what they say about assuming... ;) For instance, for your quoted email, there should be a header like: Content-Type: multipart/mixed; boundary="----=_NextPart_000_0163_01CB4EA5.46466520" — Daniel Vandersluis
– Daniel Vandersluis, Commented Sep 7, 2010 at 20:19
Would the boundaries be the same coming from different pc based email clients or the popular free email accounts? — Jimbo
– Jimbo, Commented Sep 7, 2010 at 20:24
I added the headers var to the file write and edited my question to add that info for you guys/gals... — Jimbo
– Jimbo, Commented Sep 7, 2010 at 20:32

Daniel Vandersluis · Accepted Answer · 2010-09-07 21:47:25Z

There are four steps that you will have to take in order to isolate the plain text part of your email body:

1. Get the MIME boundary string

We can use a regular expression to search your headers (let's assume they're in a separate variable, $headers):

$matches = array(); preg_match('#Content-Type: multipart\/[^;]+;\s*boundary="([^"]+)"#i', $headers, $matches); list(, $boundary) = $matches;

The regular expression will search for the Content-Type header that contains the boundary string, and then capture it into the first capture group. We then copy that capture group into variable $boundary.

2. Split the email body into segments

Once we have the boundary, we can split the body into its various parts (in your message body, the body will be prefaced by -- each time it appears). According to the MIME spec, everything before the first boundary should be ignored.

$email_segments = explode('--' . $boundary, $message); array_shift($email_segments); // drop everything before the first boundary

This will leave us with an array containing all the segments, with everything before the first boundary ignored.

3. Determine which segment is plain text.

The segment that is plain text will have a Content-Type header with the MIME-type text/plain. We can now search each segment for the first segment with that header:

foreach ($email_segments as $segment) { if (stristr($segment, "Content-Type: text/plain") !== false) { // We found the segment we're looking for! } }

Since what we're looking for is a constant, we can use stristr (which finds the first instance of a substring in a string, case insensitively) instead of a regular expression. If the Content-Type header is found, we've got our segment.

4. Remove any headers from the segment

Now we need to remove any headers from the segment we found, as we only want the actual message content. There are four MIME headers that can appear here: Content-Type as we saw before, Content-ID, Content-Disposition and Content-Transfer-Encoding. Headers are terminated by \r\n so we can use that to determine the end of the headers:

$text = preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?\r\n/is', "", $segment);

The s modifier at the end of the regular expression makes the dot match any newlines. .*? will collect as few characters as possible (ie. everything up to \r\n); the ? is a lazy modifier on .*.

And after this point, $text will contain your email message content.

So to put it all together with your code:

<?php // read from stdin $fd = fopen("php://stdin", "r"); $email = ""; while (!feof($fd)) { $email .= fread($fd, 1024); } fclose($fd); $matches = array(); preg_match('#Content-Type: multipart\/[^;]+;\s*boundary="([^"]+)"#i', $email, $matches); list(, $boundary) = $matches; $text = ""; if (isset($boundary) && !empty($boundary)) // did we find a boundary? { $email_segments = explode('--' . $boundary, $email); foreach ($email_segments as $segment) { if (stristr($segment, "Content-Type: text/plain") !== false) { $text = trim(preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?\r\n/is', "", $segment)); break; } } } // At this point, $text will either contain your plain text body, // or be an empty string if a plain text body couldn't be found. $savefile = "savehere.txt"; $sf = fopen($savefile, 'a') or die("can't open file"); fwrite($sf, $text); fclose($sf); ?>

I am beginning to understand, I think.. So, to test would I replace everything after //empty vars???
Not exactly. It depends on what you want to do (for instance you might want to continue splitting headers or collecting the "special" headers). My code expects that you'll have one block of text for headers and one for the message, but you could just replace $headers and $message in my code with $email which as per your code should contain the whole email.
AAAH, I don't understand! How can I implement this in my code example above so, I can test it? Would I put your snippet before file the file write? Then write $text instead of $message? I really appreciate your help AND PATIENCE with this beginner here.
I updated my code to read in the email (as per your code) and process it. My code snippet should work the way you want without having to make any modifications. If you want to do anything else with the email, I'll leave that to you (or you can ask another question here for further help).
Old post, but I thought I'd add a quick update from a bug I found. In step 3, I found that the regex would not match the multipart headers because they don't always have a carriage return after them. If you remove the '\r' in that preg, I believe it works for all cases (because if there is one, it will be caught by the '.*?'). So the new one looks like $text = trim(preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?\n/is', "", $segment));

Lance Roberts · Accepted Answer · 2012-12-22 00:01:05Z

0

There is one answer here:

You need only to change these 2 lines:

require_once('/path/to/class/rfc822_addresses.php'); require_once('/path/to/class/mime_parser.php');

edited Dec 22, 2012 at 0:01

Lance Roberts

22.9k32 gold badges115 silver badges132 bronze badges

answered Dec 21, 2012 at 23:42

Mladen

114 bronze badges

1 Comment

Finlay Roelofs Over a year ago

@james.garriss not anymore (at the time of writing this comment)

Collectives™ on Stack Overflow

How do I get just the text content from a multipart email?

2 Answers 2

8 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

1 Comment

Linked

Related