Most of the open source datasets are well formatted i.e each email message is separated well like the enron email dataset. But out in the real world it is highly difficult to separate a top email message from a thread of emails.
For example consider the below message.
Hi, Can you offer me a better discount. Thanks, Mr.X Customer Relations. ---- On Wed, 10 May 2017 04:05:16 -0700 [email protected] wrote ------ Hello Mr.X, Does the below work out. Do let us know your thoughts. Thanks, Mr.Y Sales Manager. Now the reason why we want to split the emails is because we want to do sentiment analysis. When we fail to split the email then the results will be wrong.
I searched around and found this very comprehensive research paper. Also found an implementation by mail gun called mail gun talon. But unfortunately it does not work well for certain kind of patters.
For example when the second message in the email thread breaks like
---------- Forwarded message ---------- instead of the above
---- On Wed, 10 May 2017 04:05:16 -0700 [email protected] wrote ------ My question is many people who are trying to do such stuff would have definitely faced such problems, but yet the area remains pretty shady. Is there any pretty solid implementation of the paper or something else that splits email pretty well.