I have a large email dataset in .txt format and want to feed LLMs (like Gemini and ChatGPT) to provide answers based on email content.
The token count for my email data is very high (~1M for 1K emails), exceeding LLM token limits. Even after preprocessing the basic headers, there are still a lot of tokens with low information like email threads on the body of the email.
I'm considering the following models/approaches:
- Signature removal: This hugging face bert project does a good job, even if the model is trained in french.
- Quoted Text Identification (for text inside the list of emails)
- Stop Word Removal
- BERT-Based Transformers for Focused Preprocessing
- Summarization Transformers
- Extractive Summarization
The challenge is doing so in mixed language emails (parts in french, english and portuguese). Since this is a very common (emails or any text) is there any hugging face project to address most of these points?