Reducing emails token count preprocessing for Large Email Datasets - Feeding LLMs

Question

I have a large email dataset in .txt format and want to feed LLMs (like Gemini and ChatGPT) to provide answers based on email content.

The token count for my email data is very high (~1M for 1K emails), exceeding LLM token limits. Even after preprocessing the basic headers, there are still a lot of tokens with low information like email threads on the body of the email.

I'm considering the following models/approaches:

Signature removal: This hugging face bert project does a good job, even if the model is trained in french.
Quoted Text Identification (for text inside the list of emails)
Stop Word Removal
BERT-Based Transformers for Focused Preprocessing
Summarization Transformers
Extractive Summarization

The challenge is doing so in mixed language emails (parts in french, english and portuguese). Since this is a very common (emails or any text) is there any hugging face project to address most of these points?

noe · Accepted Answer · 2024-04-12 05:52:42Z

Instead of the approaches you mentioned, I suggest a completely different approach: a retrieval-augmented generation (RAG) system. I am doing this because what you described is a typical use case for RAG systems.

In RAG systems, you split your emails into manageable chunks and, for each chunk, you compute vector representations (e.g. with BERT, s-Transformer, etc) and store them in a vector store (e.g. pinecone, chromaDB). Then, for the user question, you compute another vector representation and find the top k closest chunks in the vector store you populated with your email chunks and place them in a query to an LLM together with the user's question in a prompt along the lines: "With the following pieces of context: [chunk #1], [chunk #2], [chunk #3], answer the following question: [user-question]".

What I just described is the basic approach to RAG systems. There are many potential variations to each of the steps.

There are many open-source libraries that help you build RAG systems very easily, e.g. LangChain, LlamaIndex, DSPy.

Stack Exchange Network

Reducing emails token count preprocessing for Large Email Datasets - Feeding LLMs

1 Answer 1

Linked

Hot Network Questions

Reducing emails token count preprocessing for Large Email Datasets - Feeding LLMs

1 Answer 1

Linked

Related

Hot Network Questions