Skip to content

MarufZak/sher-search

Repository files navigation

Notes

LLM

The communication protocol with LLM is simple. We provide entire history of the messages shared between two parties. These include user messages and assistant (LLM) messages.

llm

There is also system prompt for customizations. LLMs are trained to follow system prompts more than user messages, but they don't always work.

llm

Different LLM providers, like OpenAI, or Anthrophic have different communication schemas. The ai sdk which we are using in this project basically normalizes the responses according to the spec of each provider.

[ { role: "system", content: "you are web searcher!", }, { role: "user", content: "hello, what is your name?", }, { role: "assistant", content: "my name is GPT 4o", }, ];

LLMs also have reasoning capability, this is about thinking before responding, and the tokens consumer are named reasoning tokens.

llm

Because of reasoning, the communication schema becomes more complex, as there are multiple output pieces from LLM.

[ { role: "system", content: "you are web searcher!", }, { role: "user", content: "hello, what is your name?", }, { role: "assistant", content: [ { type: "reasoning", text: "i don't have specific name, but my model is GPT 4o", }, { type: "text", text: "my name is GPT 4o", }, ], }, ];

This schemas has upsides also, it enables us to send different inputs in structured manner, for example files as one part, and summarization text as second part.

tokens

Tokens are currency for LLM. There are input and output tokens, and output tokens are usually higher. We provide text, and it's broken into tokens by encoding process, by tokenizer.

Tokenizer has a dictionary of text chunks that is known to it, and each chunk has some number assigned. So it about compressing large number of characters into small number of numbers.

{ "Hello": 13225, "Sher": 25391, "Search": 10497, "!!!": 10880 }

tokenizer

Decoder just does the reverse job.

tokenizer

Tokenizers are trained just as LLMs, with text. When given some sentence, it splits it into characters, and assigns numbers to them (1, ..., 10). Larger the dictionary of tokens is, the better tokenization is going to be, because encoding splits it into smaller number of chunks. More tokens -> more space -> slower LLM. Also common groups are tokenized, for example "Hello Henry" has common group "He". Unusual words take more tokens, obviously.

context window

Limit of LLM is context window, the number of both input and output tokens LLM can see at a time. It's basically guess from devs of how far LLMs can perform well. Though it's good to switch threads to 'reset' the context window

As stated in cornell university article, LLMs perform well with input tokens from start and end of conversation, and bad in the middle. This problem is called Lost in the middle.

tools

Tools is about giving the LLM real hands to do something in real world. It's extra information in system prompt that specifies the description, schema to communicate with the tool, like json schema for the input and json schema for the output. When LLM sees the need for that tool, it sends the message for request of invocation of this tool, and invocation is on behalf of app. After it, the message of success/error invocation is sent back to LLM (as part of message history).

In fact having too many tools decrease LLM performance, because context window now becomes trash of tools, and lost in the middle problem occurs. <= 6 is good.

agents

Agent is when LLM decides which tools to invoke, and when to stop. It's loop in fact.

agentic loop

There is also workflows. In comparison to agents, the next step is not determined by the LLM, but it's predetermined. These workflows are written in code.

workflows

This reminds me of recent feature of some products, which first define the clear todo list, and strictly goes according to it. It's about decomposing the tasks into smaller tasks where LLMs are better at, rather than doing everything in single step.

Workflows are efficient when the path to solve the problem is known. Agents are better when path is unknown, and it's better to do different stuff.

Building effective agents

First of all the architecture should be simple, and the complexity should be increased only when needed. Start with simple LLM calls, and increase the complexity according to your needs. Most of the time the most effective agents are simple ones, used with composable patterns.

The building block for agent is augmented LLM. It's when LLM has capabilities of retrieval, tools, and memory. These are done with MCP.

augmented llm

First pattern is about workflows: prompt chaining. It's when the next LLM input is previous LLM's output. It's also possible to add the gates for programmatic checks. This is good pattern when some task can be decomposed into multiple tasks, which are easier to do. The trade off is latency to effectiveness.

prompt chaining

Second pattern is also about workflows: routing. When there is only one prompt that dictates the LLM, it can be good for one kind of input, while hurting other inputs. This pattern solves this problem by classifying the input and routing to related handler. So when some tasks are better handled separately, it's good to use this pattern. Classification can be done by LLM or some algorithm. It's also good idea to route the questions into costly or costless processes, with specified handlers.

prompt routing

Next pattern is parallelization, processing single query in parallel with multiple LLM calls. This can be used when diverse outputs are needed, or for speed, or confidence of the results. This can be divided into 2 groups:

  1. Sectioning - breaking single task into multiple tasks and processing them independently. Then aggregate the results. Good example for this is about complex tasks, where LLMs tend to do better with multiple easier tasks. For example checking for guardrails. One LLM call can do the query processing, while another check for guardrails. Or automating evals, where different LLM calls evaluate different aspects of the response to the query.

  2. Voting - running the same task with multiple LLM calls. This can be good for guardrails for example, where with different LLM calls we check different aspects of the prompt, and reject the query if LLM responds with red flag. This can be used for example evaluating piece of code for different vulnerabilities, or evaluating if prompt is related, checking different aspects. We can set up some thresholds to control false positives/negatives.

parallel workflow

Next pattern is orchestration-workers. This is when there is LLM orchestrating the tasks to worker LLMs. This is similar to parallelization, but the tasks aren't predefined, but defined by orchestrator at runtime. After that all LLM responses are synthesized and output is generated. This is good for cases where the subtasks can't be determined at runtime, coding products (already seen in some), or sher-search is also a good fit.

synthesizer workflow

For example coding products like Roo Code use this pattern.

synthesizer workflow exapmle

Next pattern is evaluator-optimizer. Good use-case for cases where there is evaluation criteria, and multiple iterations lead to better responses. Good response can be measured by human feedback, or LLM feedback. Good fit for sher-search, because evaluator decides if the response is acceptable, and search again if not.

evaluator optimizer

Evals

We no longer work with deterministic systems, where one input maps to one output, but with probabilistic system, which are unpredictable. Writing tests in this system is a key to good product. In deterministic systems, it's straightforward to make a test, you write the input and what you expect. But AI systems are very complex, it's more like predicting the weather tomorrow. Any change, even small one, can make the AI system very different. Regular tests are mandatory to know whether the app is going up or down.

Evals take its roots from search engines, where assertions weren't good enough, because of diverse user inputs. It's when engineers understood evals are better than assertions.

Manual testing the AI systems is not good way, because sometimes something might seem good for one kind of inputs, but break to another. Any change can affect the entire system. The way to use is automation with evals. Some criteria for testing are: factuality (output is true, based on facts), writing style, prompt fidelity (whether output corresponds to what user requested). There is no output like 'true' or 'false', but it's measured by score (10%, 50%, ...).

There are different types of evals:

  1. Deterministic. You write assertions, like the output contains 'Okay, i will do it'.

  2. Human evaluation. When no data exists, human evaluates the output manually. This is too expensive and highly possibly inaccurate, because we need to hire doctors to judge the outputs, and it's not certain if they know the truth too, if the prompts leads to deep research. It takes long time too.

  3. LLM as a judge. Using LLM as a judge to the output produced by another LLM call. LLM gets the output with some ground truth. This is costly approach in fact. Because relying on LLM's training data is not good way, so we provide ground truth. Some simple classification LLM model is enough, because it's just comparing pieces of texts.

    llm as judge.

At the end of the day, real judges are the users, so some feedback loop is required, like like or dislike buttons. This is called data flywheel. This makes so we build better evals for our system. There are different ways of collecting feedback. For example explicit feedback, or implicit feedback (if user constantly rephrases the query, for example).

AI-native data flywheel with feedback loop looks like this: user makes a query, AI replies. User dislikes it and indicates it (or constantly rephrases the input). We are notified about it (or just inspect), make an eval for this specific prompt and test it. Evals are the center of greatness our application. Better evals improves the product, better product more users, more users more data, more data more traces, more traces more observability and user feedback, bad user feedback new evals. As new system pattern emerges, or if new model is created, we can test them against our evals. So to make product great, we need to put it in front of users.

data flywheel

from article

To keep the evals effective, data is important. Trash in - trash out. High quality data is needed.

Prompts

Prompt engineering is very important in our application, because system prompt basically gives constraints and boundaries for our system.

First pattern is about planning. We should have a prompt to make LLM plan its work, in other words output its thoughts. This pattern is called 'Thought Generation'. One of the approaches is chain of thought prompting. Study shows it makes LLMs tackle with complex reasoning tasks much much better when few chain of thought demonstrations are provided. But another pattern, without providing demonstrations of chain of thought, is zero-shot pattern. It's about adding "Let's think step by step" also dramatically improves the output of LLM (article). Though this didn't outperform humans :D (article)

Another technique for improving the output is providing list of valid websites for our system, just because there are more reliable.

There is github repo in fact with leaked prompts of various systems. Link.

Changelog

So right now we have interface of the chat to communicate with LLM. The first step was to bind the interface to LLM with simple text generation. This seems like a bad UX, because LLM can take some time to generate the output, and user needs to wait, so it's better to stream the response to the interface.

The next step was to get the information from the web somehow. There is Google API for this, called serper. So we decided to use it. For now 10 results are enough for now. Okay, we got these results ase structured json. But now we need to treat it to LLM. Luckily LLMs support tool calls. We created tool call called searchWeb, and said to LLM to always search the web. In order not to pay for tokens too much, we limited the number of results from web to 10. There is no 1 iteration though, because our AI is agent. Somehow we need to know when to stop the agentic loop. For now, to simplify it, the number of iterations is 5.

Okay, we decided to go with medical deep search implementation, and according system prompt. With this in hand, it takes LLM several steps to actually retrieve the information and get back with links and summary. Summary is for now only snippet, not actual data processing. Also UX seems to be laggy, because it takes time for tool calls.

Integrated tool call with UI, now it's possible to see the state of the tool call and the results of it, much better UX

Now the chat titles are not static, but dynamic and taken from the initial message of the user. The chats are persisted.

It's possible to offload the searchWeb tool call to search grounding, which is basically the same feature provided by some LLMs, for example some of google's LLM. One downside of these is that it's abstraction we can't control and customize, which is not for our application. It's better to make searching ourselves.

Now the messages are persisted in the database, with relation to the chat

As our application grows it's more difficult to track everything, so we decided it's better to add some observability. This is actually must have for production app. We decided to go with Langfuse, which seamlessly integrates with otel. Now it's possible to observe our application, including the input, output, tool calls, token usage, cost of the query, and others, all grouped by chatId, which is basically session.

There is a problem right now with our application, if the user asks something that is not needed to search the web, tool call searchWeb is called anyway. This is wasteful, and leads to bad UX. In fact our application isn't considered for generic questions like How are you, but anyway it's better to handle it. Well edition of system prompt worked, but system prompt includes categorization the query into classes, which LLMs are not good at, this needs to be fixed in the future.

Here is current prompt, we are making priority for some websites, but it's not understandable for now if it's good choice.

 You are a Medical Deep Search Expert. Follow all rules: 1. First, classify the user's message: - If it is a medical or biomedical research question → you MUST execute a two-step tool chain: (1) Call 'searchWeb'. (2) Immediately call 'scrapePages' using 3–5 links returned from 'searchWeb'. - If it is NOT medical → DO NOT call any tools and reply that you only answer medical research questions. 2. Tool usage rules for medical questions: - Step 1: Call 'searchWeb' with the user's query. - Step 2: From the 'searchWeb' results, pick 3–5 links and call 'scrapePages' with those exact URLs. - You MUST perform both tool calls before producing any answer. 3. Your final medical answer must be based ONLY on: - the scraped content from 'scrapePages' - metadata from 'searchWeb' and must include all source links. 4. Structure medical outputs exactly as: Answer → Evidence Summary → Source Links. 5. Prioritize authoritative sources: PubMed, PMC, ClinicalTrials.gov, NIH, WHO, CDC, NICE. 6. If evidence is incomplete, unclear, or contradictory: Perform additional 'searchWeb' queries and then 'scrapePages' again. Never guess. 7. Never fabricate studies, numbers, mechanisms, or links. If no evidence exists, state it explicitly. Non-medical questions must be answered normally and WITHOUT using any tools. 

Sometimes the LLM responds that the question is not related to medicine, even though it is. This is fixed by model changing, because gpt-4.1 nano seems not to be good fit.

Now there is another tool available for LLM, scrapePages, which basically scrapes the pages and returns the results in markdown format for easy consumption by LLM.

Well, we switched to gpt-4.1, and made a prompt. Single query from the user has costed ~0.5$, and token consumption is ~181k for input only. Because the context window for this model is reached error was thrown and output wasn't generated. scrapePages tool needs adjustments so it doesn't blow up the context window with so many tokens out there.

Because there is another system prompt, more iterations (steps) are needed to efficiently complete the prompt. That's why steps count was increased to 10.

Now LLM is date-aware, because we included date to the system prompt, and new prompt looks like this, there were also some adjustments in prompt.

You are a Medical Deep Search Expert. Today's date is {iso}. You know that each 'searchWeb' result may include an optional 'date' field. Your job is to always use the most recent, authoritative, and relevant medical evidence. 1. Classification: - If the message involves diseases, symptoms, drugs, treatments, diagnostics, biology, physiology, public health, or medical research → treat it as a MEDICAL QUESTION. - Only classify as non-medical if it is clearly unrelated. - When uncertain, assume it IS a medical question. 2. Mandatory tool chain for medical questions: - First call 'searchWeb' with the user's query. - Then call 'scrapePages' with 3–5 of the most relevant links returned from 'searchWeb'. - Do NOT produce the final answer until both tool calls have completed. 3. Date handling: - When ranking search results, prefer items with a 'date' field that is more recent. - If the user mentions "recent", "current", "latest", or similar, explicitly prioritize items from the last few years based on today's date (${iso}). - If a result has no 'date' field, treat it as usable but lower priority than dated results. 4. Evidence rules: - Base your medical answer ONLY on the scraped content and metadata from the tools. - Every medical answer must include direct URLs from 'searchWeb'. 5. Output format (strict): Answer: <concise conclusion> Evidence summary: - <bullet 1> - <bullet 2> Source links: - <URL 1> - <URL 2> 6. Source quality + honesty: - Prefer authoritative sources (PubMed, PMC, ClinicalTrials.gov, NIH, WHO, CDC, NICE). - Never invent studies, numbers, mechanisms, or links. - If evidence is weak / missing / contradictory, say so. For non-medical questions, answer briefly without using tools and mention you specialize in medical research. 

So our agentic loop looks like this:

agentic loop

Well let's return to scraper. It's like naive tool which just extracts everything and treats that as main content, while it is not. There is also another problem, because of complexity of recent web development, scraper might not work because it doesn't execute js, which many websites rely on when loading, so empty page is loaded instead. We already faced the problem of context window reach because of the scraper problem. It might be better to integrate external services like Jina AI which specialize in this field.

We integrated Jina AI, but it's too slow in order not to make the summary too long. It even takes couple of minutes sometimes, this is not acceptable...

Regarding patterns, some of the patterns are already implemented in our sher-search. For example:

  1. Augmented LLM: LLM has access to tool calls such as searchWeb and scrapePages
  2. Prompt chaining: The input to LLM is the output from the last LLM call, or some tool call.
  3. Prompt routing: there is already a minimal prompt routing, done by LLM itself, classifying the question into medical or non-medical question. For non-medical question, the answer is discarded and general answer is given.

routing

It's a bit difficult to decide right now among other patterns like parallelization, orchestrator-workers, or evaluator-optimizer.

First step to evals will be making feedback loop accessible for the user. We implemented the feedback loop with ability to like the response, or dislike, with optional reason.

We integrated FireCrawl, which seems to be better than Jina AI, faster, and more reliable. However, when used with build-in LLM summarizer, it takes about 20 seconds for each crawl. Without it, it is 2-3x faster. We decided to exclude links and images, and use it without builtin AI summarizer. Integration is complete, but anyway the tokens size is too big, some AI summarizer is indeed needed, maybe implement it ourselves.

We made initial iteration for Usage feature.

usage

Also made an iteration for rate limited errors.

rate limited

Well it's time for making tests for our system. Even small changes might do so much things and change everything, including output. That's why testing is essential. In case of probabilistic system, it's good to use evals.

Now working on making evals with custom scorer Factuality, which is basically LLM as a Judge.

Now we have observability of the chats, we should choose success criteria in order to track how our system behaves over time. We need some trackable metrics that are measured over time. Our evals matches the quality of our application. In DeepSearch case, it can be something like factuality, relevant (not answer to another question), fast enough, up-to-date and with sources. These metrics structures design of the system. The scores become more and more complex to achieve on higher levels. These metrics should have goals, though not often when only starting the project for example (for example factuality should be 90%).

Right now we set up factuality score with call to LLM (for now gpt-5.1-nano), and made some refactoring so our system is testable. And set set up one test. Right now factuality turns out to be 60%.

eval

Also reason for the score is chosen.

eval reason

Prompt for factuality check is taken from https://github.com/braintrustdata/autoevals/blob/5aa20a0a9eb8fc9e07e9e5722ebf71c68d082f32/templates/factuality.yaml.

We always want to emit the sources where LLM got the data from, so this is another evaluation criteria. We made another scorer for this.

links scorer

Actually we can run multiple models and compare the outputs of them, so we added gemini model as a parallel model, some kinda AB testing.

AB testing AB testing

One problem now is that LLM providers have rate limits, and once it hits the rate limit while running tests, users suffer the consequences.

At this point we can work on the prompt engineering. I added thinking step, because there are studies (references above) saying zero-shot llm queries perform much better.Also we can't do many evals, because the request is being rate limited :( Also context window is being filled up too quick.

As the system grows, so does the complexity of the system. As the system is being made better and better, it becomes incrementally difficult to go more. It's relatively easy to achieve scores like 20-40%, but as you go forward, it is much more difficult and more expensive.

complexity staircase

Now we have our evals, and providing more long ground truth seems to make our evals better, because with short ground truth LLMs choose 60% match, even if the answer closely represents the ground truth.

But for ready-to-go production app we need much more evals, much better dataset. We can run 10-20 evals in dev, 20-200 in CI, and 500+ on regression to see how our app works. It's better to put some critical and hard evals on dev, and once fixed put it to CI or regression. Writing evals for our medical application requires experts from medical field. We can use synthetic data from LLMs, but LLMs might rely on outdated or hallucinated information. One way of writing good evals is giving multi-hop questions, meaning LLM needs to make multiple hops to obtain the result, for example "What is difference between observations of coronavirus in 2020 and 2025", or "What are similarities between coronavirus and cols disease". Another way is dogfooding, asking questions about newly available products in dev.

We also made another scorer, RelevancyScorer. It tracks the relevancy of the answer to the question. It's taken from Mastro open-source typescript framework for building agents. It's interesting how they structure the things there. Prompts are taken form mastra codebase

First step is to divide the output into statements.

statements

Then statements are treated into another LLM model alongside question to test each statement relevancy to the question. Final outputs are "yes", "no", "unsure", mapped to values 1, 0, 0.5 respectively. Then average score is calculated.

relevancy

We can see that right now gemini model performs better at relevancy score than gpt.

relevancy

If we think about our system right now, it's a bit incorrect, because single system prompt guides our entire application. If we want our application to make some actions different, for example strategy for choosing URLs to scrape, it's going to affect our entire system. Not good.

system

Another problem is that our LLM chooses whether to stop unless max steps are reached (currently 10). So this system prompt is being called again and again. Instead we are going to go more towards workflows. Strategy is to define max steps, and to make a workflow for LLM to go. Each action is going to be separate LLM call with its own system prompt. This way we can creation isolation between the actions. Max steps are also defined, so if it's reached, we make another LLM call so it answers the question by force.

The benefits are following, though some of them don't come instantly, but later:

  1. Isolation. By isolating our actions, we make them testable separately. This means we can create unit evals testing each action, or end to end evals testing our entire system. We can also choose best model for action.
  2. Smart stop condition. By going towards workflow, WE are those who regulates when to stop. We can add some smart stop conditions like token budget or time limit, or both if we want. By the way it's possible it solves the problem with overly large context window in our system.
  3. Context. We can customize the context being passed to next LLM call in the loop. For example summarizing the history and treating it into context. Another way to solve large context window problem right now.
  4. Parallelism. We can make actions work in parallel and decrease the response time of the system.

First step is to build custom context where we can make all regulations and build the history of steps, scrapes, and queries. Good optimization would be storing everything not in JSON, but in markdown format. JSON has many characters that consume additional tokens, and markdown are generally better for LLMs to make better output. Recommended by openai lab and anthropic.

## Query: medical search query 2025-12-01 - Medical research article title Medical research article snippet https://example.com

Regarding the scraped pages, we used XML tags for this, because the scraped page itself could be markdown file (for example with firecrawl, or some other document is fetched).

## Scrape: https://example.com <scrape_result> Long scraped page... </scrape_result>

Now we added another system prompts and function to call the LLM. This includes choosing actions between search, scrape, and answer. Model needs to be pretty good for this, because it's the core of the loop we are going to implement. Based on the action LLM choose, it returns urls to scrape, query to make a query, or nothing along with the answer type. We are also feeding the query history and scrape history into the context, and made some little optimization so tokens are cached and reused, by moving dynamic parts (query history and scrape history) to the end.

Right now our agent is implemented, and it looks like following, where actions has its prompt and system prompt. Only 'choose next action' and 'answer action' has access to the full context. This in fact minimizes the context usage for other LLM calls, great! Now binding to the frontend is needed.

agent loop

From user's perspective there are some changes. Now user cannot see all the tool calls from the agent, which is particularly good, because we might not want to show all the results of scrape tool call for example. And user has to wait some time before our agent loop is completed, because it is not possible to stream each action inside the loop, needs some solution. So current iteration is like following. Also good news, context window problem is now resolved!!!

It takes so long for message to come, bad UX. It would be much better if we could just see what's going on. We could include chunks of some custom messages to the stream and show it inside UI, but this approach leads to saving these part in DB, which is associated with whole chat. What we want is to make some annotations only for the message. ChatGPT does the same, for example.

Well, we have implemented chain of thought, but there are some problems, because it appears to be some problems with streaming chunks, where two chunks of chain of thought are sent as separate messages, which leads to several chain of thought UI items. Needs to be fixed...

chain of thought

It also appears that LLMs can return some invalid json, we added some error handling for this. Structured outputs from LLMs don't work reliably.

Now we supported AI generated summary of the question, instead of truncating the question itself for the title.

Now we also supported suggestions generation. If there is chat history, it's retrieved to LLM, if not, it generates some random medical questions for research.

suggestions

Well huge problem is that LLM isn't going with scrape at all, all steps are being taken as search, and the number of steps exceeds 10.

We solved the problem of LLMs not using scraping at all, by adjusting the prompts, and minimizing the steps count to 5. If steps count exceeds, we use a workflow of gathering URLs that hasn't been scraped yet, treat it to another model that selects most relevant URLs, and scrape them. After that we generate final response. More workflow rather than an agent.

agent loop workflow

Right now we have another problem, next prompt user enters has no context of previous prompt. We should do so the context is given to LLM on the next call. But we can't just put all the context there, because it's going to consume vast amount of tokens, and after few prompts the context window is full. Good solution might be summarizing everything.

So agents versus workflows are not binary, but something we can turn more or less, more agent and less workflow, or more workflow and less agent. Workflow makes the system more deterministic and more predictable. It's about taking control from the LLM.

The first step would be scraping the URLs, without relying if LLM decides to scrape them. It's done now, and also summarization step is added to the scrape action. Combined results are then pushed to the system context. Instead of raw page from html, we have summary now, less token used! For summarizer big context window is needed, no reasoning is required, so we picked gemini 2.0 flash

Refactoring has been made, and the models has been adjusted. The system is now fully testable with custom models. Among following 3 model sets, the third won. Tests were done with regular scraping. It should be noted that gpt models, with 14 evals, costed 0.5$.

const judgeModelSet1 = { actionPicker: openai("gpt-4o-mini"), answerQuestion: openai("gpt-4o"), questionSummary: openai("gpt-4o-mini"), relevantLinks: openai("gpt-4o-mini"), suggestions: google("gemini-2.0-flash"), summary: google("gemini-2.0-flash"), }; const judgeModelSet2 = { actionPicker: google("gemini-2.0-flash"), answerQuestion: google("gemini-2.0-flash"), questionSummary: google("gemini-2.0-flash"), relevantLinks: google("gemini-2.0-flash"), suggestions: openai("gpt-4o-mini"), summary: openai("gpt-4o-mini"), }; const judgeModelSet3 = { actionPicker: openai("gpt-4o"), answerQuestion: google("gemini-2.0-flash"), questionSummary: openai("gpt-4o-mini"), relevantLinks: google("gemini-2.0-flash"), suggestions: openai("gpt-4o"), summary: google("gemini-2.0-flash"), };

evals 2

Now new evals were created for question-summary and summary LLM calls. New word-limit scorer was also created for question-summary evals.

evals 3

It seems that answerQuestion does too much, it returns the action to do 'search' or 'answer', and the query to search for. To make the system better it's probably better to split the jobs into separate pieces, one of which is to rewrite the query. It also fixes 1-loop bug in our system, because we are going to control the flow and generate the URLs to scrape, which ARE going to be scraped. Also it makes complex queries easier to tackle, because it's like planning approach. When complex query is received, there are multiple queries to search for. Agentic loop did the worst, it would take one query, search for it, and answer with that. If it's lucky it receives the information it needs with complex query, but in most cases it doesn't. It's going to be before getNextAction step, because if not, LLM could hallucinate and choose to answer immediately, with improper links and information. It's now completed, and the flow looks like the following. So in fact we are going more towards the workflows.

more workflows

There is another problem, no guardrils, which causes plan, search, and scrape steps.

no guardrails

Right now we have evaluator-optimizer pattern being used. In this pattern though there is feedback loop, but we don't provide feedback here. Now we have feedback included in the query rewriter to use the feedback from the action decider. Also we added support for favicons...

Now guardrail is also supported.

guardrails

Now we have also clarification step!

clarification

Now evals for question clarification are supported, gemini-2.5-flash does the best job in fact.

clarification eval

Now evals for action picker are also supported. Gemini 2.5 pro did the best job.

action picker eval

Now evals for query rewriter are also supported. Gemini 2.5 flash did the best job.

query rewriter

Ideas

  1. Traces to all LLM actions and calls.
  2. Fix total usage

About

Deep-research implementation for the med

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors