Add elastic agent alerting rule templates #15572

MichelLosier · 2025-10-06T20:27:41Z

Proposed commit message

Adds an initial set of alerting rule templates to the Elastic Agent package.

Extended description

Here is an initial exploration of alerting rule templates for monitoring elastic agent health. This PR can just include the ones we feel the most confident about, and defer others for further refinement and exploration.

Install the rules

How to install the rules:

Pull this PR locally: Add elastic agent alerting rule templates #15572
Go to the Elastic agent package your-local-dir/integrations/packages/elastic_agent
- If on remote cluster: Change the version in packages/elastic_agent/manifest.yml from 2.6.4 to 2.6.3
  - Just so you don't miss the actual release later
- Build the package with elastic-package build --skip-validation. Run this in the elastic_agent package directory
  - This should build the package in build/packages/elastic_agent-2.6.3.zip
Install the package in your cluster:
- Upload the package through the Integrations UI
  - Click the Create new integration CTA at the top right
  - Click the upload it as a .zip link, and upload the zip you built
- Once complete check the Rules management UI for the created rules
  - All titles should start with [Elastic Agent] and are tagged with Elastic Agent for filtering.

Rule templates:

So that the ESQL is clear, here is a summary of their definitions.

Resource Utilization

[Elastic Agent] CPU usage spike
- Checks if individual processes launched from directory like *elastic*agent* are above 80% of total cpu utilization. Calculate the max for 1 minute buckets and check if there are 5 occurrences when looking back 7 minutes. Rows are distinct by agent id and process name.

FROM metrics-, :metrics-
| WHERE process.executable RLIKE ".[Ee]lastic.[Aa]gent." AND agent.name NOT LIKE "agentless"
| STATS cpu_process_pct = MAX(system.process.cpu.total.pct) * 100
BY elastic_agent.id, process.name,
time_bucket = BUCKET(@timestamp, 1 minute)
// Count the 1 minute timebuckets that are above 80% by process and agent
| WHERE cpu_process_pct >= 80
| STATS count_above_threshold = COUNT(*)
BY elastic_agent.id, process.name
// Alert if there are 5 or more occurences
| WHERE count_above_threshold >= 5
```

[Elastic Agent] Excessive memory usage
- Checks if the sum of the individual processes launched from the directory like *elastic*agent* are above 50% of total memory usage. Rows are distinct by agent id.
- Mileage may vary on this one, and may need fine tuning. Assumption here, is that agent processes should not exceed 50% of memory usage on node.

FROM metrics-*, *:metrics-* | WHERE process.executable RLIKE ".*[Ee]lastic.*[Aa]gent.*" AND agent.name NOT LIKE "*agentless*" | STATS max_memory_per_process = MAX(system.process.memory.rss.pct * 100) BY agent.id, process.name | STATS total_memory_usage = SUM(max_memory_per_process) BY agent.id | WHERE total_memory_usage > 50

Beats Pipelines and Queues

[Elastic Agent] High pipeline queue
- Checks if max of beat.stats.libbeat.pipeline.queue.filled.pct exceeds 90%. Rows are distinct by agent id and component id

FROM metrics-elastic_agent.*beat-default, *:metrics-elastic_agent.*beat-default* | WHERE data_stream.dataset LIKE "elastic_agent.*beat" AND agent.name NOT LIKE "*agentless*" | STATS pipeline_queue_pct = MAX(beat.stats.libbeat.pipeline.queue.filled.pct) * 100 BY elastic_agent.id, component.id | WHERE pipeline_queue_pct >= 90

[Elastic Agent] Dropped events
- Checks if percentage of events dropped to acked events from the pipeline are greater than or equal to 5%. Rows are distinct by agent id and component id

FROM metrics-elastic_agent.*beat-default, *:metrics-elastic_agent.*beat-default* | WHERE data_stream.dataset LIKE "elastic_agent.*beat" AND agent.name NOT LIKE "*agentless*" | STATS events_dropped_max = max(to_long(beat.stats.libbeat.pipeline.events.dropped)), events_dropped_min = min(to_long(beat.stats.libbeat.pipeline.events.dropped)), pipeline_acked_max = max(to_long(beat.stats.libbeat.pipeline.queue.acked)), pipeline_acked_min = min(to_long(beat.stats.libbeat.pipeline.queue.acked)) BY time_bucket = DATE_TRUNC(1 minute, @timestamp), elastic_agent.id, component.id | EVAL events_dropped = events_dropped_max - events_dropped_min, events_acked = pipeline_acked_max - pipeline_acked_min | EVAL drop_pct = CASE( events_acked > 0, events_dropped / events_acked, 0 ) | WHERE drop_pct >= 0.05 | STATS MAX(drop_pct) BY elastic_agent.id, component.id

[Elastic Agent] Output errors
- Checks if the errors per minute from an agent component is greater than 5. Rows are distinct by agent id and component id.

FROM metrics-elastic_agent.*beat-default*, *:metrics-elastic_agent.*beat-default* | WHERE data_stream.dataset LIKE "elastic_agent.*beat" AND agent.name NOT LIKE "*agentless*" | STATS max_errors = MAX(TO_LONG(beat.stats.libbeat.output.write.errors)), min_errors = MIN(TO_LONG(beat.stats.libbeat.output.write.errors)) BY time_bucket = DATE_TRUNC(1 minute, @timestamp), elastic_agent.id, component.id | EVAL errors_count = max_errors - min_errors | WHERE errors_count > 5 | STATS MAX(errors_count) BY elastic_agent.id, component.id

Agent Stability

[Elastic Agent] Excessive restarts
- Checks if there are greater than 10 distinct startup timestamps from an agent or component process in a 5 minute window. Rows distinct by agent id, and process name

FROM metrics-*, *:metrics-* | WHERE process.executable RLIKE ".*[Ee]lastic.*[Aa]gent.*" AND agent.name NOT LIKE "*agentless*" | STATS restart_count = COUNT_DISTINCT(process.cpu.start_time) BY host.name, process.name, bucket(@timestamp,5 minute) | WHERE restart_count > 10 | STATS MAX(restart_count) BY host.name, process.name

[Elastic Agent] Unhealthy status
- Checks for log occurrence of an agent status change to "error" using the new elastic_agent.status_change datastream

FROM logs-elastic_agent.status_change-default, *:logs-elastic_agent.status_change-default | WHERE data_stream.dataset == "elastic_agent.status_change" and agentless == false and status == "error"

Checklist

I have reviewed tips for building integrations and this pull request is aligned with them.
I have verified that all data streams collect metrics or logs.
I have added an entry to my package's changelog.yml file.
I have verified that Kibana version constraints are current according to guidelines.
I have verified that any added dashboard complies with Kibana's Dashboard good practices

Author's Checklist

[ ]

How to test this PR locally

Built and Install the elastic agent package locally:

// In the elastic_agent package directory: elastic-package build elastic-package install --zip /dir/to/integrations/build/packages/elastic_agent-2.6.4.zip

Related issues

Relates: https://github.com/elastic/ingest-dev/issues/6092

Screenshots

elasticmachine · 2025-10-07T12:01:03Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

MichelLosier · 2025-10-08T17:12:53Z

Putting this back in draft temporarily to avoid accidental merge. We want to validate these more against running agents -- but still open for config review.

…agent path matching

elasticmachine · 2025-10-14T18:31:22Z

💚 Build Succeeded

Buildkite Build
Commit: 44c37fb

History

💚 Build #32725 succeeded 4df8fa5
💚 Build #32645 succeeded f5ab615
💚 Build #32521 succeeded 7a5f6c1
💚 Build #32464 succeeded a0a0b78
💚 Build #32453 succeeded bbb7b3c
💚 Build #32440 succeeded 6d0c58a

nchaulet

LGTM 🚀

elastic-vault-github-plugin-prod · 2025-10-23T15:26:42Z

Package elastic_agent - 2.6.4 containing this change is available at https://epr.elastic.co/package/elastic_agent/2.6.4/

Add alerting rule templates to the Elastic Agent package: * CPU usage spike * Excessive memory usage * High pipeline queue * Dropped events * Output errors * Excessive restarts * Unhealthy status

MichelLosier added 4 commits September 30, 2025 14:27

Add cpu and memory usage rule templates for elastic_agent

b204e0d

Fix metric name in memory usage rule

faa0b41

Add pipeline and queues alerting rule templates for elastic_agent

bb795de

Add various template rules for elastic_agent

8c29b35

MichelLosier requested a review from a team as a code owner October 6, 2025 20:27

MichelLosier added the enhancement New feature or request label Oct 6, 2025

MichelLosier added 2 commits October 6, 2025 14:18

Add changelog entry

41e3d2c

Update elastic agent package version and format_version

49c500f

MichelLosier requested a review from a team October 6, 2025 21:45

Fix alerting rule id

7329451

andrewkroh added Integration:elastic_agent Elastic Agent Team:Elastic-Agent Platform - Ingest - Agent [elastic/elastic-agent] labels Oct 7, 2025

MichelLosier added 3 commits October 7, 2025 12:17

Convert metric threshold rules to ESQL

6d0c58a

Use Max-Rate instead of Sum-Rate for some alerting rules

bbb7b3c

Remove storage related alerting rule templates

a0a0b78

pierrehilbert approved these changes Oct 8, 2025

View reviewed changes

MichelLosier marked this pull request as draft October 8, 2025 17:13

MichelLosier and others added 5 commits October 8, 2025 11:55

Have dropped events rule group by row

7a5f6c1

Use FROM instead of TS source commands

f5ab615

Update template rules to support CCS, exclude agentless, and broader …

5752c1f

…agent path matching

Merge branch 'main' into add-elastic-agent-alerting-rule-templates

4df8fa5

Fix output errors threshold

44c37fb

MichelLosier marked this pull request as ready for review October 21, 2025 00:47

nchaulet approved these changes Oct 21, 2025

View reviewed changes

MichelLosier merged commit e65e5d7 into elastic:main Oct 23, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add elastic agent alerting rule templates #15572

Add elastic agent alerting rule templates #15572

Uh oh!

MichelLosier commented Oct 6, 2025 •

edited

Loading

elasticmachine commented Oct 7, 2025

MichelLosier commented Oct 8, 2025

elasticmachine commented Oct 14, 2025

nchaulet left a comment

Uh oh!

elastic-vault-github-plugin-prod bot commented Oct 23, 2025

Labels

5 participants

Add elastic agent alerting rule templates #15572

Add elastic agent alerting rule templates #15572

Uh oh!

Conversation

MichelLosier commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed commit message

Extended description

Install the rules

Rule templates:

Resource Utilization

Beats Pipelines and Queues

Agent Stability

Checklist

Author's Checklist

How to test this PR locally

Related issues

Screenshots

elasticmachine commented Oct 7, 2025

MichelLosier commented Oct 8, 2025

elasticmachine commented Oct 14, 2025

💚 Build Succeeded

History

nchaulet left a comment

Choose a reason for hiding this comment

Uh oh!

elastic-vault-github-plugin-prod bot commented Oct 23, 2025

Labels

5 participants

MichelLosier commented Oct 6, 2025 •

edited

Loading