Add elastic agent alerting rule templates #15572
Merged
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Proposed commit message
Extended description
Here is an initial exploration of alerting rule templates for monitoring elastic agent health. This PR can just include the ones we feel the most confident about, and defer others for further refinement and exploration.
Install the rules
How to install the rules:
your-local-dir/integrations/packages/elastic_agentpackages/elastic_agent/manifest.ymlfrom 2.6.4 to 2.6.3elastic-package build --skip-validation. Run this in theelastic_agentpackage directorybuild/packages/elastic_agent-2.6.3.zipCreate new integrationCTA at the top rightupload it as a .ziplink, and upload the zip you builtElastic Agentfor filtering.Rule templates:
So that the ESQL is clear, here is a summary of their definitions.
Resource Utilization
*elastic*agent*are above 80% of total cpu utilization. Calculate the max for 1 minute buckets and check if there are 5 occurrences when looking back 7 minutes. Rows are distinct by agent id and process name.FROM metrics-, :metrics-
| WHERE process.executable RLIKE ".[Ee]lastic.[Aa]gent." AND agent.name NOT LIKE "agentless"
| STATS cpu_process_pct = MAX(system.process.cpu.total.pct) * 100
BY elastic_agent.id, process.name,
time_bucket = BUCKET(@timestamp, 1 minute)
// Count the 1 minute timebuckets that are above 80% by process and agent
| WHERE cpu_process_pct >= 80
| STATS count_above_threshold = COUNT(*)
BY elastic_agent.id, process.name
// Alert if there are 5 or more occurences
| WHERE count_above_threshold >= 5
```
*elastic*agent*are above 50% of total memory usage. Rows are distinct by agent id.Beats Pipelines and Queues
beat.stats.libbeat.pipeline.queue.filled.pctexceeds 90%. Rows are distinct by agent id and component idAgent Stability
elastic_agent.status_changedatastreamChecklist
changelog.ymlfile.Author's Checklist
How to test this PR locally
Built and Install the elastic agent package locally:
Related issues
Screenshots