Elasticsearch Elasticsearch Timed Data Analyses By Alaa Elhadba @aelhadba
Table of Contents - Hot-Cold Architecture - Data High Availability - Data design at large scale - Search Execution - Time framed indices - Aggregations
Hot-Cold Architecture
Hot-Cold Architecture Hot Data Nodes Perform indexing Hold most recent data Use SSD storage, Writing is an Intensive IO operation Cold Data Nodes Handle read only operations Can use large spinning disks
Hot-Cold Configuration node.box_type: hot elasticsearch.yaml Shard 2 Node Shard 1 Node node.box_type: cold elasticsearch.yaml
Data Availability Availability Zone 1 Availability Zone 2
Data Availability Availability Zone 1 Availability Zone 2
Data Availability Availability Zone 1 Availability Zone 2Availability Zone / Rack failure ? Shard Allocation Awareness
Shard Allocation Awareness Availability Zone 1 Availability Zone 2
Shard Allocation Awareness Availability Zone 1 Availability Zone 2 1 2 1 2 1 2 3 1 2 3
Shard Allocation Awareness cluster.routing.allocation.awareness.attributes: rack_1 ● Data replication is spanned across AZs ● No two copies of same shard on the same rack ● Elasticsearch is fully aware of shard distribution ● Awareness can be set based cluster or index ● Elasticsearch will prefer using local shards ● Always balance your nodes across AZs ● Routing Allocation Awareness can be updated on a live cluster cluster.routing.allocation.awareness.attributes: rack_2 Availability Zone 1 Availability Zone 2
Shard Allocation Awareness cluster.routing.allocation.awareness.attributes: rack_1 ● Data replication is spanned across AZs ● No two copies of same shard on the same rack ● Elasticsearch is fully aware of shard distribution ● Awareness can be set based cluster or index ● Elasticsearch will prefer using local shards ● Always balance your nodes across AZs ● Routing Allocation Awareness can be updated on a live cluster cluster.routing.allocation.awareness.attributes: rack_2 Availability Zone 1 Availability Zone 2
Shard Allocation Awareness cluster.routing.allocation.awareness.attributes: rack_1 ● Data replication is spanned across AZs ● No two copies of same shard on the same rack ● Elasticsearch is fully aware of shard distribution ● Awareness can be set based cluster or index ● Elasticsearch will prefer using local shards ● Always balance your nodes across AZs ● Routing Allocation Awareness can be updated on a live cluster cluster.routing.allocation.awareness.attributes: rack_2 Availability Zone 1 Availability Zone 2
Shard Allocation Awareness cluster.routing.allocation.awareness.attributes: rack_1 ● Data replication is spanned across AZs ● No two copies of same shard on the same rack ● Elasticsearch is fully aware of shard distribution ● Awareness can be set based cluster or index ● Elasticsearch will prefer using local shards ● Always balance your nodes across AZs ● Routing Allocation Awareness can be updated on a live cluster ● Use Forced Awareness to avoid the extra load of reallocation of missing shards cluster.routing.allocation.awareness.attributes: rack_2 Availability Zone 1 Availability Zone 2
Shard Allocation Awareness cluster.routing.allocation.awareness.attributes: rack_1 ● Data replication is spanned across AZs ● No two copies of same shard on the same rack ● Elasticsearch is fully aware of shard distribution ● Awareness can be set based cluster or index ● Elasticsearch will prefer using local shards ● Always balance your nodes across AZs ● Routing Allocation Awareness can be updated on a live cluster ● Use Forced Awareness to avoid the extra load of reallocation of missing shards cluster.routing.allocation.awareness.attributes: rack_2 Availability Zone 1 Availability Zone 2 Make sure you can handle the load with less nodes!
Forced Awareness ● Forced awareness solves this problem by NEVER allowing copies of the same shard to be allocated to the same zone. ● Avoid extra of reallocating unassigned shards after rack failure. ● Allow no single point of failure for your system. ● Make sure you can handle the load with less nodes. cluster.routing.allocation.awareness.force.zone.values: zone1,zone2 cluster.routing.allocation.awareness.attributes: rack1,zone1
Data design at large scale
Searching Shard 4 Shard 2 Query Result Node Node Shard 3 Node Shard 1 Node
Searching Shard 4 Shard 2 Query Result Node Node Shard 3 Node Shard 1 Node How to avoid asking all shards ?
Searching Shard 4 Shard 2 Query Result Node Node Shard 3 Node Shard 1 Node How to avoid asking all shards ? Routing I know my shards!
Routing PUT my_index/my_type/my_id?routing=shard1 GET my_index/_search?routing=shard1,shard2 ● Avoid calling all shards ● Dedicated shards per purpose ● Talk to one dedicated shard ● Eliminate Network Traffic ● Better Performance ● Handle sharding on your own
Routing PUT my_index/my_type/my_id?routing=shard1 GET my_index/_search?routing=shard1,shard2 ● Avoid calling all shards ● Dedicated shards per purpose ● Talk to one dedicated shard ● Eliminate Network Traffic ● Better Performance ● Handle sharding on your own But, Once in, Never out ● Routing must be always specified
Routing 1 2 3 1 2 3 1 2 21.06.2016 20.06.2016 19.06.2016
Routing 1 2 3 1 2 3 1 2 21.06.2016 20.06.2016 19.06.2016 I MUST KNOW EVERYTHING!
Talking to data
Aliasing 1 2 3 1 2 3 1 2 21.06.2016 20.06.2016 19.06.2016 today yesterday 3_days_ago
Aliasing 1 2 3 1 2 3 1 2 21.06.2016 20.06.2016 19.06.2016 today yesterday 3_days_ago 1 2 3 22.06.2016
Aliasing 1 2 3 1 2 3 21.06.2016 20.06.2016 today yesterday 3_days_ago 1 2 3 22.06.2016
Aliasing 1 2 3 1 2 3 21.06.2016 20.06.2016 today yesterday 3_days_ago 1 2 3 22.06.2016 I MUST KNOW! it’s Better Performance
Aliasing 1 2 3 1 2 3 21.06.2016 20.06.2016 1 2 3 22.06.2016 It’s a Data Problem! today yesterday 3_days_ago
Aliasing + Routing 1 2 3 1 2 3 21.06.2016 20.06.2016 1 2 3 22.06.2016 It’s a Data Problem! today yesterday 3_days_agotoday_returns recent_returns
Aliasing + Routing + Search IndexIndex Shard Alias Shard slice
Search Execution Preference Elasticsearch targets shards and replicas in round-robin manner. Each shard is queried similarly _primary Query only primary shards (latest info from index or optimize for writing path) _primary_first Query primary first in available _replica Query replica shard only _replica_first Query replica first in available _local Query shards available on the current node _only_node:node_id Query a specific node _only_nodes:* Query only a set of nodes _prefer_node:node_id Query a prefered noe _shards:1,3 e,g _shards:1,3;_local Query specific shards with a preference PUT _search?preference=_replica
Time Framed Indices
Data Flow HOT Cold Closed Backed_up Trashed Time
Closing/Opening Index ➔ Closing an index ◆ Removes all shard allocations from the cluster ◆ But keeps the index data around ◆ Helps reduce the resources used on the cluster ◆ Consumes only disk space ➔ Opening an index ◆ Allows to open a closed index ◆ Note, those are not “milliseconds” time operation, opening an index can take a few seconds to a couple of minutes ◆ Flushing before closing will reduce the opening time
Index Templates - Order allows you to override other templates - Settings allows you to scale anytime - Aliases can be defined on index creation
Index Templates
Time framed indices lifecycle 1. Use Index templates to generate mappings for new indices 2. Use aliases to decouple your application from data logic 3. Use hot nodes for fresh data 4. Move old data to cold nodes 5. Close old indices before deletion 6. Change your time frame at any point to scale (Monthly, Weekly….) 7. Use Routing if you have too many shards in a big cluster
Data Flow HOT Cold Closed Backed_up Trashed Time
Aggregations
Aggregations Types Buckets Metrics Pipeline
Nested Bucket Aggregations
Aggregation Query
Aggregation Query Better caching Fetch relevant documents First segmentation Nested segmentation
Doc Values - Why do we need this? - Sorting, Aggregations, Some Scripting - Doc Values - Build columnar style data structure on disk - Created at indexing time, stored as part of the segment - Read like other pieces of the Lucene index - Don't take up heap space - Uses file system cache - Default for not_analyzed string and numeric fields in 2.0+
Raw Fields - Use customer_name.raw for aggregations - Use customer_name for search
Aggregations Types Buckets Metrics Pipeline
Metrics Aggregations - Avg Aggregation - Cardinality Aggregation - Extended Stats Aggregation - Max Aggregation - Min Aggregation - Percentiles Aggregation - Percentile Ranks Aggregation - Scripted Metric Aggregation - Stats Aggregation - Sum Aggregation - Top hits Aggregation - Value Count Aggregation
Extended Stats Aggregation
Aggregation Search Shard 4 Shard 2 Query Result Node Node Shard 3 Node Shard 1 Node
Scripted Metric Aggregation - Init_script Executed first. Allows initialization of variables. - map_script Executed once after each document is collected. - combine_script Executed once on each shard after document collection is complete. - reduce_script Executed once on the coordinating node after all shards have returned their results.
Buckets Aggregations - Children Aggregation - Date Histogram Aggregation - Date Range Aggregation - Filter Aggregation - Filters Aggregation - Global Aggregation - Histogram Aggregation - Missing Aggregation - Range Aggregation - Reverse nested Aggregation - Sampler Aggregation - Significant Terms Aggregation - Terms Aggregation
Date Histogram Aggregation
Date Range Aggregation Don’t forget! Round your dates
Missing Aggregations
Range agg
Histogram Aggregation
Pipeline Aggregations Pipeline
Pipeline Aggregations Parent - Able to compute new buckets or new aggregations to a parent aggregation. Sibling - Able to compute new buckets or new aggregation on the same level.
Siblings Aggregation - min_bucket - max_bucket - sum_bucket - avg_bucket - stats_bucket - extended_stats_bucket - percentiles_bucket
Average Aggregation
Parent Pipeline Aggregation - moving_avg - derivative - cumulative_sum - bucket_script - bucket_selector - serial_diff
Cumulative Sum Aggregation
Derivative Aggregation
Moving Average Aggregation
Moving Average Aggregation
Moving Average Aggregation Prediction
Bucket Selector Aggregation
Bucket Script Aggregation
The End

Elasticsearch Data Analyses