Es-force merge阻塞随记
来自三线的随记
ES version: 7.17.5
有时候可能会发现 es 的 ilm policy运作不在预期,特别是带有force_merge action的ilm 策略
例如存在 ILM 策略:
# GET /_ilm/policy/envoy_access_log-ilm { "envoy_access_log-ilm" : { "version" : 9, "modified_date" : "2023-03-15T15:14:15.794Z", "policy" : { "phases" : { "warm" : { "min_age" : "50h", "actions" : { "allocate" : { "number_of_replicas" : 0, "include" : { }, "exclude" : { }, "require" : { "zone" : "dlocal-ssd" } }, "forcemerge" : { "max_num_segments" : 1 } } }, "cold" : { "min_age" : "7d", "actions" : { "allocate" : { "number_of_replicas" : 0, "include" : { }, "exclude" : { }, "require" : { "zone" : "nfs" } }, "readonly" : { } } }, "hot" : { "min_age" : "0ms", "actions" : { "rollover" : { "max_size" : "50gb", "max_age" : "1d" } } }, "delete" : { "min_age" : "32d", "actions" : { "delete" : { "delete_searchable_snapshot" : true } } } }, "_meta" : { "description" : "default policy for the envoy access log indices,created by sanXian" } }, "in_use_by" : { "indices" : [ "envoy_access_log_xxxxxx", "envoy_access_log_yyyyyyyy" ], "data_streams" : [ ], "composable_templates" : [ ] } } }
可以看到这里定义了 32 天删除索引,7天转入nfs,50h转入 warm 阶段
查询一个 index 却发现不在预期
#GET envoy_access_log_oi-2024.12.19-000292/_ilm/explain { "indices" : { "envoy_access_log_oi-2024.12.19-000292" : { "index" : "envoy_access_log_oi-2024.12.19-000292", "managed" : true, "policy" : "envoy_access_log-ilm", "lifecycle_date_millis" : 1734665457979, "age" : "13.31d", "phase" : "warm", "phase_time_millis" : 1735237921065, "action" : "forcemerge", "action_time_millis" : 1734845462955, "step" : "forcemerge", "step_time_millis" : 1735237921065, "is_auto_retryable_error" : true, "failed_step_retry_count" : 1, "phase_execution" : { "policy" : "envoy_access_log-ilm", "phase_definition" : { "min_age" : "50h", "actions" : { "allocate" : { "number_of_replicas" : 0, "include" : { }, "exclude" : { }, "require" : { "zone" : "dlocal-ssd" } }, "forcemerge" : { "max_num_segments" : 1 } } }, "version" : 9, "modified_date_in_millis" : 1678893255794 } } } }
可以看出来他一直卡在 warm phase, 这里有几个原因和分析手段
- 有其他 index 在进行 force_merge 操作,所以这个 index 阻塞了
GET /_cat/thread_pool/force_merge?v&s=node_name
可以观测- 可以通过 exporter metric
elasticsearch_thread_pool_queue_count
观测 GET /_tasks?group_by=parents&actions=*forcemerge*&detailed=true
可以看到当前集群的 tasks- 默认的 force_merge 一个节点最大并发可能只有 1 ,重启才可以修改此参数( thread_pool.force_merge.size )
For force merge operations. Thread pool type is
fixed
with a size ofmax(1, (# of allocated processors) / 8)
and an unbounded queue size.
- 这个 index force_merge action 失败了,同时还阻塞了其他 index force_merge action
可以对怀疑阻塞的 index 手动进行 ilm move 操作跳过该 action 干预 (也可以考虑人为触发force_merge API)
如
POST _ilm/move/envoy_access_log-2024.12.13-000822 { "current_step": { "phase": "warm", "action": "forcemerge", "name": "segment-count" }, "next_step": { "phase": "warm", "action": "complete" } }
或
POST _ilm/move/envoy_access_log-2024.12.13-000822 { "current_step": { "phase": "warm", "action": "forcemerge", "name": "forcemerge" }, "next_step": { "phase": "warm", "action": "complete" } }