Es-force merge阻塞随记:修订间差异
来自三线的随记
小无编辑摘要 |
小无编辑摘要 |
||
第124行: | 第124行: | ||
可以看出来他一直卡在 warm phase, 这里有几个原因和分析手段 | 可以看出来他一直卡在 warm phase, 这里有几个原因和分析手段 | ||
* 有其他 index 在进行 force_merge | * 有其他 index 在进行 force_merge 操作,所以这个 index 阻塞了 | ||
*# <code>GET /_cat/thread_pool/force_merge?v&s=node_name</code> 可以观测 | *# <code>GET /_cat/thread_pool/force_merge?v&s=node_name</code> 可以观测 | ||
*# 可以通过 exporter metric <code>elasticsearch_thread_pool_queue_count</code> 观测 | *# 可以通过 exporter metric <code>elasticsearch_thread_pool_queue_count</code> 观测 | ||
第130行: | 第130行: | ||
*# 默认的 force_merge 一个节点最大并发可能只有 1 ,重启才可以修改[https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html 此参数]( thread_pool.force_merge.size )<blockquote>For force merge operations. Thread pool type is <code>fixed</code> with a size of <code>max(1, (# of allocated processors) / 8)</code> and an unbounded queue size.</blockquote> | *# 默认的 force_merge 一个节点最大并发可能只有 1 ,重启才可以修改[https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html 此参数]( thread_pool.force_merge.size )<blockquote>For force merge operations. Thread pool type is <code>fixed</code> with a size of <code>max(1, (# of allocated processors) / 8)</code> and an unbounded queue size.</blockquote> | ||
* 这个 index force_merge action | * 这个 index force_merge action 失败了,同时还阻塞了其他 index force_merge action | ||
*# [https://discuss.elastic.co/t/elasticsearch-ilm-stuck-at-forcemerge/184506 discuss.elastic.co/t/elasticsearch-ilm-stuck-at-forcemerge] | *# [https://discuss.elastic.co/t/elasticsearch-ilm-stuck-at-forcemerge/184506 discuss.elastic.co/t/elasticsearch-ilm-stuck-at-forcemerge] | ||
*# [https://github.com/elastic/elasticsearch/issues/42824 Github - issue - <bdi>Revisit ILM retry strategy for additional conditions</bdi>] | *# [https://github.com/elastic/elasticsearch/issues/42824 Github - issue - <bdi>Revisit ILM retry strategy for additional conditions</bdi>] |
2025年1月3日 (五) 17:12的版本
ES version: 7.17.5
有时候可能会发现 es 的 ilm policy运作不在预期,特别是带有force_merge action的ilm 策略
例如存在 ILM 策略:
# GET /_ilm/policy/envoy_access_log-ilm { "envoy_access_log-ilm" : { "version" : 9, "modified_date" : "2023-03-15T15:14:15.794Z", "policy" : { "phases" : { "warm" : { "min_age" : "50h", "actions" : { "allocate" : { "number_of_replicas" : 0, "include" : { }, "exclude" : { }, "require" : { "zone" : "dlocal-ssd" } }, "forcemerge" : { "max_num_segments" : 1 } } }, "cold" : { "min_age" : "7d", "actions" : { "allocate" : { "number_of_replicas" : 0, "include" : { }, "exclude" : { }, "require" : { "zone" : "nfs" } }, "readonly" : { } } }, "hot" : { "min_age" : "0ms", "actions" : { "rollover" : { "max_size" : "50gb", "max_age" : "1d" } } }, "delete" : { "min_age" : "32d", "actions" : { "delete" : { "delete_searchable_snapshot" : true } } } }, "_meta" : { "description" : "default policy for the envoy access log indices,created by sanXian" } }, "in_use_by" : { "indices" : [ "envoy_access_log_xxxxxx", "envoy_access_log_yyyyyyyy" ], "data_streams" : [ ], "composable_templates" : [ ] } } }
可以看到这里定义了 32 天删除索引,7天转入nfs,50h转入 warm 阶段
查询一个 index 却发现不在预期
#GET envoy_access_log_oi-2024.12.19-000292/_ilm/explain { "indices" : { "envoy_access_log_oi-2024.12.19-000292" : { "index" : "envoy_access_log_oi-2024.12.19-000292", "managed" : true, "policy" : "envoy_access_log-ilm", "lifecycle_date_millis" : 1734665457979, "age" : "13.31d", "phase" : "warm", "phase_time_millis" : 1735237921065, "action" : "forcemerge", "action_time_millis" : 1734845462955, "step" : "forcemerge", "step_time_millis" : 1735237921065, "is_auto_retryable_error" : true, "failed_step_retry_count" : 1, "phase_execution" : { "policy" : "envoy_access_log-ilm", "phase_definition" : { "min_age" : "50h", "actions" : { "allocate" : { "number_of_replicas" : 0, "include" : { }, "exclude" : { }, "require" : { "zone" : "dlocal-ssd" } }, "forcemerge" : { "max_num_segments" : 1 } } }, "version" : 9, "modified_date_in_millis" : 1678893255794 } } } }
可以看出来他一直卡在 warm phase, 这里有几个原因和分析手段
- 有其他 index 在进行 force_merge 操作,所以这个 index 阻塞了
GET /_cat/thread_pool/force_merge?v&s=node_name
可以观测- 可以通过 exporter metric
elasticsearch_thread_pool_queue_count
观测 GET /_tasks?group_by=parents&actions=*forcemerge*&detailed=true
可以看到当前集群的 tasks- 默认的 force_merge 一个节点最大并发可能只有 1 ,重启才可以修改此参数( thread_pool.force_merge.size )
For force merge operations. Thread pool type is
fixed
with a size ofmax(1, (# of allocated processors) / 8)
and an unbounded queue size.
- 这个 index force_merge action 失败了,同时还阻塞了其他 index force_merge action
可以对怀疑阻塞的 index 手动进行 ilm move 操作跳过该 action 干预 (也可以考虑人为触发force_merge API)
如
POST _ilm/move/envoy_access_log-2024.12.13-000822 { "current_step": { "phase": "warm", "action": "forcemerge", "name": "segment-count" }, "next_step": { "phase": "warm", "action": "complete" } }
或
POST _ilm/move/envoy_access_log-2024.12.13-000822 { "current_step": { "phase": "warm", "action": "forcemerge", "name": "forcemerge" }, "next_step": { "phase": "warm", "action": "complete" } }