Es-force merge阻塞随记:修订间差异
来自三线的随记
小无编辑摘要 |
小无编辑摘要 |
||
| 第124行: | 第124行: | ||
可以看出来他一直卡在 warm phase, 这里有几个原因和分析手段 | 可以看出来他一直卡在 warm phase, 这里有几个原因和分析手段 | ||
* 有其他 index 在进行 force_merge | * 有其他 index 在进行 force_merge 操作,所以这个 index 阻塞了 | ||
*# <code>GET /_cat/thread_pool/force_merge?v&s=node_name</code> 可以观测 | *# <code>GET /_cat/thread_pool/force_merge?v&s=node_name</code> 可以观测 | ||
*# 可以通过 exporter metric <code>elasticsearch_thread_pool_queue_count</code> 观测 | *# 可以通过 exporter metric <code>elasticsearch_thread_pool_queue_count</code> 观测 | ||
| 第130行: | 第130行: | ||
*# 默认的 force_merge 一个节点最大并发可能只有 1 ,重启才可以修改[https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html 此参数]( thread_pool.force_merge.size )<blockquote>For force merge operations. Thread pool type is <code>fixed</code> with a size of <code>max(1, (# of allocated processors) / 8)</code> and an unbounded queue size.</blockquote> | *# 默认的 force_merge 一个节点最大并发可能只有 1 ,重启才可以修改[https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html 此参数]( thread_pool.force_merge.size )<blockquote>For force merge operations. Thread pool type is <code>fixed</code> with a size of <code>max(1, (# of allocated processors) / 8)</code> and an unbounded queue size.</blockquote> | ||
* 这个 index force_merge action | * 这个 index force_merge action 失败了,同时还阻塞了其他 index force_merge action | ||
*# [https://discuss.elastic.co/t/elasticsearch-ilm-stuck-at-forcemerge/184506 discuss.elastic.co/t/elasticsearch-ilm-stuck-at-forcemerge] | *# [https://discuss.elastic.co/t/elasticsearch-ilm-stuck-at-forcemerge/184506 discuss.elastic.co/t/elasticsearch-ilm-stuck-at-forcemerge] | ||
*# [https://github.com/elastic/elasticsearch/issues/42824 Github - issue - <bdi>Revisit ILM retry strategy for additional conditions</bdi>] | *# [https://github.com/elastic/elasticsearch/issues/42824 Github - issue - <bdi>Revisit ILM retry strategy for additional conditions</bdi>] | ||
2025年1月3日 (五) 17:12的版本
ES version: 7.17.5
有时候可能会发现 es 的 ilm policy运作不在预期,特别是带有force_merge action的ilm 策略
例如存在 ILM 策略:
# GET /_ilm/policy/envoy_access_log-ilm
{
"envoy_access_log-ilm" : {
"version" : 9,
"modified_date" : "2023-03-15T15:14:15.794Z",
"policy" : {
"phases" : {
"warm" : {
"min_age" : "50h",
"actions" : {
"allocate" : {
"number_of_replicas" : 0,
"include" : { },
"exclude" : { },
"require" : {
"zone" : "dlocal-ssd"
}
},
"forcemerge" : {
"max_num_segments" : 1
}
}
},
"cold" : {
"min_age" : "7d",
"actions" : {
"allocate" : {
"number_of_replicas" : 0,
"include" : { },
"exclude" : { },
"require" : {
"zone" : "nfs"
}
},
"readonly" : { }
}
},
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb",
"max_age" : "1d"
}
}
},
"delete" : {
"min_age" : "32d",
"actions" : {
"delete" : {
"delete_searchable_snapshot" : true
}
}
}
},
"_meta" : {
"description" : "default policy for the envoy access log indices,created by sanXian"
}
},
"in_use_by" : {
"indices" : [
"envoy_access_log_xxxxxx",
"envoy_access_log_yyyyyyyy"
],
"data_streams" : [ ],
"composable_templates" : [ ]
}
}
}
可以看到这里定义了 32 天删除索引,7天转入nfs,50h转入 warm 阶段
查询一个 index 却发现不在预期
#GET envoy_access_log_oi-2024.12.19-000292/_ilm/explain
{
"indices" : {
"envoy_access_log_oi-2024.12.19-000292" : {
"index" : "envoy_access_log_oi-2024.12.19-000292",
"managed" : true,
"policy" : "envoy_access_log-ilm",
"lifecycle_date_millis" : 1734665457979,
"age" : "13.31d",
"phase" : "warm",
"phase_time_millis" : 1735237921065,
"action" : "forcemerge",
"action_time_millis" : 1734845462955,
"step" : "forcemerge",
"step_time_millis" : 1735237921065,
"is_auto_retryable_error" : true,
"failed_step_retry_count" : 1,
"phase_execution" : {
"policy" : "envoy_access_log-ilm",
"phase_definition" : {
"min_age" : "50h",
"actions" : {
"allocate" : {
"number_of_replicas" : 0,
"include" : { },
"exclude" : { },
"require" : {
"zone" : "dlocal-ssd"
}
},
"forcemerge" : {
"max_num_segments" : 1
}
}
},
"version" : 9,
"modified_date_in_millis" : 1678893255794
}
}
}
}
可以看出来他一直卡在 warm phase, 这里有几个原因和分析手段
- 有其他 index 在进行 force_merge 操作,所以这个 index 阻塞了
GET /_cat/thread_pool/force_merge?v&s=node_name可以观测- 可以通过 exporter metric
elasticsearch_thread_pool_queue_count观测 GET /_tasks?group_by=parents&actions=*forcemerge*&detailed=true可以看到当前集群的 tasks- 默认的 force_merge 一个节点最大并发可能只有 1 ,重启才可以修改此参数( thread_pool.force_merge.size )
For force merge operations. Thread pool type is
fixedwith a size ofmax(1, (# of allocated processors) / 8)and an unbounded queue size.
- 这个 index force_merge action 失败了,同时还阻塞了其他 index force_merge action
可以对怀疑阻塞的 index 手动进行 ilm move 操作跳过该 action 干预 (也可以考虑人为触发force_merge API)
如
POST _ilm/move/envoy_access_log-2024.12.13-000822
{
"current_step": {
"phase": "warm",
"action": "forcemerge",
"name": "segment-count"
},
"next_step": {
"phase": "warm",
"action": "complete"
}
}
或
POST _ilm/move/envoy_access_log-2024.12.13-000822
{
"current_step": {
"phase": "warm",
"action": "forcemerge",
"name": "forcemerge"
},
"next_step": {
"phase": "warm",
"action": "complete"
}
}