Es-force merge阻塞随记：修订间差异

2025年1月3日 (五) 17:12的版本

ES version: 7.17.5

有时候可能会发现 es 的 ilm policy运作不在预期，特别是带有force_merge action的ilm 策略

例如存在 ILM 策略:

# GET /_ilm/policy/envoy_access_log-ilm 
{
  "envoy_access_log-ilm" : {
    "version" : 9,
    "modified_date" : "2023-03-15T15:14:15.794Z",
    "policy" : {
      "phases" : {
        "warm" : {
          "min_age" : "50h",
          "actions" : {
            "allocate" : {
              "number_of_replicas" : 0,
              "include" : { },
              "exclude" : { },
              "require" : {
                "zone" : "dlocal-ssd"
              }
            },
            "forcemerge" : {
              "max_num_segments" : 1
            }
          }
        },
        "cold" : {
          "min_age" : "7d",
          "actions" : {
            "allocate" : {
              "number_of_replicas" : 0,
              "include" : { },
              "exclude" : { },
              "require" : {
                "zone" : "nfs"
              }
            },
            "readonly" : { }
          }
        },
        "hot" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "50gb",
              "max_age" : "1d"
            }
          }
        },
        "delete" : {
          "min_age" : "32d",
          "actions" : {
            "delete" : {
              "delete_searchable_snapshot" : true
            }
          }
        }
      },
      "_meta" : {
        "description" : "default policy for the envoy access log indices,created by sanXian"
      }
    },
    "in_use_by" : {
      "indices" : [
        "envoy_access_log_xxxxxx",
        "envoy_access_log_yyyyyyyy"
      ],
      "data_streams" : [ ],
      "composable_templates" : [ ]
    }
  }
}

可以看到这里定义了 32 天删除索引，7天转入nfs，50h转入 warm 阶段

查询一个 index 却发现不在预期

#GET envoy_access_log_oi-2024.12.19-000292/_ilm/explain
{
  "indices" : {
    "envoy_access_log_oi-2024.12.19-000292" : {
      "index" : "envoy_access_log_oi-2024.12.19-000292",
      "managed" : true,
      "policy" : "envoy_access_log-ilm",
      "lifecycle_date_millis" : 1734665457979,
      "age" : "13.31d",
      "phase" : "warm",
      "phase_time_millis" : 1735237921065,
      "action" : "forcemerge",
      "action_time_millis" : 1734845462955,
      "step" : "forcemerge",
      "step_time_millis" : 1735237921065,
      "is_auto_retryable_error" : true,
      "failed_step_retry_count" : 1,
      "phase_execution" : {
        "policy" : "envoy_access_log-ilm",
        "phase_definition" : {
          "min_age" : "50h",
          "actions" : {
            "allocate" : {
              "number_of_replicas" : 0,
              "include" : { },
              "exclude" : { },
              "require" : {
                "zone" : "dlocal-ssd"
              }
            },
            "forcemerge" : {
              "max_num_segments" : 1
            }
          }
        },
        "version" : 9,
        "modified_date_in_millis" : 1678893255794
      }
    }
  }
}

可以看出来他一直卡在 warm phase, 这里有几个原因和分析手段

有其他 index 在进行 force_merge 操作，所以这个 index 阻塞了
1. GET /_cat/thread_pool/force_merge?v&s=node_name 可以观测
2. 可以通过 exporter metric elasticsearch_thread_pool_queue_count 观测
3. GET /_tasks?group_by=parents&actions=*forcemerge*&detailed=true 可以看到当前集群的 tasks
4. 默认的 force_merge 一个节点最大并发可能只有 1 ，重启才可以修改此参数( thread_pool.force_merge.size )
  For force merge operations. Thread pool type is fixed with a size of max(1, (# of allocated processors) / 8) and an unbounded queue size.

这个 index force_merge action 失败了，同时还阻塞了其他 index force_merge action
1. discuss.elastic.co/t/elasticsearch-ilm-stuck-at-forcemerge
2. Github - issue - Revisit ILM retry strategy for additional conditions

可以对怀疑阻塞的 index 手动进行 ilm move 操作跳过该 action 干预 (也可以考虑人为触发force_merge API)

如

POST _ilm/move/envoy_access_log-2024.12.13-000822
{
  "current_step": { 
    "phase": "warm",
    "action": "forcemerge",
    "name": "segment-count"
  },
  "next_step": { 
    "phase": "warm",
    "action": "complete"
  }
}

或

POST _ilm/move/envoy_access_log-2024.12.13-000822
{
  "current_step": { 
    "phase": "warm",
    "action": "forcemerge",
    "name": "forcemerge"
  },
  "next_step": { 
    "phase": "warm",
    "action": "complete"
  }
}

@@ 第124行： / 第124行： @@
 可以看出来他一直卡在 warm phase, 这里有几个原因和分析手段
-* 有其他 index 在进行 force_merge 操作，所以他阻塞了
+* 有其他 index 在进行 force_merge 操作，所以这个 index 阻塞了
 *# <code>GET /_cat/thread_pool/force_merge?v&s=node_name</code> 可以观测
 *# 可以通过 exporter metric <code>elasticsearch_thread_pool_queue_count</code> 观测
@@ 第130行： / 第130行： @@
 *# 默认的 force_merge 一个节点最大并发可能只有 1 ，重启才可以修改[https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html 此参数]( thread_pool.force_merge.size )<blockquote>For force merge operations. Thread pool type is <code>fixed</code> with a size of <code>max(1, (# of allocated processors) / 8)</code> and an unbounded queue size.</blockquote>
-* 这个 index force_merge action 失败了，阻塞了其他 index
+* 这个 index force_merge action 失败了，同时还阻塞了其他 index force_merge action
 *# [https://discuss.elastic.co/t/elasticsearch-ilm-stuck-at-forcemerge/184506 discuss.elastic.co/t/elasticsearch-ilm-stuck-at-forcemerge]
 *# [https://github.com/elastic/elasticsearch/issues/42824 Github - issue - <bdi>Revisit ILM retry strategy for additional conditions</bdi>]

匿名

搜索

Es-force merge阻塞随记：修订间差异

命名空间

更多

页面操作

2025年1月3日 (五) 17:12的版本

导航

导航

分类

友情链接(大佬们的站)

wiki工具

wiki工具

匿名

搜索

Es-force merge阻塞随记：修订间差异

2025年1月3日 (五) 17:12的版本

导航

wiki工具

页面工具

分类