Loki 日志系统分布式部署实践八 排错

说明

日志新贵 loki 很香?那么看了这篇排错后或许你不这么想了。

  1. github 社区很不作为,大多 issue 都没解决,都是自动关闭掉了
  2. 日志报错相当模糊,没有边界,无法确定稳定到底真正出在什么组件
  3. 分布式环境,组件还是很复杂的
  4. 日志量大的时候,资源开销也特别大,特别是内存开销,而且 oOM 这根本不好控制
  5. 沿用了 cortex 那套,基于 hash ring,也是个万恶之源

排错

注意:排错比较凌乱,因为很多错误都没有具体边界,不能定位到真正的问题,但是基本都试错解决掉了。不过仍然有一部分老骨头依旧没解决

错误 1:

# kubectl logs -f -n grafana promtail-hzvcw 
level=warn ts=2020-11-20T10:55:18.654550762Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=-1 error="Post \"http://loki:3100/loki/api/v1/push\": dial tcp: lookup loki on 100.100.2.138:53: no such host"
level=warn ts=2020-11-20T10:55:40.951543459Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=-1 error="Post \"http://loki:3100/loki/api/v1/push\": dial tcp: lookup loki on 100.100.2.138:53: no such host"

解决:
因为这里换了 hostnetwork 网络,所以默认走了宿主机的 dns,导致无法解析

# kubectl edit ds -n grafana promtail 
...
    spec:
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true

错误 2:

# kubectl logs -f -n grafana promtail-hzvcw 
level=warn ts=2020-11-20T11:02:38.873724616Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '2470' lines totaling '1048456' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"
level=warn ts=2020-11-20T11:02:39.570025453Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '2470' lines totaling '1048456' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"
level=warn ts=2020-11-20T11:02:41.165844784Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '2470' lines totaling '1048456' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"
level=warn ts=2020-11-20T11:02:44.446410234Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '2462' lines totaling '1048494' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"

解决:
参考:https://github.com/grafana/loki/issues/1923
这经常发生在初始安装并开始收集日志的情况下,因为已经积累了很多日志,但是还都没有收集。可能是因为它正在同时索引所有这些巨大的系统日志文件。

因为你要收集的日志太多了,超过了 loki 的限制,所以会报 429 错误,如果你要增加限制可以修改 loki 的配置文件:
注意:这里面不应该设置太大,这样可能会造成 ingester 压力过大

config:
  limits_config:
    ingestion_rate_strategy: local
    # 每个用户每秒的采样率限制
    ingestion_rate_mb: 15
    # 每个用户允许的采样突发大小
    ingestion_burst_size_mb: 20

错误 3:

启动 Loki 的时候直接提示:

Retention period should now be a multiple of periodic table duration

解决:
保留时间段必须是周期表的倍数,默认情况下 168 小时一张表,那么日志保留时间应该是 168 的倍数,比如:168 x 4

schema_config:
  configs:
  - from: 2020-10-24
    # 配置索引的更新和存储方式,默认 168h
    index:
      prefix: index_
      period: 24h
    # 配置块的更新和存储方式,默认 168h
    chunks:
      period: 24h
...
table_manager:
  retention_deletes_enabled: true
  # 设置为上面 index.period 和 chunks.period 的倍数
  retention_period: 72h

错误 4:

点击 grafana live 报错:

error: undefined
An unexpected error happened

解决:
查看 loki 日志:

# kubectl logs -f -n grafana --tail=10 querier-loki-0
level=error ts=2020-11-27T11:02:51.783911277Z caller=http.go:217 org_id=fake traceID=26e4e30b17b6caf9 msg="Error in upgrading websocket" err="websocket: the client is not using the websocket protocol: 'upgrade' token not found in 'Connection' header"
level=error ts=2020-11-27T11:04:05.316230666Z caller=http.go:217 org_id=fake traceID=71a571b766390d4f msg="Error in upgrading websocket" err="websocket: the client is not using the websocket protocol: 'websocket' token not found in 'Upgrade' header"

# kubectl logs -f -n grafana --tail=10 frontend-loki-1
level=warn ts=2020-11-27T10:56:29.735085942Z caller=logging.go:60 traceID=2044a83a71a5274a msg="GET /loki/api/v1/tail?query=%7Bapp_kubernetes_io_managed_by%3D%22Helm%22%7D 23.923117ms, error: http: request method or response status code does not allow body ws: true; Accept-Encoding: gzip, deflate; Accept-Language: en,zh-CN;q=0.9,zh;q=0.8; Cache-Control: no-cache; Connection: Upgrade; Pragma: no-cache; Sec-Websocket-Extensions: permessage-deflate; client_max_window_bits; Sec-Websocket-Key: v+7oOwg9O8RTMZ4PrFLXVw==; Sec-Websocket-Version: 13; Upgrade: websocket; User-Agent: Grafana/7.3.2; X-Forwarded-For: 10.30.0.74, 10.41.131.193, 10.41.131.193; X-Forwarded-Server: pub-k8s-mgt-prd-05654-ecs; X-Real-Ip: 10.30.0.74; X-Server-Ip: 10.30.0.74; "

通常情况下当反向代理或负载平衡器未正确传递 WebSocket 请求时,就会出现此问题。例如 Nginx:

proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection upgrade

参考:https://github.com/grafana/grafana/issues/22905
当您将 Loki 添加到数据源中时,需要在配置中为 Grafana 中的 Loki 数据源添加两个自定义标头:

# 表示 Upgrade 是一个 hop-by-hop 的字段,这个字段是给 proxy 看的
Connection: Upgrade

表示浏览器想要升级到 WebSocke t协议,这个字段是给最终处理请求的程序看的

# 如果只有 Upgrade: websocket,说明 proxy 不支持 websocket 升级,按照标准应该视为普通 HTTP 请求
Upgrade: websocket

参考:https://github.com/grafana/loki/issues/2878

  frontend:
    tail_proxy_url: "http://querier-loki:3100"

错误 5:

# kubectl logs -f -n grafana --tail=10 frontend-loki-0
2020-11-27 11:09:10.401591 I | http: proxy error: unsupported protocol scheme "querier-loki"

解决:

  frontend:
    #tail_proxy_url: "querier-loki:3100"
    tail_proxy_url: "http://querier-loki:3100"

错误 6:

grafana 设置 Line limit 查询 20000 时报错:

max entries limit per query exceeded, limit > max_entries_limit (20000 > 5000)

解决:
参考:https://github.com/grafana/loki/issues/2226
注意:不要设置太大,会造成查询压力

limits_config:
  # 默认值 5000
  max_entries_limit_per_query: 20000

错误 7:

# kubectl logs -f -n grafana ingester-loki-0
level=error ts=2020-11-23T10:14:42.241840832Z caller=client.go:294 component=client host=loki:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry for stream '{container=\"filebeat\", controller_revision_hash=\"9b74f8b55\", filename=\"/var/log/pods/elastic-system_filebeat-pkxdh_07dbab26-9c45-4133-be31-54f359e9a733/filebeat/0.log\", job=\"elastic-system/filebeat\", k8s_app=\"filebeat\", namespace=\"elastic-system\", pod=\"filebeat-pkxdh\", pod_template_generation=\"6\", restart=\"time04\", stream=\"stderr\"}' has timestamp too old: 2020-11-12 03:24:07.638861214 +0000 UTC"

解决:

limits_config:
  # 禁用 reject_old_samples 参数
  reject_old_samples: false
  reject_old_samples_max_age: 168h

错误 8:

loki 内存一直增长导致 OOMKilled

# kubectl describe pod -n grafana loki-0
...
Containers:
  loki:
    ...
    State:          Running
      Started:      Mon, 23 Nov 2020 19:54:33 +0800
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 23 Nov 2020 19:38:12 +0800
      Finished:     Mon, 23 Nov 2020 19:54:32 +0800
    Ready:          False
    Restart Count:  1
    Limits:
      cpu:     8
      memory:  40Gi
    Requests:
      cpu:        1
      memory:     20Gi

解决:
通过分布式部署后,发现是 ingester 消耗了大量的内存,10 个实例都能达到 40GB 而被 OOMKilled

config:
  ingester:
    # 如果未达到最大块大小,则在刷新之前应在内存中放置多长时间没有更新。这意味着半空的块将在一定时间后仍会被刷新,只要它们没有进一步的活动即可。
    chunk_idle_period: 3m
    # 在刷新后 chunk 在内存中保留多长时间
    chunk_retain_period: 1m
    chunk_encoding: gzip

错误 9:

# kubectl logs -f -n grafana distributor-loki-0
level=error ts=2020-11-24T10:22:04.66435891Z caller=pool.go:161 msg="error removing stale clients" err="empty ring"
level=error ts=2020-11-24T10:22:19.664371484Z caller=pool.go:161 msg="error removing stale clients" err="empty ring"

解决:
第一种情况:
参考:https://github.com/grafana/loki/issues/2155
这里说是 replication_factor 和 ingester 示例的副本数量不一致导致,我这里副本都为 1,ingester 的 replication_factor 为 1,cassandra 的 replication_factor 为 3
查询 KEYSPACE 信息:

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "SELECT * FROM system_schema.keyspaces;" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra

 keyspace_name      | durable_writes | replication
--------------------+----------------+-------------------------------------------------------------------------------------
        system_auth |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
      system_schema |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
 system_distributed |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
             system |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
               loki |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
      system_traces |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}

删除 KEYSPACE:

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "DROP KEYSPACE loki;" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra

重新创建 KEYSPACE:

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "CREATE KEYSPACE loki WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra

我这里问题依旧

第二种情况:
参考:https://github.com/grafana/loki/issues/2131
注意:collectors/ 是 Loki 配置 consul 的前缀

# consul kv get -keys /
collectors/
# consul kv delete -recurse collectors

在 etcd 设置中也看到了同样的问题。必须在 etcd 中执行此操作才能使其正常工作:

# ETCDCTL_API=3 etcdctl get --prefix collectors/ --keys-only
collectors/ring

# ETCDCTL_API=3 etcdctl del "" --from-key=true

第三种情况:
初始启动时候发生,因为 hash ring 里面根本没有这个实例,所以一定报错
参考源码:github.com/cortexproject/cortex/pkg/ring/client/pool.go

func (p *Pool) removeStaleClients() {
    // Only if service discovery has been configured.
    if p.discovery == nil {
        return
    }

    serviceAddrs, err := p.discovery()
    if err != nil {
        level.Error(util.Logger).Log("msg", "error removing stale clients", "err", err)
        return
    }

    for _, addr := range p.RegisteredAddresses() {
        if util.StringsContain(serviceAddrs, addr) {
            continue
        }
        level.Info(util.Logger).Log("msg", "removing stale client", "addr", addr)
        p.RemoveClientFor(addr)
    }
}

参考源码:github.com/cortexproject/cortex/pkg/ring/ring.go

var (
    // ErrEmptyRing is the error returned when trying to get an element when nothing has been added to hash.
    ErrEmptyRing = errors.New("empty ring")

    // ErrInstanceNotFound is the error returned when trying to get information for an instance
    // not registered within the ring.
    ErrInstanceNotFound = errors.New("instance not found in the ring")
)

从源码看,感觉不是什么大问题,发生在要移除不健康的客户端时,但是从 ring 里面查询不到
实际上是我这边的配置问题导致,分布式里面配置文件不完整。最终我用一份完整的配置文件来部署解决了这个问题。

错误 10:

# kubectl logs -f -n grafana ingester-loki-0
level=error ts=2020-11-25T12:05:41.624367027Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:05:41.731088455Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"

解决:
参考:https://github.com/grafana/loki/issues/1159
目前只能通过删除并重新部署环使它们恢复健康状态
这个错误出现在删除了所有 ingester 实例重新部署时,此时之前的 ring 还在 consul 中存在,所以第一个 ingester 实例认为需要找到其他 ingester 实例,但肯定都找不到
这里只要删除 consul 中的 ring 即可,让 ingester 认为这是一个新环境

# kubectl exec -it -n grafana consul-server-0 -- consul kv get -keys /
collectors/

# kubectl exec -it -n grafana consul-server-0 -- consul kv delete -recurse collectors
Success! Deleted keys with prefix: collectors

也可以调用 API 去删除:

# curl -XDELETE localhost:8500/v1/kv/collectors/ring

注意:该错误也出现在因为自动加入超时被杀掉的那段时间

# kubectl get pod -n grafana
NAME                                  READY   STATUS    RESTARTS   AGE
ingester-loki-0                       1/1     Running   3          28m
ingester-loki-1                       1/1     Running   0          21m
ingester-loki-2                       1/1     Running   0          20m
ingester-loki-3                       1/1     Running   0          15m
ingester-loki-4                       1/1     Running   1          14m
ingester-loki-5                       1/1     Running   1          10m

# kubectl logs -f -n grafana ingester-loki-5 --previous 
level=info ts=2020-11-25T12:22:18.844831919Z caller=loki.go:227 msg="Loki started"
level=info ts=2020-11-25T12:22:18.850195155Z caller=lifecycler.go:547 msg="instance not found in ring, adding with no tokens" ring=ingester
level=info ts=2020-11-25T12:22:18.881569904Z caller=lifecycler.go:394 msg="auto-joining cluster after timeout" ring=ingester
level=info ts=2020-11-25T12:23:16.687397103Z caller=signals.go:55 msg="=== received SIGINT/SIGTERM ===\n*** exiting"
level=info ts=2020-11-25T12:23:16.687736993Z caller=lifecycler.go:444 msg="lifecycler loop() exited gracefully" ring=ingester
level=info ts=2020-11-25T12:23:16.687775292Z caller=lifecycler.go:743 msg="changing instance state from" old_state=ACTIVE new_state=LEAVING ring=ingester
level=error ts=2020-11-25T12:23:17.095929216Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:17.26296551Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:17.57792403Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:18.091885973Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:19.43179565Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:22.61721327Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:26.915926188Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:30.19295829Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:33.96531916Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:38.966422889Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:38.9664856Z caller=lifecycler.go:788 msg="failed to transfer chunks to another instance" ring=ingester err="terminated after 10 retries"

# kubectl logs -f -n grafana ingester-loki-5 
level=info ts=2020-11-25T12:24:04.174795365Z caller=loki.go:227 msg="Loki started"
level=info ts=2020-11-25T12:24:04.181074205Z caller=lifecycler.go:578 msg="existing entry found in ring" state=ACTIVE tokens=128 ring=ingester

可以优化下 join_after 时间,太长、太短都不行:

ingester:
  lifecycler:
    # 当该成员离开时,要等多久才能从另一个成员领取令牌和块。持续时间到期后将自动加入
    join_after: 30s

错误 11:

ingester 集群在高并发发生雪崩,无法启动:

# kubectl get pod -n grafana |egrep ingester-loki       
ingester-loki-0                       0/1     CrashLoopBackOff   9          84m
ingester-loki-1                       0/1     CrashLoopBackOff   8          77m
ingester-loki-2                       0/1     CrashLoopBackOff   8          76m
ingester-loki-3                       1/1     Running            0          71m
ingester-loki-4                       0/1     CrashLoopBackOff   9          70m
ingester-loki-5                       0/1     CrashLoopBackOff   9          66m
ingester-loki-6                       0/1     CrashLoopBackOff   7          63m
ingester-loki-7                       0/1     CrashLoopBackOff   8          60m
ingester-loki-8                       0/1     CrashLoopBackOff   7          59m
ingester-loki-9                       0/1     Running            8          58m

# kubectl logs -f -n grafana ingester-loki-9
level=warn ts=2020-11-25T13:20:28.700549245Z caller=lifecycler.go:232 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance ingester-loki-7 past heartbeat timeout"
level=warn ts=2020-11-25T13:20:32.375494375Z caller=lifecycler.go:232 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance ingester-loki-6 in state LEAVING"

解决:
参考:https://github.com/cortexproject/cortex/issues/3040
这是因为 ingester 实例异常挂掉,如删除重新创建出来的 pod,这样导致老的实例信息依旧在 rig 里(之前看好像是 pod 名 + IP 字样,容器重启 pod 名不变,但 IP 变了),目前好像没有什么解决办法,只能粗暴一些删掉 ring 的信息
对于 consul:

# kubectl exec -it -n grafana consul-server-0 -- consul kv delete -recurse collectors

对于 etcd:

# kubectl exec -it -n grafana etcd-0 /bin/sh
# ETCDCTL_API=3 etcdctl del "" --from-key=true

修改 cassandra 一致性策略为 ONE 即可

  storage_config:
    cassandra:
      addresses: cassandra-cassandra-dc1-dc1-nodes
      port: 9042
      keyspace: loki
      #consistency: "QUORUM"
      consistency: "ONE"

错误 12:

# kubectl logs -f -n grafana ingester-loki-0
level=error ts=2020-11-24T10:39:58.544128896Z caller=lifecycler.go:788 msg="failed to transfer chunks to another instance" ring=ingester err="terminated after 10 retries"
level=info ts=2020-11-24T10:40:28.552357089Z caller=lifecycler.go:496 msg="instance removed from the KV store" ring=ingester
level=info ts=2020-11-24T10:40:28.552435782Z caller=module_service.go:90 msg="module stopped" module=ingester
level=info ts=2020-11-24T10:40:28.552470413Z caller=module_service.go:90 msg="module stopped" module=memberlist-kv
level=info ts=2020-11-24T10:40:28.555855919Z caller=module_service.go:90 msg="module stopped" module=store
level=info ts=2020-11-24T10:40:28.556037275Z caller=server_service.go:50 msg="server stopped"
level=info ts=2020-11-24T10:40:28.556059022Z caller=module_service.go:90 msg="module stopped" module=server
level=info ts=2020-11-24T10:40:28.556073269Z caller=loki.go:228 msg="Loki stopped"

解决:
ingester 第一次启动发生该错误,然后它自动重启后就好了,应该是 loki 的重连机制有问题

# kubectl logs -f -n grafana ingester-loki-0 --tail=100
level=info ts=2020-11-24T10:40:31.181716616Z caller=loki.go:227 msg="Loki started"
level=info ts=2020-11-24T10:40:31.186107519Z caller=lifecycler.go:547 msg="instance not found in ring, adding with no tokens" ring=ingester
level=info ts=2020-11-24T10:40:31.210151298Z caller=lifecycler.go:394 msg="auto-joining cluster after timeout" ring=ingester

错误 13:

点击 grafana 添加数据源提示:Loki: Internal Server Error. 500. too many failed ingesters
解决:

# kubectl logs -f -n grafana frontend-loki-0
level=error ts=2020-11-26T05:22:25.719644039Z caller=retry.go:71 msg="error processing request" try=0 err="rpc error: code = Code(500) desc = too many failed ingesters\n"
level=error ts=2020-11-26T05:22:25.720963727Z caller=retry.go:71 msg="error processing request" try=1 err="rpc error: code = Code(500) desc = too many failed ingesters\n"
level=error ts=2020-11-26T05:22:25.722093452Z caller=retry.go:71 msg="error processing request" try=2 err="rpc error: code = Code(500) desc = too many failed ingesters\n"
level=error ts=2020-11-26T05:22:25.722679543Z caller=retry.go:71 msg="error processing request" try=3 err="rpc error: code = Code(500) desc = too many failed ingesters\n"
level=error ts=2020-11-26T05:22:25.723216916Z caller=retry.go:71 msg="error processing request" try=4 err="rpc error: code = Code(500) desc = too many failed ingesters\n"
level=warn ts=2020-11-26T05:22:25.723320728Z caller=logging.go:71 traceID=75ef1fe1fdb05ce msg="GET /loki/api/v1/label?start=1606367545667000000 (500) 4.884289ms Response: \"too many failed ingesters\\n\" ws: false; Accept: application/json, text/plain, */*; Accept-Encoding: gzip, deflate; Accept-Language: zh-CN; Dnt: 1; User-Agent: Grafana/7.3.2; X-Forwarded-For: 10.30.0.73, 10.41.131.198, 10.41.131.198; X-Forwarded-Server: pub-k8s-mgt-prd-05667-ecs; X-Grafana-Nocache: true; X-Grafana-Org-Id: 1; X-Real-Ip: 10.30.0.73; X-Server-Ip: 10.30.0.73; "

需要启用 frontend_worker

  # 配置 querier worker,采集并执行由 query-frontend 排队的查询
  frontend_worker:

错误 14:

# kubectl logs -f -n grafana distributor-loki-0 |grep error
level=warn ts=2020-11-24T17:02:35.222118895Z caller=logging.go:71 traceID=3c236f4a1c3df592 msg="POST /loki/api/v1/push (500) 12.95762ms Response: \"rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5211272 vs. 4194304)\\n\" ws: false; Content-Length: 944036; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-24T17:02:36.143885453Z caller=logging.go:71 traceID=610c85e39bf2205e msg="POST /loki/api/v1/push (500) 29.834383ms Response: \"rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5211272 vs. 4194304)\\n\" ws: false; Content-Length: 944036; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "

解决:

  server:
    grpc_server_max_recv_msg_size: 8388608
    grpc_server_max_send_msg_size: 8388608

错误 15:

# kubectl logs -f -n grafana ingester-loki-0 |egrep -v 'level=debug|level=info'
level=error ts=2020-11-25T10:47:03.021701842Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Operation timed out - received only 0 responses."
level=error ts=2020-11-25T10:47:03.874812809Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Operation timed out - received only 0 responses."
level=error ts=2020-11-25T10:47:04.119370803Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Operation timed out - received only 0 responses."
level=error ts=2020-11-25T10:47:04.289613481Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Operation timed out - received only 0 responses."

解决:
只有 1 个数据的副本的情况遇到,需要增加键空间的复制因子。
如果您的复制因子为 1,则您将依赖 1 个节点来响应对特定分区的查询。将您的 RF 增加到大约 3,将使您的应用程序对性能不佳的节点或出现故障的节点更具弹性。

# ALTER KEYSPACE loki WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};

有人说增加下面参数,未验证:

# vi cassandra.yaml
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 2000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 5000
commitlog_segment_size_in_mb: 32

也可以修改 loki

  storage_config:
    cassandra:
      timeout: 30s

错误 16:

# kubectl logs -f -n grafana ingester-loki-0 |egrep -v 'level=debug|level=info'
level=error ts=2020-11-24T07:41:08.234749133Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="no chunk table found for time 1603287501.182"
level=error ts=2020-11-24T07:41:15.543694173Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="context deadline exceeded"
level=error ts=2020-11-24T07:41:16.785829547Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="context deadline exceeded"

解决:
应该是 ingester 遇到性能问题导致,ingester 经常 OOM

在 ingester 集群正常运行的那一段时间里,发现 cassandra 所在的节点 load 很高

# top
top - 19:34:13 up  9:30,  1 user,  load average: 19.11, 19.24, 17.57
Tasks: 352 total,   4 running, 348 sleeping,   0 stopped,   0 zombie
%Cpu(s): 48.1 us, 10.9 sy,  0.0 ni, 34.6 id,  2.8 wa,  0.0 hi,  3.7 si,  0.0 st
KiB Mem : 65806668 total, 32340232 free, 17603344 used, 15863092 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 47487824 avail Mem 

上下文切换高

# vmstat 1 -w
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
 r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st
13  1            0     32187816       368908     15561596    0    0   713  2349   44   74  22   5  72   1   0
10  0            0     32181392       368916     15561204    0    0 46824 17544 134678 116836  46  13  41   1   0
38  0            0     32153804       368980     15559408    0    0 17008 117840 134737 86828  52  14  33   2   0
89  3            0     32139480       369004     15559560    0    0 36988 117944 139337 92005  52  16  31   1   0
19  0            0     32177728       369064     15577352    0    0 13740 105716 109264 100241  48  12  31   9   0
 8  2            0     32215968       369100     15568960    0    0 29956 109324 146574 96660  50  16  32   2   0
11  0            0     32237324       369156     15562756    0    0 34932 103524 107201 71884  50  13  34   4   0

磁盘饱和度和 wait 都高

# iostat -x 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          45.00    0.00   14.85    3.87    0.00   36.28

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    33.00    0.00   12.00     0.00   228.00    38.00     0.01    7.58    0.00    7.58   0.58   0.70
vdb               0.00     0.00    0.00    2.00     0.00     4.00     4.00     0.00    1.00    0.00    1.00   1.00   0.20
vdc               0.00     0.00    0.00    4.00     0.00     0.00     0.00     0.00    0.25    0.00    0.25   0.00   0.00
vdd               0.00    72.00  105.00  158.00  5400.00 61480.00   508.59    11.60   44.29    2.02   72.37   1.32  34.60
vde               1.00    64.00  195.00  215.00  8708.00 93776.00   499.92    36.01   90.11   20.66  153.10   1.71  70.20
vdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

查看磁盘这两个盘确实是 cassandra 的
# lsblk 
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda    253:0    0  100G  0 disk 
└─vda1 253:1    0  100G  0 part /
vdb    253:16   0  200G  0 disk /data0
vdc    253:32   0  200G  0 disk /var/lib/container
vdd    253:48   0    2T  0 disk /var/lib/container/kubelet/pods/83e41036-6df9-4a81-8ac8-f9a606d7f029/volumes/kubernetes.io~csi/disk-da080df0-fda9-
vde    253:64   0    2T  0 disk /var/lib/container/kubelet/pods/8f49e222-9197-443a-9054-aa1996be9493/volumes/kubernetes.io~csi/disk-43627433-f76a-
vdf    253:80   0   20G  0 disk /var/lib/container/kubelet/pods/96f86124-0f31-477f-a764-b3eaebb0bc6d/volumes/kubernetes.io~csi/disk-ca32c719-3cfa-

所以这边猜测,这里的错误是因为 cassandra 性能不行导致的,另外 ingester 一直刷盘慢导致内存升高,进而导致了 OOM,应该也是这个原因

错误 17:

# kubectl logs -f -n grafana querier-loki-0 |egrep -v 'level=debug|level=info'
level=error ts=2020-11-24T17:20:54.847423531Z caller=worker_frontend_manager.go:96 msg="error contacting frontend" err="rpc error: code = Unavailable desc = connection closed"
level=error ts=2020-11-24T17:20:55.001467122Z caller=worker_frontend_manager.go:96 msg="error contacting frontend" err="rpc error: code = Unavailable desc = connection closed"
level=error ts=2020-11-24T17:20:55.109948605Z caller=worker_frontend_manager.go:96 msg="error contacting frontend" err="rpc error: code = Unavailable desc = connection closed"

解决:
这里根本没有开启 grpc 端口的服务:

# kubectl get svc -n grafana |grep frontend-loki
frontend-loki                            ClusterIP   172.21.11.213           3100/TCP                                                                  5h3m
frontend-loki-headless                   ClusterIP   None                    3100/TCP                                                                  5h3m

本来想通过 extraPorts 和 service 字段来定义的,但是发现 service 字段不支持多端口,所以只好自己定义:

# cat > frontend-loki-grpc.yaml <

修改配置:

  frontend_worker:
    frontend_address: "frontend-loki-grpc:9095"

错误 18:

# kubectl logs -f -n grafana distributor-loki-0 |egrep -v 'level=debug|level=info'
level=warn ts=2020-11-24T18:00:27.948766075Z caller=logging.go:71 traceID=570d6eabac64c40a msg="POST /loki/api/v1/push (500) 2.347177ms Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 168137; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-24T18:00:28.182532696Z caller=logging.go:71 traceID=40dda83791c46d22 msg="POST /loki/api/v1/push (500) 2.930596ms Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 180980; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-24T18:00:28.512289072Z caller=logging.go:71 traceID=4de6a07f956e0a26 msg="POST /loki/api/v1/push (500) 2.218879ms Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 155860; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-24T18:00:32.355800743Z caller=logging.go:71 traceID=23893b2884f630d6 msg="POST /loki/api/v1/push (500) 899.118µs Response: \"empty ring\\n\" ws: false; Content-Length: 42077; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "

# kubectl logs -f -n grafana promtail-zszhj
level=warn ts=2020-11-24T17:59:40.125688742Z caller=client.go:288 component=client host=distributor-loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): at least 1 live replicas required, could only find 0"
level=warn ts=2020-11-24T17:59:49.18473538Z caller=client.go:288 component=client host=distributor-loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): at least 1 live replicas required, could only find 0"
level=warn ts=2020-11-24T18:00:20.779127702Z caller=client.go:288 component=client host=distributor-loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): at least 1 live replicas required, could only find 0"

# kubectl logs -f -n grafana promtail-zszhj
level=warn ts=2020-11-24T17:36:03.040901602Z caller=client.go:288 component=client host=distributor-loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.41.186.88:9095: connect: connection refused\""
level=warn ts=2020-11-24T17:36:04.316609825Z caller=client.go:288 component=client host=distributor-loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.41.186.88:9095: connect: connection refused\""

解决:

  1. 因为 promtail 连不上 distributor 导致,可能是 distributor 挂了
  2. 因为 distributor 连不上 ingester 导致,ingester 经常 OOM 挂掉时会出现这些错误
  3. 当 promtail 发送失败发生,但会进入 backoff 重试,所以也不用过分担心
    参考源码:loki/pkg/promtail/client/client.go
    for backoff.Ongoing() {
        start := time.Now()
        status, err = c.send(ctx, tenantID, buf)
        requestDuration.WithLabelValues(strconv.Itoa(status), c.cfg.URL.Host).Observe(time.Since(start).Seconds())

        ...

        // Only retry 429s, 500s and connection-level errors.
        if status > 0 && status != 429 && status/100 != 5 {
            break
        }

        // 这里的 err 来源于上面的 c.send()
        level.Warn(c.logger).Log("msg", "error sending batch, will retry", "status", status, "error", err)
        batchRetries.WithLabelValues(c.cfg.URL.Host).Inc()
        backoff.Wait()
    }


func (c *client) send(ctx context.Context, tenantID string, buf []byte) (int, error) {
    ctx, cancel := context.WithTimeout(ctx, c.cfg.Timeout)
    defer cancel()
    req, err := http.NewRequest("POST", c.cfg.URL.String(), bytes.NewReader(buf))
    if err != nil {
        return -1, err
    }
    req = req.WithContext(ctx)
    req.Header.Set("Content-Type", contentType)
    req.Header.Set("User-Agent", UserAgent)

    ...

    if resp.StatusCode/100 != 2 {
        scanner := bufio.NewScanner(io.LimitReader(resp.Body, maxErrMsgLen))
        line := ""
        if scanner.Scan() {
            line = scanner.Text()
        }
        err = fmt.Errorf("server returned HTTP status %s (%d): %s", resp.Status, resp.StatusCode, line)
    }
    return resp.StatusCode, err
}

错误 20:

Grafana 添加 frontend-loki 报错:Loki: Internal Server Error. 500. unsupported protocol scheme "querier-loki"
解决:

  frontend:
    # 下游 query 的 URL,必须带上 http:// 同时需要带上 NameSpace
    # 注意:官方文档中说是 prometheus 的地址是错误的
    downstream_url: "http://querier-loki.grafana:3100"

错误 21:

grafana 查询报错:unconfigured table index_18585
解决:

  1. 查询的时间范围大于了数据有效期,我这里查询大于 3 天的数据就会出现这个问题
  table_manager:
    retention_deletes_enabled: true
    retention_period: 72h
  1. replication_factor 大于 consistency 导致的
  storage_config:
    cassandra:
      #consistency: "QUORUM"
      consistency: "ONE"
      # replication_factor 不兼容 NetworkTopologyStrategy 策略
      replication_factor: 1

错误 22:

grafana 查询报错:Cannot achieve consistency level QUORUM
解决:
虽然我这边 cassandra 有 3 个节点,但是创建 keyspace 的时候设置了 replication_factor 为 1,所以集群中的 cassandra 节点因为 oom 挂掉导致该错误:

# CREATE KEYSPACE loki WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };

算法:

quorum = (sum_of_replication_factors / 2) + 1

其中:

sum_of_replication_factors = datacenter1_RF + datacenter2_RF + … + datacentern_RF

错误 23:

grafana 查询报错:gocql: no response received from cassandra within timeout period
解决:
增加对 cassandra 的超时时间

  storage_config:
    cassandra:
      timeout: 30s
      connect_timeout: 30s

错误 24:

# kubectl logs -f -n grafana ingester-loki-0
level=info ts=2020-11-24T07:28:49.886973328Z caller=events.go:247 module=gocql client=table-manager msg=Session.handleNodeUp ip=10.41.178.155 port=9042
level=error ts=2020-11-24T07:28:49.996021839Z caller=connectionpool.go:523 module=gocql client=table-manager msg="failed to connect" address=10.41.178.155:9042 error="Keyspace 'loki' does not exist"

解决:
这里是因为没有提前在 cassandra 里创建 loki 这个 Keyspace 导致,可以提前手工创建:

# CREATE KEYSPACE loki WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

但是查询了 cassandra,实际是被自动创建出来了,并且 loki 自动重启一次后就不会报这个错误了,应该还是 loki 错误重试机制有问题

错误 25:

# kubectl logs -f -n grafana querier-loki-0
level=info ts=2020-11-27T04:37:11.75172174Z caller=events.go:247 module=gocql client=index-write msg=Session.handleNodeUp ip=10.41.238.253 port=9042
level=info ts=2020-11-27T04:37:11.760460559Z caller=events.go:247 module=gocql client=chunks-write msg=Session.handleNodeUp ip=10.41.176.102 port=9042
level=info ts=2020-11-27T04:37:11.760530285Z caller=events.go:271 module=gocql client=chunks-write msg=Session.handleNodeDown ip=10.41.238.255 port=9042
level=info ts=2020-11-27T04:37:11.776492597Z caller=events.go:247 module=gocql client=chunks-read msg=Session.handleNodeUp ip=10.41.176.65 port=9042
level=info ts=2020-11-27T04:37:11.776559951Z caller=events.go:271 module=gocql client=chunks-read msg=Session.handleNodeDown ip=10.41.239.0 port=9042
level=error ts=2020-11-27T04:37:11.783500883Z caller=connectionpool.go:523 module=gocql client=index-read msg="failed to connect" address=10.41.176.65:9042 error="Keyspace 'loki' does not exist"
level=error ts=2020-11-27T04:37:11.847266845Z caller=connectionpool.go:523 module=gocql client=index-write msg="failed to connect" address=10.41.238.253:9042 error="Keyspace 'loki' does not exist"

解决:
将 cassandra 的复制策略从 SimpleStrategy 修改 NetworkTopologyStrategy 后发生该错误,修改配置重启 querier 即可

错误 26:

# kubectl logs -f -n grafana distributor-loki-0
level=error ts=2020-11-26T05:36:58.159259504Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=error ts=2020-11-26T05:37:13.159263304Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=error ts=2020-11-26T05:37:28.159244957Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"

解决:

  1. ingester 服务挂了导致
  2. 遇到 consul 里的 ring 打开非常慢导致
    参考:https://github.com/hashicorp/consul/issues/3358
    consul 的 key 默认有 512KB 的限制,大了之后就无法读取了
  ingester:
    lifecycler:
      # 注册在哈希环上的 token 数,可以理解为虚拟节点
      # 设置 512 导致consul 里的 ring 打开非常慢,其他组件一直报 too many failed ingesters 错误,但是 ingester 大多数实例基本是 OK 的
      num_tokens: 128
  1. 关于副本因子:
    参考源码:https://github.com/grafana/loki/blob/master/vendor/github.com/cortexproject/cortex/pkg/ring/replication_strategy.go
func (s *DefaultReplicationStrategy) Filter(ingesters []IngesterDesc, op Operation, replicationFactor int, heartbeatTimeout time.Duration, zoneAwarenessEnabled bool) ([]IngesterDesc, int, error) {
    // We need a response from a quorum of ingesters, which is n/2 + 1.  In the
    // case of a node joining/leaving, the actual replica set might be bigger
    // than the replication factor, so use the bigger or the two.
    if len(ingesters) > replicationFactor {
        replicationFactor = len(ingesters)
    }

    minSuccess := (replicationFactor / 2) + 1

    // 注意这里说明
    // Skip those that have not heartbeated in a while. NB these are still
    // included in the calculation of minSuccess, so if too many failed ingesters
    // will cause the whole write to fail.
    for i := 0; i < len(ingesters); {
        if ingesters[i].IsHealthy(op, heartbeatTimeout) {
            i++
        } else {
            ingesters = append(ingesters[:i], ingesters[i+1:]...)
        }
    }

    // This is just a shortcut - if there are not minSuccess available ingesters,
    // after filtering out dead ones, don't even bother trying.
    if len(ingesters) < minSuccess {
        var err error

        if zoneAwarenessEnabled {
            err = fmt.Errorf("at least %d live replicas required across different availability zones, could only find %d", minSuccess, len(ingesters))
        } else {
            err = fmt.Errorf("at least %d live replicas required, could only find %d", minSuccess, len(ingesters))
        }

        return nil, 0, err
    }

    return ingesters, len(ingesters) - minSuccess, nil
}

参考源码:https://github.com/grafana/loki/blob/master/vendor/github.com/cortexproject/cortex/pkg/ring/ring.go
// GetAll returns all available ingesters in the ring.

func (r *Ring) GetAll(op Operation) (ReplicationSet, error) {
    r.mtx.RLock()
    defer r.mtx.RUnlock()

    if r.ringDesc == nil || len(r.ringTokens) == 0 {
        return ReplicationSet{}, ErrEmptyRing
    }

    // Calculate the number of required ingesters;
    // ensure we always require at least RF-1 when RF=3.
    numRequired := len(r.ringDesc.Ingesters)
    if numRequired < r.cfg.ReplicationFactor {
        numRequired = r.cfg.ReplicationFactor
    }
    maxUnavailable := r.cfg.ReplicationFactor / 2
    numRequired -= maxUnavailable

    ingesters := make([]IngesterDesc, 0, len(r.ringDesc.Ingesters))
    for _, ingester := range r.ringDesc.Ingesters {
        if r.IsHealthy(&ingester, op) {
            ingesters = append(ingesters, ingester)
        }
    }

    if len(ingesters) < numRequired {
        return ReplicationSet{}, fmt.Errorf("too many failed ingesters")
    }

    return ReplicationSet{
        Ingesters: ingesters,
        MaxErrors: len(ingesters) - numRequired,
    }, nil
}

错误 27:

# kubectl logs -f -n grafana ingester-loki-2
level=warn ts=2020-11-26T11:28:06.835845619Z caller=grpc_logging.go:38 method=/logproto.Pusher/Push duration=4.811872ms err="rpc error: code = Code(400) desc = entry with timestamp 2020-11-26 11:28:06.736090173 +0000 UTC ignored, reason: 'entry out of order' for stream: {...},\nentry with timestamp 2020-11-26 11:28:06.736272024 +0000 UTC ignored, reason: 'entry out of order' for stream: {...}, total ignored: 38 out of 38" msg="gRPC\n"

解决:
这是因为日志乱序了,ingester 产生的告警

错误 28:

# kubectl logs -f -n grafana ingester-loki-0 --previous 
level=info ts=2020-11-26T09:41:07.828515989Z caller=main.go:128 msg="Starting Loki" version="(version=2.0.0, branch=HEAD, revision=6978ee5d7)"
level=info ts=2020-11-26T09:41:07.828765981Z caller=server.go:225 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=error ts=2020-11-26T09:41:17.867540854Z caller=session.go:286 module=gocql client=index-read msg="dns error" error="lookup cassandra-cassandra-dc1-dc1-nodes on 172.21.0.10:53: no such host"

解决:

  storage_config:
    cassandra:
      addresses: cassandra-cassandra-dc1-dc1-nodes
      disable_initial_host_lookup: true

错误 29:

# kubectl logs -f -n grafana ingester-loki-10 
level=error ts=2020-11-27T03:05:52.561015003Z caller=main.go:85 msg="validating config" err="invalid schema config: the table period must be a multiple of 24h (1h for schema v1)"

解决:
时间周期必须是 24h 的整数倍

  schema_config:
    configs:
    - from: 2020-10-24
      index:
        prefix: index_
        period: 24h
      chunks:
        prefix: chunks_
        period: 24h

错误 30:

# kubectl logs -f -n grafana querier-loki-0
level=error ts=2020-11-27T04:05:51.648649625Z caller=worker_frontend_manager.go:96 msg="error contacting frontend" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.21.0.34:9095: connect: connection refused\""
level=error ts=2020-11-27T04:05:51.648673039Z caller=worker_frontend_manager.go:96 msg="error contacting frontend" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.21.0.34:9095: connect: connection refused\""

解决:

  1. frontend 服务挂了,恢复 frontend 服务即可
  2. frontend 服务配置不对,需要重新应用

错误 31:

# kubectl logs -f -n grafana querier-loki-0
level=error ts=2020-11-27T04:18:55.308433009Z caller=connectionpool.go:523 module=gocql client=index-write msg="failed to connect" address=10.41.176.218:9042 error="gocql: no response received from cassandra within timeout period"
level=error ts=2020-11-27T04:18:55.308493581Z caller=connectionpool.go:523 module=gocql client=chunks-write msg="failed to connect" address=10.41.176.218:9042 error="gocql: no response received from cassandra within timeout period"
level=error ts=2020-11-27T04:18:55.308461356Z caller=connectionpool.go:523 module=gocql client=index-read msg="failed to connect" address=10.41.176.218:9042 error="gocql: no response to connection startup within timeout"
level=error ts=2020-11-27T04:18:55.308497652Z caller=connectionpool.go:523 module=gocql client=chunks-read msg="failed to connect" address=10.41.176.218:9042 error="gocql: no response received from cassandra within timeout period"

解决:
这里应该是连不上 cassandra 数据库了

# kubectl get pod -n grafana -o wide |grep 10.41.176.218
cassandra-cassandra-dc1-dc1-rack1-6   1/2     Running   0          15h     10.41.176.218   cn-hangzhou.10.41.128.145              

# kubectl logs -f -n grafana cassandra-cassandra-dc1-dc1-rack1-6 -c cassandra     
WARN  [MessagingService-Incoming-/10.41.176.122] IncomingTcpConnection.java:103 UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for cfId 26534be0-3042-11eb-b160-ef05878ef351. If a table was just created, this is likely due to the schema not being fully propagated.  Please wait for schema agreement on table creation.
        at org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1578) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:900) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:875) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:415) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:371) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94) ~[apache-cassandra-3.11.9.jar:3.11.9]

解决:
当新添加节点到已经存在的集群时,新节点是没有 schema 的,它需要从集群中同步表。
新添加节点时,如果该节点还没有创建表结构,而被添加到集群中,会导致分发到这个节点的请求无法被处理

错误 33:

# kubectl logs -f -n grafana --tail=10 distributor-loki-0
level=warn ts=2020-11-27T05:51:48.000652876Z caller=logging.go:71 traceID=5b92b35f3d057623 msg="POST /loki/api/v1/push (500) 460.775µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 78746; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:51:48.000995397Z caller=logging.go:71 traceID=5095e3e5e161a426 msg="POST /loki/api/v1/push (500) 222.613µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 6991; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
...
level=error ts=2020-11-27T05:51:48.31756207Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=warn ts=2020-11-27T05:51:48.366711308Z caller=logging.go:71 traceID=79703c1097fb3890 msg="POST /loki/api/v1/push (500) 441.255µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 11611; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:52:18.000340865Z caller=logging.go:71 traceID=69210fec26d8a92b msg="POST /loki/api/v1/push (500) 157.989µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 7214; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:52:18.00061787Z caller=logging.go:71 traceID=1a19280a13626e4d msg="POST /loki/api/v1/push (500) 284.977µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 2767; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
...
level=error ts=2020-11-27T05:52:18.317576958Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=warn ts=2020-11-27T05:52:18.357480921Z caller=logging.go:71 traceID=2e144bed2da000e6 msg="POST /loki/api/v1/push (500) 438.836µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 12040; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:52:48.000666573Z caller=logging.go:71 traceID=7b072e8c150d335f msg="POST /loki/api/v1/push (500) 292.152µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 7142; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
...
level=error ts=2020-11-27T05:52:48.317561107Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=warn ts=2020-11-27T05:53:18.000228447Z caller=logging.go:71 traceID=4f484503331187aa msg="POST /loki/api/v1/push (500) 1.347378ms Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 89191; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:53:18.00071364Z caller=logging.go:71 traceID=73a47d220a07f209 msg="POST /loki/api/v1/push (500) 584.582µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 8336; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
...
level=error ts=2020-11-27T05:53:18.317557575Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=warn ts=2020-11-27T05:53:48.000718725Z caller=logging.go:71 traceID=6b8433e4ce976efe msg="POST /loki/api/v1/push (500) 674.648µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 81500; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:53:48.000751823Z caller=logging.go:71 traceID=3788174d48a7748c msg="POST /loki/api/v1/push (500) 278.135µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 6658; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
...
level=error ts=2020-11-27T05:53:48.317566199Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=warn ts=2020-11-27T05:54:18.000965033Z caller=logging.go:71 traceID=273b54ffd8b6f851 msg="POST /loki/api/v1/push (500) 170.859µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 7161; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:54:18.001483342Z caller=logging.go:71 traceID=66c98097e558e47d msg="POST /loki/api/v1/push (500) 373.922µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 4954; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "

解决:
这里错误发现一个规律,too many failed ingesters 基本是 30s 一次,所以这边可以找配置文件中 30s 的参数

错误 34:

# kubectl logs -f -n grafana --tail=10 querier-loki-1 
level=error ts=2020-11-27T06:59:46.02122327Z caller=worker_frontend_manager.go:102 msg="error processing requests" err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-11-27T06:59:46.021237567Z caller=worker_frontend_manager.go:102 msg="error processing requests" err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-11-27T06:59:46.021184495Z caller=worker_frontend_manager.go:102 msg="error processing requests" err="rpc error: code = Unavailable desc = transport is closing"

解决:
frontend 重新部署、删除、异常的时候出现该错误,算是服务端断开导致

错误 35:

grafana 查询报错:

 504 Gateway Time-out  

504 Gateway Time-out


nginx

解决:

# kubectl logs -f -n grafana --tail=10 frontend-loki-0
level=error ts=2020-11-27T08:12:31.178604776Z caller=retry.go:71 msg="error processing request" try=0 err=EOF
level=error ts=2020-11-27T08:12:31.685861905Z caller=retry.go:71 msg="error processing request" try=0 err=EOF
level=error ts=2020-11-27T08:12:31.782152574Z caller=retry.go:71 msg="error processing request" try=0 err=EOF
level=error ts=2020-11-27T08:13:00.340916358Z caller=retry.go:71 msg="error processing request" try=1 err="context canceled"
level=info ts=2020-11-27T08:13:00.340962799Z caller=frontend.go:220 org_id=fake traceID=349eb374b92cff3 msg="slow query detected" method=GET host=frontend-loki.grafana:3100 path=/loki/api/v1/query_range time_taken=59.995024399s param_query="{app=\"flog\"} |= \"/solutions/turn-key/facilitate\"" param_start=1606443120000000000 param_end=1606464721000000000 param_step=10 param_direction=BACKWARD param_limit=1000
level=info ts=2020-11-27T08:13:00.341110652Z caller=metrics.go:81 org_id=fake traceID=349eb374b92cff3 latency=fast query="{app=\"flog\"} |= \"/solutions/turn-key/facilitate\"" query_type=filter range_type=range length=6h0m1s step=10s duration=0s status=499 throughput_mb=0 total_bytes_mb=0

查询量过大超时导致的

错误 36:

# kubectl logs -f -n grafana --tail=10 querier-loki-0
level=info ts=2020-11-28T14:24:04.047366237Z caller=metrics.go:81 org_id=fake traceID=51ab87b8886e7240 latency=fast query="{app=\"flog\"}" query_type=limited range_type=range length=1h0m1s step=2s duration=173.147382ms status=200 throughput_mb=8.287327 total_bytes_mb=1.434929
level=info ts=2020-11-28T14:24:05.03985635Z caller=metrics.go:81 org_id=fake traceID=46f7eb80b08eee92 latency=fast query="{app=\"flog\"}" query_type=limited range_type=range length=1h0m1s step=2s duration=180.095448ms status=200 throughput_mb=7.967602 total_bytes_mb=1.434929
level=error ts=2020-11-28T14:24:44.799142534Z caller=http.go:256 org_id=fake traceID=2ca9bf8b1ba97c08 msg="Error from client" err="websocket: close 1006 (abnormal closure): unexpected EOF"
level=error ts=2020-11-28T14:24:44.799151755Z caller=http.go:279 org_id=fake traceID=2ca9bf8b1ba97c08 msg="Error writing to websocket" err="writev tcp 10.41.182.22:3100->10.41.191.121:49894: writev: connection reset by peer"
level=error ts=2020-11-28T14:24:44.799310451Z caller=http.go:281 org_id=fake traceID=2ca9bf8b1ba97c08 msg="Error writing close message to websocket" err="writev tcp 10.41.182.22:3100->10.41.191.121:49894: writev: connection reset by peer

解决:

# kubectl  get pod -n grafana -o wide |grep 10.41.182.22
querier-loki-0                        1/1     Running   0          11h     10.41.182.22    cn-hangzhou.10.41.131.196              

# kubectl  get pod -n grafana -o wide |grep 10.41.191.121
frontend-loki-0                       1/1     Running   0          11h     10.41.191.121   cn-hangzhou.10.41.131.200              

这里错误发现一个规律,基本是 30s 左右 websocket 报错,可能跟超时有关

错误 37:

grafana 点击报错:Cannot achieve consistency level ONE
解决:

kubectl logs -f -n grafana ingester-loki-1 --tail=1
level=error ts=2020-11-28T07:15:19.544408457Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Cannot achieve consistency level ONE"
level=error ts=2020-11-28T07:15:19.602324008Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Cannot achieve consistency level ONE"
level=error ts=2020-11-28T07:15:19.602912508Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Cannot achieve consistency level ONE"
  1. cassandra 集群挂了导致
  2. cassandra 集群 KEYSPACE 或策略设置不对

错误 38:

# kubectl logs -f -n grafana --tail=10 cassandra-cassandra-dc1-dc1-rack1-2 -c cassandra  
INFO  [CompactionExecutor:126] NoSpamLogger.java:91 Maximum memory usage reached (267386880), cannot allocate chunk of 1048576
INFO  [CompactionExecutor:128] NoSpamLogger.java:91 Maximum memory usage reached (267386880), cannot allocate chunk of 1048576
INFO  [CompactionExecutor:128] NoSpamLogger.java:91 Maximum memory usage reached (267386880), cannot allocate chunk of 1048576
INFO  [CompactionExecutor:128] NoSpamLogger.java:91 Maximum memory usage reached (267386880), cannot allocate chunk of 1048576
INFO  [CompactionExecutor:129] NoSpamLogger.java:91 Maximum memory usage reached (267386880), cannot allocate chunk of 1048576

解决:

# vi cassandra.yaml
file_cache_size_in_mb: 2048 

错误 40:

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "desc KEYSPACE loki;" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra 

CREATE KEYSPACE loki WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '1'}  AND durable_writes = true;
解决:
使用此选项,您可以指示 Cassandra 是否对当前 KeySpace 的更新使用 commitlog。此选项不是强制性的,默认情况下它设置为 true
稍等片刻,Cassandra 会自动同步到所有节点


错误 41:
# kubectl logs -f -n grafana ingester-loki-1 --tail=1
level=error ts=2020-11-28T07:46:07.802171225Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="unconfigured table chunks_18594"
level=error ts=2020-11-28T07:46:07.805926054Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="unconfigured table chunks_18594"
level=error ts=2020-11-28T07:46:07.813179527Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="unconfigured table chunks_18594"
level=error ts=2020-11-28T07:46:07.814556323Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="unconfigured table chunks_18594"

解决:
稍等片刻,Loki会自动创建,然后 Cassandra 会自动同步到所有节点

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "desc KEYSPACE loki;" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra |grep chunks_18594

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "desc KEYSPACE loki;" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra |grep chunks_18594
CREATE TABLE loki.chunks_18594 (

错误 42:

# kubectl logs -f -n grafana promtail-n6hvs 
level=error ts=2020-07-06T03:58:02.217480067Z caller=client.go:247 component=client host=192.179.11.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry for stream '{app=\"app_error\", filename=\"/error.log\", host=\"192.179.11.12\"}' has timestamp too new: 2020-07-06 03:58:01.175699907 +0000 UTC"

解决:
这个是两台机器的时间相差太大了,我 promtail 这台机器的时间没有和 ntp 服务器同步时间,所以就报了这个错误,只要把时间都同步了就好了

错误 43:

# kubectl logs -f -n grafana --tail=1 distributor-loki-0
level=warn ts=2020-11-28T11:22:36.356296195Z caller=logging.go:71 traceID=368cb9d5bab3db0 msg="POST /loki/api/v1/push (500) 683.54µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 11580; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-28T11:22:36.448942924Z caller=logging.go:71 traceID=59e19c2fd604299 msg="POST /loki/api/v1/push (500) 391.358µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 10026; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-28T11:22:36.470361015Z caller=logging.go:71 traceID=152d6e5c824eb473 msg="POST /loki/api/v1/push (500) 873.049µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 81116; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "

解决:

错误 44 :

# helm upgrade --install -f loki-config.yaml distributor --set config.target=distributor --set replicas=10 loki-2.0.2.tgz -n grafana
Error: UPGRADE FAILED: "distributor" has no deployed releases

解决:
好像是没卸载干净,但是也没发现 release:

# helm list -n grafana 
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
consul          grafana         1               2020-11-24 18:11:01.057473546 +0800 CST deployed        consul-0.26.0   1.8.5      
etcd            grafana         1               2020-11-27 10:37:34.954108464 +0800 CST deployed        etcd-5.2.1      3.4.14     
ingester        grafana         1               2020-11-30 21:10:33.961141159 +0800 CST deployed        loki-2.0.2      v2.0.0     
promtail        grafana         1               2020-11-28 21:24:12.542476902 +0800 CST deployed        promtail-2.0.1  v2.0.0     
redis           grafana         1               2020-11-30 21:00:49.567268068 +0800 CST deployed        redis-12.1.1    6.0.9      

用 install 命令也不行:

# helm install -f loki-config.yaml distributor --set config.target=distributor --set replicas=10 loki-2.0.2.tgz -n grafana          
Error: cannot re-use a name that is still in use

查看全部 release,发现了 uninstalling 状态:

# helm list -n grafana -a
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
consul          grafana         1               2020-11-24 18:11:01.057473546 +0800 CST deployed        consul-0.26.0   1.8.5      
distributor     grafana         1               2020-11-30 18:58:58.082367639 +0800 CST uninstalling    loki-2.0.2      v2.0.0     
etcd            grafana         1               2020-11-27 10:37:34.954108464 +0800 CST deployed        etcd-5.2.1      3.4.14     
ingester        grafana         1               2020-11-30 21:10:33.961141159 +0800 CST deployed        loki-2.0.2      v2.0.0     
promtail        grafana         1               2020-11-28 21:24:12.542476902 +0800 CST deployed        promtail-2.0.1  v2.0.0     
redis           grafana         1               2020-11-30 21:00:49.567268068 +0800 CST deployed        redis-12.1.1    6.0.9  

# helm  uninstall -n grafana distributor --timeout 0s

错误 45:

# kubectl logs -f -n grafana ingester-loki-4 --tail=100
level=error ts=2020-11-30T10:54:24.752039646Z caller=redis_cache.go:57 msg="failed to put to redis" name=store.index-cache-write.redis err="EXECABORT Transaction discarded because of previous errors."
level=error ts=2020-11-30T10:54:24.752183644Z caller=redis_cache.go:37 msg="failed to get from redis" name=chunksredis err="MOVED 3270 10.41.178.48:6379"

解决:
这里是启用了事务,当执行了多个命令,不管有几个命令是执行正确的,只要有一个命令有语法错误,执行 EXEC 命令后 Redis 就会直接返回错误,连语法正确的命令也不会执行
将 redis cluster 切换为 redis master/slave 解决了

错误 46:

redis 1 台 master 和 2 台 slave 的内存一直飙升,最终挂掉:
居然不是 OOM,从日志看是无法执行 exec 检查导致被 kill 了,这个可能是 redis 容器失去响应

# kubectl describe pod -n grafana redis-master-0
    State:          Running
      Started:      Tue, 01 Dec 2020 14:14:24 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 01 Dec 2020 13:38:26 +0800
      Finished:     Tue, 01 Dec 2020 14:14:23 +0800
    Ready:          True
    Restart Count:  1
    Liveness:       exec [sh -c /health/ping_liveness_local.sh 5] delay=5s timeout=6s period=5s #success=1 #failure=5
    Readiness:      exec [sh -c /health/ping_readiness_local.sh 1] delay=5s timeout=2s period=5s #success=1 #failure=5
...
Events:
  Type     Reason          Age                From                                Message
  ----     ------          ----               ----                                -------
  Warning  Unhealthy       53s (x2 over 53s)  kubelet, cn-hangzhou.10.41.131.206  Readiness probe failed: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown
  Warning  Unhealthy       50s (x2 over 52s)  kubelet, cn-hangzhou.10.41.131.206  Liveness probe failed: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown
  Normal   Started         49s (x2 over 36m)  kubelet, cn-hangzhou.10.41.131.206  Started container redis

解决:
参考:https://blog.csdn.net/chenleixing/article/details/50530419
分析可能原因:

  1. edis-cluste r的 bug
  2. 客户端的 hash(key) 有问题,造成分配不均
  3. 存在个别大的 key-value: 例如一个包含了几百万数据 set 数据结构
  4. 主从复制出现了问题
  5. 其他原因(这里提到了 monitor 进程导致的)

查找大 key,应该也不是:

# redis-cli -c -h redis-master -a kong62123 --bigkeys                  
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.

# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type.  You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).

[00.00%] Biggest string found so far '"fake/b00fe6372e647097:1761d0599b8:1761d05d357:b8134b7f"' with 266984 bytes
[00.00%] Biggest string found so far '"fake/72c6b4617011c665:1761cfd2419:1761cfd2850:521c8196"' with 509820 bytes
[00.02%] Biggest string found so far '"fake/cfdd2f587dae9791:1761cfe0194:1761cfe106e:2c7958bb"' with 510839 bytes
[00.04%] Biggest string found so far '"fake/142dc7b3b33f73a5:1761d05ddd4:1761d061d58:69d40fe1"' with 900471 bytes
[00.04%] Biggest string found so far '"fake/232db4b7f88cd92:1761d0a461b:1761d0a76f7:3b53df38"' with 1096777 bytes
[01.63%] Biggest string found so far '"fake/e2f278b0b0f1ec0e:1761d0a817e:1761d0aceb2:e6235dde"' with 1285507 bytes
[05.45%] Biggest string found so far '"fake/2155e9e474220562:1761cfccdd6:1761cfd0975:3f0923bb"' with 1437657 bytes
[05.79%] Biggest string found so far '"fake/75acab73c87ad9e8:1761d06946e:1761d06e0e2:15dddbcc"' with 1485350 bytes
[06.83%] Biggest string found so far '"fake/fab0b46790906085:1761cfe7f43:1761cfea36c:c828e40e"' with 1519460 bytes
[09.30%] Biggest string found so far '"fake/1876208440d8eb69:1761d0a8523:1761d0ac7a1:8ad790b9"' with 1553344 bytes
[45.59%] Biggest string found so far '"fake/5aca699958e02b9e:1761d0a8496:1761d0ac7f7:5a2fa202"' with 1553464 bytes
[57.92%] Biggest string found so far '"fake/f4003cd6c5a6ae4:1761cff0111:1761cff406c:84d48291"' with 1896730 bytes
[83.41%] Biggest string found so far '"fake/31c1ada0c1213aeb:1761cff011c:1761cff4075:806fd546"' with 1896849 bytes

-------- summary -------

Sampled 184141 keys in the keyspace!
Total key length in bytes is 10414202 (avg len 56.56)

Biggest string found '"fake/31c1ada0c1213aeb:1761cff011c:1761cff4075:806fd546"' has 1896849 bytes

0 lists with 0 items (00.00% of keys, avg size 0.00)
0 hashs with 0 fields (00.00% of keys, avg size 0.00)
184141 strings with 20423715612 bytes (100.00% of keys, avg size 110913.46)
0 streams with 0 entries (00.00% of keys, avg size 0.00)
0 sets with 0 members (00.00% of keys, avg size 0.00)
0 zsets with 0 members (00.00% of keys, avg size 0.00)

这里关闭 redis-exporter 后问题依旧

那么应该是客户端的问题

      redis:
        endpoint: redis-master:6379
        # Redis Sentinel master name. An empty string for Redis Server or Redis Cluster.
        #master_name: master
        timeout: 10s
        # 修改默认过期时间 1h,注意:不能太小,太小 redis 一直利用不上
        expiration: 10m

loki 集成 clickhouse

原则上最新版本代码里面已经支持 grpc-store 的实现,我们也具体实现了 grpc-store 接口,可以连接到 clickhouse,但目前无法 flush 数据

schema_config:
  configs:
    - from: 2020-12-02
      store: boltdb-shipper
      object_store: grpc-store
      schema: v11
      index:
        prefix: index_
        period: 24h
      chunks:
        prefix: chunks_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /tmp/loki/boltdb-shipper-active
    cache_location: /tmp/loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  grpc_store:
    server_address: 172.16.49.14:50001
  filesystem:
    directory: /tmp/loki/chunks

参考:https://github.com/grafana/loki/issues/745
这里说明 clickhouse 不能用于 Blob 存储,但是 Loki 和 Cortex 实际上需要 Blob 存储 Chunk。

参考:https://clickhouse.tech/docs/zh/sql-reference/data-types/string/
字符串可以任意长度的。它可以包含任意的字节集,包含空字节。因此,字符串类型可以代替其他 DBMSs 中的 VARCHAR、BLOB、CLOB 等类型。
这里官方婉转的表达了其不支持 BLOB 类型

数据库中的 blob 是什么类型?
BLOB (binary large object) 二进制大对象,是一个可以存储二进制文件的容器。BLOB 常常是数据库中用来存储二进制文件的字段类型。
BLOB 是一个大文件,典型的 BLOB 是一张图片或一个声音文件,由于它们的尺寸,必须使用特殊的方式来处理(例如:上传、下载或者存放到一个数据库)。
处理 BLOB 的主要思想就是让文件处理器(如数据库管理器)不去理会文件是什么,而是关心如何去处理它。
这种处理大数据对象的方法是把双刃剑,它有可能引发一些问题,如存储的二进制文件过大,会使数据库的性能下降。在数据库中存放体积较大的多媒体对象就是应用程序处理 BLOB 的典型例子。

但目前有个 nodejs 的实现,可以将 loki 对接到 clickhouse:
参考:https://github.com/lmangani/cLoki
其架构实现如下,即完全代替了 loki,那基本上也就不用考虑了

        Grafana
           |
           |
Agent -> Cloki -> Clickhouse

你可能感兴趣的:(Loki 日志系统分布式部署实践八 排错)