Starting point
upgraded GitLab Omnibus from 12.1.6 to 12.2.4
afterwards; Prometheus no longer showed as up
[…]
down: prometheus: 1s, normally up, want up; run: log: (pid 1666) 2881s
[…]
At this time, last known good (for sure) was following the last upgrade (to 12.1.6) as I have output stashed
A restart (systemctl restart gitlab-runsvdir) didn’t fix it, neither did a reboot.
Spoilers
The upgrade had nothing to do with it (which was done in the afternoon on September 9th.)
The broad issue of empty meta.json files is a not entirely uncommon Prometheus thing, so this will be relevant to some extent for non GitLab implementations.
Lesson learned: I will add a gitlab-ctl status to my preflight checks for upgrading GitLab.
Update 2019-12-09: GitLab Omnibus upgraded from Prometheus 2.8.1 to 2.12.0 in release 12.3.0, so from then on, it’d have the fsync fix.
Logs and Processes
Gitlab Omnibus has a service management program which owns everything. Can see it tried …
root 1648 1642 0 14:24 ? 00:00:04 runsv prometheus
root 1666 1648 0 14:24 ? 00:00:00 svlogd -tt /var/log/gitlab/prometheus
This provides a useful clue to where the logs are; they are in an unusual format …
total 244000
-rw-r–r--. 1 root root 30 Oct 15 2017 config
-rw-------. 1 root root 0 Oct 15 2017 lock
-rwxr–r--. 1 root root 140120 Jan 1 2018 @400000005a4a67741657f2a4.s
-rw-r–r--. 1 root root 10908704 Jun 4 14:49 @400000005cf690de3828462c.u
-rw-r–r--. 1 root root 0 Jun 16 15:28 state
-rwxr–r--. 1 root root 149095 Jun 16 15:28 @400000005d0652191df66af4.s
drwxr-xr-x. 20 root root 4096 Jul 3 12:21 …
-rw-r–r--. 1 root root 626069 Jul 3 20:00 @400000005d2063a11e7ddebc.u
-rw-r–r--. 1 root root 593560 Sep 8 10:28 @400000005d76324435f19fac.u
drwx------. 2 gitlab-prometheus root 4096 Sep 9 17:11 .
-rw-r–r--. 1 root root 170182287 Sep 10 15:24 current
The current log shows that there are repeated and ongoing attempts to start prometheus. Most of it seems to be general chattiness, highlights:
caller=main.go:321 msg=“Starting Prometheus” version="(version=2.8.1, branch=master, revision=)
caller=main.go:322 build_context="(go=go1.12.7, user=GitLab-Omnibus, date=)"
caller=main.go:640 msg=“Starting TSDB …”
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566547200000 maxt=1566583200000 ulid=01DK97NVZQTSK9H1SQJT2D153S
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566842400000 maxt=1566856800000 ulid=01DKBKMTMEGGGA718A4RRKFH56
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566892800000 maxt=1566914400000 ulid=01DKNW2N3JSDKCEA2JTME0WGPJ
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566972000000 maxt=1566986400000 ulid=01DKPZQ90DGN0RW5H9R10NFVSN
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1567317600000 maxt=1567382400000 ulid=01DKQYMSP8RRJ95J8DPGTWHMBB
It then logs stopping scrape discovery manager, notify discovery manager, rule manager, notification manager and …
caller=main.go:717 err=“opening storage failed: block dir: “/var/opt/gitlab/prometheus/data/01DM83875Y7WH9AK7EFTXYVFEA”: unexpected end of JSON input”
Yay. Nice full path to where the problem is. It’s like it’s making it easy for me.
Broken data?
The error suggests there’a a broken JSON file to find.
There’s a bunch of directories in the data directory (which is actually what it says to expect in the error …)
01DK97NVZQTSK9H1SQJT2D153S
01DKBKMTMEGGGA718A4RRKFH56
01DKNW2N3JSDKCEA2JTME0WGPJ
01DKPZQ90DGN0RW5H9R10NFVSN
01DKQYMSP8RRJ95J8DPGTWHMBB
01DM83875Y7WH9AK7EFTXYVFEA
01DM838A0DY0VDJNFYGP3RW6CW
At this level, the datestamps aren’t very interesting; all were last updated on the day before I did the upgrade. Probably booting the machine up.
The logs indicate that the first five load ok, it stumbles on the penultimate one.
Last good:
#size;b
drwx------ 68 Sep 8 10:27 01DKQYMSP8RRJ95J8DPGTWHMBB
drwx------ 20 Sep 2 04:00 01DKQYMSP8RRJ95J8DPGTWHMBB/chunks
-rw------- 12867691 Sep 2 04:00 01DKQYMSP8RRJ95J8DPGTWHMBB/chunks/000001
-rw------- 1629968 Sep 2 04:00 01DKQYMSP8RRJ95J8DPGTWHMBB/index
-rw------- 9 Sep 2 04:00 01DKQYMSP8RRJ95J8DPGTWHMBB/tombstones
-rw------- 827 Sep 8 10:27 01DKQYMSP8RRJ95J8DPGTWHMBB/meta.json
Broken:
drwx------ 68 Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA
drwx------ 20 Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA/chunks
-rw------- 7424 Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA/chunks/000001
-rw------- 47747 Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA/index
-rw------- 0 Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA/tombstones
-rw------- 0 Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA/meta.json
A google for the error turned up:
Looks like the fix will be to remove the dodgy directory and restart prometheus. (prometheus issue 4058)
Looks like there’s an issue due to be fixed in 2.9.0; this occurred with the bundled 2.8.1. (prometheus issue 5457, referring to 4058)
If the file (meta.json) wasn’t empty, there’s a python incantation to validate it. (prometheus issue 4297)
It seems to be (have been) a thing on NFS which is not recommended.
Also seen on (I think) Openshift 4.1 running Prometheus 2.3.2 on Gluster (which is probably against recommendations, like NFS)
Fix
Prometheus keeps trying to restart, and I want it not to do this temporarily.
fail: alertmanager: runsv not running
fail: gitaly: runsv not running
fail: gitlab-monitor: runsv not running
fail: gitlab-workhorse: runsv not running
fail: grafana: runsv not running
fail: logrotate: runsv not running
fail: nginx: runsv not running
fail: node-exporter: runsv not running
fail: postgres-exporter: runsv not running
fail: postgresql: runsv not running
fail: prometheus: runsv not running
fail: redis: runsv not running
fail: redis-exporter: runsv not running
fail: sidekiq: runsv not running
fail: unicorn: runsv not running
meta.json is empty for the two most recent directories, so I’ll delete both.
680 Sep 8 10:27 /var/opt/gitlab/prometheus/data/01DK97NVZQTSK9H1SQJT2D153S/meta.json
571 Sep 8 10:27 /var/opt/gitlab/prometheus/data/01DKBKMTMEGGGA718A4RRKFH56/meta.json
612 Sep 8 10:27 /var/opt/gitlab/prometheus/data/01DKNW2N3JSDKCEA2JTME0WGPJ/meta.json
579 Sep 8 10:27 /var/opt/gitlab/prometheus/data/01DKPZQ90DGN0RW5H9R10NFVSN/meta.json
827 Sep 8 10:27 /var/opt/gitlab/prometheus/data/01DKQYMSP8RRJ95J8DPGTWHMBB/meta.json
0 Sep 8 10:28 /var/opt/gitlab/prometheus/data/01DM83875Y7WH9AK7EFTXYVFEA/meta.json
0 Sep 8 10:28 /var/opt/gitlab/prometheus/data/01DM838A0DY0VDJNFYGP3RW6CW/meta.json
/var/opt/gitlab/prometheus/data/01DM838A0DY0VDJNFYGP3RW6CW
Restart.
run: alertmanager: (pid 8115) 139s; run: log: (pid 8114) 139s
run: gitaly: (pid 8121) 139s; run: log: (pid 8120) 139s
run: gitlab-monitor: (pid 8123) 139s; run: log: (pid 8122) 139s
run: gitlab-workhorse: (pid 8133) 139s; run: log: (pid 8132) 139s
run: grafana: (pid 8113) 139s; run: log: (pid 8112) 139s
run: logrotate: (pid 8129) 139s; run: log: (pid 8128) 139s
run: nginx: (pid 8131) 139s; run: log: (pid 8130) 139s
run: node-exporter: (pid 8137) 139s; run: log: (pid 8136) 139s
run: postgres-exporter: (pid 8125) 139s; run: log: (pid 8124) 139s
run: postgresql: (pid 8135) 139s; run: log: (pid 8134) 139s
run: prometheus: (pid 8119) 139s; run: log: (pid 8118) 139s
run: redis: (pid 8111) 139s; run: log: (pid 8110) 139s
run: redis-exporter: (pid 8127) 139s; run: log: (pid 8126) 139s
run: sidekiq: (pid 8117) 139s; run: log: (pid 8116) 139s
run: unicorn: (pid 8139) 139s; run: log: (pid 8138) 139s
Hunt for root cause
Performed before fixing it.
There’s enough free space to write files in this location.
I don’t use NFS; the data is on XFS
The previous log (/var/log/gitlab/prometheus) is: @400000005d76324435f19fac.u
It contains a clean startup (8th; day before GitLab upgrade) with a lot more data directories than I see now …
2019-09-08_09:27:49.17776 level=info ts=2019-09-08T09:27:49.105611306Z
caller=main.go:321 msg=“Starting Prometheus” version="(version=2.8.1, branch=master, revision=)"
[…]
caller=main.go:640 msg=“Starting TSDB …”
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566122400000 maxt=1566136800000 ulid=01DJMZD18ZDB9R83M6GKVBGDVP
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566201600000 maxt=1566216000000 ulid=01DJQG461YW0R13VDE0KAABSKC
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566216000000 maxt=1566230400000 ulid=01DJSQXBT99DX6KGQZKD75SZ61
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566302400000 maxt=1566316800000 ulid=01DJT46FEVE95Q07ES7993K5Z3
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566374400000 maxt=1566403200000 ulid=01DJYYDAPFM23M44SX4HYKS42W
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566489600000 maxt=1566511200000 ulid=01DJZ8ZY26ZZ1QH9Y1X5Y41B5H
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566547200000 maxt=1566583200000 ulid=01DK97NVZQTSK9H1SQJT2D153S
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566842400000 maxt=1566856800000 ulid=01DKBKMTMEGGGA718A4RRKFH56
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566892800000 maxt=1566914400000 ulid=01DKNW2N3JSDKCEA2JTME0WGPJ
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1566972000000 maxt=1566986400000 ulid=01DKPZQ90DGN0RW5H9R10NFVSN
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1567382400000 maxt=1567389600000 ulid=01DKQYMEBW70D41CFTY4GSH80E
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1567317600000 maxt=1567382400000 ulid=01DKQYMSP8RRJ95J8DPGTWHMBB
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1567389600000 maxt=1567396800000 ulid=01DKR5G5KQ2ZX0NXMYJA5X90YC
caller=repair.go:48 component=tsdb msg=“found healthy block” mint=1567396800000 maxt=1567404000000 ulid=01DM67W10QZKDTN1W2F22KBF1S
[…]
caller=main.go:655 msg=“TSDB started”
[…]
caller=main.go:724 msg=“Loading configuration file” filename=/var/opt/gitlab/prometheus/prometheus.yml
[…]
caller=main.go:609 msg=“Server is ready to receive web requests.”
It is then doing some compaction, I think (compact.go?)
2019-09-08_09:28:34.15589 level=info ts=2019-09-08T09:28:34.15526043Z caller=compact.go:496
component=tsdb msg=“write block” mint=1567872000000 maxt=1567879200000 ulid=01DM83875Y7WH9AK7EFTXYVFEA duration=428.768903ms
2019-09-08_09:28:37.50895 level=info ts=2019-09-08T09:28:37.498124475Z caller=compact.go:441
component=tsdb msg=“compact blocks” count=3 mint=1567382400000 maxt=1567404000000 ulid=01DM838A0DY0VDJNFYGP3RW6CW
sources="[01DKQYMEBW70D41CFTY4GSH80E 01DKR5G5KQ2ZX0NXMYJA5X90YC 01DM67W10QZKDTN1W2F22KBF1S]" duration=876.363746ms
And that’s the end of that log.
current log starts with:
2019-09-09_11:06:35.89163 level=info ts=2019-09-09T11:06:35.888309089Z caller=main.go:321
msg=“Starting Prometheus” version="(version=2.8.1, branch=master, revision=)"
And that fails with 01DM83875Y7WH9AK7EFTXYVFEA. I read that it was compacting this the previous day when the log stopped.
The Prometheus logs are in UTC, not system time, which is currently UTC+1.
09:28 is consistent with the first boot on Sept 8th.
Journald (I have persistent logging set up) has nothing after Sep 08 10:27:59 (ie: just before 09:28 prometheus time) and the reboot info is odd; the end time for three of the last five uptimes is apparently 19:24 on September 9th.
reboot system boot 3.10.0-957.27.2. Tue Sep 10 14:23 - 17:02 (02:38)
reboot system boot 3.10.0-957.27.2. Mon Sep 9 19:24 - 17:02 (21:38)
reboot system boot 3.10.0-957.27.2. Mon Sep 9 12:06 - 19:24 (07:18)
reboot system boot 3.10.0-957.27.2. Sun Sep 8 10:27 - 19:24 (1+08:56)
reboot system boot 3.10.0-957.27.2. Sat Sep 7 17:56 - 19:24 (2+01:27)
It’s a KVM guest; the host was up until 10:29 (just after 09:28 prometheus time) although the uptime info on there is negative.
$ last -5 reboot
reboot system boot 3.10.0-957.27.2. Tue Sep 10 15:23 - 17:07 (01:43)
reboot system boot 3.10.0-957.27.2. Mon Sep 9 13:05 - 17:07 (1+04:01)
reboot system boot 3.10.0-957.27.2. Sun Sep 8 11:26 - 10:29 (00:-57)
reboot system boot 3.10.0-957.27.2. Sat Sep 7 18:55 - 18:12 (00:-43)
reboot system boot 3.10.0-957.27.2. Sat Sep 7 18:09 - 18:12 (00:03)
I’m not going to go digging further.
Conclusion
I don’t think my guests are always going down cleanly at shutdown.
Prometheus was compacting it’s data when KVM guest went away.
Looking at issue 4058, it’s noted that the code doesn’t do a sync before closing the file; it does everywhere else, and I guess this is the fix that’ll be in 2.9.0. That might have helped, though it’s a case, here, of putting on a tourniquet before shooting oneself in a foot.
Share this: