[00:04:31] <icinga-wm>	 PROBLEM - puppet last run on mw2214 is CRITICAL: CRITICAL: puppet fail
[00:16:42] <icinga-wm>	 PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: puppet fail
[00:25:22] <grrrit-wm>	 (03CR) 10RobH: [C: 032] robh on vacation, removing from paging [puppet] - 10https://gerrit.wikimedia.org/r/300591 (owner: 10RobH)
[00:32:51] <icinga-wm>	 RECOVERY - puppet last run on mw2214 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[00:34:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[00:36:32] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[00:37:41] <tgr>	 !log doing an emergency deploy of https://gerrit.wikimedia.org/r/#/c/300679 for T141160, creates dozens of new users per hour to be unattached on loginwiki which probably has weird consequences
[00:37:42] <stashbot>	 T141160: Unattached accounts created on registration - https://phabricator.wikimedia.org/T141160
[00:37:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:38:02] <grrrit-wm>	 (03PS1) 10Yuvipanda: k8s: Upgrade to 1.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/300688 
[00:38:12] <grrrit-wm>	 (03PS3) 10Yuvipanda: tools: Alert when iowait info is missing as well [puppet] - 10https://gerrit.wikimedia.org/r/300573 (https://phabricator.wikimedia.org/T141017) 
[00:38:21] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Alert when iowait info is missing as well [puppet] - 10https://gerrit.wikimedia.org/r/300573 (https://phabricator.wikimedia.org/T141017) (owner: 10Yuvipanda)
[00:38:47] <grrrit-wm>	 (03PS2) 10Yuvipanda: k8s: Upgrade to 1.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/300688 
[00:38:56] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Upgrade to 1.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/300688 (owner: 10Yuvipanda)
[00:39:42] <neilpquinn>	 Question: I just tried connection to bast4001 for the first time, and the key fingerprint I got doesn't match the one on wikitech (https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast4001.wikimedia.org). Is wikitech outdated?
[00:40:10] <neilpquinn>	 This is what I got when I tried to SSH: `ECDSA key fingerprint is SHA256:UUsRMiUK9CkPg8yMiEPAKjs1PEhKxQPT+xhi4xRnjks.`
[00:40:26] <grrrit-wm>	 (03PS2) 10Ppchelko: Change-Prop: Updates to error hangling [puppet] - 10https://gerrit.wikimedia.org/r/300681 
[00:44:52] <icinga-wm>	 RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:46:47] <grrrit-wm>	 (03PS3) 10Legoktm: Change-Prop: Updates to error handling [puppet] - 10https://gerrit.wikimedia.org/r/300681 (owner: 10Ppchelko)
[00:53:42] <icinga-wm>	 PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: puppet fail
[01:00:46] <logmsgbot>	 !log tgr@tin Synchronized php-1.28.0-wmf.11/extensions/CentralAuth/includes/CentralAuthPrimaryAuthenticationProvider.php: T141160 (duration: 00m 28s)
[01:00:47] <stashbot>	 T141160: Unattached accounts created on registration - https://phabricator.wikimedia.org/T141160
[01:00:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:01:25] <logmsgbot>	 !log tgr@tin Synchronized php-1.28.0-wmf.11/extensions/CentralAuth/includes/CentralAuthHooks.php: T141160 (duration: 00m 27s)
[01:01:27] <stashbot>	 T141160: Unattached accounts created on registration - https://phabricator.wikimedia.org/T141160
[01:01:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:02:06] <logmsgbot>	 !log tgr@tin Synchronized php-1.28.0-wmf.11/extensions/CentralAuth/includes/CentralAuthPlugin.php: T141160 (duration: 00m 29s)
[01:02:07] <stashbot>	 T141160: Unattached accounts created on registration - https://phabricator.wikimedia.org/T141160
[01:02:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:16:33] <ori>	 neilpquinn: looks like it (wikitech outdated)
[01:19:53] <icinga-wm>	 RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[01:25:57] <Krenair>	 alex@alex-laptop:~$ ssh-keyscan -v -t rsa,ecdsa,ed25519 -p 22 bast4001.wikimedia.org
[01:25:57] <Krenair>	 alex@alex-laptop:~$ 
[01:25:57] <Krenair>	 strange
[01:27:43] <Krenair>	 okay I think it might just be my connection
[01:32:13] <Krenair>	 of course all the servers I can successfully run it on don't have OpenSSH 6.8+, so no SHA256 fingerprints
[01:37:45] <Krenair>	 neilpquinn, https://wikitech.wikimedia.org/w/index.php?title=Help:SSH_Fingerprints/bast4001.wikimedia.org&diff=781106&oldid=192162
[01:38:22] <Krenair>	 not sure why the host keys changed then
[01:38:36] <Krenair>	 only thing in SAL is robh taking it down to add a new HDD
[01:39:24] <neilpquinn>	 Thanks for investigating (and updating the docs), Krenair!
[01:40:25] <grrrit-wm>	 (03PS1) 10Chad: Gerrit: Remove googlebot from banned IPs. They ain't so bad [puppet] - 10https://gerrit.wikimedia.org/r/300692 
[01:42:40] <ostriches>	 ori: Got a second to look at that? ^
[01:46:10] <ori>	 looks sane and I saw the context for this in the backlog on another channel, but mark wouldn't be very happy with me if I merged that -- I was asked not to merge things that have to do with security and not to merge on weekends, and this meets both criteria... sorry.
[01:47:50] <ostriches>	 Okie dokie no worries.
[01:48:17] <ostriches>	 It's not a rush at all, just saw you around :)
[01:48:45] <ostriches>	 In that case I'm gonna go start_weekend()
[01:49:53] * ori waves
[01:55:14] <Krenair>	 neilpquinn, turns out the whole thing was reinstalled too, which explains the change
[01:55:42] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[01:56:05] <Krenair>	 neilpquinn, https://phabricator.wikimedia.org/T129180
[02:01:53] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[02:21:08] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.11) (duration: 08m 24s)
[02:21:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:26:49] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jul 23 02:26:49 UTC 2016 (duration 5m 41s)
[02:26:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:54:01] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is OK: TCP OK - 0.038 second response time on port 9042
[03:23:51] <icinga-wm>	 PROBLEM - Labs LDAP on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:24:11] <icinga-wm>	 PROBLEM - SSH on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:24:21] <icinga-wm>	 PROBLEM - salt-minion processes on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:24:21] <icinga-wm>	 PROBLEM - Disk space on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:24:21] <icinga-wm>	 PROBLEM - LibreNMS HTTPS on netmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:24:32] <icinga-wm>	 PROBLEM - puppet last run on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:24:42] <icinga-wm>	 PROBLEM - configured eth on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:25:02] <icinga-wm>	 PROBLEM - DPKG on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:25:02] <icinga-wm>	 PROBLEM - dhclient process on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:25:03] <icinga-wm>	 PROBLEM - Check size of conntrack table on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:25:11] <icinga-wm>	 PROBLEM - Tool Labs instance distribution on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:25:13] <icinga-wm>	 PROBLEM - Getent speed check on labstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:25:22] <icinga-wm>	 PROBLEM - Tool Labs instance distribution on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:27:23] <icinga-wm>	 PROBLEM - puppetmaster https on labcontrol1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:29:02] <icinga-wm>	 RECOVERY - Getent speed check on labstore1002 is OK: OK: getent group returns within a second
[03:31:11] <icinga-wm>	 PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:32:31] <icinga-wm>	 RECOVERY - LibreNMS HTTPS on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8685 bytes in 7.327 second response time
[03:34:12] <icinga-wm>	 PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.014 second response time
[03:34:12] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.021 second response time
[03:35:03] <icinga-wm>	 PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-puppetmaster/eqiad - 341 bytes in 0.011 second response time
[03:35:12] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.043 second response time
[03:35:32] <icinga-wm>	 PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.017 second response time
[03:35:32] <icinga-wm>	 PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.039 second response time
[03:38:57] <icinga-wm>	 PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.049 second response time
[03:39:30] <chasemp>	 is someone doing something w/ tools checker?^
[03:39:31] <chasemp>	 YuviPanda: ?
[03:39:52] <icinga-wm>	 PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:40:26] <icinga-wm>	 PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.009 second response time
[03:40:26] <icinga-wm>	 PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.008 second response time
[03:40:31] <jay8g>	 Is Gerrit down for anyone else? Is this planned?
[03:41:09] <jay8g>	 I'm getting "server unavailable" messages after a long time of it sitting at "Working..." in the top bar
[03:41:21] <icinga-wm>	 PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-puppetmaster/eqiad - 341 bytes in 0.039 second response time
[03:41:21] <icinga-wm>	 PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.038 second response time
[03:41:21] <icinga-wm>	 PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.044 second response time
[03:41:51] <icinga-wm>	 PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.005 second response time
[03:43:43] <icinga-wm>	 PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.027 second response time
[03:43:43] <YuviPanda>	 uh
[03:44:06] <chasemp>	 YuviPanda: I think seaborgium.wikimedia.org is down?
[03:44:19] <chasemp>	 or is it
[03:44:32] <icinga-wm>	 PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.032 second response time
[03:44:32] <icinga-wm>	 PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.027 second response time
[03:44:32] <icinga-wm>	 PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.045 second response time
[03:44:32] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.032 second response time
[03:44:36] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.048 second response time
[03:45:16] <icinga-wm>	 PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.033 second response time
[03:45:20] <legoktm>	 scap on beta is failing with:
[03:45:21] <legoktm>	 03:36:29 sudo: ldap_start_tls_s(): Can't contact LDAP server
[03:45:21] <legoktm>	 03:36:29 sudo: a password is required
[03:46:33] <icinga-wm>	 PROBLEM - LibreNMS HTTPS on netmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:46:43] <YuviPanda>	 chasemp yeah, that might be it - I can't ssh in
[03:47:43] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.018 second response time
[03:47:59] <YuviPanda>	 can ssh to serpens
[03:48:10] <chasemp>	 !log gnt-instance reboot seaborgium.wikimedia.org
[03:48:26] <chasemp>	 ^ that is taking a long time w/ no status
[03:49:10] <YuviPanda>	 failover is going to be useless since all of labs ssh is down anyway
[03:49:18] <YuviPanda>	 or at least tools-checker-01 is
[03:49:39] <YuviPanda>	 nope, I can ssh to tools-login
[03:49:44] <YuviPanda>	 just both checker nodes are unsshable
[03:49:50] <chasemp>	 yeah so
[03:49:51] <icinga-wm>	 RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.028 seconds response time
[03:49:55] <chasemp>	 I rebooted seaborgium
[03:49:59] <chasemp>	 and I can ssh in now
[03:50:00] <chasemp>	 so
[03:50:03] <icinga-wm>	 RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[03:50:10] <chasemp>	 serpens is not much of an effective backup it seems? and I don't get why
[03:50:23] <icinga-wm>	 RECOVERY - Disk space on seaborgium is OK: DISK OK
[03:50:23] <icinga-wm>	 RECOVERY - salt-minion processes on seaborgium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[03:50:23] <icinga-wm>	 RECOVERY - LibreNMS HTTPS on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8684 bytes in 0.178 second response time
[03:50:24] <chasemp>	 although I'm not positive there aren't other problems 
[03:50:32] <icinga-wm>	 RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures
[03:50:42] <icinga-wm>	 RECOVERY - configured eth on seaborgium is OK: OK - interfaces up
[03:50:51] <icinga-wm>	 RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 56.432 second response time
[03:50:51] <icinga-wm>	 RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 57.345 second response time
[03:50:52] <icinga-wm>	 RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 31.245 second response time
[03:50:56] <icinga-wm>	 RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 31.587 second response time
[03:50:56] <icinga-wm>	 RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.269 second response time
[03:50:56] <icinga-wm>	 RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 17.039 second response time
[03:50:56] <icinga-wm>	 RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 17.016 second response time
[03:50:56] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 17.238 second response time
[03:51:00] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 17.084 second response time
[03:51:12] <icinga-wm>	 RECOVERY - dhclient process on seaborgium is OK: PROCS OK: 0 processes with command name dhclient
[03:51:12] <icinga-wm>	 RECOVERY - DPKG on seaborgium is OK: All packages OK
[03:51:13] <icinga-wm>	 RECOVERY - Check size of conntrack table on seaborgium is OK: OK: nf_conntrack is 0 % full
[03:51:21] <icinga-wm>	 RECOVERY - Tool Labs instance distribution on labcontrol1001 is OK: OK: All critical toollabs instances are spread out enough
[03:51:26] <icinga-wm>	 RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.047 second response time
[03:51:26] <icinga-wm>	 RECOVERY - puppetmaster https on labcontrol1001 is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.058 second response time
[03:51:27] <icinga-wm>	 RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.708 second response time
[03:51:27] <icinga-wm>	 RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.743 second response time
[03:51:27] <icinga-wm>	 RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.693 second response time
[03:51:31] <icinga-wm>	 RECOVERY - Tool Labs instance distribution on labcontrol1002 is OK: OK: All critical toollabs instances are spread out enough
[03:51:51] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2489274 (10chasemp) I had to reboot seaborgium today as it froze up and took out ldap with it.  > !log gnt-instance reboot seaborgium.wikimedia.org  I would say...definitely something is sti...
[03:52:01] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.126 second response time
[03:53:01] <chasemp>	 well, it seems somewhat clear that was the deal
[03:53:13] <chasemp>	 I left a note in https://phabricator.wikimedia.org/T130593
[03:53:17] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Run toolschecker in debug mode [puppet] - 10https://gerrit.wikimedia.org/r/300696 
[03:53:41] <YuviPanda>	 chasemp ^ alll the toolschecker logs only said '503' and didn't actually mention what or why...
[03:53:45] <YuviPanda>	 this should help if it happens again
[03:53:49] <chasemp>	 ok
[03:54:00] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Run toolschecker in debug mode [puppet] - 10https://gerrit.wikimedia.org/r/300696 
[03:54:09] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Run toolschecker in debug mode [puppet] - 10https://gerrit.wikimedia.org/r/300696 (owner: 10Yuvipanda)
[03:55:21] <icinga-wm>	 RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[03:58:25] <wikibugs>	 06Operations, 13Patch-For-Review: Staging area for the next version of the transparency report - https://phabricator.wikimedia.org/T138197#2392917 (10Peachey88) @ori Are there any tasks remaining to keep this task open?
[04:00:03] <icinga-wm>	 PROBLEM - NTP on seaborgium is CRITICAL: NTP CRITICAL: Offset unknown
[04:02:14] <icinga-wm>	 RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[04:08:12] <icinga-wm>	 RECOVERY - NTP on seaborgium is OK: NTP OK: Offset -0.0009208917618 secs
[04:24:12] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:24:41] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:25:13] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:25:21] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:25:21] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:25:21] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:25:23] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:29:05] <legoktm>	 YuviPanda, chasemp: I think gerrit is down?
[04:32:17] <Amir1>	 gerrit is down
[04:32:36] <Amir1>	 legoktm: yup, it's down. I asked several friends
[04:32:42] <Amir1>	 hey btw.
[04:48:46] <ori>	 !log Users report Gerrit is down; on ytterbium java is occupying two cores at 100%
[04:48:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:51:29] <ori>	 I'm going to dump core and restart it
[04:51:35] <ori>	 the service, that is; not the host
[04:52:52] <icinga-wm>	 PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail
[04:56:44] <ori>	 !log Restarting Gerrit on ytterbium
[04:56:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:57:20] <ori>	 it's back up
[04:58:17] <ori>	 !log Gerrit is back up after service restart; was unavailable between ~ 04:29 - 04:57 UTC
[04:58:21] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:58:22] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:58:52] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:58:52] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[04:59:01] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[04:59:01] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge.
[04:59:02] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[04:59:39] <ori>	 legoktm: in the future please page ops for something like this
[04:59:51] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labcontrol1001 is OK: No changes to merge.
[05:00:02] <ori>	 Gerrit is not a high-traffic production wiki, but it is important infrastructure, and it can't just sit dead in the water
[05:00:21] <icinga-wm>	 PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail
[05:00:21] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge.
[05:00:22] <icinga-wm>	 PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 3 failures
[05:00:51] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge.
[05:01:33] <icinga-wm>	 PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures
[05:01:41] <icinga-wm>	 PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[05:02:02] <icinga-wm>	 PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures
[05:02:02] <icinga-wm>	 PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[05:02:43] <legoktm>	 ori: ack, will do next time
[05:04:01] <icinga-wm>	 RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[05:08:16] <ori>	 legoktm: first user reports of gerrit being down are from 04:29, but I see suspicious log messages starting from 04:08. is it possible it became unresponsive around then, or were you working on gerrit right before 04:29?
[05:08:50] <legoktm>	 ori: no, that's just when I tried opening a patch
[05:09:12] <legoktm>	 [21:27:34] <wikibugs> Wikipedia-Android-App-Backlog, Easy, Mobile-App-Android-Sprint-87-Francium, Patch-For-Review, Unplanned-Sprint-Work: Remove support for custom application ID - https://phabricator.wikimedia.org/T103903#2489281 (Niedzielski) I have a patch for this but Gerrit is down. Will post REAL soon.
[05:09:23] <legoktm>	 [21:23:12] <Reception|away> and is Gerrit down for everyone? (in #mediawiki)
[05:10:10] <ori>	 niedzielski-afk: it's back up
[05:14:21] <icinga-wm>	 PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 14894 MB (3% inode=99%)
[05:18:32] <icinga-wm>	 PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: puppet fail
[05:19:42] <icinga-wm>	 RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[05:20:03] <icinga-wm>	 RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:22:32] <icinga-wm>	 RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[05:25:11] <icinga-wm>	 RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:27:12] <icinga-wm>	 RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:31:29] <wikibugs>	 06Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic: Strip query parameters from w.wiki domain - https://phabricator.wikimedia.org/T141170#2489295 (10Legoktm)
[05:32:02] <icinga-wm>	 RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:46:03] <icinga-wm>	 RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:05:01] <icinga-wm>	 PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24
[06:06:51] <icinga-wm>	 RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[06:29:12] <icinga-wm>	 PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 14780 MB (3% inode=99%)
[06:30:51] <icinga-wm>	 PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail
[06:31:22] <icinga-wm>	 PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:22] <icinga-wm>	 PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:23] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:32] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:32] <icinga-wm>	 PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:33] <icinga-wm>	 PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:33] <icinga-wm>	 PROBLEM - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:23] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:39:12] <icinga-wm>	 RECOVERY - Disk space on lithium is OK: DISK OK
[06:56:32] <icinga-wm>	 RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[06:57:02] <icinga-wm>	 RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[06:57:02] <icinga-wm>	 RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:03] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[06:57:11] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[06:57:12] <icinga-wm>	 RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[06:57:12] <icinga-wm>	 RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:12] <icinga-wm>	 RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:02] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:38:03] <icinga-wm>	 PROBLEM - SSH on db1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:38:21] <icinga-wm>	 PROBLEM - MegaRAID on db1066 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:39:52] <icinga-wm>	 RECOVERY - SSH on db1066 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0)
[07:40:03] <icinga-wm>	 RECOVERY - MegaRAID on db1066 is OK: OK: optimal, 1 logical, 2 physical
[08:41:22] <icinga-wm>	 PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 185 bytes in 1.498 second response time
[11:06:49] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2489499 (10Jonas) https://wikitech.wikimedia.org/wiki/User:Jonas_Kress_(WMDE)  Thanks!
[11:22:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[11:24:16] <Reedy>	 That can't be good
[11:26:51] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[11:42:11] <wikibugs>	 06Operations, 06Discovery, 10netops, 03Discovery-Search-Sprint: deploy elasticsearch/plugins to relforge1001-1002 servers - https://phabricator.wikimedia.org/T141085#2489518 (10Aklapper)
[12:40:22] <icinga-wm>	 PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: puppet fail
[12:51:38] <apergos>	 travelling back to home base, back online in about 4 hours I guess
[13:06:03] <icinga-wm>	 RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[13:46:31] <wikibugs>	 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214216 (10Josve05a) I can't delete https://commons.wikimedia.org/wiki/File:%D8%B3%...
[14:08:42] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[14:10:33] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5082716 keys - replication_delay is 0
[15:24:31] <icinga-wm>	 PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 14923 MB (3% inode=99%)
[15:37:12] <godog>	 !log lithium sudo lvextend --size +10G -r  /dev/mapper/lithium--vg-syslog
[15:37:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:38:43] <godog>	 !log stop swift in esams test cluster, lots of logging from there
[15:38:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:57:00] <wikibugs>	 06Operations, 06Discovery, 06Maps, 10Monitoring: Map caches metrics look broken - https://phabricator.wikimedia.org/T141186#2489781 (10MaxSem)
[17:09:32] <icinga-wm>	 PROBLEM - puppet last run on mw2078 is CRITICAL: CRITICAL: puppet fail
[17:38:31] <icinga-wm>	 RECOVERY - puppet last run on mw2078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:46:42] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[18:52:42] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[19:13:23] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2490000 (10Brentjoseph) @Dzahn  done! sorry for the delay:  wikitech username: Brentjoseph shell account name: bc...
[19:38:42] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[19:44:42] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[20:15:32] <grrrit-wm>	 (03PS1) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 
[20:18:31] <grrrit-wm>	 (03PS2) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 
[20:22:12] <icinga-wm>	 PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures
[20:25:21] <grrrit-wm>	 (03PS3) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 
[20:25:53] <grrrit-wm>	 (03PS4) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 
[20:27:02] <grrrit-wm>	 (03PS5) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 
[20:28:18] <grrrit-wm>	 (03PS6) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 
[20:29:29] <grrrit-wm>	 (03PS7) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 
[20:33:01] <icinga-wm>	 PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Puppet has 1 failures
[20:33:32] <grrrit-wm>	 (03PS8) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 
[20:48:22] <icinga-wm>	 RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[20:54:42] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[20:54:46] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Track per tool request stats in graphite [puppet] - 10https://gerrit.wikimedia.org/r/300737 (https://phabricator.wikimedia.org/T69880) 
[20:55:13] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Track per tool request stats in graphite [puppet] - 10https://gerrit.wikimedia.org/r/300737 (https://phabricator.wikimedia.org/T69880) 
[20:58:11] <grrrit-wm>	 (03PS1) 10Hashar: contint: role for Android testing [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://bugzilla.wikimedia.org/139137) 
[20:58:52] <icinga-wm>	 RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:00:51] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[21:01:09] <grrrit-wm>	 (03PS3) 10Yuvipanda: tools: Track per tool request stats in graphite [puppet] - 10https://gerrit.wikimedia.org/r/300737 (https://phabricator.wikimedia.org/T69880) 
[21:02:01] <grrrit-wm>	 (03CR) 10Paladox: "You misspelt the bug." [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://bugzilla.wikimedia.org/139137) (owner: 10Hashar)
[21:04:22] <icinga-wm>	 PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100%
[21:04:22] <icinga-wm>	 PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[21:04:31] <grrrit-wm>	 (03PS2) 10Luke081515: contint: role for Android testing [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://phabricator.wikimedia.org/T139137) (owner: 10Hashar)
[21:05:03] <Luke081515>	 now it makes sense^^
[21:06:06] <paladox>	 Thanks
[21:18:12] <icinga-wm>	 PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: puppet fail
[21:30:43] <icinga-wm>	 RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 84.27 ms
[21:31:21] <icinga-wm>	 RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 83.28 ms
[21:31:31] <grrrit-wm>	 (03CR) 10Hashar: "Thank you Paladox and Luke081515 :-]" [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://phabricator.wikimedia.org/T139137) (owner: 10Hashar)
[21:31:57] <grrrit-wm>	 (03CR) 10Paladox: "Your welcome :)." [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://phabricator.wikimedia.org/T139137) (owner: 10Hashar)
[21:36:51] <icinga-wm>	 PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Puppet has 3 failures
[21:44:23] <icinga-wm>	 RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[21:45:12] <icinga-wm>	 RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.164 second response time
[21:52:43] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[21:58:52] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[22:00:52] <icinga-wm>	 RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[23:32:41] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[23:38:29] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Add timeout to ldap connection for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/300741 (https://phabricator.wikimedia.org/T141203) 
[23:38:41] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[23:39:17] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] tools: Track per tool request stats in graphite [puppet] - 10https://gerrit.wikimedia.org/r/300737 (https://phabricator.wikimedia.org/T69880) (owner: 10Yuvipanda)
[23:39:54] <grrrit-wm>	 (03Abandoned) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 (owner: 10Yuvipanda)
[23:42:20] <grrrit-wm>	 (03CR) 10Alex Monk: "Isn't this supposed to be given to the ldap3.Server constructor?" [puppet] - 10https://gerrit.wikimedia.org/r/300741 (https://phabricator.wikimedia.org/T141203) (owner: 10Yuvipanda)