[00:25:04] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:28:44] <icinga-wm>	 RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[00:34:34] <icinga-wm>	 PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:53:04] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[01:03:44] <icinga-wm>	 RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[01:53:54] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:53:54] <icinga-wm>	 PROBLEM - zotero on sca2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:55:44] <icinga-wm>	 RECOVERY - zotero on sca2004 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.079 second response time
[01:55:54] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[02:32:44] <icinga-wm>	 PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:34:54] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[03:02:04] <icinga-wm>	 RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[03:02:54] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[03:21:44] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 661.00 seconds
[03:24:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 165.14 seconds
[03:24:54] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:25:48] <sDrewth>	 is there an issue with Commons giving files?   I am getting sporadic delivery of images (at enWS)
[03:32:27] <p858snake|_>	 sDrewth: I would recommend filling a task since I don't see current tasks about it, and then offer Reedy a stroopwafel
[03:33:34] <sDrewth>	 okay p858snake|_   I am now getting Error message to come back in a few minutes
[03:34:13] <sDrewth>	 so may be my from where I am pulling data
[03:34:42] <p858snake|_>	 can you copy the full message, there is generally some stuff down the bottom which should inculde some helpful tech details
[03:35:49] <sDrewth>	 what is the name of the cut and paste thingy we host at Labs?
[03:36:27] <p858snake|_>	 https://phabricator.wikimedia.org/paste/ :)
[03:36:47] <p858snake|_>	 no idea where the toolslabs one is
[03:38:03] <sDrewth>	 https://phabricator.wikimedia.org/P4687
[03:39:28] <sDrewth>	 duh  https://tools.wmflabs.org/paste  !
[03:39:43] <p858snake|_>	 Reedy: still around?
[03:40:23] <sDrewth>	 p858snake|_: it comes and goes, so it looks to just be a part of the cluster
[03:40:53] <sDrewth>	 presumably I would pull from the Pacific servers
[03:41:34] <p858snake|_>	 I wildly and un-educatingly suggest that cp4007 might be having issues
[03:41:50] <p858snake|_>	 sDrewth: is it always cp4007 when you hit the error screens?
[03:42:26] <sDrewth>	 for commons images it says nothing, just fails to deliver.  I will watch what happens for enWS
[03:42:49] <Reedy>	 p858snake|_: I am... But it is 03:42
[03:43:51] <p858snake|_>	 Reedy: Anyone else online that could possibly look at ^
[03:44:06] <p858snake|_>	 Reedy: I would offer a strong coffee, but I'm not in the same country
[03:44:44] <sDrewth>	 I have some nice East Timorese in the grinder, but even further away
[03:45:36] <sDrewth>	 got it again, and again cp4007
[03:46:00] <Reedy>	 cp4007 doesn't look out of the ordinary in ganglia
[03:46:38] <sDrewth>	 this has been off and on for 3/4 hour
[03:47:11] <sDrewth>	 now getting Error 405, Method not allowed at (timestamp)
[03:52:54] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[03:52:56] <sDrewth>	 now getting text okay, but serving images isn't happening
[04:00:30] <sDrewth>	 now back again
[04:03:23] <sDrewth>	 feel welcome to leave it Reedy and go and sleep, whatever it was seems to have passed
[04:12:44] <icinga-wm>	 PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:25:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[04:29:56] <wikibugs>	 (03PS1) 10Tim Landscheidt: Tools: Use exported resources for ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/329382 (https://phabricator.wikimedia.org/T153163)
[04:30:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[04:30:50] <wikibugs>	 (03CR) 10Tim Landscheidt: [C: 04-1] "DO NOT SUBMIT.  The Tools puppetmaster do not support exported resources yet (but the code works, and that's very encouraging)." [puppet] - 10https://gerrit.wikimedia.org/r/329382 (https://phabricator.wikimedia.org/T153163) (owner: 10Tim Landscheidt)
[04:35:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[04:40:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[04:40:44] <icinga-wm>	 RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[04:43:14] <icinga-wm>	 PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:45:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[04:50:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[04:51:14] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[04:52:14] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2927224 keys, up 57 days 20 hours - replication_delay is 0
[04:55:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[05:00:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[05:01:04] <icinga-wm>	 PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/furl]
[05:02:24] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 14 failures. Last run 2 minutes ago with 14 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[05:05:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[05:10:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[05:11:14] <icinga-wm>	 RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[05:15:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[05:20:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[05:25:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[05:29:04] <icinga-wm>	 RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[05:29:24] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[05:29:54] <icinga-wm>	 PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:30:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[05:31:34] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=5095.70 Read Requests/Sec=2519.20 Write Requests/Sec=114.60 KBytes Read/Sec=24241.60 KBytes_Written/Sec=1954.40
[05:35:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.008 second response time
[05:38:34] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=3.20 Read Requests/Sec=0.10 Write Requests/Sec=46.30 KBytes Read/Sec=0.40 KBytes_Written/Sec=330.80
[05:40:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[05:43:04] <icinga-wm>	 PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:45:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[05:50:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[05:55:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[05:57:54] <icinga-wm>	 RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[06:00:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[06:05:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[06:10:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[06:13:04] <icinga-wm>	 RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[06:15:04] <icinga-wm>	 PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:15:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[06:20:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[06:25:14] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[06:30:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[06:35:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[06:38:14] <icinga-wm>	 PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[vim],Package[zsh-beta]
[06:40:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[06:43:04] <icinga-wm>	 RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[06:45:04] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[06:45:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[06:50:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[06:55:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[07:00:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[07:05:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[07:07:14] <icinga-wm>	 RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[07:10:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.012 second response time
[07:15:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[07:20:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[07:25:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[07:30:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[07:35:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[07:36:31] <legoktm>	 !log added reedy to "integration" gerrit group per T154207
[07:36:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:35] <stashbot>	 T154207: Grant reedy C+2 on integration repos - https://phabricator.wikimedia.org/T154207
[07:40:14] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[07:45:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[07:50:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[07:55:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[08:00:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[08:05:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[08:10:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[08:15:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[08:20:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[08:25:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[08:30:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[08:31:00] <_joe_>	 grrrr
[08:31:16] <_joe_>	 legoktm: do you know if anyone from FR is around?
[08:35:05] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[08:40:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[08:45:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[08:45:05] <wikibugs>	 (03PS1) 10Tim Landscheidt: puppetdb: Use tuning.conf only in production [puppet] - 10https://gerrit.wikimedia.org/r/329390
[08:47:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I would vastly prefer to split this role in smaller parts instead of adding if guards everywhere. I'm happy to help doing it, too. I've se" [puppet] - 10https://gerrit.wikimedia.org/r/329390 (owner: 10Tim Landscheidt)
[08:50:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[08:52:11] <sDrewth>	 Can I ask for someone to look to update interwiki.db stuff, it hasn't been done for a month
[08:52:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] puppetdb: Do not hardcode puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/329330 (https://phabricator.wikimedia.org/T153577) (owner: 10Tim Landscheidt)
[08:52:53] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: puppetdb: Do not hardcode puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/329330 (https://phabricator.wikimedia.org/T153577) (owner: 10Tim Landscheidt)
[08:53:07] <p858snake|_>	 sDrewth: we are in a depolyment freeze due to the holidays, it won't be till early jan
[08:53:17] <p858snake|_>	 can you file a task in phabricator for it please?
[08:53:22] <sDrewth>	 oh, that is deployment freeze too, okay
[08:53:29] <wikibugs>	 (03PS1) 10Tim Landscheidt: wmflib: Fix typo in cron_splay() [puppet] - 10https://gerrit.wikimedia.org/r/329392
[08:53:33] <sDrewth>	 sure
[08:54:14] <sDrewth>	 they have trimmed the displayed log at SAL to a week :-(  makes things harder to find
[08:54:35] <p858snake|_>	 _joe_: you will probably need to text jeff, he was online earlier today fixing something else as well 
[08:55:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[08:55:05] <wikibugs>	 (03CR) 10Tim Landscheidt: "(Theoretically, the puppetmaster needs to be restarted after applying this; but if this didn't trigger any errors so far, this is probably" [puppet] - 10https://gerrit.wikimedia.org/r/329392 (owner: 10Tim Landscheidt)
[08:55:08] <_joe_>	 p858snake|_: yeah it's not something for jeff to fix
[08:55:27] <_joe_>	 or, well.
[08:55:40] <_joe_>	 I can't really comment more, sorry
[08:57:27] <sDrewth>	 meh, it is more effective to search SAL?action=history to find things than use the page itself!
[08:57:37] <_joe_>	 sDrewth: what's up?
[08:58:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "I don't think that's the case, the puppetmaster should keep working just fine; the reason this wasn't fixed earlier is that apparently we'" [puppet] - 10https://gerrit.wikimedia.org/r/329392 (owner: 10Tim Landscheidt)
[08:58:09] <sDrewth>	 I was trying to find the last update of the interwiki map, but [[mw:SAL]] updates to frequently and only covers a week
[08:58:26] <sDrewth>	 ack wikitech:SAL
[08:58:34] <sDrewth>	 too
[08:59:24] <sDrewth>	 nuisance value, /me shrugs and goes to log the phabricator ticket
[08:59:47] <wikibugs>	 06Operations, 10Ops-Access-Requests, 05Security: Give Katie Horn (K4-713) access to restricted Security tasks - https://phabricator.wikimedia.org/T154211#2904538 (10Peachey88)
[09:00:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[09:05:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[09:06:59] <wikibugs>	 (03CR) 10Tim Landscheidt: "@Joe: If there's a better layout, wonderful.  But I have a number of patches yet to upload that enable PuppetDB for standalone puppetmaste" [puppet] - 10https://gerrit.wikimedia.org/r/329390 (owner: 10Tim Landscheidt)
[09:07:21] <_joe_>	 sDrewth: have you tried http://tools.wmflabs.org/sal ?
[09:09:14] <_joe_>	 there you will find a full history
[09:09:46] <sDrewth>	 thanks. I will update the meta link
[09:10:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[09:10:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "> @Joe: If there's a better layout, wonderful.  But I have a number" [puppet] - 10https://gerrit.wikimedia.org/r/329390 (owner: 10Tim Landscheidt)
[09:11:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: Fix typo in cron_splay() [puppet] - 10https://gerrit.wikimedia.org/r/329392 (owner: 10Tim Landscheidt)
[09:11:42] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: wmflib: Fix typo in cron_splay() [puppet] - 10https://gerrit.wikimedia.org/r/329392 (owner: 10Tim Landscheidt)
[09:11:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] wmflib: Fix typo in cron_splay() [puppet] - 10https://gerrit.wikimedia.org/r/329392 (owner: 10Tim Landscheidt)
[09:13:04] <icinga-wm>	 PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:15:14] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[09:15:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "@Tim: I think there is really no reason for using expand_path in labs, apart from the fact we use labs/private for both testing production" [puppet] - 10https://gerrit.wikimedia.org/r/329226 (owner: 10Tim Landscheidt)
[09:20:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[09:25:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[09:30:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[09:35:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[09:40:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[09:42:04] <icinga-wm>	 RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[09:45:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[09:49:29] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2904613 (10elukey) @zhuyifei1999, the main issue seems to be a ton of jobs submitted to the queue, that we don'...
[09:50:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[09:53:46] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2904621 (10elukey) Or maybe this is the issue: https://gerrit.wikimedia.org/r/#/c/238819/3/TimedMediaHandler.php
[09:55:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[09:57:46] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2904624 (10Joe) Let me point out that the queue is hardly growing anymore, we're in the middle of the holiday f...
[10:00:04] <icinga-wm>	 RECOVERY - check_listener_gc on thulium is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.009 second response time
[10:11:14] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 652 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2940502 keys, up 58 days 1 hours - replication_delay is 652
[10:18:14] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2932798 keys, up 58 days 1 hours - replication_delay is 22
[10:46:54] <icinga-wm>	 PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:05:14] <icinga-wm>	 PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:10:41] <wikibugs>	 (03PS1) 10Revi: Revert $wgMFEEditorOptions['anonymousEditing'] = true for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823)
[11:15:14] <icinga-wm>	 RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[11:15:18] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2904695 (10Joe) I just looked at the trends of job waiting times and they don't really make sense unless we're...
[11:15:47] <wikibugs>	 (03CR) 10Amire80: [C: 031] "Thank you, revi!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) (owner: 10Revi)
[11:28:30] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: videoscaler: bump up the number of running transcodes (again) [puppet] - 10https://gerrit.wikimedia.org/r/329447 (https://phabricator.wikimedia.org/T153488)
[11:33:00] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 3 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2904765 (10zhuyifei1999) Something really curious about the queue is that [[https://commons.wikimedia.org/wiki/...
[11:33:14] <icinga-wm>	 RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[11:35:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] videoscaler: bump up the number of running transcodes (again) [puppet] - 10https://gerrit.wikimedia.org/r/329447 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto)
[11:40:04] <zhuyifei1999_>	 _joe_: fyi, I asked Jason to not flood v2c https://commons.wikimedia.org/wiki/User_talk:Jasonanaggie#video2commons_flooding
[11:40:12] <_joe_>	 thanks :)
[11:40:33] <_joe_>	 zhuyifei1999_: yesterday matanya asked me if they could "get back to running more uploads"
[11:40:43] <_joe_>	 and well, the answer for now is no :P
[11:40:53] <zhuyifei1999_>	 if needed I can reduce the workers even more
[11:42:11] <_joe_>	 zhuyifei1999_: well, if we weren't in the middle of a freeze and of holidays (so it's basically me being around from ops during EU mornings, more or less)
[11:42:21] <_joe_>	 I'd repurpose a couple more machines
[11:44:02] <_joe_>	 but for a series of reasons its not as easy as I'd like at the moment
[11:44:21] <_joe_>	 mostly because we didn't have the time to fix ffmpeg issues on debian jessie for now
[11:44:40] <_joe_>	 so converting means not just reconfiguring, but reinstalling the servers altoghether
[11:46:21] <zhuyifei1999_>	 ffmpeg has issues on jessie?
[11:46:35] <zhuyifei1999_>	 v2c is 100% jessie
[11:48:35] <zhuyifei1999_>	 hmm https://phabricator.wikimedia.org/T145742
[11:49:54] <zhuyifei1999_>	 https://phabricator.wikimedia.org/T103335 <= "ffmpeg is not in Jessie" I don't think that's true now
[11:50:50] <_joe_>	 yeah there was another issue IIRC
[11:50:59] <_joe_>	 but honestly don't remember well
[11:51:31] <_joe_>	 when everyone's back from vacation and then the developer summit, I'll take a look with moritz.m 
[11:51:41] <zhuyifei1999_>	 ok
[11:52:19] <_joe_>	 uhm I might have been a bit optimisst with the new load
[11:52:30] <_joe_>	 let's see what happens in 2-3 hours about that
[11:52:36] <_joe_>	 I'm monitoring the situation
[11:55:41] <wikibugs>	 (03PS1) 10Ladsgroup: Add badge for "digitaldocument" in Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329453 (https://phabricator.wikimedia.org/T153186)
[12:00:19] <zhuyifei1999_>	 hmm yeah v2c is running ffmpeg from the backports
[12:02:45] <zhuyifei1999_>	 147% load avg o.O
[12:04:05] <zhuyifei1999_>	 why is there so much kernel cpu time?
[12:04:35] <zhuyifei1999_>	 looks unnatural
[12:06:44] <_joe_>	 not really
[12:06:54] <_joe_>	 a lot of interrupts to manage
[12:08:08] <_joe_>	 zhuyifei1999_: I am stepping out for a bit, I'll take a look at the status later
[12:08:18] <zhuyifei1999_>	 k
[12:08:25] <_joe_>	 but I might downsize that a bit later
[12:08:41] <wikibugs>	 (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Please wait for a final decision at T153186 before merging this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329453 (https://phabricator.wikimedia.org/T153186) (owner: 10Ladsgroup)
[12:09:48] <zhuyifei1999_>	 (well, unless VMs doesn't need to handle so much interrupts, https://tools.wmflabs.org/nagf/?project=video doesn't show much kernel CPU time, even under 200+% load)
[12:21:47] <elukey>	 zhuyifei1999_: where is the load in your metrics?
[12:22:25] <zhuyifei1999_>	 you mean v2c right now? barely any load
[12:22:38] <elukey>	 no I meant the 200% load.. 
[12:23:07] <elukey>	 I think that the kernel time on a overloaded host is expected
[12:23:27] <zhuyifei1999_>	 that was thirty something yesterday
[12:24:02] <elukey>	 yeah so this was the risk that I mentioned in the task, namely ops trying to fine tune and getting into the overloaded area
[12:24:22] <elukey>	 there is a bigger problem in the queue's content
[12:24:42] <zhuyifei1999_>	 https://grafana-labs.wikimedia.org/dashboard/db/labs-project-board?var-project=video&var-server=All half-broken
[12:24:56] <zhuyifei1999_>	 but has the load of two hosts
[12:25:53] <elukey>	 thanks!
[12:26:38] <elukey>	 zhuyifei1999_: so this is one host - https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=mw1168&var-network=eth0
[12:27:26] <elukey>	 (one of the more powerful)
[12:27:59] <elukey>	 this one https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=mw1259 is one of the "oldest"
[12:28:19] <elukey>	 5-7% of system time is not a problem imho
[12:28:53] <elukey>	 (look also at the baseline before the load increase)
[12:29:59] <zhuyifei1999_>	 about doubled 
[12:30:55] <elukey>	 so we'll reduce the load later on but it is a matter of fine tuning, _joe_ knows what he is doing :)
[12:31:04] <zhuyifei1999_>	 k
[12:32:31] <elukey>	 but again thanks a lot for taking care of the video transcode queue, I don't absolutely want to drive you away from your work (that is really appreciated), only stating that we are doing as much as we can :)
[12:33:40] <elukey>	 zhuyifei1999_: --^
[12:33:54] <zhuyifei1999_>	 ok
[12:39:14] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:48:04] <icinga-wm>	 PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:08:14] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[13:16:04] <icinga-wm>	 RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[13:19:08] <logmsgbot>	 !log mobrovac@tin Starting deploy [mathoid/deploy@79fdd56]: Add depooling and repooling to Mathoid T144602
[13:19:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:12] <stashbot>	 T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602
[13:21:39] <logmsgbot>	 !log mobrovac@tin Finished deploy [mathoid/deploy@79fdd56]: Add depooling and repooling to Mathoid T144602 (duration: 02m 31s)
[13:21:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:36] <_joe_>	 ok so, it looks like the load on the scalers is ok
[13:26:45] <_joe_>	 meaning, it's very high, but not "overloading"
[13:26:58] <_joe_>	 we're at ~ 100% utiization of resources
[13:27:04] <_joe_>	 I'd leave it like that for now
[13:27:29] <mobrovac>	 100%? wow
[13:31:02] <logmsgbot>	 !log mobrovac@tin Starting deploy [citoid/deploy@da96f4b]: Add depooling and repooling to Citoid T144602
[13:31:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:05] <stashbot>	 T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602
[13:33:01] <logmsgbot>	 !log mobrovac@tin Finished deploy [citoid/deploy@da96f4b]: Add depooling and repooling to Citoid T144602 (duration: 01m 59s)
[13:33:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:34] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:40:39] <logmsgbot>	 !log mobrovac@tin Starting deploy [cxserver/deploy@0279029]: (no message)
[13:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:03] <logmsgbot>	 !log mobrovac@tin Finished deploy [cxserver/deploy@0279029]: (no message) (duration: 00m 25s)
[13:41:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:18] <logmsgbot>	 !log mobrovac@tin Starting deploy [cxserver/deploy@0279029]: Add depooling and repooling to CXServer T144602
[13:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:21] <stashbot>	 T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602
[13:44:41] <logmsgbot>	 !log mobrovac@tin Finished deploy [cxserver/deploy@0279029]: Add depooling and repooling to CXServer T144602 (duration: 03m 22s)
[13:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:36] <logmsgbot>	 !log mobrovac@tin Starting deploy [graphoid/deploy@151f26c]: Add depooling and repooling to Graphoid T144602
[13:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:40] <stashbot>	 T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602
[13:51:14] <logmsgbot>	 !log mobrovac@tin Finished deploy [graphoid/deploy@151f26c]: Add depooling and repooling to Graphoid T144602 (duration: 01m 37s)
[13:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:05] <logmsgbot>	 !log mobrovac@tin Starting deploy [electron-render/deploy@b2a820e]: Add depooling and repooling to PDFRender T144602
[13:59:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:08] <stashbot>	 T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602
[14:00:31] <logmsgbot>	 !log mobrovac@tin Finished deploy [electron-render/deploy@b2a820e]: Add depooling and repooling to PDFRender T144602 (duration: 01m 26s)
[14:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:56] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904959 (10Paladox) That thread says online reindex works until the next restart so that could be a workaround. But I doint know what other sides aff...
[14:05:08] <logmsgbot>	 !log mobrovac@tin Starting deploy [mobileapps/deploy@ae22656]: Add depooling and repooling to MCS T144602
[14:05:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:11] <stashbot>	 T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602
[14:05:42] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904963 (10Paladox) But the online reindex dosent seem to work for some users
[14:07:34] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[14:07:41] <logmsgbot>	 !log mobrovac@tin Finished deploy [mobileapps/deploy@ae22656]: Add depooling and repooling to MCS T144602 (duration: 02m 34s)
[14:07:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:09] <logmsgbot>	 !log mobrovac@tin Starting deploy [trending-edits/deploy@c5d239b]: Add depooling and repooling to Trending Edits T144602
[14:11:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:14] <stashbot>	 T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602
[14:12:30] <logmsgbot>	 !log mobrovac@tin Finished deploy [trending-edits/deploy@c5d239b]: Add depooling and repooling to Trending Edits T144602 (duration: 01m 21s)
[14:12:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:48] <wikibugs>	 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2904971 (10Paladox) http://gerrit-test.wmflabs.org/ has  had its database converted to this for nearly a week now and still works currently.
[14:22:52] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: rotate log files weekly [puppet] - 10https://gerrit.wikimedia.org/r/329487 (https://phabricator.wikimedia.org/T153488)
[14:25:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::jobrunner: rotate log files weekly [puppet] - 10https://gerrit.wikimedia.org/r/329487 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto)
[14:31:25] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904990 (10Paladox) According to the thread the index for accounts is affected too. Though I have no idea if it affects ldap but it does store accoun...
[14:31:39] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2904991 (10Paladox)
[15:00:14] <icinga-wm>	 PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:18:51] <wikibugs>	 (03CR) 10Florianschmidtwelzow: [C: 031] Revert $wgMFEEditorOptions['anonymousEditing'] = true for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) (owner: 10Revi)
[15:25:59] <revi>	 heh
[15:28:14] <icinga-wm>	 RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[15:29:25] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes::worker: Allow access to kubelet from master [puppet] - 10https://gerrit.wikimedia.org/r/329494
[15:55:04] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%)
[15:57:14] <icinga-wm>	 PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:26:14] <icinga-wm>	 RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[16:30:54] <icinga-wm>	 PROBLEM - NTP on prometheus2003 is CRITICAL: NTP CRITICAL: Offset unknown
[16:31:24] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[16:39:24] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[16:52:04] <icinga-wm>	 PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:05:14] <icinga-wm>	 PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:21:04] <icinga-wm>	 RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[17:24:14] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:34:14] <icinga-wm>	 RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[17:45:23] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905165 (10Paladox)
[17:46:31] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2904053 (10Paladox)
[17:49:00] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905168 (10Paladox)
[17:51:45] <wikibugs>	 (03PS1) 10Urbanecm: [throttle] New rules for Queen Mary University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245)
[17:52:14] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[17:52:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [throttle] New rules for Queen Mary University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245) (owner: 10Urbanecm)
[17:54:17] <wikibugs>	 (03PS2) 10Urbanecm: [throttle] New rules for Queen Mary University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245)
[17:54:27] <wikibugs>	 (03PS5) 10Paladox: Gerrit: Convert from utf8 to utf8mb4 [puppet] - 10https://gerrit.wikimedia.org/r/328571 (https://phabricator.wikimedia.org/T153899)
[17:54:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [throttle] New rules for Queen Mary University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245) (owner: 10Urbanecm)
[17:56:27] <wikibugs>	 (03PS3) 10Urbanecm: [throttle] New rules for Queen Mary University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245)
[17:57:14] <icinga-wm>	 PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:01:44] <wikibugs>	 (03PS4) 10Urbanecm: [throttle] New rules + remove obsolete rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245)
[18:09:25] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905202 (10Paladox)
[18:11:55] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 05Security, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905203 (10Paladox) Adding #security as it affects accounts, though no one can steal accounts as far as i can tell, account...
[18:25:14] <icinga-wm>	 RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[18:40:30] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 05Security, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905243 (10Paladox) This may be the reason why we had this T152640 problem as accounts are stored in the index now. All the...
[19:03:56] <wikibugs>	 (03PS3) 10Urbanecm: Enable subpages in NS0 for arbcom_cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327700 (https://phabricator.wikimedia.org/T154247)
[19:04:10] <wikibugs>	 (03CR) 10Urbanecm: "PS3: Changed T number to T154247." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327700 (https://phabricator.wikimedia.org/T154247) (owner: 10Urbanecm)
[19:44:58] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 3 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2905338 (10Joe) So, after I fine-tuned the video scalers load, we're now processing 73 transcodes at the same t...
[19:48:10] <_joe_>	 Revent: I'm fairly sure someone re-transcoded all white house press briefings, and that is creating a huge queue
[19:48:46] <Revent>	 I don’t think that’s it...
[19:49:13] <Revent>	 The guy had dumped like 200 more into v2c the other day.
[19:49:55] <_joe_>	 Revent: we're churning videos at 7x the rate we were doing it 1 week ago
[19:50:06] <Revent>	 (nods)
[19:50:08] <_joe_>	 I can't realistically add much more capacity at this point
[19:50:55] <Revent>	 We’ve discussed it with the particular person who was uploading the briefings, hopefully he’ll actually stop for a while, and then go slower.
[19:51:17] <_joe_>	 unless I convince the WMF to buy 10 videoscalers, which is hardly justified at this point from the historic load on the scalers
[19:51:39] <_joe_>	 the right solution will be to move videoscaling to our "elastic" environment once it's ready
[19:51:40] <Revent>	 https://commons.wikimedia.org/w/index.php?search=intitle%3A%22White+House+Press+Briefing%22&title=Special:Search&go=Go&searchToken=9f39ggx3nmi0iefpwd84quf6n <- but…. nearly 600 videos, and probably 12x transcodes each....
[19:51:40] <matanya>	 _joe_: i can reduce the load coming from v2c if that helps
[19:51:44] <_joe_>	 but it will take some time
[19:51:50] <_joe_>	 matanya: stop it maybe?
[19:51:57] <matanya>	 totally ?
[19:52:24] <Revent>	 matanya: Zhu had half of v2c off for a while, but the guy just backlogged it there.
[19:52:55] <_joe_>	 matanya: well we're 12 days behind on transcodes atm
[19:53:05] <matanya>	 I am the project admin, i can scale down or completely shut it down
[19:53:21] <_joe_>	 sorry I'm on a train, internet is good for travelling at 300 kmh but still not great
[19:53:25] <matanya>	 but i'd really like the project to stay alive if that is possible
[19:53:41] <_joe_>	 matanya: yeah I'd love that too
[19:54:13] <_joe_>	 but if someone pours 7500 transcodes of large files in the queue, there is not much I can do to speed it up
[19:54:18] <_joe_>	 besides what I just did
[19:54:33] <Revent>	 _joe_: If it says anything, the ones that got ‘broken’ on the 19th (the ones that were trying to be run when the servers were reset) seem to be running now.
[19:54:56] <_joe_>	 Revent: well it says we're 11 days behind
[19:54:58] <_joe_>	 sigh
[19:55:05] <_joe_>	 sorry, 9
[19:55:18] <Revent>	 I got ‘most’ of them re-reset before they started going through.
[19:55:26] <matanya>	 _joe_: say the word and i'll shut it down
[19:55:54] <_joe_>	 matanya: let's see tomorrow morning if you're around?
[19:56:04] <matanya>	 I will be
[19:56:09] <_joe_>	 I would like to see if we can start catching up
[19:56:11] <matanya>	 EU times ?
[19:56:17] <_joe_>	 now that I fine-tuned the scalers
[19:56:27] <_joe_>	 yup, I'm in CET
[19:56:43] <matanya>	 ok, i'll be around
[19:57:25] <_joe_>	 thanks :)
[19:58:56] <Revent>	 zhuyifei1999_: To clarify...
[19:59:46] <Revent>	 zhuyifei1999_: I ‘was’ deliberately running trancodes that where in the ‘very old broken transcodes’ list, before the current drama.
[20:00:27] <Revent>	 More than once, I had accidentally overloaded the scalers for short periods (a few hours) from the timeout bug…
[20:01:11] <Revent>	 Those would have gone into the ~5000-odd ones that were ‘both error and queued’ when the servers were reset, probably.
[20:02:22] <wikibugs>	 06Operations, 06Commons, 10media-storage, 05MW-1.27-release-notes: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#2905352 (10Wieralee) It still happens... Not every day, but maybe once for a week... The files I was listing were fortunately uploaded again...
[20:02:34] <wikibugs>	 06Operations, 10ops-eqiad: rack/setup/install/track new ms-fe1005-1008 - https://phabricator.wikimedia.org/T154250#2905353 (10RobH)
[20:03:50] <Revent>	 The ones that are ‘old uploads’ should be small in number, in comparision to the new uploads of ‘old stuff’
[20:06:34] <_joe_>	 Revent: the main problem, as I see it, is that video2commons is able to flood the production infrastructure
[20:06:52] <_joe_>	 given for every video we upload there we have a 12x transcoding factor
[20:07:07] <Revent>	 (nods)
[20:08:01] <_joe_>	 so I see two ways of solving it: one is to send video2commons upload to a low-prio queue
[20:08:46] <_joe_>	 the other, is to rate-limit video2commons uploads
[20:08:51] <_joe_>	 I'd prefer the former
[20:10:05] <Revent>	 The scalers being in some way able to prioritize ‘easy’ tasks would be very good.
[20:13:17] <Revent>	 I suspect even something as ‘crude’ as ‘run the transcodes of the smallest file next’ would cause the actual backlog count to drop a lot.
[20:15:39] <_joe_>	 that's very very hard to do in our current jobqueue setting
[20:15:54] <_joe_>	 it's a FIFO 
[20:16:00] <_joe_>	 more or less
[20:16:09] <_joe_>	 you can have low-prio jobs though
[20:16:49] <_joe_>	 luckily, I'll be in a room with other devs soonish and I can bug the relevant people in person :P
[20:17:49] <Revent>	 _joe_: When you say FIFO ‘more or less’… it’s looking at transcode_time_addjob right?
[20:18:18] <_joe_>	 no, it's looking at how things got inserted in a Redis queue :P
[20:18:55] <_joe_>	 so whenever you add a video to commons, its transcodes get added to the tail of the transcodes queue
[20:19:09] <Revent>	 Ok, so that explains (kinda) how things get wierd statuses in the DB.
[20:19:25] <_joe_>	 and the scalers do the job of grabbing tasks from the head of such queue
[20:19:55] <Revent>	 It does, at least, eplicitly not allow the same thing to be queued twice, right?
[20:20:20] <_joe_>	 well, it depends on the type of job
[20:20:30] <_joe_>	 but I think for transcodes that's the case
[20:21:34] <Revent>	 _joe_: Are you able to search the queue?
[20:22:13] <_joe_>	 Revent: more or less
[20:22:22] <Revent>	 Kemi periodiska systemet.webm
[20:22:28] <_joe_>	 Revent: as in now more or lest
[20:22:38] <_joe_>	 I'm on bad internet now
[20:22:45] <Revent>	 Ah...
[20:23:24] <_joe_>	 tomorrow I'll be at home
[20:23:46] <_joe_>	 I'm on a train atm, and well into my off hours :)
[20:24:18] <Revent>	 I’m just really hoping that resetting the ones ‘listed as queued, but with an error’ is not causing them to be run more than once.
[20:24:41] <_joe_>	 Revent: let's wait for now?
[20:24:52] <_joe_>	 it should not, but I might be more specific tomorrow
[20:25:34] <_joe_>	 we're also thin on staff, I'm around but almost everyone is on a few days off
[20:25:54] <Revent>	 Thing is, if they ‘do’ get run, successfully, the error code doesn’t get removed, and then they have a really messed up ‘status’.
[20:26:32] <Revent>	 https://commons.wikimedia.org/wiki/File:Expedition_42_Crew_Profile.webm <- see the “Error on 12:48, 2016 December 19” transcodes?
[20:26:53] <_joe_>	 yeah we had a ton of failures on the 19th
[20:26:54] <Revent>	 They are in the list of ‘queued transcodes’ on the special page.
[20:27:15] <Revent>	 Yeah, about 5k, lol.
[20:28:36] <Revent>	 Those are the ones I was resetting, doing so doesn’t change the ‘queued’ count on the special page.
[20:44:48] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10procurement: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2905389 (10RobH)
[20:45:10] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2905406 (10RobH)
[21:00:53] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 05Security, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905414 (10Aklapper) How often has Wikimedia Gerrit been restarted in the last, say, 3 months?  >>! In T154205#2905203, @Pal...
[21:01:24] <icinga-wm>	 PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:04:04] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:12:04] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:29:24] <icinga-wm>	 RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[21:31:24] <icinga-wm>	 PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:31:34] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[21:32:34] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[21:33:24] <icinga-wm>	 RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational
[21:33:34] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active
[21:39:02] <wikibugs>	 (03PS2) 10Tim Landscheidt: puppetmaster: Clone repositories in Labs as root [puppet] - 10https://gerrit.wikimedia.org/r/324727 (https://phabricator.wikimedia.org/T152059)
[21:40:34] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[21:50:44] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 607 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2952991 keys, up 58 days 13 hours - replication_delay is 607
[21:55:44] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2935710 keys, up 58 days 13 hours - replication_delay is 53
[22:13:37] <wikibugs>	 (03PS1) 10Urbanecm: Set celebration logo for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329516 (https://phabricator.wikimedia.org/T154254)
[22:25:14] <icinga-wm>	 PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:29:14] <icinga-wm>	 PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:53:14] <icinga-wm>	 RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[22:57:14] <icinga-wm>	 RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[23:20:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[23:25:14] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[23:30:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[23:31:34] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:35:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[23:40:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[23:45:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[23:50:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time
[23:55:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[23:59:34] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures