[00:25:04] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:28:44] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [00:34:34] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:53:04] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [01:03:44] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [01:53:54] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:54] PROBLEM - zotero on sca2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:55:44] RECOVERY - zotero on sca2004 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.079 second response time [01:55:54] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [02:32:44] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:34:54] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [03:02:04] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [03:02:54] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [03:21:44] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 661.00 seconds [03:24:45] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 165.14 seconds [03:24:54] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:25:48] is there an issue with Commons giving files? I am getting sporadic delivery of images (at enWS) [03:32:27] sDrewth: I would recommend filling a task since I don't see current tasks about it, and then offer Reedy a stroopwafel [03:33:34] okay p858snake|_ I am now getting Error message to come back in a few minutes [03:34:13] so may be my from where I am pulling data [03:34:42] can you copy the full message, there is generally some stuff down the bottom which should inculde some helpful tech details [03:35:49] what is the name of the cut and paste thingy we host at Labs? [03:36:27] https://phabricator.wikimedia.org/paste/ :) [03:36:47] no idea where the toolslabs one is [03:38:03] https://phabricator.wikimedia.org/P4687 [03:39:28] duh https://tools.wmflabs.org/paste ! [03:39:43] Reedy: still around? [03:40:23] p858snake|_: it comes and goes, so it looks to just be a part of the cluster [03:40:53] presumably I would pull from the Pacific servers [03:41:34] I wildly and un-educatingly suggest that cp4007 might be having issues [03:41:50] sDrewth: is it always cp4007 when you hit the error screens? [03:42:26] for commons images it says nothing, just fails to deliver. I will watch what happens for enWS [03:42:49] p858snake|_: I am... But it is 03:42 [03:43:51] Reedy: Anyone else online that could possibly look at ^ [03:44:06] Reedy: I would offer a strong coffee, but I'm not in the same country [03:44:44] I have some nice East Timorese in the grinder, but even further away [03:45:36] got it again, and again cp4007 [03:46:00] cp4007 doesn't look out of the ordinary in ganglia [03:46:38] this has been off and on for 3/4 hour [03:47:11] now getting Error 405, Method not allowed at (timestamp) [03:52:54] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [03:52:56] now getting text okay, but serving images isn't happening [04:00:30] now back again [04:03:23] feel welcome to leave it Reedy and go and sleep, whatever it was seems to have passed [04:12:44] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:25:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [04:29:56] (03PS1) 10Tim Landscheidt: Tools: Use exported resources for ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/329382 (https://phabricator.wikimedia.org/T153163) [04:30:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [04:30:50] (03CR) 10Tim Landscheidt: [C: 04-1] "DO NOT SUBMIT. The Tools puppetmaster do not support exported resources yet (but the code works, and that's very encouraging)." [puppet] - 10https://gerrit.wikimedia.org/r/329382 (https://phabricator.wikimedia.org/T153163) (owner: 10Tim Landscheidt) [04:35:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [04:40:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [04:40:44] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [04:43:14] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:45:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [04:50:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [04:51:14] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [04:52:14] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2927224 keys, up 57 days 20 hours - replication_delay is 0 [04:55:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [05:00:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [05:01:04] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/furl] [05:02:24] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 14 failures. Last run 2 minutes ago with 14 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [05:05:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [05:10:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [05:11:14] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [05:15:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [05:20:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [05:25:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [05:29:04] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [05:29:24] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [05:29:54] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:30:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [05:31:34] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=5095.70 Read Requests/Sec=2519.20 Write Requests/Sec=114.60 KBytes Read/Sec=24241.60 KBytes_Written/Sec=1954.40 [05:35:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.008 second response time [05:38:34] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=3.20 Read Requests/Sec=0.10 Write Requests/Sec=46.30 KBytes Read/Sec=0.40 KBytes_Written/Sec=330.80 [05:40:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [05:43:04] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:45:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [05:50:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [05:55:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [05:57:54] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:00:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [06:05:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [06:10:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [06:13:04] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:15:04] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:15:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [06:20:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [06:25:14] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [06:30:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [06:35:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [06:38:14] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[vim],Package[zsh-beta] [06:40:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [06:43:04] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:04] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:45:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [06:50:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [06:55:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [07:00:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [07:05:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [07:07:14] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:10:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.012 second response time [07:15:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [07:20:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [07:25:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [07:30:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [07:35:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [07:36:31] !log added reedy to "integration" gerrit group per T154207 [07:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:35] T154207: Grant reedy C+2 on integration repos - https://phabricator.wikimedia.org/T154207 [07:40:14] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [07:45:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [07:50:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [07:55:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [08:00:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [08:05:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [08:10:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [08:15:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [08:20:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [08:25:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [08:30:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [08:31:00] <_joe_> grrrr [08:31:16] <_joe_> legoktm: do you know if anyone from FR is around? [08:35:05] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [08:40:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [08:45:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [08:45:05] (03PS1) 10Tim Landscheidt: puppetdb: Use tuning.conf only in production [puppet] - 10https://gerrit.wikimedia.org/r/329390 [08:47:32] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I would vastly prefer to split this role in smaller parts instead of adding if guards everywhere. I'm happy to help doing it, too. I've se" [puppet] - 10https://gerrit.wikimedia.org/r/329390 (owner: 10Tim Landscheidt) [08:50:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [08:52:11] Can I ask for someone to look to update interwiki.db stuff, it hasn't been done for a month [08:52:46] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetdb: Do not hardcode puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/329330 (https://phabricator.wikimedia.org/T153577) (owner: 10Tim Landscheidt) [08:52:53] (03PS2) 10Giuseppe Lavagetto: puppetdb: Do not hardcode puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/329330 (https://phabricator.wikimedia.org/T153577) (owner: 10Tim Landscheidt) [08:53:07] sDrewth: we are in a depolyment freeze due to the holidays, it won't be till early jan [08:53:17] can you file a task in phabricator for it please? [08:53:22] oh, that is deployment freeze too, okay [08:53:29] (03PS1) 10Tim Landscheidt: wmflib: Fix typo in cron_splay() [puppet] - 10https://gerrit.wikimedia.org/r/329392 [08:53:33] sure [08:54:14] they have trimmed the displayed log at SAL to a week :-( makes things harder to find [08:54:35] _joe_: you will probably need to text jeff, he was online earlier today fixing something else as well [08:55:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [08:55:05] (03CR) 10Tim Landscheidt: "(Theoretically, the puppetmaster needs to be restarted after applying this; but if this didn't trigger any errors so far, this is probably" [puppet] - 10https://gerrit.wikimedia.org/r/329392 (owner: 10Tim Landscheidt) [08:55:08] <_joe_> p858snake|_: yeah it's not something for jeff to fix [08:55:27] <_joe_> or, well. [08:55:40] <_joe_> I can't really comment more, sorry [08:57:27] meh, it is more effective to search SAL?action=history to find things than use the page itself! [08:57:37] <_joe_> sDrewth: what's up? [08:58:08] (03CR) 10Giuseppe Lavagetto: "I don't think that's the case, the puppetmaster should keep working just fine; the reason this wasn't fixed earlier is that apparently we'" [puppet] - 10https://gerrit.wikimedia.org/r/329392 (owner: 10Tim Landscheidt) [08:58:09] I was trying to find the last update of the interwiki map, but [[mw:SAL]] updates to frequently and only covers a week [08:58:26] ack wikitech:SAL [08:58:34] too [08:59:24] nuisance value, /me shrugs and goes to log the phabricator ticket [08:59:47] 06Operations, 10Ops-Access-Requests, 05Security: Give Katie Horn (K4-713) access to restricted Security tasks - https://phabricator.wikimedia.org/T154211#2904538 (10Peachey88) [09:00:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [09:05:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [09:06:59] (03CR) 10Tim Landscheidt: "@Joe: If there's a better layout, wonderful. But I have a number of patches yet to upload that enable PuppetDB for standalone puppetmaste" [puppet] - 10https://gerrit.wikimedia.org/r/329390 (owner: 10Tim Landscheidt) [09:07:21] <_joe_> sDrewth: have you tried http://tools.wmflabs.org/sal ? [09:09:14] <_joe_> there you will find a full history [09:09:46] thanks. I will update the meta link [09:10:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [09:10:53] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "> @Joe: If there's a better layout, wonderful. But I have a number" [puppet] - 10https://gerrit.wikimedia.org/r/329390 (owner: 10Tim Landscheidt) [09:11:36] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: Fix typo in cron_splay() [puppet] - 10https://gerrit.wikimedia.org/r/329392 (owner: 10Tim Landscheidt) [09:11:42] (03PS2) 10Giuseppe Lavagetto: wmflib: Fix typo in cron_splay() [puppet] - 10https://gerrit.wikimedia.org/r/329392 (owner: 10Tim Landscheidt) [09:11:50] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] wmflib: Fix typo in cron_splay() [puppet] - 10https://gerrit.wikimedia.org/r/329392 (owner: 10Tim Landscheidt) [09:13:04] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:15:14] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [09:15:48] (03CR) 10Giuseppe Lavagetto: "@Tim: I think there is really no reason for using expand_path in labs, apart from the fact we use labs/private for both testing production" [puppet] - 10https://gerrit.wikimedia.org/r/329226 (owner: 10Tim Landscheidt) [09:20:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [09:25:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [09:30:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [09:35:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [09:40:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [09:42:04] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:45:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [09:49:29] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2904613 (10elukey) @zhuyifei1999, the main issue seems to be a ton of jobs submitted to the queue, that we don'... [09:50:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [09:53:46] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2904621 (10elukey) Or maybe this is the issue: https://gerrit.wikimedia.org/r/#/c/238819/3/TimedMediaHandler.php [09:55:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [09:57:46] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2904624 (10Joe) Let me point out that the queue is hardly growing anymore, we're in the middle of the holiday f... [10:00:04] RECOVERY - check_listener_gc on thulium is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.009 second response time [10:11:14] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 652 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2940502 keys, up 58 days 1 hours - replication_delay is 652 [10:18:14] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2932798 keys, up 58 days 1 hours - replication_delay is 22 [10:46:54] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:05:14] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:10:41] (03PS1) 10Revi: Revert $wgMFEEditorOptions['anonymousEditing'] = true for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) [11:15:14] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [11:15:18] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2904695 (10Joe) I just looked at the trends of job waiting times and they don't really make sense unless we're... [11:15:47] (03CR) 10Amire80: [C: 031] "Thank you, revi!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) (owner: 10Revi) [11:28:30] (03PS1) 10Giuseppe Lavagetto: videoscaler: bump up the number of running transcodes (again) [puppet] - 10https://gerrit.wikimedia.org/r/329447 (https://phabricator.wikimedia.org/T153488) [11:33:00] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 3 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2904765 (10zhuyifei1999) Something really curious about the queue is that [[https://commons.wikimedia.org/wiki/... [11:33:14] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:35:47] (03CR) 10Giuseppe Lavagetto: [C: 032] videoscaler: bump up the number of running transcodes (again) [puppet] - 10https://gerrit.wikimedia.org/r/329447 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto) [11:40:04] _joe_: fyi, I asked Jason to not flood v2c https://commons.wikimedia.org/wiki/User_talk:Jasonanaggie#video2commons_flooding [11:40:12] <_joe_> thanks :) [11:40:33] <_joe_> zhuyifei1999_: yesterday matanya asked me if they could "get back to running more uploads" [11:40:43] <_joe_> and well, the answer for now is no :P [11:40:53] if needed I can reduce the workers even more [11:42:11] <_joe_> zhuyifei1999_: well, if we weren't in the middle of a freeze and of holidays (so it's basically me being around from ops during EU mornings, more or less) [11:42:21] <_joe_> I'd repurpose a couple more machines [11:44:02] <_joe_> but for a series of reasons its not as easy as I'd like at the moment [11:44:21] <_joe_> mostly because we didn't have the time to fix ffmpeg issues on debian jessie for now [11:44:40] <_joe_> so converting means not just reconfiguring, but reinstalling the servers altoghether [11:46:21] ffmpeg has issues on jessie? [11:46:35] v2c is 100% jessie [11:48:35] hmm https://phabricator.wikimedia.org/T145742 [11:49:54] https://phabricator.wikimedia.org/T103335 <= "ffmpeg is not in Jessie" I don't think that's true now [11:50:50] <_joe_> yeah there was another issue IIRC [11:50:59] <_joe_> but honestly don't remember well [11:51:31] <_joe_> when everyone's back from vacation and then the developer summit, I'll take a look with moritz.m [11:51:41] ok [11:52:19] <_joe_> uhm I might have been a bit optimisst with the new load [11:52:30] <_joe_> let's see what happens in 2-3 hours about that [11:52:36] <_joe_> I'm monitoring the situation [11:55:41] (03PS1) 10Ladsgroup: Add badge for "digitaldocument" in Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329453 (https://phabricator.wikimedia.org/T153186) [12:00:19] hmm yeah v2c is running ffmpeg from the backports [12:02:45] 147% load avg o.O [12:04:05] why is there so much kernel cpu time? [12:04:35] looks unnatural [12:06:44] <_joe_> not really [12:06:54] <_joe_> a lot of interrupts to manage [12:08:08] <_joe_> zhuyifei1999_: I am stepping out for a bit, I'll take a look at the status later [12:08:18] k [12:08:25] <_joe_> but I might downsize that a bit later [12:08:41] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Please wait for a final decision at T153186 before merging this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329453 (https://phabricator.wikimedia.org/T153186) (owner: 10Ladsgroup) [12:09:48] (well, unless VMs doesn't need to handle so much interrupts, https://tools.wmflabs.org/nagf/?project=video doesn't show much kernel CPU time, even under 200+% load) [12:21:47] zhuyifei1999_: where is the load in your metrics? [12:22:25] you mean v2c right now? barely any load [12:22:38] no I meant the 200% load.. [12:23:07] I think that the kernel time on a overloaded host is expected [12:23:27] that was thirty something yesterday [12:24:02] yeah so this was the risk that I mentioned in the task, namely ops trying to fine tune and getting into the overloaded area [12:24:22] there is a bigger problem in the queue's content [12:24:42] https://grafana-labs.wikimedia.org/dashboard/db/labs-project-board?var-project=video&var-server=All half-broken [12:24:56] but has the load of two hosts [12:25:53] thanks! [12:26:38] zhuyifei1999_: so this is one host - https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=mw1168&var-network=eth0 [12:27:26] (one of the more powerful) [12:27:59] this one https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=mw1259 is one of the "oldest" [12:28:19] 5-7% of system time is not a problem imho [12:28:53] (look also at the baseline before the load increase) [12:29:59] about doubled [12:30:55] so we'll reduce the load later on but it is a matter of fine tuning, _joe_ knows what he is doing :) [12:31:04] k [12:32:31] but again thanks a lot for taking care of the video transcode queue, I don't absolutely want to drive you away from your work (that is really appreciated), only stating that we are doing as much as we can :) [12:33:40] zhuyifei1999_: --^ [12:33:54] ok [12:39:14] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:48:04] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:08:14] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [13:16:04] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:19:08] !log mobrovac@tin Starting deploy [mathoid/deploy@79fdd56]: Add depooling and repooling to Mathoid T144602 [13:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:12] T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602 [13:21:39] !log mobrovac@tin Finished deploy [mathoid/deploy@79fdd56]: Add depooling and repooling to Mathoid T144602 (duration: 02m 31s) [13:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:36] <_joe_> ok so, it looks like the load on the scalers is ok [13:26:45] <_joe_> meaning, it's very high, but not "overloading" [13:26:58] <_joe_> we're at ~ 100% utiization of resources [13:27:04] <_joe_> I'd leave it like that for now [13:27:29] 100%? wow [13:31:02] !log mobrovac@tin Starting deploy [citoid/deploy@da96f4b]: Add depooling and repooling to Citoid T144602 [13:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:05] T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602 [13:33:01] !log mobrovac@tin Finished deploy [citoid/deploy@da96f4b]: Add depooling and repooling to Citoid T144602 (duration: 01m 59s) [13:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:34] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:40:39] !log mobrovac@tin Starting deploy [cxserver/deploy@0279029]: (no message) [13:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:03] !log mobrovac@tin Finished deploy [cxserver/deploy@0279029]: (no message) (duration: 00m 25s) [13:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:18] !log mobrovac@tin Starting deploy [cxserver/deploy@0279029]: Add depooling and repooling to CXServer T144602 [13:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:21] T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602 [13:44:41] !log mobrovac@tin Finished deploy [cxserver/deploy@0279029]: Add depooling and repooling to CXServer T144602 (duration: 03m 22s) [13:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:36] !log mobrovac@tin Starting deploy [graphoid/deploy@151f26c]: Add depooling and repooling to Graphoid T144602 [13:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:40] T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602 [13:51:14] !log mobrovac@tin Finished deploy [graphoid/deploy@151f26c]: Add depooling and repooling to Graphoid T144602 (duration: 01m 37s) [13:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:05] !log mobrovac@tin Starting deploy [electron-render/deploy@b2a820e]: Add depooling and repooling to PDFRender T144602 [13:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:08] T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602 [14:00:31] !log mobrovac@tin Finished deploy [electron-render/deploy@b2a820e]: Add depooling and repooling to PDFRender T144602 (duration: 01m 26s) [14:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:56] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904959 (10Paladox) That thread says online reindex works until the next restart so that could be a workaround. But I doint know what other sides aff... [14:05:08] !log mobrovac@tin Starting deploy [mobileapps/deploy@ae22656]: Add depooling and repooling to MCS T144602 [14:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:11] T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602 [14:05:42] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904963 (10Paladox) But the online reindex dosent seem to work for some users [14:07:34] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [14:07:41] !log mobrovac@tin Finished deploy [mobileapps/deploy@ae22656]: Add depooling and repooling to MCS T144602 (duration: 02m 34s) [14:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:09] !log mobrovac@tin Starting deploy [trending-edits/deploy@c5d239b]: Add depooling and repooling to Trending Edits T144602 [14:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:14] T144602: Depool and repool SCB services during deploys - https://phabricator.wikimedia.org/T144602 [14:12:30] !log mobrovac@tin Finished deploy [trending-edits/deploy@c5d239b]: Add depooling and repooling to Trending Edits T144602 (duration: 01m 21s) [14:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:48] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting an emoji - https://phabricator.wikimedia.org/T145885#2904971 (10Paladox) http://gerrit-test.wmflabs.org/ has had its database converted to this for nearly a week now and still works currently. [14:22:52] (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: rotate log files weekly [puppet] - 10https://gerrit.wikimedia.org/r/329487 (https://phabricator.wikimedia.org/T153488) [14:25:38] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::jobrunner: rotate log files weekly [puppet] - 10https://gerrit.wikimedia.org/r/329487 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto) [14:31:25] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904990 (10Paladox) According to the thread the index for accounts is affected too. Though I have no idea if it affects ldap but it does store accoun... [14:31:39] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2904991 (10Paladox) [15:00:14] PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:18:51] (03CR) 10Florianschmidtwelzow: [C: 031] Revert $wgMFEEditorOptions['anonymousEditing'] = true for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329446 (https://phabricator.wikimedia.org/T119823) (owner: 10Revi) [15:25:59] heh [15:28:14] RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:29:25] (03PS1) 10Alexandros Kosiaris: kubernetes::worker: Allow access to kubelet from master [puppet] - 10https://gerrit.wikimedia.org/r/329494 [15:55:04] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%) [15:57:14] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:14] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:30:54] PROBLEM - NTP on prometheus2003 is CRITICAL: NTP CRITICAL: Offset unknown [16:31:24] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [16:39:24] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:52:04] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:14] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:21:04] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:24:14] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:34:14] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:45:23] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905165 (10Paladox) [17:46:31] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2904053 (10Paladox) [17:49:00] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905168 (10Paladox) [17:51:45] (03PS1) 10Urbanecm: [throttle] New rules for Queen Mary University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245) [17:52:14] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:52:24] (03CR) 10jerkins-bot: [V: 04-1] [throttle] New rules for Queen Mary University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245) (owner: 10Urbanecm) [17:54:17] (03PS2) 10Urbanecm: [throttle] New rules for Queen Mary University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245) [17:54:27] (03PS5) 10Paladox: Gerrit: Convert from utf8 to utf8mb4 [puppet] - 10https://gerrit.wikimedia.org/r/328571 (https://phabricator.wikimedia.org/T153899) [17:54:51] (03CR) 10jerkins-bot: [V: 04-1] [throttle] New rules for Queen Mary University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245) (owner: 10Urbanecm) [17:56:27] (03PS3) 10Urbanecm: [throttle] New rules for Queen Mary University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245) [17:57:14] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:44] (03PS4) 10Urbanecm: [throttle] New rules + remove obsolete rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329500 (https://phabricator.wikimedia.org/T154245) [18:09:25] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905202 (10Paladox) [18:11:55] 06Operations, 10Gerrit, 06Release-Engineering-Team, 05Security, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905203 (10Paladox) Adding #security as it affects accounts, though no one can steal accounts as far as i can tell, account... [18:25:14] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:40:30] 06Operations, 10Gerrit, 06Release-Engineering-Team, 05Security, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905243 (10Paladox) This may be the reason why we had this T152640 problem as accounts are stored in the index now. All the... [19:03:56] (03PS3) 10Urbanecm: Enable subpages in NS0 for arbcom_cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327700 (https://phabricator.wikimedia.org/T154247) [19:04:10] (03CR) 10Urbanecm: "PS3: Changed T number to T154247." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327700 (https://phabricator.wikimedia.org/T154247) (owner: 10Urbanecm) [19:44:58] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 3 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2905338 (10Joe) So, after I fine-tuned the video scalers load, we're now processing 73 transcodes at the same t... [19:48:10] <_joe_> Revent: I'm fairly sure someone re-transcoded all white house press briefings, and that is creating a huge queue [19:48:46] I don’t think that’s it... [19:49:13] The guy had dumped like 200 more into v2c the other day. [19:49:55] <_joe_> Revent: we're churning videos at 7x the rate we were doing it 1 week ago [19:50:06] (nods) [19:50:08] <_joe_> I can't realistically add much more capacity at this point [19:50:55] We’ve discussed it with the particular person who was uploading the briefings, hopefully he’ll actually stop for a while, and then go slower. [19:51:17] <_joe_> unless I convince the WMF to buy 10 videoscalers, which is hardly justified at this point from the historic load on the scalers [19:51:39] <_joe_> the right solution will be to move videoscaling to our "elastic" environment once it's ready [19:51:40] https://commons.wikimedia.org/w/index.php?search=intitle%3A%22White+House+Press+Briefing%22&title=Special:Search&go=Go&searchToken=9f39ggx3nmi0iefpwd84quf6n <- but…. nearly 600 videos, and probably 12x transcodes each.... [19:51:40] _joe_: i can reduce the load coming from v2c if that helps [19:51:44] <_joe_> but it will take some time [19:51:50] <_joe_> matanya: stop it maybe? [19:51:57] totally ? [19:52:24] matanya: Zhu had half of v2c off for a while, but the guy just backlogged it there. [19:52:55] <_joe_> matanya: well we're 12 days behind on transcodes atm [19:53:05] I am the project admin, i can scale down or completely shut it down [19:53:21] <_joe_> sorry I'm on a train, internet is good for travelling at 300 kmh but still not great [19:53:25] but i'd really like the project to stay alive if that is possible [19:53:41] <_joe_> matanya: yeah I'd love that too [19:54:13] <_joe_> but if someone pours 7500 transcodes of large files in the queue, there is not much I can do to speed it up [19:54:18] <_joe_> besides what I just did [19:54:33] _joe_: If it says anything, the ones that got ‘broken’ on the 19th (the ones that were trying to be run when the servers were reset) seem to be running now. [19:54:56] <_joe_> Revent: well it says we're 11 days behind [19:54:58] <_joe_> sigh [19:55:05] <_joe_> sorry, 9 [19:55:18] I got ‘most’ of them re-reset before they started going through. [19:55:26] _joe_: say the word and i'll shut it down [19:55:54] <_joe_> matanya: let's see tomorrow morning if you're around? [19:56:04] I will be [19:56:09] <_joe_> I would like to see if we can start catching up [19:56:11] EU times ? [19:56:17] <_joe_> now that I fine-tuned the scalers [19:56:27] <_joe_> yup, I'm in CET [19:56:43] ok, i'll be around [19:57:25] <_joe_> thanks :) [19:58:56] zhuyifei1999_: To clarify... [19:59:46] zhuyifei1999_: I ‘was’ deliberately running trancodes that where in the ‘very old broken transcodes’ list, before the current drama. [20:00:27] More than once, I had accidentally overloaded the scalers for short periods (a few hours) from the timeout bug… [20:01:11] Those would have gone into the ~5000-odd ones that were ‘both error and queued’ when the servers were reset, probably. [20:02:22] 06Operations, 06Commons, 10media-storage, 05MW-1.27-release-notes: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#2905352 (10Wieralee) It still happens... Not every day, but maybe once for a week... The files I was listing were fortunately uploaded again... [20:02:34] 06Operations, 10ops-eqiad: rack/setup/install/track new ms-fe1005-1008 - https://phabricator.wikimedia.org/T154250#2905353 (10RobH) [20:03:50] The ones that are ‘old uploads’ should be small in number, in comparision to the new uploads of ‘old stuff’ [20:06:34] <_joe_> Revent: the main problem, as I see it, is that video2commons is able to flood the production infrastructure [20:06:52] <_joe_> given for every video we upload there we have a 12x transcoding factor [20:07:07] (nods) [20:08:01] <_joe_> so I see two ways of solving it: one is to send video2commons upload to a low-prio queue [20:08:46] <_joe_> the other, is to rate-limit video2commons uploads [20:08:51] <_joe_> I'd prefer the former [20:10:05] The scalers being in some way able to prioritize ‘easy’ tasks would be very good. [20:13:17] I suspect even something as ‘crude’ as ‘run the transcodes of the smallest file next’ would cause the actual backlog count to drop a lot. [20:15:39] <_joe_> that's very very hard to do in our current jobqueue setting [20:15:54] <_joe_> it's a FIFO [20:16:00] <_joe_> more or less [20:16:09] <_joe_> you can have low-prio jobs though [20:16:49] <_joe_> luckily, I'll be in a room with other devs soonish and I can bug the relevant people in person :P [20:17:49] _joe_: When you say FIFO ‘more or less’… it’s looking at transcode_time_addjob right? [20:18:18] <_joe_> no, it's looking at how things got inserted in a Redis queue :P [20:18:55] <_joe_> so whenever you add a video to commons, its transcodes get added to the tail of the transcodes queue [20:19:09] Ok, so that explains (kinda) how things get wierd statuses in the DB. [20:19:25] <_joe_> and the scalers do the job of grabbing tasks from the head of such queue [20:19:55] It does, at least, eplicitly not allow the same thing to be queued twice, right? [20:20:20] <_joe_> well, it depends on the type of job [20:20:30] <_joe_> but I think for transcodes that's the case [20:21:34] _joe_: Are you able to search the queue? [20:22:13] <_joe_> Revent: more or less [20:22:22] Kemi periodiska systemet.webm [20:22:28] <_joe_> Revent: as in now more or lest [20:22:38] <_joe_> I'm on bad internet now [20:22:45] Ah... [20:23:24] <_joe_> tomorrow I'll be at home [20:23:46] <_joe_> I'm on a train atm, and well into my off hours :) [20:24:18] I’m just really hoping that resetting the ones ‘listed as queued, but with an error’ is not causing them to be run more than once. [20:24:41] <_joe_> Revent: let's wait for now? [20:24:52] <_joe_> it should not, but I might be more specific tomorrow [20:25:34] <_joe_> we're also thin on staff, I'm around but almost everyone is on a few days off [20:25:54] Thing is, if they ‘do’ get run, successfully, the error code doesn’t get removed, and then they have a really messed up ‘status’. [20:26:32] https://commons.wikimedia.org/wiki/File:Expedition_42_Crew_Profile.webm <- see the “Error on 12:48, 2016 December 19” transcodes? [20:26:53] <_joe_> yeah we had a ton of failures on the 19th [20:26:54] They are in the list of ‘queued transcodes’ on the special page. [20:27:15] Yeah, about 5k, lol. [20:28:36] Those are the ones I was resetting, doing so doesn’t change the ‘queued’ count on the special page. [20:44:48] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10procurement: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2905389 (10RobH) [20:45:10] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2905406 (10RobH) [21:00:53] 06Operations, 10Gerrit, 06Release-Engineering-Team, 05Security, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905414 (10Aklapper) How often has Wikimedia Gerrit been restarted in the last, say, 3 months? >>! In T154205#2905203, @Pal... [21:01:24] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:04] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:12:04] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:29:24] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [21:31:24] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:31:34] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [21:32:34] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [21:33:24] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [21:33:34] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [21:39:02] (03PS2) 10Tim Landscheidt: puppetmaster: Clone repositories in Labs as root [puppet] - 10https://gerrit.wikimedia.org/r/324727 (https://phabricator.wikimedia.org/T152059) [21:40:34] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [21:50:44] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 607 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2952991 keys, up 58 days 13 hours - replication_delay is 607 [21:55:44] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2935710 keys, up 58 days 13 hours - replication_delay is 53 [22:13:37] (03PS1) 10Urbanecm: Set celebration logo for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329516 (https://phabricator.wikimedia.org/T154254) [22:25:14] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:29:14] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:53:14] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [22:57:14] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [23:20:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [23:25:14] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [23:30:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [23:31:34] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:35:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [23:40:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [23:45:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [23:50:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.009 second response time [23:55:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [23:59:34] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures