[01:00:01] (03PS1) 10Alex Monk: Add wmflabsdotorg credentials to horizon config [puppet] - 10https://gerrit.wikimedia.org/r/278538 (https://phabricator.wikimedia.org/T129245) [01:12:25] (03PS1) 10Alex Monk: openstack: clean up a couple of trivial things in makedomain [puppet] - 10https://gerrit.wikimedia.org/r/278539 [02:19:51] (03CR) 10Tim Landscheidt: "How will the relay to the mail server and the other bits and bobs in the toollabs class then be set up for static servers?" [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) (owner: 10Yuvipanda) [02:23:34] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.17) (duration: 10m 22s) [02:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:04] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Mar 20 02:32:04 UTC 2016 (duration 8m 30s) [02:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:23:47] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89832.00 seconds [05:12:36] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:49:19] (03CR) 10Yuvipanda: "It won't, but I don't think that has any practical effects. All other classes that don't execute user code and/or are gridengine related a" [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) (owner: 10Yuvipanda) [06:30:17] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:57] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:57] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:54] (03PS1) 10Ori.livneh: Segment Navigation Timing data by continent [puppet] - 10https://gerrit.wikimedia.org/r/278546 (https://phabricator.wikimedia.org/T128709) [06:35:24] (03PS2) 10Ori.livneh: Segment Navigation Timing data by continent [puppet] - 10https://gerrit.wikimedia.org/r/278546 (https://phabricator.wikimedia.org/T128709) [06:51:57] (03PS3) 10Ori.livneh: Segment Navigation Timing data by continent [puppet] - 10https://gerrit.wikimedia.org/r/278546 (https://phabricator.wikimedia.org/T128709) [06:52:36] (03CR) 10Ori.livneh: [C: 032 V: 032] Segment Navigation Timing data by continent [puppet] - 10https://gerrit.wikimedia.org/r/278546 (https://phabricator.wikimedia.org/T128709) (owner: 10Ori.livneh) [06:56:08] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:57:28] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:58:17] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:17] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:53:12] 6Operations, 6Performance-Team, 10Traffic, 13Patch-For-Review: Segment Navigation Timing data by continent - https://phabricator.wikimedia.org/T128709#2137301 (10ori) 5Open>3Resolved @mark, yep; done. Initial dashboard at . [07:57:48] RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89800.00 seconds [08:25:22] 6Operations, 6Performance-Team, 6Release-Engineering-Team, 7Availability, and 2 others: Dig through logs from 15 Mar 2016 read-only test and file bugs - https://phabricator.wikimedia.org/T129973#2137313 (10Nemo_bis) [08:26:00] 6Operations, 6Performance-Team, 6Release-Engineering-Team, 7Availability, and 3 others: Dig through logs from 15 Mar 2016 read-only test and file bugs - https://phabricator.wikimedia.org/T129973#2121627 (10Nemo_bis) [08:44:47] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [08:45:18] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [09:01:07] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [09:02:27] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on port 9042 [11:23:58] (03PS1) 10Giuseppe Lavagetto: Adding more unit tests [software/conftool] - 10https://gerrit.wikimedia.org/r/278550 [11:24:00] (03PS1) 10Giuseppe Lavagetto: Print out the tags any conftool result line is referring to [software/conftool] - 10https://gerrit.wikimedia.org/r/278551 (https://phabricator.wikimedia.org/T128199) [11:24:02] (03PS1) 10Giuseppe Lavagetto: Add select mode, refactor conftool.cli.tool [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) [11:24:58] (03CR) 10jenkins-bot: [V: 04-1] Adding more unit tests [software/conftool] - 10https://gerrit.wikimedia.org/r/278550 (owner: 10Giuseppe Lavagetto) [11:25:15] (03CR) 10jenkins-bot: [V: 04-1] Print out the tags any conftool result line is referring to [software/conftool] - 10https://gerrit.wikimedia.org/r/278551 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto) [11:25:24] (03CR) 10jenkins-bot: [V: 04-1] Add select mode, refactor conftool.cli.tool [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto) [11:25:28] <_joe_> yeah you boring jenkins [12:11:12] (03CR) 10Glaisher: "I'm not sure as I'm not really available during the SWAT windows nowadays (but might be on some days). It'd be nice if you could help. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [12:13:28] (03PS1) 10Sabya: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 [12:14:39] (03CR) 10jenkins-bot: [V: 04-1] Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [12:58:47] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: puppet fail [13:25:26] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:53:28] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [13:53:57] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [14:02:31] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [14:02:57] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [14:52:52] (03PS2) 10Giuseppe Lavagetto: Print out the tags any conftool result line is referring to [software/conftool] - 10https://gerrit.wikimedia.org/r/278551 (https://phabricator.wikimedia.org/T128199) [14:52:54] (03PS2) 10Giuseppe Lavagetto: Adding more unit tests [software/conftool] - 10https://gerrit.wikimedia.org/r/278550 [14:52:56] (03PS2) 10Giuseppe Lavagetto: Add select mode, refactor conftool.cli.tool [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) [14:53:57] (03CR) 10jenkins-bot: [V: 04-1] Print out the tags any conftool result line is referring to [software/conftool] - 10https://gerrit.wikimedia.org/r/278551 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto) [14:54:09] (03CR) 10jenkins-bot: [V: 04-1] Adding more unit tests [software/conftool] - 10https://gerrit.wikimedia.org/r/278550 (owner: 10Giuseppe Lavagetto) [14:54:19] (03CR) 10jenkins-bot: [V: 04-1] Add select mode, refactor conftool.cli.tool [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto) [15:13:18] (03CR) 10Tim Landscheidt: [C: 04-1] "Who would receive errors from cron jobs then?" [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) (owner: 10Yuvipanda) [16:45:48] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [16:45:57] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [17:01:57] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [17:03:47] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [17:13:28] 6Operations, 7Availability, 5MW-1.27-release-notes, 13Patch-For-Review, and 4 others: Implement a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#2137813 (10Aklapper) [17:13:30] 6Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2137809 (10Aklapper) 5Resolved>3Open Reopening due to T130487 on bnwiki. [17:13:39] 6Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2137815 (10Aklapper) [17:22:03] (03CR) 10Yuvipanda: "Whoever's been getting the cron mails for all of those other hosts (both in tools and outside tools) that don't have the toollabs base cla" [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) (owner: 10Yuvipanda) [18:07:28] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [18:07:37] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [18:32:17] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [18:33:56] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [19:20:57] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [19:21:17] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [19:31:57] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [19:33:26] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [19:40:28] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:40:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:51:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:14:17] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:42:26] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:14:06] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:31:47] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:38:08] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [21:55:36] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [21:58:50] 6Operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#2137966 (10RobLa-WMF) >>! In T52864#1938756, @JanZerebecki wrote: > @Robla-WMF: Are you willing to resource this? @JanZerebecki - I don't have authority to resource this. I was hoping @mark or some... [22:37:25] (03PS1) 10Ori.livneh: Segment Navigation Timing data by country [puppet] - 10https://gerrit.wikimedia.org/r/278684 (https://phabricator.wikimedia.org/T128709) [22:41:02] 7Puppet, 6Revision-Scoring-As-A-Service, 10ores, 13Patch-For-Review: Fix puppet webservice name to uwsgi-ores-web - https://phabricator.wikimedia.org/T124621#1960573 (10Halfak) Confirmed. This seems to work now. [22:41:51] (03CR) 10Ori.livneh: [C: 032] Segment Navigation Timing data by country [puppet] - 10https://gerrit.wikimedia.org/r/278684 (https://phabricator.wikimedia.org/T128709) (owner: 10Ori.livneh) [22:47:58] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [22:48:56] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [22:59:48] 6Operations: Grafana: Job Queue Health: Panel is displayed incorrectly - https://phabricator.wikimedia.org/T130512#2138003 (10Luke081515) [23:00:36] Why don't we have a project for grafana yet? [23:02:06] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [23:02:57] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [23:03:56] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:46:15] 6Operations, 10Traffic, 7Design: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136256 (10Bawolff) >>! In T130449#2136927, @Krenair wrote: > I don't think upload.wikimedia.org has anything to do with apache. > > I tried `cu...