[01:00:01] <grrrit-wm>	 (03PS1) 10Alex Monk: Add wmflabsdotorg credentials to horizon config [puppet] - 10https://gerrit.wikimedia.org/r/278538 (https://phabricator.wikimedia.org/T129245) 
[01:12:25] <grrrit-wm>	 (03PS1) 10Alex Monk: openstack: clean up a couple of trivial things in makedomain [puppet] - 10https://gerrit.wikimedia.org/r/278539 
[02:19:51] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "How will the relay to the mail server and the other bits and bobs in the toollabs class then be set up for static servers?" [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) (owner: 10Yuvipanda)
[02:23:34] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.17) (duration: 10m 22s)
[02:23:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:32:04] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Mar 20 02:32:04 UTC 2016 (duration 8m 30s)
[02:32:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:23:47] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89832.00 seconds
[05:12:36] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[05:49:19] <grrrit-wm>	 (03CR) 10Yuvipanda: "It won't, but I don't think that has any practical effects. All other classes that don't execute user code and/or are gridengine related a" [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) (owner: 10Yuvipanda)
[06:30:17] <icinga-wm>	 PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:57] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:07] <icinga-wm>	 PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:38] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:57] <icinga-wm>	 PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:57] <icinga-wm>	 PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:33:54] <grrrit-wm>	 (03PS1) 10Ori.livneh: Segment Navigation Timing data by continent [puppet] - 10https://gerrit.wikimedia.org/r/278546 (https://phabricator.wikimedia.org/T128709) 
[06:35:24] <grrrit-wm>	 (03PS2) 10Ori.livneh: Segment Navigation Timing data by continent [puppet] - 10https://gerrit.wikimedia.org/r/278546 (https://phabricator.wikimedia.org/T128709) 
[06:51:57] <grrrit-wm>	 (03PS3) 10Ori.livneh: Segment Navigation Timing data by continent [puppet] - 10https://gerrit.wikimedia.org/r/278546 (https://phabricator.wikimedia.org/T128709) 
[06:52:36] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Segment Navigation Timing data by continent [puppet] - 10https://gerrit.wikimedia.org/r/278546 (https://phabricator.wikimedia.org/T128709) (owner: 10Ori.livneh)
[06:56:08] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[06:56:37] <icinga-wm>	 RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[06:57:26] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:57:28] <icinga-wm>	 RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[06:58:17] <icinga-wm>	 RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:59:17] <icinga-wm>	 RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:53:12] <wikibugs>	 6Operations, 6Performance-Team, 10Traffic, 13Patch-For-Review: Segment Navigation Timing data by continent - https://phabricator.wikimedia.org/T128709#2137301 (10ori) 5Open>3Resolved @mark, yep; done. Initial dashboard at <https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-continent>.
[07:57:48] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89800.00 seconds
[08:25:22] <wikibugs>	 6Operations, 6Performance-Team, 6Release-Engineering-Team, 7Availability, and 2 others: Dig through logs from 15 Mar 2016 read-only test and file bugs - https://phabricator.wikimedia.org/T129973#2137313 (10Nemo_bis)
[08:26:00] <wikibugs>	 6Operations, 6Performance-Team, 6Release-Engineering-Team, 7Availability, and 3 others: Dig through logs from 15 Mar 2016 read-only test and file bugs - https://phabricator.wikimedia.org/T129973#2121627 (10Nemo_bis)
[08:44:47] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[08:45:18] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[09:01:07] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[09:02:27] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on port 9042
[11:23:58] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: Adding more unit tests [software/conftool] - 10https://gerrit.wikimedia.org/r/278550 
[11:24:00] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: Print out the tags any conftool result line is referring to [software/conftool] - 10https://gerrit.wikimedia.org/r/278551 (https://phabricator.wikimedia.org/T128199) 
[11:24:02] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: Add select mode, refactor conftool.cli.tool [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) 
[11:24:58] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Adding more unit tests [software/conftool] - 10https://gerrit.wikimedia.org/r/278550 (owner: 10Giuseppe Lavagetto)
[11:25:15] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Print out the tags any conftool result line is referring to [software/conftool] - 10https://gerrit.wikimedia.org/r/278551 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto)
[11:25:24] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add select mode, refactor conftool.cli.tool [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto)
[11:25:28] <_joe_>	 yeah you boring jenkins
[12:11:12] <grrrit-wm>	 (03CR) 10Glaisher: "I'm not sure as I'm not really available during the SWAT windows nowadays (but might be on some days). It'd be nice if you could help. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher)
[12:13:28] <grrrit-wm>	 (03PS1) 10Sabya: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 
[12:14:39] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya)
[12:58:47] <icinga-wm>	 PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: puppet fail
[13:25:26] <icinga-wm>	 RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[13:53:28] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[13:53:57] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[14:02:31] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[14:02:57] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042
[14:52:52] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: Print out the tags any conftool result line is referring to [software/conftool] - 10https://gerrit.wikimedia.org/r/278551 (https://phabricator.wikimedia.org/T128199) 
[14:52:54] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: Adding more unit tests [software/conftool] - 10https://gerrit.wikimedia.org/r/278550 
[14:52:56] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: Add select mode, refactor conftool.cli.tool [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) 
[14:53:57] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Print out the tags any conftool result line is referring to [software/conftool] - 10https://gerrit.wikimedia.org/r/278551 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto)
[14:54:09] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Adding more unit tests [software/conftool] - 10https://gerrit.wikimedia.org/r/278550 (owner: 10Giuseppe Lavagetto)
[14:54:19] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add select mode, refactor conftool.cli.tool [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto)
[15:13:18] <grrrit-wm>	 (03CR) 10Tim Landscheidt: [C: 04-1] "Who would receive errors from cron jobs then?" [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) (owner: 10Yuvipanda)
[16:45:48] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[16:45:57] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[17:01:57] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[17:03:47] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042
[17:13:28] <wikibugs>	 6Operations, 7Availability, 5MW-1.27-release-notes, 13Patch-For-Review, and 4 others: Implement a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#2137813 (10Aklapper)
[17:13:30] <wikibugs>	 6Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2137809 (10Aklapper) 5Resolved>3Open Reopening due to T130487 on bnwiki.
[17:13:39] <wikibugs>	 6Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2137815 (10Aklapper)
[17:22:03] <grrrit-wm>	 (03CR) 10Yuvipanda: "Whoever's been getting the cron mails for all of those other hosts (both in tools and outside tools) that don't have the toollabs base cla" [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) (owner: 10Yuvipanda)
[18:07:28] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[18:07:37] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[18:32:17] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[18:33:56] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042
[19:20:57] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[19:21:17] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[19:31:57] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[19:33:26] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042
[19:40:28] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[19:40:48] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[19:51:17] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:14:17] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[20:42:26] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[21:14:06] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[21:31:47] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[21:38:08] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0]
[21:55:36] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[21:58:50] <wikibugs>	 6Operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#2137966 (10RobLa-WMF) >>! In T52864#1938756, @JanZerebecki wrote: > @Robla-WMF: Are you willing to resource this?  @JanZerebecki - I don't have authority to resource this.  I was hoping @mark or some...
[22:37:25] <grrrit-wm>	 (03PS1) 10Ori.livneh: Segment Navigation Timing data by country [puppet] - 10https://gerrit.wikimedia.org/r/278684 (https://phabricator.wikimedia.org/T128709) 
[22:41:02] <wikibugs>	 7Puppet, 6Revision-Scoring-As-A-Service, 10ores, 13Patch-For-Review: Fix puppet webservice name to uwsgi-ores-web - https://phabricator.wikimedia.org/T124621#1960573 (10Halfak) Confirmed.  This seems to work now.
[22:41:51] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Segment Navigation Timing data by country [puppet] - 10https://gerrit.wikimedia.org/r/278684 (https://phabricator.wikimedia.org/T128709) (owner: 10Ori.livneh)
[22:47:58] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[22:48:56] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[22:59:48] <wikibugs>	 6Operations: Grafana: Job Queue Health: Panel is displayed incorrectly - https://phabricator.wikimedia.org/T130512#2138003 (10Luke081515)
[23:00:36] <Luke081515>	 Why don't we have a project for grafana yet?
[23:02:06] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[23:02:57] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042
[23:03:56] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:46:15] <wikibugs>	 6Operations, 10Traffic, 7Design: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136256 (10Bawolff) >>! In T130449#2136927, @Krenair wrote: > I don't think upload.wikimedia.org has anything to do with apache. >  > I tried `cu...