[00:23:08] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[00:24:17] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[00:29:37] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[00:30:17] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042
[01:19:33] <wikibugs>	 6Operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#2138089 (10Tgr) >>! In T52864#2137966, @RobLa-WMF wrote: > @JanZerebecki - I don't have authority to resource this.  I was hoping @mark or someone from #operations would respond, but I believe that s...
[01:27:34] <wikibugs>	 6Operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#553889 (10ori) >>! In T52864#2137966, @RobLa-WMF wrote: > I was hoping @mark or someone from #operations would respond  @faidon did, in T52864#954874 above.
[02:00:46] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[02:01:07] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[02:02:29] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[02:02:57] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.039 second response time on port 9042
[02:24:50] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.17) (duration: 10m 57s)
[02:24:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:33:31] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Mar 21 02:33:31 UTC 2016 (duration 8m 41s)
[02:33:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:45:22] <grrrit-wm>	 (03PS2) 10Sabya: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 
[02:47:00] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya)
[02:55:29] <wikibugs>	 6Operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#2138129 (10RobLa-WMF) >>! In T52864#2138089, @Tgr wrote: >>>! In T52864#2137966, @RobLa-WMF wrote: >> @JanZerebecki - I don't have authority to resource this.  I was hoping @mark or someone from #ope...
[03:00:29] <grrrit-wm>	 (03PS3) 10Sabya: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 
[03:09:32] <grrrit-wm>	 (03PS1) 10Ori.livneh: Add ten additional countries to NavTiming [puppet] - 10https://gerrit.wikimedia.org/r/278701 
[03:40:37] <icinga-wm>	 PROBLEM - RAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[03:52:58] <Niharika>	 Hello. Wikipedia seems down for me. The iOS app isn't working either. Error: Host with specified name could not be found.
[03:58:38] <grrrit-wm>	 (03PS1) 10Yuvipanda: labs: Add support for custom cnames in labs recursor [puppet] - 10https://gerrit.wikimedia.org/r/278705 (https://phabricator.wikimedia.org/T118758) 
[04:10:57] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[04:25:14] <grrrit-wm>	 (03CR) 10Ori.livneh: "Um, for each continent except Antarctica and Oceania, that is." [puppet] - 10https://gerrit.wikimedia.org/r/278701 (owner: 10Ori.livneh)
[04:26:44] <ori>	 Niharika: can you run traceroute to wikipedia.org? (If you're using Windows, it's "tracert")
[04:27:43] <ori>	 Actually, that may not work, if you're not able to resolve the name
[04:27:44] <Niharika>	 ori: Okay, let me get on my laptop and try that.
[04:29:40] <ori>	 If you get an "unknown host" error, try running 'nslookup en.wikipedia.org' and make note of the "Server:" line
[04:30:43] <Niharika>	 ori: Never mind, it seems to be back up now. 
[04:30:49] <ori>	 I fixed it! \o/
[04:30:55] <Niharika>	 :D 
[04:37:46] <yuvipanda>	 Niharika: when I was on Airtel, their DNS servers would fuck up like this now and then, with weird cache issues
[04:37:56] <yuvipanda>	 Niharika: I switched my router to use Google DNS, and the problems went away
[04:39:03] <Niharika>	 yuvipanda: That's probably it. I was on Google DNS but the stupid wifi at Bentley (All-hands) wouldn't let me use any other DNS except their own, and I forgot to switch back to Google ones when I got home. 
[04:46:35] <yuvipanda>	 Niharika: ah!
[04:59:06] <ori>	 captive portals that hijack dns are the worst
[05:22:08] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 88595.00 seconds
[05:35:59] <grrrit-wm>	 (03PS1) 10Ori.livneh: Add Australia to NavTiming countries [puppet] - 10https://gerrit.wikimedia.org/r/278706 
[05:43:46] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Add Australia to NavTiming countries [puppet] - 10https://gerrit.wikimedia.org/r/278706 (owner: 10Ori.livneh)
[06:12:16] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0]
[06:22:37] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[06:30:47] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:07] <icinga-wm>	 PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:31:18] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:37] <icinga-wm>	 PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:47] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:56] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:27] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:57] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:56] <icinga-wm>	 PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:56:47] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[06:56:47] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[06:56:48] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:57:07] <icinga-wm>	 RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[06:57:08] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[06:57:26] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[06:57:37] <icinga-wm>	 RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:46] <icinga-wm>	 PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 7 failures
[06:57:47] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:57] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:07] <icinga-wm>	 RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[07:23:48] <icinga-wm>	 RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:59:05] <wikibugs>	 6Operations, 10ops-eqiad: db1067 degraded RAID - https://phabricator.wikimedia.org/T130517#2138224 (10jcrespo)
[08:00:13] <icinga-wm>	 ACKNOWLEDGEMENT - RAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo https://phabricator.wikimedia.org/T130517
[08:57:18] <icinga-wm>	 PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 69 failures
[09:09:13] <elukey>	 !log restarted hhvm on mw1116 
[09:09:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:14:58] <icinga-wm>	 RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:26:04] <jynus>	 !log Altering change_tag engine to InnoDB on db1069:3313
[09:26:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:29:14] <grrrit-wm>	 (03PS4) 10Mobrovac: Introducing changeprop role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/275772 (https://phabricator.wikimedia.org/T128463) 
[09:30:40] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Introducing changeprop role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/275772 (https://phabricator.wikimedia.org/T128463) (owner: 10Mobrovac)
[09:39:08] <jynus>	 I think change_tag was the main cause of lag on s3 for labs, but we will see if it pays off
[09:42:09] <grrrit-wm>	 (03PS5) 10Mobrovac: Introducing changeprop role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/275772 (https://phabricator.wikimedia.org/T128463) 
[09:48:59] <grrrit-wm>	 (03PS3) 10Hashar: hiera_lookup: support 'labs' realm [puppet] - 10https://gerrit.wikimedia.org/r/276345 (https://phabricator.wikimedia.org/T129092) 
[09:49:10] <grrrit-wm>	 (03PS1) 10Mobrovac: Citoid: Switch to the Scap3 deployment method [puppet] - 10https://gerrit.wikimedia.org/r/278710 (https://phabricator.wikimedia.org/T116337) 
[09:49:50] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] "Merging the change since it will be a bit difficult to test this code review on the main puppet repo. The next step is to file a code revi" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/277984 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey)
[09:50:25] <revi>	 derp
[09:51:23] <revi>	 I always hate my nick when grrrit cuts the text at revi(ew)
[09:51:48] <grrrit-wm>	 (03PS4) 10Hashar: hiera_lookup: recognize labs project and site [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) 
[09:51:59] <grrrit-wm>	 (03CR) 10Hashar: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) (owner: 10Hashar)
[10:05:57] <icinga-wm>	 PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 28 failures
[10:08:26] <icinga-wm>	 PROBLEM - puppet last run on mw1121 is CRITICAL: CRITICAL: Puppet has 77 failures
[10:14:23] <grrrit-wm>	 (03PS1) 10Elukey: Update Analytics cdh submodule after https://gerrit.wikimedia.org/r/#/c/277984/ [puppet] - 10https://gerrit.wikimedia.org/r/278713 (https://phabricator.wikimedia.org/T129838) 
[10:19:37] <wikibugs>	 6Operations, 10Continuous-Integration-Config, 10Dumps-Generation, 13Patch-For-Review, 7WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#2138340 (10hashar) >>! In T114249#2106764, @ArielGlenn wrote: > Don't despair.  I have still on my roadmap to l...
[10:24:39] <jynus>	 !log Altering user_properties engine to InnoDB on db1069:3313
[10:24:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:32:46] <icinga-wm>	 RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:52:08] <hashar>	 !log Live hacked puppet compiler on compiler02.puppet3-diffs.eqiad.wmflabs to debug it not processing submodules.  Reinstalled it from the last tag in the process
[10:52:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:04:48] <icinga-wm>	 PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 73 failures
[11:32:17] <icinga-wm>	 PROBLEM - HHVM rendering on mw1133 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.021 second response time
[11:32:58] <icinga-wm>	 PROBLEM - Apache HTTP on mw1133 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time
[11:34:07] <icinga-wm>	 RECOVERY - HHVM rendering on mw1133 is OK: HTTP OK: HTTP/1.1 200 OK - 71682 bytes in 8.751 second response time
[11:34:46] <icinga-wm>	 RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.060 second response time
[11:36:22] <grrrit-wm>	 (03CR) 10Mobrovac: Introducing changeprop role and puppet module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/275772 (https://phabricator.wikimedia.org/T128463) (owner: 10Mobrovac)
[11:55:56] <icinga-wm>	 PROBLEM - HHVM rendering on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:57:16] <icinga-wm>	 PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:01:48] <icinga-wm>	 RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[12:02:55] <elukey>	 mmw1121: Mar 21 11:54:18 mw1121 kernel: [428912.210401] Out of memory: Kill process 23236 (hhvm) score 951 or sacrifice child
[12:04:18] <icinga-wm>	 RECOVERY - puppet last run on mw1121 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[12:30:57] <wikibugs>	 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2138475 (10BBlack) From a naive POV based on the screenshots alone: they're using an outdated set of Root certificates, inc...
[12:52:06] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [5000000.0]
[13:06:27] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[13:09:43] <grrrit-wm>	 (03PS3) 10BBlack: move most of esams to standard layout [dns] - 10https://gerrit.wikimedia.org/r/270285 
[13:23:39] <grrrit-wm>	 (03PS1) 10BBlack: remove esams ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278721 
[13:23:41] <grrrit-wm>	 (03PS1) 10BBlack: remove corp ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278722 
[13:23:43] <grrrit-wm>	 (03PS1) 10BBlack: remove redundant wikimedia.org. trailers [dns] - 10https://gerrit.wikimedia.org/r/278723 
[13:24:23] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] remove esams ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278721 (owner: 10BBlack)
[13:29:10] <grrrit-wm>	 (03PS2) 10BBlack: remove esams ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278721 
[13:29:12] <grrrit-wm>	 (03PS2) 10BBlack: remove redundant wikimedia.org. trailers [dns] - 10https://gerrit.wikimedia.org/r/278723 
[13:29:14] <grrrit-wm>	 (03PS2) 10BBlack: remove corp ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278722 
[13:33:02] <mobrovac>	 !log restbase deploy start of 26f9e90 on canary restbase1003
[13:33:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:58:26] <icinga-wm>	 PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:59:47] <icinga-wm>	 PROBLEM - HHVM rendering on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time
[14:00:06] <icinga-wm>	 RECOVERY - Restbase root url on restbase1013 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.034 second response time
[14:00:08] <icinga-wm>	 RECOVERY - Restbase root url on restbase1012 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.045 second response time
[14:00:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[14:01:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[14:04:21] <grrrit-wm>	 (03CR) 10Ottomata: [C: 031] "If you feel good about it, proceed!" [puppet] - 10https://gerrit.wikimedia.org/r/278713 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey)
[14:05:25] <mobrovac>	 !log restbase deploy end of 26f9e90
[14:05:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:07:21] <grrrit-wm>	 (03CR) 10Ottomata: "Talked with Marko a bit about this in IRC." (031 comment) [puppet/kafka] - 10https://gerrit.wikimedia.org/r/278329 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac)
[14:23:38] <jynus>	 !log restarting labsdb1001 mysql
[14:23:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:26:16] <ottomata>	 !log altering kafka topics webrequest_text and webrequest_upload, increasing each from 12 partitions to 24 partitions
[14:26:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:28:07] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[14:28:24] <grrrit-wm>	 (03PS2) 10Ottomata: Increase number of map tasks for camus webrequest to 72 [puppet] - 10https://gerrit.wikimedia.org/r/278288 (https://phabricator.wikimedia.org/T127351) 
[14:28:36] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Increase number of map tasks for camus webrequest to 72 [puppet] - 10https://gerrit.wikimedia.org/r/278288 (https://phabricator.wikimedia.org/T127351) (owner: 10Ottomata)
[14:47:47] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[15:00:05] <jouncebot>	 anomie ostriches thcipriani marktraceur aude: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160321T1500).
[15:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[15:00:07] <icinga-wm>	 PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 61 failures
[15:00:15] <MatmaRex>	 hello.
[15:01:28] <thcipriani>	 MatmaRex: Hiya, I can SWAT.
[15:01:33] <grrrit-wm>	 (03CR) 10Nuria: "I am all for this change but do not know enough of mediawiki conventions to merge. +1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis)
[15:05:07] <icinga-wm>	 PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 61 failures
[15:10:07] <icinga-wm>	 PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 61 failures
[15:11:38] <MatmaRex>	 that took a while to merge.
[15:13:43] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.27.0-wmf.17/includes/upload/UploadBase.php: SWAT: UploadBase: Set mFileSize, if given, even if mTempPath is unknown [[gerrit:278724]] (duration: 00m 30s)
[15:13:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:13:49] <thcipriani>	 ^ MatmaRex check please
[15:14:00] <thcipriani>	 yeah, core changes ain't quick for jenkins
[15:15:07] <icinga-wm>	 RECOVERY - check_puppetrun on betelgeuse is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[15:15:29] <MatmaRex>	 thcipriani: thanks. i don't want to upload files in production to check this, but we have logs for the errors this fixes and i'll watch them.
[15:15:39] <thcipriani>	 MatmaRex: ack. Thanks.
[15:41:55] <hoo>	 thcipriani: Are you donw with SWAT?
[15:42:07] <thcipriani>	 hoo: yes
[15:42:20] <grrrit-wm>	 (03PS1) 10Hoo man: Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278736 
[15:43:09] <grrrit-wm>	 (03CR) 10Hoo man: [C: 032] Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278736 (owner: 10Hoo man)
[15:43:35] <grrrit-wm>	 (03Merged) 10jenkins-bot: Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278736 (owner: 10Hoo man)
[15:44:31] <logmsgbot>	 !log hoo@tin Synchronized wmf-config/Wikibase.php: Bump $wgCacheEpoch on Wikidata after Property conversions (duration: 00m 28s)
[15:44:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:52:03] <wikibugs>	 7Blocked-on-Operations, 6Operations, 10RESTBase, 10RESTBase-Cassandra, 13Patch-For-Review: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2138908 (10GWicke)
[15:55:10] <wikibugs>	 7Puppet, 6Revision-Scoring-As-A-Service, 10ores, 13Patch-For-Review: Fix puppet webservice name to uwsgi-ores-web - https://phabricator.wikimedia.org/T124621#2138942 (10Halfak) 5Open>3Resolved a:3Halfak
[16:17:06] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: Add select mode [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) 
[16:21:46] <grrrit-wm>	 (03Abandoned) 10Giuseppe Lavagetto: Adding more unit tests [software/conftool] - 10https://gerrit.wikimedia.org/r/278550 (owner: 10Giuseppe Lavagetto)
[16:22:11] <grrrit-wm>	 (03Abandoned) 10Giuseppe Lavagetto: Print out the tags any conftool result line is referring to [software/conftool] - 10https://gerrit.wikimedia.org/r/278551 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto)
[16:26:06] <wikibugs>	 6Operations, 10ops-eqiad, 10RESTBase-Cassandra: restbase1007.eqiad.wmnet CPU temperature? - https://phabricator.wikimedia.org/T130370#2134035 (10GWicke) This is one of the three boxes (restbase1007-1009) where a second CPU was installed later.
[16:47:17] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[16:47:26] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[16:52:32] <mobrovac>	 on it ^
[16:54:27] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[16:56:07] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.038 second response time on port 9042
[17:00:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures
[17:05:16] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures
[17:10:06] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 205 seconds ago with 0 failures
[17:29:06] <grrrit-wm>	 (03PS1) 10Elukey: HDFS Namenode automatic failover support - bug fixes. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/278748 (https://phabricator.wikimedia.org/T129838) 
[17:45:37] <icinga-wm>	 PROBLEM - torrus.wikimedia.org UI on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Torrus Top: Wikimedia not found on https://torrus.wikimedia.org:443/torrus - 1140 bytes in 0.038 second response time
[17:49:07] <icinga-wm>	 RECOVERY - torrus.wikimedia.org UI on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2493 bytes in 0.120 second response time
[17:54:56] <grrrit-wm>	 (03PS1) 10Elukey: Fix varnishkafka cronspam due to non existent rsyslog action. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/278750 (https://phabricator.wikimedia.org/T129344) 
[18:00:12] <wikibugs>	 6Operations, 10Ops-Access-Requests, 6Discovery, 10Maps, 13Patch-For-Review: Requesting maps-admins access for Eric Evans - https://phabricator.wikimedia.org/T130412#2135290 (10akosiaris) So, this constitutes a sudo request, so per policy we need to get this approved in the ops meeting. FWIW, I support this
[18:15:39] <Amir1>	 if Ops are around, I have some simple patches for review: https://gerrit.wikimedia.org/r/278270
[18:15:46] <Amir1>	 https://gerrit.wikimedia.org/r/278271
[18:15:49] <grrrit-wm>	 (03CR) 10Ottomata: HDFS Namenode automatic failover support - bug fixes. (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/278748 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey)
[18:20:32] <grrrit-wm>	 (03CR) 10Ottomata: [C: 031] Fix varnishkafka cronspam due to non existent rsyslog action. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/278750 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey)
[18:42:30] <grrrit-wm>	 (03PS1) 10Ottomata: Add DC named topics to event bus topic config [puppet] - 10https://gerrit.wikimedia.org/r/278752 (https://phabricator.wikimedia.org/T127718) 
[18:43:30] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Citoid: Switch to the Scap3 deployment method [puppet] - 10https://gerrit.wikimedia.org/r/278710 (https://phabricator.wikimedia.org/T116337) (owner: 10Mobrovac)
[18:54:21] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Citoid: Switch to the Scap3 deployment method [puppet] - 10https://gerrit.wikimedia.org/r/278710 (https://phabricator.wikimedia.org/T116337) (owner: 10Mobrovac)
[18:58:20] <wikibugs>	 6Operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#2139275 (10AdHuikeshoven) @RobLa-WMF , thanks for the kind words. The status of Discourse is a pilot a test and generates feedback about what people like and what people don't like. There are some st...
[18:59:57] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[19:00:17] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[19:02:16] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[19:03:37] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on port 9042
[19:10:57] <wikibugs>	 6Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#2139292 (10AdHuikeshoven)
[19:12:29] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup)
[19:12:34] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup)
[19:14:52] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup)
[19:17:25] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "That's a stub class (apart from the base requirement). Let's populate it with something actually doing something useful :-)" [puppet] - 10https://gerrit.wikimedia.org/r/278455 (https://phabricator.wikimedia.org/T130461) (owner: 10Halfak)
[19:18:13] <wikibugs>	 6Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#2139305 (10AdHuikeshoven)
[19:18:17] <icinga-wm>	 PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 56 failures
[19:19:00] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "Hm... that's a 753 line patch. I know it's supposed to be NOOP, but got to figure out how to test it before breaking someone's workflow. T" [puppet] - 10https://gerrit.wikimedia.org/r/278271 (owner: 10Ladsgroup)
[19:31:44] <grrrit-wm>	 (03CR) 10Ladsgroup: "and that's even the first pass, I will make several others just for LDAP. These changes are only cosmetic ones and they don't change outsi" [puppet] - 10https://gerrit.wikimedia.org/r/278271 (owner: 10Ladsgroup)
[19:32:06] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[19:32:48] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[19:36:26] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[19:36:46] <akosiaris>	 yuvipanda: https://gerrit.wikimedia.org/r/#/c/197409/ what have I missed ?
[19:37:13] <akosiaris>	 aka: why nodes/labs/integration under the top level puppet repo hierarchy ?
[19:37:17] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[19:43:22] <grrrit-wm>	 (03PS1) 10Ladsgroup: Flake8 for apt [puppet] - 10https://gerrit.wikimedia.org/r/278753 
[19:44:36] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Flake8 for apt [puppet] - 10https://gerrit.wikimedia.org/r/278753 (owner: 10Ladsgroup)
[19:45:20] <hashar>	 akosiaris: if you are in pep8/flake8 mood, I had a pending patch to switch the puppet repo to use  tox   to run pep8 :)
[19:46:06] <icinga-wm>	 PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:46:16] <akosiaris>	 I am just in a react mood while handling some ORES redis things, not really in a flake8 mood :-(
[19:46:25] <hashar>	 :D
[19:46:57] <icinga-wm>	 PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:47:47] <icinga-wm>	 PROBLEM - nutcracker process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:47:56] <icinga-wm>	 PROBLEM - nutcracker port on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:48:06] <icinga-wm>	 PROBLEM - salt-minion processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:48:08] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:48:16] <icinga-wm>	 PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:48:17] <icinga-wm>	 PROBLEM - DPKG on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:48:38] <icinga-wm>	 PROBLEM - dhclient process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:48:47] <icinga-wm>	 PROBLEM - HHVM processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:48:56] <icinga-wm>	 PROBLEM - configured eth on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:49:16] <icinga-wm>	 PROBLEM - SSH on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:49:17] <icinga-wm>	 PROBLEM - Disk space on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:49:46] <icinga-wm>	 RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:49:48] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1142 is OK: OK: nf_conntrack is 0 % full
[19:49:56] <icinga-wm>	 RECOVERY - RAID on mw1142 is OK: OK: no RAID installed
[19:55:07] <icinga-wm>	 PROBLEM - salt-minion processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:55:08] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:55:17] <icinga-wm>	 PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:58:52] <akosiaris>	 !log powercycle mw1142, console available but not ever prompting for the root password, stuck at username
[19:58:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:00:04] <jouncebot>	 gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160321T2000).
[20:00:57] <icinga-wm>	 RECOVERY - dhclient process on mw1142 is OK: PROCS OK: 0 processes with command name dhclient
[20:01:07] <icinga-wm>	 RECOVERY - HHVM processes on mw1142 is OK: PROCS OK: 6 processes with command name hhvm
[20:01:07] <icinga-wm>	 RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 69920 bytes in 1.198 second response time
[20:01:16] <icinga-wm>	 RECOVERY - configured eth on mw1142 is OK: OK - interfaces up
[20:01:36] <icinga-wm>	 RECOVERY - SSH on mw1142 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0)
[20:01:46] <icinga-wm>	 RECOVERY - Disk space on mw1142 is OK: DISK OK
[20:01:57] <icinga-wm>	 RECOVERY - nutcracker process on mw1142 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[20:02:06] <icinga-wm>	 RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.106 second response time
[20:02:07] <icinga-wm>	 RECOVERY - nutcracker port on mw1142 is OK: TCP OK - 0.000 second response time on port 11212
[20:02:16] <icinga-wm>	 RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[20:02:17] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1142 is OK: OK: nf_conntrack is 9 % full
[20:02:27] <icinga-wm>	 RECOVERY - RAID on mw1142 is OK: OK: no RAID installed
[20:02:28] <icinga-wm>	 RECOVERY - DPKG on mw1142 is OK: All packages OK
[20:03:48] <icinga-wm>	 PROBLEM - torrus.wikimedia.org UI on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Torrus Top: Wikimedia not found on https://torrus.wikimedia.org:443/torrus - 1140 bytes in 0.044 second response time
[20:04:30] <grrrit-wm>	 (03PS1) 10Ladsgroup: Flake8 and fix bug in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/278754 
[20:04:37] <icinga-wm>	 RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:05:37] <icinga-wm>	 RECOVERY - torrus.wikimedia.org UI on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2493 bytes in 0.110 second response time
[20:06:13] <grrrit-wm>	 (03PS1) 10Ladsgroup: Flake8 for osm [puppet] - 10https://gerrit.wikimedia.org/r/278755 
[20:12:27] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[20:12:46] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0]
[20:13:07] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[20:23:16] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[20:30:06] <icinga-wm>	 PROBLEM - DPKG on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:30:07] <icinga-wm>	 PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:30:07] <icinga-wm>	 PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:31:47] <icinga-wm>	 RECOVERY - DPKG on analytics1047 is OK: All packages OK
[20:31:48] <icinga-wm>	 RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[20:31:48] <icinga-wm>	 RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical
[20:32:37] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[20:33:47] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042
[20:36:47] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Flake8 and fix bug in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/278754 (owner: 10Ladsgroup)
[20:36:53] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Flake8 and fix bug in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/278754 (owner: 10Ladsgroup)
[20:37:06] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Flake8 for osm [puppet] - 10https://gerrit.wikimedia.org/r/278755 (owner: 10Ladsgroup)
[20:37:09] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] Flake8 and fix bug in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/278754 (owner: 10Ladsgroup)
[20:37:32] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Flake8 for osm [puppet] - 10https://gerrit.wikimedia.org/r/278755 (owner: 10Ladsgroup)
[20:37:36] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] Flake8 for osm [puppet] - 10https://gerrit.wikimedia.org/r/278755 (owner: 10Ladsgroup)
[20:39:11] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 
[20:39:13] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Add the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) 
[20:46:42] <Amir1>	 akosiaris: thank you :)
[21:02:58] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:03:47] <icinga-wm>	 PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:07:57] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[21:08:58] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[21:09:57] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING
[21:10:37] <icinga-wm>	 RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical
[21:10:49] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "Not against this per se, but what is the rationale behind this ?" [puppet] - 10https://gerrit.wikimedia.org/r/278318 (owner: 10Ori.livneh)
[21:32:47] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[21:33:46] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.039 second response time on port 9042
[22:09:45] <grrrit-wm>	 (03PS1) 10Ladsgroup: Flake8 on openstack, part I [puppet] - 10https://gerrit.wikimedia.org/r/278761 
[22:12:28] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[22:13:27] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[22:13:47] <grrrit-wm>	 (03PS1) 10Ladsgroup: Flake8 for HHVM [puppet] - 10https://gerrit.wikimedia.org/r/278762 
[22:15:10] <grrrit-wm>	 (03PS2) 10Ladsgroup: Flake8 for HHVM [puppet] - 10https://gerrit.wikimedia.org/r/278762 
[22:25:07] <grrrit-wm>	 (03PS1) 10Ladsgroup: flake8 on icinga [puppet] - 10https://gerrit.wikimedia.org/r/278763 
[22:26:27] <grrrit-wm>	 (03CR) 10Ori.livneh: "Still not sure I need it, to be honest. I was going to look into setting the 'Server: ' header to the app server hostname, instead of just" [puppet] - 10https://gerrit.wikimedia.org/r/278318 (owner: 10Ori.livneh)
[22:31:16] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[22:33:47] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.042 second response time on port 9042
[22:56:40] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: stdlib: import deep_merge function [puppet] - 10https://gerrit.wikimedia.org/r/278241 
[22:56:42] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) 
[22:56:44] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 (https://phabricator.wikimedia.org/T124200) 
[22:56:46] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278836 (https://phabricator.wikimedia.org/T124200) 
[22:58:03] <grrrit-wm>	 (03Abandoned) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278242 (https://phabricator.wikimedia.org/T124200) (owner: 10Alexandros Kosiaris)
[22:58:21] <grrrit-wm>	 (03Abandoned) 10Alexandros Kosiaris: ores: define slaveof as a parameter [puppet] - 10https://gerrit.wikimedia.org/r/278243 (https://phabricator.wikimedia.org/T124200) (owner: 10Alexandros Kosiaris)
[23:00:04] <jouncebot>	 RoanKattouw ostriches Krenair MaxSem: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160321T2300).
[23:03:30] <Krenair>	 no patches listed
[23:04:38] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) 
[23:15:55] <grrrit-wm>	 (03PS4) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) 
[23:26:15] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278836 (https://phabricator.wikimedia.org/T124200) 
[23:26:17] <grrrit-wm>	 (03PS5) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) 
[23:26:19] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 (https://phabricator.wikimedia.org/T124200) 
[23:28:16] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[23:40:46] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[23:46:02] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] flake8 on icinga [puppet] - 10https://gerrit.wikimedia.org/r/278763 (owner: 10Ladsgroup)
[23:47:47] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:51:57] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89413.00 seconds