[00:08:02] <wikibugs>	 (03PS2) 10Reedy: Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff)
[00:09:00] <wikibugs>	 (03PS3) 10Reedy: Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff)
[00:09:04] <wikibugs>	 (03CR) 10Reedy: [C: 032] Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff)
[00:10:25] <wikibugs>	 (03Merged) 10jenkins-bot: Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff)
[00:12:05] <logmsgbot>	 !log reedy@tin Synchronized docroot/mediawiki/keys/: Add Brian Wolff's key (duration: 00m 45s)
[00:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:33:01] <wikibugs>	 (03CR) 10Legoktm: "Nooo this worked against my changes to stick the current release managers to the top of keys.html, and label all the keys in keys.txt :/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff)
[01:34:34] <wikibugs>	 (03CR) 10Legoktm: "Uh yeah this got broken in the rebase :(" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff)
[01:37:14] <wikibugs>	 (03PS1) 10Legoktm: Fix up 611a3b6cba28342c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393169
[01:37:57] <wikibugs>	 (03CR) 10Legoktm: "Follow-up Change-Id: I11e64e9fe2b0d24f1fecd19209a29588f803166f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff)
[02:24:10] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[02:24:20] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[02:24:30] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.137 and port 9042: Connection refused
[02:24:30] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.137:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[02:24:39] <icinga-wm>	 PROBLEM - Check systemd state on restbase1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:24:49] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[03:07:20] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.137 and port 9042: Connection refused eevans not an outage scheduled maintenance expired
[03:07:20] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c SSL 10.64.48.137:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans not an outage scheduled maintenance expired
[03:09:10] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on restbase1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans not an outage scheduled maintenance expired
[03:09:10] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed eevans not an outage scheduled maintenance expired
[03:09:10] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed eevans not an outage scheduled maintenance expired
[03:09:10] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans not an outage scheduled maintenance expired
[03:10:42] <wikibugs>	 (03PS3) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045)
[03:10:56] <wikibugs>	 (03PS11) 10TerraCodes: Add loginwiki and wikidata to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302)
[03:25:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.25 seconds
[03:57:19] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 271.08 seconds
[03:58:19] <icinga-wm>	 PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100%
[04:00:09] <icinga-wm>	 RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.82 ms
[05:25:49] <icinga-wm>	 PROBLEM - HP RAID on db2055 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK
[05:25:51] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db2055 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T181266
[05:25:55] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3784747 (10ops-monitoring-bot)
[06:19:26] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3784765 (10Marostegui) a:03Papaul @Papaul can we get this replaced? Thanks!
[06:25:44] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1101:3318 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393173 (https://phabricator.wikimedia.org/T178359)
[06:27:56] <wikibugs>	 (03PS1) 10Marostegui: s5.hosts: Update db1101 port [software] - 10https://gerrit.wikimedia.org/r/393174 (https://phabricator.wikimedia.org/T178359)
[06:28:40] <wikibugs>	 (03CR) 10Marostegui: [C: 032] s5.hosts: Update db1101 port [software] - 10https://gerrit.wikimedia.org/r/393174 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[06:29:21] <wikibugs>	 (03Merged) 10jenkins-bot: s5.hosts: Update db1101 port [software] - 10https://gerrit.wikimedia.org/r/393174 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[06:43:38] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1101:3318 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393173 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[06:45:06] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1101:3318 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393173 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[06:46:29] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Slowly pool db1101:3318 in s5 to warm it up - T178359 (duration: 00m 49s)
[06:46:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:38] <stashbot>	 T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359
[06:47:22] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly pool db1101:3318 in s5 to warm it up - T178359 (duration: 00m 45s)
[06:47:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:35] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db1097 in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393175 (https://phabricator.wikimedia.org/T178359)
[06:55:31] <wikibugs>	 (03PS1) 10Marostegui: s5.hosts: Add db1097:3315 [software] - 10https://gerrit.wikimedia.org/r/393176 (https://phabricator.wikimedia.org/T178359)
[06:56:29] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[06:57:05] <wikibugs>	 (03CR) 10Marostegui: [C: 032] s5.hosts: Add db1097:3315 [software] - 10https://gerrit.wikimedia.org/r/393176 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[06:57:19] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy
[06:57:46] <wikibugs>	 (03Merged) 10jenkins-bot: s5.hosts: Add db1097:3315 [software] - 10https://gerrit.wikimedia.org/r/393176 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[06:58:56] <wikibugs>	 (03PS1) 10Marostegui: db1097.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/393177 (https://phabricator.wikimedia.org/T178359)
[07:03:06] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db1097.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/393177 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[07:22:00] <icinga-wm>	 PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:22:49] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[07:23:40] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy
[07:48:32] <wikibugs>	 (03CR) 10Hashar: [C: 031] Remove experimental component from contintcloud [puppet] - 10https://gerrit.wikimedia.org/r/393081 (owner: 10Muehlenhoff)
[07:52:02] <icinga-wm>	 RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:53:34] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove experimental component from contintcloud [puppet] - 10https://gerrit.wikimedia.org/r/393081
[07:55:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove experimental component from contintcloud [puppet] - 10https://gerrit.wikimedia.org/r/393081 (owner: 10Muehlenhoff)
[08:07:22] <elukey>	 !log re-enabling piwik on bohrium (only VM running on ganeti1006 atm) after mysql tables restore completed
[08:07:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:07] <hashar>	 !log restarting jenkins on contint1001 for a java update
[08:14:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:32] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater]
[08:15:59] <moritzm>	 !log installing java security updates on stat1004
[08:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:23] <icinga-wm>	 PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused
[08:17:04] <hashar>	 grrr
[08:17:07] <hashar>	 that is a valid alarm
[08:20:00] <moritzm>	 !log installing java security updates on meitnerium
[08:20:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:24] <wikibugs>	 (03CR) 10Jcrespo: "I can only restart 1 server on my own unless I get help to do it at the same time. It can be done later- even you can help! :-)" [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[08:22:47] <wikibugs>	 (03CR) 10Marostegui: [C: 031] "> I can only restart 1 server on my own unless I get help to do it at" [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[08:23:03] <icinga-wm>	 ACKNOWLEDGEMENT - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused amusso ZeroMQ is not loading for some reason
[08:24:17] <wikibugs>	 (03CR) 10Jcrespo: "> > I can only restart 1 server on my own unless I get help to do it" [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[08:24:32] <icinga-wm>	 RECOVERY - jenkins_zmq_publisher on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 8888
[08:28:15] <wikibugs>	 (03PS1) 10Marostegui: db1092.yaml: Remove old socket location [puppet] - 10https://gerrit.wikimedia.org/r/393180
[08:28:41] <wikibugs>	 (03CR) 10Marostegui: [C: 031] "> > > I can only restart 1 server on my own unless I get help to do" [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[08:31:08] <wikibugs>	 (03CR) 10jenkins-bot: Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff)
[08:31:53] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Move some (only the single-instance) s5 hosts to s8 [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208)
[08:34:03] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] db-eqiad,db-codfw.php: Pool db1097 in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393175 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[08:34:23] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db1097 in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393175 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[08:34:32] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:35:51] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1097 in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393175 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[08:36:15] <wikibugs>	 (03PS1) 10Elukey: profile::mariadb::misc::el::replication: fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/393182
[08:37:01] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Slowly pool db1097:3315 - T178359 (duration: 00m 45s)
[08:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:09] <stashbot>	 T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359
[08:38:27] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move some (only the single-instance) s5 hosts to s8 [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[08:38:41] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly pool db1097:3315 - T178359 (duration: 00m 45s)
[08:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:52] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::mariadb::misc::el::replication: fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/393182 (owner: 10Elukey)
[08:38:58] <wikibugs>	 (03PS2) 10Elukey: profile::mariadb::misc::el::replication: fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/393182
[08:40:19] <wikibugs>	 (03PS2) 10Marostegui: db1092.yaml: Remove old socket location [puppet] - 10https://gerrit.wikimedia.org/r/393180
[08:40:56] <moritzm>	 !log installing java security updates on notebook* hosts
[08:41:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:43] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393185 (https://phabricator.wikimedia.org/T177208)
[08:43:10] <marostegui>	 jynus: ^ I will also depool db1092
[08:43:56] <wikibugs>	 (03CR) 10jenkins-bot: Remove wmgRelatedSitesPrefixes intermediatary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393166 (owner: 10Reedy)
[08:44:23] <wikibugs>	 (03PS3) 10Marostegui: db1092.yaml: Remove old socket location [puppet] - 10https://gerrit.wikimedia.org/r/393180
[08:45:06] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393185 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui)
[08:45:32] <jynus>	 this was a bad idea, the eqiad hosts will alert
[08:46:00] <jynus>	 as they have not yet been topologically moved
[08:46:27] <marostegui>	 mmm what they would alert for?
[08:46:31] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393185 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui)
[08:47:52] <jynus>	 actually, they won't alert, which is almost as bad
[08:48:03] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1101:3318 in s5 to warm it up and depool db1092 - T178359 T177208 (duration: 00m 45s)
[08:48:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:11] <stashbot>	 T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208
[08:48:11] <stashbot>	 T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359
[08:48:14] <marostegui>	 But why should they alert? nothing is down
[08:48:28] <jynus>	 lag from s8
[08:48:36] <marostegui>	 Ah, the pt-hearbeat
[08:48:49] <jynus>	 but when there is no row found, they fail back to show slave status
[08:48:58] <jynus>	 so it kinda works as intended
[08:49:07] <marostegui>	 !log Stop MySQL on db1092
[08:49:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:27] <jynus>	 but we should depool all s8 servers on eqiad
[08:50:23] <marostegui>	 Yeah, but if we are aiming to do the split on tuesday we need to warm them up at least on monday
[08:50:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] openldap: move firewall/standard to roles, use profile [puppet] - 10https://gerrit.wikimedia.org/r/391737 (owner: 10Dzahn)
[08:50:43] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db1092.yaml: Remove old socket location [puppet] - 10https://gerrit.wikimedia.org/r/393180 (owner: 10Marostegui)
[08:59:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] mediawiki:appserver:api: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/391731 (owner: 10Dzahn)
[09:25:12] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase weight:db1101,db1092,db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393193
[09:26:13] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Increase weight:db1101,db1092,db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393193
[09:29:02] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight:db1101,db1092,db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393193 (owner: 10Marostegui)
[09:30:25] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight:db1101,db1092,db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393193 (owner: 10Marostegui)
[09:31:39] <wikibugs>	 (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Looks good to me. In comparison I do not see any difference other than the logo becoming black(ish). The smallest version also has a new r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376150 (https://phabricator.wikimedia.org/T174603) (owner: 10Odder)
[09:32:23] <wikibugs>	 (03CR) 10Hashar: [C: 032] Add .gitreview [software/conftool] - 10https://gerrit.wikimedia.org/r/392795 (owner: 10Hashar)
[09:32:31] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1101:3318 db1097:3315 abd db1092 (duration: 00m 45s)
[09:32:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:58] <wikibugs>	 (03CR) 10jenkins-bot: Remove GettingStarted intermediate variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393167 (owner: 10Reedy)
[09:33:00] <wikibugs>	 (03CR) 10jenkins-bot: wgPageImagesExpandOpenSearchXml: drop intermediate $wmg setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393122 (owner: 10Chad)
[09:33:02] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1101:3318 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393173 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[09:34:42] <Wurgl>	 Hello!
[09:35:07] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1097 in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393175 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui)
[09:36:02] <wikibugs>	 (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/387544 (owner: 10Volans)
[09:36:05] <wikibugs>	 (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/392793 (owner: 10Hashar)
[09:36:07] <wikibugs>	 (03CR) 10Hashar: [C: 032] "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/392795 (owner: 10Hashar)
[09:37:05] <Wurgl>	 Can I use jsub / jstop from a job which was started in the cloud by jsub? Background: For whatever reason some program hangs and I want to kill it and restart it automagically
[09:40:32] <wikibugs>	 (03PS1) 10Ema: cache_misc: use grafana.w.o instead of git.w.o in VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/393195
[09:49:49] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1097, db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393197
[09:50:05] <jynus>	 !log restarting db2045
[09:50:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:35] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1097, db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393197 (owner: 10Marostegui)
[09:54:55] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1097, db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393197 (owner: 10Marostegui)
[09:55:12] <wikibugs>	 (03CR) 10Ema: [C: 032] cache_misc: use grafana.w.o instead of git.w.o in VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/393195 (owner: 10Ema)
[09:56:34] <wikibugs>	 (03CR) 10Muehlenhoff: kmod::blacklist: prevent manual install, update initramfs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392644 (owner: 10BBlack)
[09:56:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] kmod::blacklist: prevent manual install, update initramfs [puppet] - 10https://gerrit.wikimedia.org/r/392644 (owner: 10BBlack)
[09:57:06] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1097:3315 and db1092 (duration: 00m 45s)
[09:57:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:12] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: move db2085:s5 to db2085:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393203 (https://phabricator.wikimedia.org/T177208)
[10:21:48] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208)
[10:22:37] <moritzm>	 !log installing ca-cerfificates updates on trusty hosts
[10:22:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:51] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move db2086:s5 to db2086:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393205 (https://phabricator.wikimedia.org/T177208)
[10:26:20] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208)
[10:29:25] <jynus>	 I think that looks ok
[10:29:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui)
[10:29:33] <marostegui>	 let me give it another final look
[10:30:45] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui)
[10:31:28] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: move db2085:s5 to db2085:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393203 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[10:31:58] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Move db2086:s5 to db2086:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393205 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[10:32:26] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui)
[10:33:35] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool all future s8 slaves for a topology change - T177208 (duration: 00m 45s)
[10:33:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:43] <stashbot>	 T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208
[10:35:59] <jynus>	 !log restarting db2085 (including both s5 and s3 instances)
[10:36:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:19] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3785194 (10Pchelolo) The `wikibase-UpdateUsagesForPage` job sounds like a perfect candidate to be the next one. It's ~220 jobs/s on...
[10:37:10] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3785198 (10Pchelolo)
[10:37:49] <jynus>	 !log cancelling db2085 restart, only doing mysql:s5
[10:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: move db2085:s5 to db2085:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393203 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[10:42:42] <icinga-wm>	 PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates]
[10:44:12] <marostegui>	 jynus: you want me to take care of db2086?
[10:44:39] <jynus>	 no, one at a time
[10:44:43] <marostegui>	 oki
[10:50:29] <jynus>	 so I moved manually sqldata, tmp and removed the old s5.cfg then run puppet
[10:50:38] <jynus>	 anything else I am missing?
[10:50:58] <jynus>	 aside from removal of old nagios check files, which I never do
[10:51:10] <marostegui>	 I think that should be it
[10:51:19] <marostegui>	 I ran manually puppet on einstenimium
[10:51:23] <marostegui>	 but that is not really a big thing
[10:51:24] <jynus>	 I already started it
[10:51:50] <jynus>	 but I was worried there was some reference to shard on relay or binlog name
[10:52:04] <marostegui>	 Yeah, I was worried yesterday with that too when I moved db1101
[10:52:05] <marostegui>	 hehe
[10:52:12] <jynus>	 ok
[10:52:29] <jynus>	 I think it is ok I did it mysqlf, so now both know about it
[10:52:40] <jynus>	 you or me can do the other one, I think
[10:53:00] <marostegui>	 I will do it then
[10:53:07] <jynus>	 replication catched up, and alerts should go away when einstinium finishes
[10:53:27] <marostegui>	 great, I will take care of db2086
[10:54:12] <jynus>	 ah, tendril change
[10:54:16] <jynus>	 and of cours the dblists
[10:54:22] <jynus>	 *hosts
[10:54:33] <jynus>	 I think prometheus was updated in advance
[10:54:44] <marostegui>	 and codfw.php
[10:55:01] <jynus>	 eqiad, too
[10:55:39] <marostegui>	 !log Restart MySQL on db2086 to move s5 to s8
[10:55:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:27] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Move db2086:s5 to db2086:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393205 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[11:01:53] <wikibugs>	 (03PS1) 10Jcrespo: [WIP]mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208)
[11:02:19] <marostegui>	 db2086 is done
[11:03:03] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208)
[11:03:05] <jynus>	 new patch^
[11:03:18] <wikibugs>	 (03PS1) 10Ppchelko: Remove RESTBase jobs config [puppet] - 10https://gerrit.wikimedia.org/r/393209
[11:04:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: First version of the helm chart scaffolding for production services (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) (owner: 10Giuseppe Lavagetto)
[11:04:27] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: First version of the helm chart scaffolding for production services [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397)
[11:04:33] <wikibugs>	 (03CR) 10Marostegui: mariadb: Change db208[56]:3315 to port 3318; repool db2038 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[11:07:42] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208)
[11:09:06] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[11:09:27] <wikibugs>	 (03CR) 10Ppchelko: "I was sure that was already removed, no idea why it got resurrected. These jobs are long long gone" [puppet] - 10https://gerrit.wikimedia.org/r/393209 (owner: 10Ppchelko)
[11:09:44] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[11:10:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Remove RESTBase jobs config [puppet] - 10https://gerrit.wikimedia.org/r/393209 (owner: 10Ppchelko)
[11:11:11] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[11:12:18] <wikibugs>	 (03PS1) 10Marostegui: s8.hosts: Change port for db2085,db2086 [software] - 10https://gerrit.wikimedia.org/r/393210
[11:12:44] <icinga-wm>	 RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:15:09] <wikibugs>	 (03CR) 10Marostegui: [C: 032] s8.hosts: Change port for db2085,db2086 [software] - 10https://gerrit.wikimedia.org/r/393210 (owner: 10Marostegui)
[11:15:50] <wikibugs>	 (03Merged) 10jenkins-bot: s8.hosts: Change port for db2085,db2086 [software] - 10https://gerrit.wikimedia.org/r/393210 (owner: 10Marostegui)
[11:15:54] <icinga-wm>	 PROBLEM - Check systemd state on es2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:16:14] <jynus>	 es2018 ?
[11:16:24] <icinga-wm>	 PROBLEM - Disk space on es2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:16:24] <icinga-wm>	 PROBLEM - Check size of conntrack table on es2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:16:28] <marostegui>	 uyuy
[11:16:36] <jynus>	 is it a glitch, or is it crashing?
[11:16:48] <marostegui>	 I cannot login
[11:17:25] <marostegui>	 I think it is rebooting or something
[11:17:40] <jynus>	 that was the only one that didn't crash: T130702
[11:17:43] <stashbot>	 T130702: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702
[11:17:43] <jynus>	 almost
[11:18:08] <marostegui>	 the console is stuck on [OK ] which looks like the typical service OK from when a server is stopping or rebooting
[11:18:14] <icinga-wm>	 PROBLEM - DPKG on es2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:18:17] <marostegui>	 let's wait a bit more to see if it does something
[11:18:22] <marostegui>	 else, I will reboot it myself
[11:18:24] <icinga-wm>	 PROBLEM - Check size of conntrack table on es2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:18:41] <jynus>	 at least we can log in to the console
[11:18:54] <marostegui>	 hehe yes
[11:20:06] <jynus>	 can you check hw logs while I finish deployment, so I do not block tin
[11:20:12] <marostegui>	 yep
[11:20:14] <marostegui>	 no worries
[11:20:24] <marostegui>	 interesting: https://grafana.wikimedia.org/dashboard/db/server-board?refresh=1m&orgId=1&var-server=es2018&var-network=eth0
[11:21:11] <marostegui>	 storage crashed I just got a kernel message
[11:21:13] <marostegui>	 on the console
[11:21:14] <jynus>	 load incresees on io blockage
[11:21:18] <marostegui>	 [30996087.770298] megaraid_sas 0000:03:00.0: pending commands remain after waiting, will reset adapter scsi0.
[11:21:21] <marostegui>	 [30996102.596599] megaraid_sas 0000:03:00.0: Init cmd success
[11:21:24] <jynus>	 and it is iowaity
[11:21:24] <icinga-wm>	 RECOVERY - Disk space on es2018 is OK: DISK OK
[11:21:24] <icinga-wm>	 RECOVERY - Check size of conntrack table on es2018 is OK: OK: nf_conntrack is 0 % full
[11:21:33] <jynus>	 so most likely RAID issue, as usual
[11:21:39] <marostegui>	 yeah
[11:21:50] <marostegui>	 it is accessible again
[11:21:58] <marostegui>	 without rebooting
[11:22:05] <moritzm>	 looks like some I/O hardware error, I' day
[11:22:07] <jynus>	 mysql will probably had crashed
[11:22:14] <icinga-wm>	 RECOVERY - DPKG on es2018 is OK: All packages OK
[11:22:37] <moritzm>	 kernel started to log hung jbd processes at 11:16 (with a 120 seconds interval)
[11:23:42] <jynus>	 mm
[11:23:47] <marostegui>	 mysql is up
[11:23:48] <jynus>	 it says mysql is still up?
[11:23:57] <jynus>	 maybe io didn't crash
[11:24:01] <jynus>	 just hunged up?
[11:24:14] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Pool db2038, db2085:3318, db2086:3318 (duration: 00m 45s)
[11:24:15] <jynus>	 check mysql logs, then stop it
[11:24:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:26] <marostegui>	 i am doing that
[11:24:27] <jynus>	 I will finish the deploy and depool it
[11:24:41] <moritzm>	 there's also several hung processes for mysql, i.e. mysql was probably unable to complete some writes or reads due to the hardware error
[11:25:30] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: db2085:3318, db2086:3318 (duration: 00m 43s)
[11:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:46] <marostegui>	 !log Restart mysql on es2018
[11:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:57] <jynus>	 no, don't start it back yet
[11:26:58] <jynus>	 it is the master
[11:27:02] <jynus>	 we need a failover
[11:28:03] <jynus>	 probably it crashed and restarted?
[11:28:07] <jynus>	 what does the log say?
[11:29:20] <icinga-wm>	 PROBLEM - MariaDB Slave IO: es3 on es1014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2018.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on es2018.codfw.wmnet (111 Connection refused)
[11:29:37] <marostegui>	 no
[11:29:38] <marostegui>	 it never crashed
[11:29:44] <jynus>	 oh, ignore that page
[11:29:50] <jynus>	 it is the replication back
[11:29:54] <icinga-wm>	 PROBLEM - MariaDB Slave IO: es3 on es2019 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2018.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on es2018.codfw.wmnet (111 Connection refused)
[11:29:54] <icinga-wm>	 PROBLEM - MariaDB Slave IO: es3 on es2017 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2018.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on es2018.codfw.wmnet (111 Connection refused)
[11:30:12] <marostegui>	 I would reboot the server and start mysql again
[11:30:13] <jynus>	 *forward
[11:30:18] <jynus>	 wait first
[11:30:20] <jynus>	 we have time
[11:30:24] <jynus>	 once we depool
[11:31:48] <wikibugs>	 (03PS1) 10Jcrespo: maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293)
[11:32:05] <wikibugs>	 (03CR) 10Marostegui: [C: 031] maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo)
[11:32:09] <jynus>	 I created T181293
[11:32:09] <stashbot>	 T181293: es2018 crashed - https://phabricator.wikimedia.org/T181293
[11:32:18] <jynus>	 then let's put there evertyhing we know
[11:32:25] <marostegui>	 sounfs good
[11:32:28] <jynus>	 and let's not rush, there is no outage ongoing
[11:33:04] <jynus>	 let's get the kernel and hw logs there
[11:33:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo)
[11:33:47] <wikibugs>	 (03PS2) 10Jcrespo: maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293)
[11:34:47] <jynus>	 let's also get the binlog position of the 2 servers
[11:35:32] <marostegui>	 I cannot see any HW logs on the idrac
[11:36:04] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3785349 (10Marostegui)
[11:36:07] <jynus>	 I am going to drop the replication to es1014
[11:36:18] <jynus>	 meaning reset slave all on es1014
[11:36:23] <marostegui>	 ok
[11:37:21] <jynus>	 last logged position was es2018-bin.001481:3902469
[11:37:28] <jynus>	 although I do not think we care about that
[11:37:29] <icinga-wm>	 RECOVERY - MariaDB Slave IO: es3 on es1014 is OK: OK slave_io_state not a slave
[11:38:50] <jynus>	 so you suspect kernel crash rather than hw issue?
[11:39:00] <jynus>	 or no conclustion yet?
[11:39:08] <marostegui>	 No, I think storage crashed but maybe not as badly
[11:39:15] <marostegui>	 trtying to get the syslog logs from lithium
[11:39:20] <marostegui>	 to see if there is something extra there
[11:39:25] <marostegui>	 that was not written to the OS
[11:39:26] <wikibugs>	 (03PS1) 10Elukey: [WIP] cdh hadoop defaults refactoring [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393213
[11:39:42] <jynus>	 "Controller encountered a fatal error and was reset"
[11:39:47] <jynus>	 yeah, that supports that
[11:39:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] cdh hadoop defaults refactoring [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393213 (owner: 10Elukey)
[11:39:59] <jynus>	 the kernel being able to log that
[11:40:00] <marostegui>	 that on the webconsole?
[11:40:20] <jynus>	 is your own dmesg :-)
[11:40:43] <marostegui>	 Ah yeah, but on the HW there is nothing :)
[11:40:46] <marostegui>	 at least via idrac
[11:42:52] <marostegui>	 let's reboot it?
[11:43:12] <jynus>	 one sec
[11:43:30] <marostegui>	 let me try to grab the logs from the controller actually
[11:44:25] <jynus>	 Disk 0 in Backplane 1 of Integrated RAID Controller 1 is inserted.
[11:44:44] <jynus>	 basically it is reiniting all disks
[11:45:18] <wikibugs>	 (03Abandoned) 10Elukey: [WIP] cdh hadoop defaults refactoring [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393213 (owner: 10Elukey)
[11:45:20] <jynus>	 but not error is sent before
[11:45:31] <marostegui>	 yeah, nothing on the controller log either
[11:46:07] <jynus>	 marostegui: let's upgrade it- after all, we have one host to test
[11:46:15] <jynus>	 both kernel and mariadb
[11:46:21] <jynus>	 and pool it as a replica
[11:46:21] <marostegui>	 sounds good
[11:46:35] <jynus>	 *we have to have one host to test
[11:46:35] <marostegui>	 I will upgrade it now
[11:47:00] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo)
[11:47:54] <jynus>	 so, as far as you can tell, it kept being up and did stop cleanly, right?
[11:48:04] <marostegui>	 !log Reboot es2018 after full-upgrade - T181293
[11:48:06] <marostegui>	 indeed
[11:48:07] <marostegui>	 it stopped fine
[11:48:11] <wikibugs>	 (03PS1) 10Elukey: Allow the configuration of the HDFS Journalnode's jvm settings [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393214
[11:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:12] <stashbot>	 T181293: es2018 crashed - https://phabricator.wikimedia.org/T181293
[11:48:13] <marostegui>	 and I was able to get it finely
[11:48:15] <wikibugs>	 (03Merged) 10jenkins-bot: maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo)
[11:48:16] <marostegui>	 when it was up
[11:48:19] <jynus>	 ok, let's restart
[11:48:32] <marostegui>	 going to reboot now
[11:48:36] <jynus>	 we will put it up and use it to move the slaves
[11:48:42] <marostegui>	 sure
[11:48:45] <jynus>	 but pool it as a replica
[11:48:59] <marostegui>	 rebooting - I am monitoring also via idrac
[11:49:05] <marostegui>	 its boot, to see if there any errors
[11:50:39] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: depool es2018 T181293 (duration: 00m 45s)
[11:50:43] <wikibugs>	 (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8959/analytics1028.eqiad.wmnet/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393214 (owner: 10Elukey)
[11:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:59] <marostegui>	 it booted up cleanly
[11:51:05] <wikibugs>	 (03Merged) 10jenkins-bot: Allow the configuration of the HDFS Journalnode's jvm settings [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393214 (owner: 10Elukey)
[11:51:14] <icinga-wm>	 RECOVERY - Check systemd state on es2018 is OK: OK - running: The system is fully operational
[11:51:52] <wikibugs>	 (03PS1) 10Hashar: diamond: skip DiskSpace for Docker containers [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T181295)
[11:52:10] <marostegui>	 Everything looks fine
[11:52:14] <marostegui>	 Going to start mysql and run mysql_upgrade
[11:53:05] <icinga-wm>	 RECOVERY - MariaDB Slave IO: es3 on es2019 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:53:07] <icinga-wm>	 RECOVERY - MariaDB Slave IO: es3 on es2017 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:53:37] <wikibugs>	 (03PS1) 10Elukey: modules::cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216
[11:53:46] <jynus>	 let's not start replication unless you already did that
[11:54:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] modules::cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216 (owner: 10Elukey)
[11:54:16] <marostegui>	 no
[11:54:20] <marostegui>	 I started with skip-slave
[11:54:24] <jynus>	 cool
[11:56:01] <wikibugs>	 (03PS2) 10Elukey: cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216
[11:56:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216 (owner: 10Elukey)
[11:56:25] <marostegui>	 it had an uptime of 587d
[11:56:43] <jynus>	 let's disable puppet and stop pt-heartbeat there
[11:56:52] <jynus>	 so it doesn't confuse the other hosts
[11:57:03] <marostegui>	 !log Disable puppet on es2018 - T181293
[11:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:11] <stashbot>	 T181293: es2018 crashed - https://phabricator.wikimedia.org/T181293
[11:57:38] <marostegui>	 heartbeat stopped
[11:57:45] <jynus>	 cool
[11:58:03] <jynus>	 so we are at es2018-bin.001482:138357 everywhere
[11:58:26] <marostegui>	 which host you want to be the master?
[11:58:31] <jynus>	 that is es1014-bin.001887:819580774
[11:58:31] <marostegui>	 (I am doing the puppet patches)
[11:58:44] <jynus>	 the one that I randomly pooled as such on mediawiki :-)
[11:58:48] <marostegui>	 haha
[11:58:49] <marostegui>	 ok
[11:59:03] <jynus>	 did I deploy that already?
[11:59:40] <jynus>	 es2017 as master
[11:59:45] <marostegui>	 you did
[11:59:51] <jynus>	 so I will move es2019
[12:00:00] <marostegui>	 cool
[12:00:12] <jynus>	 to es2017-bin.001476:843479464
[12:00:25] <jynus>	 and es2017 to
[12:00:31] <jynus>	 es1014-bin.001887:819580774
[12:01:30] <wikibugs>	 (03PS3) 10Elukey: cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216
[12:01:37] <marostegui>	 why is es2017 getting writes?
[12:01:42] <marostegui>	 or its binlog advancing?
[12:01:51] <jynus>	 it is not
[12:01:57] <jynus>	 that I can see
[12:02:06] <marostegui>	 ah
[12:02:07] <jynus>	 it was before, because pt-heartbeat
[12:02:09] <marostegui>	 i logged to db2017
[12:02:10] <marostegui>	 XDD
[12:02:24] <elukey>	 do you guys prefer to avoid any in flight puppet changes or shall I proceed?
[12:02:40] <marostegui>	 es2017-bin.001476:843479464 -> looks good
[12:02:55] <marostegui>	 elukey: if you can hold a sec, yep, I am about to send a patch
[12:03:12] <elukey>	 sure
[12:03:26] <marostegui>	 thanks
[12:03:49] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote es2017 to master [puppet] - 10https://gerrit.wikimedia.org/r/393218 (https://phabricator.wikimedia.org/T181293)
[12:04:04] <jynus>	 marostegui: do not deploy yet
[12:04:10] <marostegui>	 no
[12:04:10] <jynus>	 until I change the replication master
[12:04:11] <marostegui>	 no worries
[12:04:21] <marostegui>	 I will also top mysql on es2018 to update its socket once it is moved
[12:04:45] <marostegui>	 actually, let me ammend the patch
[12:05:20] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote es2017 to master [puppet] - 10https://gerrit.wikimedia.org/r/393218 (https://phabricator.wikimedia.org/T181293)
[12:05:48] <jynus>	 I will change master to es2018, too
[12:05:53] <jynus>	 same coords
[12:05:56] <marostegui>	 yep
[12:06:43] <jynus>	 now you can deploy the new master hearbeat
[12:06:48] <marostegui>	 good
[12:06:57] <jynus>	 while I reploint the new master itself
[12:07:02] <jynus>	 that is, es2017
[12:07:10] <marostegui>	 want me to do it manually or merging puppet?
[12:07:11] <marostegui>	 up to you
[12:07:19] <jynus>	 just merge puppet
[12:07:22] <jynus>	 it should be ok
[12:07:23] <wikibugs>	 (03CR) 10Marostegui: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/8962/console" [puppet] - 10https://gerrit.wikimedia.org/r/393218 (https://phabricator.wikimedia.org/T181293) (owner: 10Marostegui)
[12:07:28] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Promote es2017 to master [puppet] - 10https://gerrit.wikimedia.org/r/393218 (https://phabricator.wikimedia.org/T181293) (owner: 10Marostegui)
[12:07:56] <marostegui>	 running puppet on es2017
[12:08:44] <marostegui>	 heartbeat running
[12:09:02] <marostegui>	 es2018 is catching up fine
[12:09:09] <marostegui>	 with the heartbeat from es2017
[12:09:38] <jynus>	 cool
[12:09:51] <jynus>	 I would do a sanity check of enwiki later
[12:09:53] <marostegui>	 let me know when I can stop mysql on es2018, enable puppet and enable gtid
[12:10:06] <jynus>	 not a full table check, just to make sure no event has been lost
[12:10:27] <marostegui>	 We also have ROW based, so that is a good sanity check itself too
[12:10:29] <jynus>	 stop replication?
[12:10:39] <jynus>	 or mysql?
[12:10:41] <marostegui>	 mysql
[12:10:46] <marostegui>	 to update the socket
[12:10:48] <jynus>	 ah!
[12:10:48] <marostegui>	 now that it is depooled
[12:10:56] <jynus>	 ok, any time now
[12:10:59] <marostegui>	 ok
[12:11:00] <marostegui>	 doing it
[12:11:13] <jynus>	 I asume it is down on icinga already?
[12:11:17] <marostegui>	 yep
[12:11:26] <jynus>	 we are good now
[12:11:41] <jynus>	 as will do a quick data check later, but other than that, everthing is ok
[12:11:45] <marostegui>	 cool
[12:11:55] <marostegui>	 starting mysql again
[12:11:58] <marostegui>	 I will enable gtid back
[12:13:47] <marostegui>	 !log Enable GTID on es2018 - T181293
[12:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:54] <stashbot>	 T181293: es2018 crashed - https://phabricator.wikimedia.org/T181293
[12:13:59] <marostegui>	 all done
[12:14:48] <marostegui>	 elukey: feel free to push anything you like
[12:14:50] <marostegui>	 Thanks for waiting :)
[12:15:39] <jynus>	 I will do that on the other host, without a restart
[12:15:42] <jynus>	 *hosts
[12:15:48] <marostegui>	 cool!
[12:16:26] <elukey>	 marostegui: ack!
[12:20:47] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Leave reimaginable only the db latest servers [puppet] - 10https://gerrit.wikimedia.org/r/392400 (https://phabricator.wikimedia.org/T170662)
[12:25:29] <wikibugs>	 (03CR) 10Elukey: [C: 032] cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216 (owner: 10Elukey)
[12:25:33] <wikibugs>	 (03PS4) 10Elukey: cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216
[12:27:32] <wikibugs>	 (03PS2) 10Reedy: Fix up 611a3b6cba28342c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393169 (owner: 10Legoktm)
[12:27:34] <wikibugs>	 (03CR) 10Reedy: [C: 032] Fix up 611a3b6cba28342c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393169 (owner: 10Legoktm)
[12:28:59] <wikibugs>	 (03Merged) 10jenkins-bot: Fix up 611a3b6cba28342c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393169 (owner: 10Legoktm)
[12:31:10] <wikibugs>	 (03PS1) 10Elukey: Fix prometheus target for the Eventlogging mysql master db [puppet] - 10https://gerrit.wikimedia.org/r/393220 (https://phabricator.wikimedia.org/T177405)
[12:32:01] <logmsgbot>	 !log reedy@tin Synchronized docroot/mediawiki/keys/: Fixup keys (duration: 00m 45s)
[12:32:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:29] <wikibugs>	 (03PS9) 10ArielGlenn: rsync all dumps status files to web servers and unpack them periodically [puppet] - 10https://gerrit.wikimedia.org/r/392875 (https://phabricator.wikimedia.org/T179857)
[12:38:18] <jynus>	 !log disable puppet on db1071 and stop local s5 heartbeat there
[12:38:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:35] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] rsync all dumps status files to web servers and unpack them periodically [puppet] - 10https://gerrit.wikimedia.org/r/392875 (https://phabricator.wikimedia.org/T179857) (owner: 10ArielGlenn)
[12:40:13] <jynus>	 !log setting up s8 topology on eqiad
[12:40:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:24] <apergos>	 ignore errors on dump* and such, fixing grrrr
[12:47:57] <wikibugs>	 (03PS1) 10ArielGlenn: use right path to dumps status file script [puppet] - 10https://gerrit.wikimedia.org/r/393225
[12:49:12] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] use right path to dumps status file script [puppet] - 10https://gerrit.wikimedia.org/r/393225 (owner: 10ArielGlenn)
[12:49:53] <wikibugs>	 (03PS5) 10Mark Bergsma: Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764)
[12:50:26] <jynus>	 !log resetting replication on es1011 for consistency with other replica sets
[12:50:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:20] <jynus>	 ok, that should be it,
[12:54:29] <jynus>	 !log reenabling puppet on db1071
[12:54:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:02] <jynus>	 s8 is officially on eqiad and codfw, including replication control
[12:56:24] <jynus>	 I mean, it is not pooled, but the infrastructure is there
[12:56:47] <jynus>	 only labs filters and replication pending
[12:57:50] <marostegui>	 \o/
[13:06:35] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3785600 (10Marostegui)
[13:17:04] <marostegui>	 !log Stop replication on db1097 to reimport and recompress commonswiki.watchlist
[13:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:18] <moritzm>	 !log installing openjpeg2 updates (original security already got installed after initial release, but there was a binNMU for amd64)
[13:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:07] <wikibugs>	 (03CR) 10Addshore: [C: 031] diamond: skip DiskSpace for Docker containers [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T181295) (owner: 10Hashar)
[13:52:37] <moritzm>	 !log removing git packages from jessie-wikimedia/experimental (replaced by component/git)
[13:52:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:15] <wikibugs>	 10Operations, 10Patch-For-Review: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3785688 (10MoritzMuehlenhoff) 05Open>03Resolved I'm closing this bug. The new structure is fully in effect for stretch-wikimedia and partly for jessie-wikimedia (component/foo also ex...
[13:56:44] <wikibugs>	 10Operations: Collate jessie-wikimedia/backports into jessie-wikimedia/main - https://phabricator.wikimedia.org/T167292#3785690 (10MoritzMuehlenhoff) p:05High>03Low
[13:57:53] <wikibugs>	 10Operations: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401#3785691 (10MoritzMuehlenhoff) 05Open>03Resolved This is complete for a while now.
[14:02:58] <wikibugs>	 (03PS7) 10Ema: vcl: distinguish between hfp and hfm [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434)
[14:03:00] <wikibugs>	 (03PS1) 10Ema: WIP: cache: size-based cutoff for exp caching policy [puppet] - 10https://gerrit.wikimedia.org/r/393227 (https://phabricator.wikimedia.org/T144187)
[14:04:13] <wikibugs>	 (03PS2) 10Ema: WIP: vcl: size-based cutoff for exp caching policy [puppet] - 10https://gerrit.wikimedia.org/r/393227 (https://phabricator.wikimedia.org/T144187)
[14:08:14] <icinga-wm>	 PROBLEM - puppet last run on restbase1015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid]
[14:11:42] <wikibugs>	 10Operations: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703#3785705 (10MoritzMuehlenhoff) These are fully rolled out: binutils libvirt ndisc6
[14:18:15] <icinga-wm>	 RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[14:35:34] <wikibugs>	 (03PS1) 10Elukey: Drop the Eventlogging support for dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/393238 (https://phabricator.wikimedia.org/T156844)
[14:36:26] <wikibugs>	 (03PS2) 10Elukey: Drop the Eventlogging support for dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/393238 (https://phabricator.wikimedia.org/T156844)
[14:37:30] <wikibugs>	 (03CR) 10Elukey: [C: 04-2] "Waiting for Nov 28th to drop the log database manually from dbstore1002" [puppet] - 10https://gerrit.wikimedia.org/r/393238 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey)
[14:47:36] <wikibugs>	 (03PS1) 10Jdrewniak: [WIP] Replace portals submodule with portals/deploy submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393239 (https://phabricator.wikimedia.org/T180777)
[14:47:56] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3785770 (10Aklapper) Is {T178840} a duplicate?
[14:50:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Restrict access to ferm service on mwlog* hosts [puppet] - 10https://gerrit.wikimedia.org/r/393240
[14:51:28] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3785776 (10jcrespo) @Aklapper Probably, but I would close that one, as that should not be happening right now, unless you have reports saying it is again.
[14:57:04] <wikibugs>	 (03PS2) 10Jdrewniak: [WIP] Replace portals submodule with portals/deploy submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393239 (https://phabricator.wikimedia.org/T180777)
[14:59:05] <wikibugs>	 10Operations, 10Analytics, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3785806 (10elukey) As far as I know Analytics has no plans for oxygen, I thought that it was completely managed by ops :D  +1 for the fast SSDs for occasional greps, even if recently Filippo...
[15:01:30] <wikibugs>	 (03PS1) 10Muehlenhoff: grafana_http: Restrict to CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/393244
[15:10:16] <wikibugs>	 (03CR) 10Ema: [C: 031] grafana_http: Restrict to CACHE_MISC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393244 (owner: 10Muehlenhoff)
[15:15:32] <wikibugs>	 (03PS1) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942)
[15:15:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn)
[15:16:34] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received
[15:17:44] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received
[15:18:34] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed o
[15:18:34] <icinga-wm>	 e was received
[15:18:44] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[15:21:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received
[15:23:35] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received
[15:23:54] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[15:24:17] <wikibugs>	 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3785829 (10elukey)
[15:24:42] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Configuration, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3785832 (10Addshore) Thanks!  I only asked as the title of this ticket references testwikidatawiki not wikidatawiki
[15:25:44] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy
[15:27:29] <wikibugs>	 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3785843 (10elukey) a:03elukey
[15:27:54] <icinga-wm>	 PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:28:30] <mobrovac>	 that's me ^
[15:31:29] <wikibugs>	 (03PS2) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942)
[15:31:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn)
[15:33:29] <wikibugs>	 (03PS3) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942)
[15:33:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn)
[15:35:09] <wikibugs>	 (03PS4) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942)
[15:39:55] <icinga-wm>	 RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational
[15:39:56] <wikibugs>	 (03PS2) 10Hashar: diamond: skip DiskSpace for Docker containers [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T177052)
[15:40:08] <wikibugs>	 (03PS1) 10Muehlenhoff: ntp: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/393247
[15:41:51] <wikibugs>	 (03CR) 10Hashar: "Should probably be made more generic eg:" [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T177052) (owner: 10Hashar)
[15:42:40] <wikibugs>	 10Operations, 10Cloud-Services, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 3 others: Grafana reports ALL docker mounts in a spammy way - https://phabricator.wikimedia.org/T177052#3785870 (10hashar)
[15:45:28] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Nodepool, 10Patch-For-Review: Clean up apt:pin of python modules used for Nodepool - https://phabricator.wikimedia.org/T137217#3785873 (10hashar) 05Open>03declined Nodepool is legacy. I am not going to bother upgrading the python modules. We will...
[15:55:48] <wikibugs>	 (03PS5) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942)
[15:59:34] <wikibugs>	 (03PS1) 10Ema: vcl: add hostname/layer info to syntethic healthcheck response [puppet] - 10https://gerrit.wikimedia.org/r/393251
[16:08:10] <wikibugs>	 (03PS1) 10Muehlenhoff: hue: Restrict to CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/393253
[16:09:03] <wikibugs>	 (03Abandoned) 10Muehlenhoff: hue: Restrict to CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/393253 (owner: 10Muehlenhoff)
[16:12:05] <wikibugs>	 (03CR) 10Ema: "Looks reasonable in pcc https://puppet-compiler.wmflabs.org/compiler02/8966/cp1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/393251 (owner: 10Ema)
[16:19:58] <wikibugs>	 (03CR) 10Marostegui: [C: 031] Fix prometheus target for the Eventlogging mysql master db [puppet] - 10https://gerrit.wikimedia.org/r/393220 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey)
[16:20:52] <wikibugs>	 (03PS6) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942)
[16:21:28] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Configuration, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3785922 (10Marostegui) Yeah, we decided to go for wikidatawiki on codfw, as it is the passive DC :-)
[16:34:45] <wikibugs>	 (03PS7) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942)
[16:44:39] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::database::meta::backup*: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393257 (https://phabricator.wikimedia.org/T167790)
[16:46:27] <wikibugs>	 (03PS2) 10Elukey: role::analytics_cluster::database::meta::backup*: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393257 (https://phabricator.wikimedia.org/T167790)
[16:51:37] <wikibugs>	 (03PS3) 10Elukey: role::analytics_cluster::database::meta::backup*: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393257 (https://phabricator.wikimedia.org/T167790)
[16:55:05] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/8971/ - looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/393257 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey)
[17:15:36] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: [WiP] Move puppet CI to puppet 4.8.2 [puppet] - 10https://gerrit.wikimedia.org/r/393259
[17:16:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WiP] Move puppet CI to puppet 4.8.2 [puppet] - 10https://gerrit.wikimedia.org/r/393259 (owner: 10Giuseppe Lavagetto)
[17:24:09] <wikibugs>	 (03CR) 10Muehlenhoff: "$wgFFmpeg2theoraLocation is now removed from wmf-config, so removing my earlier -1" [puppet] - 10https://gerrit.wikimedia.org/r/373733 (https://phabricator.wikimedia.org/T172445) (owner: 10Muehlenhoff)
[17:24:17] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove ffmpeg2theora from package list [puppet] - 10https://gerrit.wikimedia.org/r/373733 (https://phabricator.wikimedia.org/T172445)
[18:30:18] <wikibugs>	 (03PS6) 10Mark Bergsma: Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764)
[18:31:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) (owner: 10Mark Bergsma)
[18:33:30] <wikibugs>	 (03PS7) 10Mark Bergsma: Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764)
[18:57:15] <icinga-wm>	 PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused
[18:57:54] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused
[18:59:04] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed o
[18:59:04] <icinga-wm>	 e was received
[18:59:15] <icinga-wm>	 RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.033 second response time
[18:59:54] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time
[18:59:55] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy
[19:01:01] <wikibugs>	 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): role::puppetmaster::standalone has no firewall rule for port 8140 - https://phabricator.wikimedia.org/T154150#3786046 (10aborrero) More testing. I see that a patch like this just works, but the reporter @scfc seems to suggest this doesn't work:  ``` diff...
[19:46:14] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.232:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.232 and port 9042: Connection refused
[19:46:15] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[19:46:15] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[19:46:24] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[19:46:24] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.230:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[19:46:25] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.204:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.204 and port 9042: Connection refused
[19:46:44] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.231 and port 9042: Connection refused
[19:46:44] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[19:46:45] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.203 and port 9042: Connection refused
[19:46:45] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[19:46:45] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.230 and port 9042: Connection refused
[19:46:45] <icinga-wm>	 PROBLEM - Check systemd state on restbase1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:46:54] <icinga-wm>	 PROBLEM - Check systemd state on restbase1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:46:55] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.32.203:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[19:46:55] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[19:46:55] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.231:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[19:46:55] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.202 and port 9042: Connection refused
[19:46:55] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[19:47:14] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.204:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[19:47:15] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.232:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[20:07:14] <icinga-wm>	 PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 61289 MB (12% inode=99%)
[20:08:35] <icinga-wm>	 PROBLEM - HHVM rendering on mw2126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:09:25] <icinga-wm>	 RECOVERY - HHVM rendering on mw2126 is OK: HTTP OK: HTTP/1.1 200 OK - 78961 bytes in 0.304 second response time
[20:13:14] <icinga-wm>	 RECOVERY - Disk space on elastic1017 is OK: DISK OK
[20:25:27] <wikibugs>	 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3786146 (10Capt_Swing) The `shawn` table belonged to Shawn Walker, a research intern in 2011. These tables can be safely deleted.
[21:33:58] <wikibugs>	 (03Draft1) 10MarcoAurelio: [WIP] puppet: redirect several wikis per LangCom decission [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450)
[21:34:03] <wikibugs>	 (03PS2) 10MarcoAurelio: [WIP] puppet: redirect several wikis per LangCom decission [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450)
[21:38:24] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786188 (10MarcoAurelio)
[21:38:53] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786200 (10MarcoAurelio) p:05Triage>03Unbreak! Temptatively UBN as this is not normal and has jobs stuck.
[21:41:00] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: load.php requests taking multiple minutes - https://phabricator.wikimedia.org/T181315#3786217 (10Tgr)
[21:57:19] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786259 (10Paladox) p:05Unbreak!>03High Changing to high as UBN means a site is down. Tests getting stuck in the post merge pipeline happened...
[21:57:29] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: load.php requests taking multiple minutes - https://phabricator.wikimedia.org/T181315#3786261 (10Tgr)
[21:58:58] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786262 (10MarcoAurelio) Okay. Would a restart of zuul help unlock those jobs?
[22:06:59] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786188 (10hashar) That happens from time to time and it is T72597. There is no magic solution to remove the lock though ;(
[22:08:19] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393185 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui)
[22:08:21] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase weight:db1101,db1092,db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393193 (owner: 10Marostegui)
[22:08:23] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1097, db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393197 (owner: 10Marostegui)
[22:08:25] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui)
[22:08:27] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo)
[22:08:29] <wikibugs>	 (03CR) 10jenkins-bot: maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo)
[22:08:31] <wikibugs>	 (03CR) 10jenkins-bot: Fix up 611a3b6cba28342c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393169 (owner: 10Legoktm)
[22:09:27] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786267 (10hashar) That happens from time to time and it is T72597. There is no magic solution to remove the lock though ;(  I went to https://in...
[22:11:03] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786188 (10hashar) 05Open>03Resolved a:03hashar
[22:11:36] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786275 (10MarcoAurelio) Thank you!
[23:02:44] <wikibugs>	 (03PS1) 10Reedy: Remove $wgStyleVersion appending in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393353 (https://phabricator.wikimedia.org/T181318)
[23:14:34] <wikibugs>	 (03PS8) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942)
[23:20:15] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:21:05] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:25:35] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:36:14] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:37:24] <icinga-wm>	 PROBLEM - Apache HTTP on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:38:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw2130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.120 second response time
[23:49:44] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[23:51:25] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 8.953 second response time
[23:51:25] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.005 second response time
[23:51:44] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time