[00:08:02] (03PS2) 10Reedy: Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff) [00:09:00] (03PS3) 10Reedy: Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff) [00:09:04] (03CR) 10Reedy: [C: 032] Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff) [00:10:25] (03Merged) 10jenkins-bot: Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff) [00:12:05] !log reedy@tin Synchronized docroot/mediawiki/keys/: Add Brian Wolff's key (duration: 00m 45s) [00:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:01] (03CR) 10Legoktm: "Nooo this worked against my changes to stick the current release managers to the top of keys.html, and label all the keys in keys.txt :/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff) [01:34:34] (03CR) 10Legoktm: "Uh yeah this got broken in the rebase :(" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff) [01:37:14] (03PS1) 10Legoktm: Fix up 611a3b6cba28342c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393169 [01:37:57] (03CR) 10Legoktm: "Follow-up Change-Id: I11e64e9fe2b0d24f1fecd19209a29588f803166f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff) [02:24:10] PROBLEM - cassandra-b service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [02:24:20] PROBLEM - cassandra-c service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [02:24:30] PROBLEM - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.137 and port 9042: Connection refused [02:24:30] PROBLEM - cassandra-c SSL 10.64.48.137:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [02:24:39] PROBLEM - Check systemd state on restbase1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:24:49] PROBLEM - cassandra-a service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [03:07:20] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.137 and port 9042: Connection refused eevans not an outage scheduled maintenance expired [03:07:20] ACKNOWLEDGEMENT - cassandra-c SSL 10.64.48.137:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans not an outage scheduled maintenance expired [03:09:10] ACKNOWLEDGEMENT - Check systemd state on restbase1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans not an outage scheduled maintenance expired [03:09:10] ACKNOWLEDGEMENT - cassandra-a service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed eevans not an outage scheduled maintenance expired [03:09:10] ACKNOWLEDGEMENT - cassandra-b service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed eevans not an outage scheduled maintenance expired [03:09:10] ACKNOWLEDGEMENT - cassandra-c service on restbase1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans not an outage scheduled maintenance expired [03:10:42] (03PS3) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [03:10:56] (03PS11) 10TerraCodes: Add loginwiki and wikidata to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) [03:25:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.25 seconds [03:57:19] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 271.08 seconds [03:58:19] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:09] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.82 ms [05:25:49] PROBLEM - HP RAID on db2055 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [05:25:51] ACKNOWLEDGEMENT - HP RAID on db2055 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T181266 [05:25:55] 10Operations, 10ops-codfw: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3784747 (10ops-monitoring-bot) [06:19:26] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3784765 (10Marostegui) a:03Papaul @Papaul can we get this replaced? Thanks! [06:25:44] (03PS1) 10Marostegui: db-eqiad.php: Repool db1101:3318 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393173 (https://phabricator.wikimedia.org/T178359) [06:27:56] (03PS1) 10Marostegui: s5.hosts: Update db1101 port [software] - 10https://gerrit.wikimedia.org/r/393174 (https://phabricator.wikimedia.org/T178359) [06:28:40] (03CR) 10Marostegui: [C: 032] s5.hosts: Update db1101 port [software] - 10https://gerrit.wikimedia.org/r/393174 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:29:21] (03Merged) 10jenkins-bot: s5.hosts: Update db1101 port [software] - 10https://gerrit.wikimedia.org/r/393174 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:43:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1101:3318 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393173 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:45:06] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1101:3318 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393173 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:46:29] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Slowly pool db1101:3318 in s5 to warm it up - T178359 (duration: 00m 49s) [06:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:38] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:47:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly pool db1101:3318 in s5 to warm it up - T178359 (duration: 00m 45s) [06:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:35] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db1097 in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393175 (https://phabricator.wikimedia.org/T178359) [06:55:31] (03PS1) 10Marostegui: s5.hosts: Add db1097:3315 [software] - 10https://gerrit.wikimedia.org/r/393176 (https://phabricator.wikimedia.org/T178359) [06:56:29] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:57:05] (03CR) 10Marostegui: [C: 032] s5.hosts: Add db1097:3315 [software] - 10https://gerrit.wikimedia.org/r/393176 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:57:19] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [06:57:46] (03Merged) 10jenkins-bot: s5.hosts: Add db1097:3315 [software] - 10https://gerrit.wikimedia.org/r/393176 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:58:56] (03PS1) 10Marostegui: db1097.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/393177 (https://phabricator.wikimedia.org/T178359) [07:03:06] (03CR) 10Marostegui: [C: 032] db1097.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/393177 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:22:00] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:22:49] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:23:40] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:48:32] (03CR) 10Hashar: [C: 031] Remove experimental component from contintcloud [puppet] - 10https://gerrit.wikimedia.org/r/393081 (owner: 10Muehlenhoff) [07:52:02] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:53:34] (03PS2) 10Muehlenhoff: Remove experimental component from contintcloud [puppet] - 10https://gerrit.wikimedia.org/r/393081 [07:55:28] (03CR) 10Muehlenhoff: [C: 032] Remove experimental component from contintcloud [puppet] - 10https://gerrit.wikimedia.org/r/393081 (owner: 10Muehlenhoff) [08:07:22] !log re-enabling piwik on bohrium (only VM running on ganeti1006 atm) after mysql tables restore completed [08:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:07] !log restarting jenkins on contint1001 for a java update [08:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:32] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater] [08:15:59] !log installing java security updates on stat1004 [08:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:23] PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused [08:17:04] grrr [08:17:07] that is a valid alarm [08:20:00] !log installing java security updates on meitnerium [08:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:24] (03CR) 10Jcrespo: "I can only restart 1 server on my own unless I get help to do it at the same time. It can be done later- even you can help! :-)" [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [08:22:47] (03CR) 10Marostegui: [C: 031] "> I can only restart 1 server on my own unless I get help to do it at" [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [08:23:03] ACKNOWLEDGEMENT - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused amusso ZeroMQ is not loading for some reason [08:24:17] (03CR) 10Jcrespo: "> > I can only restart 1 server on my own unless I get help to do it" [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [08:24:32] RECOVERY - jenkins_zmq_publisher on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 8888 [08:28:15] (03PS1) 10Marostegui: db1092.yaml: Remove old socket location [puppet] - 10https://gerrit.wikimedia.org/r/393180 [08:28:41] (03CR) 10Marostegui: [C: 031] "> > > I can only restart 1 server on my own unless I get help to do" [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [08:31:08] (03CR) 10jenkins-bot: Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff) [08:31:53] (03PS2) 10Jcrespo: mariadb: Move some (only the single-instance) s5 hosts to s8 [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208) [08:34:03] (03CR) 10Jcrespo: [C: 031] db-eqiad,db-codfw.php: Pool db1097 in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393175 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:34:23] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db1097 in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393175 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:34:32] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:35:51] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1097 in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393175 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:36:15] (03PS1) 10Elukey: profile::mariadb::misc::el::replication: fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/393182 [08:37:01] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Slowly pool db1097:3315 - T178359 (duration: 00m 45s) [08:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:09] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:38:27] (03CR) 10Jcrespo: [C: 032] mariadb: Move some (only the single-instance) s5 hosts to s8 [puppet] - 10https://gerrit.wikimedia.org/r/393102 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [08:38:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly pool db1097:3315 - T178359 (duration: 00m 45s) [08:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:52] (03CR) 10Elukey: [C: 032] profile::mariadb::misc::el::replication: fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/393182 (owner: 10Elukey) [08:38:58] (03PS2) 10Elukey: profile::mariadb::misc::el::replication: fix logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/393182 [08:40:19] (03PS2) 10Marostegui: db1092.yaml: Remove old socket location [puppet] - 10https://gerrit.wikimedia.org/r/393180 [08:40:56] !log installing java security updates on notebook* hosts [08:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:43] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393185 (https://phabricator.wikimedia.org/T177208) [08:43:10] jynus: ^ I will also depool db1092 [08:43:56] (03CR) 10jenkins-bot: Remove wmgRelatedSitesPrefixes intermediatary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393166 (owner: 10Reedy) [08:44:23] (03PS3) 10Marostegui: db1092.yaml: Remove old socket location [puppet] - 10https://gerrit.wikimedia.org/r/393180 [08:45:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393185 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [08:45:32] this was a bad idea, the eqiad hosts will alert [08:46:00] as they have not yet been topologically moved [08:46:27] mmm what they would alert for? [08:46:31] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393185 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [08:47:52] actually, they won't alert, which is almost as bad [08:48:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1101:3318 in s5 to warm it up and depool db1092 - T178359 T177208 (duration: 00m 45s) [08:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:11] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [08:48:11] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:48:14] But why should they alert? nothing is down [08:48:28] lag from s8 [08:48:36] Ah, the pt-hearbeat [08:48:49] but when there is no row found, they fail back to show slave status [08:48:58] so it kinda works as intended [08:49:07] !log Stop MySQL on db1092 [08:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:27] but we should depool all s8 servers on eqiad [08:50:23] Yeah, but if we are aiming to do the split on tuesday we need to warm them up at least on monday [08:50:38] (03CR) 10Muehlenhoff: [C: 031] openldap: move firewall/standard to roles, use profile [puppet] - 10https://gerrit.wikimedia.org/r/391737 (owner: 10Dzahn) [08:50:43] (03CR) 10Marostegui: [C: 032] db1092.yaml: Remove old socket location [puppet] - 10https://gerrit.wikimedia.org/r/393180 (owner: 10Marostegui) [08:59:21] (03CR) 10Muehlenhoff: [C: 031] mediawiki:appserver:api: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/391731 (owner: 10Dzahn) [09:25:12] (03PS1) 10Marostegui: db-eqiad.php: Increase weight:db1101,db1092,db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393193 [09:26:13] (03PS2) 10Marostegui: db-eqiad.php: Increase weight:db1101,db1092,db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393193 [09:29:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight:db1101,db1092,db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393193 (owner: 10Marostegui) [09:30:25] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight:db1101,db1092,db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393193 (owner: 10Marostegui) [09:31:39] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Looks good to me. In comparison I do not see any difference other than the logo becoming black(ish). The smallest version also has a new r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376150 (https://phabricator.wikimedia.org/T174603) (owner: 10Odder) [09:32:23] (03CR) 10Hashar: [C: 032] Add .gitreview [software/conftool] - 10https://gerrit.wikimedia.org/r/392795 (owner: 10Hashar) [09:32:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1101:3318 db1097:3315 abd db1092 (duration: 00m 45s) [09:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:58] (03CR) 10jenkins-bot: Remove GettingStarted intermediate variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393167 (owner: 10Reedy) [09:33:00] (03CR) 10jenkins-bot: wgPageImagesExpandOpenSearchXml: drop intermediate $wmg setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393122 (owner: 10Chad) [09:33:02] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1101:3318 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393173 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:34:42] Hello! [09:35:07] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1097 in s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393175 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:36:02] (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/387544 (owner: 10Volans) [09:36:05] (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/392793 (owner: 10Hashar) [09:36:07] (03CR) 10Hashar: [C: 032] "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/392795 (owner: 10Hashar) [09:37:05] Can I use jsub / jstop from a job which was started in the cloud by jsub? Background: For whatever reason some program hangs and I want to kill it and restart it automagically [09:40:32] (03PS1) 10Ema: cache_misc: use grafana.w.o instead of git.w.o in VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/393195 [09:49:49] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1097, db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393197 [09:50:05] !log restarting db2045 [09:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1097, db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393197 (owner: 10Marostegui) [09:54:55] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1097, db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393197 (owner: 10Marostegui) [09:55:12] (03CR) 10Ema: [C: 032] cache_misc: use grafana.w.o instead of git.w.o in VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/393195 (owner: 10Ema) [09:56:34] (03CR) 10Muehlenhoff: kmod::blacklist: prevent manual install, update initramfs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392644 (owner: 10BBlack) [09:56:48] (03CR) 10Muehlenhoff: [C: 031] kmod::blacklist: prevent manual install, update initramfs [puppet] - 10https://gerrit.wikimedia.org/r/392644 (owner: 10BBlack) [09:57:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1097:3315 and db1092 (duration: 00m 45s) [09:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:12] (03PS1) 10Jcrespo: mariadb: move db2085:s5 to db2085:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393203 (https://phabricator.wikimedia.org/T177208) [10:21:48] (03PS1) 10Marostegui: db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208) [10:22:37] !log installing ca-cerfificates updates on trusty hosts [10:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:51] (03PS1) 10Jcrespo: mariadb: Move db2086:s5 to db2086:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393205 (https://phabricator.wikimedia.org/T177208) [10:26:20] (03PS2) 10Marostegui: db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208) [10:29:25] I think that looks ok [10:29:33] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [10:29:33] let me give it another final look [10:30:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [10:31:28] (03CR) 10Marostegui: [C: 031] mariadb: move db2085:s5 to db2085:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393203 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [10:31:58] (03CR) 10Marostegui: [C: 031] mariadb: Move db2086:s5 to db2086:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393205 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [10:32:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [10:33:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool all future s8 slaves for a topology change - T177208 (duration: 00m 45s) [10:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:43] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [10:35:59] !log restarting db2085 (including both s5 and s3 instances) [10:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:19] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3785194 (10Pchelolo) The `wikibase-UpdateUsagesForPage` job sounds like a perfect candidate to be the next one. It's ~220 jobs/s on... [10:37:10] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3785198 (10Pchelolo) [10:37:49] !log cancelling db2085 restart, only doing mysql:s5 [10:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:38] (03CR) 10Jcrespo: [C: 032] mariadb: move db2085:s5 to db2085:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393203 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [10:42:42] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates] [10:44:12] jynus: you want me to take care of db2086? [10:44:39] no, one at a time [10:44:43] oki [10:50:29] so I moved manually sqldata, tmp and removed the old s5.cfg then run puppet [10:50:38] anything else I am missing? [10:50:58] aside from removal of old nagios check files, which I never do [10:51:10] I think that should be it [10:51:19] I ran manually puppet on einstenimium [10:51:23] but that is not really a big thing [10:51:24] I already started it [10:51:50] but I was worried there was some reference to shard on relay or binlog name [10:52:04] Yeah, I was worried yesterday with that too when I moved db1101 [10:52:05] hehe [10:52:12] ok [10:52:29] I think it is ok I did it mysqlf, so now both know about it [10:52:40] you or me can do the other one, I think [10:53:00] I will do it then [10:53:07] replication catched up, and alerts should go away when einstinium finishes [10:53:27] great, I will take care of db2086 [10:54:12] ah, tendril change [10:54:16] and of cours the dblists [10:54:22] *hosts [10:54:33] I think prometheus was updated in advance [10:54:44] and codfw.php [10:55:01] eqiad, too [10:55:39] !log Restart MySQL on db2086 to move s5 to s8 [10:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:27] (03CR) 10Marostegui: [C: 032] mariadb: Move db2086:s5 to db2086:s8 [puppet] - 10https://gerrit.wikimedia.org/r/393205 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [11:01:53] (03PS1) 10Jcrespo: [WIP]mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) [11:02:19] db2086 is done [11:03:03] (03PS2) 10Jcrespo: mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) [11:03:05] new patch^ [11:03:18] (03PS1) 10Ppchelko: Remove RESTBase jobs config [puppet] - 10https://gerrit.wikimedia.org/r/393209 [11:04:11] (03CR) 10Giuseppe Lavagetto: First version of the helm chart scaffolding for production services (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) (owner: 10Giuseppe Lavagetto) [11:04:27] (03PS2) 10Giuseppe Lavagetto: First version of the helm chart scaffolding for production services [deployment-charts] - 10https://gerrit.wikimedia.org/r/392619 (https://phabricator.wikimedia.org/T177397) [11:04:33] (03CR) 10Marostegui: mariadb: Change db208[56]:3315 to port 3318; repool db2038 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [11:07:42] (03PS3) 10Jcrespo: mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) [11:09:06] (03CR) 10Marostegui: [C: 031] mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [11:09:27] (03CR) 10Ppchelko: "I was sure that was already removed, no idea why it got resurrected. These jobs are long long gone" [puppet] - 10https://gerrit.wikimedia.org/r/393209 (owner: 10Ppchelko) [11:09:44] (03CR) 10Jcrespo: [C: 032] mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [11:10:22] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove RESTBase jobs config [puppet] - 10https://gerrit.wikimedia.org/r/393209 (owner: 10Ppchelko) [11:11:11] (03Merged) 10jenkins-bot: mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [11:12:18] (03PS1) 10Marostegui: s8.hosts: Change port for db2085,db2086 [software] - 10https://gerrit.wikimedia.org/r/393210 [11:12:44] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:15:09] (03CR) 10Marostegui: [C: 032] s8.hosts: Change port for db2085,db2086 [software] - 10https://gerrit.wikimedia.org/r/393210 (owner: 10Marostegui) [11:15:50] (03Merged) 10jenkins-bot: s8.hosts: Change port for db2085,db2086 [software] - 10https://gerrit.wikimedia.org/r/393210 (owner: 10Marostegui) [11:15:54] PROBLEM - Check systemd state on es2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:16:14] es2018 ? [11:16:24] PROBLEM - Disk space on es2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:16:24] PROBLEM - Check size of conntrack table on es2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:16:28] uyuy [11:16:36] is it a glitch, or is it crashing? [11:16:48] I cannot login [11:17:25] I think it is rebooting or something [11:17:40] that was the only one that didn't crash: T130702 [11:17:43] T130702: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702 [11:17:43] almost [11:18:08] the console is stuck on [OK ] which looks like the typical service OK from when a server is stopping or rebooting [11:18:14] PROBLEM - DPKG on es2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:18:17] let's wait a bit more to see if it does something [11:18:22] else, I will reboot it myself [11:18:24] PROBLEM - Check size of conntrack table on es2018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:18:41] at least we can log in to the console [11:18:54] hehe yes [11:20:06] can you check hw logs while I finish deployment, so I do not block tin [11:20:12] yep [11:20:14] no worries [11:20:24] interesting: https://grafana.wikimedia.org/dashboard/db/server-board?refresh=1m&orgId=1&var-server=es2018&var-network=eth0 [11:21:11] storage crashed I just got a kernel message [11:21:13] on the console [11:21:14] load incresees on io blockage [11:21:18] [30996087.770298] megaraid_sas 0000:03:00.0: pending commands remain after waiting, will reset adapter scsi0. [11:21:21] [30996102.596599] megaraid_sas 0000:03:00.0: Init cmd success [11:21:24] and it is iowaity [11:21:24] RECOVERY - Disk space on es2018 is OK: DISK OK [11:21:24] RECOVERY - Check size of conntrack table on es2018 is OK: OK: nf_conntrack is 0 % full [11:21:33] so most likely RAID issue, as usual [11:21:39] yeah [11:21:50] it is accessible again [11:21:58] without rebooting [11:22:05] looks like some I/O hardware error, I' day [11:22:07] mysql will probably had crashed [11:22:14] RECOVERY - DPKG on es2018 is OK: All packages OK [11:22:37] kernel started to log hung jbd processes at 11:16 (with a 120 seconds interval) [11:23:42] mm [11:23:47] mysql is up [11:23:48] it says mysql is still up? [11:23:57] maybe io didn't crash [11:24:01] just hunged up? [11:24:14] !log jynus@tin Synchronized wmf-config/db-codfw.php: Pool db2038, db2085:3318, db2086:3318 (duration: 00m 45s) [11:24:15] check mysql logs, then stop it [11:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:26] i am doing that [11:24:27] I will finish the deploy and depool it [11:24:41] there's also several hung processes for mysql, i.e. mysql was probably unable to complete some writes or reads due to the hardware error [11:25:30] !log jynus@tin Synchronized wmf-config/db-eqiad.php: db2085:3318, db2086:3318 (duration: 00m 43s) [11:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:46] !log Restart mysql on es2018 [11:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:57] no, don't start it back yet [11:26:58] it is the master [11:27:02] we need a failover [11:28:03] probably it crashed and restarted? [11:28:07] what does the log say? [11:29:20] PROBLEM - MariaDB Slave IO: es3 on es1014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2018.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on es2018.codfw.wmnet (111 Connection refused) [11:29:37] no [11:29:38] it never crashed [11:29:44] oh, ignore that page [11:29:50] it is the replication back [11:29:54] PROBLEM - MariaDB Slave IO: es3 on es2019 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2018.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on es2018.codfw.wmnet (111 Connection refused) [11:29:54] PROBLEM - MariaDB Slave IO: es3 on es2017 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2018.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on es2018.codfw.wmnet (111 Connection refused) [11:30:12] I would reboot the server and start mysql again [11:30:13] *forward [11:30:18] wait first [11:30:20] we have time [11:30:24] once we depool [11:31:48] (03PS1) 10Jcrespo: maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) [11:32:05] (03CR) 10Marostegui: [C: 031] maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo) [11:32:09] I created T181293 [11:32:09] T181293: es2018 crashed - https://phabricator.wikimedia.org/T181293 [11:32:18] then let's put there evertyhing we know [11:32:25] sounfs good [11:32:28] and let's not rush, there is no outage ongoing [11:33:04] let's get the kernel and hw logs there [11:33:12] (03CR) 10jerkins-bot: [V: 04-1] maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo) [11:33:47] (03PS2) 10Jcrespo: maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) [11:34:47] let's also get the binlog position of the 2 servers [11:35:32] I cannot see any HW logs on the idrac [11:36:04] 10Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3785349 (10Marostegui) [11:36:07] I am going to drop the replication to es1014 [11:36:18] meaning reset slave all on es1014 [11:36:23] ok [11:37:21] last logged position was es2018-bin.001481:3902469 [11:37:28] although I do not think we care about that [11:37:29] RECOVERY - MariaDB Slave IO: es3 on es1014 is OK: OK slave_io_state not a slave [11:38:50] so you suspect kernel crash rather than hw issue? [11:39:00] or no conclustion yet? [11:39:08] No, I think storage crashed but maybe not as badly [11:39:15] trtying to get the syslog logs from lithium [11:39:20] to see if there is something extra there [11:39:25] that was not written to the OS [11:39:26] (03PS1) 10Elukey: [WIP] cdh hadoop defaults refactoring [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393213 [11:39:42] "Controller encountered a fatal error and was reset" [11:39:47] yeah, that supports that [11:39:48] (03CR) 10jerkins-bot: [V: 04-1] [WIP] cdh hadoop defaults refactoring [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393213 (owner: 10Elukey) [11:39:59] the kernel being able to log that [11:40:00] that on the webconsole? [11:40:20] is your own dmesg :-) [11:40:43] Ah yeah, but on the HW there is nothing :) [11:40:46] at least via idrac [11:42:52] let's reboot it? [11:43:12] one sec [11:43:30] let me try to grab the logs from the controller actually [11:44:25] Disk 0 in Backplane 1 of Integrated RAID Controller 1 is inserted. [11:44:44] basically it is reiniting all disks [11:45:18] (03Abandoned) 10Elukey: [WIP] cdh hadoop defaults refactoring [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393213 (owner: 10Elukey) [11:45:20] but not error is sent before [11:45:31] yeah, nothing on the controller log either [11:46:07] marostegui: let's upgrade it- after all, we have one host to test [11:46:15] both kernel and mariadb [11:46:21] and pool it as a replica [11:46:21] sounds good [11:46:35] *we have to have one host to test [11:46:35] I will upgrade it now [11:47:00] (03CR) 10Jcrespo: [C: 032] maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo) [11:47:54] so, as far as you can tell, it kept being up and did stop cleanly, right? [11:48:04] !log Reboot es2018 after full-upgrade - T181293 [11:48:06] indeed [11:48:07] it stopped fine [11:48:11] (03PS1) 10Elukey: Allow the configuration of the HDFS Journalnode's jvm settings [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393214 [11:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:12] T181293: es2018 crashed - https://phabricator.wikimedia.org/T181293 [11:48:13] and I was able to get it finely [11:48:15] (03Merged) 10jenkins-bot: maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo) [11:48:16] when it was up [11:48:19] ok, let's restart [11:48:32] going to reboot now [11:48:36] we will put it up and use it to move the slaves [11:48:42] sure [11:48:45] but pool it as a replica [11:48:59] rebooting - I am monitoring also via idrac [11:49:05] its boot, to see if there any errors [11:50:39] !log jynus@tin Synchronized wmf-config/db-codfw.php: depool es2018 T181293 (duration: 00m 45s) [11:50:43] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8959/analytics1028.eqiad.wmnet/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393214 (owner: 10Elukey) [11:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:59] it booted up cleanly [11:51:05] (03Merged) 10jenkins-bot: Allow the configuration of the HDFS Journalnode's jvm settings [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393214 (owner: 10Elukey) [11:51:14] RECOVERY - Check systemd state on es2018 is OK: OK - running: The system is fully operational [11:51:52] (03PS1) 10Hashar: diamond: skip DiskSpace for Docker containers [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T181295) [11:52:10] Everything looks fine [11:52:14] Going to start mysql and run mysql_upgrade [11:53:05] RECOVERY - MariaDB Slave IO: es3 on es2019 is OK: OK slave_io_state Slave_IO_Running: Yes [11:53:07] RECOVERY - MariaDB Slave IO: es3 on es2017 is OK: OK slave_io_state Slave_IO_Running: Yes [11:53:37] (03PS1) 10Elukey: modules::cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216 [11:53:46] let's not start replication unless you already did that [11:54:00] (03CR) 10jerkins-bot: [V: 04-1] modules::cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216 (owner: 10Elukey) [11:54:16] no [11:54:20] I started with skip-slave [11:54:24] cool [11:56:01] (03PS2) 10Elukey: cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216 [11:56:20] (03CR) 10jerkins-bot: [V: 04-1] cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216 (owner: 10Elukey) [11:56:25] it had an uptime of 587d [11:56:43] let's disable puppet and stop pt-heartbeat there [11:56:52] so it doesn't confuse the other hosts [11:57:03] !log Disable puppet on es2018 - T181293 [11:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:11] T181293: es2018 crashed - https://phabricator.wikimedia.org/T181293 [11:57:38] heartbeat stopped [11:57:45] cool [11:58:03] so we are at es2018-bin.001482:138357 everywhere [11:58:26] which host you want to be the master? [11:58:31] that is es1014-bin.001887:819580774 [11:58:31] (I am doing the puppet patches) [11:58:44] the one that I randomly pooled as such on mediawiki :-) [11:58:48] haha [11:58:49] ok [11:59:03] did I deploy that already? [11:59:40] es2017 as master [11:59:45] you did [11:59:51] so I will move es2019 [12:00:00] cool [12:00:12] to es2017-bin.001476:843479464 [12:00:25] and es2017 to [12:00:31] es1014-bin.001887:819580774 [12:01:30] (03PS3) 10Elukey: cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216 [12:01:37] why is es2017 getting writes? [12:01:42] or its binlog advancing? [12:01:51] it is not [12:01:57] that I can see [12:02:06] ah [12:02:07] it was before, because pt-heartbeat [12:02:09] i logged to db2017 [12:02:10] XDD [12:02:24] do you guys prefer to avoid any in flight puppet changes or shall I proceed? [12:02:40] es2017-bin.001476:843479464 -> looks good [12:02:55] elukey: if you can hold a sec, yep, I am about to send a patch [12:03:12] sure [12:03:26] thanks [12:03:49] (03PS1) 10Marostegui: mariadb: Promote es2017 to master [puppet] - 10https://gerrit.wikimedia.org/r/393218 (https://phabricator.wikimedia.org/T181293) [12:04:04] marostegui: do not deploy yet [12:04:10] no [12:04:10] until I change the replication master [12:04:11] no worries [12:04:21] I will also top mysql on es2018 to update its socket once it is moved [12:04:45] actually, let me ammend the patch [12:05:20] (03PS2) 10Marostegui: mariadb: Promote es2017 to master [puppet] - 10https://gerrit.wikimedia.org/r/393218 (https://phabricator.wikimedia.org/T181293) [12:05:48] I will change master to es2018, too [12:05:53] same coords [12:05:56] yep [12:06:43] now you can deploy the new master hearbeat [12:06:48] good [12:06:57] while I reploint the new master itself [12:07:02] that is, es2017 [12:07:10] want me to do it manually or merging puppet? [12:07:11] up to you [12:07:19] just merge puppet [12:07:22] it should be ok [12:07:23] (03CR) 10Marostegui: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/8962/console" [puppet] - 10https://gerrit.wikimedia.org/r/393218 (https://phabricator.wikimedia.org/T181293) (owner: 10Marostegui) [12:07:28] (03CR) 10Marostegui: [C: 032] mariadb: Promote es2017 to master [puppet] - 10https://gerrit.wikimedia.org/r/393218 (https://phabricator.wikimedia.org/T181293) (owner: 10Marostegui) [12:07:56] running puppet on es2017 [12:08:44] heartbeat running [12:09:02] es2018 is catching up fine [12:09:09] with the heartbeat from es2017 [12:09:38] cool [12:09:51] I would do a sanity check of enwiki later [12:09:53] let me know when I can stop mysql on es2018, enable puppet and enable gtid [12:10:06] not a full table check, just to make sure no event has been lost [12:10:27] We also have ROW based, so that is a good sanity check itself too [12:10:29] stop replication? [12:10:39] or mysql? [12:10:41] mysql [12:10:46] to update the socket [12:10:48] ah! [12:10:48] now that it is depooled [12:10:56] ok, any time now [12:10:59] ok [12:11:00] doing it [12:11:13] I asume it is down on icinga already? [12:11:17] yep [12:11:26] we are good now [12:11:41] as will do a quick data check later, but other than that, everthing is ok [12:11:45] cool [12:11:55] starting mysql again [12:11:58] I will enable gtid back [12:13:47] !log Enable GTID on es2018 - T181293 [12:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:54] T181293: es2018 crashed - https://phabricator.wikimedia.org/T181293 [12:13:59] all done [12:14:48] elukey: feel free to push anything you like [12:14:50] Thanks for waiting :) [12:15:39] I will do that on the other host, without a restart [12:15:42] *hosts [12:15:48] cool! [12:16:26] marostegui: ack! [12:20:47] (03PS3) 10Jcrespo: mariadb: Leave reimaginable only the db latest servers [puppet] - 10https://gerrit.wikimedia.org/r/392400 (https://phabricator.wikimedia.org/T170662) [12:25:29] (03CR) 10Elukey: [C: 032] cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216 (owner: 10Elukey) [12:25:33] (03PS4) 10Elukey: cdh: update to the latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/393216 [12:27:32] (03PS2) 10Reedy: Fix up 611a3b6cba28342c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393169 (owner: 10Legoktm) [12:27:34] (03CR) 10Reedy: [C: 032] Fix up 611a3b6cba28342c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393169 (owner: 10Legoktm) [12:28:59] (03Merged) 10jenkins-bot: Fix up 611a3b6cba28342c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393169 (owner: 10Legoktm) [12:31:10] (03PS1) 10Elukey: Fix prometheus target for the Eventlogging mysql master db [puppet] - 10https://gerrit.wikimedia.org/r/393220 (https://phabricator.wikimedia.org/T177405) [12:32:01] !log reedy@tin Synchronized docroot/mediawiki/keys/: Fixup keys (duration: 00m 45s) [12:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:29] (03PS9) 10ArielGlenn: rsync all dumps status files to web servers and unpack them periodically [puppet] - 10https://gerrit.wikimedia.org/r/392875 (https://phabricator.wikimedia.org/T179857) [12:38:18] !log disable puppet on db1071 and stop local s5 heartbeat there [12:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:35] (03CR) 10ArielGlenn: [C: 032] rsync all dumps status files to web servers and unpack them periodically [puppet] - 10https://gerrit.wikimedia.org/r/392875 (https://phabricator.wikimedia.org/T179857) (owner: 10ArielGlenn) [12:40:13] !log setting up s8 topology on eqiad [12:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:24] ignore errors on dump* and such, fixing grrrr [12:47:57] (03PS1) 10ArielGlenn: use right path to dumps status file script [puppet] - 10https://gerrit.wikimedia.org/r/393225 [12:49:12] (03CR) 10ArielGlenn: [C: 032] use right path to dumps status file script [puppet] - 10https://gerrit.wikimedia.org/r/393225 (owner: 10ArielGlenn) [12:49:53] (03PS5) 10Mark Bergsma: Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) [12:50:26] !log resetting replication on es1011 for consistency with other replica sets [12:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:20] ok, that should be it, [12:54:29] !log reenabling puppet on db1071 [12:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:02] s8 is officially on eqiad and codfw, including replication control [12:56:24] I mean, it is not pooled, but the infrastructure is there [12:56:47] only labs filters and replication pending [12:57:50] \o/ [13:06:35] 10Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3785600 (10Marostegui) [13:17:04] !log Stop replication on db1097 to reimport and recompress commonswiki.watchlist [13:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:18] !log installing openjpeg2 updates (original security already got installed after initial release, but there was a binNMU for amd64) [13:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:07] (03CR) 10Addshore: [C: 031] diamond: skip DiskSpace for Docker containers [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T181295) (owner: 10Hashar) [13:52:37] !log removing git packages from jessie-wikimedia/experimental (replaced by component/git) [13:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:15] 10Operations, 10Patch-For-Review: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3785688 (10MoritzMuehlenhoff) 05Open>03Resolved I'm closing this bug. The new structure is fully in effect for stretch-wikimedia and partly for jessie-wikimedia (component/foo also ex... [13:56:44] 10Operations: Collate jessie-wikimedia/backports into jessie-wikimedia/main - https://phabricator.wikimedia.org/T167292#3785690 (10MoritzMuehlenhoff) p:05High>03Low [13:57:53] 10Operations: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401#3785691 (10MoritzMuehlenhoff) 05Open>03Resolved This is complete for a while now. [14:02:58] (03PS7) 10Ema: vcl: distinguish between hfp and hfm [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) [14:03:00] (03PS1) 10Ema: WIP: cache: size-based cutoff for exp caching policy [puppet] - 10https://gerrit.wikimedia.org/r/393227 (https://phabricator.wikimedia.org/T144187) [14:04:13] (03PS2) 10Ema: WIP: vcl: size-based cutoff for exp caching policy [puppet] - 10https://gerrit.wikimedia.org/r/393227 (https://phabricator.wikimedia.org/T144187) [14:08:14] PROBLEM - puppet last run on restbase1015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid] [14:11:42] 10Operations: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703#3785705 (10MoritzMuehlenhoff) These are fully rolled out: binutils libvirt ndisc6 [14:18:15] RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:35:34] (03PS1) 10Elukey: Drop the Eventlogging support for dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/393238 (https://phabricator.wikimedia.org/T156844) [14:36:26] (03PS2) 10Elukey: Drop the Eventlogging support for dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/393238 (https://phabricator.wikimedia.org/T156844) [14:37:30] (03CR) 10Elukey: [C: 04-2] "Waiting for Nov 28th to drop the log database manually from dbstore1002" [puppet] - 10https://gerrit.wikimedia.org/r/393238 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [14:47:36] (03PS1) 10Jdrewniak: [WIP] Replace portals submodule with portals/deploy submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393239 (https://phabricator.wikimedia.org/T180777) [14:47:56] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3785770 (10Aklapper) Is {T178840} a duplicate? [14:50:10] (03PS1) 10Muehlenhoff: Restrict access to ferm service on mwlog* hosts [puppet] - 10https://gerrit.wikimedia.org/r/393240 [14:51:28] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3785776 (10jcrespo) @Aklapper Probably, but I would close that one, as that should not be happening right now, unless you have reports saying it is again. [14:57:04] (03PS2) 10Jdrewniak: [WIP] Replace portals submodule with portals/deploy submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393239 (https://phabricator.wikimedia.org/T180777) [14:59:05] 10Operations, 10Analytics, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3785806 (10elukey) As far as I know Analytics has no plans for oxygen, I thought that it was completely managed by ops :D +1 for the fast SSDs for occasional greps, even if recently Filippo... [15:01:30] (03PS1) 10Muehlenhoff: grafana_http: Restrict to CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/393244 [15:10:16] (03CR) 10Ema: [C: 031] grafana_http: Restrict to CACHE_MISC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393244 (owner: 10Muehlenhoff) [15:15:32] (03PS1) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [15:15:57] (03CR) 10jerkins-bot: [V: 04-1] clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [15:16:34] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [15:17:44] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received [15:18:34] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed o [15:18:34] e was received [15:18:44] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [15:21:44] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received [15:23:35] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received [15:23:54] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [15:24:17] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3785829 (10elukey) [15:24:42] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3785832 (10Addshore) Thanks! I only asked as the title of this ticket references testwikidatawiki not wikidatawiki [15:25:44] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [15:27:29] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3785843 (10elukey) a:03elukey [15:27:54] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:28:30] that's me ^ [15:31:29] (03PS2) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [15:31:54] (03CR) 10jerkins-bot: [V: 04-1] clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [15:33:29] (03PS3) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [15:33:51] (03CR) 10jerkins-bot: [V: 04-1] clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [15:35:09] (03PS4) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [15:39:55] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [15:39:56] (03PS2) 10Hashar: diamond: skip DiskSpace for Docker containers [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T177052) [15:40:08] (03PS1) 10Muehlenhoff: ntp: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/393247 [15:41:51] (03CR) 10Hashar: "Should probably be made more generic eg:" [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T177052) (owner: 10Hashar) [15:42:40] 10Operations, 10Cloud-Services, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 3 others: Grafana reports ALL docker mounts in a spammy way - https://phabricator.wikimedia.org/T177052#3785870 (10hashar) [15:45:28] 10Operations, 10Continuous-Integration-Infrastructure, 10Nodepool, 10Patch-For-Review: Clean up apt:pin of python modules used for Nodepool - https://phabricator.wikimedia.org/T137217#3785873 (10hashar) 05Open>03declined Nodepool is legacy. I am not going to bother upgrading the python modules. We will... [15:55:48] (03PS5) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [15:59:34] (03PS1) 10Ema: vcl: add hostname/layer info to syntethic healthcheck response [puppet] - 10https://gerrit.wikimedia.org/r/393251 [16:08:10] (03PS1) 10Muehlenhoff: hue: Restrict to CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/393253 [16:09:03] (03Abandoned) 10Muehlenhoff: hue: Restrict to CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/393253 (owner: 10Muehlenhoff) [16:12:05] (03CR) 10Ema: "Looks reasonable in pcc https://puppet-compiler.wmflabs.org/compiler02/8966/cp1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/393251 (owner: 10Ema) [16:19:58] (03CR) 10Marostegui: [C: 031] Fix prometheus target for the Eventlogging mysql master db [puppet] - 10https://gerrit.wikimedia.org/r/393220 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [16:20:52] (03PS6) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [16:21:28] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3785922 (10Marostegui) Yeah, we decided to go for wikidatawiki on codfw, as it is the passive DC :-) [16:34:45] (03PS7) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [16:44:39] (03PS1) 10Elukey: role::analytics_cluster::database::meta::backup*: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393257 (https://phabricator.wikimedia.org/T167790) [16:46:27] (03PS2) 10Elukey: role::analytics_cluster::database::meta::backup*: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393257 (https://phabricator.wikimedia.org/T167790) [16:51:37] (03PS3) 10Elukey: role::analytics_cluster::database::meta::backup*: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393257 (https://phabricator.wikimedia.org/T167790) [16:55:05] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/8971/ - looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/393257 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [17:15:36] (03PS1) 10Giuseppe Lavagetto: [WiP] Move puppet CI to puppet 4.8.2 [puppet] - 10https://gerrit.wikimedia.org/r/393259 [17:16:08] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Move puppet CI to puppet 4.8.2 [puppet] - 10https://gerrit.wikimedia.org/r/393259 (owner: 10Giuseppe Lavagetto) [17:24:09] (03CR) 10Muehlenhoff: "$wgFFmpeg2theoraLocation is now removed from wmf-config, so removing my earlier -1" [puppet] - 10https://gerrit.wikimedia.org/r/373733 (https://phabricator.wikimedia.org/T172445) (owner: 10Muehlenhoff) [17:24:17] (03PS3) 10Muehlenhoff: Remove ffmpeg2theora from package list [puppet] - 10https://gerrit.wikimedia.org/r/373733 (https://phabricator.wikimedia.org/T172445) [18:30:18] (03PS6) 10Mark Bergsma: Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) [18:31:07] (03CR) 10jerkins-bot: [V: 04-1] Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) (owner: 10Mark Bergsma) [18:33:30] (03PS7) 10Mark Bergsma: Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) [18:57:15] PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused [18:57:54] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [18:59:04] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed o [18:59:04] e was received [18:59:15] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.033 second response time [18:59:54] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [18:59:55] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [19:01:01] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): role::puppetmaster::standalone has no firewall rule for port 8140 - https://phabricator.wikimedia.org/T154150#3786046 (10aborrero) More testing. I see that a patch like this just works, but the reporter @scfc seems to suggest this doesn't work: ``` diff... [19:46:14] PROBLEM - cassandra-c CQL 10.64.0.232:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.232 and port 9042: Connection refused [19:46:15] PROBLEM - cassandra-a service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:46:15] PROBLEM - cassandra-c service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [19:46:24] PROBLEM - cassandra-c service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [19:46:24] PROBLEM - cassandra-a SSL 10.64.0.230:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:46:25] PROBLEM - cassandra-c CQL 10.64.32.204:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.204 and port 9042: Connection refused [19:46:44] PROBLEM - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.231 and port 9042: Connection refused [19:46:44] PROBLEM - cassandra-b service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [19:46:45] PROBLEM - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.203 and port 9042: Connection refused [19:46:45] PROBLEM - cassandra-a service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:46:45] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.230 and port 9042: Connection refused [19:46:45] PROBLEM - Check systemd state on restbase1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:46:54] PROBLEM - Check systemd state on restbase1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:46:55] PROBLEM - cassandra-b SSL 10.64.32.203:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:46:55] PROBLEM - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:46:55] PROBLEM - cassandra-b SSL 10.64.0.231:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:46:55] PROBLEM - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.202 and port 9042: Connection refused [19:46:55] PROBLEM - cassandra-b service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [19:47:14] PROBLEM - cassandra-c SSL 10.64.32.204:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:47:15] PROBLEM - cassandra-c SSL 10.64.0.232:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:07:14] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 61289 MB (12% inode=99%) [20:08:35] PROBLEM - HHVM rendering on mw2126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:25] RECOVERY - HHVM rendering on mw2126 is OK: HTTP OK: HTTP/1.1 200 OK - 78961 bytes in 0.304 second response time [20:13:14] RECOVERY - Disk space on elastic1017 is OK: DISK OK [20:25:27] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3786146 (10Capt_Swing) The `shawn` table belonged to Shawn Walker, a research intern in 2011. These tables can be safely deleted. [21:33:58] (03Draft1) 10MarcoAurelio: [WIP] puppet: redirect several wikis per LangCom decission [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) [21:34:03] (03PS2) 10MarcoAurelio: [WIP] puppet: redirect several wikis per LangCom decission [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) [21:38:24] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786188 (10MarcoAurelio) [21:38:53] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786200 (10MarcoAurelio) p:05Triage>03Unbreak! Temptatively UBN as this is not normal and has jobs stuck. [21:41:00] 10Operations, 10Performance-Team, 10Traffic: load.php requests taking multiple minutes - https://phabricator.wikimedia.org/T181315#3786217 (10Tgr) [21:57:19] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786259 (10Paladox) p:05Unbreak!>03High Changing to high as UBN means a site is down. Tests getting stuck in the post merge pipeline happened... [21:57:29] 10Operations, 10Performance-Team, 10Traffic: load.php requests taking multiple minutes - https://phabricator.wikimedia.org/T181315#3786261 (10Tgr) [21:58:58] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786262 (10MarcoAurelio) Okay. Would a restart of zuul help unlock those jobs? [22:06:59] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786188 (10hashar) That happens from time to time and it is T72597. There is no magic solution to remove the lock though ;( [22:08:19] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393185 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [22:08:21] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight:db1101,db1092,db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393193 (owner: 10Marostegui) [22:08:23] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1097, db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393197 (owner: 10Marostegui) [22:08:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool all future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393204 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [22:08:27] (03CR) 10jenkins-bot: mariadb: Change db208[56]:3315 to port 3318; repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393208 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [22:08:29] (03CR) 10jenkins-bot: maridb: depool es2018 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393211 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo) [22:08:31] (03CR) 10jenkins-bot: Fix up 611a3b6cba28342c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393169 (owner: 10Legoktm) [22:09:27] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786267 (10hashar) That happens from time to time and it is T72597. There is no magic solution to remove the lock though ;( I went to https://in... [22:11:03] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786188 (10hashar) 05Open>03Resolved a:03hashar [22:11:36] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: zuul/jenkins has jobs stuck in postmerge for 13 hours - https://phabricator.wikimedia.org/T181313#3786275 (10MarcoAurelio) Thank you! [23:02:44] (03PS1) 10Reedy: Remove $wgStyleVersion appending in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393353 (https://phabricator.wikimedia.org/T181318) [23:14:34] (03PS8) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [23:20:15] PROBLEM - HHVM jobrunner on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:21:05] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:25:35] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:36:14] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:37:24] PROBLEM - Apache HTTP on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:15] RECOVERY - Apache HTTP on mw2130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.120 second response time [23:49:44] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [23:51:25] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 8.953 second response time [23:51:25] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.005 second response time [23:51:44] RECOVERY - HHVM jobrunner on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time