[00:20:21] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/citoid [00:20:21] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/citoid [00:20:27] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/citoid [00:20:29] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:21:33] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/citoid [00:21:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [00:21:41] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/citoid [00:22:39] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/citoid [00:22:43] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/citoid [00:22:47] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/citoid [01:05:55] (03CR) 10Ottomata: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [01:13:16] (03CR) 10Ottomata: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [01:13:42] (03PS5) 10Ottomata: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) [01:36:23] (03PS2) 10Alex Monk: Re-apply "openstack::clientpackages::common: include python3 packages" [puppet] - 10https://gerrit.wikimedia.org/r/497009 (https://phabricator.wikimedia.org/T218423) [02:00:57] !log Started manual run of unpublished ContentTranslation draft purge script (T218279) [02:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:01:01] T218279: Run unpublished draft purge script for CX (Week of 03/17) - https://phabricator.wikimedia.org/T218279 [02:03:17] PROBLEM - puppet last run on ms-be1043 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdk] [02:14:53] 10Operations, 10Gerrit, 10Release-Engineering-Team: I'm refused to login to Gerrit - https://phabricator.wikimedia.org/T218507 (10Paladox) @GChecker yes that was still required before. [02:25:01] PROBLEM - Disk space on ms-be1043 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%) [03:58:15] !log Finished manual run of unpublished ContentTranslation draft purge script (T218279) [03:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:19] T218279: Run unpublished draft purge script for CX (Week of 03/17) - https://phabricator.wikimedia.org/T218279 [04:11:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [05:59:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497227 [06:01:44] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497227 (owner: 10Marostegui) [06:02:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497227 (owner: 10Marostegui) [06:03:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1121 (duration: 01m 04s) [06:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:03] !log Deploy schema change on db1121 - lag will appear on labsdb:s4 [06:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:46] !log Deploy schema change on x1 master (db1069) with replication - T218397 [06:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:49] T218397: Add usc_deleted to urlshortcodes - https://phabricator.wikimedia.org/T218397 [06:19:51] (03PS1) 10Marostegui: wikireplica_dns.yaml: Depool dbproxy1010 [puppet] - 10https://gerrit.wikimedia.org/r/497228 [06:20:25] (03Abandoned) 10Marostegui: wmnet: Depool dbproxy1010 [dns] - 10https://gerrit.wikimedia.org/r/496410 (owner: 10Marostegui) [06:27:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497229 [06:28:49] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497229 (owner: 10Marostegui) [06:29:46] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497229 (owner: 10Marostegui) [06:30:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 (duration: 00m 48s) [06:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:43] !log Deploy schema change on s8 codfw master (db2045), this will generate lag on s8 codfw [06:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:07] (03PS9) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [06:59:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497231 [07:00:30] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497231 (owner: 10Marostegui) [07:01:34] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497231 (owner: 10Marostegui) [07:02:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101 (duration: 00m 48s) [07:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:39] !log Stop db1101 to upgrade mysql and kernel [07:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:52] (03PS10) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [07:13:58] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497232 [07:15:02] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497232 (owner: 10Marostegui) [07:15:58] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497232 (owner: 10Marostegui) [07:16:47] (03PS11) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [07:16:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1101 (duration: 00m 49s) [07:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:38] (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497233 [07:22:33] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give more traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497233 (owner: 10Marostegui) [07:23:31] (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497233 (owner: 10Marostegui) [07:24:11] (03PS12) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [07:24:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1101 (duration: 00m 48s) [07:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:20] (03CR) 10Mathew.onipe: elasticsearch: add profile for icinga checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [07:27:03] (03CR) 10Mathew.onipe: elasticsearch: add profile for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [07:31:10] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497235 [07:33:21] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497235 (owner: 10Marostegui) [07:34:19] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497235 (owner: 10Marostegui) [07:35:21] (03PS1) 10Ammarpad: Enable logging of private filters on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (https://phabricator.wikimedia.org/T218527) [07:35:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1101 (duration: 00m 48s) [07:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:32] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/496764 (https://phabricator.wikimedia.org/T135991) [07:37:30] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497237 [07:44:20] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/496764 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:44:27] (03CR) 10Jcrespo: "+1 although I would change what dbproxy1011 points to at the same time (eg. point to all hosts, set all host with the larger kill timeout)" [puppet] - 10https://gerrit.wikimedia.org/r/497228 (owner: 10Marostegui) [07:46:57] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497237 (owner: 10Marostegui) [07:47:53] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497237 (owner: 10Marostegui) [07:49:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1101 (duration: 00m 48s) [07:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:05] (03CR) 10Marostegui: "Not sure what you mean, I am doing the same thing that was done here: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/420055/" [puppet] - 10https://gerrit.wikimedia.org/r/497228 (owner: 10Marostegui) [07:53:30] (03CR) 10Mathew.onipe: elasticsearch: add profile for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [07:56:52] (03CR) 10Jcrespo: "> Not sure what you mean, I am doing the same thing that was done" [puppet] - 10https://gerrit.wikimedia.org/r/497228 (owner: 10Marostegui) [08:01:57] (03PS1) 10Muehlenhoff: Remove access for tbolliger [puppet] - 10https://gerrit.wikimedia.org/r/497240 [08:04:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for tbolliger [puppet] - 10https://gerrit.wikimedia.org/r/497240 (owner: 10Muehlenhoff) [08:04:26] (03PS13) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [08:04:38] (03CR) 10Jcrespo: "There is also broken replication on labsdb1010, I would wait for it get fixed and catch up." [puppet] - 10https://gerrit.wikimedia.org/r/497228 (owner: 10Marostegui) [08:10:52] (03CR) 10Marostegui: "> > Not sure what you mean, I am doing the same thing that was done" [puppet] - 10https://gerrit.wikimedia.org/r/497228 (owner: 10Marostegui) [08:12:02] (03CR) 10Daimona Eaytoy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (https://phabricator.wikimedia.org/T218527) (owner: 10Ammarpad) [08:14:05] (03CR) 10Daimona Eaytoy: [C: 03+1] Enable logging of private filters on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (https://phabricator.wikimedia.org/T218527) (owner: 10Ammarpad) [08:16:58] (03PS1) 10Vgutierrez: acme-chief: Ensure that the CN is part of the SNI list for certs config [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497242 (https://phabricator.wikimedia.org/T218418) [08:22:50] !log armed keyholder on neodymium [08:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:09] RECOVERY - Keyholder SSH agent on neodymium is OK: OK: Keyholder is armed with all configured keys. [08:24:04] (03PS1) 10Gergő Tisza: Enable Flow for testing on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497243 (https://phabricator.wikimedia.org/T119365) [08:25:13] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497244 [08:26:35] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497244 (owner: 10Marostegui) [08:27:33] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497244 (owner: 10Marostegui) [08:28:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1101 (duration: 00m 48s) [08:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:55] !log cp2002: repool varnish-fe to resume ATS testing T213263 [08:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:58] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [08:32:03] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=nginx [08:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:04] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=varnish-fe [08:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:06] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus2003.codfw.wmnet [08:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:04] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497245 [08:37:25] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497245 (owner: 10Marostegui) [08:38:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497245 (owner: 10Marostegui) [08:38:38] (03PS31) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [08:38:40] (03PS4) 10Jcrespo: mariadb-snapshots: Better error and logging handling [puppet] - 10https://gerrit.wikimedia.org/r/496746 (https://phabricator.wikimedia.org/T210292) [08:39:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 (duration: 00m 48s) [08:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:42] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) I've just repooled prometheus2003 and things seems to be working as expected! The gap in metrics starts at March 14th ~14 UTC. The wides... [08:42:13] (03CR) 10Vgutierrez: [C: 03+2] admin: create user with analytics-privatedata access for sukhe [puppet] - 10https://gerrit.wikimedia.org/r/496379 (https://phabricator.wikimedia.org/T217438) (owner: 10Vgutierrez) [08:42:23] (03PS2) 10Vgutierrez: admin: create user with analytics-privatedata access for sukhe [puppet] - 10https://gerrit.wikimedia.org/r/496379 (https://phabricator.wikimedia.org/T217438) [08:46:08] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for sukhe - https://phabricator.wikimedia.org/T217438 (10Vgutierrez) 05Open→03Resolved [08:47:06] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497246 [08:48:48] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497246 (owner: 10Marostegui) [08:49:44] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497246 (owner: 10Marostegui) [08:50:49] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1104 (duration: 00m 48s) [08:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:49] RECOVERY - Check systemd state on sessionstore1001 is OK: OK - running: The system is fully operational [08:51:49] RECOVERY - Check whether ferm is active by checking the default input chain on sessionstore1001 is OK: OK ferm input default policy is set [08:52:06] !log restarting ferm on sessionstore, was stuck in resolving one of the -a records, which were only merged in a subsequent step (T215883) [08:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:09] T215883: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 [08:52:45] RECOVERY - Check systemd state on sessionstore1002 is OK: OK - running: The system is fully operational [08:53:03] RECOVERY - Check whether ferm is active by checking the default input chain on sessionstore1003 is OK: OK ferm input default policy is set [08:53:09] RECOVERY - Check whether ferm is active by checking the default input chain on sessionstore2001 is OK: OK ferm input default policy is set [08:53:11] RECOVERY - Check whether ferm is active by checking the default input chain on sessionstore2002 is OK: OK ferm input default policy is set [08:53:21] RECOVERY - Check systemd state on sessionstore2003 is OK: OK - running: The system is fully operational [08:53:27] RECOVERY - Check whether ferm is active by checking the default input chain on sessionstore2003 is OK: OK ferm input default policy is set [08:53:27] RECOVERY - Check whether ferm is active by checking the default input chain on sessionstore1002 is OK: OK ferm input default policy is set [08:53:29] RECOVERY - Check systemd state on sessionstore1003 is OK: OK - running: The system is fully operational [08:53:47] RECOVERY - Check systemd state on sessionstore2001 is OK: OK - running: The system is fully operational [08:54:15] RECOVERY - Check systemd state on sessionstore2002 is OK: OK - running: The system is fully operational [08:55:27] (03PS14) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [08:56:14] (03CR) 10Ema: [C: 04-1] Add lvs to the read-only ldap replicas (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [08:58:03] RECOVERY - Disk space on ms-be1043 is OK: DISK OK [08:58:14] !log uploaded acme-chief 0.11 to apt.wikimedia.org (buster) - T207295 [08:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:17] T207295: Expose not-yet-live certs to clients so they can handle OCSP stapling - https://phabricator.wikimedia.org/T207295 [08:59:21] (03CR) 10Ema: [C: 04-1] Add lvs to the read-only ldap replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [09:01:09] PROBLEM - Check whether ferm is active by checking the default input chain on db1101 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [09:01:31] PROBLEM - Check systemd state on db1101 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:01:39] ^ checking [09:01:45] as I rebooted db1101 recently [09:03:24] fixed [09:03:37] RECOVERY - Check whether ferm is active by checking the default input chain on db1101 is OK: OK ferm input default policy is set [09:03:57] RECOVERY - Check systemd state on db1101 is OK: OK - running: The system is fully operational [09:12:49] (03CR) 10Ema: "recheck" [debs/superior-cache-analyzer] - 10https://gerrit.wikimedia.org/r/496781 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [09:14:16] (03CR) 10Gehel: [C: 04-1] "minor comment inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [09:16:42] (03CR) 10Muehlenhoff: [C: 03+1] "Two nits, but LGTM from a quick look." (032 comments) [debs/superior-cache-analyzer] - 10https://gerrit.wikimedia.org/r/496781 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [09:17:38] (03PS1) 10Vgutierrez: acme-chief: Avoid unneeded calls to _push_live_certificate() [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497249 (https://phabricator.wikimedia.org/T218543) [09:17:55] 10Operations, 10ops-eqiad, 10Operations-Software-Development: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10fgiunchedi) [09:17:56] (03CR) 10Hashar: "recheck" [debs/superior-cache-analyzer] - 10https://gerrit.wikimedia.org/r/496781 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [09:22:47] (03PS3) 10Ema: Initial debianization [debs/superior-cache-analyzer] - 10https://gerrit.wikimedia.org/r/496781 (https://phabricator.wikimedia.org/T213263) [09:23:17] PROBLEM - puppet last run on acmechief2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-node-exporter] [09:23:57] (03PS15) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [09:24:36] (03CR) 10Gehel: [C: 04-1] Pass flag use_nodejs10 for maps services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos) [09:24:52] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [09:26:07] (03CR) 10Ema: Initial debianization (032 comments) [debs/superior-cache-analyzer] - 10https://gerrit.wikimedia.org/r/496781 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [09:26:09] (03PS16) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [09:27:25] (03CR) 10Ema: "recheck" [debs/superior-cache-analyzer] - 10https://gerrit.wikimedia.org/r/496781 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [09:28:24] (03PS3) 10Vgutierrez: acme-chief: Update certs-sync to mirror the new directory tree [puppet] - 10https://gerrit.wikimedia.org/r/495854 (https://phabricator.wikimedia.org/T207295) [09:29:09] !log switch to mpm_event for prometheus apache before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/496750 [09:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:43] (03CR) 10Muehlenhoff: [C: 03+1] Initial debianization [debs/superior-cache-analyzer] - 10https://gerrit.wikimedia.org/r/496781 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [09:30:56] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Disable display_errors in FPM mode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497024 (https://phabricator.wikimedia.org/T218005) (owner: 10MaxSem) [09:31:56] (03CR) 10Ema: [C: 03+2] Initial debianization [debs/superior-cache-analyzer] - 10https://gerrit.wikimedia.org/r/496781 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [09:32:10] (03CR) 10Vgutierrez: [C: 03+2] "acme-chief 0.11 has been installed in acmechief[12]001" [puppet] - 10https://gerrit.wikimedia.org/r/495854 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [09:32:55] PROBLEM - puppet last run on acmechief1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-node-exporter] [09:35:06] (03PS4) 10Filippo Giunchedi: prometheus: maximum connections to proxypass [puppet] - 10https://gerrit.wikimedia.org/r/496750 (https://phabricator.wikimedia.org/T217715) [09:35:22] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: maximum connections to proxypass [puppet] - 10https://gerrit.wikimedia.org/r/496750 (https://phabricator.wikimedia.org/T217715) (owner: 10Filippo Giunchedi) [09:36:32] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Consider raising Memcached MWObject cache memory size limit - https://phabricator.wikimedia.org/T217731 (10elukey) I have read again the https://github.com/memcached/memcached/blob/master/doc/protocol.txt and... [09:36:38] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::php: disable display_errors in FPM mode [puppet] - 10https://gerrit.wikimedia.org/r/497024 (https://phabricator.wikimedia.org/T218005) (owner: 10MaxSem) [09:37:21] PROBLEM - Ensure cert-sync script runs successfully in the active node on acmechief1001 is CRITICAL: FILE_AGE CRITICAL: File not found - /var/lib/acme-chief/certs/.rsync.done [09:37:37] ^^ that's me && that's expected, sorry about the noise [09:39:01] (03PS1) 10Vgutierrez: acme_chief: Fix cert-sync.conf [puppet] - 10https://gerrit.wikimedia.org/r/497251 (https://phabricator.wikimedia.org/T207295) [09:40:02] !log superior-cache-analyzer_3.3.7 uploaded to stretch-wikimedia T213263 [09:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:06] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [09:40:56] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Fix cert-sync.conf [puppet] - 10https://gerrit.wikimedia.org/r/497251 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [09:41:16] (03PS2) 10Vgutierrez: acme_chief: Fix cert-sync.conf [puppet] - 10https://gerrit.wikimedia.org/r/497251 (https://phabricator.wikimedia.org/T207295) [09:42:57] (03PS1) 10Muehlenhoff: Don't set a fixed package version for prometheus-node-exporter on buster [puppet] - 10https://gerrit.wikimedia.org/r/497252 (https://phabricator.wikimedia.org/T213708) [09:43:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15184/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/497024 (https://phabricator.wikimedia.org/T218005) (owner: 10MaxSem) [09:43:52] <_joe_> It's impossible to merge changes at a time where multiple people are merging [09:44:03] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::php: disable display_errors in FPM mode [puppet] - 10https://gerrit.wikimedia.org/r/497024 (https://phabricator.wikimedia.org/T218005) (owner: 10MaxSem) [09:44:04] <_joe_> if you want CI to properly run, that is [09:44:18] <_joe_> if you just don't care, it's probably easy [09:44:32] <_joe_> but I can't think we lose minutes for every merge [09:44:41] RECOVERY - Ensure cert-sync script runs successfully in the active node on acmechief1001 is OK: FILE_AGE OK: /var/lib/acme-chief/certs/.rsync.done is 61 seconds old and 0 bytes [09:46:41] (03PS1) 10Hashar: shinken: add basic spec [puppet] - 10https://gerrit.wikimedia.org/r/497253 [09:50:11] (03Abandoned) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [09:51:22] (03CR) 10Muehlenhoff: "https://puppet-compiler.wmflabs.org/compiler1002/15185/" [puppet] - 10https://gerrit.wikimedia.org/r/497252 (https://phabricator.wikimedia.org/T213708) (owner: 10Muehlenhoff) [09:54:49] PROBLEM - Check systemd state on mw1288 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:54:57] PROBLEM - php7.2-fpm service on mw2204 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:55:01] PROBLEM - PHP7 rendering on mw2204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:55:05] PROBLEM - PHP7 rendering on mw2257 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:55:05] PROBLEM - php7.2-fpm service on mw1327 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:55:05] PROBLEM - PHP7 rendering on mw1327 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:55:07] PROBLEM - PHP7 rendering on mw2242 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:55:09] PROBLEM - Check systemd state on mw2242 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:55:13] PROBLEM - php7.2-fpm service on mw2267 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:55:27] PROBLEM - Check systemd state on mw2204 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:55:29] PROBLEM - PHP7 rendering on mw1246 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:55:31] PROBLEM - PHP7 rendering on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:55:31] PROBLEM - Check systemd state on mw1246 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:55:35] PROBLEM - php7.2-fpm service on mw2242 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:55:35] PROBLEM - Check systemd state on mw2257 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:55:35] PROBLEM - PHP7 rendering on mw2267 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://10.192.16.68:9005/w/health-check.php - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:55:39] PROBLEM - PHP7 rendering on mw2289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:55:39] (03CR) 10Filippo Giunchedi: [C: 03+1] Don't set a fixed package version for prometheus-node-exporter on buster [puppet] - 10https://gerrit.wikimedia.org/r/497252 (https://phabricator.wikimedia.org/T213708) (owner: 10Muehlenhoff) [09:55:45] PROBLEM - Check systemd state on mw2170 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:55:47] PROBLEM - php7.2-fpm service on mw1348 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:55:49] PROBLEM - Check systemd state on mw1253 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:55:51] PROBLEM - PHP7 rendering on mw1340 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:55:51] _joe_: ^^^ [09:55:56] related to your last change? [09:56:01] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:03] PROBLEM - php7.2-fpm service on mw1340 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:03] PROBLEM - PHP7 rendering on mw2179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:07] PROBLEM - PHP7 rendering on mw2241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:07] <_joe_> it really shouldn't [09:56:08] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Avoid unneeded calls to _push_live_certificate() [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497249 (https://phabricator.wikimedia.org/T218543) (owner: 10Vgutierrez) [09:56:09] PROBLEM - PHP7 rendering on mw2185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:11] PROBLEM - Check systemd state on mw2140 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:13] PROBLEM - PHP7 rendering on mw2155 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://10.192.32.43:9005/w/health-check.php - 473 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:13] PROBLEM - PHP7 rendering on mw2160 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://10.192.32.48:9005/w/health-check.php - 473 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:15] PROBLEM - php7.2-fpm service on mw1246 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:15] PROBLEM - Check systemd state on mw2142 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:17] PROBLEM - Check systemd state on mw1348 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:17] PROBLEM - PHP7 rendering on mw1230 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:19] PROBLEM - Check systemd state on mw2289 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:19] PROBLEM - Check systemd state on mw2241 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:19] PROBLEM - php7.2-fpm service on mw2241 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:27] PROBLEM - php7.2-fpm service on mw2170 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:27] PROBLEM - php7.2-fpm service on mw1253 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:27] PROBLEM - php7.2-fpm service on mw2138 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:27] PROBLEM - php7.2-fpm service on mw2185 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:27] PROBLEM - php7.2-fpm service on mw2177 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:29] PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:29] PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:29] PROBLEM - php7.2-fpm service on mw2178 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:29] PROBLEM - php7.2-fpm service on mw2173 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:29] PROBLEM - php7.2-fpm service on mw2159 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:30] PROBLEM - php7.2-fpm service on mw2172 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:30] PROBLEM - php7.2-fpm service on mw2157 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:31] PROBLEM - php7.2-fpm service on mw2152 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:31] PROBLEM - php7.2-fpm service on mw2140 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:32] PROBLEM - PHP7 rendering on mw2170 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:33] PROBLEM - PHP7 rendering on mw2169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:33] PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:33] PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:34] PROBLEM - Check systemd state on mw2138 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:34] PROBLEM - php7.2-fpm service on mw2145 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:35] PROBLEM - php7.2-fpm service on mw2161 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:36] PROBLEM - php7.2-fpm service on mw2154 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:36] PROBLEM - PHP7 rendering on mw2177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:36] PROBLEM - php7.2-fpm service on mw2142 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:37] PROBLEM - PHP7 rendering on mw2138 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:38] PROBLEM - PHP7 rendering on mw2146 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:38] PROBLEM - PHP7 rendering on mw2142 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:39] PROBLEM - PHP7 rendering on mw2171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:39] PROBLEM - PHP7 rendering on mw2137 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:39] PROBLEM - PHP7 rendering on mw2178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:40] PROBLEM - PHP7 rendering on mw2186 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:41] PROBLEM - php7.2-fpm service on mw2247 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:41] PROBLEM - PHP7 rendering on mw2247 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://10.192.0.73:9005/w/health-check.php - 473 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:42] PROBLEM - php7.2-fpm service on mw2141 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:42] PROBLEM - php7.2-fpm service on mw2167 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:42] PROBLEM - php7.2-fpm service on mw2175 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:43] PROBLEM - php7.2-fpm service on mw2174 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:43] PROBLEM - php7.2-fpm service on mw2168 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:44] PROBLEM - php7.2-fpm service on mw2217 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:45] PROBLEM - php7.2-fpm service on mw2183 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:45] PROBLEM - php7.2-fpm service on mw2186 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:46] PROBLEM - php7.2-fpm service on mw2153 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:46] PROBLEM - php7.2-fpm service on mw2151 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:46] PROBLEM - PHP7 rendering on mw2154 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://10.192.32.42:9005/w/health-check.php - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:46] PROBLEM - php7.2-fpm service on mw2137 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:47] PROBLEM - php7.2-fpm service on mw2160 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:47] PROBLEM - php7.2-fpm service on mw2180 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:48] PROBLEM - PHP7 rendering on mw2153 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://10.192.32.41:9005/w/health-check.php - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:48] PROBLEM - PHP7 rendering on mw2167 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:49] PROBLEM - PHP7 rendering on mw2173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:50] PROBLEM - PHP7 rendering on mw2174 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:51] PROBLEM - PHP7 rendering on mw2139 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:51] PROBLEM - php7.2-fpm service on mw2136 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:51] PROBLEM - php7.2-fpm service on mw2143 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:52] PROBLEM - php7.2-fpm service on mw2179 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:53] PROBLEM - php7.2-fpm service on mw2155 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:53] PROBLEM - php7.2-fpm service on mw2135 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:54] PROBLEM - php7.2-fpm service on mw2139 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:54] PROBLEM - php7.2-fpm service on mw2166 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed [09:56:54] PROBLEM - PHP7 rendering on mw2175 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:54] PROBLEM - Check systemd state on mw2183 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:55] <_joe_> sigh [09:56:56] PROBLEM - Check systemd state on mw2151 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:56] PROBLEM - Check systemd state on mw2167 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:57] PROBLEM - Check systemd state on mw2184 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:57] PROBLEM - PHP7 rendering on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:56:58] PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:56:58] PROBLEM - Check systemd state on mw2177 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:57:07] !log temporarily stop ircecho to avoid spam [09:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:09] (03CR) 10Arturo Borrero Gonzalez: [C: 04-2] Re-apply "openstack::clientpackages::common: include python3 packages" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497009 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [09:57:16] <_joe_> volans: please revert while I investigate [09:57:30] ack [09:57:34] (03Merged) 10jenkins-bot: acme-chief: Avoid unneeded calls to _push_live_certificate() [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497249 (https://phabricator.wikimedia.org/T218543) (owner: 10Vgutierrez) [09:57:56] (03PS1) 10Volans: Revert "profile::mediawiki::php: disable display_errors in FPM mode" [puppet] - 10https://gerrit.wikimedia.org/r/497255 [09:58:46] _joe_: depool a host and disable puppet there to keep it for investigation, I plan to run puppet on the failed ones [09:58:56] (03CR) 10Volans: [C: 03+2] Revert "profile::mediawiki::php: disable display_errors in FPM mode" [puppet] - 10https://gerrit.wikimedia.org/r/497255 (owner: 10Volans) [09:59:23] <_joe_> yeah it was inserted in the wrong place [09:59:26] <_joe_> stupid me [09:59:37] <_joe_> so lemme first test a quick fix [09:59:42] (03CR) 10Arturo Borrero Gonzalez: "This may need puppet cleanup from previous location? i.e, 'ensure => absent' of the old path" [puppet] - 10https://gerrit.wikimedia.org/r/497216 (https://phabricator.wikimedia.org/T218504) (owner: 10BryanDavis) [09:59:53] ok, the revert is merged, let me know when to kick the pupet runs [10:00:05] <_joe_> go whenever you want, actually [10:00:05] (03PS1) 10Vgutierrez: Release 0.12 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497256 (https://phabricator.wikimedia.org/T218543) [10:00:25] !log running puppet on failed hosts [10:00:25] ack [10:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:34] <_joe_> volans: define "failed" though [10:00:41] https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [10:00:45] ah right [10:00:48] puppet didn't fail? [10:00:49] <_joe_> that's for a failed puppet run [10:00:51] <_joe_> nopoe [10:00:53] (03CR) 10Vgutierrez: [C: 03+2] Release 0.12 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497256 (https://phabricator.wikimedia.org/T218543) (owner: 10Vgutierrez) [10:01:18] right, adjusting target [10:01:21] <_joe_> ok run it across the whole fleet where php::fpm is installed [10:02:11] !log running puppet on hosts matching 'C:php::fpm' to apply I004349ebfab34a4b7b5a65b9b10f78817e5cf193 [10:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:21] (03PS17) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [10:03:01] (03Merged) 10jenkins-bot: Release 0.12 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497256 (https://phabricator.wikimedia.org/T218543) (owner: 10Vgutierrez) [10:04:43] <_joe_> !log hot-patching the error in php7.2-fpm config [10:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:03] <_joe_> I did fix the issue [10:06:22] <_joe_> sigh, sorry, I'm too tired to work, clearly [10:06:34] 10Puppet, 10cloud-services-team, 10Patch-For-Review: Puppet failure emails sent to non-admin members of tools project causing user confusion - https://phabricator.wikimedia.org/T218009 (10aborrero) >>! In T218009#5028698, @Krenair wrote: > @aborrero I tracked down mysql packages in openstack clientpackages t... [10:06:39] (03CR) 10Marostegui: "Jaime and myself had a chat and we sync'ed." [puppet] - 10https://gerrit.wikimedia.org/r/497228 (owner: 10Marostegui) [10:06:49] (03PS1) 10Vgutierrez: acme-chief: Avoid unneeded calls to _push_live_certificate() [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497257 (https://phabricator.wikimedia.org/T218543) [10:06:51] (03PS1) 10Vgutierrez: Release 0.12 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497258 (https://phabricator.wikimedia.org/T218543) [10:06:54] (03PS1) 10Vgutierrez: debian: Add release 0.12 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497259 (https://phabricator.wikimedia.org/T218543) [10:07:26] (03PS2) 10Muehlenhoff: Don't set a fixed package version for prometheus-node-exporter on buster [puppet] - 10https://gerrit.wikimedia.org/r/497252 (https://phabricator.wikimedia.org/T213708) [10:07:49] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Avoid unneeded calls to _push_live_certificate() [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497257 (https://phabricator.wikimedia.org/T218543) (owner: 10Vgutierrez) [10:08:02] (03CR) 10Vgutierrez: [C: 03+2] Release 0.12 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497258 (https://phabricator.wikimedia.org/T218543) (owner: 10Vgutierrez) [10:08:25] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.12 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497259 (https://phabricator.wikimedia.org/T218543) (owner: 10Vgutierrez) [10:08:32] (03CR) 10Muehlenhoff: [C: 03+2] Don't set a fixed package version for prometheus-node-exporter on buster [puppet] - 10https://gerrit.wikimedia.org/r/497252 (https://phabricator.wikimedia.org/T213708) (owner: 10Muehlenhoff) [10:09:20] (03Merged) 10jenkins-bot: acme-chief: Avoid unneeded calls to _push_live_certificate() [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497257 (https://phabricator.wikimedia.org/T218543) (owner: 10Vgutierrez) [10:09:47] (03Merged) 10jenkins-bot: Release 0.12 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497258 (https://phabricator.wikimedia.org/T218543) (owner: 10Vgutierrez) [10:09:54] (03Merged) 10jenkins-bot: debian: Add release 0.12 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497259 (https://phabricator.wikimedia.org/T218543) (owner: 10Vgutierrez) [10:10:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloud-vps: Remove mysql packages from openstack::clientpackages::mitaka::* [puppet] - 10https://gerrit.wikimedia.org/r/497210 (https://phabricator.wikimedia.org/T218009) (owner: 10BryanDavis) [10:11:28] (03CR) 10Filippo Giunchedi: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [10:12:13] !log restarted irc echo on icinga2001 [10:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:17] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:12:57] !log uploaded acme-chief 0.12 to apt.wikimedia.org (buster) - T218543 [10:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:00] T218543: acme-chief calls unnecessarily to ACMEChief._push_live_certificates() on daemon start - https://phabricator.wikimedia.org/T218543 [10:13:45] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:19:06] (03PS10) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) [10:19:39] (03CR) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [10:20:01] RECOVERY - puppet last run on acmechief2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:24:35] 10Operations, 10LDAP: Update certificates on productions replicas of corp.wikimedia.org LDAP - https://phabricator.wikimedia.org/T168460 (10Krenair) was looking through some LDAP tasks, this one is for corp but was presumably completed as it was about a June 2017 expiry? [10:26:10] (03CR) 10Volans: [C: 03+1] "LGTM, although I might miss some low level context here." [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [10:26:15] (03CR) 10Marostegui: [C: 03+1] "Let's deploy and start checking what issues we face once this start being executed on a regular basis" [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:27:53] (03CR) 10Alex Monk: [C: 03+2] acme-chief: Avoid unneeded calls to _push_live_certificate() [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497249 (https://phabricator.wikimedia.org/T218543) (owner: 10Vgutierrez) [10:28:26] Krenair: sorry! I wanted to fix that ASAP so I already deployed it [10:28:47] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497261 (https://phabricator.wikimedia.org/T128546) [10:28:57] I noticed this wasn't your first time self-+2ing either [10:29:09] (03PS4) 10Arturo Borrero Gonzalez: toolforge: Cleanup host_aliases and exim4 conf for Trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/496680 (https://phabricator.wikimedia.org/T109485) (owner: 10BryanDavis) [10:29:37] RECOVERY - puppet last run on acmechief1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:29:59] (03PS32) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [10:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190318T1030). [10:30:10] https://gerrit.wikimedia.org/r/#/c/operations/software/acme-chief/+/494956/ did at least have a positive CR from a second person [10:30:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Cleanup host_aliases and exim4 conf for Trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/496680 (https://phabricator.wikimedia.org/T109485) (owner: 10BryanDavis) [10:30:24] and this one was minor enough [10:30:25] (03PS18) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [10:30:28] (03PS1) 10Tarrow: WIP DNM: Introduce wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/497262 [10:31:42] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497261 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:31:45] Krenair: I didn't see any comments there holding the CR back on your side. I'm holding on https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/494957 cause your concerns though [10:32:01] did you check my reply to your TZ comment? [10:32:46] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497261 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:45] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:497261| Bumping portals to master (T128546)]] (duration: 00m 49s) [10:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:48] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:35:22] (03CR) 10Alex Monk: acme-chief-api: Add support for puppet HTTP API search operation (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [10:35:26] (03PS1) 10ArielGlenn: die with usage message on bad filespec arg [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/497263 (https://phabricator.wikimedia.org/T218316) [10:35:34] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:497261| Bumping portals to master (T128546)]] (duration: 00m 48s) [10:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:29] 10Operations, 10ops-eqiad, 10Operations-Software-Development: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10Volans) @fgiunchedi agree that this is a new issue, and we need to fix two different scripts to have an automatic task created for this: 1) The `check_raid` script The current `c... [10:38:25] (03PS33) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [10:38:34] vgutierrez, I hadn't, I have no problem depending on python 3.7 if that's what the wikimedia deployment runs on [10:38:41] it's buster right? [10:38:53] indeed, we are using python 3.7.2 right now [10:39:06] this is the distro that ships both 2.7 and 3.7 right [10:39:25] 2.7.16 and 3.7.2 [10:39:44] heh [10:39:45] (03CR) 10Jcrespo: [C: 03+2] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:39:59] Krenair: hmm besides mentioning this requirement in the README.. should I upgrade the classifiers on setup.py? [10:40:37] (03PS19) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [10:40:44] hmm [10:41:48] vgutierrez, yes [10:42:05] so let's get rid of the python 3 one and add Programming Language :: Python :: 3.7? [10:42:06] vgutierrez, I think it should also go in debian/control right? [10:42:13] sure [10:43:56] (03PS1) 10Elukey: hue: add class parameters to configure Yarn/HDFS/MapRed TLS ports [puppet/cdh] - 10https://gerrit.wikimedia.org/r/497264 (https://phabricator.wikimedia.org/T217412) [10:44:27] elukey: wow, what a timing, i was about to ask something about "hue" and in that moment see your change [10:44:52] mutante: ahahhah [10:45:05] elukey: so i am on https://wikitech.wikimedia.org/wiki/Analytics/Systems looking for a page on Hue.. i figure it's "Hadoop User Experience" so that is just part of Hadoop ..so https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop [10:45:25] still just trying to find URLs for Icinga checks [10:45:39] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/mysql/backups.cnf] [10:45:54] (03PS2) 10ArielGlenn: die with usage message on bad filespec arg [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/497263 (https://phabricator.wikimedia.org/T218316) [10:45:56] (03PS5) 10Jcrespo: mariadb-snapshots: Better error and logging handling [puppet] - 10https://gerrit.wikimedia.org/r/496746 (https://phabricator.wikimedia.org/T210292) [10:45:58] (03PS1) 10Jcrespo: mariadb-backups: Make backups.cnf on managements hosts owned by root [puppet] - 10https://gerrit.wikimedia.org/r/497265 (https://phabricator.wikimedia.org/T210292) [10:46:03] mutante: ah yes I think that we don't have a dedicated page for it [10:46:13] I can ask to my team and create one later on if you are not in arush [10:46:28] elukey: not in a rush, no :) thank you [10:47:13] (03CR) 10Alex Monk: [C: 03+1] cloud-vps: Remove mysql packages from openstack::clientpackages::mitaka::* [puppet] - 10https://gerrit.wikimedia.org/r/497210 (https://phabricator.wikimedia.org/T218009) (owner: 10BryanDavis) [10:47:25] (03PS2) 10Jcrespo: mariadb-backups: Make backups.cnf on managements hosts owned by root [puppet] - 10https://gerrit.wikimedia.org/r/497265 (https://phabricator.wikimedia.org/T210292) [10:47:30] cumin1001 is me, working on it [10:48:29] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/mysql/backups.cnf] [10:48:48] it is a technicality I am fixing immediately [10:48:52] (03PS2) 10Elukey: hue: add class parameters to configure Yarn/HDFS/MapRed TLS ports [puppet/cdh] - 10https://gerrit.wikimedia.org/r/497264 (https://phabricator.wikimedia.org/T217412) [10:48:55] 10Puppet, 10cloud-services-team, 10Patch-For-Review: Puppet failure emails sent to non-admin members of tools project causing user confusion - https://phabricator.wikimedia.org/T218009 (10Krenair) ack, sounds like we should try getting rid of those mysql packages from the manifest then. thanks. [10:49:20] (03PS20) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [10:49:22] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Make backups.cnf on managements hosts owned by root [puppet] - 10https://gerrit.wikimedia.org/r/497265 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:49:28] Krenair: also check https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/497242 and let's wrap both CRs in one single release [10:49:30] (03PS3) 10Jcrespo: mariadb-backups: Make backups.cnf on managements hosts owned by root [puppet] - 10https://gerrit.wikimedia.org/r/497265 (https://phabricator.wikimedia.org/T210292) [10:50:20] (03CR) 10Alex Monk: [C: 03+2] acme-chief: Ensure that the CN is part of the SNI list for certs config [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497242 (https://phabricator.wikimedia.org/T218418) (owner: 10Vgutierrez) [10:50:38] (03PS11) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) [10:50:56] (03CR) 10Elukey: [V: 03+2 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15191/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/497264 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [10:50:57] 10Operations, 10ops-eqiad, 10Operations-Software-Development: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10Volans) Also worth mentioning that tracking the IDs for missing ones is not enough because if the last one fails we should know in advance how many are supposed to be there. [10:51:09] <_joe_> !log testing safety checks for php-fpm on mwdebug2001 [10:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:14] (03CR) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [10:51:25] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/recover_dump.py] [10:51:56] that is also me, should be fixed on head [10:52:51] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/recover_dump.py] [10:53:41] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [10:53:44] (03PS21) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [10:53:49] yep, recovering now [10:54:01] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:54:01] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:54:01] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:54:58] sorry about that, the deploy was safe it was related to templates (compiler won't catch it) and it was non-obvious, but completely harmless [10:55:05] arturo, I think you meant CR-1 [10:55:26] Krenair: ? [10:55:31] you gave CR-2 [10:55:34] on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497009/ [10:55:51] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Re-apply "openstack::clientpackages::common: include python3 packages" [puppet] - 10https://gerrit.wikimedia.org/r/497009 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [10:56:01] thanks [10:56:07] 👍 [10:56:23] alright will look into the problems later, gotta run [10:57:09] (03PS1) 10Elukey: profile::hue: configure Yarn|HDFS|MapRed SSL ports for analytics1039 [puppet] - 10https://gerrit.wikimedia.org/r/497267 (https://phabricator.wikimedia.org/T217412) [10:58:44] (03PS22) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [10:59:32] (03PS1) 10Jcrespo: mariadb-backups: Update recover script to handle compressed dumps [puppet] - 10https://gerrit.wikimedia.org/r/497269 (https://phabricator.wikimedia.org/T206203) [10:59:34] (03PS23) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190318T1100). [11:00:05] edsanders: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:21] o/ [11:00:40] edsanders: you're a deployer, right? [11:00:43] * zeljkof checks [11:01:43] (03PS1) 10Elukey: hue: change name of the hdfs/yarn/mapred tls config [puppet/cdh] - 10https://gerrit.wikimedia.org/r/497270 (https://phabricator.wikimedia.org/T217412) [11:02:09] (03CR) 10Elukey: [V: 03+2 C: 03+2] hue: change name of the hdfs/yarn/mapred tls config [puppet/cdh] - 10https://gerrit.wikimedia.org/r/497270 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [11:02:14] (03CR) 10Mathew.onipe: "PCC is expected and Ok!" [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:02:42] edsanders: around for swat? [11:03:03] (03PS2) 10Elukey: profile::hue: configure Yarn|HDFS|MapRed SSL ports for analytics1039 [puppet] - 10https://gerrit.wikimedia.org/r/497267 (https://phabricator.wikimedia.org/T217412) [11:05:14] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15196/" [puppet] - 10https://gerrit.wikimedia.org/r/497267 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [11:06:21] (03PS2) 10Vgutierrez: acme-chief: Ensure that the CN is part of the SNI list for certs config [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497242 (https://phabricator.wikimedia.org/T218418) [11:09:25] (03CR) 10Gehel: [C: 04-1] elasticsearch: add profile for icinga checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:09:56] (03CR) 10jenkins-bot: acme-chief: Ensure that the CN is part of the SNI list for certs config [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497242 (https://phabricator.wikimedia.org/T218418) (owner: 10Vgutierrez) [11:16:26] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Marostegui) Hello @Papaul Jaime and myself discussed a few things. Hostname:** dbprov2001** **dbprov2002** **RAID0** for SSD **RAID6** for the SATA Disks partman r... [11:21:30] (03PS24) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [11:21:33] (03CR) 10Mathew.onipe: elasticsearch: add profile for icinga checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:22:39] (03CR) 10Zfilipin: "This was scheduled for EU SWAT today, but it was not deployed because the developer was not in #wikimedia-operations." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496696 (https://phabricator.wikimedia.org/T218375) (owner: 10Esanders) [11:23:27] (03PS1) 10Dzahn: hadoop/hue/systemd: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497273 [11:23:57] (03PS1) 10Muehlenhoff: Add option to skip analytics check in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/497274 [11:24:02] (03CR) 10Mathew.onipe: "PCC is still Ok: https://puppet-compiler.wmflabs.org/compiler1002/15197/" [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:24:48] zeljkof: is deployment over/not happening? I'm about to install new PHP packages on the deployment servers, but will wait until you're done [11:29:33] moritzm: go ahead, there was one patch from edsanders but he's not around, so no swat today [11:30:05] well, no eu swat today, there might be us swats later :) [11:30:14] ack, thanks [11:30:20] Hey [11:30:26] Sorry. I'm here.. [11:30:32] I'll wait, then :-) [11:31:07] (03PS2) 10Muehlenhoff: Add option to skip analytics check in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/497274 [11:31:18] edsanders: you're a deployer, right? are you deploying your patch? do you need help? [11:31:18] sorry, timezone change [11:31:36] DST FTW [11:31:37] I only did one several years ago with guidance [11:31:54] so, yeah I have no idea how to do it anymore [11:32:06] edsanders: it's nicely documented, with a lot of love :) https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [11:32:15] if you want to try [11:32:22] I can deploy if you prefer [11:32:40] (03PS3) 10Jbond: facter: fix interface_primary under newer versions of facter [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) (owner: 10Herron) [11:32:42] but if you want to try, I'm around to help (irc, meet...) [11:32:49] I'll let you if you don't mind [11:32:55] (03CR) 10Muehlenhoff: [C: 03+2] Add option to skip analytics check in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/497274 (owner: 10Muehlenhoff) [11:33:19] edsanders: sure, please stand by I'll let you know in a few minutes when it's at mwdebug ready for testing [11:33:46] our deploy is just a config change to enable a feature on a few wikis [11:34:06] can you test it at mwdebug1002? [11:34:12] (03CR) 10Dzahn: "any concerns about this?" [puppet] - 10https://gerrit.wikimedia.org/r/483520 (owner: 10Dzahn) [11:34:24] the majority of swat deploys are config changes [11:34:46] yeah I can [11:34:58] I can test with any domain on mwdebug then? [11:35:06] ok, I'll ping you in a few minutes when it's there [11:35:28] correct, docs if you need to refresh the memory https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [11:36:00] yyp, thanks - I have the plugin [11:36:05] (03CR) 10Jbond: "i just rebased this to test if it still works is there anything preventing this from getting merged as im also looking at facter3 upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) (owner: 10Herron) [11:37:36] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496696 (https://phabricator.wikimedia.org/T218375) (owner: 10Esanders) [11:38:36] (03Merged) 10jenkins-bot: Enable mobile section editing on bnwiki, hewiki, zh_yuewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496696 (https://phabricator.wikimedia.org/T218375) (owner: 10Esanders) [11:38:46] (03CR) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [11:39:22] edsanders: the patch is at mwdebug1002, please test and let me know if I can deploy it [11:39:31] (03CR) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [11:39:38] thanks [11:40:43] looks good on bn.wiki, just checking the other two [11:41:17] (03PS1) 10Ema: ATS: disable max_doc_size [puppet] - 10https://gerrit.wikimedia.org/r/497277 (https://phabricator.wikimedia.org/T213263) [11:43:13] +2 LGTM [11:43:41] (@zelijko) [11:44:09] * zeljkof [11:44:17] edsanders: ok, deploying [11:45:19] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:496696|Enable mobile section editing on bnwiki, hewiki, zh_yuewiki (T218375)]] (duration: 00m 50s) [11:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:23] T218375: Deploy mobile section editing to he.wiki, bn.wiki, zh-yue.wiki - https://phabricator.wikimedia.org/T218375 [11:45:50] edsanders: it's deployed! please test in production :) and thanks for deploying with #releng! [11:45:54] !log EU SWAT finished [11:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:13] moritzm: swat done, you're next :) [11:46:42] how long before I should see the changes? [11:47:11] immediately? [11:47:37] there shouldn't be any delay, I think [11:47:41] not working [11:48:08] worked with mwdebug, doesn't work when you disable the extension? [11:48:16] yup [11:48:30] if you add debug=true to the url? [11:48:43] zeljkof: ack, gonna start in a bit [11:48:47] caching is the only thing that comes to my mind [11:48:49] so, it's working on zh-yue [11:48:51] not bn [11:48:55] will try debug [11:49:07] I have inspector open with debugging disabled [11:49:36] ok, has come through on bn now [11:49:44] must've been some form of caching [11:49:52] edsanders: all good? [11:50:02] all good [11:50:14] huh, got me scared for a minute :) [11:50:31] ok, swat officially done then, see you next time! [11:54:40] <_joe_> edsanders: are you in the php7 beta by any chance? [11:55:04] <_joe_> in those wikis where you saw a delay, I mean [11:56:29] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: only reload php-fpm if the config is valid [puppet] - 10https://gerrit.wikimedia.org/r/497281 [11:56:31] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: re-disable display_errors [puppet] - 10https://gerrit.wikimedia.org/r/497282 [12:01:00] (03PS6) 10Jcrespo: mariadb-snapshots: Better error and logging handling [puppet] - 10https://gerrit.wikimedia.org/r/496746 (https://phabricator.wikimedia.org/T210292) [12:01:02] (03PS2) 10Jcrespo: mariadb-backups: Update recover script to handle compressed dumps [puppet] - 10https://gerrit.wikimedia.org/r/497269 (https://phabricator.wikimedia.org/T206203) [12:01:35] (03PS3) 10Jcrespo: mariadb-backups: Update recover script to handle compressed dumps [puppet] - 10https://gerrit.wikimedia.org/r/497269 (https://phabricator.wikimedia.org/T206203) [12:02:33] (03PS7) 10Jcrespo: mariadb-snapshots: Better error and logging handling [puppet] - 10https://gerrit.wikimedia.org/r/496746 (https://phabricator.wikimedia.org/T210292) [12:04:57] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/497281 (owner: 10Giuseppe Lavagetto) [12:15:01] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] "tested with bad and good fspecs." [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/497263 (https://phabricator.wikimedia.org/T218316) (owner: 10ArielGlenn) [12:15:30] (03PS13) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [12:16:32] (03CR) 10jerkins-bot: [V: 04-1] icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [12:21:23] (03PS1) 10Jbond: Add empty file for buster [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) [12:21:57] (03CR) 10jerkins-bot: [V: 04-1] Add empty file for buster [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) (owner: 10Jbond) [12:22:13] (03PS14) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [12:23:13] (03CR) 10jerkins-bot: [V: 04-1] icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [12:23:30] (03CR) 10Dzahn: ""not in autoload module layout" = class name does not match file name" [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) (owner: 10Jbond) [12:24:02] (03PS2) 10Jbond: Add empty file for buster [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) [12:24:43] _joe_ no... [12:25:12] <_joe_> edsanders: ok, thanks, I'm chasing a possible bug in php7 + scap [12:29:20] (03PS15) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [12:35:31] (03PS1) 10Volans: setup.py: revert commit 3d7ab9b [software/spicerack] - 10https://gerrit.wikimedia.org/r/497288 [12:35:33] (03PS1) 10Volans: tox: fix typo in environment name [software/spicerack] - 10https://gerrit.wikimedia.org/r/497289 [12:35:35] (03PS1) 10Volans: Add Python type hints and mypy check [software/spicerack] - 10https://gerrit.wikimedia.org/r/497290 [12:48:38] !log mvolz@deploy1001 scap-helm citoid upgrade staging -f citoid-values-staging.yaml stable/citoid [namespace: citoid, clusters: staging] [12:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:47] !log mvolz@deploy1001 scap-helm citoid upgrade staging -f citoid-staging-values.yaml stable/citoid [namespace: citoid, clusters: staging] [12:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:48] !log mvolz@deploy1001 scap-helm citoid cluster staging completed [12:49:48] !log mvolz@deploy1001 scap-helm citoid finished [12:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:03] (03CR) 10Volans: "I've added a bunch of reviewers, feel free to just review the modules you're more familiar with / have worked on." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/497290 (owner: 10Volans) [12:52:43] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10aborrero) [12:53:39] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10aborrero) [12:54:32] !log T218025 disable icinga checks for cloudnet2001-dev.codfw.wmnet [12:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:35] T218025: decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 [12:54:57] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=DELETE https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:55:23] PROBLEM - puppet last run on cloudnet2001-dev is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[tzdata],Package[prometheus-node-exporter] [12:56:09] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:00:23] (03PS1) 10Arturo Borrero Gonzalez: wmcs: decommision several codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/497293 (https://phabricator.wikimedia.org/T218021) [13:01:17] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10aborrero) [13:03:01] (03CR) 10Dzahn: "could you add "openstack: " to the commit message? so far it doesn't say what context this is in" [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) (owner: 10Jbond) [13:03:58] !log T218022 disable icinga checks for labtestservices2001.wikimedia.org [13:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:02] T218022: Hardware decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 [13:04:57] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10aborrero) [13:08:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (that even seems like a suitable candidate for the service unit in Debian to be reported upstream)" [puppet] - 10https://gerrit.wikimedia.org/r/497281 (owner: 10Giuseppe Lavagetto) [13:10:10] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10aborrero) [13:17:59] !log mvolz@deploy1001 scap-helm citoid upgrade production -f citoid-eqiad-values.yaml stable/citoid [namespace: citoid, clusters: eqiad] [13:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:00] !log mvolz@deploy1001 scap-helm citoid cluster eqiad completed [13:18:01] !log mvolz@deploy1001 scap-helm citoid finished [13:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:48] (03PS3) 10Jbond: Add empty file for buster [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) [13:20:23] (03CR) 10jerkins-bot: [V: 04-1] Add empty file for buster [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) (owner: 10Jbond) [13:21:12] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) (owner: 10Jbond) [13:22:47] (03PS1) 10Giuseppe Lavagetto: scap: use php7.2 in the "sql" script [puppet] - 10https://gerrit.wikimedia.org/r/497297 (https://phabricator.wikimedia.org/T217938) [13:22:57] !log mvolz@deploy1001 scap-helm citoid upgrade production -f citoid-codfw-values.yaml stable/citoid [namespace: citoid, clusters: codfw] [13:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:59] !log mvolz@deploy1001 scap-helm citoid cluster codfw completed [13:22:59] !log mvolz@deploy1001 scap-helm citoid finished [13:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:10] (03CR) 10ArielGlenn: [C: 03+2] gather (almost) all maxretries vars under one config setting [dumps] - 10https://gerrit.wikimedia.org/r/494991 (https://phabricator.wikimedia.org/T217744) (owner: 10ArielGlenn) [13:27:56] (03PS1) 10ArielGlenn: dumps: we use maxretries for multiple jobs, reflect that in config file [puppet] - 10https://gerrit.wikimedia.org/r/497298 [13:28:50] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10aborrero) [13:29:25] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10aborrero) 05Open→03Stalled This task is blocked on {T218569} [13:31:32] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) [13:33:01] 10Operations, 10Traffic, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ema) >>! In T213263#5027366, @ema wrote: > > I have disabled puppet on cp2015 and manually set `proxy.config.cache.ram_cache.size` to 1G to tes... [13:34:06] (03PS2) 10Ema: ATS: set RAM cache size [puppet] - 10https://gerrit.wikimedia.org/r/496829 (https://phabricator.wikimedia.org/T213263) [13:34:08] (03CR) 10Giuseppe Lavagetto: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/497281 (owner: 10Giuseppe Lavagetto) [13:34:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: only reload php-fpm if the config is valid [puppet] - 10https://gerrit.wikimedia.org/r/497281 (owner: 10Giuseppe Lavagetto) [13:34:14] (03CR) 10Gehel: [C: 03+1] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/497289 (owner: 10Volans) [13:35:22] (03PS2) 10ArielGlenn: dumps: we use maxretries for multiple jobs, reflect that in config file [puppet] - 10https://gerrit.wikimedia.org/r/497298 [13:35:49] (03CR) 10Ema: [C: 03+2] ATS: set RAM cache size [puppet] - 10https://gerrit.wikimedia.org/r/496829 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [13:36:13] (03PS5) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) [13:36:47] moritzm: I see there's a commit of yours which has not been puppet-merged (c60a9f7078) [13:36:50] (03CR) 10Gehel: [C: 03+1] "LGTM. Would be nice to have a link to the upstream issue in the commit message, but feel free to merge as-is (probably no one is going to " [software/spicerack] - 10https://gerrit.wikimedia.org/r/497288 (owner: 10Volans) [13:37:10] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: only reload php-fpm if the config is valid [puppet] - 10https://gerrit.wikimedia.org/r/497281 [13:37:21] ema: ah, right. please merge along [13:37:24] moritzm: k [13:38:20] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) T218570 might unblock this [13:39:05] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10aborrero) 05Open→03Stalled >>! In T218022#5027290, @Andrew wrote: > right now labtestservices2001 is the only h... [13:39:20] 10Operations, 10Patch-For-Review: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10Dzahn) This could probably move forward once T218570 gets resolved. [13:39:30] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10aborrero) [13:41:28] (03CR) 10Effie Mouzeli: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15202/mwmaint1002.eqiad.wmnet/ LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/497297 (https://phabricator.wikimedia.org/T217938) (owner: 10Giuseppe Lavagetto) [13:41:48] (03CR) 10Ottomata: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [13:41:59] !log cp-ats rolling restart to apply proxy.config.cache.ram_cache.size config change T213263 [13:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:03] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [13:42:31] (03PS16) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [13:42:33] (03PS3) 10ArielGlenn: dumps: we use maxretries for multiple jobs, reflect that in config file [puppet] - 10https://gerrit.wikimedia.org/r/497298 [13:45:50] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10aborrero) [13:46:09] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10aborrero) [13:50:12] (03PS2) 10Arturo Borrero Gonzalez: wmcs: decommision several codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/497293 (https://phabricator.wikimedia.org/T218023) [13:50:59] (03PS4) 10ArielGlenn: dumps: we use maxretries for multiple jobs, reflect that in config file [puppet] - 10https://gerrit.wikimedia.org/r/497298 [13:51:52] (03CR) 10Andrew Bogott: [C: 03+1] "I hate the phrase 'It has the downside of returning inconsistent information about group membership which may confuse some applications' b" [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis) [13:52:48] (03CR) 10Andrew Bogott: [C: 03+1] "I'm all for this but it probably needs some accompanying mailing-list communication" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [13:54:26] 10Operations, 10Patch-For-Review: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10Marostegui) >>! In T156937#5032610, @Dzahn wrote: > This could probably move forward once T218570 gets resolved. I think those are diffe... [13:54:32] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10Andrew) > I don't really know what that database is about. But perhaps we want to do it at the same time as {T218569}. Would you mind up... [13:56:02] (03CR) 10Mathew.onipe: "Nice work! few comments inline" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [13:57:00] (03PS2) 10Ema: ATS: disable max_doc_size [puppet] - 10https://gerrit.wikimedia.org/r/497277 (https://phabricator.wikimedia.org/T213263) [13:57:18] (03CR) 10Volans: [C: 03+2] "> Patch Set 1: Code-Review+1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/497288 (owner: 10Volans) [13:57:32] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10aborrero) >>! In T218022#5032691, @Andrew wrote: > >> I don't really know what that database is about. But perhaps we want to do it at t... [13:58:03] 10Operations, 10Cloud-VPS, 10Traffic, 10LDAP, and 2 others: Update openldap profile to use LE - https://phabricator.wikimedia.org/T218398 (10Andrew) 05Open→03Resolved a:03Andrew This is working with acme now. [13:59:41] 10Operations, 10Patch-For-Review: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 (10fgiunchedi) [13:59:47] 10Operations, 10Puppet, 10monitoring: prometheus-node-exporter makes puppet fail because requiring a version that no longer exists on buster - https://phabricator.wikimedia.org/T218118 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Fixed by @MoritzMuehlenhoff in https://gerrit.wikimedia.org/r/c/operat... [14:00:30] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10aborrero) [14:00:47] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: add method to mock node info API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 (owner: 10Gehel) [14:01:55] (03Merged) 10jenkins-bot: setup.py: revert commit 3d7ab9b [software/spicerack] - 10https://gerrit.wikimedia.org/r/497288 (owner: 10Volans) [14:02:11] (03CR) 10Volans: [C: 03+2] tox: fix typo in environment name [software/spicerack] - 10https://gerrit.wikimedia.org/r/497289 (owner: 10Volans) [14:02:25] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: upgrade to elastic 6.5.4 [puppet] - 10https://gerrit.wikimedia.org/r/495920 (https://phabricator.wikimedia.org/T218116) (owner: 10Gehel) [14:02:42] (03CR) 10Gehel: [C: 03+1] "Looks good for the elastic part. Minor comment / question inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/497290 (owner: 10Volans) [14:03:18] (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) (owner: 10Jbond) [14:05:24] (03CR) 10Ema: [C: 03+2] ATS: disable max_doc_size [puppet] - 10https://gerrit.wikimedia.org/r/497277 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [14:06:19] (03Merged) 10jenkins-bot: tox: fix typo in environment name [software/spicerack] - 10https://gerrit.wikimedia.org/r/497289 (owner: 10Volans) [14:06:24] (03CR) 10Volans: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) (owner: 10Herron) [14:06:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: use php7.2 in the "sql" script [puppet] - 10https://gerrit.wikimedia.org/r/497297 (https://phabricator.wikimedia.org/T217938) (owner: 10Giuseppe Lavagetto) [14:06:46] (03PS2) 10Giuseppe Lavagetto: scap: use php7.2 in the "sql" script [puppet] - 10https://gerrit.wikimedia.org/r/497297 (https://phabricator.wikimedia.org/T217938) [14:07:57] (03CR) 10CRusnov: "> Patch Set 14:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [14:08:31] (03PS3) 10Arturo Borrero Gonzalez: wmcs: decommision several codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/497293 (https://phabricator.wikimedia.org/T218023) [14:11:25] (03CR) 1020after4: [C: 03+1] scap: use php7.2 in the "sql" script [puppet] - 10https://gerrit.wikimedia.org/r/497297 (https://phabricator.wikimedia.org/T217938) (owner: 10Giuseppe Lavagetto) [14:13:28] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=prometheus2003.codfw.wmnet [14:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:12] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: re-disable display_errors [puppet] - 10https://gerrit.wikimedia.org/r/497282 [14:17:44] (03CR) 10Mathew.onipe: Add Python type hints and mypy check (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/497290 (owner: 10Volans) [14:18:39] (03PS5) 10ArielGlenn: dumps: we use maxretries for multiple jobs, reflect that in config file [puppet] - 10https://gerrit.wikimedia.org/r/497298 [14:18:41] (03CR) 10Mathew.onipe: [C: 03+1] tox: fix typo in environment name [software/spicerack] - 10https://gerrit.wikimedia.org/r/497289 (owner: 10Volans) [14:19:04] 10Operations, 10monitoring: Update prometheus-node-exporter NTP metrics - https://phabricator.wikimedia.org/T208875 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm resolving this, upgrade work is tracked in {T213708} [14:19:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: re-disable display_errors [puppet] - 10https://gerrit.wikimedia.org/r/497282 (owner: 10Giuseppe Lavagetto) [14:19:34] (03CR) 10ArielGlenn: [C: 03+2] dumps: we use maxretries for multiple jobs, reflect that in config file [puppet] - 10https://gerrit.wikimedia.org/r/497298 (owner: 10ArielGlenn) [14:20:12] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, 10Patch-For-Review: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10GTirloni) I'm not seeing any meaningful impact on LDAP operations from our latest changes. [14:20:14] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: deploy elasticsearch config for ES6 [puppet] - 10https://gerrit.wikimedia.org/r/495921 (https://phabricator.wikimedia.org/T218116) (owner: 10Gehel) [14:20:46] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::php: re-disable display_errors [puppet] - 10https://gerrit.wikimedia.org/r/497282 [14:20:55] (03PS2) 10Gehel: elasticsearch: upgrade to elastic 6.5.4 [puppet] - 10https://gerrit.wikimedia.org/r/495920 (https://phabricator.wikimedia.org/T218116) [14:21:13] (03PS6) 10Ottomata: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) [14:21:18] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10herron) According to netbox support for hosts `kafka[12]00[123]` expired in Dec 2018. After... [14:22:58] (03CR) 10Gehel: [C: 03+2] elasticsearch: upgrade to elastic 6.5.4 [puppet] - 10https://gerrit.wikimedia.org/r/495920 (https://phabricator.wikimedia.org/T218116) (owner: 10Gehel) [14:23:11] (03PS3) 10Gehel: elasticsearch: upgrade to elastic 6.5.4 [puppet] - 10https://gerrit.wikimedia.org/r/495920 (https://phabricator.wikimedia.org/T218116) [14:25:23] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10elukey) In the SRE spreadsheet I can see that the suggested replacement FY is 20/21, not the... [14:27:02] (03PS1) 10ArielGlenn: fix up a bunch of global config values for dumps [puppet] - 10https://gerrit.wikimedia.org/r/497305 [14:27:57] (03PS4) 10Jbond: Create an empty openstack::clientpackages::mitaka::buster manifest [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) [14:28:52] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) @jijiki the server log is not reporting any errors message since 3-13. I will go ahead and replace the memory with one of the memory from the decom servers and we... [14:28:55] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) (owner: 10Jbond) [14:29:25] (03PS1) 10Ladsgroup: gitignore: Ignore all *.iml files [puppet] - 10https://gerrit.wikimedia.org/r/497306 [14:29:41] (03CR) 10Alex Monk: "should probably do the same for newton/buster.pp and ocata/buster.pp" [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) (owner: 10Jbond) [14:31:17] Anyone for this quick patch: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497306 ? Noop for prod [14:32:07] (03PS1) 10TheAnarcat: allow PuppetDB scheme to be changed [software/cumin] - 10https://gerrit.wikimedia.org/r/497309 [14:32:09] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [software/cumin] - 10https://gerrit.wikimedia.org/r/497309 (owner: 10TheAnarcat) [14:33:03] (03PS1) 10Alexandros Kosiaris: openldap: Enforce cn uniqueness [puppet] - 10https://gerrit.wikimedia.org/r/497310 [14:33:20] (03CR) 10Mathew.onipe: [C: 03+1] gitignore: Ignore all *.iml files [puppet] - 10https://gerrit.wikimedia.org/r/497306 (owner: 10Ladsgroup) [14:33:58] (03PS5) 10Jbond: Create an empty openstack::clientpackages::**::buster manifest [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) [14:34:33] (03CR) 10Alex Monk: [C: 03+1] Create an empty openstack::clientpackages::**::buster manifest [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) (owner: 10Jbond) [14:35:20] (03PS2) 10TheAnarcat: allow PuppetDB scheme to be changed [software/cumin] - 10https://gerrit.wikimedia.org/r/497309 [14:36:25] (03PS3) 10Herron: rsyslog: update syslog_json template with format jsonf [puppet] - 10https://gerrit.wikimedia.org/r/496806 (https://phabricator.wikimedia.org/T213899) [14:36:40] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) >>! In T213976#5028623, @Nuria wrote: > Ping @fgiunchedi about putting this as a commong goal next quar... [14:36:57] (03CR) 10Alex Monk: "shouldn't it be using a cert signed by your private CA? or even just a publicly-trusted cert (despite being unexposed)?" [software/cumin] - 10https://gerrit.wikimedia.org/r/497309 (owner: 10TheAnarcat) [14:37:23] (03PS1) 10TheAnarcat: allow running cumin as a regular user [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 [14:40:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/497310 (owner: 10Alexandros Kosiaris) [14:40:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] openldap: Enforce cn uniqueness [puppet] - 10https://gerrit.wikimedia.org/r/497310 (owner: 10Alexandros Kosiaris) [14:40:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/497310 (owner: 10Alexandros Kosiaris) [14:42:59] (03CR) 10Jbond: "> Patch Set 3: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/398120 (https://phabricator.wikimedia.org/T182819) (owner: 10Herron) [14:44:32] (03CR) 10TheAnarcat: "maybe? having it unexposed makes it difficult to verify with Let's Encrypt, for example. and running a private CA is nice if you like that" [software/cumin] - 10https://gerrit.wikimedia.org/r/497309 (owner: 10TheAnarcat) [14:46:36] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [14:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:52] * onimisionipe crosses fingers [14:47:27] (03CR) 10Alex Monk: "should be just as easy as exposed (assuming dns-01 challenges)." [software/cumin] - 10https://gerrit.wikimedia.org/r/497309 (owner: 10TheAnarcat) [14:47:59] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [14:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:03] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10monitoring: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10fgiunchedi) [14:49:34] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) @Marostegui Thank you what is the stripe size to use? [14:51:36] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Marostegui) 256KB [14:52:28] (03CR) 10TheAnarcat: "we do use dns-01 challenges, but that's not the common case so I'm trying to broaden the horizon here a bit. :) i must admit I haven't che" [software/cumin] - 10https://gerrit.wikimedia.org/r/497309 (owner: 10TheAnarcat) [14:58:24] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) @fgiunchedi sounds good, will try to set up short 30 min meeting? [15:01:37] (03PS1) 10Ladsgroup: Deprecate statsd hiera config in favor of statsd_host and statsd_port [puppet] - 10https://gerrit.wikimedia.org/r/497316 [15:01:42] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) Also the error I have here is not telling me which memory row or channel it refers to so it's difficult to tell which one to replace . The reason being maybe the m... [15:05:25] (03PS2) 10Ladsgroup: Deprecate statsd hiera config in favor of statsd_host and statsd_port [puppet] - 10https://gerrit.wikimedia.org/r/497316 [15:06:26] !log shutting down mw2206 for memtest [15:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:46] (03PS7) 10Ottomata: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) [15:07:50] PROBLEM - Host mw2206 is DOWN: PING CRITICAL - Packet loss = 100% [15:08:07] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Later), and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Pchelolo) After enabling logging over syslog for RESTBase in deployment-prep, we have identified a numbe... [15:09:51] (03PS4) 10Herron: rsyslog: update syslog_json template with format jsonf [puppet] - 10https://gerrit.wikimedia.org/r/496806 (https://phabricator.wikimedia.org/T213899) [15:10:52] (03CR) 10Herron: [C: 03+2] rsyslog: update syslog_json template with format jsonf [puppet] - 10https://gerrit.wikimedia.org/r/496806 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [15:12:37] (03PS1) 10Tchanders: Enable partial blocks on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497317 [15:14:15] (03PS2) 10Jayprakash12345: Enable $wgAllowCopyUploads for pawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495446 (https://phabricator.wikimedia.org/T217486) [15:14:18] 10Operations, 10SRE-Access-Requests, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10EBjune) I appreciate the thoughtfulness that went into this assessment, this has my signoff. [15:17:06] (03CR) 10Jayprakash12345: "@Reedy Zfilipin clearly told me to take +1 from you or MarcoA. Please consider this if you have time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495446 (https://phabricator.wikimedia.org/T217486) (owner: 10Jayprakash12345) [15:17:44] (03PS4) 10Jcrespo: mariadb-backups: Update recover script to handle compressed dumps [puppet] - 10https://gerrit.wikimedia.org/r/497269 (https://phabricator.wikimedia.org/T206203) [15:18:56] (03CR) 10Alex Monk: acme-chief-api: Add support for puppet HTTP API search operation (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [15:19:10] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Update recover script to handle compressed dumps [puppet] - 10https://gerrit.wikimedia.org/r/497269 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [15:21:09] 10Operations, 10SRE-Access-Requests, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10MoritzMuehlenhoff) >>! In T218448#5028665, @bd808 wrote: > Adding #sre-access-requests tag as well because I'm not 100% certain if the NDA needed is just L2 or if the Cobbleston... [15:22:01] (03CR) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [15:23:22] (03PS2) 10Dbarratt: Enable partial blocks on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497317 (https://phabricator.wikimedia.org/T218584) (owner: 10Tchanders) [15:25:37] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) switch port information dbprov2001: asw-a4-codfw xe-4/0/18 dbprov2002: asw-b4-codfw xe-4/0/2 [15:26:29] (03CR) 10Alex Monk: acme-chief-api: Add support for puppet HTTP API search operation (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [15:27:49] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) [15:27:57] (03CR) 10BryanDavis: "> I'm all for this but it probably needs some accompanying" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [15:30:16] (03PS1) 10Jbond: Ensure auto_restart is a dependency of auto_restart_service [puppet] - 10https://gerrit.wikimedia.org/r/497318 [15:30:32] (03PS3) 10Arturo Borrero Gonzalez: cloud-vps: Remove mysql packages from openstack::clientpackages::mitaka::* [puppet] - 10https://gerrit.wikimedia.org/r/497210 (https://phabricator.wikimedia.org/T218009) (owner: 10BryanDavis) [15:31:15] (03CR) 10jerkins-bot: [V: 04-1] Ensure auto_restart is a dependency of auto_restart_service [puppet] - 10https://gerrit.wikimedia.org/r/497318 (owner: 10Jbond) [15:31:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud-vps: Remove mysql packages from openstack::clientpackages::mitaka::* [puppet] - 10https://gerrit.wikimedia.org/r/497210 (https://phabricator.wikimedia.org/T218009) (owner: 10BryanDavis) [15:32:10] (03PS1) 10Bstorm: dumps distribution: failing over the web only for network changes [puppet] - 10https://gerrit.wikimedia.org/r/497319 (https://phabricator.wikimedia.org/T187960) [15:34:00] (03PS1) 10Ppchelko: Map syslogseverity-text to severity_label. [puppet] - 10https://gerrit.wikimedia.org/r/497321 (https://phabricator.wikimedia.org/T211125) [15:35:03] 10Operations, 10Analytics, 10EventBus, 10Prod-Kubernetes, and 2 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Milimetric) p:05Triage→03High [15:35:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 3 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Milimetric) [15:36:19] (03PS3) 10Tchanders: Enable partial blocks on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497317 (https://phabricator.wikimedia.org/T218584) [15:36:24] (03CR) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [15:38:48] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [15:38:53] (03CR) 10Dbarratt: [C: 03+1] Enable partial blocks on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497317 (https://phabricator.wikimedia.org/T218584) (owner: 10Tchanders) [15:41:17] (03CR) 10Bstorm: [C: 03+1] "I can merge this and run the update script right away, if you want? It'll take a while. Otherwise, I can wait for the haproxy update. Wi" [puppet] - 10https://gerrit.wikimedia.org/r/497228 (owner: 10Marostegui) [15:42:05] (03CR) 10Marostegui: "Thanks - I will do that myself tomorrow. Labsdb1010 is lagging still a lot due to earlier issues, so I will do it tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/497228 (owner: 10Marostegui) [15:42:50] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [15:42:53] (03CR) 10Alexandros Kosiaris: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [15:43:04] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [15:45:34] !log Depool sbc* from serving cxserver on eqiad - T213195 [15:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:41] T213195: Migrate cxserver to kubernetes - https://phabricator.wikimedia.org/T213195 [15:46:28] (03PS12) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) [15:47:48] RECOVERY - Host mw2206 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [15:48:03] (03CR) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [15:48:07] !log jiji@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,service=cxserver,cluster=scb,name=scb.* [15:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:13] 10Operations, 10monitoring: Update prometheus-node-exporter NTP metrics - https://phabricator.wikimedia.org/T208875 (10fgiunchedi) 05Resolved→03Open Reopening, once T213708 is done there's also followup work for this, e.g. updating dashboards. [15:48:55] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) memtest complete with no errors [15:50:26] (03PS1) 10Gehel: elasticsearch: multiple commands should be unpacked [cookbooks] - 10https://gerrit.wikimedia.org/r/497326 (https://phabricator.wikimedia.org/T218116) [15:52:24] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.146 second response time https://phabricator.wikimedia.org/T174916 [15:52:37] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: multiple commands should be unpacked [cookbooks] - 10https://gerrit.wikimedia.org/r/497326 (https://phabricator.wikimedia.org/T218116) (owner: 10Gehel) [15:53:08] (03CR) 10Volans: elasticsearch: multiple commands should be unpacked (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/497326 (https://phabricator.wikimedia.org/T218116) (owner: 10Gehel) [15:53:14] (03PS2) 10Gehel: elasticsearch: multiple commands should be unpacked [cookbooks] - 10https://gerrit.wikimedia.org/r/497326 (https://phabricator.wikimedia.org/T218116) [15:58:36] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [16:03:26] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 6.120 second response time https://phabricator.wikimedia.org/T174916 [16:07:10] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [16:08:40] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.734 second response time https://phabricator.wikimedia.org/T174916 [16:12:24] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [16:13:34] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.934 second response time https://phabricator.wikimedia.org/T174916 [16:15:40] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.483 second response time https://phabricator.wikimedia.org/T174916 [16:15:49] !log restarting pdfrender on scb1004 [16:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:20] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [16:17:34] !log restarting pdfrender on scb1003 [16:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:22] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.075 second response time https://phabricator.wikimedia.org/T174916 [16:18:48] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:19:56] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:21:06] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:36:56] gerrit session expired? [16:37:00] ... this feels new [16:37:19] maybe I forgot to press the remember button last time [16:37:26] I got kicked out too just now [16:37:53] maybe someone flushed stuff [16:38:12] PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 481.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [16:38:44] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 481.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [16:38:56] yeah, I also just got logged out [16:39:56] tarrow: Yeah, that was me [16:39:59] k [16:40:03] sorry for the inconvinence [16:40:43] bawolff: no problem :). Just thought we might all be on the lookout for odd gerrit behaviour for a bit [16:40:57] ^ [16:41:33] Aren't we all :) [16:41:49] gerrit is an easy system to log into, I don't mind having to redo it from time to time. not quite the same deal as horizon which logs us out all the damn time and needs 2fa [16:42:44] 🙏 [16:43:01] We should ask upstream to add 2FA to gerrit [16:43:09] probably [16:43:34] they might say use an auth provider which has it [16:43:36] I remember chad once suggested we should switch to the oauth backend, then we could piggy back off onwiki 2FA [16:43:43] I'm just glad I realized that Gerrit saying "All changes saved" but not removing the textbox meant it was lying, so I should copy my comment elsewhere, reload, and resubmit it. [16:45:15] "We should ask upstream to add 2FA to gerrit" yes, we really should [16:47:42] just to note it here [16:47:51] Gerrit does not provide the authentication system. [16:48:51] Gerrit tenically supports 2fa if you use OAuth i think [16:49:17] is it just me that I cannot load meta? [16:49:34] meta loads for me. [16:49:36] revi: works for me [16:49:38] revi: i can load meta [16:49:38] revi: seems ot be just you [16:49:40] hmmmmm [16:49:43] yeah [16:49:52] revi: Please don't scare us like that :P [16:49:55] damn network failing just for meta [16:50:00] lol [16:50:16] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled: git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled [16:50:19] works on ko, en, hi, my phone just can't load meta [16:50:29] gonna switch device [16:50:51] k my phone is ded [16:51:11] bawolff im glad we didn't use oauth back then :) (it's safe to use it now though). [16:53:14] hmm, why did phab go down at the same time as gerrit? [16:53:26] paladox: we are on it [16:53:30] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [16:53:36] yup, i know :) [16:53:43] Eek, surprise Gerrit maintenance. I hope the code review I was in the middle of is still there when it comes back up. [16:54:09] anomie if you used PolyGerrit then yes [16:54:09] sorry for confusing you people xD, /me back to laziness [16:54:14] as long as you did not press save [16:55:03] paladox: I had saved my comments and was going back to the top to submit them when it went down. [16:55:18] PROBLEM - PyBal IPVS diff check on lvs1002 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab1001-vcs.eqiad.wmnet]) [16:55:32] ah, ok. Saved commits will be in the repo then [16:55:42] *comments [16:56:55] Gerrit is not loading for me [16:57:06] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [16:57:07] paladox: Can you try please? [16:57:09] xSavitar: See /topic [16:57:12] It's known [16:57:24] Oh, thanks [16:57:37] And I needed a change ID :( [16:57:42] Anyway, I'll wait [16:58:25] btw Gerrit also timed out [16:58:50] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [16:59:15] AndyRussG: see topic [16:59:25] Cc: xSavitar --^ [16:59:47] volans: ah ok thanks [16:59:58] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [17:00:04] gehel and onimisionipe: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190318T1700). [17:00:23] errr..not now i guess [17:00:34] jouncebot: deployment canceled during gerrit maintenance [17:03:18] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_operations/mediawiki-config] [17:03:28] ACKNOWLEDGEMENT - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_operations/mediawiki-config] cole_white gerrit maintenance [17:03:28] ACKNOWLEDGEMENT - puppet last run on db2094 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] cole_white gerrit maintenance [17:03:28] ACKNOWLEDGEMENT - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] cole_white gerrit maintenance [17:03:28] ACKNOWLEDGEMENT - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] cole_white gerrit maintenance [17:03:28] ACKNOWLEDGEMENT - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] cole_white gerrit maintenance [17:03:29] ACKNOWLEDGEMENT - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas] cole_white gerrit maintenance [17:03:29] ACKNOWLEDGEMENT - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] cole_white gerrit maintenance [17:05:16] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1002 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab1001-vcs.eqiad.wmnet]) cole_white phabricator maintenance [17:05:30] PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [17:05:48] ACKNOWLEDGEMENT - puppet last run on sarin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] cole_white gerrit maintenance [17:06:48] ACKNOWLEDGEMENT - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] cole_white gerrit maintenance [17:07:50] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [17:07:58] ACKNOWLEDGEMENT - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] cole_white gerrit maintenance [17:07:58] ACKNOWLEDGEMENT - puppet last run on labsdb1012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] cole_white gerrit maintenance [17:07:58] ACKNOWLEDGEMENT - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] cole_white gerrit maintenance [17:09:18] ACKNOWLEDGEMENT - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] cole_white gerrit maintenance [17:09:18] ACKNOWLEDGEMENT - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] cole_white gerrit maintenance [17:09:18] ACKNOWLEDGEMENT - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] cole_white gerrit maintenance [17:10:03] ACKNOWLEDGEMENT - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] cole_white gerrit maintenance [17:10:03] ACKNOWLEDGEMENT - puppet last run on db2095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] cole_white gerrit maintenance [17:13:13] ACKNOWLEDGEMENT - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] cole_white gerrit maintenance [17:13:13] ACKNOWLEDGEMENT - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] cole_white gerrit maintenance [17:13:13] ACKNOWLEDGEMENT - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] cole_white gerrit maintenance [17:13:13] ACKNOWLEDGEMENT - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] cole_white gerrit maintenance [17:14:48] ACKNOWLEDGEMENT - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] cole_white gerrit maintenance [17:14:48] ACKNOWLEDGEMENT - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] cole_white gerrit maintenance [17:14:48] ACKNOWLEDGEMENT - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] cole_white gerrit maintenance [17:14:48] ACKNOWLEDGEMENT - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] cole_white gerrit maintenance [17:16:16] RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [17:16:50] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [17:16:54] ACKNOWLEDGEMENT - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] cole_white gerrit maintenance [17:16:54] ACKNOWLEDGEMENT - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] cole_white gerrit maintenance [17:16:54] ACKNOWLEDGEMENT - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] cole_white gerrit maintenance [17:16:54] ACKNOWLEDGEMENT - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] cole_white gerrit maintenance [17:17:53] ACKNOWLEDGEMENT - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] cole_white gerrit maintenance [17:17:53] ACKNOWLEDGEMENT - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 1 minute ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] cole_white gerrit maintenance [17:19:06] ACKNOWLEDGEMENT - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] cole_white gerrit maintenance [17:19:06] ACKNOWLEDGEMENT - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] cole_white gerrit maintenance [17:19:43] ACKNOWLEDGEMENT - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] cole_white gerrit maintenance [17:21:08] ACKNOWLEDGEMENT - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] cole_white gerrit maintenance [17:22:23] ACKNOWLEDGEMENT - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] cole_white gerrit maintenance [17:22:23] ACKNOWLEDGEMENT - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 46 seconds ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] cole_white gerrit maintenance [17:23:53] ACKNOWLEDGEMENT - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] cole_white gerrit maintenance [17:27:28] PROBLEM - Long running screen/tmux on acmechief1001 is CRITICAL: CRIT: Long running SCREEN process. (user: vgutierrez PID: 13272, 1729843s 1728000s). [17:34:38] ACKNOWLEDGEMENT - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] cole_white gerrit maintenance [17:38:18] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [17:39:42] ACKNOWLEDGEMENT - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] cole_white gerrit maintenance [17:46:50] Just a sanity check - I got this warning for ssh/mosh. Is this expected? [17:46:51] The ECDSA host key for login.tools.wmflabs.org has changed, [17:46:51] and the key for the corresponding IP address 185.15.56.48 [17:48:33] fuzheado: thats correct [17:48:42] fuzheado: the host may have changed since you last connected [17:49:02] Zppix: I last connected... perhaps 5-6 days ago [17:49:23] fuzheado: check wiki tech for the most recent signatures [17:49:45] ok thanks, I'll go ahead and let known_hosts get updated [17:49:50] tools can have frequent host changes if things happen [17:50:36] Normally they are fairly stable, but during transition periods like the move to stretch things happen [17:52:17] fuzheado: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/login.tools.wmflabs.org [17:52:38] Betacommand: excellent, thanks! [17:52:38] fuzheado: last update was about 10 days ago [17:53:01] I probably had long-standing mosh sessions opened so may have never hit this issue until now [17:53:05] fuzheado: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints provides the fingerprints for all of cloud [18:00:05] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190318T1800) [18:00:05] bmansurov and davidwbarratt: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:10] here! [18:00:12] here [18:00:23] is a SWAT still happening with gerrit being down? [18:01:44] No [18:02:07] okie dokie I'll reschedule [18:02:12] What's happened with them? Any information? [18:08:07] Gerrit down again? [18:08:19] * hauskatze sees the topic - ack [18:14:25] Reedy: Do you know if there's an ETA on when Gerrit will be back? [18:14:41] Niharika: no immediate ETA at this point [18:14:49] Unfortuantely not. Don't expect it during this SWAT window [18:15:05] Okay. Thanks volans. We're wondering if we will get a SWAT window before the train. [18:15:55] jouncebot: next [18:15:55] In 0 hour(s) and 44 minute(s): Wikimania scholarships app update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190318T1900) [18:17:00] Niharika: Well, if gerrit is down, there's no train either [18:17:02] So... :) [18:17:28] Yep. :) [18:17:31] doesn't mean it won't come back up immediately before the train without a preceding SWAT though [18:17:49] also today is monday... there should be no train right? [18:18:04] at least I cannot find it in the deployments page :) [18:18:18] volans: TZ [18:18:27] It might be the last swat for some people before the train runs [18:19:32] there is another swat in ~4h:40 [18:19:36] but yeah agree [18:24:22] Reedy: no train today https://wikitech.wikimedia.org/wiki/Deployments [18:42:40] gerrit and phab appear to be down is this known [18:42:51] Zppix: Look at the topic, ffs [18:43:16] Reedy: I completely skipped that over when i read it sorry [18:54:40] Yeah it's easy to miss in some clients, it's super subtle in mine. tiny grey text :D. [19:00:04] Niharika: Dear deployers, time to do the Wikimania scholarships app update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190318T1900). [19:01:00] ^^^ Is that happening given phab/Gerrit. [19:01:20] I would assume not given we can't even click the link to the commit being deployed. [19:02:02] but the deployer may have a copy of the commit, and if they're allowed to do so then maybe [19:10:23] No, no deployments right now. [19:23:15] this can take the place of 'I'm compiling' as an excuse... for now [19:23:20] but don't get used to it :-P [19:24:26] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:32:19] hi- https://phabricator.wikimedia.org/ seemed to be down? [19:32:27] Look at the topic [19:32:58] Okay not a problem [19:34:04] He's on a WikiBreak, why does he need phab? [19:35:15] he can still use phab if he wants [19:35:59] No he can't [19:36:02] It's not accessible [19:36:12] zing! [19:36:28] wikibreaks are only as effective as you make them :P [19:36:29] "wikibreak" [19:37:20] I prefer breakwiki [19:37:26] xD [19:37:50] you get a tshirt for that ;P [19:37:51] if you break the wikis and fix them afterwards you get a sticker, jouncebot dixit [19:38:04] vgutierrez: I have my t-shirt, and it's framed [19:38:05] :P [19:38:14] * vgutierrez jealous [19:38:39] "Destruí los wikis y todo lo que me dieron fue esta camiseta -y una carta de despido-" [19:39:30] vgutierrez: The framing cost... a lot more than than the t-shirt cost :) [19:40:36] Reedy: is it really framed? [19:40:41] Yup [19:40:45] awesome [19:40:57] I'll take a photo and stick it on commons later [19:41:10] lol please [19:43:01] nice [19:46:09] Reedy: amazing [19:56:18] !log bawolff@deploy1001 Synchronized wmf-config/wikitech.php: Adjust ldap config (duration: 00m 48s) [19:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:51] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable block disables login on wikitech (duration: 00m 48s) [19:58:25] Reedy: that shirt was priceless! (or about $15 plus some share of the luggage costs to get it to SF from my house) [19:58:40] :D [19:59:03] So heads up, I comitted directly on deploy1001 since gerrit is down. That needs to be fixed once its back up [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190318T2000). [20:00:29] no deployment of any kind [20:02:07] Reedy: I got a sticker. I guess for budget reasons :/ [20:03:49] Ugh, what's the license of the WP logo? [20:03:52] Can we just stop the bot until it needs to be back online. [20:03:59] No [20:04:15] Can I upload a photo of a t-shirt of it on... as CC-BY-SA-4.0? [20:04:20] It's going to get /ignore applied soon. [20:04:28] Reedy, should be fine [20:06:49] https://commons.wikimedia.org/wiki/File:Framed_%22I_BROKE_WIKIPEDIA..._THEN_I_FIXED_IT!%22_T-shirt.jpg [20:09:48] ohh yes! [20:22:35] !!! [20:26:08] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:27:02] PROBLEM - PHD should be running on phab1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [20:27:22] ^ being worked on afaik [20:27:33] yes, gnore please [20:27:36] yep, should probably be ack'd [20:27:54] ACKNOWLEDGEMENT - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Muehlenhoff maintenance [20:27:54] ACKNOWLEDGEMENT - PHD should be running on phab1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 498 (phd) Muehlenhoff maintenance https://wikitech.wikimedia.org/wiki/Phabricator [20:27:58] done:-) [20:28:05] thanks! [20:33:22] RECOVERY - PHD should be running on phab1001 is OK: PROCS OK: 1 process with regex args php ./phd-daemon, UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [20:34:12] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 1 process with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [20:34:50] hush you [20:35:19] !log restore phabricator and gerrit services [20:36:56] what's up with stashbot? [20:37:12] !log test [20:37:14] PROBLEM - PHD should be running on phab1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [20:37:21] bd808, ^ it didn't log bawolff's InitialiseSettings sync earlier either by the looks of the chat? [20:38:38] Krenair: phab was down so stashbot could not connect to it [20:38:50] oh right because stashbot logs there too [20:38:59] but then why did it work at 19:56 [20:39:27] maybe it broke after the first [20:39:40] RECOVERY - PyBal IPVS diff check on lvs1002 is OK: OK: no difference between hosts in IPVS/PyBal [20:39:46] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [20:42:56] it might require a restart. I saw it quit at some point [20:43:10] stashbot left the room (quit: Ping timeout: 244 seconds). [20:44:00] * bawolff goes (re)deploys https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/497339 [20:48:14] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:48:49] !log bawolff@deploy1001 Synchronized wmf-config/wikitech.php: redeploy [[gerrit:97339]] (duration: 00m 47s) [20:50:38] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: redeploy [[gerrit:97339]] (duration: 00m 47s) [20:51:40] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [20:53:26] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:54:05] Krenair, akosiaris: stashbot is logging errors about 500 responses when trying to write to the wiki [20:54:11] I'll try to look in a bit [20:54:46] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:55:39] I did just muck about how logins work on wikitech... [20:57:14] Bawolff - Whats happening regarding task@phabricator.wikimedia.org emails sent during outage. [20:57:30] RhinosF1: In theory, they should have worked [20:57:53] Not seeing mine [20:57:54] Could be in a mail queue waiting a retry [20:57:56] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:57:58] Or you emailed in wrong [20:58:21] Or they could have just been broken in general. I doubt anyone actually uses that feature [20:58:39] Poor rendering on tools.wmflabs.org Is the header - that's the email I used. Only did it as phab was down. [20:59:03] The config may be broken [20:59:07] Email was rhinosf1@gmail.com if it helps [20:59:28] Seeing as phabs config changed but I never tested mailing when changing it [21:00:04] bawolff and Reedy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190318T2100). [21:00:14] RECOVERY - puppet last run on sarin is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:00:15] jouncebot: go away [21:00:19] Could be but yeah, don't want to resubmit until were certain it's not there. [21:00:20] lol [21:00:23] RhinosF1: that should take a while to be fully restore [21:00:26] restored* [21:00:32] Just in time for me to not be deploying things! [21:00:39] Had enough fun for one day [21:00:43] !log Testing stashbot's ability to log events to wikitech [21:00:44] bd808: Failed to log message to wiki. Somebody should check the error logs. [21:00:54] they will be processed at some point, but for now it's disabled, will be reenabled soon hopefully [21:01:00] jouncebot: next [21:01:00] In 1 hour(s) and 58 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190318T2300) [21:01:11] akosiaris, worth just redoing via web [21:02:22] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:06:52] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:10:36] PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[phd] [21:14:23] I thought it was back [21:14:44] akosiaris, Did you miss my last message? [21:14:47] phd is a background processing service [21:15:21] You lot have had a busy day reedy [21:23:03] thanks for the fixes, folks! [21:23:54] moritzm: What is the current SRE policy for buster servers? I'm pretty sure you recently told me to not worry about buster-on-VMs until the official release, but now I'm coming across existing prod servers running Buster already. [21:27:40] RECOVERY - Long running screen/tmux on acmechief1001 is OK: OK: No SCREEN or tmux processes detected. [21:54:16] andrewbogott: a few internal ones is fine, but it's not yet ready for general consumption in VPS or Toolforge [21:55:04] Ah, ok — no danger of them showing up in toolforge :) [21:55:19] But I might enable them for deployment-prep so they can keep up with prod [21:55:35] yeah so acme-chief is running on buster in prod [21:55:39] and is about to become dependent on it [21:55:56] which is a problem for me trying to test+dev on it in deployment-prep [21:56:34] the stretch package for it is out of date too as prod only needs buster [21:59:08] moritzm, ^ [21:59:44] PROBLEM - Device not healthy -SMART- on db2052 is CRITICAL: cluster=mysql device=cciss,9 instance=db2052:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2052&var-datasource=codfw+prometheus/ops [22:02:40] Krenair: I think it's fine to have an image, but not make it generally available, similar to how trusty instances were not selectable in VPS (despite being available if WMCS admins intervened), then it can be assigned for special cases like the acmechief deployment instance [22:02:59] the integration is more or less done [22:03:18] there might be some WMCS-specific bits not covered by the prod testing that happened so fae [22:03:25] (03PS2) 10CRusnov: Add LimitCORE support for uwsgi units. [puppet] - 10https://gerrit.wikimedia.org/r/493294 [22:03:32] if you run into anything, ping me and we'll figure it out [22:04:26] (03CR) 10CRusnov: Add LimitCORE support for uwsgi units. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/493294 (owner: 10CRusnov) [22:15:46] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:18:27] !log gjg@phab1001:~$ sudo /srv/phab/phabricator/bin/auth strip --all-types --user Barras # per request/verification from foks [22:18:27] greg-g: Failed to log message to wiki. Somebody should check the error logs. [22:18:31] oh right [22:19:26] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:19:42] You lot are doing great today! [22:19:46] greg-g: https://tools.wmflabs.org/sal/log/AWmS4nCUA1BDhGjCXi6- -- only the wikitech save is busted at the moment [22:21:36] bd808: ack [22:22:13] In theory I can fix the wikitech log later. In theory ;) [22:23:55] note that wikibugs is having issues with the Phab parser so you won't get updates from phab tasks here today [22:23:55] heh [22:25:24] (03CR) 10BryanDavis: "> I ran some tests and wasn't able to produce any change in behavior" [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis) [22:28:00] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 6 processes with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:28:32] RECOVERY - PHD should be running on phab1001 is OK: PROCS OK: 1 process with regex args php ./phd-daemon, UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:29:38] hauskatze: Seems to work in -dev. [22:30:11] Niharika: I know the job kept respawning so maybe if fixed itself [22:31:06] it's being reporting everywhere now [22:31:07] great [22:33:18] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10bd808) [22:34:03] andre__: blocked [22:34:44] (03PS1) 10Bstorm: dumps distribution: fail over dumps to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/497420 (https://phabricator.wikimedia.org/T187960) [22:37:23] (03PS6) 10Andrew Bogott: Create an empty openstack::clientpackages::**::buster manifest [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) (owner: 10Jbond) [22:42:43] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 1 process with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:43:27] (03CR) 10Andrew Bogott: [C: 03+2] Create an empty openstack::clientpackages::**::buster manifest [puppet] - 10https://gerrit.wikimedia.org/r/497286 (https://phabricator.wikimedia.org/T213546) (owner: 10Jbond) [22:44:04] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:45:45] (03CR) 10Bstorm: [C: 03+2] dumps distribution: fail over dumps to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/497420 (https://phabricator.wikimedia.org/T187960) (owner: 10Bstorm) [22:46:50] (03PS1) 10CRusnov: Add configuration file for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/497421 [22:51:58] (03PS2) 10Bstorm: dumps distribution: failing over the web only for network changes [puppet] - 10https://gerrit.wikimedia.org/r/497319 (https://phabricator.wikimedia.org/T187960) [22:52:12] (03PS1) 10BryanDavis: wikitech: Use cn:caseExactMatch: as account search filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) [22:55:33] (03CR) 10Bstorm: [C: 03+2] dumps distribution: failing over the web only for network changes [puppet] - 10https://gerrit.wikimedia.org/r/497319 (https://phabricator.wikimedia.org/T187960) (owner: 10Bstorm) [22:57:17] (03CR) 10BryanDavis: [C: 04-2] "Self-blocking merge until I do a bit more testing in a controlled environment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) (owner: 10BryanDavis) [22:57:39] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:57:55] grar [22:58:32] not actually looking into that... [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190318T2300). [23:00:04] davidwbarratt, bmansurov, and MaxSem: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:09] here! [23:00:09] here [23:00:10] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [23:00:16] I'll do the deed [23:00:21] MaxSem thanks! [23:01:13] can we modify that icinga check to only fail if phd == 0 [23:01:30] (03PS2) 10MaxSem: Add 'suggestChangeOnLogin' flag to password policies for privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496515 (owner: 10Tchanders) [23:01:35] (03CR) 10MaxSem: [C: 03+2] Add 'suggestChangeOnLogin' flag to password policies for privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496515 (owner: 10Tchanders) [23:02:37] (03Merged) 10jenkins-bot: Add 'suggestChangeOnLogin' flag to password policies for privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496515 (owner: 10Tchanders) [23:02:39] (03PS2) 10CRusnov: Add configuration file for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/497421 [23:04:21] davidwbarratt: ^ is live on mwdebug1002, please test [23:04:30] testing now [23:05:27] MaxSem it is perfect! [23:06:27] 10Operations, 10netops: eqiad - eqord Telia link down - IC-314533 - https://phabricator.wikimedia.org/T218307 (10ayounsi) Telia did a loop test facing eqiad and our light levels didn't change. While Telia still don't receive light. The culprit seems to be an active element somewhere on the cr2-eqiad (in DC6)<-... [23:07:21] (03PS3) 10CRusnov: Add configuration file for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/497421 (https://phabricator.wikimedia.org/T212526) [23:07:51] !log maxsem@deploy1001 Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/496515/ (duration: 00m 48s) [23:07:51] maxsem@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [23:08:16] davidwbarratt: ^ [23:08:20] (03CR) 10jerkins-bot: [V: 04-1] Add configuration file for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/497421 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [23:08:34] MaxSem what does that mean? [23:08:46] test whole prod now ;) [23:08:49] oh [23:09:31] MaxSem looks perfect on prod [23:11:04] (03PS4) 10MaxSem: Enable partial blocks on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497317 (https://phabricator.wikimedia.org/T218584) (owner: 10Tchanders) [23:11:05] (03CR) 10MaxSem: [C: 03+2] Enable partial blocks on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497317 (https://phabricator.wikimedia.org/T218584) (owner: 10Tchanders) [23:11:37] (03Merged) 10jenkins-bot: Enable partial blocks on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497317 (https://phabricator.wikimedia.org/T218584) (owner: 10Tchanders) [23:12:09] (03PS4) 10CRusnov: Add configuration file for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/497421 (https://phabricator.wikimedia.org/T212526) [23:12:30] davidwbarratt: not sure if this one can be tested, but here it's live on 1002 [23:12:44] (03PS5) 10CRusnov: Add configuration file for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/497421 (https://phabricator.wikimedia.org/T212526) [23:12:46] MaxSem yes it can, one moment [23:13:10] MaxSem looks perfect! [23:14:24] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/497317/ (duration: 00m 49s) [23:14:24] maxsem@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [23:14:33] davidwbarratt: still ok?^ [23:14:58] (03CR) 10CRusnov: "Build looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/497421 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [23:15:15] MaxSem yeah I just noticed they are missing some translations, but it looks good [23:15:55] i.e. https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%98%D9%87:%D8%A8%D8%B3%D8%AA%D9%86_%D9%86%D8%B4%D8%A7%D9%86%DB%8C_%D8%A2%DB%8C%E2%80%8C%D9%BE%DB%8C [23:16:37] I don't think TWN updates made it through over the weekend [23:16:53] ACKNOWLEDGEMENT - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. 20after4 phd is currently (partially) disabled [23:16:53] ACKNOWLEDGEMENT - puppet last run on phab1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[phd] 20after4 phd is currently (partially) disabled [23:17:33] MaxSem technically the only thing essential that is missing is "Sitewide" but "Partial" is there [23:19:25] bmansurov: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/494551/ needs a rebase [23:19:31] 1 sec [23:20:15] MaxSem ok talked to mooeypoo and Niharika and it's not ideal, but we'll buy it! [23:20:24] (03PS2) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496857 (https://phabricator.wikimedia.org/T213969) [23:20:31] MaxSem: done [23:20:51] oh wait [23:21:07] yeah, wrong commit [23:21:11] (03CR) 10jerkins-bot: [V: 04-1] Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496857 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [23:23:14] !log renumber Telia transit in eqsin [23:23:14] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [23:23:22] (03PS2) 10Bmansurov: Enable reader trust survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494551 (https://phabricator.wikimedia.org/T217576) [23:23:42] MaxSem: done [23:23:46] (03CR) 10jerkins-bot: [V: 04-1] Enable reader trust survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494551 (https://phabricator.wikimedia.org/T217576) (owner: 10Bmansurov) [23:24:34] (03CR) 10MaxSem: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494551 (https://phabricator.wikimedia.org/T217576) (owner: 10Bmansurov) [23:24:34] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [23:24:57] (03CR) 10jerkins-bot: [V: 04-1] Enable reader trust survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494551 (https://phabricator.wikimedia.org/T217576) (owner: 10Bmansurov) [23:25:31] !log running puppet on phab1001 to get out of degraded state [23:25:32] twentyafterfour: Failed to log message to wiki. Somebody should check the error logs. [23:25:56] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational [23:26:02] RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [23:26:06] PROBLEM - puppet last run on db2094 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [23:26:39] (03CR) 10Bmansurov: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494551 (https://phabricator.wikimedia.org/T217576) (owner: 10Bmansurov) [23:27:03] (03CR) 10jerkins-bot: [V: 04-1] Enable reader trust survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494551 (https://phabricator.wikimedia.org/T217576) (owner: 10Bmansurov) [23:27:18] MaxSem: seems unrelated [23:27:42] * MaxSem scratches head [23:29:31] hrm, I wonder if that's related to something I did just now :\ [23:29:33] * thcipriani reverts [23:29:34] I was about to ping you [23:29:35] Haha [23:29:40] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:29:52] MaxSem thanks! [23:30:50] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [23:31:09] reverted now [23:31:36] (03CR) 10Reedy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494551 (https://phabricator.wikimedia.org/T217576) (owner: 10Bmansurov) [23:31:44] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 80815 bytes in 0.301 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:32:34] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] [23:32:45] thcipriani: ^ lol, breaking all of the things [23:32:50] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [23:32:57] oh good. [23:33:08] !log maxsem@deploy1001 Synchronized php-1.33.0-wmf.21/includes/EditPage.php: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/497347/ (duration: 00m 49s) [23:33:09] maxsem@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [23:33:10] should recover next run at any rate :\ [23:33:28] PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [23:33:52] * Reedy grins [23:34:32] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [23:35:16] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [23:35:48] MaxSem: looks good to go [23:36:06] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 6 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikibase/wikiba.se-deploy],Exec[git_pull_research/landing-page] [23:36:34] PROBLEM - puppet last run on labsdb1012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [23:37:38] (03CR) 10MaxSem: [C: 03+2] Enable reader trust survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494551 (https://phabricator.wikimedia.org/T217576) (owner: 10Bmansurov) [23:38:26] ohh [23:38:38] (03Merged) 10jenkins-bot: Enable reader trust survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494551 (https://phabricator.wikimedia.org/T217576) (owner: 10Bmansurov) [23:39:58] * paladox knows the correct fix now [23:40:21] Anyone knows if we have a list of all the wikis on the beta cluster? [23:40:31] Niharika: https://noc.wikimedia.org/conf/highlight.php?file=dblists/all-labs.dblist [23:40:44] Reedy: Thank you! [23:41:28] MaxSem: Got much more to deploy? [23:42:08] Reedy: that's the last one [23:42:29] bmansurov: pulled on mwdebug1002 [23:42:30] cool, I'll start merging a core patch then [23:42:39] MaxSem: testing [23:43:58] MaxSem: looks good, please deploy everywhere [23:45:33] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/494551/ (duration: 00m 48s) [23:45:33] maxsem@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [23:45:41] bmansurov: ^ [23:45:53] MaxSem: thank you [23:47:03] Reedy: all yours [23:47:25] cheers [23:47:27] waiting on jerkins [23:48:47] Actually, I have another change in mind, can deploy it after yours or something [23:49:01] Is it a mw-config patch or something else? [23:49:10] (03PS2) 10MaxSem: Remove $wgMediaInTargetLanguage, matches the MW default now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495408 [23:49:14] ^^^ [23:49:25] CR+2 it, see which gets merged first :P [23:49:42] (03CR) 10MaxSem: [C: 03+2] Remove $wgMediaInTargetLanguage, matches the MW default now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495408 (owner: 10MaxSem) [23:50:55] (03Merged) 10jenkins-bot: Remove $wgMediaInTargetLanguage, matches the MW default now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495408 (owner: 10MaxSem) [23:51:06] I WON [23:53:16] * Reedy is still waiting [23:54:21] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/494551/ (duration: 00m 49s) [23:54:21] maxsem@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [23:54:28] Reedy, I'm done [23:54:38] jerkins is not [23:55:47] (03CR) 10BryanDavis: "Works as expected in local testing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) (owner: 10BryanDavis) [23:56:06] (03CR) 10BryanDavis: [C: 04-2] "Putting my self -2 block back" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) (owner: 10BryanDavis) [23:56:13] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:56:37] RECOVERY - puppet last run on db2094 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:58:05] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:59:01] RECOVERY - puppet last run on sarin is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures