[00:04:36] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10Dzahn) thanks @eross :) @Adamant.pwn After you mailed techsupport@ you will have created a new ticket for this in another system. Given that is it ok with you if we close the one over here?... [00:05:16] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10Dzahn) 05Open→03Stalled p:05Triage→03Medium [00:08:18] (03PS6) 10CDanis: WIP: serve NEL headers on group0 [puppet] - 10https://gerrit.wikimedia.org/r/627629 [00:11:13] (03PS7) 10CDanis: WIP: serve NEL headers on group0 [puppet] - 10https://gerrit.wikimedia.org/r/627629 [00:14:36] RECOVERY - MariaDB Replica Lag: m2 on db2133 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:25:39] (03PS8) 10CDanis: WIP: serve NEL headers on group0 [puppet] - 10https://gerrit.wikimedia.org/r/627629 [00:49:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:56] PROBLEM - dump of m2 in eqiad on icinga1001 is CRITICAL: dump for m2 at eqiad taken more than 8 days ago: Most recent backup 2020-09-08 00:46:58 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:06:09] 10Operations, 10LDAP-Access-Requests: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10KFrancis) >>! In T262921#6462896, @RLazarus wrote: > Hello Bereket! I can handle the LDAP changes for you. > > @KFrancis Could you set up Bereket with an NDA please, the... [02:07:22] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:42] PROBLEM - Check the last execution of package_builder_Clean_up_build_directory on deneb is CRITICAL: CRITICAL: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:22:19] !log deneb - sudo systemctl start package_builder_Clean_up_build_directory to fix icinga alert after failed build attempts [02:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:17] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10wiki_willy) Thanks @ayounsi, much appreciated for creating T262898. Glad the Netbox graphs won't be too hard to generate. Also, just a follow-up from my action item earlier, I chatted wi... [03:19:24] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:44:54] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:52] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 3 (etcd1001, ...), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:45:10] 10Operations, 10Analytics, 10Domains, 10Traffic, 10Wikimedia-General-or-Unknown: Blocking all third-party storage access requests - https://phabricator.wikimedia.org/T262996 (10RolandUnger) [04:46:37] 10Operations, 10Analytics, 10Domains, 10Traffic, 10Wikimedia-General-or-Unknown: WMF third-party cookies rejected - https://phabricator.wikimedia.org/T262882 (10RolandUnger) [04:57:25] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: dbproxy1020 PS disconnected - https://phabricator.wikimedia.org/T262998 (10Marostegui) [04:57:42] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: dbproxy1020 PS disconnected - https://phabricator.wikimedia.org/T262998 (10Marostegui) p:05Triage→03Medium [04:59:19] (03PS1) 10Marostegui: Revert "dbproxy: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/627610 [05:05:49] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/627610 (owner: 10Marostegui) [05:07:29] !log Repool labsdb1010 [05:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:06] 10Operations, 10ops-eqiad, 10DC-Ops: New Date - Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3 - https://phabricator.wikimedia.org/T261455 (10Marostegui) [05:11:12] 10Operations, 10ops-eqiad, 10DC-Ops: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Marostegui) [05:19:01] (03Abandoned) 10Giuseppe Lavagetto: mobileapps: use a non-retry, long-lasting restbase endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/626159 (owner: 10Giuseppe Lavagetto) [05:22:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2011 and es2017 after cloning es2027 and es2028', diff saved to https://phabricator.wikimedia.org/P12592 and previous config saved to /var/cache/conftool/dbconfig/20200916-052241-marostegui.json [05:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087 into vslow', diff saved to https://phabricator.wikimedia.org/P12593 and previous config saved to /var/cache/conftool/dbconfig/20200916-052343-marostegui.json [05:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:29] (03CR) 10Marostegui: [C: 03+1] mariadb: Use labsdb mysql config group for both labsdb and clouddb hosts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 (owner: 10Jcrespo) [05:26:07] (03PS1) 10Marostegui: db1122: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/627642 [05:28:18] (03CR) 10Marostegui: [C: 03+2] db1122: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/627642 (owner: 10Marostegui) [05:32:17] (03PS1) 10Marostegui: instances.yaml: Add es2027, es2028 [puppet] - 10https://gerrit.wikimedia.org/r/627643 (https://phabricator.wikimedia.org/T261717) [05:35:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2011 and es2017 after cloning es2027 and es2028', diff saved to https://phabricator.wikimedia.org/P12594 and previous config saved to /var/cache/conftool/dbconfig/20200916-053507-marostegui.json [05:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:13] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add es2027, es2028 [puppet] - 10https://gerrit.wikimedia.org/r/627643 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:38:48] 10Operations, 10ops-codfw: ps1-a8-codfw WebUI unresponsive - https://phabricator.wikimedia.org/T263001 (10ayounsi) p:05Triage→03Medium [05:39:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add es2027 and es2028 to dbctl T261717', diff saved to https://phabricator.wikimedia.org/P12595 and previous config saved to /var/cache/conftool/dbconfig/20200916-053918-marostegui.json [05:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:26] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:40:19] (03CR) 10Marostegui: [C: 03+1] "Thank you!" [dns] - 10https://gerrit.wikimedia.org/r/627518 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [05:40:20] ACKNOWLEDGEMENT - ps1-a8-codfw-infeed-load-tower-A-phase-X on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T263001 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:20] ACKNOWLEDGEMENT - ps1-a8-codfw-infeed-load-tower-A-phase-Y on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T263001 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:20] ACKNOWLEDGEMENT - ps1-a8-codfw-infeed-load-tower-A-phase-Z on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T263001 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:20] ACKNOWLEDGEMENT - ps1-a8-codfw-infeed-load-tower-B-phase-X on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T263001 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:20] ACKNOWLEDGEMENT - ps1-a8-codfw-infeed-load-tower-B-phase-Y on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T263001 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:20] ACKNOWLEDGEMENT - ps1-a8-codfw-infeed-load-tower-B-phase-Z on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T263001 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:43:22] (03PS1) 10Marostegui: es2027,es2028: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/627644 (https://phabricator.wikimedia.org/T261717) [05:43:57] (03CR) 10Marostegui: [C: 03+2] es2027,es2028: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/627644 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:50:49] !log msw1-codfw> request system snapshot slice alternate - T262290 [05:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:55] T262290: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 [05:51:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2011 and es2017 after cloning es2027 and es2028', diff saved to https://phabricator.wikimedia.org/P12596 and previous config saved to /var/cache/conftool/dbconfig/20200916-055108-marostegui.json [05:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:34] 10Operations, 10netops: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 (10ayounsi) [05:53:06] (03PS5) 10KartikMistry: Update cxserver to 2020-08-30-011854-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/623475 (https://phabricator.wikimedia.org/T253439) [05:53:49] !log asw2-a-eqiad> request system snapshot slice alternate all-members - T262290 [05:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2015 to clone es2031 T261717', diff saved to https://phabricator.wikimedia.org/P12597 and previous config saved to /var/cache/conftool/dbconfig/20200916-055535-marostegui.json [05:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:43] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:57:51] (03PS1) 10Marostegui: mariadb: Productionize es2031 [puppet] - 10https://gerrit.wikimedia.org/r/627645 (https://phabricator.wikimedia.org/T261717) [05:58:55] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es2031 [puppet] - 10https://gerrit.wikimedia.org/r/627645 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:03:54] (03PS2) 10Volans: wmcs: remove old unused records [dns] - 10https://gerrit.wikimedia.org/r/627442 (https://phabricator.wikimedia.org/T262863) [06:04:20] I plan to update cxserver (minor changes). [06:04:40] 10Operations, 10Wikidata, 10Wikimedia-Mailing-lists: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 (10Ladsgroup) What about pywikibugs-l? I assume it's also pretty big and people can just search in phabricator instead. [06:05:10] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-08-30-011854-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/623475 (https://phabricator.wikimedia.org/T253439) (owner: 10KartikMistry) [06:06:20] (03Merged) 10jenkins-bot: Update cxserver to 2020-08-30-011854-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/623475 (https://phabricator.wikimedia.org/T253439) (owner: 10KartikMistry) [06:07:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es2011 and es2017 after cloning es2027 and es2028', diff saved to https://phabricator.wikimedia.org/P12598 and previous config saved to /var/cache/conftool/dbconfig/20200916-060717-marostegui.json [06:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:25] !log kartik@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [06:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2027 and es2028 for the first time with minimum weight T261717', diff saved to https://phabricator.wikimedia.org/P12599 and previous config saved to /var/cache/conftool/dbconfig/20200916-061013-marostegui.json [06:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:21] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:11:22] (03CR) 10Volans: [C: 03+2] wmcs: remove old unused records [dns] - 10https://gerrit.wikimedia.org/r/627442 (https://phabricator.wikimedia.org/T262863) (owner: 10Volans) [06:11:45] (03PS2) 10Volans: databases: remove leftover records from old hosts [dns] - 10https://gerrit.wikimedia.org/r/627518 (https://phabricator.wikimedia.org/T244153) [06:11:49] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [06:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:16] (03CR) 10Volans: [C: 03+2] databases: remove leftover records from old hosts [dns] - 10https://gerrit.wikimedia.org/r/627518 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [06:12:41] (03PS2) 10Volans: misc: remove leftover records for old hosts [dns] - 10https://gerrit.wikimedia.org/r/627523 (https://phabricator.wikimedia.org/T244153) [06:13:15] (03CR) 10Volans: [C: 03+2] misc: remove leftover records for old hosts [dns] - 10https://gerrit.wikimedia.org/r/627523 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [06:15:56] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [06:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:25] !log asw2-b-eqiad> request system snapshot slice alternate all-members - T262290 [06:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:31] T262290: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 [06:20:44] !log Updated cxserver to 2020-08-30-011854-production (T253439, T260557) [06:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:51] T260557: Add lldwiki to cxserver - https://phabricator.wikimedia.org/T260557 [06:20:52] T253439: Eliminate the toil in WMF wiki creation - https://phabricator.wikimedia.org/T253439 [06:28:41] !log codfw-prod: bump weight for ms-be2057 - T261633 [06:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:49] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [06:35:12] (03CR) 10Filippo Giunchedi: [C: 03+2] rsync: handle quickdatacopy cron cleanup when flipping source/dest [puppet] - 10https://gerrit.wikimedia.org/r/627531 (owner: 10Filippo Giunchedi) [06:37:41] (03PS2) 10Filippo Giunchedi: prometheus: use prometheus-icinga-am to send alerts [puppet] - 10https://gerrit.wikimedia.org/r/627500 (https://phabricator.wikimedia.org/T258948) [06:42:33] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Nemo_bis) >>! In T262869#6464839, @Andyrom75 wrote: > ~15min ago connection has been restored. I'll test it again tomorrow. [A blog](https://www.optimagazine.com/2... [06:42:47] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use prometheus-icinga-am to send alerts [puppet] - 10https://gerrit.wikimedia.org/r/627500 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [06:46:19] (03PS1) 10Jforrester: Fix failure of rebuildLocalisationCache.php due to RL hook [core] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627617 (https://phabricator.wikimedia.org/T262900) [06:55:09] (03CR) 10Volans: [C: 04-1] "Small error inline, and let's find a better solution for the urllib ignore warning" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (owner: 10Jbond) [06:55:42] (03PS1) 10Jcrespo: backups: Ignore failures on backing up etcd1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/627647 (https://phabricator.wikimedia.org/T239835) [06:57:31] (03CR) 10Jcrespo: [C: 03+2] backups: Ignore failures on backing up etcd1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/627647 (https://phabricator.wikimedia.org/T239835) (owner: 10Jcrespo) [06:58:50] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) - 22nd Jul 2019 server delivered to the DC - 18th Aug 2020: server crashes with the following errors - server totally unresponsive... [07:00:20] (03PS1) 10Jcrespo: Revert "backups: Ignore failures on backing up etcd1* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/627619 (https://phabricator.wikimedia.org/T239835) [07:01:29] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Release 1.0.16. First version that only support 6.0.x [software/otrs] - 10https://gerrit.wikimedia.org/r/617381 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [07:02:11] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:02:30] ^yay [07:02:36] !log T187984 migration script done. Config updates, rebuilds, package upgrades/reinstall and index rebuilds done [07:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:44] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [07:02:56] let me check lag on other m2 instances [07:03:02] !log T187984 validated that the OTRS installation is functional over SSH [07:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:52] still high on db2078, which makes sense as it has to be applied to 2 places before changes reaching to it [07:05:07] (03PS1) 10Muehlenhoff: Remove LDAP access for kuz [puppet] - 10https://gerrit.wikimedia.org/r/627648 [07:07:01] (03PS1) 10Filippo Giunchedi: prometheus: add port to alertmanager url [puppet] - 10https://gerrit.wikimedia.org/r/627677 (https://phabricator.wikimedia.org/T258948) [07:08:02] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add port to alertmanager url [puppet] - 10https://gerrit.wikimedia.org/r/627677 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [07:09:45] (03PS2) 10Filippo Giunchedi: prometheus: add port to alertmanager url [puppet] - 10https://gerrit.wikimedia.org/r/627677 (https://phabricator.wikimedia.org/T258948) [07:10:21] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for kuz [puppet] - 10https://gerrit.wikimedia.org/r/627648 (owner: 10Muehlenhoff) [07:11:43] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/25113/icinga1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/627677 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [07:12:09] moritzm: merged your change as well [07:12:37] !log T187984 Disable gravatar in system configuration to avoid leaking agent PII through a 3rd party service [07:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:43] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [07:13:36] ack, thx [07:14:17] (03PS1) 10Marostegui: mariadb: Disable notifications on masters [puppet] - 10https://gerrit.wikimedia.org/r/627737 (https://phabricator.wikimedia.org/T261454) [07:15:23] (03PS2) 10Marostegui: mariadb: Disable notifications on masters [puppet] - 10https://gerrit.wikimedia.org/r/627737 (https://phabricator.wikimedia.org/T261454) [07:17:01] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on masters [puppet] - 10https://gerrit.wikimedia.org/r/627737 (https://phabricator.wikimedia.org/T261454) (owner: 10Marostegui) [07:21:45] (03CR) 10Hashar: [C: 03+1] profile::scap::dsh: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/627391 (owner: 10Dzahn) [07:21:57] !log jayme@cumin1001 START - Cookbook sre.hosts.decommission [07:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:27] !log jayme@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [07:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:38] (03CR) 10Hashar: "Sorry I have missed this change or I would have reused it :\" [puppet] - 10https://gerrit.wikimedia.org/r/626260 (https://phabricator.wikimedia.org/T262244) (owner: 10Jbond) [07:22:46] !log jayme@cumin1001 START - Cookbook sre.hosts.decommission [07:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] exim: Switch OTRS exim to otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/626630 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [07:26:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2027 and es2028 with more weight T261717', diff saved to https://phabricator.wikimedia.org/P12600 and previous config saved to /var/cache/conftool/dbconfig/20200916-072614-marostegui.json [07:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:21] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:26:22] !log T187984 Tested outbound email, switching inbound email configuration and performing tests [07:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:27] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [07:26:35] (03PS8) 10Muehlenhoff: Add CAS-enabled vhost for editors/admins [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) [07:28:38] liw: I am rolling the hotfix for the train blocker ( https://gerrit.wikimedia.org/r/c/mediawiki/core/+/627617 ) [07:29:11] (03PS2) 10Jcrespo: Revert "mariadb-backups: Temporarilly disable logical backups of m2 on eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/626669 (https://phabricator.wikimedia.org/T187984) [07:29:17] (03CR) 10Hashar: [C: 03+2] "Thank you so much! I will deploy this on the deployment server." [core] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627617 (https://phabricator.wikimedia.org/T262900) (owner: 10Jforrester) [07:29:41] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:58] (03PS5) 10JMeybohm: Remove etcd100[123] hosts [puppet] - 10https://gerrit.wikimedia.org/r/626274 (https://phabricator.wikimedia.org/T239835) [07:36:39] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) I 'll split this off in its own task, but worthy to point out in order not to forget it. Znuny's QuickClose package seems to be... [07:37:19] !log T187984 Tested inbound email successfully [07:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:26] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [07:37:46] (03PS2) 10Alexandros Kosiaris: Switch ticket.discovery.wmnet to otrs1001 [dns] - 10https://gerrit.wikimedia.org/r/626629 (https://phabricator.wikimedia.org/T187984) [07:38:03] (03CR) 10Muehlenhoff: [C: 03+2] Add CAS-enabled vhost for editors/admins [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [07:40:46] !log asw2-c-eqiad> request system snapshot slice alternate all-members - T262290 [07:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:52] T262290: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 [07:41:07] (03PS1) 10Giuseppe Lavagetto: Use a separate HELM_HOME for each helmfile run [deployment-charts] - 10https://gerrit.wikimedia.org/r/627741 (https://phabricator.wikimedia.org/T261313) [07:42:18] (03PS1) 10Muehlenhoff: Enable CAS on grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/627742 [07:44:02] !log jayme@cumin1001 START - Cookbook sre.dns.netbox [07:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:18] (03PS3) 10JMeybohm: Remove etcd100[123] hosts [dns] - 10https://gerrit.wikimedia.org/r/626337 (https://phabricator.wikimedia.org/T239835) [07:48:50] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb-backups: Temporarilly disable logical backups of m2 on eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/626669 (https://phabricator.wikimedia.org/T187984) (owner: 10Jcrespo) [07:48:58] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch ticket.discovery.wmnet to otrs1001 [dns] - 10https://gerrit.wikimedia.org/r/626629 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [07:49:27] (03Merged) 10jenkins-bot: Fix failure of rebuildLocalisationCache.php due to RL hook [core] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627617 (https://phabricator.wikimedia.org/T262900) (owner: 10Jforrester) [07:49:56] !log T187984 Switch over ticket.discovery.wmnet to otrs1001 [07:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:02] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [07:50:34] (03PS1) 10Nikerabbit: Enable Special:TranslationStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627744 (https://phabricator.wikimedia.org/T263004) [07:52:28] (03PS4) 10JMeybohm: Remove etcd100[123] hosts [dns] - 10https://gerrit.wikimedia.org/r/626337 (https://phabricator.wikimedia.org/T239835) [07:54:39] (03CR) 10JMeybohm: [C: 03+2] Remove etcd100[123] hosts [puppet] - 10https://gerrit.wikimedia.org/r/626274 (https://phabricator.wikimedia.org/T239835) (owner: 10JMeybohm) [07:54:46] (03CR) 10JMeybohm: [C: 03+2] Revert "backups: Ignore failures on backing up etcd1* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/627619 (https://phabricator.wikimedia.org/T239835) (owner: 10Jcrespo) [07:56:27] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/626337 (https://phabricator.wikimedia.org/T239835) (owner: 10JMeybohm) [07:57:16] (03PS1) 10Jcrespo: mariadb: Incrase max_allowed_packet to 64MB on all generic misc dbs [puppet] - 10https://gerrit.wikimedia.org/r/627745 (https://phabricator.wikimedia.org/T187984) [07:58:25] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) 05Open→03Declined The summary is: We can't use the third party service gravatar.com since that leaks personal information to a third party.... [07:59:39] (03CR) 10Jcrespo: "This should be deployed from replicas -> primary db in that order to prevent replication issues (I think). I think it doesn't require rest" [puppet] - 10https://gerrit.wikimedia.org/r/627745 (https://phabricator.wikimedia.org/T187984) (owner: 10Jcrespo) [07:59:57] (03CR) 10JMeybohm: [C: 03+2] Remove etcd100[123] hosts [dns] - 10https://gerrit.wikimedia.org/r/626337 (https://phabricator.wikimedia.org/T239835) (owner: 10JMeybohm) [08:01:39] akosiaris: thanks for your work on the OTRS migration! [08:02:07] Urbanecm: :-) [08:02:11] !log asw2-d-eqiad> request system snapshot slice alternate all-members - T262290 [08:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:18] it looks like it has gone well, so I am happy [08:02:18] T262290: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 [08:02:53] yup yup akosiaris :) [08:03:10] I'm able to log in and read emails, which is good :-) [08:04:01] !log T187984 Validated that ticket.wikimedia.org works, proceeding with a wider announcement [08:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:07] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [08:05:50] akosiaris: congratulations! [08:06:46] (03PS2) 10Muehlenhoff: Enable CAS on grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/627742 [08:07:22] (03CR) 10Kormat: mariadb: Use labsdb mysql config group for both labsdb and clouddb hosts (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 (owner: 10Jcrespo) [08:08:04] 10Operations, 10OTRS: Quick Close menu items appear multiple times in Ticket Zoom - https://phabricator.wikimedia.org/T263005 (10akosiaris) [08:08:18] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10JMeybohm) [08:08:20] 10Operations, 10OTRS: Quick Close menu items appear multiple times in Ticket Zoom - https://phabricator.wikimedia.org/T263005 (10akosiaris) p:05Triage→03Medium [08:08:38] (03CR) 10Jcrespo: mariadb: Use labsdb mysql config group for both labsdb and clouddb hosts (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 (owner: 10Jcrespo) [08:09:19] (03PS2) 10Jcrespo: mariadb: Use labsdb mysql config group for both labsdb and clouddb hosts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 [08:10:14] (03CR) 10Jcrespo: "We can clean up and move the group prefix to clouddb when there is no more labsdb hosts." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 (owner: 10Jcrespo) [08:10:19] (03CR) 10Giuseppe Lavagetto: prometheus: Scrape k8s etcd nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627439 (owner: 10Alexandros Kosiaris) [08:10:35] (03CR) 10Jcrespo: mariadb: Use labsdb mysql config group for both labsdb and clouddb hosts (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 (owner: 10Jcrespo) [08:10:41] (03CR) 10Kormat: [C: 03+1] "LGTM" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 (owner: 10Jcrespo) [08:10:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "+1 to effie's comment. Apart from that LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627439 (owner: 10Alexandros Kosiaris) [08:14:18] (03CR) 10Jcrespo: "Yep, this should be able to be rebased cleanly I think, just by pressing "rebase on top of HEAD" by yourself, doesn't it?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 (owner: 10Jcrespo) [08:15:01] (03CR) 10Jcrespo: "Actually, there is a dependency on the others." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 (owner: 10Jcrespo) [08:15:52] (03CR) 10Jcrespo: "> Patch Set 2:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 (owner: 10Jcrespo) [08:16:48] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1001/25117/dbmonitor1001.wikimedia.org/fulldiff.html does the right thing" [puppet] - 10https://gerrit.wikimedia.org/r/625583 (https://phabricator.wikimedia.org/T224589) (owner: 10Giuseppe Lavagetto) [08:17:11] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/25116/" [puppet] - 10https://gerrit.wikimedia.org/r/627742 (owner: 10Muehlenhoff) [08:17:16] (03CR) 10Muehlenhoff: [C: 03+2] Enable CAS on grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/627742 (owner: 10Muehlenhoff) [08:17:52] !log asw-a-codfw> request system snapshot slice alternate all-members - T262290 [08:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:59] T262290: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 [08:19:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 from me, FWIW." [puppet] - 10https://gerrit.wikimedia.org/r/627745 (https://phabricator.wikimedia.org/T187984) (owner: 10Jcrespo) [08:26:04] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:13] !log beginning security backport for https://phabricator.wikimedia.org/T262628 [08:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:46] PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: connect to address 10.192.0.160 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [08:27:23] !log asw-b-codfw> request system snapshot slice alternate all-members - T262290 [08:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:29] T262290: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 [08:31:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] lvs: Remove check_eventgate_analytics_http_cluster monitoring [puppet] - 10https://gerrit.wikimedia.org/r/627528 (https://phabricator.wikimedia.org/T255870) (owner: 10JMeybohm) [08:31:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] lvs: Remove check_eventgate_main_http_cluster monitoring [puppet] - 10https://gerrit.wikimedia.org/r/627536 (https://phabricator.wikimedia.org/T255873) (owner: 10JMeybohm) [08:34:15] (03Abandoned) 10ZPapierski: Switch between active W[D|C]QS indexes [puppet] - 10https://gerrit.wikimedia.org/r/627495 (https://phabricator.wikimedia.org/T262828) (owner: 10ZPapierski) [08:38:15] (03PS1) 10Muehlenhoff: Add CAS settings to the grafana/CAS template [puppet] - 10https://gerrit.wikimedia.org/r/627746 [08:38:51] 10Operations, 10Acme-chief, 10Traffic: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10Vgutierrez) [08:39:38] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove check_eventgate_analytics_http_cluster monitoring [puppet] - 10https://gerrit.wikimedia.org/r/627528 (https://phabricator.wikimedia.org/T255870) (owner: 10JMeybohm) [08:39:40] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove check_eventgate_main_http_cluster monitoring [puppet] - 10https://gerrit.wikimedia.org/r/627536 (https://phabricator.wikimedia.org/T255873) (owner: 10JMeybohm) [08:40:00] (03PS2) 10JMeybohm: lvs: Remove check_eventgate_main_http_cluster monitoring [puppet] - 10https://gerrit.wikimedia.org/r/627536 (https://phabricator.wikimedia.org/T255873) [08:41:09] !log asw-c-codfw> request system snapshot slice alternate all-members - T262290 [08:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:15] T262290: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 [08:41:54] !log awight@deploy1001 Synchronized php-1.36.0-wmf.8/extensions/FileImporter/src/Services/ImportPlanValidator.php: Security patch for T262628 (duration: 00m 59s) [08:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:03] !log finished security backport for https://phabricator.wikimedia.org/T262628 [08:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:08] (03CR) 10Gehel: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/626240 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [08:42:31] (03CR) 10Marostegui: [C: 03+1] mariadb: Incrase max_allowed_packet to 64MB on all generic misc dbs [puppet] - 10https://gerrit.wikimedia.org/r/627745 (https://phabricator.wikimedia.org/T187984) (owner: 10Jcrespo) [08:44:36] (03PS2) 10Jcrespo: mariadb: Increase max_allowed_packet to 64MB on all generic misc dbs [puppet] - 10https://gerrit.wikimedia.org/r/627745 (https://phabricator.wikimedia.org/T187984) [08:44:54] (03CR) 10Muehlenhoff: [C: 03+2] Add CAS settings to the grafana/CAS template [puppet] - 10https://gerrit.wikimedia.org/r/627746 (owner: 10Muehlenhoff) [08:45:53] (03CR) 10Jcrespo: [C: 03+2] mariadb: Increase max_allowed_packet to 64MB on all generic misc dbs [puppet] - 10https://gerrit.wikimedia.org/r/627745 (https://phabricator.wikimedia.org/T187984) (owner: 10Jcrespo) [08:45:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tendril::webserver: configure mpm [puppet] - 10https://gerrit.wikimedia.org/r/625583 (https://phabricator.wikimedia.org/T224589) (owner: 10Giuseppe Lavagetto) [08:46:26] (03PS1) 10Filippo Giunchedi: am: use default alertmanager url when needed [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627748 [08:46:28] (03PS1) 10Filippo Giunchedi: Move service problem parsing/handling to am [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627749 [08:46:30] (03PS1) 10Filippo Giunchedi: am: default interval to 10s [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627750 [08:46:32] (03PS1) 10Filippo Giunchedi: Add test scaffolding for am [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627751 [08:48:34] 10Operations, 10Acme-chief, 10Traffic: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10ema) p:05Triage→03High [08:49:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2027 and es2028 with more weight T261717', diff saved to https://phabricator.wikimedia.org/P12601 and previous config saved to /var/cache/conftool/dbconfig/20200916-084916-marostegui.json [08:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:23] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:50:35] !log deploy new max_allowed_packet configuration to m1, m2 and m5 dbs [08:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:17] jynus: no m3? [08:51:24] (03PS3) 10Jbond: puppetmaster: (re)move hiera lookup for scripts to profiles [puppet] - 10https://gerrit.wikimedia.org/r/624335 (owner: 10Dzahn) [08:51:26] (03PS1) 10Jbond: puppetmaster: drop workers parameter as this can be inferred from servers [puppet] - 10https://gerrit.wikimedia.org/r/627754 [08:51:30] (03CR) 10ArielGlenn: [C: 03+1] "Seems ok to me, let's see what Brooke thinks." [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [08:51:34] well, we can do it, but that patch didn't touch m3 as it has its own config [08:51:44] it may be already larger, I can check [08:51:55] jynus: ok [08:51:56] thanks [08:52:12] (03PS5) 10Jbond: puppetdb: (re)move hiera lookup for db pass to profile [puppet] - 10https://gerrit.wikimedia.org/r/624340 (owner: 10Dzahn) [08:52:24] !log asw-d-codfw> request system snapshot slice alternate all-members - T262290 [08:52:26] (03PS5) 10Jbond: puppetmaster::backend: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/624342 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [08:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:31] T262290: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 [08:52:37] (03CR) 10Gehel: [C: 03+1] "marked all comments as resolved. This LGTM, but we need to wait for the latest version of spicerack to be deployed before merging this one" (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [08:52:41] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: drop workers parameter as this can be inferred from servers [puppet] - 10https://gerrit.wikimedia.org/r/627754 (owner: 10Jbond) [08:52:41] !log Stop mysql on db1121, db1123, db1093 and db1109 for PDU work T261454 T261457 [08:52:43] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/624335 (owner: 10Dzahn) [08:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:48] T261454: New Date - Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 [08:52:48] T261457: Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks C6 and C7 - https://phabricator.wikimedia.org/T261457 [08:53:21] (03PS2) 10Jbond: puppetmaster: drop workers parameter as this can be inferred from servers [puppet] - 10https://gerrit.wikimedia.org/r/627754 [08:53:41] marostegui: https://phabricator.wikimedia.org/P12602 I can do a patch for m3 in no time too [08:53:51] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/627754 (owner: 10Jbond) [08:54:00] (03PS6) 10Jbond: puppetdb: (re)move hiera lookup for db pass to profile [puppet] - 10https://gerrit.wikimedia.org/r/624340 (owner: 10Dzahn) [08:54:06] (03PS6) 10Jbond: puppetmaster::backend: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/624342 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [08:55:34] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks C6 and C7 - https://phabricator.wikimedia.org/T261457 (10Marostegui) mysql stopped on db1121 (sanitarium master) [08:55:49] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10Marostegui) mysql stopped on db1123 (s3 master), db1093 (s6 master) and db1109 (s8 master) [09:00:28] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:25] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [09:03:48] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [09:08:03] !log fasw-c-eqiad> request system snapshot slice alternate member 1 - T262290 [09:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:12] T262290: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 [09:09:21] RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 58497 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [09:09:25] mmm gerrit is acting weirdly for me [09:09:36] moritzm: jayme ps -w is not working for me either [09:09:57] so I cannot debug if there is threads stalled or something [09:11:29] even if in https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?orgId=1 I don't see anything strange [09:13:29] !log fasw-c-eqiad> request system snapshot slice alternate member 0 - T262290 [09:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:36] T262290: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 [09:15:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2027 and es2028 with more weight T261717', diff saved to https://phabricator.wikimedia.org/P12603 and previous config saved to /var/cache/conftool/dbconfig/20200916-091535-marostegui.json [09:15:39] 10Operations, 10netops: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 (10ayounsi) 05Open→03Resolved All done! [09:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:43] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [09:17:33] PROBLEM - SSH access on gerrit1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit [09:18:18] ^there it is elukey? [09:21:15] votes for restarting it? [09:21:34] +1 [09:22:38] * volans getting 500s on the UI too fwiw [09:22:47] even from the host overview I don't see anything standing out [09:22:52] !log restarting gerrit service on gerrit1001, unresponsive [09:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:22] uf, it is taking quite some time [09:23:39] now it finished [09:23:45] is it back? [09:23:59] seems so yes [09:24:14] is Gerrit having problems? [09:24:18] (03PS4) 10Jbond: puppetmaster: (re)move hiera lookup for scripts to profiles [puppet] - 10https://gerrit.wikimedia.org/r/624335 (owner: 10Dzahn) [09:24:21] now the git ssh is responsive [09:24:30] I can ps and check the queue [09:24:35] it couldn't before [09:24:51] thanks jynus ! [09:25:07] RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.2.3-1-g185bdc3a69 (APACHE-SSHD-2.4.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [09:25:22] well, using a hammer not sure if it is a great accomplishment [09:25:27] 0:-) [09:25:53] ah [09:26:04] some stacktrace would have been nice yeah ;D [09:26:16] !log moving train 1.36.0-wmf.9 to testwikis [09:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:35] hashar: I couldn't do admin commands, but it hadn't crashed [09:26:42] (03CR) 10Jbond: [C: 03+1] "Also added profile::puppetmaster::frontend to the CR LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624335 (owner: 10Dzahn) [09:26:53] (03PS3) 10Jbond: puppetmaster: drop workers parameter as this can be inferred from servers [puppet] - 10https://gerrit.wikimedia.org/r/627754 [09:26:58] (03PS1) 10Lars Wirzenius: testwikis wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627768 [09:27:00] (03CR) 10Lars Wirzenius: [C: 03+2] testwikis wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627768 (owner: 10Lars Wirzenius) [09:27:50] oh, it actually did [09:27:50] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627768 (owner: 10Lars Wirzenius) [09:27:58] hashar: stacktrace is on logs [09:28:15] I can copy it to you if you want it [09:28:40] although I am gessing you have access? [09:28:43] *guessing [09:28:53] !log liw@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.9 [09:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:37] RECOVERY - MariaDB Replica Lag: m2 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:29:48] (03CR) 10Volans: "Comments inline on diffs with the generated records and possible approaches. We can totally split this in smaller patches if you feel more" (035 comments) [dns] - 10https://gerrit.wikimedia.org/r/627605 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [09:29:57] (03PS1) 10Muehlenhoff: Add grafana-rw for CAS-enabled vhost for editors/admins [dns] - 10https://gerrit.wikimedia.org/r/627769 (https://phabricator.wikimedia.org/T262512) [09:30:21] (03PS1) 10Jcrespo: mariadb: Increase max_packet_size to 64M on phabricator instances, too [puppet] - 10https://gerrit.wikimedia.org/r/627770 [09:31:27] jynus: [2020-09-16 08:59:26,288] [HTTP-964151] INFO org.eclipse.jetty.io.ManagedSelector : Caught select() failure, trying to recover: java.lang.OutOfMemoryError: Java heap space [09:31:28] :\ [09:31:47] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) All m2 dbs are back to sync with primary server: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=6&orgId=1&var-server=d... [09:32:04] (03CR) 10Jbond: "PCC https://puppet-compiler.wmflabs.org/compiler1003/25122/" [puppet] - 10https://gerrit.wikimedia.org/r/627754 (owner: 10Jbond) [09:32:18] (03CR) 10Ayounsi: Migrate ulsfo records to automated DNS. (034 comments) [dns] - 10https://gerrit.wikimedia.org/r/627605 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [09:32:40] (03PS1) 10Muehlenhoff: Add grafana-rw to cache config [puppet] - 10https://gerrit.wikimedia.org/r/627772 (https://phabricator.wikimedia.org/T262512) [09:33:29] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/624332 (owner: 10Dzahn) [09:34:25] jynus: can you push the trace log to https://phabricator.wikimedia.org/file/ please? [09:34:31] I have filed a dummy task for it https://phabricator.wikimedia.org/T263008 [09:34:46] sure, I can phaste it [09:34:50] :] [09:34:55] (03PS1) 10Arturo Borrero Gonzalez: openstack: rocky/buster: use more modern netfilter components [puppet] - 10https://gerrit.wikimedia.org/r/627773 (https://phabricator.wikimedia.org/T262979) [09:35:10] I would love an utility to dump an histogram of Apache requests [09:35:19] let me double check there is not PII [09:35:42] (03CR) 10Jbond: "LGTM types would be nice 😊" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627391 (owner: 10Dzahn) [09:36:21] hashar: https://phabricator.wikimedia.org/P12604 [09:36:42] jbond42: `Enum ["Enthusiastic", "Accepting", "Begruding"] lgtm` [09:37:33] L10n cache building worked today. yay. [09:37:42] (03CR) 10Jcrespo: "It was already applied to multiiinstance ones for m3. Better to keep all main misc in sync." [puppet] - 10https://gerrit.wikimedia.org/r/627770 (owner: 10Jcrespo) [09:38:09] (03CR) 10Jcrespo: [C: 03+2] mariadb: Increase max_packet_size to 64M on phabricator instances, too [puppet] - 10https://gerrit.wikimedia.org/r/627770 (owner: 10Jcrespo) [09:38:39] kormat: lol :D [09:39:32] (03PS1) 10Muehlenhoff: Add grafana-rw to CORS origins [puppet] - 10https://gerrit.wikimedia.org/r/627777 [09:41:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Use `wmgWikibaseClientItemAndPropertySourceName` instead of `wmgWikibaseClientLocalEntitySourceName` in Wikibase.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622993 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [09:41:37] (03CR) 10JMeybohm: [C: 03+1] "Great!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/627741 (https://phabricator.wikimedia.org/T261313) (owner: 10Giuseppe Lavagetto) [09:43:27] !log deploying max_packet_size change to m3 instances, too [09:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:44] (03PS2) 10JMeybohm: lvs: Remove blubberoid non-TLS endpoint from LVS 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/627268 (https://phabricator.wikimedia.org/T236017) [09:46:09] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove blubberoid non-TLS endpoint from LVS 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/627268 (https://phabricator.wikimedia.org/T236017) (owner: 10JMeybohm) [09:46:41] (03CR) 10Jbond: [C: 03+1] "LGTN" [puppet] - 10https://gerrit.wikimedia.org/r/627557 (https://phabricator.wikimedia.org/T262642) (owner: 10Herron) [09:48:18] (03CR) 10Filippo Giunchedi: [C: 03+2] Add grafana-rw for CAS-enabled vhost for editors/admins [dns] - 10https://gerrit.wikimedia.org/r/627769 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [09:48:24] (03CR) 10Filippo Giunchedi: [C: 03+1] Add grafana-rw for CAS-enabled vhost for editors/admins [dns] - 10https://gerrit.wikimedia.org/r/627769 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [09:48:27] (03CR) 10JMeybohm: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/627271 (https://phabricator.wikimedia.org/T255876) (owner: 10JMeybohm) [09:48:38] (03CR) 10Filippo Giunchedi: [C: 03+1] Add grafana-rw to cache config [puppet] - 10https://gerrit.wikimedia.org/r/627772 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [09:50:15] (03PS4) 10JMeybohm: lvs: Remove mobileapps non-TLS endpoint from LVS 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/627271 (https://phabricator.wikimedia.org/T255876) [09:53:02] 10Operations, 10Acme-chief, 10Traffic: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10ema) I have backported the patch to python-acme-0.31.0, here it is for review and future reference. ` Description: Backport of upstream patch "certbot: add --preferred-chain" Auth... [09:56:16] (03CR) 10Muehlenhoff: [C: 03+2] Add grafana-rw for CAS-enabled vhost for editors/admins [dns] - 10https://gerrit.wikimedia.org/r/627769 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [10:01:29] !log T187984 Shutdown mendelevium. [10:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:35] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [10:05:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2027 and es2028 with more weight T261717', diff saved to https://phabricator.wikimedia.org/P12605 and previous config saved to /var/cache/conftool/dbconfig/20200916-100548-marostegui.json [10:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:56] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [10:08:23] (03PS2) 10Arturo Borrero Gonzalez: openstack: rocky/buster: use more modern netfilter components [puppet] - 10https://gerrit.wikimedia.org/r/627773 (https://phabricator.wikimedia.org/T262979) [10:09:09] (03PS2) 10Jforrester: Drop scap plugins, moved into scap proper [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601388 (https://phabricator.wikimedia.org/T248490) [10:10:40] !log upload python-acme 0.31.0-2wm1 to buster-wikimedia T263006 [10:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:46] T263006: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 [10:11:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "the virt1009 naming predates me being around. It is really related to WMCS?" [dns] - 10https://gerrit.wikimedia.org/r/627519 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [10:13:47] (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review+1" [dns] - 10https://gerrit.wikimedia.org/r/627519 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [10:14:49] !log liw@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.9 (duration: 46m 07s) [10:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:34] (03PS1) 10Lars Wirzenius: group0 wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627781 [10:16:36] (03CR) 10Lars Wirzenius: [C: 03+2] group0 wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627781 (owner: 10Lars Wirzenius) [10:17:45] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627781 (owner: 10Lars Wirzenius) [10:19:20] 10Operations, 10OTRS: Quick Close menu items appear multiple times in Ticket Zoom - https://phabricator.wikimedia.org/T263005 (10akosiaris) 05Open→03Resolved a:03akosiaris This was probably an artifact from the migration. In the configuration I found 4 duplicate entries for `Ticket::Frontend::MenuModule#... [10:20:30] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.9 [10:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:07] (03CR) 10Ema: "The change looks fine, but I see that grafana2001.codfw.wmnet does not have grafana-rw in the certificate Subject Alternative Name:" [puppet] - 10https://gerrit.wikimedia.org/r/627772 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [10:26:14] (03PS3) 10Alexandros Kosiaris: prometheus: Scrape k8s etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/627439 [10:26:18] (03CR) 10Alexandros Kosiaris: prometheus: Scrape k8s etcd nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627439 (owner: 10Alexandros Kosiaris) [10:26:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/627439 (owner: 10Alexandros Kosiaris) [10:26:51] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/627772 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [10:29:13] (03PS13) 10Elukey: Add basic Debian packaging [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) [10:32:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me! (one comment inline, but let's simply add it later)" (031 comment) [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [10:33:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully pool es2027 and es2028 T261717', diff saved to https://phabricator.wikimedia.org/P12606 and previous config saved to /var/cache/conftool/dbconfig/20200916-103324-marostegui.json [10:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:32] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [10:35:32] (03PS1) 10Filippo Giunchedi: am: handle missing scheduled_downtime_depth [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627782 [10:41:01] (03PS2) 10Volans: wmcs: remove leftover records for old hosts [dns] - 10https://gerrit.wikimedia.org/r/627519 (https://phabricator.wikimedia.org/T244153) [10:43:14] (03CR) 10Volans: [C: 03+2] wmcs: remove leftover records for old hosts [dns] - 10https://gerrit.wikimedia.org/r/627519 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [10:46:55] (03PS1) 10Muehlenhoff: Make the servername configurable via Hiera and set to hue-next for an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/627784 [10:48:09] (03PS1) 10Jforrester: Revert "Remove support for (Archived|OldLocal)File::userCan without a user" [core] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627787 (https://phabricator.wikimedia.org/T263014) [10:48:49] (03CR) 10Jforrester: [C: 03+2] Revert "Remove support for (Archived|OldLocal)File::userCan without a user" [core] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627787 (https://phabricator.wikimedia.org/T263014) (owner: 10Jforrester) [10:55:19] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/25123/" [puppet] - 10https://gerrit.wikimedia.org/r/627784 (owner: 10Muehlenhoff) [10:57:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:59:35] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: (C)100 gt (W)50 gt 28 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200916T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:02:14] (03PS30) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) [11:03:19] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [11:03:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: 170 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:09:40] (03Merged) 10jenkins-bot: Revert "Remove support for (Archived|OldLocal)File::userCan without a user" [core] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627787 (https://phabricator.wikimedia.org/T263014) (owner: 10Jforrester) [11:12:39] !log jforrester@deploy1001 Synchronized php-1.36.0-wmf.9/includes/filerepo/file: T263014 Revert "Remove support for (Archived|OldLocal)File::userCan without a user" (duration: 01m 04s) [11:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:46] T263014: Argument 2 passed to File::userCan() must be an instance of User, null given, called in /srv/mediawiki/php-1.36.0-wmf.9/includes/filerepo/LocalRepo.php on line 275 - https://phabricator.wikimedia.org/T263014 [11:15:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: (C)100 gt (W)50 gt 12 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:16:53] (03PS31) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) [11:18:02] !log installing gnutls28 security updates on remaining stretch hosts [11:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:15] (03PS32) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) [11:24:00] jbond42: I'll wait for PS42, must be the right one :-P [11:24:20] (03PS1) 10Vgutierrez: x509: Provide support for an alternative chain [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627809 (https://phabricator.wikimedia.org/T263006) [11:24:28] volans: :D yes could be lol [11:24:54] (03PS2) 10Vgutierrez: x509: Alternative chain support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627809 (https://phabricator.wikimedia.org/T263006) [11:28:07] (03CR) 10jerkins-bot: [V: 04-1] x509: Alternative chain support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627809 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [11:28:07] 04Critical Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - Device rebooted [11:29:14] !log restarting slapd on LDAP replicas to pick up GNUTLS update [11:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:31] ^^ this reboot was me [11:29:38] (03CR) 10Jbond: "I think i got everything and to pre-empt you i agree this really should be a spicerack module now (https://phabricator.wikimedia.org/T2630" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [11:33:08] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-d7-codfw.mgmt.codfw.wmnet recovered from Device rebooted [11:39:23] (03PS3) 10Vgutierrez: x509: Alternative chain support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627809 (https://phabricator.wikimedia.org/T263006) [11:39:25] (03PS1) 10Vgutierrez: Handle new pylint raise-missing-from [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627813 [11:40:01] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [11:40:12] elukey: ^^^ [11:42:38] (03CR) 10jerkins-bot: [V: 04-1] Handle new pylint raise-missing-from [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627813 (owner: 10Vgutierrez) [11:43:21] (03CR) 10jerkins-bot: [V: 04-1] x509: Alternative chain support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627809 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [11:49:41] volans: this is maintenance for updating ROCM kernels [11:50:10] ack [11:55:00] (03PS2) 10Vgutierrez: Handle new pylint raise-missing-from [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627813 [11:55:02] (03PS4) 10Vgutierrez: x509: Alternative chain support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627809 (https://phabricator.wikimedia.org/T263006) [11:55:05] (03PS3) 10Arturo Borrero Gonzalez: openstack: rocky/buster: use more modern netfilter components [puppet] - 10https://gerrit.wikimedia.org/r/627773 (https://phabricator.wikimedia.org/T262979) [11:57:41] (03CR) 10Filippo Giunchedi: [C: 03+1] Add grafana-rw to CORS origins [puppet] - 10https://gerrit.wikimedia.org/r/627777 (owner: 10Muehlenhoff) [11:59:11] (03CR) 10Volans: "Thx for the refactor this will simplify moving it to spicerack. Couple of last comments inline." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200916T1200) [12:01:47] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [12:08:05] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [12:09:31] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [12:12:30] Sorry for forgetting to put in a maint window for stat1008. [12:12:44] *puts 5CHF in the end-of-year jar* [12:13:34] (03PS1) 10Filippo Giunchedi: hieradata: enable remote syslog queues in codfw [puppet] - 10https://gerrit.wikimedia.org/r/627816 (https://phabricator.wikimedia.org/T226703) [12:14:21] hey, it didn't page, so there's that [12:16:57] looking for volunteers to +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/627816 (impactless so far) [12:18:28] (03CR) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [12:19:15] RECOVERY - dump of m2 in eqiad on icinga1001 is OK: Last dump for m2 at eqiad (db1117.eqiad.wmnet:3322) taken on 2020-09-16 07:31:19 (438 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:29:56] !log restarting exim on MXes to pick up GNUTLS update [12:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:51] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [12:35:29] (03PS1) 10DCausse: Remove kafka msearch and bulk daemon from cirrus and relforge [puppet] - 10https://gerrit.wikimedia.org/r/627818 [12:36:47] !log powercycling mw2256 (went down with overheated CPU) [12:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "Not super familiar with how it is supposed to look, but LGTM. PCC run https://puppet-compiler.wmflabs.org/compiler1003/25124/" [puppet] - 10https://gerrit.wikimedia.org/r/627818 (owner: 10DCausse) [12:42:52] (03PS1) 10Filippo Giunchedi: base: remove obsolete enable_rsyslog_exporter [puppet] - 10https://gerrit.wikimedia.org/r/627821 (https://phabricator.wikimedia.org/T226703) [12:44:32] 10Operations, 10ops-codfw, 10serviceops: mw2256 went down with thermal issues / fail-safe voltage is out of range - https://phabricator.wikimedia.org/T263022 (10MoritzMuehlenhoff) [12:48:56] (03PS5) 10Vgutierrez: x509: Alternative chain support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627809 (https://phabricator.wikimedia.org/T263006) [12:48:58] (03PS1) 10Vgutierrez: requests: Fetch alternative chains [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627823 (https://phabricator.wikimedia.org/T263006) [12:49:29] (03CR) 10Elukey: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/627784 (owner: 10Muehlenhoff) [12:49:50] !log start pdu swap in racks c6 and c7, d8 [12:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:37] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for [12:51:37] ed the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:52:05] (03CR) 10Elukey: [C: 03+2] Make the servername configurable via Hiera and set to hue-next for an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/627784 (owner: 10Muehlenhoff) [12:53:33] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:53:42] (03CR) 10Elukey: [C: 03+2] aptrepo: add rock-dkms in the list of packages for the rocm33 component [puppet] - 10https://gerrit.wikimedia.org/r/626112 (https://phabricator.wikimedia.org/T260442) (owner: 10Elukey) [12:56:05] !log cdanis@cumin1001 START - Cookbook sre.network.cf [12:56:06] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [12:56:06] (03CR) 10Elukey: "> Patch Set 11:" (031 comment) [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [12:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:28] (03CR) 10Elukey: [V: 03+2 C: 03+2] Upstream version 4.7.1 [debs/hue] - 10https://gerrit.wikimedia.org/r/627755 (owner: 10Elukey) [12:56:47] (03CR) 10Elukey: [C: 03+2] Add basic Debian packaging [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [12:58:17] !log upload hue_4.7.1-1+deb10u1 to buster-wikimedia [12:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] liw and brennen: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200916T1300). [13:00:44] (03PS1) 10Lars Wirzenius: group1 wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627825 [13:00:46] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627825 (owner: 10Lars Wirzenius) [13:01:39] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627825 (owner: 10Lars Wirzenius) [13:06:26] "connect to host mw2256.codfw.wmnet port 22: Connection timed out" hmm [13:07:49] liw: host broke half an hour ago, let me fix this [13:08:06] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.9 [13:08:08] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2256.codfw.wmnet [13:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:16] (03PS1) 10Muehlenhoff: Enable CAS on an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/627826 [13:08:25] liw: can you try again? [13:08:37] for context: https://phabricator.wikimedia.org/T263022 [13:11:29] (03CR) 10Elukey: [C: 03+2] Remove kafka msearch and bulk daemon from cirrus and relforge [puppet] - 10https://gerrit.wikimedia.org/r/627818 (owner: 10DCausse) [13:11:29] moritzm, it was an error from scap (via deploy-promote), and it finished successfully anyway, so I'm not sure how to try again without doing a rollback [13:12:29] moritzm, I can do a rollback if it's important enough [13:16:09] no idea about scap internal, but it should be fine, when mw2256 has been fixed by our data centre ops people, it will be synced to the most current scap state anyway before being put back in action [13:16:29] ack, then I won't roll back for this [13:16:45] ack [13:18:56] (03PS1) 10Elukey: hue: add specific settings for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/627830 (https://phabricator.wikimedia.org/T258768) [13:19:59] (03CR) 10jerkins-bot: [V: 04-1] hue: add specific settings for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/627830 (https://phabricator.wikimedia.org/T258768) (owner: 10Elukey) [13:21:49] PROBLEM - Host ps1-c6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:22:23] (03PS1) 10Elukey: Add fake kerberos keytab for hue on an-tool1009 [labs/private] - 10https://gerrit.wikimedia.org/r/627832 [13:24:08] (03PS1) 10Esanders: Enable DiscussionTools beta on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627833 (https://phabricator.wikimedia.org/T262984) [13:25:46] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake kerberos keytab for hue on an-tool1009 [labs/private] - 10https://gerrit.wikimedia.org/r/627832 (owner: 10Elukey) [13:26:22] (03PS1) 10Volans: sre.dns.netbox: improve the DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/627834 (https://phabricator.wikimedia.org/T244153) [13:26:53] (03PS2) 10Elukey: hue: add specific settings for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/627830 (https://phabricator.wikimedia.org/T258768) [13:26:55] (03PS1) 10Vgutierrez: acme_chief: Save alternative chains [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627835 [13:27:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Use a separate HELM_HOME for each helmfile run [deployment-charts] - 10https://gerrit.wikimedia.org/r/627741 (https://phabricator.wikimedia.org/T261313) (owner: 10Giuseppe Lavagetto) [13:29:18] PROBLEM - Disk space on otrs1001 is CRITICAL: DISK CRITICAL - free space: / 11426 MB (57% inode=0%): /tmp 11426 MB (57% inode=0%): /var/tmp 11426 MB (57% inode=0%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=otrs1001&var-datasource=eqiad+prometheus/ops [13:30:19] akosiaris: --^ [13:30:25] (03Merged) 10jenkins-bot: Use a separate HELM_HOME for each helmfile run [deployment-charts] - 10https://gerrit.wikimedia.org/r/627741 (https://phabricator.wikimedia.org/T261313) (owner: 10Giuseppe Lavagetto) [13:30:59] should be the upgrade, also why we alert at 43% used? [13:31:53] * akosiaris looking [13:32:09] inodes? [13:32:12] what on earth [13:32:21] ah yeah [13:32:23] /dev/vda1 1376256 1376256 0 100% / [13:32:48] is there some infinite loop of creating inodes? [13:34:12] the daemon seems to be doing something... [13:34:28] I'm looking for dirs with more inodes [13:34:39] it's also quite possible it's in problems because of that [13:35:25] each opt/otrs-6.0.29/var/tmp/CacheFileStorable/Article/* dir has more or less 80k [13:35:49] akosiaris: [13:35:50] 1302172 opt/otrs-6.0.29/var/tmp/CacheFileStorable [13:36:07] that's the total for that dir and the parent [13:36:30] 10Operations, 10Traffic: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10eprodromou) @Nuria we split the conversation here into T259296 and this ticket. @hnowlan is taking care of making sure the API server doesn't pass through cookies, so it's really... [13:36:50] 1376256 is the total number of inodes in that partition [13:36:51] it's caching all the articles? [13:36:54] what on earth [13:37:21] sudo -u otrs /opt/otrs/bin/otrs.Console.pl Maint::Cache::Delete [13:37:21] Deleting cache... [13:37:25] let's see if this fixes it [13:37:26] 10Operations, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Platform Team Sprints Board (Sprint 1), and 2 others: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10eprodromou) I'm moving this to blocked until we work out how to get the cookies out of the A... [13:37:38] we could tweak the ext4 params or (I know you'll love it) use xfs :-P [13:37:47] PROBLEM - IPMI Sensor Status on mw1347 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:37:55] (03CR) 10Elukey: [C: 03+2] hue: add specific settings for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/627830 (https://phabricator.wikimedia.org/T258768) (owner: 10Elukey) [13:38:08] or just tweak some setting so that it doesn't do that [13:38:11] ehehe [13:38:26] much better now [13:39:39] PROBLEM - IPMI Sensor Status on mw1346 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:40:33] volans: it seems it has happened before https://phabricator.wikimedia.org/T154841 [13:41:25] lol [13:41:27] well not that exactly [13:41:42] but googling for this, returns this phab task in like 4th place [13:43:09] volans: and guess what's the recommendation https://doc.otrs.com/doc/manual/admin/6.0/en/html/performance-tuning.html#performance-tuning-otrs-cache [13:43:20] * akosiaris feels all the way back to 2013 [13:44:41] you can purchase the memcached backend, that's so '98 :D [13:45:41] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25127/" [puppet] - 10https://gerrit.wikimedia.org/r/627826 (owner: 10Muehlenhoff) [13:46:06] this open core model (which now is a delayed release open core model) is starting to get on my nerves [13:46:24] !log Restarting CI Jenkins for T262827 [13:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:44] volans: how about I set up a cron to run the above command every 1h? [13:46:46] :P [13:47:14] lol [13:48:12] !log Start mysql on db1121 after PDU work [13:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:20] RECOVERY - Disk space on otrs1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=otrs1001&var-datasource=eqiad+prometheus/ops [13:49:22] (03PS1) 10Elukey: hue: use gunicorn with Hue 4 [puppet] - 10https://gerrit.wikimedia.org/r/627838 (https://phabricator.wikimedia.org/T258768) [13:50:19] (03PS2) 10Jforrester: Drop wgHiddenPrefs hack for VE Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620958 (https://phabricator.wikimedia.org/T254349) (owner: 10Esanders) [13:50:56] (03CR) 10Jforrester: [C: 03+1] "This no-op clean-up is good to go whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620958 (https://phabricator.wikimedia.org/T254349) (owner: 10Esanders) [13:52:53] (03PS2) 10Elukey: hue: use gunicorn with Hue 4 [puppet] - 10https://gerrit.wikimedia.org/r/627838 (https://phabricator.wikimedia.org/T258768) [13:53:16] (03CR) 10jerkins-bot: [V: 04-1] hue: use gunicorn with Hue 4 [puppet] - 10https://gerrit.wikimedia.org/r/627838 (https://phabricator.wikimedia.org/T258768) (owner: 10Elukey) [13:56:12] (03PS3) 10Elukey: hue: use gunicorn with Hue 4 [puppet] - 10https://gerrit.wikimedia.org/r/627838 (https://phabricator.wikimedia.org/T258768) [13:57:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:58:22] (03PS4) 10Elukey: hue: use gunicorn with Hue 4 [puppet] - 10https://gerrit.wikimedia.org/r/627838 (https://phabricator.wikimedia.org/T258768) [13:59:54] PROBLEM - Host wtp1042 is DOWN: PING CRITICAL - Packet loss = 100% [14:00:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=mjolnir site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:00:52] (03PS5) 10Elukey: hue: use gunicorn with Hue 4 [puppet] - 10https://gerrit.wikimedia.org/r/627838 (https://phabricator.wikimedia.org/T258768) [14:01:00] RECOVERY - Host wtp1042 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [14:01:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:02:03] 10Operations, 10observability: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027 (10fgiunchedi) [14:02:08] !log Change email address of User:Oversight@enwiki to oversight-en-wp@wikipedia.org as OTRS is back up (T262733) [14:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:15] T262733: Cannot change password for role account that is not an attached global account - Assistance required from someone with shell access - https://phabricator.wikimedia.org/T262733 [14:03:47] (03CR) 10Elukey: [C: 03+2] hue: use gunicorn with Hue 4 [puppet] - 10https://gerrit.wikimedia.org/r/627838 (https://phabricator.wikimedia.org/T258768) (owner: 10Elukey) [14:04:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=mjolnir site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:50] PROBLEM - Check systemd state on an-tool1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:12] (03PS1) 10Jgiannelos: Enable prometheus monitoring on push-notifications staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/627840 [14:05:39] (03PS1) 10Kormat: bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841 [14:06:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:06:20] RECOVERY - Check systemd state on an-tool1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:27] (03CR) 10jerkins-bot: [V: 04-1] bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat) [14:08:03] (03PS2) 10Kormat: bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841 [14:08:09] (03PS1) 10Alexandros Kosiaris: otrs: Cleanup Cache every 1h [puppet] - 10https://gerrit.wikimedia.org/r/627842 [14:08:18] RECOVERY - IPMI Sensor Status on mw1347 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:10:18] RECOVERY - IPMI Sensor Status on mw1346 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:10:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: Cleanup Cache every 1h [puppet] - 10https://gerrit.wikimedia.org/r/627842 (owner: 10Alexandros Kosiaris) [14:10:53] (03PS1) 10Elukey: Add hue-next.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/627843 [14:11:13] 10Operations, 10observability: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027 (10fgiunchedi) [14:13:20] (03PS1) 10Alexandros Kosiaris: otrs: Redirect cache cleanup cron's stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/627844 [14:14:14] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10Joe) [14:17:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: Redirect cache cleanup cron's stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/627844 (owner: 10Alexandros Kosiaris) [14:20:12] PROBLEM - IPMI Sensor Status on cp1086 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:21:04] PROBLEM - IPMI Sensor Status on ms-be1042 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:21:12] PROBLEM - IPMI Sensor Status on dbprov1003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:22:45] (03PS1) 10Effie Mouzeli: push-notifications: enable mmonitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/627846 (https://phabricator.wikimedia.org/T256973) [14:22:47] (03CR) 10Hnowlan: [C: 03+2] "lgtm, happy to deploy and test" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/627588 (https://phabricator.wikimedia.org/T262396) (owner: 10Ppchelko) [14:23:18] PROBLEM - Host ps1-c7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:23:33] (03PS2) 10Effie Mouzeli: push-notifications: enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/627846 (https://phabricator.wikimedia.org/T256973) [14:24:53] (03Merged) 10jenkins-bot: Api-gateway: Implement support for X-Wikimedia-Debug header [deployment-charts] - 10https://gerrit.wikimedia.org/r/627588 (https://phabricator.wikimedia.org/T262396) (owner: 10Ppchelko) [14:26:12] (03CR) 10JMeybohm: [C: 03+1] push-notifications: enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/627846 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:26:30] (03CR) 10Volans: "reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [14:26:44] (03Abandoned) 10Jgiannelos: Enable prometheus monitoring on push-notifications staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/627840 (owner: 10Jgiannelos) [14:26:48] 10Operations, 10observability: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027 (10fgiunchedi) [14:28:34] (03PS1) 10Filippo Giunchedi: icinga: reload when nagios config changes [puppet] - 10https://gerrit.wikimedia.org/r/627849 (https://phabricator.wikimedia.org/T263027) [14:28:56] (03CR) 10Effie Mouzeli: [C: 03+2] push-notifications: enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/627846 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:29:58] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.43 [software/spicerack] - 10https://gerrit.wikimedia.org/r/627850 [14:31:02] (03CR) 10Filippo Giunchedi: "/etc/icinga/commands still missing, will need followup" [puppet] - 10https://gerrit.wikimedia.org/r/627849 (https://phabricator.wikimedia.org/T263027) (owner: 10Filippo Giunchedi) [14:31:12] (03Merged) 10jenkins-bot: push-notifications: enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/627846 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:32:52] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.43 [software/spicerack] - 10https://gerrit.wikimedia.org/r/627850 (owner: 10Volans) [14:34:31] !log jiji@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [14:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:58] (03PS1) 10Volans: Upstream release v0.0.43 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/627852 [14:36:18] (03PS33) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) [14:36:42] (03CR) 10Jbond: "updated" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [14:37:41] (03PS1) 10BBlack: Update bblack ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/627853 [14:38:12] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.43 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/627852 (owner: 10Volans) [14:39:52] !log pdu swap rack d7-eqiad, missed this in earlier log entry [14:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:35] (03CR) 10Volans: "last nit" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [14:41:18] (03CR) 10C. Scott Ananian: Revert "Remove support for (Archived|OldLocal)File::userCan without a user" (031 comment) [core] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627787 (https://phabricator.wikimedia.org/T263014) (owner: 10Jforrester) [14:41:30] RECOVERY - Host ps1-c7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [14:41:42] !log uploaded spicerack_0.0.43 to apt.wikimedia.org buster-wikimedia [14:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:07] (03CR) 10BBlack: [C: 03+2] Update bblack ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/627853 (owner: 10BBlack) [14:43:41] (03PS1) 10Marostegui: db1121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/627855 [14:43:53] gehel, ryankemper: spicerack_0.0.43 available on APT update at will the cumin hosts and perform related testing [14:44:00] (03PS34) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) [14:44:12] (03CR) 10Jbond: "thx, updated" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [14:44:20] RECOVERY - Host ps1-c6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [14:45:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=kubetcd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:45:32] PROBLEM - ps1-c7-eqiad-infeed-load-tower-B-phase-Z on ps1-c7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:45:48] PROBLEM - ps1-c7-eqiad-infeed-load-tower-A-phase-Z on ps1-c7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:46:24] PROBLEM - ps1-c7-eqiad-infeed-load-tower-B-phase-X on ps1-c7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:46:24] PROBLEM - ps1-c7-eqiad-infeed-load-tower-A-phase-Y on ps1-c7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:46:28] akosiaris: ^ Prometheus jobs reduced availability? [14:46:52] PROBLEM - ps1-c7-eqiad-infeed-load-tower-A-phase-X on ps1-c7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:46:56] PROBLEM - ps1-c7-eqiad-infeed-load-tower-B-phase-Y on ps1-c7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:47:47] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [14:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:02] jayme: /me looking [14:48:14] RECOVERY - IPMI Sensor Status on cp1086 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:48:14] RECOVERY - IPMI Sensor Status on dbprov1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:48:34] (03CR) 10Elukey: [C: 03+2] Add hue-next.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/627843 (owner: 10Elukey) [14:48:38] (03PS1) 10Elukey: Add hue-next.wikimedia.org settings for ATS [puppet] - 10https://gerrit.wikimedia.org/r/627856 (https://phabricator.wikimedia.org/T258768) [14:48:39] ^I've forced a recheck to make sure monitoring is up to date [14:52:00] RECOVERY - IPMI Sensor Status on ms-be1042 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:53:07] 04Critical Alert for device ps1-c7-eqiad.mgmt.eqiad.wmnet - Device rebooted [14:53:14] 04Critical Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - Device rebooted [14:54:24] PROBLEM - Host ps1-d8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:55:21] (03PS2) 10Muehlenhoff: Enable CAS on an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/627826 [14:55:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627856 (https://phabricator.wikimedia.org/T258768) (owner: 10Elukey) [14:56:12] (03PS1) 10Giuseppe Lavagetto: proton-http: stop monitoring the endpoint [puppet] - 10https://gerrit.wikimedia.org/r/627857 (https://phabricator.wikimedia.org/T255877) [14:56:14] (03PS1) 10Giuseppe Lavagetto: proton: remove non-https endpoint [puppet] - 10https://gerrit.wikimedia.org/r/627858 (https://phabricator.wikimedia.org/T255877) [14:56:16] (03PS1) 10Giuseppe Lavagetto: proton: remove conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/627859 (https://phabricator.wikimedia.org/T255877) [14:56:18] (03PS1) 10Giuseppe Lavagetto: proton: remove the ganeti VMs from puppet [puppet] - 10https://gerrit.wikimedia.org/r/627860 (https://phabricator.wikimedia.org/T255877) [14:56:20] (03PS1) 10Giuseppe Lavagetto: proton: remove all puppet code, other references to the non-k8s service [puppet] - 10https://gerrit.wikimedia.org/r/627861 (https://phabricator.wikimedia.org/T255877) [14:56:57] (03CR) 10Muehlenhoff: [C: 03+2] Enable CAS on an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/627826 (owner: 10Muehlenhoff) [14:57:03] (03PS1) 10Jbond: stdlib: backport datasize type [puppet] - 10https://gerrit.wikimedia.org/r/627862 [14:58:08] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-c7-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [14:58:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-c6-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [15:00:54] (03CR) 10Jbond: "not sure if this is the best way to update stdlib?" [puppet] - 10https://gerrit.wikimedia.org/r/627862 (owner: 10Jbond) [15:02:30] (03CR) 10JMeybohm: "Duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/627541 ?" [puppet] - 10https://gerrit.wikimedia.org/r/627857 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [15:03:47] (03PS1) 10RobH: pdu upgrade in c6/c7 [puppet] - 10https://gerrit.wikimedia.org/r/627864 (https://phabricator.wikimedia.org/T261457) [15:04:41] liw: are you done with the train? I’d like to deploy a few hopefully-harmless config changes :) [15:04:57] (03CR) 10RobH: [C: 03+2] pdu upgrade in c6/c7 [puppet] - 10https://gerrit.wikimedia.org/r/627864 (https://phabricator.wikimedia.org/T261457) (owner: 10RobH) [15:06:11] (03PS9) 10Lucas Werkmeister (WMDE): Add `wmgWikibaseClientItemAndPropertySourceName` to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622612 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:06:22] (03PS1) 10Filippo Giunchedi: profile: add queues to rsyslog kafka output [puppet] - 10https://gerrit.wikimedia.org/r/627865 (https://phabricator.wikimedia.org/T226703) [15:06:36] (03PS4) 10Lucas Werkmeister (WMDE): Use `wmgWikibaseClientItemAndPropertySourceName` instead of `wmgWikibaseClientLocalEntitySourceName` in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622993 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:06:53] (03PS4) 10Lucas Werkmeister (WMDE): Remove `wmgWikibaseClientLocalEntitySourceName` from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622994 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:10:30] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks C6 and C7 - https://phabricator.wikimedia.org/T261457 (10RobH) [15:12:34] (03CR) 10Mholloway: [C: 03+1] "Tested various requests in staging and all worked well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/627517 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [15:13:17] hmmm. what host am I supposed to run `logspam-watch` on at the moment? [15:13:24] on mwlog2001 it says files don’t exist [15:14:36] hm, on mwlog1001 it seems to work… guess I’ll use that one [15:15:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add `wmgWikibaseClientItemAndPropertySourceName` to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622612 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:16:56] (03Merged) 10jenkins-bot: Add `wmgWikibaseClientItemAndPropertySourceName` to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622612 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:17:36] RECOVERY - ps1-c7-eqiad-infeed-load-tower-A-phase-Y on ps1-c7-eqiad is OK: SNMP OK - ps1-c7-eqiad-infeed-load-tower-A-phase-Y 336 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:17:36] RECOVERY - ps1-c7-eqiad-infeed-load-tower-B-phase-X on ps1-c7-eqiad is OK: SNMP OK - ps1-c7-eqiad-infeed-load-tower-B-phase-X 389 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:19:42] PROBLEM - IPMI Sensor Status on mw1383 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:20:14] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/25137/" [puppet] - 10https://gerrit.wikimedia.org/r/627865 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [15:20:34] (03CR) 10Marostegui: [C: 03+2] db1121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/627855 (owner: 10Marostegui) [15:21:07] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:622612|Add `wmgWikibaseClientItemAndPropertySourceName` to InitialiseSettings.php (T258060)]] (duration: 01m 06s) [15:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:14] T258060: localEntitySourceName is a misleading setting in some cases - https://phabricator.wikimedia.org/T258060 [15:21:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Use `wmgWikibaseClientItemAndPropertySourceName` instead of `wmgWikibaseClientLocalEntitySourceName` in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622993 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:22:42] (03Merged) 10jenkins-bot: Use `wmgWikibaseClientItemAndPropertySourceName` instead of `wmgWikibaseClientLocalEntitySourceName` in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622993 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:25:43] (03PS1) 10JMeybohm: Use Kernel 4.19 on kubestage1002 [puppet] - 10https://gerrit.wikimedia.org/r/627867 (https://phabricator.wikimedia.org/T262527) [15:25:45] (03PS1) 10JMeybohm: Use Kernel 4.19 on staging cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/627868 (https://phabricator.wikimedia.org/T262527) [15:26:12] PROBLEM - IPMI Sensor Status on analytics1042 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:26:18] (03CR) 10jerkins-bot: [V: 04-1] Use Kernel 4.19 on staging cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/627868 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [15:27:22] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:622993|Use `wmgWikibaseClientItemAndPropertySourceName` instead of `wmgWikibaseClientLocalEntitySourceName` in Wikibase.php (T258060)]] (duration: 01m 02s) [15:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:29] T258060: localEntitySourceName is a misleading setting in some cases - https://phabricator.wikimedia.org/T258060 [15:27:39] (03PS1) 10Vgutierrez: api: Allow acme-chief clients to fetch alt. chain cert versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627869 (https://phabricator.wikimedia.org/T263006) [15:28:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove `wmgWikibaseClientLocalEntitySourceName` from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622994 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:29:06] (03PS1) 10Elukey: Update yarn.wikimedia.org's public cert [puppet] - 10https://gerrit.wikimedia.org/r/627870 [15:29:37] (03CR) 10Elukey: [C: 03+2] Update yarn.wikimedia.org's public cert [puppet] - 10https://gerrit.wikimedia.org/r/627870 (owner: 10Elukey) [15:29:46] (03Merged) 10jenkins-bot: Remove `wmgWikibaseClientLocalEntitySourceName` from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622994 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:30:15] (03PS1) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseClientLocalEntitySourceName to wmgWikibaseClientItemAndPropertySourceName on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627871 [15:30:43] (03CR) 10Alexandros Kosiaris: [C: 04-2] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/627862 (owner: 10Jbond) [15:31:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] Use Kernel 4.19 on kubestage1002 [puppet] - 10https://gerrit.wikimedia.org/r/627867 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [15:32:16] Lucas_WMDE, yes, done with the train - sorry, was out for a bit [15:32:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, but a commit message fix is in order" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627868 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [15:32:23] ok thanks :) [15:32:54] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Rename wmgWikibaseClientLocalEntitySourceName to wmgWikibaseClientItemAndPropertySourceName on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627871 (owner: 10Lucas Werkmeister (WMDE)) [15:33:36] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:622994|Remove `wmgWikibaseClientLocalEntitySourceName` from InitialiseSettings.php (T258060)]] (duration: 01m 05s) [15:33:38] (03Merged) 10jenkins-bot: Rename wmgWikibaseClientLocalEntitySourceName to wmgWikibaseClientItemAndPropertySourceName on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627871 (owner: 10Lucas Werkmeister (WMDE)) [15:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:42] T258060: localEntitySourceName is a misleading setting in some cases - https://phabricator.wikimedia.org/T258060 [15:33:46] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@b7e2d0b]: 0.3.48 [15:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:36] (03PS2) 10JMeybohm: Use Kernel 4.19 on staging cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/627868 (https://phabricator.wikimedia.org/T262527) [15:35:10] (03CR) 10JMeybohm: Use Kernel 4.19 on staging cluster nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627868 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [15:35:41] !log Canary `wdqs1003` query tests looks good, proceeding to wdqs deploy for rest of fleet [15:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:19] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) Updated (ouch!) {F32352585} [15:37:18] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:627871|Rename wmgWikibaseClientLocalEntitySourceName to wmgWikibaseClientItemAndPropertySourceName on Beta (T258060)]] (production no-op) (duration: 01m 04s) [15:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:14] PROBLEM - Host ps1-d7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:40:16] RECOVERY - Host ps1-d8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.93 ms [15:41:12] (03PS2) 10Giuseppe Lavagetto: wikifeeds: use the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/627517 (https://phabricator.wikimedia.org/T255878) [15:41:38] RECOVERY - ps1-c7-eqiad-infeed-load-tower-A-phase-X on ps1-c7-eqiad is OK: SNMP OK - ps1-c7-eqiad-infeed-load-tower-A-phase-X 369 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:43:48] RECOVERY - ps1-c7-eqiad-infeed-load-tower-A-phase-Z on ps1-c7-eqiad is OK: SNMP OK - ps1-c7-eqiad-infeed-load-tower-A-phase-Z 298 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:44:12] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10Marostegui) Following my IRC chat with @ArielGlenn - `revision` and `slots` table on s4 (commonswiki) are still under reasonable sizes. We just decrease... [15:45:10] (03CR) 10Jbond: "> Patch Set 1: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/627862 (owner: 10Jbond) [15:45:12] RECOVERY - IPMI Sensor Status on analytics1042 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:45:12] RECOVERY - IPMI Sensor Status on mw1383 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:46:02] RECOVERY - ps1-c7-eqiad-infeed-load-tower-B-phase-Y on ps1-c7-eqiad is OK: SNMP OK - ps1-c7-eqiad-infeed-load-tower-B-phase-Y 353 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:47:05] PROBLEM - IPMI Sensor Status on kafka-jumbo1009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:47:05] PROBLEM - IPMI Sensor Status on an-worker1094 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:47:07] PROBLEM - IPMI Sensor Status on ms-be1039 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:47:07] PROBLEM - IPMI Sensor Status on ms-be1037 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:47:49] RECOVERY - ps1-c7-eqiad-infeed-load-tower-B-phase-Z on ps1-c7-eqiad is OK: SNMP OK - ps1-c7-eqiad-infeed-load-tower-B-phase-Z 310 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:48:26] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@b7e2d0b]: 0.3.48 (duration: 14m 40s) [15:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikifeeds: use the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/627517 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [15:49:07] !log `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [15:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:12] (03CR) 10Alexandros Kosiaris: [C: 04-2] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/627862 (owner: 10Jbond) [15:49:56] !log sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 60 && systemctl restart wdqs-categories && sleep 30 && pool'; sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories' [15:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:08] (03CR) 10Volans: [C: 03+1] "LGTM! See one last comment inline, no need for re-review if you decide to add it." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [15:50:57] (03Merged) 10jenkins-bot: wikifeeds: use the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/627517 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [15:51:07] akosiaris: from your comments am i right in sayinh you think using puppet-module-puppetlabs-stdlib has more negatives then posatives? (thats where im leaning but wanted to checl) [15:52:06] jbond42: yeah, I think that's where I am leaning. [15:52:20] it's probably gonna cause more issues than solve problems [15:52:44] akosiaris: yes thats what im thinking as well thanks [15:53:46] (03Abandoned) 10Jbond: stdlib: backport datasize type [puppet] - 10https://gerrit.wikimedia.org/r/627862 (owner: 10Jbond) [15:53:54] (03PS3) 10Vgutierrez: requests: Fetch alternative chains [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627823 (https://phabricator.wikimedia.org/T263006) [15:53:56] (03PS5) 10Vgutierrez: acme_chief: Save alternative chains [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627835 [15:53:58] (03PS2) 10Vgutierrez: api: Allow acme-chief clients to fetch alt. chain cert versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627869 (https://phabricator.wikimedia.org/T263006) [15:55:27] PROBLEM - hue-next.wikimedia.org requires authentication on an-tool1009 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://hue-next.wikimedia.org:443/ - 871 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:56:58] (03PS1) 10Vgutierrez: Release 0.29 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627873 (https://phabricator.wikimedia.org/T263006) [15:57:15] PROBLEM - IPMI Sensor Status on mc-gp1003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:57:15] PROBLEM - IPMI Sensor Status on kafka-jumbo1008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:58:14] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [15:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:33] PROBLEM - Host ms-fe2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:49] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [16:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:21] (03PS2) 10Herron: mx: remove spamhaus dnsbl lookups [puppet] - 10https://gerrit.wikimedia.org/r/627557 (https://phabricator.wikimedia.org/T262642) [16:04:25] (03PS35) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) [16:04:56] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10Vgutierrez) So I've prepared a 0.29 release shipping https://gerrit.wikimedia.org/r/q/topic:%22T263006%22+(status:open%20OR%20status:merged) I've tested it m... [16:05:11] (03CR) 10Herron: [C: 03+2] mx: remove spamhaus dnsbl lookups [puppet] - 10https://gerrit.wikimedia.org/r/627557 (https://phabricator.wikimedia.org/T262642) (owner: 10Herron) [16:05:15] (03CR) 10Jbond: [C: 03+2] cookbook sre.pdu: Fix reboot logic and other minor fixes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [16:05:23] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [16:06:05] RECOVERY - Host ms-fe2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.82 ms [16:07:01] (03PS1) 10Klausman: root: Update .ruby-version to show 2.5.5 [puppet] - 10https://gerrit.wikimedia.org/r/627874 [16:07:38] (03CR) 10jerkins-bot: [V: 04-1] root: Update .ruby-version to show 2.5.5 [puppet] - 10https://gerrit.wikimedia.org/r/627874 (owner: 10Klausman) [16:08:29] (03PS36) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) [16:09:12] !log reinstall buster on an-tool1009 after a lot of tests (ganeti VM, so it is a manual work) [16:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:13] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [16:12:56] !log `wdqs` deploy complete, service is healthy [16:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:54] !log Start mysql on db1093, db1109 and db1123 after pdu work is done [16:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:13] 10Operations, 10ops-eqiad, 10DC-Ops: New Date - Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10Marostegui) [16:15:30] 10Operations, 10ops-eqiad, 10DC-Ops: New Date - Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10Marostegui) >>! In T261454#6465531, @Marostegui wrote: > mysql stopped on db1123 (s3 master), db1093 (s6 master) and db1109 (s8 master) hosts restarted after... [16:16:05] (03PS1) 10Ottomata: refine.pp - use eventlogging_legacy job to refine Test schema events [puppet] - 10https://gerrit.wikimedia.org/r/627876 (https://phabricator.wikimedia.org/T251609) [16:16:09] (03PS2) 10Klausman: root: Update .ruby-version to show 2.5.5 [puppet] - 10https://gerrit.wikimedia.org/r/627874 [16:16:20] 10Operations, 10ops-eqiad, 10DC-Ops: Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks C6 and C7 - https://phabricator.wikimedia.org/T261457 (10Marostegui) >>! In T261457#6465530, @Marostegui wrote: > mysql stopped on db1121 (sanitarium master) host started after PDU work is done [16:17:14] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C4 and C5 - https://phabricator.wikimedia.org/T261456 (10Marostegui) The on-site work here was fully done or is there anything pending that requires power changes? :) [16:17:20] (03PS3) 10Klausman: root: Delete .ruby-version [puppet] - 10https://gerrit.wikimedia.org/r/627874 [16:18:53] RECOVERY - Host ps1-d7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [16:18:57] PROBLEM - ps1-d7-eqiad-infeed-load-tower-A-phase-Z on ps1-d7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] root: Delete .ruby-version [puppet] - 10https://gerrit.wikimedia.org/r/627874 (owner: 10Klausman) [16:21:15] (03CR) 10Klausman: [C: 03+2] root: Delete .ruby-version [puppet] - 10https://gerrit.wikimedia.org/r/627874 (owner: 10Klausman) [16:22:22] (03CR) 10BryanDavis: [C: 03+1] toolforge grid: Remove some old scripts we don't use anymore [puppet] - 10https://gerrit.wikimedia.org/r/627628 (https://phabricator.wikimedia.org/T247364) (owner: 10Bstorm) [16:22:57] PROBLEM - ps1-d7-eqiad-infeed-load-tower-B-phase-Z on ps1-d7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:23:13] PROBLEM - ps1-d7-eqiad-infeed-load-tower-A-phase-X on ps1-d7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:23:15] PROBLEM - ps1-d7-eqiad-infeed-load-tower-B-phase-Y on ps1-d7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:23:25] (03PS1) 10Razzi: Add types to profile::analytics classes [puppet] - 10https://gerrit.wikimedia.org/r/627878 (https://phabricator.wikimedia.org/T213741) [16:23:43] PROBLEM - ps1-d7-eqiad-infeed-load-tower-A-phase-Y on ps1-d7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:24:00] 10Operations, 10ops-eqiad, 10DC-Ops: Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks C6 and C7 - https://phabricator.wikimedia.org/T261457 (10RobH) [16:24:02] 10Operations, 10ops-eqiad, 10DC-Ops: New Date - Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10RobH) [16:24:11] PROBLEM - ps1-d7-eqiad-infeed-load-tower-B-phase-X on ps1-d7-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:24:19] expected for d7 and d8 pdu errors [16:24:26] you can ignore they'll go away in the next 10 minutes or so [16:28:32] 10Operations, 10Language-Team, 10Wikimedia-Mailing-lists: localisation-team mailing list to be archived and made read-only - https://phabricator.wikimedia.org/T262788 (10Arrbee) All good. Thanks for the quick response on this @RLazarus [16:28:37] RECOVERY - IPMI Sensor Status on kafka-jumbo1008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:28:37] RECOVERY - IPMI Sensor Status on an-worker1094 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:28:39] RECOVERY - IPMI Sensor Status on mc-gp1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:28:39] RECOVERY - IPMI Sensor Status on kafka-jumbo1009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:28:39] RECOVERY - IPMI Sensor Status on ms-be1039 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:28:39] RECOVERY - IPMI Sensor Status on ms-be1037 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:29:37] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [16:32:17] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:07] 10Operations, 10ops-eqiad, 10DC-Ops: New Date - Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10RobH) [16:34:17] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10BBlack) @Vgutierrez Yes, sounds good! [16:35:26] (03PS1) 10RobH: d7 d8 pdu upgrades [puppet] - 10https://gerrit.wikimedia.org/r/627880 (https://phabricator.wikimedia.org/T261454) [16:35:46] (03PS1) 10Jdlrobson: Check $coords matched some nodes before comparing contents [extensions/MobileFrontend] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627793 (https://phabricator.wikimedia.org/T263034) [16:36:01] (03CR) 10RobH: [C: 03+2] d7 d8 pdu upgrades [puppet] - 10https://gerrit.wikimedia.org/r/627880 (https://phabricator.wikimedia.org/T261454) (owner: 10RobH) [16:39:39] PROBLEM - Host kubernetes2013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:39:39] PROBLEM - Host kubernetes2014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:40:09] PROBLEM - Host ps1-d6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:40:27] PROBLEM - Host db2101.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:40:43] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:49] PROBLEM - Host db2140.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:40:51] PROBLEM - Host dbproxy2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:43:17] PROBLEM - Host db2074.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:43:21] PROBLEM - Host db2084.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:43:21] PROBLEM - Host db2130.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:43:39] PROBLEM - Host es2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:44:15] 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [16:44:39] RECOVERY - ps1-d7-eqiad-infeed-load-tower-B-phase-Z on ps1-d7-eqiad is OK: SNMP OK - ps1-d7-eqiad-infeed-load-tower-B-phase-Z 472 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:44:55] RECOVERY - ps1-d7-eqiad-infeed-load-tower-A-phase-X on ps1-d7-eqiad is OK: SNMP OK - ps1-d7-eqiad-infeed-load-tower-A-phase-X 430 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:44:57] RECOVERY - ps1-d7-eqiad-infeed-load-tower-B-phase-Y on ps1-d7-eqiad is OK: SNMP OK - ps1-d7-eqiad-infeed-load-tower-B-phase-Y 285 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:45:29] RECOVERY - ps1-d7-eqiad-infeed-load-tower-A-phase-Z on ps1-d7-eqiad is OK: SNMP OK - ps1-d7-eqiad-infeed-load-tower-A-phase-Z 459 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:45:29] RECOVERY - ps1-d7-eqiad-infeed-load-tower-A-phase-Y on ps1-d7-eqiad is OK: SNMP OK - ps1-d7-eqiad-infeed-load-tower-A-phase-Y 256 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:45:29] RECOVERY - ps1-d7-eqiad-infeed-load-tower-B-phase-X on ps1-d7-eqiad is OK: SNMP OK - ps1-d7-eqiad-infeed-load-tower-B-phase-X 405 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:45:57] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:56] 10Operations, 10ops-eqiad, 10DC-Ops: New Date - Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10RobH) [16:47:59] 10Operations, 10ops-eqiad, 10DC-Ops: New Date - Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10RobH) [16:48:51] RECOVERY - Host es2019.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 33.84 ms [16:48:53] RECOVERY - Host ps1-d6-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.42 ms [16:49:15] RECOVERY - Host dbproxy2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.08 ms [16:49:45] RECOVERY - Host db2130.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.11 ms [16:49:55] RECOVERY - Host db2140.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.15 ms [16:50:16] RECOVERY - Host kubernetes2013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.10 ms [16:50:16] RECOVERY - Host kubernetes2014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.10 ms [16:50:51] RECOVERY - Host db2101.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.44 ms [16:53:09] RECOVERY - Host db2074.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.85 ms [16:53:45] RECOVERY - Host db2084.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.85 ms [16:55:54] (03PS1) 10Hnowlan: api-gateway: allow prod instances to hit mwdebug [deployment-charts] - 10https://gerrit.wikimedia.org/r/627882 (https://phabricator.wikimedia.org/T262396) [16:56:24] jouncebot: now [16:56:24] No deployments scheduled for the next 1 hour(s) and 3 minute(s) [16:57:38] (03CR) 10CDanis: [C: 03+1] icinga: reload when nagios config changes [puppet] - 10https://gerrit.wikimedia.org/r/627849 (https://phabricator.wikimedia.org/T263027) (owner: 10Filippo Giunchedi) [17:00:09] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed conten [17:00:09] 016 responds with unexpected value at path /mostread/articles = Expected 1 array elements, gotten 0 https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:02:22] (03CR) 10Ppchelko: Api-gateway: Implement support for X-Wikimedia-Debug header (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/627588 (https://phabricator.wikimedia.org/T262396) (owner: 10Ppchelko) [17:02:42] (03PS2) 10Volans: sre.dns.decommission: improve the DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/627834 (https://phabricator.wikimedia.org/T244153) [17:03:30] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:22] (03CR) 10Ppchelko: [C: 03+1] "I knew I was forgetting something :) it could not be that easy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/627882 (https://phabricator.wikimedia.org/T262396) (owner: 10Hnowlan) [17:06:55] (03CR) 10Nskaggs: [C: 03+1] "One question on specifics of how to remove, but +1" [puppet] - 10https://gerrit.wikimedia.org/r/627628 (https://phabricator.wikimedia.org/T247364) (owner: 10Bstorm) [17:07:24] (03CR) 10Andrew Bogott: [C: 03+1] toolforge grid: Remove some old scripts we don't use anymore [puppet] - 10https://gerrit.wikimedia.org/r/627628 (https://phabricator.wikimedia.org/T247364) (owner: 10Bstorm) [17:07:56] (03CR) 10Nskaggs: [C: 03+1] toolforge grid: Remove some old scripts we don't use anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627628 (https://phabricator.wikimedia.org/T247364) (owner: 10Bstorm) [17:08:31] (03PS1) 10Jdlrobson: Enable Vector search in header on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627884 (https://phabricator.wikimedia.org/T262207) [17:09:07] (03CR) 10Ottomata: [C: 03+2] refine.pp - use eventlogging_legacy job to refine Test schema events [puppet] - 10https://gerrit.wikimedia.org/r/627876 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [17:09:15] (03CR) 10Hnowlan: [C: 03+2] api-gateway: allow prod instances to hit mwdebug [deployment-charts] - 10https://gerrit.wikimedia.org/r/627882 (https://phabricator.wikimedia.org/T262396) (owner: 10Hnowlan) [17:09:56] (03PS2) 10Jdlrobson: Enable Vector search in header on officewiki and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627884 (https://phabricator.wikimedia.org/T262207) [17:11:41] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:10] (03Merged) 10jenkins-bot: api-gateway: allow prod instances to hit mwdebug [deployment-charts] - 10https://gerrit.wikimedia.org/r/627882 (https://phabricator.wikimedia.org/T262396) (owner: 10Hnowlan) [17:13:29] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content f [17:13:29] ) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path /mostread/articles = Expected 1 array elements, gotten 0 https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:15:07] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [17:15:10] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [17:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:50] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [17:19:12] (03CR) 10Brennen Bearnes: [C: 03+2] Check $coords matched some nodes before comparing contents [extensions/MobileFrontend] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627793 (https://phabricator.wikimedia.org/T263034) (owner: 10Jdlrobson) [17:19:25] Jdlrobson: apologies for delay; juggling meeting stuff - we're good with just that patch and not the test case? ^ [17:19:33] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10daniel) This RFC is up for public discussion today at 21:00 UTC (23:00 CEST, 2pm PDT). The discussion is taking place on IRC,... [17:19:40] (03CR) 10Cwhite: icinga: reload when nagios config changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627849 (https://phabricator.wikimedia.org/T263027) (owner: 10Filippo Giunchedi) [17:21:49] (03CR) 10Cwhite: [V: 03+2 C: 03+2] am: use default alertmanager url when needed [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627748 (owner: 10Filippo Giunchedi) [17:22:57] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:23:45] (03CR) 10Cwhite: [V: 03+2 C: 03+2] am: default interval to 10s [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627750 (owner: 10Filippo Giunchedi) [17:24:53] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:28:22] (03CR) 10Cwhite: [V: 03+2 C: 03+2] am: handle missing scheduled_downtime_depth [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627782 (owner: 10Filippo Giunchedi) [17:28:23] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content fo [17:28:23] responds with unexpected value at path /mostread/articles = Expected 1 array elements, gotten 0 https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:28:44] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) [17:29:15] (03PS1) 10Elukey: hue: use 'sync' with gunicorn when hue 4 is selected [puppet] - 10https://gerrit.wikimedia.org/r/627886 [17:29:47] (03CR) 10Elukey: [C: 03+2] hue: use 'sync' with gunicorn when hue 4 is selected [puppet] - 10https://gerrit.wikimedia.org/r/627886 (owner: 10Elukey) [17:30:21] 10Operations, 10ops-eqiad, 10DC-Ops: Physically move db1131 from B5 to C8 - https://phabricator.wikimedia.org/T262901 (10Cmjohnson) @marostegui is going to schedule this down tomorrow 17 Sept to be relocated to C8 [17:32:31] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Move service problem parsing/handling to am [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627749 (owner: 10Filippo Giunchedi) [17:36:00] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) I suspect that this line is our culprit. Does it need to be updated to use https? https://gerrit.wikimedia.org/r/plugins/giti... [17:36:49] (03CR) 10Dduvall: [C: 03+1] "Looks right to me." [puppet] - 10https://gerrit.wikimedia.org/r/624972 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [17:38:18] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10MSantos) >>! In T263043#6467333, @Mholloway wrote: > I suspect that this line is our culprit. Does it need to be updated to use https? h... [17:38:33] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) >>! In T263043#6467333, @Mholloway wrote: > I suspect that this line is our culprit. Does it need to be updated to use https? https... [17:40:52] (03Merged) 10jenkins-bot: Check $coords matched some nodes before comparing contents [extensions/MobileFrontend] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627793 (https://phabricator.wikimedia.org/T263034) (owner: 10Jdlrobson) [17:43:11] (03PS1) 10Cwhite: metrics static var to literal [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627890 [17:46:23] Jdlrobson, edsanders: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/627793/ is staged on mwdebug1002 if anyone would like to verify there. [17:48:13] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) For the record, the service doesn't seem to be having any trouble in staging: ` mholloway-shell@deploy1001:/srv/deployment-c... [17:48:37] (03PS1) 10Giuseppe Lavagetto: Revert "wikifeeds: use the service proxy everywhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/627796 [17:49:45] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "wikifeeds: use the service proxy everywhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/627796 (owner: 10Giuseppe Lavagetto) [17:50:26] !log joal@deploy1001 Started deploy [analytics/refinery@07056b0]: Regular analytics weekly train [analytics/refinery@07056b0] [17:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:41] brennen: is there a way to see if an error is triggered, because I don't think there is a visible effect of the bug [17:51:07] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [17:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:19] edsanders: yeah, it will show up in logstash or in logspam-watch on mwlog1001 [17:51:35] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:51:35] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:51:39] PROBLEM - IPMI Sensor Status on kafka-jumbo1008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:51:57] i can verify that a new error doesn't appear for a page load - was just trying to figure out how to hit mwdebug1002 while loading mobilefrontend [17:51:59] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:53:01] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) The configuration is exactly the same for staging and eqiad/codfw right now, so I'm not sure what's going on here. I'll have to dig... [17:53:04] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10faidon) Hey - this was brought to my attention, and we discussed it today at... [17:53:09] (03CR) 10Herron: [C: 03+1] "PCC looks ok https://puppet-compiler.wmflabs.org/compiler1003/25139/icinga1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/627849 (https://phabricator.wikimedia.org/T263027) (owner: 10Filippo Giunchedi) [17:53:32] sounds good then [17:56:34] edsanders: yeah, i think fix worked, at least based on loading a page that normally logs the notice with the firefox mobile emulation thingy in dev tools. deploying. [17:58:46] !log joal@deploy1001 Started deploy [analytics/refinery@07056b0] (thin): Regular analytics weekly train THIN [analytics/refinery@07056b0] [17:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:54] !log joal@deploy1001 Finished deploy [analytics/refinery@07056b0] (thin): Regular analytics weekly train THIN [analytics/refinery@07056b0] (duration: 00m 08s) [17:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] liw and brennen: That opportune time is upon us again. Time for a Train log triage with CPT deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200916T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200916T1800). [18:00:04] Jdlrobson: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.9/extensions/MobileFrontend: Backport: [[gerrit:627793|Check $coords matched some nodes before comparing contents (T263034)]] (duration: 01m 06s) [18:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:15] T263034: PHP Notice: Trying to get property 'textContent' of non-object - https://phabricator.wikimedia.org/T263034 [18:00:26] o/ here [18:00:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Telia IC-361191) patch - https://phabricator.wikimedia.org/T261791 (10RobH) I've been working with Chris on this today, and after testing, it was determined our optic in xe-3/3/7 was not outputting correctly. It measured a -1.x output from the module, but a p... [18:01:05] RoanKattouw / Niharika / Urbanecm: production should be clear, over to you. [18:01:47] I can deploy [18:02:42] (03CR) 10Catrope: [C: 03+2] Enable Vector search in header on officewiki and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627884 (https://phabricator.wikimedia.org/T262207) (owner: 10Jdlrobson) [18:03:35] (03Merged) 10jenkins-bot: Enable Vector search in header on officewiki and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627884 (https://phabricator.wikimedia.org/T262207) (owner: 10Jdlrobson) [18:04:43] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10dr0ptp4kt) First of all, cool stuff! Second, I noticed the following: > I previously considered making the service be aware... [18:04:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Telia IC-361191) patch - https://phabricator.wikimedia.org/T261791 (10RobH) > Telia Implementation Team, > We ended up having to swap the optic on our end of the link, as its TX out was not acceptable.  Its now showing a much better TX light out (-5) at our dm... [18:04:57] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10MSantos) p:05Unbreak!→03High Keeping this alive until we can figure out the root cause. But this should not be broken anymore. [18:05:23] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) a:03Joe The revert solved the issue for now. Still, we need to figure out what was going wrong, most apparently in the wikifeeds -... [18:08:10] Jdlrobson: Your patch is on mwdebug2001, please test [18:12:09] RoanKattouw: almost done [18:12:39] RoanKattouw: we're good to sync! [18:14:12] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Vector search in header on testwiki and officewiki (T262207) (duration: 01m 04s) [18:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:19] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) [18:14:19] T262207: Deploy new search location and DOM order to officewiki and testwiki - https://phabricator.wikimedia.org/T262207 [18:14:21] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) racked [18:14:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10Cmjohnson) [18:14:29] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) for the record, I just confirmed: eqiad gives a correct result as well: ` $ curl -s http://wikifeeds.svc.eqiad.wmnet:8889/en.wikipe... [18:14:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10Cmjohnson) racked [18:15:19] (03CR) 10Dzahn: [C: 03+2] ci/pipeline/builder.pp: Add ruamel package [puppet] - 10https://gerrit.wikimedia.org/r/624972 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [18:16:05] Jdlrobson: All done [18:18:00] 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) 05Open→03Stalled TL;DR: we were unable to reproduce a corruption in this iterartion * I run the full set of URLs a few times using `opcache.protect_memory = 1`.... [18:18:18] 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) p:05Triage→03Low [18:18:26] RoanKattouw: thanks! and yay! [18:18:46] (03CR) 10Ottomata: "I'd like to merge this and deploy tomorrow my (US east coast) morning." [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [18:19:10] (03CR) 10Jeena Huneidi: "Thanks Dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/624972 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [18:20:56] (03Abandoned) 10Dzahn: move 20 new codfw parsoid servers into production [puppet] - 10https://gerrit.wikimedia.org/r/579026 (https://phabricator.wikimedia.org/T247441) (owner: 10Dzahn) [18:21:32] (03PS7) 10Dzahn: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 [18:36:02] (03CR) 10BryanDavis: [C: 03+1] toolforge grid: Remove some old scripts we don't use anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627628 (https://phabricator.wikimedia.org/T247364) (owner: 10Bstorm) [18:37:00] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10dpifke) >>! In T260330#6408193, @tstarling wrote: > Has anyone got an idea for giving the HMAC key to the server without allow... [18:37:13] (03CR) 10Herron: [C: 03+1] profile: add queues to rsyslog kafka output [puppet] - 10https://gerrit.wikimedia.org/r/627865 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [18:37:44] (03CR) 10Herron: [C: 03+1] hieradata: enable remote syslog queues in codfw [puppet] - 10https://gerrit.wikimedia.org/r/627816 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [18:39:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10wiki_willy) Thanks @Marostegui for the outline and @Papaul for the verbal context. The email's sent out to our account rep for escalation, so... [18:39:28] (03CR) 10Herron: [C: 03+1] base: remove obsolete enable_rsyslog_exporter [puppet] - 10https://gerrit.wikimedia.org/r/627821 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [18:44:39] (03PS1) 10Razzi: Add razzi to ops group [puppet] - 10https://gerrit.wikimedia.org/r/627895 (https://phabricator.wikimedia.org/T261443) [18:45:33] (03CR) 10Ottomata: [C: 03+2] Add razzi to ops group [puppet] - 10https://gerrit.wikimedia.org/r/627895 (https://phabricator.wikimedia.org/T261443) (owner: 10Razzi) [18:47:10] (03PS2) 10Razzi: Add types to profile::analytics classes [puppet] - 10https://gerrit.wikimedia.org/r/627878 (https://phabricator.wikimedia.org/T213741) [18:49:32] (03PS1) 10Volans: dns: make logging less noisy [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627898 (https://phabricator.wikimedia.org/T244153) [18:49:34] (03PS1) 10Volans: dns: split public zones per DC [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627899 (https://phabricator.wikimedia.org/T244153) [18:50:37] (03CR) 10CRusnov: [C: 03+2] sre.dns.decommission: improve the DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/627834 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [18:50:52] (03CR) 10CRusnov: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/627834 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [18:51:59] (03CR) 10CRusnov: [C: 03+1] "looks good" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627898 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [18:54:05] 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10holger.knust) [18:59:56] (03PS1) 10Razzi: Add types to profile::analytics classes [puppet] - 10https://gerrit.wikimedia.org/r/627903 (https://phabricator.wikimedia.org/T213741) [19:00:04] liw and brennen: Your horoscope predicts another unfortunate Mediawiki train - European+American Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200916T1900). [19:00:18] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10daniel) >>! In T260330#6467549, @dpifke wrote: > The kernel keyring can have a key loaded onto it (via `keyctl`) which is usab... [19:02:02] (03PS3) 10Razzi: Add types to profile::analytics classes [puppet] - 10https://gerrit.wikimedia.org/r/627878 (https://phabricator.wikimedia.org/T213741) [19:03:10] train status: currently blocked, but at group1 as expected for today. nothing to do during this window unless a fix crops up for T263047. [19:03:10] T263047: Uncaught TypeError: Cannot read property 'node' of undefined - https://phabricator.wikimedia.org/T263047 [19:03:16] (03CR) 10Razzi: "Created this as a duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/627878" [puppet] - 10https://gerrit.wikimedia.org/r/627903 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [19:03:30] (03Abandoned) 10Razzi: Add types to profile::analytics classes [puppet] - 10https://gerrit.wikimedia.org/r/627903 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [19:06:10] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10dpifke) >>! In T260330#6467753, @daniel wrote: > Can this be used from inside a docker container? I've used it with LXC conta... [19:11:03] (03CR) 10Ottomata: [C: 03+1] Add types to profile::analytics classes [puppet] - 10https://gerrit.wikimedia.org/r/627878 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [19:11:39] (03PS4) 10Razzi: Add types to profile::analytics classes [puppet] - 10https://gerrit.wikimedia.org/r/627878 (https://phabricator.wikimedia.org/T213741) [19:15:04] 10Operations, 10MediaWiki-Stakeholders-Group, 10TechCom-RFC, 10Traffic, and 3 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588 (10Krinkle) [19:15:12] (03CR) 10Ottomata: [C: 03+2] Add types to profile::analytics classes [puppet] - 10https://gerrit.wikimedia.org/r/627878 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [19:17:08] 10Operations, 10MediaWiki-Stakeholders-Group, 10TechCom-RFC, 10Traffic, and 3 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588 (10Krinkle) 05Open→03Declined Closing old RFC that is not yet on to [our 2020 process](https://www.mediawiki.org/wiki/Requests_for_comment) and... [19:17:26] (03CR) 10Volans: [C: 03+2] sre.dns.decommission: improve the DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/627834 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [19:17:46] (03CR) 10Volans: [C: 03+2] dns: make logging less noisy [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627898 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [19:18:23] (03Merged) 10jenkins-bot: sre.dns.decommission: improve the DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/627834 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [19:20:05] (03PS2) 10Dzahn: profile::scap::dsh: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/627391 [19:22:14] DannyS712: i think https://gerrit.wikimedia.org/r/c/mediawiki/core/+/627863 is waiting for you to upgrade James_F and Ammarpad's 2 C+1s into a C+2 [19:23:03] (and then i can cherry-pick to REL1_35 for Reedy, who loves changes to the release branch, I'm sure) [19:23:23] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:24:19] 10Puppet, 10Analytics, 10VPS-Projects: Puppet failing on wikistats.analytics.eqiad.wmflabs due to statistics::user - https://phabricator.wikimedia.org/T259307 (10razzi) 05Open→03Resolved This node was deleted. [19:27:05] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:00:04] halfak and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200916T2000). [20:01:42] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10Framawiki) [20:04:09] PROBLEM - SSH on mw1346.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:04:47] !log robh@cumin1001 START - Cookbook sre.dns.netbox [20:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:29] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:06] !log cdanis@cumin1001 START - Cookbook sre.network.cf [20:19:06] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [20:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:12] (03PS1) 10Volans: dns: correctly sort IPv6 PTR records [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627909 (https://phabricator.wikimedia.org/T244153) [20:39:14] cscott reviewing now [20:42:58] +2'ed, will backport once it merges [20:48:32] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25142/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627391 (owner: 10Dzahn) [20:49:03] cscott removed by +2, see patch [20:51:07] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Legoktm) >>! In T260330#6458637, @tstarling wrote: > An open question is what to do about shell pipelines. I didn't see any... [20:56:57] DannyS712: not in this patch! [20:57:24] As far as I understand it Reedy will make the "rc.4" section when he releases rc.4. [20:58:31] see d6ac54eb8c096bba5817481f692493e0120f0582 and b86370cca3c37494c98ec5dde375092db3eaf35f etc [20:58:45] cscott then it shouldn't be added at all? It shouldn't be listed twice, and we probably won't remember to remove it from lower down [20:58:51] (03CR) 10CRusnov: [C: 03+1] "lgtm i think :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627909 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [20:59:04] DannyS712: no, it's listed twice [20:59:15] it is? [20:59:23] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/621731/1/RELEASE-NOTES-1.35 [20:59:32] doesn't remove the stuff from further down. [21:00:00] the hard deprecation still gets listed in the "deprecation" section, where it belongs [21:00:22] the "changes from rc.3" might not even mention this patch, it's for reedy to list user-visible bugs fixed [21:00:37] okay [21:02:02] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) >>! In T262151#6462585, @Volans wrote: > The device is still active in Netbox, shouldn't be marked as failed? Yep, its not online so I'm putting it failed so the reports clear up in netbox.... [21:04:47] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:55] RECOVERY - SSH on mw1346.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:05:27] cscott for the backport should the release notes change also be backported? [21:05:57] DannyS712: yes, see https://gerrit.wikimedia.org/r/q/b86370cca3c37494c98ec5dde375092db3eaf35f [21:06:52] all the changes to RELEASE-NOTES-1.35 land first on master then get backported as-is to REL1_35 [21:07:06] How about you handle the backport? [21:07:13] will do :) [21:07:50] (03PS5) 10Dzahn: puppetmaster: (re)move hiera lookup for scripts to profiles [puppet] - 10https://gerrit.wikimedia.org/r/624335 [21:08:08] DannyS712: i'd like to have the hard deprecation ride the train next week before we re-apply the original 'remove support for $user=null' patch [21:08:24] okay [21:08:26] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10wiki_willy) 05Open→03Resolved Resolved the Netbox error alerts. Closing task. [21:08:30] that way we can watch the logs for deprecation notices for any corner cases we've missed [21:08:48] (which might have to be applied to REL1_35 as well, so i'm hoping we don't find any/many) [21:08:53] lol I removed my +2 and added it back in time for the original gated pipeline tests to be used instead of needing to run them again [21:09:09] fooling jenkins ftw [21:25:55] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:31] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Andyrom75) Today, all is fine. I would consider this issue closed and solved. [21:28:27] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10CDanis) >>! In T262869#6468336, @Andyrom75 wrote: > Today, all is fine. I would consider this issue closed and solved. Thanks Andy, but we are going to keep it ope... [21:29:44] 10Operations, 10DC-Ops, 10serviceops: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) [21:30:18] 10Operations, 10DC-Ops, 10serviceops: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) server is down and depooled. it can be worked on anytime [21:30:30] 10Operations, 10DC-Ops, 10serviceops: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) p:05Triage→03Medium [21:31:05] ACKNOWLEDGEMENT - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T263065 [21:36:25] (03PS1) 10CDanis: depool esams [dns] - 10https://gerrit.wikimedia.org/r/627919 [21:37:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) [21:40:06] (03PS1) 10CDanis: prepend esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/627920 [21:43:29] (03PS1) 10Catrope: Homepage: Fix styling for mobile start module [extensions/GrowthExperiments] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627802 (https://phabricator.wikimedia.org/T258008) [21:45:05] (03PS1) 10CDanis: repool eqiad [dns] - 10https://gerrit.wikimedia.org/r/627922 [21:45:17] (03PS1) 10CDanis: prepend eqiad/eqord [homer/public] - 10https://gerrit.wikimedia.org/r/627923 [22:06:31] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:09:23] (03CR) 10Cwhite: [C: 03+1] base: remove obsolete enable_rsyslog_exporter [puppet] - 10https://gerrit.wikimedia.org/r/627821 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [22:09:40] (03CR) 10Cwhite: [C: 03+1] hieradata: enable remote syslog queues in codfw [puppet] - 10https://gerrit.wikimedia.org/r/627816 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [22:10:52] (03CR) 10Cwhite: [C: 03+1] profile: add queues to rsyslog kafka output [puppet] - 10https://gerrit.wikimedia.org/r/627865 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [22:12:01] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Add test scaffolding for am [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627751 (owner: 10Filippo Giunchedi) [22:12:08] (03CR) 10Cwhite: [V: 03+2 C: 03+2] metrics static var to literal [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/627890 (owner: 10Cwhite) [22:12:16] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/25144/puppetmaster2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/624335 (owner: 10Dzahn) [22:14:18] Hi. I'm trying to trouble shoot T262970. Would someone be willing to run `var_export( $wgAutopromoteOnce['onEdit'] );` on dewiki prod and post the result? [22:14:19] T262970: FlaggedRevs doesn't check the 'neverBlocked' / APCOND_FR_NEVERBLOCKED option when autopromoting - https://phabricator.wikimedia.org/T262970 [22:14:50] DannyS712: sure [22:15:35] (03CR) 10Dzahn: [C: 03+2] puppetmaster: (re)move hiera lookup for scripts to profiles [puppet] - 10https://gerrit.wikimedia.org/r/624335 (owner: 10Dzahn) [22:16:08] DannyS712: https://phabricator.wikimedia.org/P12614 [22:16:20] done with [urbanecm@mwmaint2001 ~]$ echo 'var_export( $wgAutopromoteOnce['onEdit'] );' | mwscript eval.php --wiki=dewiki | phaste [22:17:32] (03CR) 10Dzahn: "noop on puppetmaster1001, 1002, 2001" [puppet] - 10https://gerrit.wikimedia.org/r/624335 (owner: 10Dzahn) [22:18:51] (03PS7) 10Dzahn: puppetdb: (re)move hiera lookup for db pass to profile [puppet] - 10https://gerrit.wikimedia.org/r/624340 [22:20:08] (03CR) 10Dzahn: [C: 03+2] puppetdb: (re)move hiera lookup for db pass to profile [puppet] - 10https://gerrit.wikimedia.org/r/624340 (owner: 10Dzahn) [22:22:28] (03CR) 10Dzahn: "noop on puppetdb1002/2002" [puppet] - 10https://gerrit.wikimedia.org/r/624340 (owner: 10Dzahn) [22:24:37] !log install prometheus-icinga-exporter 0.11 on icinga2001 [22:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:57] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:03] (03PS7) 10Dzahn: puppetmaster::backend: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/624342 (https://phabricator.wikimedia.org/T209953) [22:32:05] (03Abandoned) 10Dzahn: puppetmaster::backend: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/624342 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:32:37] (03PS4) 10Dzahn: puppetmaster: drop workers parameter as this can be inferred from servers [puppet] - 10https://gerrit.wikimedia.org/r/627754 (owner: 10Jbond) [22:35:59] PROBLEM - Disk space on graphite1004 is CRITICAL: DISK CRITICAL - free space: / 1444 MB (3% inode=97%): /tmp 1444 MB (3% inode=97%): /var/tmp 1444 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=graphite1004&var-datasource=eqiad+prometheus/ops [22:39:37] (03CR) 10Krinkle: Add performance settings for DPL and re-enable on ruwikinews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626919 (https://phabricator.wikimedia.org/T262240) (owner: 10Brian Wolff) [22:47:26] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25147/puppetmaster1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/627754 (owner: 10Jbond) [22:48:18] 10Operations, 10Sentry: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138 (10Jdlrobson) Can this and other sentry related tasks be declined now @tgr https://logstash.wikimedia.org/app/kibana#/dashboard/AXDBY8Qhh3Uj6x1zCF56 [22:51:02] (03CR) 10Dzahn: "noop on puppetmaster1001,2001,2002..." [puppet] - 10https://gerrit.wikimedia.org/r/627754 (owner: 10Jbond) [22:51:33] (03PS3) 10Dzahn: ntp::daemon: replace hiera() with lookup(), lint [puppet] - 10https://gerrit.wikimedia.org/r/624332 [22:52:26] DannyS712: i guess we want to backport your patch? (whenever it's ready) [23:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200916T2300). [23:00:04] RoanKattouw and MatmaRex: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:17] I will deploy my own changes [23:00:24] (I listed one, but two more are in CI) [23:02:47] MatmaRex: You didn't list which patches you want to deploy, but I assume it's https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/627461 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/627462 ? [23:03:04] hey [23:03:08] one sec. it's not those [23:03:31] ...which were deployed yesterday [23:03:38] (03CR) 10Catrope: [C: 03+2] Homepage: Fix styling for mobile start module [extensions/GrowthExperiments] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627802 (https://phabricator.wikimedia.org/T258008) (owner: 10Catrope) [23:04:33] (03PS1) 10Catrope: Homepage: Revert wider task card on desktop for now [extensions/GrowthExperiments] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627804 (https://phabricator.wikimedia.org/T263042) [23:04:41] (03CR) 10Catrope: [C: 03+2] Homepage: Revert wider task card on desktop for now [extensions/GrowthExperiments] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627804 (https://phabricator.wikimedia.org/T263042) (owner: 10Catrope) [23:04:43] MatmaRex finished writing the patch, but I'd appreciate a full review before backporting given that everything I know about WANObjectCache I learned in the last hour [23:04:52] i want to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/627803 but i need to review it first [23:05:02] DannyS712: same [23:05:31] DannyS712: i had a comment on the task: https://phabricator.wikimedia.org/T262970#6468577 [23:06:33] Current cached values are only `true` and `false`, which should all be ignored - not `=== 'priorBlock'` and fails `is_int` if I understand correctly [23:09:13] MatmaRex V+2, ready for review [23:12:17] DannyS712: thanks, i think that looks right. although i think that $oldAsOf should be the same value as your timestamp, but it should be fine to be more explicit [23:12:25] (03Merged) 10jenkins-bot: Homepage: Fix styling for mobile start module [extensions/GrowthExperiments] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627802 (https://phabricator.wikimedia.org/T258008) (owner: 10Catrope) [23:13:20] (03Merged) 10jenkins-bot: Homepage: Revert wider task card on desktop for now [extensions/GrowthExperiments] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627804 (https://phabricator.wikimedia.org/T263042) (owner: 10Catrope) [23:13:26] I don't know, I was just doing what used to work [23:16:46] (03PS1) 10Bartosz Dziewoński: Fix APCOND_FR_NEVERBLOCKED handling (part 2) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/627805 (https://phabricator.wikimedia.org/T262970) [23:16:55] (03PS2) 10DannyS712: Fix APCOND_FR_NEVERBLOCKED handling (part 2) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/627805 (https://phabricator.wikimedia.org/T262970) (owner: 10Bartosz Dziewoński) [23:16:57] (03PS1) 10Catrope: Fix width of sidebar modules in narrow mode in variant A [extensions/GrowthExperiments] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627946 (https://phabricator.wikimedia.org/T263068) [23:17:01] (03PS1) 10Bartosz Dziewoński: Fix APCOND_FR_NEVERBLOCKED handling (part 2) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627947 (https://phabricator.wikimedia.org/T262970) [23:17:15] (03CR) 10Catrope: [C: 03+2] Fix width of sidebar modules in narrow mode in variant A [extensions/GrowthExperiments] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627946 (https://phabricator.wikimedia.org/T263068) (owner: 10Catrope) [23:17:17] (03PS2) 10DannyS712: Fix APCOND_FR_NEVERBLOCKED handling (part 2) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627947 (https://phabricator.wikimedia.org/T262970) (owner: 10Bartosz Dziewoński) [23:17:38] Wow CI has gotten a lot faster these days. Only 8 minutes to merge a wmf.9 patch [23:17:55] ...I guess we hit cherry pick at the same time? [23:18:05] looks like we did, heh [23:18:08] Well he beat you by 16 seconds :) [23:18:31] wait what happened here? how did the change numbers go from 627805 to 627947 in like fifteen seconds? [23:18:44] anyway [23:18:51] Hah yeah that's bizarre [23:19:18] RoanKattouw: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/627805/ and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/627947/ , if you're deploying, please [23:20:10] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Marostegui Papaul, I understand that the last Bios update didn't fix the issue you were experiencing, however these recommendati... [23:20:14] I am, will do [23:20:26] 627805 seems to be the odd one out, the ones after it are all 12 hours old [23:20:37] RoanKattouw MatmaRex for the record I have not tested these locally [23:20:39] Maybe someone created a cherry-pick of that FR patch as a private draft or something? [23:21:23] idk. Turns out if I hit cherry pick after MatmaRex it uploads an identical PS2 [23:21:47] yeah, it'll just change some metadata [23:22:42] once the patches work in prod will also need to be cherry picked to 1.35 and 1.34 [23:23:07] (03CR) 10Catrope: [C: 03+2] Fix APCOND_FR_NEVERBLOCKED handling (part 2) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/627805 (https://phabricator.wikimedia.org/T262970) (owner: 10Bartosz Dziewoński) [23:23:11] (03CR) 10Catrope: [C: 03+2] Fix APCOND_FR_NEVERBLOCKED handling (part 2) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627947 (https://phabricator.wikimedia.org/T262970) (owner: 10Bartosz Dziewoński) [23:25:03] MatmaRex should this have incident documentation on wikitech? [23:25:46] i'm not planning to write any [23:26:40] (03Merged) 10jenkins-bot: Fix width of sidebar modules in narrow mode in variant A [extensions/GrowthExperiments] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627946 (https://phabricator.wikimedia.org/T263068) (owner: 10Catrope) [23:27:00] (03Merged) 10jenkins-bot: Fix APCOND_FR_NEVERBLOCKED handling (part 2) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/627805 (https://phabricator.wikimedia.org/T262970) (owner: 10Bartosz Dziewoński) [23:28:01] (03Merged) 10jenkins-bot: Fix APCOND_FR_NEVERBLOCKED handling (part 2) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627947 (https://phabricator.wikimedia.org/T262970) (owner: 10Bartosz Dziewoński) [23:34:38] MatmaRex I have to go, but I'll check back later if there are more issues [23:35:00] sure. see you [23:36:45] MatmaRex: The FR patches are now on mwdebug2001 [23:37:39] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.9/extensions/GrowthExperiments/: Fix styling for mobile start module (T258008); Revert wider task card on desktop (T263042, T258704); Fix width of sidebar modules in narrow mode in variant A (T263068) (duration: 01m 09s) [23:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:51] T258008: Variant C/D: smaller start module - https://phabricator.wikimedia.org/T258008 [23:37:52] T258704: Variant C/D: homepage Suggested edits module changes - Desktop - https://phabricator.wikimedia.org/T258704 [23:37:52] T263042: [wmf.9-regression] Homepage SE - suggested-edits-card-wrapper displayed wide - https://phabricator.wikimedia.org/T263042 [23:37:53] T263068: Mentor module and help module display broken on narrow desktop screens - https://phabricator.wikimedia.org/T263068 [23:38:03] RoanKattouw: i can't really test them, but the sites seem to be still working [23:38:24] i'm planning to watch the autopromote logs later to see the effects [23:38:24] OK, then I guess I'll deploy [23:40:02] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.9/extensions/FlaggedRevs: T262970 (duration: 01m 06s) [23:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:08] T262970: FlaggedRevs doesn't check the 'neverBlocked' / APCOND_FR_NEVERBLOCKED option when autopromoting - https://phabricator.wikimedia.org/T262970 [23:41:08] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.8/extensions/FlaggedRevs: T262970 (duration: 01m 06s) [23:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:36] thanks RoanKattouw