[00:04:08] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:36:18] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 84115736 and 273 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:22] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 41634200 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:58] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 56025808 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:38] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1388856 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:36] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 90731936 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:04] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 106375416 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:59] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17888784 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:02] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 479745768 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:02] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 386943816 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:02] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32260544 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:18] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 382732456 and 19 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:36] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35722488 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:02] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 754815632 and 37 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:04] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 616951440 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:20] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 103944 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:36] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 153288 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:42] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3056 and 103 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:44] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1688 and 86 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:56] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4800 and 97 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:58] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3832 and 99 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:02] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1936 and 103 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:02] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 89872 and 103 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:18:40] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access for Jan Jaquemot - https://phabricator.wikimedia.org/T267771 (10KFrancis) @Dzahn I am confirming the NDA is signed and complete! Thanks! [03:33:14] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [03:43:20] (03PS7) 10KartikMistry: Remove wgContentTranslationRESTBase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634956 (https://phabricator.wikimedia.org/T266213) [03:43:36] (03PS2) 10KartikMistry: Update cxserver to 2020-11-23-050106-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/643139 (https://phabricator.wikimedia.org/T262253) [03:59:04] * kart_ updating cxserver.. [03:59:21] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-11-23-050106-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/643139 (https://phabricator.wikimedia.org/T262253) (owner: 10KartikMistry) [04:00:42] (03Merged) 10jenkins-bot: Update cxserver to 2020-11-23-050106-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/643139 (https://phabricator.wikimedia.org/T262253) (owner: 10KartikMistry) [04:11:10] !log kartik@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [04:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:02] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [04:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:35] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [04:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:22] !log Updated cxserver to 2020-11-23-050106-production (T262253, T268410) [04:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:30] T268410: Create Wikipedia Saraiki - https://phabricator.wikimedia.org/T268410 [04:26:30] T262253: Improve MT support for Central Bikol with OpusMT - https://phabricator.wikimedia.org/T262253 [05:54:20] 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10WMDE-leszek) >>! In T268946#6654407, @Aklapper wrote: > Thanks for filing this! I archived also https://phabricator.wikimedia.org/tag/user-pablo-wmde/ , wonde... [06:42:38] (03PS1) 10Marostegui: mariadb: Decommission es1015 [puppet] - 10https://gerrit.wikimedia.org/r/644083 (https://phabricator.wikimedia.org/T268810) [06:43:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission es1015 [puppet] - 10https://gerrit.wikimedia.org/r/644083 (https://phabricator.wikimedia.org/T268810) (owner: 10Marostegui) [06:47:38] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [06:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:48] !log Stop mysql on db1124:3318 to clone clouddb1016:3318, lag will show up on wikireplicas on s8 T267090 [07:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:56] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [07:09:22] PROBLEM - mysqld processes on db1124 is CRITICAL: PROCS CRITICAL: 3 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:09:40] PROBLEM - MariaDB read only s8 on db1124 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:10:37] ^ me [07:10:40] but I thought I downtimed it [07:10:43] (03PS1) 10Marostegui: check_private_data: Add clouddb1016 and clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/644084 (https://phabricator.wikimedia.org/T267090) [07:11:40] !log Deploy schema change on s1 codfw - T268004 [07:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:46] (03CR) 10Marostegui: [C: 03+2] check_private_data: Add clouddb1016 and clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/644084 (https://phabricator.wikimedia.org/T267090) (owner: 10Marostegui) [07:11:47] T268004: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 [07:18:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [07:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:51] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [07:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:41] (03PS2) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 [07:28:09] (03CR) 10jerkins-bot: [V: 04-1] Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [07:30:40] 10Operations, 10SRE-tools: Decommission cookbook failing to update DNS - https://phabricator.wikimedia.org/T268963 (10Marostegui) [07:31:51] (03PS3) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 [07:34:03] 10Operations, 10SRE-tools: Decommission cookbook failing to update DNS - https://phabricator.wikimedia.org/T268963 (10elukey) From `/var/log/spicerack/sre/hosts/decommission.log` on cumin1001: ` 2020-11-30 07:25:21,024 marostegui 6173 [INFO] Deploying the updated zonefiles on authdns[1001,2001].wikimedia.org,... [07:37:47] (03PS1) 10Marostegui: wmnet: Update esX-master cnames [dns] - 10https://gerrit.wikimedia.org/r/644086 (https://phabricator.wikimedia.org/T268963) [07:37:48] 10Operations, 10SRE-tools: Decommission cookbook failing to update DNS - https://phabricator.wikimedia.org/T268963 (10elukey) Ran the following from dns5001: ` root@dns5001:/srv/authdns/git# utils/deploy-check.py [..] error: CNAME 'es2-master.eqiad.wmnet.' points to known same-zone NXDOMAIN 'es1015.eqiad.wmne... [07:39:09] (03CR) 10Elukey: [C: 03+1] "Assuming that the hostnames are correct, looks good!" [dns] - 10https://gerrit.wikimedia.org/r/644086 (https://phabricator.wikimedia.org/T268963) (owner: 10Marostegui) [07:39:20] (03CR) 10Marostegui: [C: 03+2] wmnet: Update esX-master cnames [dns] - 10https://gerrit.wikimedia.org/r/644086 (https://phabricator.wikimedia.org/T268963) (owner: 10Marostegui) [07:41:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [07:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:15] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [07:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/630690 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [07:47:45] 10Operations, 10SRE-tools, 10Patch-For-Review: Decommission cookbook failing to update DNS - https://phabricator.wikimedia.org/T268963 (10Marostegui) After deploying the above changeset to address what @elukey found, I ran it again and even though it failed for other reasons, I think it has worked: ` $ sudo... [07:51:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1015.eqiad.wmnet - https://phabricator.wikimedia.org/T268810 (10Marostegui) [08:06:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:24:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:26:53] 10Operations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10jcrespo) > maybe we just hit a small/unfortunate time window while a systemd timer tried to execute the unit while it was reloaded or so Not the case- the timer didn't start on cumin2001 on the next iteration- it i... [08:36:44] !log Compare data between clouddb1016:3315 labsdb1012 T267090 [08:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:52] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [08:41:08] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [08:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:15] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [08:48:40] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T268856 (10fgiunchedi) [08:48:44] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi) [08:51:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1089', diff saved to https://phabricator.wikimedia.org/P13463 and previous config saved to /var/cache/conftool/dbconfig/20201130-085101-marostegui.json [08:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:20] !log Deploy schema change on db1089 [08:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:26] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10MoritzMuehlenhoff) mwdebug1003 is working just fine in my tests, both for various page browsing and edits. [08:56:22] (03PS1) 10Joal: Correct analytics pageview_actor data purge [puppet] - 10https://gerrit.wikimedia.org/r/644172 (https://phabricator.wikimedia.org/T268382) [08:57:50] (03CR) 10jerkins-bot: [V: 04-1] Correct analytics pageview_actor data purge [puppet] - 10https://gerrit.wikimedia.org/r/644172 (https://phabricator.wikimedia.org/T268382) (owner: 10Joal) [09:04:18] (03PS2) 10Elukey: Correct analytics pageview_actor data purge [puppet] - 10https://gerrit.wikimedia.org/r/644172 (https://phabricator.wikimedia.org/T268382) (owner: 10Joal) [09:04:56] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/644172 (https://phabricator.wikimedia.org/T268382) (owner: 10Joal) [09:05:58] (03PS1) 10Jcrespo: mariadb: Change section command to include test-$section in its output [software] - 10https://gerrit.wikimedia.org/r/644173 [09:06:20] (03CR) 10Elukey: [C: 03+2] "Table name and hdfs path looks good, merging!" [puppet] - 10https://gerrit.wikimedia.org/r/644172 (https://phabricator.wikimedia.org/T268382) (owner: 10Joal) [09:06:55] (03CR) 10Marostegui: [C: 03+1] mariadb: Change section command to include test-$section in its output [software] - 10https://gerrit.wikimedia.org/r/644173 (owner: 10Jcrespo) [09:13:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gerrit,gerrit-metrics} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:13:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Change section command to include test-$section in its output [software] - 10https://gerrit.wikimedia.org/r/644173 (owner: 10Jcrespo) [09:15:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:21:39] (03PS1) 10Volans: remote: re-enable cumin's output [software/spicerack] - 10https://gerrit.wikimedia.org/r/644175 (https://phabricator.wikimedia.org/T212783) [09:21:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1089+ (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13464 and previous config saved to /var/cache/conftool/dbconfig/20201130-092154-root.json [09:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:04] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=- method=POST https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [09:23:40] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [09:25:37] (03CR) 10Elukey: [C: 03+1] remote: re-enable cumin's output [software/spicerack] - 10https://gerrit.wikimedia.org/r/644175 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [09:26:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1089 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13465 and previous config saved to /var/cache/conftool/dbconfig/20201130-092614-root.json [09:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:22] PROBLEM - Check systemd state on ms-be1060 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:35] (03CR) 10Ema: [C: 03+2] Systemd::Servicename: add '@' to Pattern [puppet] - 10https://gerrit.wikimedia.org/r/643934 (https://phabricator.wikimedia.org/T256467) (owner: 10Ema) [09:31:27] 10Operations, 10SRE-tools: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) [09:31:48] 10Operations, 10SRE-tools: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) p:05Triage→03Unbreak! [09:33:16] 10Operations, 10SRE-tools: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) [09:34:34] RECOVERY - mysqld processes on db1124 is OK: PROCS OK: 4 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:34:39] 10Operations, 10SRE-tools: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) [09:35:08] RECOVERY - MariaDB read only s8 on db1124 is OK: Version 10.1.44-MariaDB, Uptime 57s, read_only: True, event_scheduler: True, 3460.38 QPS, connection latency: 0.002224s, query latency: 0.000403s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:35:18] (03CR) 10Ema: [C: 03+2] atsmtail: make systemd unit depend on fifo-log-demux@ [puppet] - 10https://gerrit.wikimedia.org/r/643922 (https://phabricator.wikimedia.org/T256467) (owner: 10Ema) [09:35:26] (03PS1) 10Kormat: host-to-instance: Better error handling [software] - 10https://gerrit.wikimedia.org/r/644180 [09:35:41] 10Operations, 10SRE-tools: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) [09:36:40] (03PS1) 10Elukey: admin: remove user nikerabbit from 'researchers' [puppet] - 10https://gerrit.wikimedia.org/r/644181 (https://phabricator.wikimedia.org/T268801) [09:37:03] (03CR) 10Elukey: [C: 03+2] admin: remove user nikerabbit from 'researchers' [puppet] - 10https://gerrit.wikimedia.org/r/644181 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [09:39:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 from s8 and pool db1092 instead temporarily on vslow T267090', diff saved to https://phabricator.wikimedia.org/P13466 and previous config saved to /var/cache/conftool/dbconfig/20201130-093909-marostegui.json [09:39:14] (03CR) 10Kormat: [C: 03+2] host-to-instance: Better error handling [software] - 10https://gerrit.wikimedia.org/r/644180 (owner: 10Kormat) [09:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:17] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [09:39:51] (03PS1) 10Marostegui: db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/644182 (https://phabricator.wikimedia.org/T267090) [09:39:57] (03Merged) 10jenkins-bot: host-to-instance: Better error handling [software] - 10https://gerrit.wikimedia.org/r/644180 (owner: 10Kormat) [09:40:39] !log Stop MySQL on db1087 to clone clouddb1016:3318 T267090) [09:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:49] (03CR) 10Marostegui: [C: 03+2] db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/644182 (https://phabricator.wikimedia.org/T267090) (owner: 10Marostegui) [09:41:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1089 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13467 and previous config saved to /var/cache/conftool/dbconfig/20201130-094117-root.json [09:41:19] Hey everyone, how can I now if a machine is serving traffic production? I want to see if maps10[05-10].eqiad.wmnet are serving production traffic [09:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:45] mateusbs17: if it is under lvs, maybe you can check pybal: https://config-master.wikimedia.org/pybal/ ? [09:43:40] mateusbs17: not sure if this is the right service, that I cannot say: https://config-master.wikimedia.org/pybal/eqiad/kartotherian [09:43:54] and https://config-master.wikimedia.org/pybal/eqiad/kartotherian-ssl [09:43:57] yep that's it [09:44:08] Thanks jynus [09:44:40] of course, a an alternatibe answer would be "check the apache logs" if they are receiving requests [09:45:25] That's very informative, thanks [09:50:59] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Nemo_bis) >>! In T261694#6652951, @Stephankn wrote: > Is it possible to allow wiki.openstreetmap.org? OpenStreetMap see... [09:52:18] RECOVERY - snapshot of x1 in codfw on alert1001 is OK: Last snapshot for x1 at codfw (db2101.codfw.wmnet:3320) taken on 2020-11-30 08:51:44 (244 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:55:12] RECOVERY - Check systemd state on ms-be1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1089 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13468 and previous config saved to /var/cache/conftool/dbconfig/20201130-095621-root.json [09:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134 for schema change', diff saved to https://phabricator.wikimedia.org/P13469 and previous config saved to /var/cache/conftool/dbconfig/20201130-095729-marostegui.json [09:57:33] (03CR) 10Volans: [C: 03+2] remote: re-enable cumin's output [software/spicerack] - 10https://gerrit.wikimedia.org/r/644175 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [09:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:04] (03Merged) 10jenkins-bot: remote: re-enable cumin's output [software/spicerack] - 10https://gerrit.wikimedia.org/r/644175 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [10:01:44] 10Operations, 10SRE-tools: Decommission cookbook failing to update DNS - https://phabricator.wikimedia.org/T268963 (10Volans) 05Open→03Resolved p:05Triage→03High a:03Volans Just for the record, what "fixed" the deploy was the manual `authdns-update` run after merging the change with the updated CNAME... [10:03:42] 10Operations: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10Volans) [removing the SRE-tools tag as this is not specific to SRE-tools but a general Puppet/Systemd issue] [10:03:44] 10Operations, 10puppet-compiler: String vs Binary issues while running the puppet compiler - https://phabricator.wikimedia.org/T268978 (10elukey) [10:06:04] !log installing NSS security updates [10:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:57] 10Operations, 10Puppet: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) [10:11:18] !log cp4032: upgrade varnish to 6.0.7-1wm1 T268736 [10:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:25] T268736: Package and deploy varnish 6.0.7 - https://phabricator.wikimedia.org/T268736 [10:14:42] (03PS1) 10Filippo Giunchedi: alertmanager: fix cluster config out of sync alert [puppet] - 10https://gerrit.wikimedia.org/r/644184 (https://phabricator.wikimedia.org/T266017) [10:16:49] 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10Urbanecm) I note the user is still member of the wikidata staff group. Any other member of that group should be able to remove it. [10:18:09] !log cp4031: reboot to test atsmtail/fifo-log-demux service dependencies -- https://gerrit.wikimedia.org/r/c/operations/puppet/+/643922 T256467 [10:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:16] !log ema@cumin1001 START - Cookbook sre.hosts.reboot-single [10:18:18] T256467: Make atsmtail-backend.service depend on fifo-log-demux - https://phabricator.wikimedia.org/T256467 [10:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:46] (03CR) 10Hashar: [V: 03+1] "That is merely a cleanup after we have moved to profile::java." [puppet] - 10https://gerrit.wikimedia.org/r/639272 (owner: 10Hashar) [10:22:56] (03PS1) 10Muehlenhoff: Remove LDAP access for pgrass [puppet] - 10https://gerrit.wikimedia.org/r/644186 (https://phabricator.wikimedia.org/T268946) [10:22:59] 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10WMDE-leszek) >>! In T268946#6655192, @Urbanecm wrote: > I note the user is still member of the wikidata staff group. Any other member of... [10:24:46] !log applying https://gerrit.wikimedia.org/r/q/topic:%22k8s_config%22 series of patches [10:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s::proxy: Define --logtostderr=true and --v=0 [puppet] - 10https://gerrit.wikimedia.org/r/643456 (owner: 10Alexandros Kosiaris) [10:25:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for pgrass [puppet] - 10https://gerrit.wikimedia.org/r/644186 (https://phabricator.wikimedia.org/T268946) (owner: 10Muehlenhoff) [10:25:54] akosiaris: shall I merge your patch along? [10:28:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13470 and previous config saved to /var/cache/conftool/dbconfig/20201130-102811-root.json [10:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:42] !log Compare data between clouddb1012:3312 clouddb1018:3312 labsdb1012 T267090 [10:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:48] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [10:29:52] !log Compare data between clouddb1014:3312 clouddb1018:3312 labsdb1012 T267090 [10:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:58] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:00] 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10MoritzMuehlenhoff) [10:32:18] (03PS2) 10Jbond: disable-puppet: add username to disable message [puppet] - 10https://gerrit.wikimedia.org/r/643945 [10:32:24] (03CR) 10Jbond: disable-puppet: add username to disable message (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643945 (owner: 10Jbond) [10:34:03] 10Operations, 10Traffic: Make atsmtail-backend.service depend on fifo-log-demux - https://phabricator.wikimedia.org/T256467 (10ema) 05Open→03Resolved a:03ema Unit ordering at boot time is now correct: ` root@cp4031:~# journalctl -u trafficserver-tls.service | grep Started Nov 30 10:22:11 cp4031 systemd[... [10:43:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13471 and previous config saved to /var/cache/conftool/dbconfig/20201130-104314-root.json [10:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:47:15] (03PS1) 10Joal: Import new tables to analytics datalake with sqoop [puppet] - 10https://gerrit.wikimedia.org/r/644189 (https://phabricator.wikimedia.org/T266077) [10:47:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s::kubelet: Define --logtostderr=true and --v=0 [puppet] - 10https://gerrit.wikimedia.org/r/643457 (owner: 10Alexandros Kosiaris) [10:48:33] (03PS2) 10Joal: Import new tables to analytics datalake with sqoop [puppet] - 10https://gerrit.wikimedia.org/r/644189 (https://phabricator.wikimedia.org/T266077) [10:50:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:52:06] (03PS1) 10Ema: fifo-log-tailer: gracefully handle missing unix socket [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/644191 (https://phabricator.wikimedia.org/T268883) [10:52:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s::scheduler: Define --logtostderr=true and --v=0 [puppet] - 10https://gerrit.wikimedia.org/r/643458 (owner: 10Alexandros Kosiaris) [10:56:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s::controller: Define --logtostderr=true and --v=0 [puppet] - 10https://gerrit.wikimedia.org/r/643459 (owner: 10Alexandros Kosiaris) [10:56:44] (03CR) 10Jbond: [C: 03+2] "> Patch Set 1: Code-Review+1" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/643741 (owner: 10Jbond) [10:57:33] (03Merged) 10jenkins-bot: puppet_compiler: drop python2 support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/643741 (owner: 10Jbond) [10:57:51] (03PS2) 10Ema: fifo-log-tailer: gracefully handle missing unix socket [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/644191 (https://phabricator.wikimedia.org/T268883) [10:58:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13472 and previous config saved to /var/cache/conftool/dbconfig/20201130-105818-root.json [10:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:31] !log bootstrapping maps1005 cassandra [11:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s::apiserver: Follow the other components config pattern [puppet] - 10https://gerrit.wikimedia.org/r/643490 (owner: 10Alexandros Kosiaris) [11:07:07] (03CR) 10Vgutierrez: [C: 03+1] fifo-log-tailer: gracefully handle missing unix socket [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/644191 (https://phabricator.wikimedia.org/T268883) (owner: 10Ema) [11:09:20] 10Operations, 10puppet-compiler: String vs Binary issues while running the puppet compiler - https://phabricator.wikimedia.org/T268978 (10jbond) 05Open→03Resolved a:03jbond This looks like it was an artefact of the python2 -> python3 migration. I have pushed the [[ https://gerrit.wikimedia.org/r/c/opera... [11:10:27] jbond42: you rock [11:12:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26758/console" [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [11:13:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13473 and previous config saved to /var/cache/conftool/dbconfig/20201130-111321-root.json [11:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:39] (03CR) 10Ema: [C: 03+2] fifo-log-tailer: gracefully handle missing unix socket [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/644191 (https://phabricator.wikimedia.org/T268883) (owner: 10Ema) [11:14:45] (03CR) 10Elukey: [V: 03+1 C: 03+2] kafka: Migrate hiera() to lookup() and setting datatype in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [11:19:31] (03PS2) 10TK-999: GeoDNS: Update entry for Wikia [dns] - 10https://gerrit.wikimedia.org/r/643983 [11:19:57] (03CR) 10TK-999: GeoDNS: Update entry for Wikia (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/643983 (owner: 10TK-999) [11:20:02] RECOVERY - snapshot of s1 in codfw on alert1001 is OK: Last snapshot for s1 at codfw (db2097.codfw.wmnet:3311) taken on 2020-11-30 10:00:41 (986 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [11:20:28] (03PS3) 10TK-999: GeoDNS: Update entry for Wikia [dns] - 10https://gerrit.wikimedia.org/r/643983 [11:22:51] (03CR) 10Faidon Liambotis: "Have you submitted this correction to https://support.maxmind.com/geoip-data-correction-request/correct-a-geoip-location/ ? In my experien" [dns] - 10https://gerrit.wikimedia.org/r/643983 (owner: 10TK-999) [11:23:52] (03PS2) 10Alexandros Kosiaris: k8s: Sort DAEMON_ARGS before joining [puppet] - 10https://gerrit.wikimedia.org/r/643702 [11:26:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Sort DAEMON_ARGS before joining [puppet] - 10https://gerrit.wikimedia.org/r/643702 (owner: 10Alexandros Kosiaris) [11:28:06] !log upload fifo-log-demux 0.6.2 to buster-wikimedia T268883 [11:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:13] T268883: fifo-log-tailer: gracefully handle missing unix socket - https://phabricator.wikimedia.org/T268883 [11:28:42] RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:22] RECOVERY - tilerator on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.493 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201130T1130). [11:30:14] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644196 (https://phabricator.wikimedia.org/T128546) [11:31:18] (03CR) 10ArielGlenn: [C: 03+2] option for text pass fixup script to write files to specified directory [dumps] - 10https://gerrit.wikimedia.org/r/640420 (owner: 10ArielGlenn) [11:31:46] (03Merged) 10jenkins-bot: option for text pass fixup script to write files to specified directory [dumps] - 10https://gerrit.wikimedia.org/r/640420 (owner: 10ArielGlenn) [11:31:53] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644196 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:32:43] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644196 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:32:48] !log ariel@deploy1001 Started deploy [dumps/dumps@e8c6267]: allow page content fixup script to write output files to arbitrary dir [11:32:53] !log ariel@deploy1001 Finished deploy [dumps/dumps@e8c6267]: allow page content fixup script to write output files to arbitrary dir (duration: 00m 04s) [11:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:08] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:644196| Bumping portals to master (T128546)]] (duration: 01m 01s) [11:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:15] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:36:05] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:644196| Bumping portals to master (T128546)]] (duration: 00m 57s) [11:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:03] !log A:cp upgrade fifo-log-demux to 0.6.2 T268883 [11:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:09] T268883: fifo-log-tailer: gracefully handle missing unix socket - https://phabricator.wikimedia.org/T268883 [11:43:05] (03CR) 10TK-999: "> Patch Set 3:" [dns] - 10https://gerrit.wikimedia.org/r/643983 (owner: 10TK-999) [11:43:46] !log Sanitize clouddb1016:3318 - T267090 [11:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:53] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [11:44:04] (03PS1) 10Jbond: P:mariadb::backup::transfer: use full path in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/644197 (https://phabricator.wikimedia.org/T268974) [11:44:42] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/644197 (https://phabricator.wikimedia.org/T268974) (owner: 10Jbond) [11:44:48] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] make plain text output writer return the right thing [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643775 (owner: 10ArielGlenn) [11:45:59] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] make dumpbz2filefromoffset not write cruft before the first tag [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/642903 (https://phabricator.wikimedia.org/T268416) (owner: 10ArielGlenn) [11:47:16] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] script to split up xml dump bz2 file into smaller ones [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/642904 (https://phabricator.wikimedia.org/T268416) (owner: 10ArielGlenn) [11:49:42] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] new tool to generate bz2 output appendable to a truncated bz2 file [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/642906 (https://phabricator.wikimedia.org/T268417) (owner: 10ArielGlenn) [11:50:25] (03CR) 10Jcrespo: [C: 03+1] "Even if this wasn't the underlying issue (which probably is), this is the best shot at fixing it, because it would hopefully refresh the t" [puppet] - 10https://gerrit.wikimedia.org/r/644197 (https://phabricator.wikimedia.org/T268974) (owner: 10Jbond) [11:50:58] RECOVERY - snapshot of s8 in codfw on alert1001 is OK: Last snapshot for s8 at codfw (db2100.codfw.wmnet:3318) taken on 2020-11-30 10:17:29 (1180 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [11:52:01] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] Script to show offsets and crcs of byte-aligned blocks of bz2 files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/642907 (https://phabricator.wikimedia.org/T268417) (owner: 10ArielGlenn) [11:52:50] 10Operations, 10Puppet, 10Patch-For-Review: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) Is there a different systemd version on cumin1001 vs cumin2001 (or active version, e.g. time since reboot)? That would explain why cumin1001... [11:53:17] (03CR) 10Jcrespo: [C: 03+2] P:mariadb::backup::transfer: use full path in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/644197 (https://phabricator.wikimedia.org/T268974) (owner: 10Jbond) [11:55:16] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] Tool to write out offset and crc information for all blocks in a bzip2 file [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/490299 (https://phabricator.wikimedia.org/T216009) (owner: 10ArielGlenn) [11:56:54] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] remove an unused function in findpageidinbz2xml [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643778 (owner: 10ArielGlenn) [11:58:07] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] fix usage for getlastidinbz2xml and make man page build properly [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643779 (owner: 10ArielGlenn) [11:58:20] 10Operations, 10Puppet, 10Patch-For-Review: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) p:05Unbreak!→03High Lowering priority, will wait for: * Checking next scheduled snapshot runs normally (I will take care of this) * Follo... [11:59:58] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201130T1200). [12:00:05] Lucas_WMDE and kart_: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:14] \o/ [12:00:18] * kart_ is here. [12:00:31] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] script to make block crc info from the showcrcs tool more useful [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/642911 (https://phabricator.wikimedia.org/T268417) (owner: 10ArielGlenn) [12:00:34] o/ I need 15 more minutes [12:00:43] Urbanecm: can you deploy kart_’s change? [12:00:46] Lucas_WMDE: I'll deploy kart's patch then [12:00:50] sure, was just writing that :D [12:00:58] thanks :) [12:01:13] (03CR) 10Urbanecm: [C: 03+2] Remove wgContentTranslationRESTBase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634956 (https://phabricator.wikimedia.org/T266213) (owner: 10KartikMistry) [12:01:19] Urbanecm: patch is only testable in Production, so it is likely that we need to revert it quickly too. [12:01:22] 10Operations, 10Puppet, 10Patch-For-Review: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) >>! In T268974#6655413, @jcrespo wrote: > Is there a different systemd version on cumin1001 vs cumin2001 (or active version, e.g. time since... [12:01:39] kart_: acknowledged. I'll let you know once it's at mwdebug, assuming mwdebug test is enough? [12:01:50] Yes. mwdebug is enough. [12:01:58] okay, thank you [12:02:03] (03Merged) 10jenkins-bot: Remove wgContentTranslationRESTBase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634956 (https://phabricator.wikimedia.org/T266213) (owner: 10KartikMistry) [12:02:28] kart_: available at mwdebug1001, please test [12:03:31] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] unify the .gitignore files, move to top level and add a few more executables [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643780 (owner: 10ArielGlenn) [12:05:31] Urbanecm: doing some tests. Is it OK to take some more time to test? [12:05:34] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] update man pages for several utils [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643818 (owner: 10ArielGlenn) [12:05:36] kart_: sure. [12:06:11] Urbanecm: thanks. [12:08:28] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] make revsperpage test echo SUCCESS when it passes [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643819 (owner: 10ArielGlenn) [12:09:57] (03PS1) 10Jbond: gerrit: use fully qualified path [puppet] - 10https://gerrit.wikimedia.org/r/644199 [12:11:08] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] tests for dumpbz2filefromoffset [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643820 (owner: 10ArielGlenn) [12:11:19] Urbanecm: looks good for me. CX can load new pages, can load drafts etc. [12:11:33] kart_: okay, trying to sync and let's hope :) [12:11:38] Urbanecm: but, still - please be ready for revert :P [12:11:46] sure, I'll prepare a revert just in case [12:12:22] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] tests for appendbz2 [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643822 (https://phabricator.wikimedia.org/T268417) (owner: 10ArielGlenn) [12:12:56] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 2922abe7b810f1b53b446af783dc9d51e6585225: Remove wgContentTranslationRESTBase config (T266213) (duration: 00m 57s) [12:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:06] T266213: Deprecate wgContentTranslationRESTBase config - https://phabricator.wikimedia.org/T266213 [12:13:25] kart_: synced, revert is now a matter of pressing a single enter should it be necessary. [12:13:57] Urbanecm: cool. Let me do more tests.. ie Debug in Production. [12:14:08] kart_: certainly. [12:14:59] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] tests for writeuptopageid [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643865 (owner: 10ArielGlenn) [12:15:45] Lucas_WMDE: FYI: I'm still deploying (=waiting to make sure the patch didn't break anything). [12:15:49] ok [12:15:55] I’m also still in a meeting ^^ [12:16:16] (not sure if I’ll even have time for my deployment before the next meeting starts, we’ll see) [12:16:21] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] tests for recompressxml [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643866 (owner: 10ArielGlenn) [12:16:44] ack :) [12:18:17] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] basic tests for split_bz2.py [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643867 (owner: 10ArielGlenn) [12:19:24] kart_: so far, nothing suspicious on Logstash [12:20:38] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] tests for dumplastbz2block [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643869 (owner: 10ArielGlenn) [12:21:04] (03PS1) 10Arturo Borrero Gonzalez: cloud: kubeadm: refresh version defaults [puppet] - 10https://gerrit.wikimedia.org/r/644202 (https://phabricator.wikimedia.org/T268669) [12:21:14] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no:weight=0; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1005.eqiad.wmnet [12:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:31] Urbanecm: yes. CX is all OK so far. [12:21:38] wonderful [12:22:23] (03CR) 10Volans: [C: 03+1] "LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643945 (owner: 10Jbond) [12:22:47] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] tests for findpageidinbz2xml [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643870 (owner: 10ArielGlenn) [12:23:36] o/ I’m free now [12:23:51] ack [12:24:00] !log hnowlan@deploy1001 Started deploy [kartotherian/deploy@0a38bc5]: New eqiad maps hosts [12:24:02] kart_: do you wish to test more, or can we hand over to Lucas_WMDE ? [12:24:03] !log hnowlan@deploy1001 Finished deploy [kartotherian/deploy@0a38bc5]: New eqiad maps hosts (duration: 00m 03s) [12:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:19] question: when I push a job on mwdebug, do I have a guarantee that the job will also run on the same host? or do I need to do a sync-file to ensure the job class exists on the host that will end up running the job? [12:24:27] Urbanecm: I think you knew something about this ^ last time I asked? [12:24:31] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] tests for getlastidinbz2xml [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643872 (owner: 10ArielGlenn) [12:25:13] (I’m not doing anything yet, waiting for handover confirmation) [12:25:41] kart_: ping? "do you wish to test more, or can we hand over to Lucas?" [12:26:14] Lucas_WMDE: good question. I'm not sure, I'd try it, and update docs for other deployers 🙂 [12:26:27] ok :) [12:26:29] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] script to run all the tests [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643873 (owner: 10ArielGlenn) [12:27:01] https://wikitech.wikimedia.org/wiki/WikimediaDebug still mentions a showJobs.php script, which IIRC has been broken in prod for a while [12:27:03] :| [12:27:11] yeah, T221224 [12:27:11] T221224: showJobs.php maintenance script useless and misleading in production - https://phabricator.wikimedia.org/T221224 [12:27:16] Lucas_WMDE: that script doesn't work, yeah :( [12:27:19] oh sorry Urbanecm [12:27:24] !log hnowlan@deploy1001 Started deploy [tilerator/deploy@97575e4]: New eqiad maps hosts [12:27:27] !log hnowlan@deploy1001 Finished deploy [tilerator/deploy@97575e4]: New eqiad maps hosts (duration: 00m 03s) [12:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:30] Lucas_WMDE: Please go ahead. [12:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:36] ok thanks! I’ll try [12:27:38] not until I clean deployment please! [12:27:41] Lucas_WMDE: ^ [12:27:46] ok! [12:28:11] Lucas_WMDE: done. Please ping me once done if there's time, I've a few non-urgent config patches [12:28:15] floor is yours :) [12:28:18] alright thanks! [12:28:25] 10Operations, 10MediaWiki-General, 10Platform Engineering: Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10daniel) [12:29:08] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] bump version to 0.1.1 [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/643876 (owner: 10ArielGlenn) [12:29:24] (03Abandoned) 10Vlad.shapik: Expiration date: OAuth 2.0 access tokens have effectively infinite expiration date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644056 (https://phabricator.wikimedia.org/T265075) (owner: 10Vlad.shapik) [12:29:56] it… seems to have worked [12:30:14] I’ll sync and then look at the logs in more detail afterwards [12:30:41] great! [12:30:44] (03CR) 10ArielGlenn: [C: 03+2] version 0.1.1 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/643909 (owner: 10ArielGlenn) [12:31:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/643945 (owner: 10Jbond) [12:32:10] (03CR) 10Muehlenhoff: gerrit: use fully qualified path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644199 (owner: 10Jbond) [12:32:46] !log Deployed patch for T260349 [12:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:17] !log hnowlan@deploy1001 Started deploy [tilerator/deploy@97575e4]: Newer codfw maps hosts [12:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:58] committed in /srv/patches too [12:34:08] !log hnowlan@deploy1001 Finished deploy [tilerator/deploy@97575e4]: Newer codfw maps hosts (duration: 00m 51s) [12:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:17] Urbanecm: I think you can go ahead now [12:34:22] thanks [12:34:24] !log hnowlan@deploy1001 Started deploy [tilerator/deploy@97575e4]: Newer codfw maps hosts [12:34:28] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:28] RECOVERY - tilerator on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [12:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:42] (03PS2) 10Urbanecm: Assign patrolmarks right to autoconfirmed users on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643464 (https://phabricator.wikimedia.org/T268734) [12:34:46] (03CR) 10Urbanecm: [C: 03+2] Assign patrolmarks right to autoconfirmed users on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643464 (https://phabricator.wikimedia.org/T268734) (owner: 10Urbanecm) [12:34:48] RECOVERY - tilerator on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [12:34:48] !log hnowlan@deploy1001 Finished deploy [tilerator/deploy@97575e4]: Newer codfw maps hosts (duration: 00m 24s) [12:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:10] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:15] (03PS1) 10Vlad.shapik: Expiration date: OAuth 2.0 access tokens have effectively infinite expiration date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644203 (https://phabricator.wikimedia.org/T265075) [12:35:25] !log hnowlan@deploy1001 Started deploy [kartotherian/deploy@0a38bc5]: Newer codfw maps hosts [12:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:34] (03Merged) 10jenkins-bot: Assign patrolmarks right to autoconfirmed users on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643464 (https://phabricator.wikimedia.org/T268734) (owner: 10Urbanecm) [12:35:43] (03PS2) 10Urbanecm: Grant enwikibooks reviewers suppressredirect and raise move rate limit to 100/60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643771 (https://phabricator.wikimedia.org/T268849) [12:35:47] (03CR) 10Urbanecm: [C: 03+2] Grant enwikibooks reviewers suppressredirect and raise move rate limit to 100/60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643771 (https://phabricator.wikimedia.org/T268849) (owner: 10Urbanecm) [12:35:55] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1005.eqiad.wmnet [12:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:40] (03Merged) 10jenkins-bot: Grant enwikibooks reviewers suppressredirect and raise move rate limit to 100/60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643771 (https://phabricator.wikimedia.org/T268849) (owner: 10Urbanecm) [12:36:59] (03PS2) 10Urbanecm: Enable RelatedArticles on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644070 (https://phabricator.wikimedia.org/T268945) [12:37:06] (03CR) 10Urbanecm: [C: 03+2] Enable RelatedArticles on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644070 (https://phabricator.wikimedia.org/T268945) (owner: 10Urbanecm) [12:37:12] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 9942d68c914a56073d2d192434ba24ff8cb921ba: Assign patrolmarks right to autoconfirmed users on itwiki (T268734) (duration: 00m 57s) [12:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:19] T268734: Assign patrolmarks right to autoconfirmed users on itwiki - https://phabricator.wikimedia.org/T268734 [12:37:24] (03PS2) 10ArielGlenn: Add link to Article Feedback dumps for downloaders [puppet] - 10https://gerrit.wikimedia.org/r/594961 (https://phabricator.wikimedia.org/T250715) [12:37:30] !log hnowlan@deploy1001 Finished deploy [kartotherian/deploy@0a38bc5]: Newer codfw maps hosts (duration: 02m 05s) [12:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:03] (03Merged) 10jenkins-bot: Enable RelatedArticles on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644070 (https://phabricator.wikimedia.org/T268945) (owner: 10Urbanecm) [12:38:43] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ba6d0f8fd2a443e5c913a292365063f01f2d076b: Grant enwikibooks reviewers suppressredirect and raise move rate limit to 100/60 (T268849) (duration: 00m 57s) [12:38:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: kubeadm: refresh version defaults [puppet] - 10https://gerrit.wikimedia.org/r/644202 (https://phabricator.wikimedia.org/T268669) (owner: 10Arturo Borrero Gonzalez) [12:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:50] T268849: Add suppressredirect to reviewers on en.wb - https://phabricator.wikimedia.org/T268849 [12:39:24] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 265773960 and 311 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:30] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 289849688 and 317 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:00] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 44532040 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:49] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 5585fd79119a4e077705789d1c1928c9e9efa956: Enable RelatedArticles on ptwikinews (T268945) (duration: 00m 57s) [12:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:56] T268945: Add mobile wordmark and RelatedArticles extension on ptwikinews - https://phabricator.wikimedia.org/T268945 [12:41:13] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1004.eqiad.wmnet [12:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:38] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 78544056 and 444 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:41:56] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 73724256 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:42:06] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 144853352 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:02] !log hnowlan@deploy1001 Started deploy [kartotherian/deploy@0a38bc5]: Redeploy to fix gelf traffic [12:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:16] (03PS1) 10Urbanecm: [enwikibooks] Follow-up for ba6d0f8f: The group is called "editor", not "reviewer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644205 (https://phabricator.wikimedia.org/T268849) [12:43:26] !log hnowlan@deploy1001 Finished deploy [kartotherian/deploy@0a38bc5]: Redeploy to fix gelf traffic (duration: 00m 24s) [12:43:28] (03CR) 10Urbanecm: [C: 03+2] [enwikibooks] Follow-up for ba6d0f8f: The group is called "editor", not "reviewer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644205 (https://phabricator.wikimedia.org/T268849) (owner: 10Urbanecm) [12:43:30] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 95241384 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:15] (03Merged) 10jenkins-bot: [enwikibooks] Follow-up for ba6d0f8f: The group is called "editor", not "reviewer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644205 (https://phabricator.wikimedia.org/T268849) (owner: 10Urbanecm) [12:45:24] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 543816 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:45:56] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22022440 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:46:16] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 165950552 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:46:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3476644e4c27dd28339f7b10c8871be2e9455394: Grant enwikibooks reviewers suppressredirect and raise move rate limit to 100/60 (T268849; 2nd attempt) (duration: 00m 56s) [12:46:24] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 204654208 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:32] T268849: Add suppressredirect to reviewers on en.wb - https://phabricator.wikimedia.org/T268849 [12:46:34] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 398115720 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:46:58] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 104441424 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:47:31] hnowlan: there's a lot of map alerts, and I noticed you did some map stuff [12:48:28] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:42] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 72600 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:48:49] (03PS2) 10Jbond: gerrit: use fully qualified path [puppet] - 10https://gerrit.wikimedia.org/r/644199 [12:48:51] (03PS1) 10Jbond: systemd::timer::job: Ensure all commands are fully qualified [puppet] - 10https://gerrit.wikimedia.org/r/644226 [12:50:01] (03CR) 10Hashar: [C: 03+1] "Please do :]" [puppet] - 10https://gerrit.wikimedia.org/r/644199 (owner: 10Jbond) [12:50:02] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 950896 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:50:03] !log EU B&C window done [12:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:09] (03PS3) 10Jbond: disable-puppet: add username to disable message [puppet] - 10https://gerrit.wikimedia.org/r/643945 [12:50:59] (03CR) 10Jbond: disable-puppet: add username to disable message (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643945 (owner: 10Jbond) [12:51:08] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 236024320 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:51:14] (03PS4) 10Jbond: disable-puppet: add username to disable message [puppet] - 10https://gerrit.wikimedia.org/r/643945 [12:51:36] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:51:42] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 36816 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:52:02] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 89776 and 38 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:52:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644199 (owner: 10Jbond) [12:52:30] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 30016 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:52:52] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 87 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:53:10] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 21928 and 107 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:53:48] (03PS2) 10Vlad.shapik: Expiration date: OAuth 2.0 access tokens have effectively infinite expiration date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644203 (https://phabricator.wikimedia.org/T265075) [12:56:13] Urbanecm: unfortunately unrelated [12:56:26] okay :( [12:57:20] (03CR) 10Jbond: [C: 03+2] gerrit: use fully qualified path [puppet] - 10https://gerrit.wikimedia.org/r/644199 (owner: 10Jbond) [12:57:38] (03CR) 10Muehlenhoff: [C: 03+2] Use CAS for Racktables [puppet] - 10https://gerrit.wikimedia.org/r/643711 (owner: 10Muehlenhoff) [12:57:56] (03CR) 10Jbond: gerrit: use fully qualified path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644199 (owner: 10Jbond) [12:58:52] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:51] (03PS11) 10Hashar: Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) [13:02:53] (03PS2) 10Jbond: systemd::timer::job: Ensure all commands are fully qualified [puppet] - 10https://gerrit.wikimedia.org/r/644226 [13:03:16] (03CR) 10Jbond: "PCC still running but looks like a noop" [puppet] - 10https://gerrit.wikimedia.org/r/644226 (owner: 10Jbond) [13:09:39] (03PS1) 10Gilles: Add mwdebug1003 to list of debug servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644229 (https://phabricator.wikimedia.org/T268167) [13:12:15] (03CR) 10Gilles: [C: 03+2] Add mwdebug1003 to list of debug servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644229 (https://phabricator.wikimedia.org/T268167) (owner: 10Gilles) [13:13:02] (03Merged) 10jenkins-bot: Add mwdebug1003 to list of debug servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644229 (https://phabricator.wikimedia.org/T268167) (owner: 10Gilles) [13:16:10] !log gilles@deploy1001 Synchronized debug.json: T268167 Add mwdebug1003 to list of debug servers (duration: 00m 56s) [13:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:17] T268167: Fetch mwdebug backend server list from noc.wikimedia.org - https://phabricator.wikimedia.org/T268167 [13:18:59] !log CAS enabled for racktables [13:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:31] !log depooling wdqs1004 (lag) [13:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:20] !log update zeromq on jessie hosts [13:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:29] (03PS1) 10Gilles: Allow CORS requests to noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/644230 (https://phabricator.wikimedia.org/T268167) [13:25:10] (03PS9) 10Hashar: scap::sources stop assuming mediawiki/services as a prefix [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) [13:33:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (but didn't check PCC, not sure if it's complete yet)" [puppet] - 10https://gerrit.wikimedia.org/r/644226 (owner: 10Jbond) [13:37:38] (03PS1) 10Kormat: intpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 [13:44:16] (03CR) 10Jbond: [C: 03+2] systemd::timer::job: Ensure all commands are fully qualified [puppet] - 10https://gerrit.wikimedia.org/r/644226 (owner: 10Jbond) [13:45:13] (03PS1) 10Vlad.shapik: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into T265075-oauth-2-0-access-tokens-have-effectively-infinite-expiration-date Change-Id: I2dafc28b3411da8dc09925ad496e9a890660720f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644232 [13:45:44] (03PS1) 10Marostegui: Revert "db1087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/644207 [13:46:26] (03CR) 10Marostegui: [C: 03+2] Revert "db1087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/644207 (owner: 10Marostegui) [13:46:36] (03Abandoned) 10Vlad.shapik: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into T265075-oauth-2-0-access-tokens-have-effectively-infinite-expiration-date Change-Id: I2dafc28b3411da8dc09925ad496e9a890660720f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644232 (owner: 10Vlad.shapik) [13:47:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 25%: After cloning clouddb1016:3318', diff saved to https://phabricator.wikimedia.org/P13474 and previous config saved to /var/cache/conftool/dbconfig/20201130-134722-root.json [13:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:57] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P13475 and previous config saved to /var/cache/conftool/dbconfig/20201130-134841-marostegui.json [13:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:24] (03CR) 10Jbond: [C: 03+2] P:idp: drop support for json devicie storage (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643538 (owner: 10Jbond) [13:51:30] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:idp: remove the concept of a primary and a secondary [puppet] - 10https://gerrit.wikimedia.org/r/643541 (owner: 10Jbond) [13:51:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:apereo_cas: drop support for managing the user [puppet] - 10https://gerrit.wikimedia.org/r/643713 (owner: 10Jbond) [13:52:55] RECOVERY - snapshot of s2 in codfw on alert1001 is OK: Last snapshot for s2 at codfw (db2098.codfw.wmnet:3312) taken on 2020-11-30 12:15:58 (846 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [13:58:08] !log varnish 6.0.7-1wm1 uploaded to apt.wikimedia.org component/varnish6 T268736 [13:58:12] (03Abandoned) 10Vlad.shapik: Expiration date: OAuth 2.0 access tokens have effectively infinite expiration date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644203 (https://phabricator.wikimedia.org/T265075) (owner: 10Vlad.shapik) [13:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:15] T268736: Package and deploy varnish 6.0.7 - https://phabricator.wikimedia.org/T268736 [14:02:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 50%: After cloning clouddb1016:3318', diff saved to https://phabricator.wikimedia.org/P13477 and previous config saved to /var/cache/conftool/dbconfig/20201130-140226-root.json [14:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:33] (03PS1) 10Alexandros Kosiaris: k8s: Remove profile::kubernetes::master::storage_backend fully [puppet] - 10https://gerrit.wikimedia.org/r/644234 [14:03:35] (03PS1) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 [14:04:12] (03CR) 10jerkins-bot: [V: 04-1] k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris) [14:04:38] (03PS1) 10Vlad.shapik: Expiration date: OAuth 2.0 access tokens have effectively infinite expiration date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644236 (https://phabricator.wikimedia.org/T265075) [14:05:00] 10Operations, 10Analytics: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10Ottomata) THAT IS AWESOME YES PLEASE! [14:05:05] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:07] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:27] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:07:15] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:09:00] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T268997 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [14:09:04] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268997 (10ops-monitoring-bot) [14:11:39] (03CR) 10Ottomata: Add Andrew Otto as approval contact for Hadoop and analytics groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643875 (owner: 10Muehlenhoff) [14:11:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P13478 and previous config saved to /var/cache/conftool/dbconfig/20201130-141146-marostegui.json [14:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:56] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:12:42] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:15:35] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) Varnish 6.0.7 is behaving well in terms of functionality on cp4032 (T268736). I've rebooted cp4031 shortly after the upgrade of cp4032... [14:17:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 75%: After cloning clouddb1016:3318', diff saved to https://phabricator.wikimedia.org/P13479 and previous config saved to /var/cache/conftool/dbconfig/20201130-141729-root.json [14:17:34] 10Operations, 10observability: thanos: 404 error trying to fetch js library - https://phabricator.wikimedia.org/T269000 (10jbond) p:05Triage→03Low [14:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:35] (03CR) 10Muehlenhoff: Add Andrew Otto as approval contact for Hadoop and analytics groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643875 (owner: 10Muehlenhoff) [14:19:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1092', diff saved to https://phabricator.wikimedia.org/P13480 and previous config saved to /var/cache/conftool/dbconfig/20201130-141953-marostegui.json [14:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:16] (03CR) 10Ottomata: Add Andrew Otto as approval contact for Hadoop and analytics groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643875 (owner: 10Muehlenhoff) [14:23:41] !log Deploy schema change on s3 codfw, lag will show up on s3 codfw T268004 [14:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:48] T268004: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 [14:28:44] (03PS2) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 [14:28:46] (03PS1) 10Alexandros Kosiaris: k8s::master: Remove redundant has_lvs hiera check [puppet] - 10https://gerrit.wikimedia.org/r/644237 [14:28:48] (03PS1) 10Alexandros Kosiaris: k8s: Allow using cergen [puppet] - 10https://gerrit.wikimedia.org/r/644238 [14:29:20] (03CR) 10jerkins-bot: [V: 04-1] k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris) [14:32:07] (03PS1) 10Elukey: profile::hive::client: make DBTokenStore the default [puppet] - 10https://gerrit.wikimedia.org/r/644239 (https://phabricator.wikimedia.org/T268028) [14:32:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 100%: After cloning clouddb1016:3318', diff saved to https://phabricator.wikimedia.org/P13481 and previous config saved to /var/cache/conftool/dbconfig/20201130-143232-root.json [14:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:20] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26761/console" [puppet] - 10https://gerrit.wikimedia.org/r/644239 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [14:37:39] (03CR) 10Ppchelko: [C: 03+1] "Will deploy today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644236 (https://phabricator.wikimedia.org/T265075) (owner: 10Vlad.shapik) [14:37:53] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1006.eqiad.wmnet [14:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:55] (03PS1) 10Ppchelko: group2: switch ParserCache to JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644243 (https://phabricator.wikimedia.org/T263579) [14:43:38] (03CR) 10Vlad.shapik: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644236 (https://phabricator.wikimedia.org/T265075) (owner: 10Vlad.shapik) [14:44:18] (03PS1) 10Hashar: ci: add prometheus exporter for Apache [puppet] - 10https://gerrit.wikimedia.org/r/644245 [14:45:22] (03CR) 10Ppchelko: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644236 (https://phabricator.wikimedia.org/T265075) (owner: 10Vlad.shapik) [14:45:41] !log joal@deploy1001 Started deploy [analytics/refinery@72ac883]: Analytics special deploy before first of month [analytics/refinery@72ac883] [14:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:56] (03CR) 10Hashar: "Based after Chris addition to Gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/644245 (owner: 10Hashar) [14:47:32] (03CR) 10Vlad.shapik: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644236 (https://phabricator.wikimedia.org/T265075) (owner: 10Vlad.shapik) [14:49:11] (03PS1) 10Effie Mouzeli: hiera: enable icu 63 component on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/644248 (https://phabricator.wikimedia.org/T264991) [14:50:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644248 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [14:51:23] (03CR) 10Elukey: [C: 03+2] Import new tables to analytics datalake with sqoop [puppet] - 10https://gerrit.wikimedia.org/r/644189 (https://phabricator.wikimedia.org/T266077) (owner: 10Joal) [14:52:32] (03PS3) 10Andrew Bogott: Roles and hiera for codfw1dev ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/643316 (https://phabricator.wikimedia.org/T265965) [14:54:52] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::hive::client: make DBTokenStore the default [puppet] - 10https://gerrit.wikimedia.org/r/644239 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [14:55:07] !log joal@deploy1001 Finished deploy [analytics/refinery@72ac883]: Analytics special deploy before first of month [analytics/refinery@72ac883] (duration: 09m 26s) [14:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:29] !log joal@deploy1001 Started deploy [analytics/refinery@72ac883] (thin): Analytics special deploy before first of month -- THIN [analytics/refinery@72ac883] [14:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:37] !log joal@deploy1001 Finished deploy [analytics/refinery@72ac883] (thin): Analytics special deploy before first of month -- THIN [analytics/refinery@72ac883] (duration: 00m 08s) [14:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:19] (03CR) 10CDanis: [C: 03+2] Allow CORS requests to noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/644230 (https://phabricator.wikimedia.org/T268167) (owner: 10Gilles) [14:57:20] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Lea_Lacroix_WMDE) FYI, it keeps happening, for example today with an announcement on Commons https://commons.wikimedia.org/... [15:00:13] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26763/console" [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris) [15:05:47] 10Operations, 10serviceops, 10cloud-services-team (Kanban): Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004 (10jijiki) [15:06:08] 10Operations, 10serviceops, 10cloud-services-team (Kanban): Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004 (10jijiki) [15:06:11] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) [15:10:24] (03PS1) 10Ottomata: Migrate 2 Anti-Harassment schemas to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644252 (https://phabricator.wikimedia.org/T268517) [15:13:09] (03CR) 10Ottomata: [C: 03+2] Migrate 2 Anti-Harassment schemas to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644252 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [15:13:22] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10hnowlan) I've redeployed the services emitting gelf traffic and it appears these messages have stopped as of 12:43 UTC. [15:15:05] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate 2 Anti-Harassment schemas to EventGate on testwiki - T268517 (duration: 01m 16s) [15:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:12] T268517: Migrate Anti-Harassment EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T268517 [15:15:23] (03PS1) 10Jbond: (WIP) ocsp [puppet] - 10https://gerrit.wikimedia.org/r/644254 [15:15:26] (03Abandoned) 10Ppchelko: Add wgOldRevisionParserCacheType setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643600 (https://phabricator.wikimedia.org/T268278) (owner: 10Ppchelko) [15:15:48] (03CR) 10jerkins-bot: [V: 04-1] (WIP) ocsp [puppet] - 10https://gerrit.wikimedia.org/r/644254 (owner: 10Jbond) [15:17:17] (03PS1) 10Jbond: P:base::labs: use fq path [puppet] - 10https://gerrit.wikimedia.org/r/644255 [15:19:08] (03CR) 10Jbond: [C: 03+2] P:base::labs: use fq path [puppet] - 10https://gerrit.wikimedia.org/r/644255 (owner: 10Jbond) [15:21:34] (03PS2) 10Volans: CHANGELOG: add changelogs for release v0.0.45 [software/spicerack] - 10https://gerrit.wikimedia.org/r/643982 [15:24:30] (03PS1) 10Ottomata: Migrate 2 Anti-Harassment schemas to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644256 (https://phabricator.wikimedia.org/T268517) [15:25:09] (03PS2) 10Ottomata: Migrate 2 Anti-Harassment schemas to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644256 (https://phabricator.wikimedia.org/T268517) [15:25:11] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.45 [software/spicerack] - 10https://gerrit.wikimedia.org/r/643982 (owner: 10Volans) [15:27:19] (03CR) 10Ottomata: [C: 03+2] Migrate 2 Anti-Harassment schemas to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644256 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [15:27:24] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ycrusoe) > This is not so simple. In Wikisource we convert music scores from images to make them edit... [15:28:11] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.45 [software/spicerack] - 10https://gerrit.wikimedia.org/r/643982 (owner: 10Volans) [15:28:41] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate 2 Anti-Harassment schemas to EventGate on all wikis - T268517 (duration: 00m 56s) [15:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:48] T268517: Migrate Anti-Harassment EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T268517 [15:29:32] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:45] (03PS1) 10Andrew Bogott: Ceph/codfw: added some missing hiera values [puppet] - 10https://gerrit.wikimedia.org/r/644257 (https://phabricator.wikimedia.org/T265965) [15:31:06] (03CR) 10Andrew Bogott: [C: 03+2] Ceph/codfw: added some missing hiera values [puppet] - 10https://gerrit.wikimedia.org/r/644257 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [15:34:09] (03PS1) 10Ottomata: Refine 2 Anti-Harassment schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/644259 (https://phabricator.wikimedia.org/T268517) [15:34:13] !log cp3054: upgrade varnish to 6.0.7-1wm1 T268736 T264398 [15:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:22] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [15:34:22] T268736: Package and deploy varnish 6.0.7 - https://phabricator.wikimedia.org/T268736 [15:35:39] (03CR) 10jerkins-bot: [V: 04-1] Refine 2 Anti-Harassment schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/644259 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [15:37:19] (03PS2) 10Ottomata: Refine 2 Anti-Harassment schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/644259 (https://phabricator.wikimedia.org/T268517) [15:38:00] (03PS2) 10Lucas Werkmeister (WMDE): Add log channel Wikibase.IdGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) [15:38:10] (03CR) 10Lucas Werkmeister (WMDE): Add log channel Wikibase.IdGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE)) [15:38:48] (03CR) 10jerkins-bot: [V: 04-1] Refine 2 Anti-Harassment schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/644259 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [15:40:24] (03CR) 10Reedy: [C: 03+1] Add link to Article Feedback dumps for downloaders [puppet] - 10https://gerrit.wikimedia.org/r/594961 (https://phabricator.wikimedia.org/T250715) (owner: 10ArielGlenn) [15:40:35] (03PS3) 10Ottomata: Refine 2 Anti-Harassment schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/644259 (https://phabricator.wikimedia.org/T268517) [15:42:06] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=mysql file=device_smart.prom instance=db1139 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [15:43:11] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1007.eqiad.wmnet [15:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:18] (03CR) 10Lucas Werkmeister (WMDE): Add log channel Wikibase.IdGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE)) [15:43:36] !log installing tomcat8 security updates [15:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:25] (03PS1) 10Ppchelko: Update change-prop to 0.10.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/644261 [15:45:19] (03PS2) 10Ssingh: admin: add ldap_only_user entry for janjaquemot [puppet] - 10https://gerrit.wikimedia.org/r/641509 (https://phabricator.wikimedia.org/T267771) (owner: 10Herron) [15:46:03] (03CR) 10Ottomata: [C: 03+2] Refine 2 Anti-Harassment schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/644259 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [15:47:03] (03CR) 10Ssingh: "The user has signed the NDA as per the tracking sheet and confirmation from Katie (https://phabricator.wikimedia.org/T267771#6654534)." [puppet] - 10https://gerrit.wikimedia.org/r/641509 (https://phabricator.wikimedia.org/T267771) (owner: 10Herron) [15:48:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/641509 (https://phabricator.wikimedia.org/T267771) (owner: 10Herron) [15:48:16] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:45] (03CR) 10Ssingh: [C: 03+2] admin: add ldap_only_user entry for janjaquemot [puppet] - 10https://gerrit.wikimedia.org/r/641509 (https://phabricator.wikimedia.org/T267771) (owner: 10Herron) [15:52:38] (03PS2) 10Alexandros Kosiaris: k8s: Remove profile::kubernetes::master::storage_backend fully [puppet] - 10https://gerrit.wikimedia.org/r/644234 [15:52:40] (03PS2) 10Alexandros Kosiaris: k8s::master: Remove redundant has_lvs hiera check [puppet] - 10https://gerrit.wikimedia.org/r/644237 [15:52:42] (03PS2) 10Alexandros Kosiaris: k8s: Allow using cergen [puppet] - 10https://gerrit.wikimedia.org/r/644238 [15:52:44] (03PS3) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 [15:52:46] (03PS1) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [15:53:35] (03CR) 10jerkins-bot: [V: 04-1] k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris) [15:54:02] RECOVERY - snapshot of s4 in codfw on alert1001 is OK: Last snapshot for s4 at codfw (db2099.codfw.wmnet:3314) taken on 2020-11-30 13:50:33 (1521 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:55:24] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access for Jan Jaquemot - https://phabricator.wikimedia.org/T267771 (10ssingh) 05Open→03Resolved Thanks for the confirmation Katie! Jan: The request has been merged and you should have access. Please feel free to reopen this task if you it d... [15:57:10] 10Operations, 10ops-codfw: Degraded RAID on ms-be2031 - https://phabricator.wikimedia.org/T268773 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi disk replaced. Note: we have no more 4TB spare to use . [16:00:14] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:48] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:50] RECOVERY - HP RAID on ms-be2031 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:05:28] RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [16:05:56] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:06:58] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) >>! In T257066#6655978, @Ycrusoe wrote: >> This is not so simple. In Wikisource we convert mus... [16:07:32] (03PS1) 10Volans: Upstream release v0.0.45 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/644264 [16:07:52] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@b46380d]: oozie: Repoint hive to analytics-hive.eqiad.wmnet [16:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:06] ebernhardson: thank youuuu [16:09:08] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@b46380d]: oozie: Repoint hive to analytics-hive.eqiad.wmnet (duration: 01m 15s) [16:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:26] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy) >>! In T257066#6656115, @Ankry wrote: > If lilypond cannot be executed on WMF servers, it may... [16:12:58] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T268622 (10Papaul) a:05Papaul→03hnowlan @hnowlan disk replaced [16:15:59] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.45 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/644264 (owner: 10Volans) [16:19:21] (03Merged) 10jenkins-bot: Upstream release v0.0.45 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/644264 (owner: 10Volans) [16:24:08] !log uploaded spicerack_0.0.45 to apt.wikimedia.org buster-wikimedia [16:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:27] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10bd808) >>! In T261694#6655085, @Nemo_bis wrote: >>>! In T261694#6652951, @Stephankn wrote: >> Is it possible to allow w... [16:30:06] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10Papaul) [16:30:41] 10Operations, 10observability: VictorOps ~5min delay from email received to incident paging - https://phabricator.wikimedia.org/T266800 (10lmata) 05Open→03Resolved a:03lmata Closing for now we can reopen if we see another occurrence of this event happening. [16:32:22] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10Papaul) [16:32:28] 10Operations, 10observability: Two close pages for idle workers api + appserver didn't auto-resolve on recovery - https://phabricator.wikimedia.org/T266570 (10lmata) a:03herron [16:32:52] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:55] 10Operations, 10Security-Team: Offboard Chase Pettet from Security Team - https://phabricator.wikimedia.org/T265147 (10sbassett) [16:38:18] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [16:41:29] (03CR) 10Volans: "Small nit inline, looks good if the compiler is happy" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [16:43:27] 10Operations, 10Traffic: fifo-log-tailer: gracefully handle missing unix socket - https://phabricator.wikimedia.org/T268883 (10ema) 05Open→03Resolved a:03ema Done! ` root@cp4028:~# fifo-log-tailer -socket this-does-not-exist-at-all.socket 2020/11/30 16:42:38 Unable to read from socket: dial unix this-do... [16:46:41] 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10Dzahn) [16:46:56] (03PS2) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [16:48:08] (03CR) 10SBassett: [C: 03+1] Add link to Article Feedback dumps for downloaders [puppet] - 10https://gerrit.wikimedia.org/r/594961 (https://phabricator.wikimedia.org/T250715) (owner: 10ArielGlenn) [16:48:26] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:52] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26766/console" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [16:49:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1016.eqiad.wmnet - https://phabricator.wikimedia.org/T268812 (10wiki_willy) a:05wiki_willy→03Cmjohnson [16:49:59] 10Operations, 10Puppet, 10puppet-compiler: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10Ottomata) p:05High→03Medium [16:50:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1015.eqiad.wmnet - https://phabricator.wikimedia.org/T268810 (10wiki_willy) a:05wiki_willy→03Cmjohnson [16:50:46] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1008.eqiad.wmnet [16:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:11] 10Operations: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10Ottomata) [16:52:44] (03PS2) 10Jbond: cfss::ocsp: move ocsp servie to its own resource [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) [16:53:05] (03PS2) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [16:53:07] (03CR) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [16:53:10] (03CR) 10jerkins-bot: [V: 04-1] cfss::ocsp: move ocsp servie to its own resource [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [16:54:43] RECOVERY - snapshot of s7 in codfw on alert1001 is OK: Last snapshot for s7 at codfw (db2100.codfw.wmnet:3317) taken on 2020-11-30 15:09:47 (1034 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:54:43] (03PS3) 10Jbond: cfss::ocsp: move ocsp service to its own resource [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) [16:55:05] RECOVERY - Device not healthy -SMART- on ms-be2031 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2031&var-datasource=codfw+prometheus/ops [16:55:08] (03CR) 10jerkins-bot: [V: 04-1] cfss::ocsp: move ocsp service to its own resource [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [16:55:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26769/console" [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [16:55:55] (03CR) 10CRusnov: "compiler output:" [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [16:57:37] 10Operations, 10Security-Team: Offboard Chase Pettet from Security Team - https://phabricator.wikimedia.org/T265147 (10MoritzMuehlenhoff) @sbassett I think there's nothing open, can be closed? [16:59:07] (03PS1) 10Volans: Move import from wmflib [cookbooks] - 10https://gerrit.wikimedia.org/r/644284 [17:00:06] 10Operations, 10Security-Team: Offboard Chase Pettet from Security Team - https://phabricator.wikimedia.org/T265147 (10sbassett) >>! In T265147#6656404, @MoritzMuehlenhoff wrote: > @sbassett I think there's nothing open, can be closed? T265922 is still technically open, just waiting for SRE review. But that... [17:00:19] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:35] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:02:10] was something changed on idp, the above board doesn't seem to load now the table of puppet failures: https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:02:32] oh, it may be me [17:02:35] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: use calico/kube-controllers image from our internal docker registry [puppet] - 10https://gerrit.wikimedia.org/r/644286 (https://phabricator.wikimedia.org/T269016) [17:02:37] I may not be properlly logged [17:03:36] yeah, it was me [17:03:45] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:22] (03PS3) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [17:05:24] 10Operations, 10Analytics-Clusters: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10Ottomata) [17:05:43] 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Ottomata) [17:05:51] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10Papaul) [17:06:47] (03PS4) 10Jbond: cfss::ocsp: move ocsp service to its own resource [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) [17:07:04] 10Operations, 10User-DannyS712: Access to #mediawiki_security IRC channel for DannyS712 - https://phabricator.wikimedia.org/T267800 (10sbassett) a:05sbassett→03None [17:07:17] (03CR) 10jerkins-bot: [V: 04-1] cfss::ocsp: move ocsp service to its own resource [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [17:07:28] !log reset failed (now obsolete idp-u2f-sync/stunnel4 services on idp1001 [17:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26771/console" [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [17:08:15] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26770/console" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [17:08:51] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:17] (03CR) 10Elukey: [C: 03+1] Move import from wmflib [cookbooks] - 10https://gerrit.wikimedia.org/r/644284 (owner: 10Volans) [17:10:58] (03PS5) 10Jbond: cfss::ocsp: move ocsp service to its own resource [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) [17:11:09] (03CR) 10Bstorm: [C: 03+2] kubeadm: use calico/kube-controllers image from our internal docker registry [puppet] - 10https://gerrit.wikimedia.org/r/644286 (https://phabricator.wikimedia.org/T269016) (owner: 10Arturo Borrero Gonzalez) [17:11:34] (03CR) 10jerkins-bot: [V: 04-1] cfss::ocsp: move ocsp service to its own resource [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [17:11:46] (03PS3) 10Alexandros Kosiaris: k8s: Remove profile::kubernetes::master::storage_backend fully [puppet] - 10https://gerrit.wikimedia.org/r/644234 [17:11:48] (03PS3) 10Alexandros Kosiaris: k8s::master: Remove redundant has_lvs hiera check [puppet] - 10https://gerrit.wikimedia.org/r/644237 [17:11:52] (03PS3) 10Alexandros Kosiaris: k8s: Allow using cergen [puppet] - 10https://gerrit.wikimedia.org/r/644238 [17:11:54] (03PS4) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 [17:11:56] (03PS4) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [17:12:14] 10Operations, 10serviceops, 10MW-1.36-notes (1.36.0-wmf.18; 2020-11-17), 10Performance Issue, and 3 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Pchelolo) [17:12:49] (03CR) 10jerkins-bot: [V: 04-1] k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris) [17:13:17] (03CR) 10Volans: [C: 03+2] Move import from wmflib [cookbooks] - 10https://gerrit.wikimedia.org/r/644284 (owner: 10Volans) [17:13:28] (03PS6) 10Jbond: cfss::ocsp: move ocsp service to its own resource [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) [17:14:14] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26772/console" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [17:14:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26773/console" [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [17:15:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfss::ocsp: move ocsp service to its own resource [puppet] - 10https://gerrit.wikimedia.org/r/644254 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [17:15:52] (03Merged) 10jenkins-bot: Move import from wmflib [cookbooks] - 10https://gerrit.wikimedia.org/r/644284 (owner: 10Volans) [17:17:15] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [17:20:17] (03PS4) 10Alexandros Kosiaris: k8s: Remove profile::kubernetes::master::storage_backend fully [puppet] - 10https://gerrit.wikimedia.org/r/644234 [17:20:19] (03PS4) 10Alexandros Kosiaris: k8s::master: Remove redundant has_lvs hiera check [puppet] - 10https://gerrit.wikimedia.org/r/644237 [17:20:21] (03PS4) 10Alexandros Kosiaris: k8s: Allow using cergen [puppet] - 10https://gerrit.wikimedia.org/r/644238 [17:20:23] (03PS5) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 [17:20:25] (03PS5) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [17:21:20] (03CR) 10jerkins-bot: [V: 04-1] k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris) [17:22:39] 10Operations, 10Technical-blog-posts, 10Traffic: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T266857 (10srodlund) 05Open→03Resolved a:03srodlund [17:24:42] RECOVERY - snapshot of s5 in codfw on alert1001 is OK: Last snapshot for s5 at codfw (db2099.codfw.wmnet:3315) taken on 2020-11-30 16:39:56 (677 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [17:25:47] (03PS1) 10Jbond: profile::pki::server: enable ocsp service [puppet] - 10https://gerrit.wikimedia.org/r/644291 (https://phabricator.wikimedia.org/T268882) [17:26:13] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26774/console" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [17:27:18] (03PS2) 10Jbond: profile::pki::server: enable ocsp service [puppet] - 10https://gerrit.wikimedia.org/r/644291 (https://phabricator.wikimedia.org/T268882) [17:27:48] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC at https://puppet-compiler.wmflabs.org/compiler1001/26774/, changes as expected" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [17:31:03] (03PS3) 10Jbond: profile::pki::server: enable ocsp service [puppet] - 10https://gerrit.wikimedia.org/r/644291 (https://phabricator.wikimedia.org/T268882) [17:31:56] !log joal@deploy1001 Started deploy [analytics/refinery@9db742d]: Analytics special deploy before first of month - Hotfix [analytics/refinery@9db742d] [17:31:58] (03CR) 10Dave Pifke: [C: 03+1] Allow CORS requests to noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/644230 (https://phabricator.wikimedia.org/T268167) (owner: 10Gilles) [17:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:50] (03PS4) 10Jbond: profile::pki::server: enable ocsp service [puppet] - 10https://gerrit.wikimedia.org/r/644291 (https://phabricator.wikimedia.org/T268882) [17:37:21] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [17:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:28] (03PS5) 10Jbond: profile::pki::server: enable ocsp service [puppet] - 10https://gerrit.wikimedia.org/r/644291 (https://phabricator.wikimedia.org/T268882) [17:39:11] (03PS3) 10ArielGlenn: Add link to Article Feedback dumps for downloaders [puppet] - 10https://gerrit.wikimedia.org/r/594961 (https://phabricator.wikimedia.org/T250715) [17:39:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26779/console" [puppet] - 10https://gerrit.wikimedia.org/r/644291 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [17:40:24] 10Puppet, 10Beta-Cluster-Infrastructure, 10Developer Productivity, 10Patch-For-Review: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10hashar) Based on `dmesg` the oom hasn't happened since beginning of October so that is an improvement. Some request... [17:40:27] (03CR) 10Jbond: [V: 03+1 C: 03+2] profile::pki::server: enable ocsp service [puppet] - 10https://gerrit.wikimedia.org/r/644291 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [17:41:04] (03CR) 10ArielGlenn: [C: 03+2] Add link to Article Feedback dumps for downloaders [puppet] - 10https://gerrit.wikimedia.org/r/594961 (https://phabricator.wikimedia.org/T250715) (owner: 10ArielGlenn) [17:43:18] (03PS1) 10Jbond: P:pki::server: Use correct host name [puppet] - 10https://gerrit.wikimedia.org/r/644294 [17:43:29] !log joal@deploy1001 Finished deploy [analytics/refinery@9db742d]: Analytics special deploy before first of month - Hotfix [analytics/refinery@9db742d] (duration: 11m 32s) [17:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:53] (03CR) 10Jbond: [C: 03+2] P:pki::server: Use correct host name [puppet] - 10https://gerrit.wikimedia.org/r/644294 (owner: 10Jbond) [17:44:56] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/644001 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [17:47:38] !log joal@deploy1001 Started deploy [analytics/refinery@9db742d] (thin): Analytics special deploy before first of month - Hotfix -- THIN [analytics/refinery@9db742d] [17:47:43] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:46] !log joal@deploy1001 Finished deploy [analytics/refinery@9db742d] (thin): Analytics special deploy before first of month - Hotfix -- THIN [analytics/refinery@9db742d] (duration: 00m 08s) [17:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:44] (03PS1) 10Jbond: P:pki::server: use safe_title for the label [puppet] - 10https://gerrit.wikimedia.org/r/644295 (https://phabricator.wikimedia.org/T268882) [17:49:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26781/console" [puppet] - 10https://gerrit.wikimedia.org/r/644295 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [17:49:56] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::server: use safe_title for the label [puppet] - 10https://gerrit.wikimedia.org/r/644295 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [17:57:28] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Dzahn) 05Stalled→03Open brought up in today's SRE meeting for approval and approved [17:58:14] jbond42: cfssl failing ^^^ FYI [17:59:13] (03CR) 10Dzahn: "please also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/643112" [puppet] - 10https://gerrit.wikimedia.org/r/644001 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [17:59:37] (03PS3) 10Dzahn: admin: create peek-admins group and apply on peek role [puppet] - 10https://gerrit.wikimedia.org/r/641570 (https://phabricator.wikimedia.org/T265922) [18:00:04] ryankemper: (Dis)respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201130T1800). Please do the needful. [18:00:20] (03PS1) 10Jbond: P:pki::server: use the correct CA certificate and add ocsp_port [puppet] - 10https://gerrit.wikimedia.org/r/644298 (https://phabricator.wikimedia.org/T268882) [18:01:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26782/console" [puppet] - 10https://gerrit.wikimedia.org/r/644298 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [18:02:19] (03CR) 10Dzahn: [C: 03+2] "brought up in today's SRE meeting and was approved" [puppet] - 10https://gerrit.wikimedia.org/r/641570 (https://phabricator.wikimedia.org/T265922) (owner: 10Dzahn) [18:02:38] (03PS2) 10Jbond: P:pki::server: use the correct CA certificate and add ocsp_port [puppet] - 10https://gerrit.wikimedia.org/r/644298 (https://phabricator.wikimedia.org/T268882) [18:03:31] (03CR) 10Dzahn: [C: 04-1] "mistake in group name" [puppet] - 10https://gerrit.wikimedia.org/r/641570 (https://phabricator.wikimedia.org/T265922) (owner: 10Dzahn) [18:03:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26783/console" [puppet] - 10https://gerrit.wikimedia.org/r/644298 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [18:04:50] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:55] (03PS3) 10Jbond: P:pki::server: use the correct CA certificate and add ocsp_port [puppet] - 10https://gerrit.wikimedia.org/r/644298 (https://phabricator.wikimedia.org/T268882) [18:05:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26784/console" [puppet] - 10https://gerrit.wikimedia.org/r/644298 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [18:06:31] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::server: use the correct CA certificate and add ocsp_port [puppet] - 10https://gerrit.wikimedia.org/r/644298 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [18:09:24] (03CR) 10Volans: "LGTM code wise but I have a set of questions on the redis installation, see inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [18:10:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) @ayounsi Regarding netflow data size in Druid: We Analytics took a look at the data size after adding the new field... [18:27:15] (03PS6) 10Ryan Kemper: elasticsearch-cluster: support for cloudelastic [software/spicerack] - 10https://gerrit.wikimedia.org/r/643532 (https://phabricator.wikimedia.org/T268779) [18:28:15] (03CR) 10CRusnov: "Thanks for quick review." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [18:28:21] (03CR) 10Ryan Kemper: elasticsearch-cluster: support for cloudelastic (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/643532 (https://phabricator.wikimedia.org/T268779) (owner: 10Ryan Kemper) [18:28:44] (03PS7) 10Ryan Kemper: elasticsearch-cluster: support for cloudelastic [software/spicerack] - 10https://gerrit.wikimedia.org/r/643532 (https://phabricator.wikimedia.org/T268779) [18:31:38] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch-cluster: support for cloudelastic [software/spicerack] - 10https://gerrit.wikimedia.org/r/643532 (https://phabricator.wikimedia.org/T268779) (owner: 10Ryan Kemper) [18:32:50] (03PS5) 10Razzi: zookeeper-test: Give zookeeper-test1001 zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) [18:37:35] (03PS6) 10Razzi: zookeeper-test: Give zookeeper-test1001 zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) [18:39:23] (03CR) 10Razzi: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26786/zookeeper-test1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:39:31] (03PS7) 10Razzi: zookeeper-test: Give zookeeper-test1001 zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) [18:41:28] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [18:42:44] (03PS4) 10Dzahn: admin: create peek-admins group and apply on peek role [puppet] - 10https://gerrit.wikimedia.org/r/641570 (https://phabricator.wikimedia.org/T265922) [18:42:58] (03PS1) 10Clarakosi: Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) [18:43:18] PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 9904 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:46:30] RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 9 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:47:22] (03CR) 10Dzahn: [C: 03+2] admin: create peek-admins group and apply on peek role [puppet] - 10https://gerrit.wikimedia.org/r/641570 (https://phabricator.wikimedia.org/T265922) (owner: 10Dzahn) [18:51:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Dzahn) 05Open→03Resolved [18:51:12] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Dzahn) @Reedy @sbassett This was now approved and added. on `peek2001.codfw.wmnet`: ` Notice: /Stage[main]/Admin/Admin::Groupmembers[peek-admins]/Exec[peek-admins_ensure... [18:53:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Dzahn) If you want to setup peek1001 in the future access is already dealt with now. The new admin group gets applied on the puppet role, not an individual host name, so this... [18:55:44] (03CR) 10Clarakosi: [C: 04-2] "We'll deploy this tomorrow as part of knowledge sharing session" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi) [18:57:38] (03CR) 10Dzahn: [C: 03+1] "looks consistent with existing uploader groups" [puppet] - 10https://gerrit.wikimedia.org/r/643512 (https://phabricator.wikimedia.org/T268818) (owner: 10Tobias Andersson) [18:58:27] (03CR) 10Dzahn: [C: 03+1] "please involve the person currently on clinic duty (operations channel topic) to handle this as an access request and check what kind of a" [puppet] - 10https://gerrit.wikimedia.org/r/643512 (https://phabricator.wikimedia.org/T268818) (owner: 10Tobias Andersson) [18:58:42] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [19:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201130T1900). [19:00:04] Pchelolo: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:10] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:57] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [19:03:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10Dzahn) Seems good to me. The group creation in gerrit looks consistent with our existing groups. As mentioned on Gerrit let the current clinic du... [19:05:24] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:14] (03PS1) 10Jbond: P:analytics::cluster::packages::common: demo spec test [puppet] - 10https://gerrit.wikimedia.org/r/644312 (https://phabricator.wikimedia.org/T261693) [19:09:58] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [19:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:35] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) Downloading the largest file on commons (https://commons.wik... [19:13:53] (03PS2) 10Jbond: P:analytics::cluster::packages::common: demo spec test [puppet] - 10https://gerrit.wikimedia.org/r/644312 (https://phabricator.wikimedia.org/T261693) [19:14:20] (03PS2) 10Ppchelko: group2: switch ParserCache to JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644243 (https://phabricator.wikimedia.org/T263579) [19:14:24] (03CR) 10Ppchelko: [C: 03+2] group2: switch ParserCache to JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644243 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [19:14:44] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10ssingh) Thanks @Dzahn! @thcipriani: Adding you to this task to see if you have any possible concerns about this request as it involves the use o... [19:14:55] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:16] (03Merged) 10jenkins-bot: group2: switch ParserCache to JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644243 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [19:15:19] (03PS2) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [19:15:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 (owner: 10Jcrespo) [19:17:13] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:644243 group2: switch ParserCache to JSON (duration: 00m 58s) [19:17:15] (03CR) 10jerkins-bot: [V: 04-1] P:analytics::cluster::packages::common: demo spec test [puppet] - 10https://gerrit.wikimedia.org/r/644312 (https://phabricator.wikimedia.org/T261693) (owner: 10Jbond) [19:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:02] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:18:11] (03PS2) 10Ppchelko: Expiration date: OAuth 2.0 access tokens have effectively infinite expiration date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644236 (https://phabricator.wikimedia.org/T265075) (owner: 10Vlad.shapik) [19:18:35] 10Operations, 10Puppet, 10puppet-compiler, 10Patch-For-Review: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10jbond) The main issue here is that both puppet-lint and puppet syntax validate are too light weight to pick up issues like this as they are designed... [19:18:37] (03CR) 10Ppchelko: [C: 03+2] Expiration date: OAuth 2.0 access tokens have effectively infinite expiration date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644236 (https://phabricator.wikimedia.org/T265075) (owner: 10Vlad.shapik) [19:19:25] (03Merged) 10jenkins-bot: Expiration date: OAuth 2.0 access tokens have effectively infinite expiration date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644236 (https://phabricator.wikimedia.org/T265075) (owner: 10Vlad.shapik) [19:21:32] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [19:23:20] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings.php: gerrit:644236 Decrease OAuth token expiration (duration: 00m 56s) [19:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:16] RECOVERY - snapshot of s6 in codfw on alert1001 is OK: Last snapshot for s6 at codfw (db2097.codfw.wmnet:3316) taken on 2020-11-30 18:13:49 (548 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [19:33:19] (03PS3) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:34:50] (03CR) 10jerkins-bot: [V: 04-1] netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [19:35:10] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:35:10] (03PS17) 10ArielGlenn: per job batches file with locking and methods for claiming jobs [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [19:36:13] (03PS4) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:36:25] (03PS4) 10Bstorm: wikireplicas: Upgrade maintain-meta_p.py to python 3 [puppet] - 10https://gerrit.wikimedia.org/r/643363 (https://phabricator.wikimedia.org/T268312) [19:39:26] (03PS5) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:40:39] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) reseated dimm`s errors persisted following up with HP [19:41:35] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10Papaul) [19:47:34] !log added Sukhbir to Ops vendor maintenance calendar permissions to make changes and share like all of SRE (T229860) [19:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:41] T229860: SRE Onboarding for Sukhbir Singh - https://phabricator.wikimedia.org/T229860 [19:48:24] (03PS6) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:52:05] !log power down ms-be2059 for RAID re-configuration [19:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:14] PROBLEM - Host ms-be2059 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:34] (03PS18) 10ArielGlenn: per job batches file with locking and methods for claiming jobs [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [19:55:19] (03PS2) 10Razzi: Add kafka-test virtual machine mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/642563 (https://phabricator.wikimedia.org/T268202) [19:56:10] (03PS7) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:57:41] (03CR) 10jerkins-bot: [V: 04-1] netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [19:57:43] (03CR) 10CRusnov: "pcc output https://puppet-compiler.wmflabs.org/compiler1003/26791/" [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [19:59:16] (03CR) 10ArielGlenn: [C: 03+2] per job batches file with locking and methods for claiming jobs [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [19:59:25] (03PS8) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:59:43] (03Merged) 10jenkins-bot: per job batches file with locking and methods for claiming jobs [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [20:00:15] (03CR) 10Ottomata: zookeeper-test: Give zookeeper-test1001 zookeeper role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [20:00:20] RECOVERY - Host ms-be2059 is UP: PING OK - Packet loss = 0%, RTA = 33.37 ms [20:01:21] (03PS1) 10Ppchelko: Remove wgParserCacheUseJson setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644317 (https://phabricator.wikimedia.org/T263579) [20:02:09] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2059.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [20:06:03] (03PS1) 10ArielGlenn: enable page content batches for wikidata xml/sql dump run [puppet] - 10https://gerrit.wikimedia.org/r/644318 (https://phabricator.wikimedia.org/T252396) [20:06:51] (03CR) 10ArielGlenn: [C: 03+2] enable page content batches for wikidata xml/sql dump run [puppet] - 10https://gerrit.wikimedia.org/r/644318 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [20:10:32] !log ariel@deploy1001 Started deploy [dumps/dumps@2f4d931]: per job batches for page content. step one. [20:10:37] !log ariel@deploy1001 Finished deploy [dumps/dumps@2f4d931]: per job batches for page content. step one. (duration: 00m 04s) [20:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:46] (03PS1) 10Dzahn: switch deploy1002/deploy2002 to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/644319 (https://phabricator.wikimedia.org/T265963) [20:13:53] (03CR) 10Dzahn: "As suggested on ticket, switching new deployment server hardware to buster at this point." [puppet] - 10https://gerrit.wikimedia.org/r/644319 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [20:14:44] (03CR) 10Dzahn: [C: 03+2] switch deploy1002/deploy2002 to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/644319 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [20:16:07] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [20:16:28] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10Papaul) [20:18:08] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2059.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2059.codfw.wmnet'] ` [20:18:49] jouncebot: next [20:18:49] In 0 hour(s) and 41 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201130T2100) [20:25:38] (03PS1) 10Dzahn: scap: temp remove deploy1002 from dsh group (don't deploy mw to it) [puppet] - 10https://gerrit.wikimedia.org/r/644320 (https://phabricator.wikimedia.org/T265963) [20:26:01] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) Excellent, @Dzahn -- thank you - I will reach out if I need further support! [20:26:52] (03CR) 10Dzahn: [C: 03+2] "safe to do - deploy1001 is still the active deployment server for all purposes - this one is not" [puppet] - 10https://gerrit.wikimedia.org/r/644320 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [20:27:16] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2059.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [20:28:51] !log reimaging deploy1002 with buster - not the active deployment server, deploy1001 still is (T265963) [20:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:58] T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963 [20:32:48] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=parse2001.codfw.wmnet [20:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:17] !log depooling parse2001 to prepare for reimage T268524 [20:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:26] T268524: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 [20:35:04] (03CR) 10Volans: "Adding Effie/Reuven for the Redis config bit as I'm not familiar with it at all." [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [20:36:10] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=parse2001.codfw.wmnet [20:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:30] (03CR) 10Volans: "How would this Redis instance be monitored?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [20:38:53] 10Operations: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` parse2001.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202011302038_dzahn_11921_parse2001_codf... [20:39:32] !log reimaging parse2001 (parsoid canary) with buster (T268524) [20:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:39] T268524: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 [20:42:53] !log reimaging deploy2002 with buster (not active, deploy1001/2001 are) T265963 [20:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:00] T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963 [20:43:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:19] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [20:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:55] (03PS2) 10Dzahn: mediawiki:deployment_server/beta: replace hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/643116 (https://phabricator.wikimedia.org/T209953) [20:45:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:56] (03CR) 10Dzahn: [C: 03+2] mediawiki:deployment_server/beta: replace hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/643116 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:46:33] (03PS5) 10Dzahn: thumbor: move hiera lookup from role to profile, replace with lookup [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) [20:47:03] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:46] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm [20:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:24] 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10MattCleinman) @CDanis - No rush on making these changes. We have a workable solution for now, but would prefer an eve... [20:55:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:15] (03CR) 10Dzahn: gerrit: use proper hostname on replica hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643919 (owner: 10Hashar) [20:57:16] mutante: that series of patch is a looong tale unfortunately :-\ [20:58:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201130T2100). Please do the needful. [21:00:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:13] (03CR) 10Dzahn: [C: 03+2] "cloud-only: https://openstack-browser.toolforge.org/puppetclass/role::thumbor::mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:03:49] (03CR) 10Dzahn: "no.. not true.. and this is not the change I meant to merge at this point." [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:04:24] (03PS2) 10Dzahn: thumbor: Migrate hiera to lookup [puppet] - 10https://gerrit.wikimedia.org/r/644001 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [21:04:55] (03PS9) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [21:04:57] (03CR) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [21:05:50] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26792/" [puppet] - 10https://gerrit.wikimedia.org/r/644001 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [21:08:47] (03CR) 10CRusnov: "> Patch Set 8:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [21:09:29] (03CR) 10Dzahn: "I think we should probably move everything the other way around to eqiad/codfw and never use ./hosts/ at all.. but then it might be more c" [puppet] - 10https://gerrit.wikimedia.org/r/643918 (owner: 10Hashar) [21:11:41] (03CR) 10Dzahn: "noop on thumbor1001 in prod" [puppet] - 10https://gerrit.wikimedia.org/r/644001 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [21:12:53] (03CR) 10Dzahn: [C: 03+2] "I would prefer even more to never use ./hosts/ at all. but merging this anyway for consistency. https://puppet-compiler.wmflabs.org/compil" [puppet] - 10https://gerrit.wikimedia.org/r/643918 (owner: 10Hashar) [21:16:02] (03CR) 10Dzahn: "noop on gerrit1001/gerrit2001" [puppet] - 10https://gerrit.wikimedia.org/r/643918 (owner: 10Hashar) [21:16:59] 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10razzi) I originally created these virtual machines in the `analytics` vlan, but it should be in the default `private` network instead, so I'm decommissioning the nodes... [21:18:26] (03CR) 10Bstorm: "Finally got my test environment working again and validated that this should work as expected against a standard, existing replica as well" [puppet] - 10https://gerrit.wikimedia.org/r/642503 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [21:18:44] (03CR) 10Bstorm: [C: 03+2] wikireplicas: modify views scripts to work on any replica style [puppet] - 10https://gerrit.wikimedia.org/r/642503 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [21:22:07] 10Operations: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2001.codfw.wmnet'] ` and were **ALL** successful. [21:27:50] RECOVERY - snapshot of s3 in codfw on alert1001 is OK: Last snapshot for s3 at codfw (db2098.codfw.wmnet:3313) taken on 2020-11-30 18:13:08 (982 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:28:55] (03PS1) 10Bstorm: wikireplicas: fiddle with the template spacing a bit [puppet] - 10https://gerrit.wikimedia.org/r/644328 (https://phabricator.wikimedia.org/T268312) [21:31:07] (03PS2) 10Bstorm: wikireplicas: fiddle with the template spacing a bit [puppet] - 10https://gerrit.wikimedia.org/r/644328 (https://phabricator.wikimedia.org/T268312) [21:32:20] (03PS3) 10Bstorm: wikireplicas: fiddle with the template spacing a bit [puppet] - 10https://gerrit.wikimedia.org/r/644328 (https://phabricator.wikimedia.org/T268312) [21:34:25] 10Operations, 10Security-Team: Offboard Chase Pettet from Security Team - https://phabricator.wikimedia.org/T265147 (10sbassett) 05Open→03Resolved p:05Medium→03Low [21:34:28] (03CR) 10Bstorm: [C: 03+2] wikireplicas: fiddle with the template spacing a bit [puppet] - 10https://gerrit.wikimedia.org/r/644328 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [21:36:22] (03PS10) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [21:38:02] (03CR) 10jerkins-bot: [V: 04-1] netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [21:38:27] !log razzi@cumin1001 START - Cookbook sre.hosts.decommission [21:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:51] (03PS6) 10Dzahn: thumbor: move thumbor mediawiki role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) [21:43:14] (03PS1) 10Dzahn: site: add deploy2002 and unify deployment server role regex [puppet] - 10https://gerrit.wikimedia.org/r/644333 (https://phabricator.wikimedia.org/T265963) [21:43:30] (03PS2) 10Dzahn: site: add deploy2002 and unify deployment server role regex [puppet] - 10https://gerrit.wikimedia.org/r/644333 (https://phabricator.wikimedia.org/T265963) [21:44:45] (03CR) 10Dzahn: "rebased on Amir's change, now just moving stuff around to profile and parameter" [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:44:46] (03PS11) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [21:45:29] !log razzi@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [21:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:34] 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `zookeeper-test1001.eqiad.wmnet` - zookeeper-test1001.eqiad.wmnet (**PASS**... [21:46:03] (03PS2) 10Dzahn: gerrit: daemon option in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar) [21:46:08] (03CR) 10jerkins-bot: [V: 04-1] netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [21:49:21] (03PS12) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [21:50:31] (03CR) 10Dzahn: "unexpected changes in compiler output .. though only .erb is changed here... https://puppet-compiler.wmflabs.org/compiler1003/26796/gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar) [21:50:53] (03CR) 10jerkins-bot: [V: 04-1] netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [21:51:21] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=parse2001.codfw.wmnet [21:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:44] !log parse2001 - scap pull [21:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:53] (03PS13) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [21:53:06] 10Operations: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) @jijiki parse2001 is now on buster. I noticed no puppet errors or warnings. It went surprisingly smooth. Icinga also looking good. It is currently back in "pooled=no" (from inactive). [21:54:53] 10Operations: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) [21:55:13] (03CR) 10CRusnov: "Okay I have discovered how to override settings in the redis module and switched to it." [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [21:55:57] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) [21:56:00] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) That sounds lovely! Are there any sanity tests we could possibly do? Maybe @ssastry can give us an idea? [22:00:04] Reedy and sbassett: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201130T2200). [22:02:39] 10Operations, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10CDanis) 05Open→03Resolved Filed T269046 [22:04:24] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) So I used httpbb to test it and we still have this issue, which looks like firewalling..to my surprise. {P13485} [22:06:04] (03PS14) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [22:06:18] (03PS1) 10CDanis: Also serve apple-app-site-assoc file from /.well-known/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644343 (https://phabricator.wikimedia.org/T259312) [22:09:04] (03CR) 10CRusnov: "> Patch Set 13:" [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [22:12:11] (03CR) 10RLazarus: [C: 03+1] "Wow, gerrit sure doesn't have a friendly representation for symlinks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644343 (https://phabricator.wikimedia.org/T259312) (owner: 10CDanis) [22:12:32] (03PS1) 10Razzi: Configure zookeeper-test1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/644344 (https://phabricator.wikimedia.org/T268074) [22:14:34] !log parse2001 - systemctl restart ferm - had to restart ferm after reimaging (though there weren't any alerts about that) but it fixed running httpbb tests on it (T268524) [22:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:41] T268524: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 [22:14:48] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm [22:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:49] !log razzi@cumin1001 START - Cookbook sre.hosts.decommission [22:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:59] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) @jijiki @rlazarus I had to manually restart ferm (meh! no alerts about that and should not happen) but that made the httpbb tests work now: ` [deploy1001:~] $ httpbb --hosts parse2... [22:18:43] (03CR) 10CDanis: [C: 03+2] "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644343 (https://phabricator.wikimedia.org/T259312) (owner: 10CDanis) [22:18:49] jouncebot: next [22:18:49] In 1 hour(s) and 41 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T0000) [22:19:33] (03Merged) 10jenkins-bot: Also serve apple-app-site-assoc file from /.well-known/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644343 (https://phabricator.wikimedia.org/T259312) (owner: 10CDanis) [22:21:42] !log cdanis@deploy1001 Synchronized docroot/thankyou: Also serve apple-app-site-assoc file from /.well-known/ T259312 bc52d1481 (duration: 00m 57s) [22:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:53] T259312: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 [22:22:13] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) I used `httpbb` to run all the tests for appservers we have on mwdebug1003 and it reported no issues... [22:23:52] (03CR) 10Cwhite: [C: 03+1] alertmanager: fix cluster config out of sync alert [puppet] - 10https://gerrit.wikimedia.org/r/644184 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [22:27:15] 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10CDanis) https://thankyou.wikipedia.org/.well-known/apple-app-site-association now live, please let me know if it help... [22:34:00] !log razzi@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [22:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:05] 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `kafka-test1001.eqiad.wmnet` - kafka-test1001.eqiad.wmnet (**WARN**) - **... [22:35:21] (03PS1) 10Razzi: Ensure /tmp/sqoop-jars/ is present [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) [22:36:31] (03PS1) 10Bstorm: wikireplicas: extend, don't append when adding to your lists [puppet] - 10https://gerrit.wikimedia.org/r/644348 (https://phabricator.wikimedia.org/T268312) [22:37:33] jouncebot: now [22:37:33] For the next 1 hour(s) and 22 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201130T2200) [22:38:35] (03PS15) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [22:38:38] (03PS2) 10Reedy: Remove REL1_34 from $wgExtDistSnapshotRefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644045 (https://phabricator.wikimedia.org/T268931) [22:38:54] (03CR) 10Reedy: [C: 03+2] Remove REL1_34 from $wgExtDistSnapshotRefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644045 (https://phabricator.wikimedia.org/T268931) (owner: 10Reedy) [22:39:40] (03Merged) 10jenkins-bot: Remove REL1_34 from $wgExtDistSnapshotRefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644045 (https://phabricator.wikimedia.org/T268931) (owner: 10Reedy) [22:40:12] (03CR) 10Bstorm: [C: 03+2] wikireplicas: extend, don't append when adding to your lists [puppet] - 10https://gerrit.wikimedia.org/r/644348 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [22:42:21] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove 1.34 from $wgExtDistSnapshotRefs T268931 (duration: 00m 57s) [22:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:28] T268931: Tidy up references to REL1_34 when it is EOL - https://phabricator.wikimedia.org/T268931 [22:43:13] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26801/console" [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [22:43:45] (03CR) 10Razzi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/26801/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [22:45:40] !log razzi@cumin1001 START - Cookbook sre.hosts.decommission [22:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:02] (03PS4) 10CRusnov: base/phaste.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) [22:46:39] (03PS5) 10CRusnov: base/phaste.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) [22:46:51] (03CR) 10CRusnov: [C: 03+2] base/phaste.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:48:54] (03PS1) 10Dzahn: deployment::server: buster support, use mariadb-client, not mysql-client [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) [22:49:34] (03CR) 10CRusnov: [C: 03+2] "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:50:33] (03CR) 10CRusnov: [C: 03+2] base/firewall/check_conntrack.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/630690 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:51:44] (03CR) 10CRusnov: [C: 03+2] scripts/interface_automation.py: Clarify statusoverride flag [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627567 (owner: 10CRusnov) [22:52:45] (03PS1) 10Bstorm: wikireplicas: fix the spacing for the index script template [puppet] - 10https://gerrit.wikimedia.org/r/644351 (https://phabricator.wikimedia.org/T268312) [22:57:07] (03PS1) 10Razzi: analytics: Replace an-coord1001 with analytics-hive [puppet] - 10https://gerrit.wikimedia.org/r/644353 (https://phabricator.wikimedia.org/T268028) [22:58:25] PROBLEM - PHP opcache health on parse2001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:00:00] (03CR) 10Bstorm: [C: 03+2] wikireplicas: fix the spacing for the index script template [puppet] - 10https://gerrit.wikimedia.org/r/644351 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [23:04:11] (03PS3) 10RLazarus: 00-warmup-caches: Repeat until execution time converges. [cookbooks] - 10https://gerrit.wikimedia.org/r/643596 [23:04:21] (03PS4) 10RLazarus: 00-warmup-caches: Repeat until execution time converges. [cookbooks] - 10https://gerrit.wikimedia.org/r/643596 [23:06:00] (03CR) 10RLazarus: [C: 03+2] "Thanks for the review! Merging this per your last, but happy to refine further." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/643596 (owner: 10RLazarus) [23:07:10] (03Merged) 10jenkins-bot: 00-warmup-caches: Repeat until execution time converges. [cookbooks] - 10https://gerrit.wikimedia.org/r/643596 (owner: 10RLazarus) [23:08:27] !log sudo -i /usr/local/sbin/restart-php7.2-fpm [23:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:35] !log parse2001 - sudo -i /usr/local/sbin/restart-php7.2-fpm [23:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:16] 10Operations, 10ops-eqiad: mw1304.mgmt down - https://phabricator.wikimedia.org/T269050 (10Dzahn) [23:12:51] !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [23:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:13] ACKNOWLEDGEMENT - Host mw1304.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T269050 [23:16:07] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Continuous-Integration-Config, and 2 others: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Daimona) 05Open→03Invalid We can pull pcov from sury-php, so this shouldn'... [23:18:47] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/26800/" [puppet] - 10https://gerrit.wikimedia.org/r/644333 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [23:27:54] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:56] (03PS1) 10Dzahn: conftool: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953) [23:29:23] (03CR) 10jerkins-bot: [V: 04-1] conftool: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:39:24] (03CR) 10CRusnov: "Just a note, I don't see this included in any other modules so this may be deprecated also." [puppet] - 10https://gerrit.wikimedia.org/r/644358 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:45:07] (03PS3) 10CRusnov: Port drac.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644358 (https://phabricator.wikimedia.org/T247364) [23:48:14] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:15] (03PS1) 10Dzahn: k8s: replace hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/644363 [23:50:40] (03CR) 10Dzahn: "pretty sure the entire module can be replaced. it's like 2 generations ago from pre-2013. after that we had a different script to change m" [puppet] - 10https://gerrit.wikimedia.org/r/644358 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:52:56] (03PS1) 10Dzahn: delete the drac module [puppet] - 10https://gerrit.wikimedia.org/r/644364 [23:53:20] (03CR) 10CRusnov: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/644358 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:58:58] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes