[00:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201208T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:01:23] 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10Papaul) Dear Juniper Networks Customer, Thank you for returning your defective product in relation to your recently created RMA. This notification confirms that Juniper has received the following def... [00:05:41] 10Operations, 10ops-codfw: codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T269266 (10Papaul) This arrived today. Will be installing the DIMM tomorrow. [00:13:12] Hello, does blocking feature of the AbuseFilter needs to be enabled separately, as I can find it on the metawiki, but not on the srwiki? [00:17:13] Kizule: do you have the same user rights? [00:17:29] ori: Which user rights? [00:18:14] You think on blocking? [00:21:05] I haven't used AbuseFilter much so I'm not sure which features exactly you're referring to. In general the way extension configuration is made to vary by wiki is through InitialiseSettings.php. This is the relevant section for AbuseFilter: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L23689-L23714 [00:21:14] Ahh, I found that it needs to be enabled here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/abusefilter.php [00:21:19] per wiki [00:22:53] oh, there we go [00:23:04] yes, looks like you answered you own question :) [00:23:24] Maybe. Thank you ori in each case. :) [00:33:34] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1791857296 and 175 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:35:52] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3211973752 and 312 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:12] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1027112016 and 333 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:14] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2824109504 and 334 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 71923912 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:36] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 590757992 and 37 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:50] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 216879784 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:02] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 883216 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:37] (03PS10) 10Mstyles: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) [00:43:41] (03CR) 10jerkins-bot: [V: 04-1] Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [00:43:45] (03CR) 10Mstyles: "> Patch Set 9: Code-Review+1" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [00:44:36] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 199267736 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:43] (03PS1) 10Dzahn: wikistats: use script and separate config to dump mysql from timer [puppet] - 10https://gerrit.wikimedia.org/r/646876 [00:45:50] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1722920360 and 84 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:58] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1130169000 and 52 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:25] (03CR) 10jerkins-bot: [V: 04-1] wikistats: use script and separate config to dump mysql from timer [puppet] - 10https://gerrit.wikimedia.org/r/646876 (owner: 10Dzahn) [00:47:32] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 95376 and 76 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:34] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 105200 and 77 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:44] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 60536 and 88 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:42] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 149688 and 146 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:58] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 36120 and 162 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:03] (03CR) 10Bmansurov: "Thanks, @Alexandros." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [00:49:14] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 139888 and 177 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:58] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 228984 and 221 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:20] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 243 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:50] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 68063768 and 514 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:50] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 107882672 and 514 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:50] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1232601392 and 514 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:08] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 21944904 and 533 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:08] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1062890296 and 533 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:18] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 70514992 and 543 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:06] (03PS2) 10Dzahn: wikistats: use script and separate config to dump mysql from timer [puppet] - 10https://gerrit.wikimedia.org/r/646876 [00:53:35] (03CR) 10jerkins-bot: [V: 04-1] wikistats: use script and separate config to dump mysql from timer [puppet] - 10https://gerrit.wikimedia.org/r/646876 (owner: 10Dzahn) [00:53:46] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 23743576 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:40] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1991888 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:58] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1173872 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:58] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1100304 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:08] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1121216 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:09] !log T269204 reimaging the following instances to debian buster: `eqiad public`->`wdqs1007`, `codfw public`->`wdqs2007`, `codfw internal`->`wdqs2008`, `test`->`wdqs1010` [00:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:19] T269204: Some wdqs metrics changed when switching to python3 - https://phabricator.wikimedia.org/T269204 [00:57:16] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 69731584 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:36] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 93844216 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:59:08] (03PS11) 10Mstyles: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) [01:01:07] (03CR) 10Mstyles: Add new helm chart for rdf-streaming-updater (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [01:02:10] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1963568 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:10] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1855864 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:04:13] (03CR) 10Mstyles: "I tried doing a readiness port with the UI port (8081) and it's using the pod IP to hit the endpoint and it fails for me in minikube and t" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [01:05:06] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 178658040 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:50] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 892480320 and 51 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:06:42] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 176 and 56 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:07:24] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 26920 and 99 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:36] 10Operations, 10ops-eqiad: sdg1 failed on ms-be1054 - https://phabricator.wikimedia.org/T269556 (10wiki_willy) a:03Cmjohnson @Cmjohnson - looks like this one is still under warranty (installed in Aug 2019), so you should be good with submitting a RMA. Thanks, Willy [01:11:24] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime [01:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:28] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime [01:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:27] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:18] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/646878 [01:18:37] !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime [01:18:37] !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime [01:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:56] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:37] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:13] PROBLEM - Host wdqs2008 is DOWN: PING CRITICAL - Packet loss = 100% [01:31:19] RECOVERY - Host wdqs2008 is UP: PING OK - Packet loss = 0%, RTA = 33.53 ms [01:35:25] PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:36:21] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/646879 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [01:43:37] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:44] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:51] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [01:43:55] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [01:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:31] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:49:04] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:49:04] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:49:07] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:23] !log ryankemper@mwmaint1002 conftool action : set/pooled=yes:weight=10; selector: name=wdqs1012.eqiad.wmnet|wdqs1013.eqiad.wmnet [01:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:18] !log T246345 Brought two new public wdqs nodes `wdqs101[2,3]` into service: `sudo confctl select 'name=wdqs1012.eqiad.wmnet|wdqs1013.eqiad.wmnet' set/pooled=yes:weight=10` [01:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:25] T246345: Service implementation on wdqs101[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T246345 [01:52:27] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:53:11] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:57] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [01:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:13] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [01:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:48] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:06] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:51] (03CR) 10Ppchelko: [C: 03+1] Configure API Portal permissions for launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646862 (https://phabricator.wikimedia.org/T267953) (owner: 10Cicalese) [02:07:41] PROBLEM - Check systemd state on ms-be1054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:55] (03CR) 10CRusnov: purge-nagios-resources.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646884 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [02:15:53] PROBLEM - Check systemd state on dbprov2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:07] RECOVERY - Check systemd state on ms-be1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:45] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.21 [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646887 (https://phabricator.wikimedia.org/T264801) (owner: 10TrainBranchBot) [02:35:58] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/646890 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [03:03:39] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:34] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:57] RECOVERY - Check systemd state on dbprov2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:35] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:25] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:25] PROBLEM - Check systemd state on dbprov2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:12] 10Operations, 10Collection, 10OfflineContentGenerator, 10Readers-Web-Backlog (Tracking), 10Services (watching): Remove deprecated features from book creator UI - https://phabricator.wikimedia.org/T150917 (10Izno) [03:41:04] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Epic, and 2 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871 (10Izno) [03:42:32] 10Operations, 10Collection, 10OfflineContentGenerator, 10Readers-Web-Backlog (Tracking), 10Services (watching): Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872 (10Izno) [03:48:15] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:33] RECOVERY - Check systemd state on dbprov2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:49:47] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:03] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:49] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:03:27] PROBLEM - Check systemd state on ms-be1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:25] RECOVERY - Check systemd state on ms-be1028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:37] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Beta cluster logstash down - https://phabricator.wikimedia.org/T268200 (10DannyS712) p:05Triage→03High a:05herron→03None Reset assignee since it was automatically set when @herron closed as resolved, but the... [05:38:02] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.13.0-a19 [vendor] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646771 (https://phabricator.wikimedia.org/T184779) [05:38:53] Pchelolo are you around? T269651 might be related to your recent work [05:38:54] T269651: InvalidArgumentException on zhwiki MediaWiki:Conversiontable/zh-hk - "The supplied ParserOptions are not safe to cache. Use NO_CACHE." - https://phabricator.wikimedia.org/T269651 [05:39:09] PROBLEM - dump of es5 in codfw on alert1001 is CRITICAL: Last dump for es5 at codfw (es2025.codfw.wmnet) taken on 2020-12-08 00:00:01 is 1237 GB, but previous one was 1017 GB, a change of 21.7% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:40:23] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:41:50] (03PS2) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.13.0-a19 [vendor] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646771 (https://phabricator.wikimedia.org/T184779) [05:42:53] (03CR) 10C. Scott Ananian: [C: 03+1] Bump wikimedia/parsoid to 0.13.0-a19 [vendor] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646771 (https://phabricator.wikimedia.org/T184779) (owner: 10C. Scott Ananian) [05:55:47] PROBLEM - dump of es4 in codfw on alert1001 is CRITICAL: Last dump for es4 at codfw (es2022.codfw.wmnet) taken on 2020-12-08 00:00:01 is 1260 GB, but previous one was 1040 GB, a change of 21.2% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [07:11:33] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:47] PROBLEM - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:09] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/646884 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:07:27] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add custom email template [puppet] - 10https://gerrit.wikimedia.org/r/646680 (https://phabricator.wikimedia.org/T267018) (owner: 10Filippo Giunchedi) [08:08:59] (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.36.0-wmf.21 [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646887 (https://phabricator.wikimedia.org/T264801) (owner: 10TrainBranchBot) [08:09:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (and yes, it's currently not yet used in production)" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [08:09:39] !log power reset ms-be1032 [08:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:27] PROBLEM - Host ms-be1032 is DOWN: PING CRITICAL - Packet loss = 100% [08:12:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/646890 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:12:41] RECOVERY - Host ms-be1032 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [08:12:43] RECOVERY - MD RAID on ms-be1032 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:12:45] PROBLEM - Check systemd state on ms-be1032 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:19] RECOVERY - Check systemd state on ms-be1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:31] !log upgrade hw raid firmware on ms-be1032 and reboot - T269624 [08:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:38] T269624: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T269624 [08:18:59] PROBLEM - Host ms-be1032 is DOWN: PING CRITICAL - Packet loss = 100% [08:20:55] RECOVERY - Host ms-be1032 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [08:22:35] PROBLEM - Check systemd state on ms-be1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:55] RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:58] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T269624 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I've upgraded the raid controller and on reboot we're back! [08:28:23] (03CR) 10Muehlenhoff: httpd: make it possible to configure server admin email address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [08:30:46] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add logstash ecs 1.7.0-1 template [puppet] - 10https://gerrit.wikimedia.org/r/645209 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [08:31:21] RECOVERY - Check systemd state on dbprov1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:39] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.21 [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646887 (https://phabricator.wikimedia.org/T264801) (owner: 10TrainBranchBot) [08:40:09] RECOVERY - Disk space on ms-be1032 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1032&var-datasource=eqiad+prometheus/ops [08:56:08] jouncebot: now [08:56:09] No deployments scheduled for the next 3 hour(s) and 3 minute(s) [09:01:42] (03CR) 10Elukey: "This is really a great news! \o/" [puppet] - 10https://gerrit.wikimedia.org/r/646751 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [09:06:58] !log upgrade scap to 3.16 everywhere - T268634 [09:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:06] T268634: Deploy Scap version 3.16.0-1 - https://phabricator.wikimedia.org/T268634 [09:19:33] RECOVERY - Check systemd state on ms-be1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:05] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:43] PROBLEM - dump of es4 in eqiad on alert1001 is CRITICAL: Last dump for es4 at eqiad (es1022.eqiad.wmnet) taken on 2020-12-08 00:00:01 is 1260 GB, but previous one was 1040 GB, a change of 21.2% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:33:03] (03PS1) 10Ema: cache: downgrade Varnish on cp3054 to 5.2.1-1wm1 [puppet] - 10https://gerrit.wikimedia.org/r/646964 (https://phabricator.wikimedia.org/T264398) [09:39:06] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/646964 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [09:41:29] (03CR) 10DCausse: [C: 04-1] wdqs: Counters now must end in _total (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646888 (https://phabricator.wikimedia.org/T269204) (owner: 10Ryan Kemper) [09:46:53] (03CR) 10Ema: [C: 03+2] cache: downgrade Varnish on cp3054 to 5.2.1-1wm1 [puppet] - 10https://gerrit.wikimedia.org/r/646964 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [09:47:05] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:50] (03PS3) 10Cparle: Schedule maintenance script for prioritising images for CAT [puppet] - 10https://gerrit.wikimedia.org/r/643917 (https://phabricator.wikimedia.org/T262857) [09:50:15] (03CR) 10jerkins-bot: [V: 04-1] Schedule maintenance script for prioritising images for CAT [puppet] - 10https://gerrit.wikimedia.org/r/643917 (https://phabricator.wikimedia.org/T262857) (owner: 10Cparle) [09:50:56] (03CR) 10Cparle: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/643917 (https://phabricator.wikimedia.org/T262857) (owner: 10Cparle) [09:52:00] (03PS4) 10Cparle: Schedule maintenance script for prioritising images for CAT [puppet] - 10https://gerrit.wikimedia.org/r/643917 (https://phabricator.wikimedia.org/T262857) [09:52:14] (03PS2) 10Hashar: Add rename-project plugin @ a880148 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/630548 (https://phabricator.wikimedia.org/T201953) [09:55:07] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:33] !log cp3054: downgrade varnish to 5.2.1-1wm1 T264398 [09:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:41] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [09:58:05] (03PS5) 10Kormat: int_cont [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 [09:59:43] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:53] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:53] PROBLEM - dump of es5 in eqiad on alert1001 is CRITICAL: Last dump for es5 at eqiad (es1025.eqiad.wmnet) taken on 2020-12-08 00:00:01 is 1237 GB, but previous one was 1017 GB, a change of 21.7% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:09:14] (03PS1) 10Ema: Revert "cache: downgrade Varnish on cp3054 to 5.2.1-1wm1" [puppet] - 10https://gerrit.wikimedia.org/r/646776 (https://phabricator.wikimedia.org/T264398) [10:10:27] (03CR) 10Ema: [C: 03+2] Revert "cache: downgrade Varnish on cp3054 to 5.2.1-1wm1" [puppet] - 10https://gerrit.wikimedia.org/r/646776 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [10:10:58] (03PS1) 10Muehlenhoff: Remove "idp" backup file set and drop backup host profile from IDPs [puppet] - 10https://gerrit.wikimedia.org/r/646966 [10:11:13] (03PS2) 10Muehlenhoff: Remove "idp" backup file set and drop backup host profile from IDPs [puppet] - 10https://gerrit.wikimedia.org/r/646966 [10:13:50] !log cp3054: upgrade varnish back to 6.0.7-1wm1 T264398 [10:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:58] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [10:14:20] (03PS1) 10DCausse: [cirrus] setup perfield builder A/B test on spaceless languages [extensions/WikimediaEvents] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/646777 (https://phabricator.wikimedia.org/T266027) [10:16:13] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1002 job=burrow partition={0,1,2,4,5} prometheus=ops site=eqiad topic={rsyslog-info,rsyslog-notice} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var [10:16:13] s&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [10:16:52] ACKNOWLEDGEMENT - dump of es4 in codfw on alert1001 is CRITICAL: Last dump for es4 at codfw (es2022.codfw.wmnet) taken on 2020-12-08 00:00:01 is 1260 GB, but previous one was 1040 GB, a change of 21.2% Kormat Unable to find data on this, but no point in it alerting in the meantime. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:16:52] ACKNOWLEDGEMENT - dump of es4 in eqiad on alert1001 is CRITICAL: Last dump for es4 at eqiad (es1022.eqiad.wmnet) taken on 2020-12-08 00:00:01 is 1260 GB, but previous one was 1040 GB, a change of 21.2% Kormat Unable to find data on this, but no point in it alerting in the meantime. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:16:52] ACKNOWLEDGEMENT - dump of es5 in codfw on alert1001 is CRITICAL: Last dump for es5 at codfw (es2025.codfw.wmnet) taken on 2020-12-08 00:00:01 is 1237 GB, but previous one was 1017 GB, a change of 21.7% Kormat Unable to find data on this, but no point in it alerting in the meantime. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:16:52] ACKNOWLEDGEMENT - dump of es5 in eqiad on alert1001 is CRITICAL: Last dump for es5 at eqiad (es1025.eqiad.wmnet) taken on 2020-12-08 00:00:01 is 1237 GB, but previous one was 1017 GB, a change of 21.7% Kormat Unable to find data on this, but no point in it alerting in the meantime. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:16:57] (03CR) 10Effie Mouzeli: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/646751 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [10:17:47] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [10:18:56] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [10:19:42] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [10:27:57] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) [10:28:04] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [10:28:06] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) 05Open→03Resolved [10:28:15] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) I've downgraded Varnish to version 5.2.1-1wm1 on cp3054 and had to revert due to an issue with varnishlog. After t... [10:29:28] !log reboot ms-be1021 - T268435 [10:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:37] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [10:30:54] (03PS1) 10Effie Mouzeli: hiera: remove shard16 from redis, reimage mc1034/mc2034 to buster [puppet] - 10https://gerrit.wikimedia.org/r/646967 (https://phabricator.wikimedia.org/T213089) [10:30:57] PROBLEM - Host ms-be1021 is DOWN: PING CRITICAL - Packet loss = 100% [10:32:59] RECOVERY - Host ms-be1021 is UP: PING OK - Packet loss = 0%, RTA = 1.72 ms [10:33:48] !log cp4032: restart ats-{tls,be} [10:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:04] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) Looking at mc1035's behaviour after reimaging, it appears that its hit ratio reached 0.95 after ~10h, and it reached 28Mil... [10:39:42] (03CR) 10Gehel: [C: 04-1] "See comment inline, and ping me or David to discuss in more details as needed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646888 (https://phabricator.wikimedia.org/T269204) (owner: 10Ryan Kemper) [10:45:05] !log A:cp remove libvmod-tbf -- see https://gerrit.wikimedia.org/r/c/operations/puppet/+/623583/ [10:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:53] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10MoritzMuehlenhoff) [10:47:35] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:04] (03PS5) 10Hashar: gerrit: open link in new window [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [10:53:55] (03CR) 10Hashar: [C: 03+1] "Rebased to fix the trivial conflict reported by Gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [10:55:11] (03CR) 10Elukey: [C: 03+1] "Looks good modulo the fact that it is acceptable to remove all the keys in a Redis shard (I am not sure the current status of the main sta" [puppet] - 10https://gerrit.wikimedia.org/r/646967 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [10:58:13] (03PS3) 10Alexandros Kosiaris: apertium: Add a TLS enabled release [deployment-charts] - 10https://gerrit.wikimedia.org/r/646688 (https://phabricator.wikimedia.org/T255672) [10:58:15] (03PS1) 10Alexandros Kosiaris: apertium: Set default tls parameters correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/646970 [10:58:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove values-nontls.yaml for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/646754 (owner: 10Alexandros Kosiaris) [10:58:53] (03CR) 10Hashar: [C: 03+1] "As far as CI is concerned, our base Stretch image already removes the backports config:" [puppet] - 10https://gerrit.wikimedia.org/r/610050 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [10:58:57] (03CR) 10Ladsgroup: "I made a similar one but on another file in I2706a12f042afecf06bbb3adad521382d446a5a7 but PCC was failing on wdqs1006, we should double ch" [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [10:59:36] (03CR) 10jerkins-bot: [V: 04-1] apertium: Add a TLS enabled release [deployment-charts] - 10https://gerrit.wikimedia.org/r/646688 (https://phabricator.wikimedia.org/T255672) (owner: 10Alexandros Kosiaris) [11:00:05] (03Merged) 10jenkins-bot: Remove values-nontls.yaml for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/646754 (owner: 10Alexandros Kosiaris) [11:02:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium: Set default tls parameters correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/646970 (owner: 10Alexandros Kosiaris) [11:04:12] (03Merged) 10jenkins-bot: apertium: Set default tls parameters correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/646970 (owner: 10Alexandros Kosiaris) [11:16:22] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for purged [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) [11:26:31] (03CR) 10Hashar: "Stevie can you review / approve this one please? :] I had the issue when running the integration tests locally." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/624622 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [11:33:39] (03CR) 10Alexandros Kosiaris: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646688 (https://phabricator.wikimedia.org/T255672) (owner: 10Alexandros Kosiaris) [11:36:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium: Add a TLS enabled release [deployment-charts] - 10https://gerrit.wikimedia.org/r/646688 (https://phabricator.wikimedia.org/T255672) (owner: 10Alexandros Kosiaris) [11:37:27] (03Merged) 10jenkins-bot: apertium: Add a TLS enabled release [deployment-charts] - 10https://gerrit.wikimedia.org/r/646688 (https://phabricator.wikimedia.org/T255672) (owner: 10Alexandros Kosiaris) [11:41:50] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'plain' . [11:41:50] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'staging' . [11:41:51] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' . [11:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:17] !log add a TLS enabled release for apertium [11:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:57] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'plain' . [11:42:57] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' . [11:42:57] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'staging' . [11:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:47] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' . [11:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:52] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'plain' . [11:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:02] (03PS5) 10Alexandros Kosiaris: recommendation-api: Cleanups [puppet] - 10https://gerrit.wikimedia.org/r/641939 (https://phabricator.wikimedia.org/T241230) [11:50:04] (03PS2) 10Alexandros Kosiaris: apertium: Add kubernetes as backend for traffic [puppet] - 10https://gerrit.wikimedia.org/r/646673 [11:51:48] !log uploaded jasper 1.900.1-debian1-2.4+deb8u6+wmf1 to apt.wikimedia.org / jessie-wikimedia [11:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] recommendation-api: Cleanups [puppet] - 10https://gerrit.wikimedia.org/r/641939 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [11:53:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium: Add kubernetes as backend for traffic [puppet] - 10https://gerrit.wikimedia.org/r/646673 (owner: 10Alexandros Kosiaris) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201208T1200). [12:00:05] dcausse: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:18] o/ [12:00:23] I can deploy [12:00:47] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/recommendation-api-notls on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/recommendation-api-notls is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:00:47] PROBLEM - Confd template for /srv/config-master/pybal/codfw/recommendation-api on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/recommendation-api is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:01:12] !log installing jasper security updates [12:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:27] PROBLEM - Confd template for /srv/config-master/pybal/codfw/recommendation-api-notls on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/recommendation-api-notls is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:01:27] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:29] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/recommendation-api-notls on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/recommendation-api-notls is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:01:33] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/recommendation-api on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/recommendation-api is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:01:33] PROBLEM - Confd template for /srv/config-master/pybal/codfw/recommendation-api on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/recommendation-api is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:01:41] (03CR) 10DCausse: [C: 03+2] [cirrus] setup perfield builder A/B test on spaceless languages [extensions/WikimediaEvents] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/646777 (https://phabricator.wikimedia.org/T266027) (owner: 10DCausse) [12:01:41] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/recommendation-api on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/recommendation-api is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:01:57] PROBLEM - Confd template for /srv/config-master/pybal/codfw/recommendation-api-notls on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/recommendation-api-notls is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:01:57] !log akosiaris@cumin1001 conftool action : set/weight=1; selector: service=apertium,name=kubernetes.* [12:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:44] kart_: Heads up, I am pooling the kubernetes service in traffic rotation for apertium. [12:03:31] Given the lack of good stats from this service, we are a bit blind as to what's going on during the migration, so keep an eye out for reports of problems [12:03:47] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: service=apertium,name=kubernetes.* [12:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:29] (03Merged) 10jenkins-bot: [cirrus] setup perfield builder A/B test on spaceless languages [extensions/WikimediaEvents] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/646777 (https://phabricator.wikimedia.org/T266027) (owner: 10DCausse) [12:07:36] PROBLEM - LVS apertium codfw port 2737/tcp - Machine Translation service. apertium.svc.eqiad.wmnet IPv4 #page on apertium.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.11 and port 2737: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:08:19] sigh. That's me, looking [12:08:34] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: service=apertium,name=kubernetes.* [12:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:49] !log rollback pooling of apertium kubernetes service. [12:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:08] RECOVERY - LVS apertium codfw port 2737/tcp - Machine Translation service. apertium.svc.eqiad.wmnet IPv4 #page on apertium.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:11:37] (03PS3) 10Vlad.shapik: CommonSettings: OAuth 2.0 refresh tokens expire after 1 minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645308 (https://phabricator.wikimedia.org/T269152) [12:13:28] (03PS1) 10Alexandros Kosiaris: lvs: Add apertium stanza to kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/646978 [12:14:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Add apertium stanza to kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/646978 (owner: 10Alexandros Kosiaris) [12:20:20] 10Operations, 10LDAP, 10Python3-Porting: Port prometheus-openldap-exporter to Python 3 - https://phabricator.wikimedia.org/T266147 (10MoritzMuehlenhoff) https://github.com/tomcz/openldap_exporter seems like a promising alternative. [12:23:40] !log pooling of apertium kubernetes service, take #2 [12:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:52] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: service=apertium,name=kubernetes.* [12:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:03] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: service=apertium,name=kubernetes.* [12:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:03] !log dcausse@deploy1001 Synchronized php-1.36.0-wmf.20/extensions/WikimediaEvents/: T266027: [cirrus] setup perfield builder A/B test on spaceless languages (duration: 01m 00s) [12:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:10] T266027: Test perfield_builder on spaceless languages - https://phabricator.wikimedia.org/T266027 [12:28:13] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:29] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 40668200 and 193 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:30:03] (03PS1) 10DCausse: [cirrus] setup perfield builder A/B test on spaceless languages [extensions/WikimediaEvents] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646779 (https://phabricator.wikimedia.org/T266027) [12:30:35] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [12:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:49] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 566326160 and 272 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:17] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 299805896 and 300 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:21] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1814835120 and 304 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:07] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime [12:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:13] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 120259336 and 356 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:13] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 639045400 and 356 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:55] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 118875672 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:59] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 122593696 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:01] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 171055176 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:11] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:43] !log EU B&C done [12:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:01] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35710928 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:41] (03CR) 10DCausse: [C: 03+2] [cirrus] setup perfield builder A/B test on spaceless languages [extensions/WikimediaEvents] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646779 (https://phabricator.wikimedia.org/T266027) (owner: 10DCausse) [12:37:09] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 236908016 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:33] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 39374872 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:38:11] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:27] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6675887, @ema wrote: > It's likely that 5.2.1 is affected by T264074, we need to backport [[ https://gerrit.wikimedia.org... [12:39:25] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 89407520 and 321 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:25] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 599630528 and 321 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:26] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1849838440 and 321 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:26] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 355765672 and 321 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:55] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 163467176 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:55] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 279365560 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:55] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 269279592 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:41:23] (03Merged) 10jenkins-bot: [cirrus] setup perfield builder A/B test on spaceless languages [extensions/WikimediaEvents] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646779 (https://phabricator.wikimedia.org/T266027) (owner: 10DCausse) [12:43:11] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 14000 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:17] 10Operations: Traceback in icinga-status 'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10MoritzMuehlenhoff) [12:43:32] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: service=apertium,name=kubernetes1001* [12:43:33] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 76416 and 54 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:39] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 63792 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:09] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 66568 and 89 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:21] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1180394552 and 63 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:23] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 244368 and 104 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:23] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 244368 and 104 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:45:11] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1379520 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:45:12] !log akosiaris@cumin1001 conftool action : set/pooled=inactive; selector: service=apertium,name=kubernetes.* [12:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:23] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:45] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 293688 and 425 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:52:57] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 98952 and 128 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:53:11] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1512 and 144 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:54:01] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 92192 and 193 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:54:01] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 193 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:57:07] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'plain' . [12:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:37] (03Abandoned) 10Hashar: base: add basic spec for base::standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/625846 (owner: 10Hashar) [12:58:07] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 16440 and 929 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:58:15] (03Abandoned) 10Hashar: zuul: in spec, use compile.with_all_deps [puppet] - 10https://gerrit.wikimedia.org/r/627325 (owner: 10Hashar) [13:01:54] (03PS3) 10Hashar: jenkins: support changing $JAVA_HOME [puppet] - 10https://gerrit.wikimedia.org/r/645075 (https://phabricator.wikimedia.org/T269354) [13:03:54] (03CR) 10Hashar: "Rebased to clear out a file conflict in hieradata/role/common/ci/master.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/645075 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [13:07:32] (03PS12) 10Hashar: Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) [13:08:25] (03CR) 10Hashar: "Rebased to take in account the removal of recommendation-api/deploy" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [13:09:49] (03PS10) 10Hashar: scap::sources stop assuming mediawiki/services as a prefix [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) [13:19:51] !log installing xen security updates (client side libs only) [13:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:08] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [13:28:09] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) [13:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:43] (03PS1) 10Muehlenhoff: Add library hint for xen [puppet] - 10https://gerrit.wikimedia.org/r/646988 [13:38:52] (03PS1) 10Alexandros Kosiaris: apertium: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/646989 [13:41:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/646989 (owner: 10Alexandros Kosiaris) [13:41:33] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27008/console" [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [13:42:13] !log disable puppet on mc1034/mc2034 and reimage to buster - T213089 [13:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:22] T213089: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 [13:42:27] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [13:42:32] (03Merged) 10jenkins-bot: apertium: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/646989 (owner: 10Alexandros Kosiaris) [13:43:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [13:44:52] !log bounce logstash on logstash102 [13:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:58] !log bounce logstash on logstash1023 [13:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:24] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: remove shard16 from redis, reimage mc1034/mc2034 to buster [puppet] - 10https://gerrit.wikimedia.org/r/646967 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [13:47:55] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/recommendation-api on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:47:55] RECOVERY - Confd template for /srv/config-master/pybal/codfw/recommendation-api on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:48:37] RECOVERY - Confd template for /srv/config-master/pybal/codfw/recommendation-api on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:48:40] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'plain' . [13:48:40] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' . [13:48:40] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [13:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:09] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/recommendation-api on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:50:10] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'staging' . [13:50:10] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' . [13:50:10] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'plain' . [13:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:45] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2034.codfw.wmnet ` The log can be... [13:50:59] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'staging' . [13:50:59] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' . [13:50:59] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'plain' . [13:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:04] !log pooling of apertium kubernetes service, take #3 [13:53:07] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-me [13:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:14] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: service=apertium,name=kubernetes1001.* [13:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:45] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:56:14] (03CR) 10Ppchelko: [C: 03+1] "Makes sense. Will deploy at some point." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645308 (https://phabricator.wikimedia.org/T269152) (owner: 10Vlad.shapik) [13:56:52] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for xen [puppet] - 10https://gerrit.wikimedia.org/r/646988 (owner: 10Muehlenhoff) [13:56:56] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: service=apertium,name=kubernetes.* [13:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:33] (03CR) 10Vlad.shapik: "> Patch Set 3: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645308 (https://phabricator.wikimedia.org/T269152) (owner: 10Vlad.shapik) [13:59:03] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1034 site=eqiad tunnel=mc2034_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:00:53] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:27] (03PS1) 10Muehlenhoff: Fix Cumin alias for cloudvirt-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/646994 [14:05:41] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:53] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: service=apertium,name=scb.* [14:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:05] !log switch over all apertium traffic to kubernetes [14:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:12] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:01] akosiaris: hi, thx for the scap::sources related changes :] [14:07:21] (03CR) 10Muehlenhoff: [C: 03+2] Fix Cumin alias for cloudvirt-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/646994 (owner: 10Muehlenhoff) [14:07:26] (03PS5) 10Effie Mouzeli: redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) [14:08:12] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): scap configuration in puppet defaults to forge the git repo name with 'mediawiki/services/' - https://phabricator.wikimedia.org/T257413 (10hashar) ... [14:08:14] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:57] (03CR) 10jerkins-bot: [V: 04-1] redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [14:11:16] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for toan - https://phabricator.wikimedia.org/T269678 (10toan) [14:16:09] (03PS6) 10Effie Mouzeli: redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) [14:16:24] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2034.codfw.wmnet'] ` and were **ALL** successful. [14:17:41] (03CR) 10jerkins-bot: [V: 04-1] redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [14:18:40] !log installing puma security updates [14:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10dcaro) [14:23:25] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for toan - https://phabricator.wikimedia.org/T269678 (10WMDE-leszek) [14:23:37] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for toan - https://phabricator.wikimedia.org/T269678 (10WMDE-leszek) I approve this request on WMDE end. [14:27:33] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for toan - https://phabricator.wikimedia.org/T269678 (10Ottomata) APPROVED. @toan will need to be in either the wmf or nda ldap groups, and should also be given a Kerberos principal. [14:31:09] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for toan - https://phabricator.wikimedia.org/T269678 (10MoritzMuehlenhoff) That should be cn=nda, not cn=wmf. [14:32:35] (03CR) 10Alexandros Kosiaris: "> Patch Set 11:" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [14:32:46] (03CR) 10JMeybohm: [C: 04-1] linkrecommendation: Add private config for DB write user (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [14:34:58] (03CR) 10JMeybohm: [C: 04-1] "You probably want to set "DB_HOST", "DB_PORT", "DB_DATABASE" and maybe "DB_*USER" here as well." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [14:44:37] (03PS3) 10Kosta Harlan: linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) [14:45:04] (03CR) 10Kosta Harlan: linkrecommendation: Add private config for DB admin user (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [14:48:16] (03PS4) 10Kosta Harlan: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) [14:48:35] (03CR) 10Kosta Harlan: "> Patch Set 3: Code-Review-1" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [14:56:15] (03PS1) 10ArielGlenn: clean up handling of failed page content batches [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396) [14:58:45] 10Operations, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T269552 (10Papaul) p:05Triage→03Medium [14:58:56] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:23] 10Operations: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [15:10:02] (03CR) 10RLazarus: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/643917 (https://phabricator.wikimedia.org/T262857) (owner: 10Cparle) [15:14:10] (03CR) 10Andrew Bogott: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/646994 (owner: 10Muehlenhoff) [15:18:26] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10hashar) [15:20:38] (03PS1) 10Andrew Bogott: cloud-vps VM backups: exclude some more hostnames from backup [puppet] - 10https://gerrit.wikimedia.org/r/647003 (https://phabricator.wikimedia.org/T266180) [15:23:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] Split out RBAC rules and service accounts for typa and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [15:25:00] (03PS1) 10Urbanecm: Add MessagesMad [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646781 (https://phabricator.wikimedia.org/T269585) [15:25:35] twentyafterfour: do you think we can sneak https://gerrit.wikimedia.org/r/c/mediawiki/core/+/646781 into the "train deployment"? [15:27:22] jouncebot: now [15:27:22] No deployments scheduled for the next 1 hour(s) and 32 minute(s) [15:27:25] jouncebot: next [15:27:25] In 1 hour(s) and 32 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201208T1700) [15:28:07] effie: scap started to throw the following warnings [15:28:11] https://www.irccloud.com/pastebin/xhufYz4X/ [15:28:32] liw: can you have a look ? [15:28:45] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigrations (duration: 01m 00s) [15:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:54] Urbanecm: is there something that looks fatal ? [15:29:08] effie: no, it looks like a warning, scap completed normally [15:29:42] and the change was also synced properly, confirmed by ssh'ing to a random appserver [15:33:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] linkrecommendation: Add helmfile.d config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [15:35:11] (03PS5) 10Kosta Harlan: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) [15:35:24] (03CR) 10Kosta Harlan: linkrecommendation: Add helmfile.d config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [15:36:38] (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [15:43:58] (03CR) 10Zfilipin: [C: 03+1] "@Mukunda Modell: can you merge this? It's the last commit to be merged for this task. 😬" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/639293 (https://phabricator.wikimedia.org/T265463) (owner: 10Harriet Ayugi) [15:44:30] (03PS2) 10Muehlenhoff: Add Tyler as approval contact for Gerrit/contint [puppet] - 10https://gerrit.wikimedia.org/r/644856 [15:44:45] (03CR) 10JMeybohm: [C: 04-1] linkrecommendation: Add helmfile.d config (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [15:44:53] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigrations (duration: 00m 58s) [15:44:59] 10Operations, 10Domains, 10Traffic: Okapi Domains - https://phabricator.wikimedia.org/T269686 (10RBrounley_WMF) [15:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:30] (03PS6) 10Kosta Harlan: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) [15:47:40] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:29] (03CR) 10Kosta Harlan: linkrecommendation: Add helmfile.d config (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [15:48:39] (03CR) 10Kormat: [C: 03+1] "LGTM" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/624622 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [15:49:20] (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [15:49:22] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1034.eqiad.wmnet ` The log can be... [15:49:52] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:51:48] PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 5790 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:52:47] (03CR) 10Kosta Harlan: linkrecommendation: Add helmfile.d config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [15:53:18] RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 14 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:57:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) a:05Andrew→03dcaro [15:59:02] (03CR) 10Kormat: [C: 03+2] test: TestOnlineSchemaChanger missed analyze config [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/624622 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [15:59:44] (03CR) 10Subramanya Sastry: [C: 03+2] Bump wikimedia/parsoid to 0.13.0-a19 [vendor] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646771 (https://phabricator.wikimedia.org/T184779) (owner: 10C. Scott Ananian) [16:00:16] (03CR) 10Herron: [C: 03+1] profile: add logstash ecs 1.7.0-1 template [puppet] - 10https://gerrit.wikimedia.org/r/645209 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:01:16] (03Merged) 10jenkins-bot: test: TestOnlineSchemaChanger missed analyze config [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/624622 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [16:02:52] (03PS7) 10Effie Mouzeli: redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) [16:03:13] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [16:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:22] (03CR) 10jerkins-bot: [V: 04-1] redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [16:05:14] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:05:16] effie, had a look (tldr: it's harmelss) [16:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:16] liw: feel free to close the task then, but it needs to be fixed [16:06:23] because it will be annoying to deployers [16:06:28] yes [16:07:02] leaving open until mediawiki-config is changed [16:07:07] (03CR) 10Gmodena: [C: 03+1] similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [16:08:47] (03PS8) 10Effie Mouzeli: redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) [16:13:32] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10Michael) Thank you all for the impressively fast fulfillment of this request. I appreciate it 🙏 [16:13:35] (03PS1) 10Alexandros Kosiaris: k8s_infrastructure_users: Amend to support groups, avoid uid conflicts [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) [16:15:28] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1034.eqiad.wmnet'] ` and were **ALL** successful. [16:16:41] (03CR) 10C. Scott Ananian: [C: 03+2] Bump wikimedia/parsoid to 0.13.0-a19 [vendor] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646771 (https://phabricator.wikimedia.org/T184779) (owner: 10C. Scott Ananian) [16:23:31] (03CR) 10JMeybohm: "Do we need what happens (if anything happens at all) to existing clusters when we switch the UID for "existing" users? It feels like they " [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) (owner: 10Alexandros Kosiaris) [16:23:33] (03PS2) 10Alexandros Kosiaris: k8s_infrastructure_users: Amend to support groups, avoid uid conflicts [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) [16:24:16] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): Degraded RAID on labstore1006 - https://phabricator.wikimedia.org/T268281 (10RobH) [16:24:18] (03CR) 10JMeybohm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) (owner: 10Alexandros Kosiaris) [16:26:05] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27013/console" [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) (owner: 10Alexandros Kosiaris) [16:27:03] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:27] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:52] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime [16:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:23] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:23] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.13.0-a19 [vendor] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646771 (https://phabricator.wikimedia.org/T184779) (owner: 10C. Scott Ananian) [16:31:55] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:42] (03PS1) 10Lars Wirzenius: Drop Scap plugins now that 3.16.0 is installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647013 (https://phabricator.wikimedia.org/T248490) [16:36:27] (03CR) 10Lars Wirzenius: "Scap bundles the plugins that are in mediawiki-config, leading to warnings about duplicates. Drop them from mediawiki-config." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647013 (https://phabricator.wikimedia.org/T248490) (owner: 10Lars Wirzenius) [16:37:17] (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) (owner: 10Alexandros Kosiaris) [16:40:12] !log poweroff ms-be2058 for DIMM replacement [16:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:02] PROBLEM - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:40] 10Operations, 10Domains, 10Okapi, 10Traffic: Okapi Domains - https://phabricator.wikimedia.org/T269686 (10Nintendofan885) [16:44:56] PROBLEM - Host ms-be2058.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:24] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:15] mutante: were you looking at the issues with mediawiki_job_wikidata-updateQueryServiceLag on mwmaint1002? it looks like the job just fails every time it runs, do you know if there's a task open? [16:51:13] cc addshore ^ in case you happen to know anything [16:52:56] (03CR) 10Thcipriani: [C: 03+2] "scap release success \o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647013 (https://phabricator.wikimedia.org/T248490) (owner: 10Lars Wirzenius) [16:53:29] RECOVERY - Host ms-be2058.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.07 ms [16:53:51] (03Merged) 10jenkins-bot: Drop Scap plugins now that 3.16.0 is installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647013 (https://phabricator.wikimedia.org/T248490) (owner: 10Lars Wirzenius) [16:53:57] afaict it's been failing consistently once per minute since yesterday, the icinga alert has been recovering sometimes but that's an artifact [16:55:05] RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.34 ms [16:55:21] 10Operations, 10ops-codfw: codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T269266 (10Papaul) 05Open→03Resolved @fgiunchedi DIMM replaced . Server is back online with 512GB [16:55:32] liw: fetched your patch on the deployment host, I still see "scap prep" in --help output -- I assume that means it's working as intended? [16:56:05] PROBLEM - Check systemd state on ms-be2058 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:16] thcipriani, should be, yes [16:57:09] RECOVERY - Check systemd state on ms-be2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:14] ack, ok, I'll do a quick no-op sync to make sure both deployment servers agree about the world [16:57:51] thcipriani, ta [17:00:04] jbond42 and cdanis: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201208T1700). [17:00:49] !log thcipriani@deploy1001 Synchronized README: no-op sync for [[gerrit:647013|Drop Scap plugins now that 3.16.0 is installed]] (duration: 00m 59s) [17:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:57] ^ liw all done! [17:01:02] kudos! [17:01:26] thcipriani, mostly thanks to James_F who did it originally [17:01:34] 10Operations, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T269552 (10Papaul) @wiki_willy the server is out of warranty and uses 4TB disks same as the ms-be. We do not have any spare 4TB disks on site. [17:01:47] 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10RLazarus) [17:03:11] effie, the duplicate plugins issue is solved, closing the task [17:05:08] liw: Ha, `git rm -rf` isn't that hard. All kudos to you for making scap releases happen on a plan with testing. :-) [17:05:24] (03CR) 10Urbanecm: [C: 03+2] Add MessagesMad [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646781 (https://phabricator.wikimedia.org/T269585) (owner: 10Urbanecm) [17:06:42] James_F, oh git has a command for that? it's not necessary to hand craft tree objects to remove files? TIL [17:07:12] liw: I mean, if you want to do `git filter-branch` and cry a lot, be my guest, but… ;-) [17:07:23] :) [17:09:40] ACKNOWLEDGEMENT - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. RLazarus https://phabricator.wikimedia.org/T269693 (mediawiki_job_wikidata-updateQueryServiceLag) https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:33] (03PS3) 10Cwhite: Generate Logstash ECS cleanup filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 [17:17:27] 10Operations, 10serviceops, 10cloud-services-team (Kanban): Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004 (10Andrew) a:03Andrew [17:17:53] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) @klausman what is the partman recipe to use for those servers and are we using 10G or 1G? Thanks [17:18:30] 10Operations, 10serviceops, 10cloud-services-team (Kanban): Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004 (10Andrew) p:05Triage→03Medium [17:18:43] (03CR) 1020after4: [C: 03+2] Add tests/selenium/log to .gitignore [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/639293 (https://phabricator.wikimedia.org/T265463) (owner: 10Harriet Ayugi) [17:19:48] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10RobH) [17:34:08] (03Merged) 10jenkins-bot: Add MessagesMad [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/646781 (https://phabricator.wikimedia.org/T269585) (owner: 10Urbanecm) [17:34:45] (03CR) 1020after4: [V: 03+2 C: 03+2] Add tests/selenium/log to .gitignore [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/639293 (https://phabricator.wikimedia.org/T265463) (owner: 10Harriet Ayugi) [17:38:09] (03PS1) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [17:38:22] (03CR) 10jerkins-bot: [V: 04-1] profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:39:33] (03CR) 1020after4: "This file clearly shouldn't be in operations/puppet" [puppet] - 10https://gerrit.wikimedia.org/r/455271 (https://phabricator.wikimedia.org/T91428) (owner: 10Aklapper) [17:42:12] (03PS1) 10Cwhite: profile: add netdev grok patterns to ecs pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) [17:42:34] (03CR) 10jerkins-bot: [V: 04-1] profile: add netdev grok patterns to ecs pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:46:10] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) [17:48:06] (03CR) 1020after4: [C: 03+1] "Were there any objections?" [puppet] - 10https://gerrit.wikimedia.org/r/569627 (https://phabricator.wikimedia.org/T215360) (owner: 10Zoranzoki21) [17:49:20] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) [17:49:51] 10Operations, 10Domains, 10Okapi, 10Traffic: Okapi Domains - https://phabricator.wikimedia.org/T269686 (10Reedy) I'm presuming these aren't going to be MediaWiki wiks underneath etc? Where do these domains want to point? While they can be "parked", there's not a great deal to do until there's something to... [17:50:15] (03CR) 1020after4: [C: 03+1] "lol I guess it was my objection. I'm going to retract my objection even though I still don't want to encourage it this allows video to be" [puppet] - 10https://gerrit.wikimedia.org/r/569627 (https://phabricator.wikimedia.org/T215360) (owner: 10Zoranzoki21) [17:51:15] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.533 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:51:46] (03PS1) 10Dave Pifke: webperf: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/647031 (https://phabricator.wikimedia.org/T266479) [17:52:15] (03CR) 10Dave Pifke: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/647031 (https://phabricator.wikimedia.org/T266479) (owner: 10Dave Pifke) [17:53:16] (03CR) 10jerkins-bot: [V: 04-1] webperf: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/647031 (https://phabricator.wikimedia.org/T266479) (owner: 10Dave Pifke) [17:58:55] (03PS3) 10CRusnov: purge-nagios-resources.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/646884 (https://phabricator.wikimedia.org/T247364) [17:59:31] sbassett I'm around for the next bit if you want to try again with the security patch [17:59:45] (03PS3) 10Dave Pifke: arclamp: add CORS header and clean up modules [puppet] - 10https://gerrit.wikimedia.org/r/636759 [18:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201208T1800). [18:00:49] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.067 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:01:05] (03PS1) 10Cwhite: profile: update netdev rsyslog template to ecs 1.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/647032 (https://phabricator.wikimedia.org/T234565) [18:01:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10dcaro) 05Open→03Resolved [18:03:30] (03CR) 10Dave Pifke: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/636759 (owner: 10Dave Pifke) [18:03:42] (03PS4) 10Cwhite: Generate Logstash ECS cleanup filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 [18:08:23] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10dduvall) [18:08:49] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.113 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:17:17] (03PS3) 10Cwhite: Add HTTP request and response headers fields as object[field:keyword] [software/ecs] - 10https://gerrit.wikimedia.org/r/636515 [18:20:15] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.271 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:21:01] (03PS3) 10Cwhite: Add CSP Report fields. [software/ecs] - 10https://gerrit.wikimedia.org/r/636516 [18:21:03] (03PS3) 10Cwhite: Enable search slowlog by default for ECS indices. [software/ecs] - 10https://gerrit.wikimedia.org/r/636685 [18:25:02] 10Operations, 10LDAP-Access-Requests: LDAP access to wmf group for Matt Cleinman - https://phabricator.wikimedia.org/T269696 (10MattCleinman) [18:25:13] (03CR) 10Cwhite: [C: 03+2] profile: add logstash ecs 1.7.0-1 template [puppet] - 10https://gerrit.wikimedia.org/r/645209 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [18:25:21] (03PS5) 10Cwhite: profile: add logstash ecs 1.7.0-1 template [puppet] - 10https://gerrit.wikimedia.org/r/645209 (https://phabricator.wikimedia.org/T234565) [18:32:10] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10RobH) Jin's received the new fan and we're scheduling for him to go on-site to swap on 2020*12-10 @ 0200 UTC / 2020-12-10 @ 0900 Singapore / 2020-12-09 @ 1800 Pacific. @ayonsi: Juniper requested that i... [18:34:41] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.042 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:34:59] (03CR) 10Alex Paskulin: [C: 03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646862 (https://phabricator.wikimedia.org/T267953) (owner: 10Cicalese) [18:35:59] 10Operations, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T269552 (10wiki_willy) Hi @herron - it looks like this server is due to be refreshed next next year. (around Nov 2021) Let me know if you want a replacement disk purchased for this in the mean time, or if you can go w... [18:36:59] PROBLEM - SSH on logstash2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:37:45] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) [18:38:26] (03PS1) 10MSantos: wikifeeds: bump to 2020-12-08-174707-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647038 [18:39:59] (03CR) 10MSantos: [C: 03+2] wikifeeds: bump to 2020-12-08-174707-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647038 (owner: 10MSantos) [18:40:05] RECOVERY - SSH on logstash2006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:40:17] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7fab5818c518: Failed to establish a new connection: [Errno 111] Connection [18:40:17] ://wikitech.wikimedia.org/wiki/Search%23Administration [18:40:35] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [18:41:13] (03Merged) 10jenkins-bot: wikifeeds: bump to 2020-12-08-174707-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647038 (owner: 10MSantos) [18:41:41] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) [18:41:51] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [18:41:55] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2006 is OK: OK - elasticsearch status production-logstash-codfw: active_primary_shards: 456, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, initializing_shards: 0, cluster_name: production-logstash-codfw, active_shards: 862, unassigned_shards: 0, number_of_data_nodes: 3, number_of_in_flight_fet [18:41:55] False, status: green, active_shards_percent_as_number: 100.0, number_of_nodes: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:42:15] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [18:42:49] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.258 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:42:49] !log mbsantos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [18:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:27] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [18:44:31] !log mbsantos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [18:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:02] !log mbsantos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [18:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:44] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) [18:58:01] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:25] (03CR) 10CRusnov: [C: 03+2] purge-nagios-resources.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/646884 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201208T1900) [19:02:53] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:18] (03CR) 10Aklapper: "Mmm, where else? (Plus that might be a meta-discussion not directly related to this proposed change?)" [puppet] - 10https://gerrit.wikimedia.org/r/455271 (https://phabricator.wikimedia.org/T91428) (owner: 10Aklapper) [19:05:11] 10Operations, 10serviceops, 10cloud-services-team (Kanban): Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004 (10Andrew) It's most useful if effort is directed towards completing T237773, which will render this issue moot. In theory the DBAs are going to work on the first step of t... [19:07:10] (03CR) 10Andrew Bogott: [C: 03+2] Initial cinder class and templates [puppet] - 10https://gerrit.wikimedia.org/r/645722 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:15:55] (03PS1) 10Andrew Bogott: Cinder: add some shared config settings to hiera [puppet] - 10https://gerrit.wikimedia.org/r/647043 (https://phabricator.wikimedia.org/T269511) [19:16:27] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: add some shared config settings to hiera [puppet] - 10https://gerrit.wikimedia.org/r/647043 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:20:01] (03PS1) 10Andrew Bogott: Pass ceph_pool name to deployment-specific cinder profiles [puppet] - 10https://gerrit.wikimedia.org/r/647044 (https://phabricator.wikimedia.org/T269511) [19:21:39] (03CR) 10Andrew Bogott: [C: 03+2] Pass ceph_pool name to deployment-specific cinder profiles [puppet] - 10https://gerrit.wikimedia.org/r/647044 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:24:16] (03PS1) 10Andrew Bogott: Pass ceph_pool name from profile to module [puppet] - 10https://gerrit.wikimedia.org/r/647045 (https://phabricator.wikimedia.org/T269511) [19:25:58] (03CR) 10Andrew Bogott: [C: 03+2] Pass ceph_pool name from profile to module [puppet] - 10https://gerrit.wikimedia.org/r/647045 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:27:05] (03PS2) 10ArielGlenn: clean up handling of failed page content batches [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396) [19:27:31] (03CR) 10jerkins-bot: [V: 04-1] clean up handling of failed page content batches [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [19:27:50] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to v0.13.0-a19 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/647046 (https://phabricator.wikimedia.org/T269685) [19:29:01] (03PS3) 10ArielGlenn: clean up handling of failed page content batches [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396) [19:51:32] (03PS1) 10Andrew Bogott: Cinder: create cinder system user before package install [puppet] - 10https://gerrit.wikimedia.org/r/647047 [19:51:55] (03CR) 10jerkins-bot: [V: 04-1] Cinder: create cinder system user before package install [puppet] - 10https://gerrit.wikimedia.org/r/647047 (owner: 10Andrew Bogott) [19:53:19] (03PS2) 10Andrew Bogott: Cinder: create cinder system user before package install [puppet] - 10https://gerrit.wikimedia.org/r/647047 (https://phabricator.wikimedia.org/T269511) [19:53:45] (03CR) 10jerkins-bot: [V: 04-1] Cinder: create cinder system user before package install [puppet] - 10https://gerrit.wikimedia.org/r/647047 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:54:57] (03PS3) 10Andrew Bogott: Cinder: create cinder system user before package install [puppet] - 10https://gerrit.wikimedia.org/r/647047 (https://phabricator.wikimedia.org/T269511) [19:55:41] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: create cinder system user before package install [puppet] - 10https://gerrit.wikimedia.org/r/647047 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:57:57] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:00:05] twentyafterfour and marxarelli: May I have your attention please! Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201208T2000) [20:01:34] twentyafterfour: o/ [20:02:00] (03PS1) 10Andrew Bogott: Cinder: create cinder group as well as user [puppet] - 10https://gerrit.wikimedia.org/r/647048 (https://phabricator.wikimedia.org/T269511) [20:02:04] marxarelli: hello, I'm just working out patch conflicts now [20:02:47] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: create cinder group as well as user [puppet] - 10https://gerrit.wikimedia.org/r/647048 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [20:02:47] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:03:45] twentyafterfour: np. i'm around whenever you're ready [20:06:19] (03PS1) 10Andrew Bogott: Added rsyslog config for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647049 (https://phabricator.wikimedia.org/T269511) [20:06:59] (03CR) 10Andrew Bogott: [C: 03+2] Added rsyslog config for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647049 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [20:09:15] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:10:11] (03CR) 10Isaac Johnson: [C: 03+1] "Thanks -- looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/647003 (https://phabricator.wikimedia.org/T266180) (owner: 10Andrew Bogott) [20:10:18] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10ayounsi) > @ayonsi: Juniper requested that if this new fan also doesn't work, we test the new fan in our other router. Is this ok with you for us to try out? Sure. Make sure cr2-eqsin is up before messi... [20:16:22] !log syncing new branch 1.376.0-wmf.21 [20:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:34] er not 376 [20:20:27] :) [20:20:35] ugh something broke deploy-promote [20:20:39] manual runbook woes [20:23:41] (03PS5) 10Cwhite: Generate Logstash ECS cleanup filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 [20:24:50] (03PS1) 10Hashar: Update translations [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/647053 (https://phabricator.wikimedia.org/T269339) [20:25:54] (03PS1) 1020after4: testwikis wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647054 [20:25:56] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647054 (owner: 1020after4) [20:26:41] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647054 (owner: 1020after4) [20:28:22] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.21 [20:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:15] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:44] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.21 (duration: 39m 02s) [21:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:07] mwdebug1001 and mwdebug1002 have full disks :-/ [21:09:30] purge a couple of MW versions? [21:09:40] 6.7 GiB [##########] /php-1.36.0-wmf.20 [21:09:40] 6.7 GiB [######### ] /php-1.36.0-wmf.18 [21:09:40] 6.7 GiB [######### ] /php-1.36.0-wmf.16 [21:09:40] 6.7 GiB [######### ] /php-1.36.0-wmf.14 [21:09:40] 6.7 GiB [######### ] /php-1.36.0-wmf.13 [21:09:41] 3.9 GiB [##### ] /php-1.36.0-wmf.21 [21:10:22] That's ~36G out of ~46 [21:11:33] (03PS6) 10Zoranzoki21: Add .webm in files.viewable-mime-types of Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/569627 (https://phabricator.wikimedia.org/T215360) [21:12:41] twentyafterfour: ^^ [21:12:47] purge at least 13 and 14 [21:14:32] Or at least the l10n cache [21:14:33] PROBLEM - Disk space on mwdebug1001 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=76%): /tmp 1 MB (0% inode=76%): /var/tmp 1 MB (0% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops [21:14:43] I suspect we don't need 5 full versions of MW [21:14:50] time to broom the disks Reedy [21:14:58] mmm [21:15:05] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10Ottomata) Hmm, uh oh, I think this host needed to be placed in the Analytics VLAN. Ping @elukey @razzi @robh [21:16:20] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10RobH) >>! In T267544#6677262, @ayounsi wrote: >> @ayonsi: Juniper requested that if this new fan also doesn't work, we test the new fan in our other router. Is this ok with you for us to try out? > Sure... [21:16:51] twentyafterfour: sounds like we might need to remind folks to run `scap clean` at our next weekly [21:17:36] marxarelli: I wonder if it's fallen foul because of numerous skipped weeks [21:18:11] * Reedy runs `scap clean 1.36.0-wmf.13` [21:18:36] yeah, could be. it's also time consuming [21:21:39] !log reedy@deploy1001 Pruned MediaWiki: 1.36.0-wmf.13 [keeping static files] (duration: 03m 59s) [21:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:04] !log reedy@deploy1001 Pruned MediaWiki: 1.36.0-wmf.14 [keeping static files] (duration: 02m 44s) [21:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:38] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10ayounsi) I mean make sure there are no (other than the fan) alerts about cr2-eqsin before doing anything with cr3. As we saw, hot swap is fine, no need to depool. [21:26:02] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10RobH) Ahh understood! [21:28:51] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [21:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:52] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering Roadmap Decision Making, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Clarakosi) [21:33:27] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:05] RECOVERY - Disk space on mwdebug1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops [21:36:03] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:39] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:22] we don't even need to keep that many around do we? probably just 20 and 21 at this point [21:41:19] Indeed not [21:41:29] But minimal solution was to kill off the oldest couple [21:43:18] (03PS1) 10Andrew Bogott: Add dummy passwords for cinder service user [labs/private] - 10https://gerrit.wikimedia.org/r/647060 (https://phabricator.wikimedia.org/T269511) [21:48:01] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:51:07] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm [21:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:36] (03PS2) 10Andrew Bogott: Add dummy passwords for cinder service user [labs/private] - 10https://gerrit.wikimedia.org/r/647060 (https://phabricator.wikimedia.org/T269511) [21:52:24] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add dummy passwords for cinder service user [labs/private] - 10https://gerrit.wikimedia.org/r/647060 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:52:29] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) [21:54:10] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: correct some config settings [puppet] - 10https://gerrit.wikimedia.org/r/647061 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:57:21] PROBLEM - SSH on ms-be1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:57:50] cleaning some more [21:58:49] RECOVERY - SSH on ms-be1030 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:58:56] !log twentyafterfour@deploy1001 Pruned MediaWiki: 1.36.0-wmf.16 (duration: 02m 43s) [21:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:46] twentyafterfour: fwiw, i'm pretty aggressive with the cleans, usually just keeping the past week's version as long as it was a stable rollout [22:00:38] s/past week's/current/ [22:08:44] (03CR) 10Ayounsi: ganeti-netbox-sync: Add post-sync PuppetDB import where necessary (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [22:09:12] RoanKattouw, twentyafterfour, marxarelli, Niharika, Urbanecm : we've got a parsoid bump on the evening backport window coming up, just so you know [22:09:18] (03PS1) 10Daimona Eaytoy: Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) [22:10:42] !log twentyafterfour@deploy1001 Pruned MediaWiki: 1.36.0-wmf.18 (duration: 02m 00s) [22:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:29] cscott: since it's an UBN, it can be deployed outside of a window, fwiw [22:11:59] yeah, since it's an early deploy of Parsoid -a19 we wanted to give it a little bit of time on group0 to smoke test it before pushing it to group1 and group2 [22:12:11] (-a19 just got deployed to group0 with the train) [22:12:24] just gives us a little bit more confidence [22:12:31] ah, so it's already partially deployed [22:12:33] makes sense then [22:14:23] (03PS1) 1020after4: group0 wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647093 [22:14:25] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647093 (owner: 1020after4) [22:15:19] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647093 (owner: 1020after4) [22:17:41] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.21 [22:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:50] (03PS1) 10Bstorm: kubeadm-k8s: use cached calico container images [puppet] - 10https://gerrit.wikimedia.org/r/647094 (https://phabricator.wikimedia.org/T269016) [22:27:00] (03PS1) 10Ayounsi: Add Cloudcontrol term to labs-in4/6 [homer/public] - 10https://gerrit.wikimedia.org/r/647098 (https://phabricator.wikimedia.org/T269457) [22:28:13] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps VM backups: exclude some more hostnames from backup [puppet] - 10https://gerrit.wikimedia.org/r/647003 (https://phabricator.wikimedia.org/T266180) (owner: 10Andrew Bogott) [22:28:27] (03CR) 10Ayounsi: [C: 03+2] Add Cloudcontrol term to labs-in4/6 [homer/public] - 10https://gerrit.wikimedia.org/r/647098 (https://phabricator.wikimedia.org/T269457) (owner: 10Ayounsi) [22:28:50] (03PS1) 10Andrew Bogott: Cinder: add rabbit hostname/password to config [puppet] - 10https://gerrit.wikimedia.org/r/647099 (https://phabricator.wikimedia.org/T269511) [22:28:54] (03Merged) 10jenkins-bot: Add Cloudcontrol term to labs-in4/6 [homer/public] - 10https://gerrit.wikimedia.org/r/647098 (https://phabricator.wikimedia.org/T269457) (owner: 10Ayounsi) [22:33:01] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: add rabbit hostname/password to config [puppet] - 10https://gerrit.wikimedia.org/r/647099 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [22:34:39] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:13] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:44:08] 10Operations, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Andrew) [22:46:43] !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [22:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:13] (03PS1) 10Razzi: superset: add cache to staging superset [puppet] - 10https://gerrit.wikimedia.org/r/647106 (https://phabricator.wikimedia.org/T268784) [23:11:38] (03PS1) 10Andrew Bogott: Ceph: allow controller nodes to be specified per DC [puppet] - 10https://gerrit.wikimedia.org/r/647107 (https://phabricator.wikimedia.org/T265965) [23:13:09] (03CR) 10jerkins-bot: [V: 04-1] Ceph: allow controller nodes to be specified per DC [puppet] - 10https://gerrit.wikimedia.org/r/647107 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [23:13:34] (03PS2) 10Razzi: superset: add cache to staging superset [puppet] - 10https://gerrit.wikimedia.org/r/647106 (https://phabricator.wikimedia.org/T268784) [23:14:33] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27021/console" [puppet] - 10https://gerrit.wikimedia.org/r/647106 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi) [23:17:59] (03PS2) 10Andrew Bogott: Ceph: allow controller nodes to be specified per DC [puppet] - 10https://gerrit.wikimedia.org/r/647107 (https://phabricator.wikimedia.org/T265965) [23:18:32] (03PS1) 10Mholloway: WikimediaEvents: Enable SessionTick on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647108 (https://phabricator.wikimedia.org/T248987) [23:19:27] (03CR) 10jerkins-bot: [V: 04-1] Ceph: allow controller nodes to be specified per DC [puppet] - 10https://gerrit.wikimedia.org/r/647107 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [23:20:20] (03PS3) 10Andrew Bogott: Ceph: allow controller nodes to be specified per DC [puppet] - 10https://gerrit.wikimedia.org/r/647107 (https://phabricator.wikimedia.org/T265965) [23:20:22] (03CR) 10Mholloway: [C: 03+2] WikimediaEvents: Enable SessionTick on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647108 (https://phabricator.wikimedia.org/T248987) (owner: 10Mholloway) [23:20:59] (03PS1) 10Razzi: Add kafka-test1007 virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/647109 (https://phabricator.wikimedia.org/T268202) [23:21:20] (03Merged) 10jenkins-bot: WikimediaEvents: Enable SessionTick on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647108 (https://phabricator.wikimedia.org/T248987) (owner: 10Mholloway) [23:21:45] (03CR) 10jerkins-bot: [V: 04-1] Add kafka-test1007 virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/647109 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [23:22:11] (03CR) 10Andrew Bogott: [C: 03+2] Ceph: allow controller nodes to be specified per DC [puppet] - 10https://gerrit.wikimedia.org/r/647107 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [23:23:34] (03PS2) 10Razzi: Add kafka-test1007 virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/647109 (https://phabricator.wikimedia.org/T268202) [23:24:33] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: WikimediaEvents: Enable SessionTick on group0 T248987 (duration: 02m 00s) [23:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:42] T248987: Session Length Metric. Web implementation - https://phabricator.wikimedia.org/T248987 [23:33:45] (03PS1) 10Andrew Bogott: Include glance/ceph image backup job in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/647112 [23:34:14] (03CR) 10Andrew Bogott: [C: 03+2] Include glance/ceph image backup job in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/647112 (owner: 10Andrew Bogott) [23:36:31] (03PS1) 10Andrew Bogott: Add glance backup times to codfw1dev control nodes [puppet] - 10https://gerrit.wikimedia.org/r/647113 [23:37:02] (03CR) 10Andrew Bogott: [C: 03+2] Add glance backup times to codfw1dev control nodes [puppet] - 10https://gerrit.wikimedia.org/r/647113 (owner: 10Andrew Bogott) [23:44:49] (03CR) 10Thcipriani: [C: 03+1] Add Tyler as approval contact for Gerrit/contint [puppet] - 10https://gerrit.wikimedia.org/r/644856 (owner: 10Muehlenhoff) [23:53:28] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) HP sent 2 more Dimms requesting to be changed. just arrived will change tomorrow