[00:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201210T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:01:56] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10thcipriani) >>! In T269731#6679835, @jbond wrote: > @thcipriani are you able to approve adding kostajh to the `deployment:` group Approved! [00:03:45] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27063/wikistats-wild-tiger.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/646876 (owner: 10Dzahn) [00:03:50] 10Operations, 10fundraising-tech-ops, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10Dwisehaupt) @jbond Thanks for the pointers. I have started testing this in our VM setup and it looks like getting lldp in place should be easy to do. I do have a question about th... [00:04:55] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:22:30] (03PS1) 10Dzahn: wikistats: fix file name of db dump script [puppet] - 10https://gerrit.wikimedia.org/r/647402 [00:23:20] (03CR) 10Dzahn: [C: 03+2] wikistats: fix file name of db dump script [puppet] - 10https://gerrit.wikimedia.org/r/647402 (owner: 10Dzahn) [00:23:26] (03PS2) 10Dzahn: wikistats: fix file name of db dump script [puppet] - 10https://gerrit.wikimedia.org/r/647402 [00:26:20] !log cr2-eqsin bad fan being swapped via T267544 [00:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:24] T267544: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 [00:32:17] PROBLEM - PHP opcache health on mw2243 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:33:31] ACKNOWLEDGEMENT - PHP opcache health on mw2243 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged and not pooled https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:38:01] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6418400728 and 746 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:01] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2431911424 and 138 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:40] (03PS1) 10Dzahn: wikistats: redirect output of mysqldump command properly [puppet] - 10https://gerrit.wikimedia.org/r/647408 [00:41:07] (03CR) 10jerkins-bot: [V: 04-1] wikistats: redirect output of mysqldump command properly [puppet] - 10https://gerrit.wikimedia.org/r/647408 (owner: 10Dzahn) [00:41:15] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 292830432 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:19] (03PS2) 10Dzahn: wikistats: redirect output of mysqldump command properly [puppet] - 10https://gerrit.wikimedia.org/r/647408 [00:41:47] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2579873816 and 143 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:55] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 515024360 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:59] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7966246568 and 481 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:17] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 75056672 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:23] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6678317568 and 411 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:53] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 11368 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:31] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 83 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:46] (03CR) 10Dzahn: [C: 03+2] wikistats: redirect output of mysqldump command properly [puppet] - 10https://gerrit.wikimedia.org/r/647408 (owner: 10Dzahn) [00:43:53] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 45160 and 105 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:31] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 187112 and 142 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:01] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 59152 and 174 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:15] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 11992 and 306 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:45] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2144 and 336 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:25] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1040 and 378 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:54] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering Roadmap Decision Making, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10nnikkhoui) [00:50:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:42] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10RobH) Summary update: * Jin installed the second replacement fan from Juniper into cr2-eqsin, the red led stayed red (didn't change to green) and software via ssh check by me still showed the fan in a... [00:57:03] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:04] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201210T0100). Please do the needful. [01:01:35] (03PS1) 10Bstorm: wikireplicas: close all connections [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) [01:01:51] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:47] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 106.17, 100.34, 97.61 https://wikitech.wikimedia.org/wiki/Swift [01:18:13] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 101.30, 100.36, 98.38 https://wikitech.wikimedia.org/wiki/Swift [01:34:03] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:55] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:05] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 106.54, 102.17, 98.53 https://wikitech.wikimedia.org/wiki/Swift [02:16:03] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:04] (03PS1) 10Catrope: RCFilters: Temporarily fix TagItemWidget remove button size [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647305 (https://phabricator.wikimedia.org/T269477) [02:20:55] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:27] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 101.64, 100.33, 99.40 https://wikitech.wikimedia.org/wiki/Swift [02:55:19] RECOVERY - very high load average likely xfs on ms-be2018 is OK: OK - load average: 77.24, 75.89, 79.82 https://wikitech.wikimedia.org/wiki/Swift [03:00:25] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:47:15] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:53:39] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 106.22, 101.74, 97.43 https://wikitech.wikimedia.org/wiki/Swift [04:03:31] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki) @Krinkle @aaron do you think we are ready to move this forward? [04:29:24] 10Operations, 10MW-on-K8s, 10serviceops: Sandbox/limit child processes within a container runtime - https://phabricator.wikimedia.org/T252745 (10tstarling) [04:29:48] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) 05Resolved→03Open Can the task stay open to track implementation? The RFC workboard has "Approved" and "Implemente... [05:14:51] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Krinkle) Given the title and task description, I assumed it was a dedicated task, but I see it's used as tracking task indeed. So... [06:21:28] * kart_ upgrading Apertium service. No major changes. [06:21:58] (03CR) 10KartikMistry: [C: 03+2] Update apertium to 2020-12-09-115733-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647220 (owner: 10KartikMistry) [06:23:25] (03Merged) 10jenkins-bot: Update apertium to 2020-12-09-115733-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647220 (owner: 10KartikMistry) [06:24:37] RECOVERY - very high load average likely xfs on ms-be2018 is OK: OK - load average: 62.65, 68.04, 78.51 https://wikitech.wikimedia.org/wiki/Swift [06:27:00] !log kartik@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [06:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:47] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' . [06:30:47] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'plain' . [06:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:00] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'plain' . [06:35:00] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' . [06:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:53] !log Upgraded Apertium to 2020-12-09-115733-production [06:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:37] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Thanks a lot for all the work! To recap: +2 servers in A2 +2 servers in A4 +2 servers in B2 +2 servers in B4 +1 servers in B7 +2 servers i... [07:35:28] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add a second Hive Metastore on an-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/647273 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [07:58:53] (03CR) 10David Caro: [C: 04-1] "I have a question 😊" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [08:09:36] (03PS1) 10Elukey: hive: fix wrong kerberos principal for the replicated metastore [puppet] - 10https://gerrit.wikimedia.org/r/647599 [08:10:12] (03CR) 10Elukey: [C: 03+2] "And I was wondering why the hive server complained about kerberos auth :D" [puppet] - 10https://gerrit.wikimedia.org/r/647599 (owner: 10Elukey) [08:11:28] (03CR) 10Muehlenhoff: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [08:14:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/647369 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:16:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [08:18:04] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:44] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:55] !log swift codfw-prod: more weight to ms-be20[58-61] - T269337 [08:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:00] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [08:43:59] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [08:45:11] (03PS7) 10Jbond: hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [08:45:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [08:51:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [08:54:36] (03PS3) 10Jbond: icinga: add support for downtimed and notifications_enabled parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672) [08:54:41] (03CR) 10Jbond: "updated" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond) [08:55:06] volans: not sure if i have allready missed your spicerack release but ^^^ has been updated now [08:56:28] (03PS2) 10Jbond: icinga::raid_handler: add support for ssacli [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) [08:59:46] (03CR) 10Effie Mouzeli: "PCC for all affected hosts looks ok: https://puppet-compiler.wmflabs.org/compiler1001/27064/" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [09:00:00] (03CR) 10Effie Mouzeli: [C: 03+2] Upgrade to 2.9 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/603876 (https://phabricator.wikimedia.org/T254845) (owner: 10Gilles) [09:00:34] 10Operations, 10Traffic: Incorrect X-Cache-Status reported by deployment-prep caches - https://phabricator.wikimedia.org/T269825 (10ema) [09:00:40] 10Operations, 10Traffic: Incorrect X-Cache-Status reported by deployment-prep caches - https://phabricator.wikimedia.org/T269825 (10ema) p:05Triage→03Lowest [09:02:20] (03CR) 10Elukey: "To keep archives happy - the change was not submitted, and the WMDE team fixed the schema, so we are good now (no need to revert etc..)." [puppet] - 10https://gerrit.wikimedia.org/r/647351 (owner: 10Milimetric) [09:06:07] (03PS3) 10Jbond: icinga::raid_handler: add support for ssacli [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) [09:06:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:35] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) Can confirm from os command line: ` $ free -m Mem: 515690 ` Thank you very much! [09:08:42] (03CR) 10Jbond: icinga::raid_handler: add support for ssacli (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) (owner: 10Jbond) [09:10:48] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "Thanks! I 've missed that part, good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/647210 (owner: 10Alexandros Kosiaris) [09:11:12] !log disable puppet on all hosts running redis - T265643 [09:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:17] T265643: Upgrade MediaWiki's Redis cluster to Debian Buster - https://phabricator.wikimedia.org/T265643 [09:12:44] (03CR) 10Effie Mouzeli: [C: 03+2] redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [09:13:49] akosiaris: I thing I have your patch too [09:13:55] 10Operations, 10Traffic: X-Cache-Status: distinguish between fresh and stale hits/misses - https://phabricator.wikimedia.org/T269828 (10ema) [09:14:02] jbond42: no you didn't [09:14:02] 10Operations, 10Traffic: X-Cache-Status: distinguish between fresh and stale hits/misses - https://phabricator.wikimedia.org/T269828 (10ema) p:05Triage→03Medium [09:14:05] akosiaris: should I proceed? [09:14:57] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond) [09:15:05] jbond42: go ahead and merge it at will [09:15:31] effie: yup [09:15:38] smile :D [09:17:37] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on es1023 - https://phabricator.wikimedia.org/T268796 (10jcrespo) This is still rebuilding: ` root@es1023:~$ megacli -PDRbld -ShowProg -PhysDrv \[32\:5\] -aALL Rebuild Progress on Device at Enclosure 32, Slot 5 Completed... [09:18:54] (03CR) 10Jbond: [C: 03+2] icinga::raid_handler: add support for ssacli [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) (owner: 10Jbond) [09:19:15] (03CR) 10Jbond: [C: 03+2] icinga: add support for downtimed and notifications_enabled parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond) [09:19:47] ack volans the spicerack one is merge ping me when you do a release and ill merge the icinga_status CR, thx [09:20:11] perfect, will be shortly [09:20:45] (03PS1) 10Ema: cache: downgrade Varnish on cp3054 to 6.0.0-1wm1 [puppet] - 10https://gerrit.wikimedia.org/r/647615 (https://phabricator.wikimedia.org/T264398) [09:20:59] ack [09:22:18] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/647615 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [09:22:51] (03Merged) 10jenkins-bot: icinga: add support for downtimed and notifications_enabled parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond) [09:24:50] (03CR) 10Ema: [C: 03+2] cache: downgrade Varnish on cp3054 to 6.0.0-1wm1 [puppet] - 10https://gerrit.wikimedia.org/r/647615 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [09:26:31] !log disable puppet on all mw* hosts for 647204 - T265643 [09:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:35] T265643: Upgrade MediaWiki's Redis cluster to Debian Buster - https://phabricator.wikimedia.org/T265643 [09:28:19] !log disable puppet on all hosts running nutcracker for 647204 - T265643 [09:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:28] (03PS8) 10Effie Mouzeli: hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) [09:28:46] !log cp3054: downgrade varnish to 6.0.0-1wm1 T264398 [09:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:49] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [09:34:40] PROBLEM - Check systemd state on cp3054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:42] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) Also no more errors on reboot: ` Installed System Memory: 512 GB, Available System Memory: 512 GB 2 Processor(s) detected, 8 total cores enab... [09:37:56] RECOVERY - Check systemd state on cp3054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:15] (03PS2) 10Jcrespo: Revert "mariadb: Reduce memory consumption of mariadb@s6 while hw degraded" [puppet] - 10https://gerrit.wikimedia.org/r/641498 [09:39:35] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.46 [software/spicerack] - 10https://gerrit.wikimedia.org/r/647621 [09:40:00] (03CR) 10JMeybohm: "Thanks for the review." (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [09:40:28] (03PS8) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) [09:40:53] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Reduce memory consumption of mariadb@s6 while hw degraded" [puppet] - 10https://gerrit.wikimedia.org/r/641498 (owner: 10Jcrespo) [09:41:10] (03CR) 10Kormat: [C: 03+2] alerting: Disable screen/tmux monitoring on orchestrator hosts [puppet] - 10https://gerrit.wikimedia.org/r/647319 (https://phabricator.wikimedia.org/T265990) (owner: 10Jcrespo) [09:41:52] ups [09:41:54] jynus: is it safe to puppet-merge your change? [09:41:55] ok to merge? [09:41:57] yeah [09:41:59] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1001/27066/ and https://puppet-compiler.wmflabs.org/compiler1001/27067/" [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [09:42:04] go for it :) [09:42:22] jynus: i hit the wrong button on the screen-monitoring CR, so i decided i'd submit it [09:42:33] ha ha [09:42:52] so you only merged my suggestion because of an accident :-)))))) [09:43:04] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v0.0.46 [software/spicerack] - 10https://gerrit.wikimedia.org/r/647621 (owner: 10Volans) [09:43:17] jynus: haha. i _meant_ to +1 it, but it's morning :) [09:43:21] ah, ok [09:43:38] nah, those patches where you have the last call, I am more than cool with you merging [09:43:49] as you are more of the owner of them [09:44:29] (03PS2) 10Volans: CHANGELOG: add changelogs for release v0.0.46 [software/spicerack] - 10https://gerrit.wikimedia.org/r/647621 [09:44:33] (03CR) 10JMeybohm: "> Patch Set 2: -Verified" [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) (owner: 10Alexandros Kosiaris) [09:46:48] RECOVERY - MariaDB read only s1 on db1139 is OK: Version 10.1.44-MariaDB, Uptime 56s, read_only: True, event_scheduler: True, 11.78 QPS, connection latency: 0.002330s, query latency: 0.000594s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:47:03] wait, what? [09:47:09] ah [09:47:15] jynus: that you? ^ [09:47:17] that is db1139 coming back from the dead [09:47:20] grand :) [09:47:21] I will disable notifications [09:47:41] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v0.0.46 [software/spicerack] - 10https://gerrit.wikimedia.org/r/647621 (owner: 10Volans) [09:47:45] it is the typical issue that downtime only disables new downs, not recoveries [09:48:40] (03PS1) 10Jcrespo: Revert "database backups: Move s1&s6 snapshots and logical dumps from db1139 to db1140" [puppet] - 10https://gerrit.wikimedia.org/r/647626 [09:48:51] (03PS2) 10Jcrespo: Revert "database backups: Move s1&s6 snapshots and logical dumps from db1139 to db1140" [puppet] - 10https://gerrit.wikimedia.org/r/647626 [09:49:08] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/647621 (owner: 10Volans) [09:51:56] (03CR) 10Jcrespo: [C: 03+2] Revert "database backups: Move s1&s6 snapshots and logical dumps from db1139 to db1140" [puppet] - 10https://gerrit.wikimedia.org/r/647626 (owner: 10Jcrespo) [09:55:47] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/647621 (owner: 10Volans) [09:57:52] !log A:cp rolling ats-{tls,backend}-restart for openssl upgrades (CVE-2020-1971) [09:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:38] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.46 [software/spicerack] - 10https://gerrit.wikimedia.org/r/647621 (owner: 10Volans) [10:00:43] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi) >>! In T268036#6679944, @Cmjohnson wrote: > @fgiunchedi The bbu is on-site, please let me know when I can take this offline? I can do tomorrow 1500UTC 1500 UTC sounds good to me, please LMK on IRC... [10:00:47] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10jbond) >>! In T269731#6679919, @marcella wrote: > @jbond I am Kosta's manager and I approve this request. Thank you! >>! In T269731#6680958, @kaldari wrote: > I approve as well... [10:02:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [10:02:33] (03PS1) 10Volans: Upstream release v0.0.46 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/647649 [10:02:42] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [10:07:21] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.46 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/647649 (owner: 10Volans) [10:10:14] (03Merged) 10jenkins-bot: Upstream release v0.0.46 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/647649 (owner: 10Volans) [10:16:38] !log uploaded spicerack_0.0.46 to apt.wikimedia.org buster-wikimedia [10:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:43] (03PS1) 10Jbond: admin: add kharlan to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/647651 (https://phabricator.wikimedia.org/T269731) [10:17:35] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10jbond) [10:17:36] RECOVERY - HP RAID on labstore1006 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK --- Slot 3: OK: 1E:1:1, 1E:1:10, 1E:1:11, 1E:1:12, 1E:1:2, 1E:1:3, 1E:1:4, 1E:1:5, 1E:1:6, 1E:1:7, 1E:1:8, 1E:1:9, 1E:2:1, 1E:2:10, 1E:2:11, 1E:2:12, 1E:2:2, 1E:2:3, 1E:2:4, 1E:2:5, 1E:2:6, 1E:2:7, 1E:2:8, 1E:2:9 - Controll [10:17:36] Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:17:46] (03CR) 10Jbond: [C: 03+2] admin: add kharlan to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/647651 (https://phabricator.wikimedia.org/T269731) (owner: 10Jbond) [10:18:02] PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2034 site=codfw tunnel=mc1034_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:18:13] jbond42: new spicerack released, we can upgrade it on the cumin hosts whenver you're ready to merge [10:23:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10jbond) I have now added you to the deployment group. however there is currently on going work which means it may take a few hours for this change to propog... [10:23:15] volans: i can deploy now [10:24:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10jbond) 05Open→03Resolved p:05Triage→03Medium [10:25:12] (03PS4) 10Jbond: icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084 [10:28:05] jbond42: thanks for your help! [10:28:24] jbond42: ack, sorry got disconnected [10:28:46] volans: no problem just ping me when its deployed to cumin then ill deploy my change [10:28:56] jbond42: {done} [10:29:14] !log upgraded spicearack to 0.0.46 on cumin[12]001 [10:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:40] (03CR) 10Jbond: [C: 03+2] icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084 (owner: 10Jbond) [10:29:56] PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:22] volans: ack deployed to icinga1001 and tested localy. will test the reboot cook book in a bit [10:31:34] perfect, thanks a lot! [10:31:40] np thx :) [10:32:10] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [10:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:22] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [10:33:23] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [10:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:34] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [10:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:54] (03PS1) 10Filippo Giunchedi: smokeping: force redirect to https [puppet] - 10https://gerrit.wikimedia.org/r/647654 [10:37:33] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [10:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [10:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:49] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `an-tool1010.eqiad.wmnet` - an-tool1010.eqiad.wmnet (**PASS**) - Downtim... [10:38:28] !log uploading prometheus-redis-exporter_0.13-1 in component/redis2 for buster [10:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:38] (03PS1) 10Filippo Giunchedi: alertmanager: set karma poll interval to 10s [puppet] - 10https://gerrit.wikimedia.org/r/647655 (https://phabricator.wikimedia.org/T266017) [10:41:48] (03PS1) 10Volans: Testing CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 [10:42:51] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [10:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:58] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [10:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:05] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [10:45:08] (03CR) 10jerkins-bot: [V: 04-1] Testing CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 (owner: 10Volans) [10:45:59] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:02] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 (owner: 10Volans) [10:47:16] 10Operations: Traceback in icinga-status 'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) I fix has been applied to both spicerack and the icingas_status script. I have checked things work with the following and all looks good to me. please re-open if you still s... [10:47:23] 10Operations: Traceback in icinga-status 'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) 05Resolved→03Open [10:47:42] volans: fyi looks like the spicerack release fixed the issue reported thanks ^^ [10:47:50] great! thanks a lot [10:50:47] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 (owner: 10Volans) [10:51:07] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:52:14] (03PS1) 10Alexandros Kosiaris: WIP: Move monitoring stanzas to shared templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/647660 [10:53:49] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 10.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:54:51] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 (owner: 10Volans) [10:56:31] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:58:19] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10MoritzMuehlenhoff) >>! In T245757#6680909, @Dzahn wrote: >>>! In T245757#6645352, @jijiki wrote: >> @Dzahn... [11:00:05] mvolz: Dear deployers, time to do the Services – Citoid / Zotero deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201210T1100). [11:00:41] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 (owner: 10Volans) [11:01:28] (03PS3) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/641725 (owner: 10PipelineBot) [11:02:19] RECOVERY - Aggregate IPsec Tunnel Status codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:02:41] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/641725 (owner: 10PipelineBot) [11:03:09] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [11:04:09] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/641725 (owner: 10PipelineBot) [11:04:23] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:01] !log rebooting failoid1001 [11:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:42] moritzm: it will fail [11:05:46] ...oid [11:06:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:28] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [11:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:57] 10Operations: Traceback in icinga-status 'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10MoritzMuehlenhoff) Works like a charm now. [11:08:21] it continues to fail, that's how I like it! [11:10:22] 10Operations, 10fundraising-tech-ops, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10jbond) >>! In T268802#6681000, @Dwisehaupt wrote: > I do have a question about the use of facter though. In my testing with lldbctl I see multiple neighbors for an interface. Alth... [11:13:32] 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10jbond) [11:13:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:21] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:14:33] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` an-tool1010.eqiad.wmnet ` The log can be found in `/var/log/wm... [11:19:30] (03PS1) 10Jbond: admin: add toan user [puppet] - 10https://gerrit.wikimedia.org/r/647662 (https://phabricator.wikimedia.org/T269777) [11:20:02] (03PS7) 10Jbond: Add group wikibase-releasers & folder [puppet] - 10https://gerrit.wikimedia.org/r/643512 (https://phabricator.wikimedia.org/T268818) (owner: 10Tobias Andersson) [11:20:57] !log installing apt security updates on buster/stretch [11:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:20] !log upload rometheus-redis-exporter_0.13-1 to buster-wikimedia main [11:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:25] (03PS8) 10Jbond: Add group wikibase-releasers & folder [puppet] - 10https://gerrit.wikimedia.org/r/643512 (https://phabricator.wikimedia.org/T268818) (owner: 10Tobias Andersson) [11:21:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10jbond) [11:22:53] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10jbond) @toan I have created the CR to add your shell account and wiull merge at the same time as the change to add [[ https://gerrit.wikimedia.org/r/c/opera... [11:24:59] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:39:09] 10Operations, 10Growth-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) [11:39:19] 10Operations, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade MediaWiki's Redis cluster to Debian Buster - https://phabricator.wikimedia.org/T265643 (10jijiki) [11:39:24] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [11:40:11] 10Operations, 10Growth-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) 05Open→03Resolved a:03jijiki [11:41:28] 10Operations, 10LDAP-Access-Requests: LDAP access to wmf group for Matt Cleinman - https://phabricator.wikimedia.org/T269696 (10jbond) p:05Triage→03Medium @MattCleinman granting access is not an issue however could you please provide information on the services you require access to for audit puposes @JoeW... [11:45:15] (03Abandoned) 10Effie Mouzeli: mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [11:45:42] (03Abandoned) 10Effie Mouzeli: mcrouter: enable gutter pool config on mwdebug1001 and mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/574200 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [11:45:47] !log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [11:51:03] (03PS7) 10MSantos: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) [11:52:17] 10Operations, 10Performance-Team, 10Platform Engineering, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10jijiki) [11:52:37] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [11:53:35] (03PS1) 10Lucas Werkmeister (WMDE): Fix prev/next links on Special:WhatLinksHere [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647628 (https://phabricator.wikimedia.org/T269830) [11:54:40] jouncebot: recheck please [11:55:08] uh [11:55:10] i’m stupid [11:55:12] jouncebot: refresh please [11:55:13] I refreshed my knowledge about deployments. [11:55:17] thanks ^^ [11:56:06] (I’ll be in a meeting for the first half of the window, so the other config changes can be deployed ahead of that backport) [11:56:24] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Mohammed Sadat - https://phabricator.wikimedia.org/T269843 (10Mohammed_Sadat_WMDE) [11:57:03] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201210T1200). [12:00:04] Bencemac, matthiasmullie, and Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:54] I'm here, but it's also my first patch, so sorry in advance :) [12:00:54] I’m busy for the next 30 mins [12:00:59] I withdraw my patch - not to be deployed today [12:01:53] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:54] (03PS1) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/647629 [12:03:09] I can deploy today! [12:03:27] so, it's only Bencemac 's patch (and Lucas's once he gets back?) [12:03:44] it looks like [12:03:46] (03PS2) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/647629 [12:04:12] (03CR) 10Urbanecm: [C: 03+2] "B&C" [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647628 (https://phabricator.wikimedia.org/T269830) (owner: 10Lucas Werkmeister (WMDE)) [12:04:12] I have installed the gadget and ready to go [12:04:14] yes, just removed mine from deployments page [12:04:17] Bencemac: great [12:04:31] Urbanecm: you saw my message that I’m not available yet? [12:04:37] I think that +2 is premature… [12:04:51] Lucas_WMDE: yes, but also CI takes over 20 minutes to complete [12:05:03] yes, but I’ll be unavailable for more than over 20 minutes too… [12:05:15] I can cancel it if you wish, but that'll mean you'll have to wait 20 minutes instead just coming to a merged patch :) [12:05:42] Bencemac: is it intentional to also set wgKartographerEnableMapFrame to true? [12:06:36] (03PS1) 10Effie Mouzeli: hiera: upgrade mc1032, mc2032 to buster [puppet] - 10https://gerrit.wikimedia.org/r/647672 (https://phabricator.wikimedia.org/T213089) [12:07:06] Kartographer is not enabled @huwiki, so I'm not sure. But probably yes, because the FR settings affect it [12:07:37] Lucas_WMDE: anyway, I'm happy to deploy yours even w/o you, as it's simple enough 🙂 [12:08:05] or you could just wait? [12:08:10] and then I’ll be happy to deploy it… [12:08:33] as you wish, +2 removed :) [12:09:09] Bencemac: I don't understand it. The comment for wgKartographerEnableMapFrame says "// Disable for FlaggedRevs wikis with $wgFlaggedRevsOverride=true", and you're setting wgFlaggedRevsOverride to true? [12:09:37] PROBLEM - DPKG on ganeti2019 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:10:48] well, it's tgr's patch and I'm just here to learn how this works for the future [12:11:00] I also think that it should be false [12:11:40] Bencemac: remove the kartographer thing from the patch please [12:11:47] PROBLEM - DPKG on mwdebug2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:12:00] this will actually enable it, and there's no reason to do it now AFAICS [12:12:28] ^ dpkg error will sort out soon, apt update [12:13:39] !log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [12:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:01] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 (owner: 10Volans) [12:14:12] Bencemac: just to confrim, did you see my message? :-) [12:14:27] yes, I am just trying [12:14:40] okay, ask if you have any questions :) [12:15:15] 10Operations, 10Mail: Bounces when sending mail to aliases of a specific WMF email address: 550 Previous (cached) callout verification failure - https://phabricator.wikimedia.org/T269725 (10jbond) I have tried to recreate this and every thing looks fine from an SMTP PoV ` lines=5 $ telnet mx1001.wikimedia.org... [12:15:27] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-tool1010.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-tool1010.eqiad.wmnet'] ` [12:16:07] PROBLEM - DPKG on puppetboard2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:17:29] (03PS7) 10Bencemac: [huwiki] Set wgFlaggedRevsOverride back to true per community vote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496205 (https://phabricator.wikimedia.org/T210224) (owner: 10Mahveotm) [12:17:37] o/ meeting over \o/ [12:17:49] (03PS8) 10Urbanecm: [huwiki] Set wgFlaggedRevsOverride back to true per community vote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496205 (https://phabricator.wikimedia.org/T210224) (owner: 10Mahveotm) [12:17:57] (03CR) 10Urbanecm: [C: 03+2] [huwiki] Set wgFlaggedRevsOverride back to true per community vote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496205 (https://phabricator.wikimedia.org/T210224) (owner: 10Mahveotm) [12:18:01] it's done [12:18:04] thanks Bencemac :) [12:18:09] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix prev/next links on Special:WhatLinksHere [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647628 (https://phabricator.wikimedia.org/T269830) (owner: 10Lucas Werkmeister (WMDE)) [12:18:10] will ping you once it's ready to be tested [12:18:39] readded the +2 to my backport, hopefully that means the old gate-and-submit will still go through [12:18:46] but I gather you’re not done yet so I’ll wait with the actual deploy [12:19:02] sorry, I'm just a bit nervous, I'm not so eperienced in this stuff :D [12:19:07] (03Merged) 10jenkins-bot: [huwiki] Set wgFlaggedRevsOverride back to true per community vote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496205 (https://phabricator.wikimedia.org/T210224) (owner: 10Mahveotm) [12:19:25] will wait here [12:19:27] Bencemac: no problem, I'll guide you through it :) [12:20:01] truly appreciated [12:20:56] Bencemac: I've pulled your change onto mwdebug1001. Can you test, please? Assuming you have the browser extension/gadget installed, you need to only enable it, pick mwdebug1001 there, and ensure indeed the stable version appears. [12:21:13] doing... [12:22:19] (03CR) 10Tobias Andersson: admin: add toan user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647662 (https://phabricator.wikimedia.org/T269777) (owner: 10Jbond) [12:24:17] RECOVERY - DPKG on ganeti2019 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:24:17] RECOVERY - DPKG on mwdebug2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:24:26] Urbanecm, it works perfectly [12:24:32] great, syncing then [12:26:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 042dd034ef5811923106e81dbb4ac129be1f1ba6: [huwiki] Set wgFlaggedRevsOverride back to true per community vote (T210224) (duration: 01m 07s) [12:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:15] T210224: Revert FlaggedRevs changes on the Hungarian Wikipedia - https://phabricator.wikimedia.org/T210224 [12:26:20] Bencemac: done :). Anything else? [12:26:53] nothing else, thank you very much! [12:26:59] no problem [12:27:07] Lucas_WMDE: in that case, it's yours :) [12:27:12] alright, thanks :) [12:27:16] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` an-tool1010.eqiad.wmnet ` The log can be found in `/var/log/wm... [12:27:25] RECOVERY - DPKG on puppetboard2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:27:27] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) @aaron In order to change the servers defined in the mediawiki-config (and use other redis instances), apart from roll change them... [12:27:31] (03Merged) 10jenkins-bot: Fix prev/next links on Special:WhatLinksHere [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647628 (https://phabricator.wikimedia.org/T269830) (owner: 10Lucas Werkmeister (WMDE)) [12:27:41] aaand right on time \o/ [12:28:01] PROBLEM - DPKG on db1079 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:28:29] testing on mwdebug1001 [12:28:49] PROBLEM - DPKG on ganeti1015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:29:16] seems to work just fine, syncing [12:30:07] PROBLEM - DPKG on puppetboard1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:30:39] PROBLEM - DPKG on elastic2051 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:30:49] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.21/includes/specials/SpecialWhatLinksHere.php: Backport: [[gerrit:647628|Fix prev/next links on Special:WhatLinksHere (T269830)]] (duration: 01m 04s) [12:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:53] T269830: Previous/next links on Special:WhatLinksHere are HTML-escaped on 1.36.0-wmf.21 - https://phabricator.wikimedia.org/T269830 [12:31:09] !log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [12:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:19] PROBLEM - DPKG on analytics1072 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:31:23] someone doing updates, there seems to be a few dpkg alerts? [12:31:41] jynus: I saw this a while ago: 13:12 ^ dpkg error will sort out soon, apt update [12:31:46] not sure if it applies to those alerts too [12:31:55] thanks, Urbanecm, that explains it [12:32:03] I didn't read it, too much scrollback [12:32:09] yeah, that's all going to recover soon and harmless [12:32:20] i was deploying at that time, so... 🙂 [12:32:23] RECOVERY - DPKG on puppetboard1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:32:23] RECOVERY - DPKG on db1079 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:32:23] RECOVERY - DPKG on ganeti1015 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:33:32] any other backport/config changes? [12:33:55] RECOVERY - DPKG on analytics1072 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:33:55] RECOVERY - DPKG on elastic2051 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:33:59] not from me [12:33:59] !log EU backport+config window done [12:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:45] PROBLEM - DPKG on ganeti1021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:36:27] PROBLEM - DPKG on analytics1073 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:37:27] PROBLEM - DPKG on db1123 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:40:33] PROBLEM - DPKG on an-worker1113 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:41:24] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [12:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:59] PROBLEM - DPKG on elastic1054 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:42:03] RECOVERY - DPKG on db1123 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:42:03] RECOVERY - DPKG on ganeti1021 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:42:03] RECOVERY - DPKG on analytics1073 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:43:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:11] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1975379752 and 107 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:23] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1881107584 and 109 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:37] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1811215568 and 118 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:46:41] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 48992 and 142 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:47:15] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4320 and 176 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:47:25] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 214776 and 187 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:50:11] PROBLEM - DPKG on argon is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:50:21] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 294901824 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:51:24] 10Operations, 10Mail: Bounces when sending mail to aliases of a specific WMF email address: 550 Previous (cached) callout verification failure - https://phabricator.wikimedia.org/T269725 (10jbond) 05Open→03Resolved a:03jbond >>! In T269725#6680196, @JCabanero wrote: > Hi all, > > I sent a test email to... [12:51:59] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1116630440 and 64 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:52:59] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-tool1010.eqiad.wmnet'] ` and were **ALL** successful. [12:53:27] (03PS2) 10Jbond: admin: add toan user [puppet] - 10https://gerrit.wikimedia.org/r/647662 (https://phabricator.wikimedia.org/T269777) [12:53:38] (03CR) 10Jbond: admin: add toan user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647662 (https://phabricator.wikimedia.org/T269777) (owner: 10Jbond) [12:53:39] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 90336 and 65 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:54:01] (03PS9) 10Jbond: Add group wikibase-releasers & folder [puppet] - 10https://gerrit.wikimedia.org/r/643512 (https://phabricator.wikimedia.org/T268818) (owner: 10Tobias Andersson) [12:54:11] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 33688 and 99 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:54:21] RECOVERY - DPKG on argon is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:54:21] RECOVERY - DPKG on elastic1054 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:54:21] RECOVERY - DPKG on an-worker1113 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:59:55] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.466e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [13:01:17] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01697 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [13:01:35] PROBLEM - DPKG on elastic1032 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:05:23] (03PS1) 10Elukey: Add bigtop15 component for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/647697 [13:09:09] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27068/console" [puppet] - 10https://gerrit.wikimedia.org/r/647697 (owner: 10Elukey) [13:12:08] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10elukey) All right the host is now up and running in the analytics vlan, this is the procedure that I followed: - ran the decom cookbook for an-tool1010 - manually rem... [13:12:48] (03CR) 10Elukey: Add bigtop15 component for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/647697 (owner: 10Elukey) [13:20:15] (03PS20) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) [13:20:37] (03CR) 10Jbond: "updated" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond) [13:20:53] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond) [13:21:56] (03PS21) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) [13:26:56] !log disable puppet fleet wide to reboot puppet managment infrastructre [13:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:38] PROBLEM - Host puppetmaster2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:31:46] ^^ me downtimeing now [13:32:38] RECOVERY - Host puppetmaster2003 is UP: PING OK - Packet loss = 0%, RTA = 31.89 ms [13:32:43] 10Operations, 10LDAP-Access-Requests: LDAP access to wmf group for Matt Cleinman - https://phabricator.wikimedia.org/T269696 (10Aklapper) @MattCleinman: See https://phabricator.wikimedia.org/project/profile/1564/ for required info; in case that you followed some onboarding docs you may want them to link to tha... [13:32:57] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [13:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:25] (03PS1) 10Ppchelko: Configure $wgWikimediaApiPortalOAuthMetaApiURL in labs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647708 [13:41:57] (03CR) 10Ppchelko: [C: 03+2] Configure $wgWikimediaApiPortalOAuthMetaApiURL in labs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647708 (owner: 10Ppchelko) [13:42:55] (03Merged) 10jenkins-bot: Configure $wgWikimediaApiPortalOAuthMetaApiURL in labs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647708 (owner: 10Ppchelko) [13:44:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Patch Set 11:" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [13:45:54] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 've also added you to https://gerrit.wikimedia.org/r/admin/groups/3fdcf8fd0d569e90a3e9b39788a29f2c50d33be9,members you should have +2 ri" [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [13:46:45] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [13:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:50] 10Operations, 10ops-codfw, 10SRE-swift-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10fgiunchedi) [13:50:22] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:29] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [13:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:52] 10Operations, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10hnowlan) [14:05:49] (03PS1) 10Jbond: sre.puppet.renew-cert: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 [14:05:54] 10Operations, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10hnowlan) [14:06:06] (03PS2) 10Jbond: sre.puppet.renew-cert: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 [14:06:18] 10Operations, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10hnowlan) [14:07:13] (03CR) 10jerkins-bot: [V: 04-1] sre.puppet.renew-cert: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 (owner: 10Jbond) [14:07:52] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:02] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [14:10:02] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [14:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:12] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [14:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:12] (03PS3) 10Jbond: sre.puppet.renew-cert: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 [14:14:20] (03CR) 10jerkins-bot: [V: 04-1] sre.puppet.renew-cert: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 (owner: 10Jbond) [14:16:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:47] (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641151 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:17:11] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [14:17:11] (03PS4) 10Jbond: sre.puppet.renew-cert: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 [14:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:10] (03CR) 10BBlack: [C: 03+2] GeoDNS: Remove old hack for Wikia RES datacenter [dns] - 10https://gerrit.wikimedia.org/r/647253 (owner: 10TK-999) [14:21:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10thcipriani) >>! In T268818#6656828, @ssingh wrote: > Thanks @Dzahn! > > @thcipriani: Adding you to this task to see if you have any possible con... [14:22:54] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 (owner: 10Jbond) [14:28:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10jbond) @KFrancis Are you able to confirm NDA status for Tobias, thanks [14:30:22] (03PS10) 10Jbond: Add group wikibase-releasers & folder [puppet] - 10https://gerrit.wikimedia.org/r/643512 (https://phabricator.wikimedia.org/T268818) (owner: 10Tobias Andersson) [14:30:37] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27069/console" [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [14:30:42] (03PS11) 10Jbond: Add group wikibase-releasers & folder [puppet] - 10https://gerrit.wikimedia.org/r/643512 (https://phabricator.wikimedia.org/T268818) (owner: 10Tobias Andersson) [14:32:50] (03CR) 10Jbond: [C: 03+2] Add group wikibase-releasers & folder [puppet] - 10https://gerrit.wikimedia.org/r/643512 (https://phabricator.wikimedia.org/T268818) (owner: 10Tobias Andersson) [14:32:52] RECOVERY - DPKG on elastic1032 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:33:44] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:26] (03PS3) 10Jbond: admin: add toan user and add to wikibase-releasers group [puppet] - 10https://gerrit.wikimedia.org/r/647662 (https://phabricator.wikimedia.org/T269777) [14:37:55] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+1] "PCC is happy too, so +1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [14:38:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10jbond) 05Open→03Resolved a:03jbond The change has now been merged all users listed in the original post should have the required access. @... [14:38:34] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [14:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:50] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10Ottomata) Thanks Luca! [14:40:55] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10toan) >>! In T268818#6682278, @jbond wrote: > The change has now been merged all users listed in the original post should have the required acces... [14:42:06] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:10] !log re-enable puppet fleet wide [14:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:28] PROBLEM - puppet last run on miscweb1002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:50:08] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [14:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:51] (03CR) 10Bstorm: wikireplicas: close all connections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [14:53:04] RECOVERY - puppet last run on miscweb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:54:02] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:49] (03CR) 10Bstorm: "> Patch Set 1: Code-Review-1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [14:59:49] (03CR) 10VolkerE: [C: 03+1] "Have just +1 rights here" [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647305 (https://phabricator.wikimedia.org/T269477) (owner: 10Catrope) [15:03:32] (03CR) 10David Caro: wikireplicas: close all connections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [15:03:33] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:54] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single [15:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:02] !log reboot deneb.codfw.wmnet [15:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:20] !log restarting slapd on ldap replicas to pick up OpenSSL updates [15:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:50] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:52] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:49] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:38] (03CR) 10Volans: "LGTM, nit and a suggestion inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 (owner: 10Jbond) [15:11:20] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:11] hi, anyone with Logstash access around that could lookup stack trace for X9I6DgpAIC4AAHI1i4kAAAAR T269857? thanks! [15:12:11] T269857: Fatal exception of type "TypeError" when viewing enwiki page "Draft:Richard_L._Greene" - https://phabricator.wikimedia.org/T269857 [15:12:18] !log restarting turnilo and hue to pick up OpenSSL security updates [15:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/647697 (owner: 10Elukey) [15:16:23] Hello. For some reason a specific page on enwiki is displayed to me in a different MW Skin than the one I use. Any ideas? [15:16:32] (03CR) 10Bstorm: wikireplicas: close all connections (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [15:17:28] 10Operations, 10serviceops, 10MW-1.36-notes (1.36.0-wmf.18; 2020-11-17), 10Performance Issue, and 3 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Pchelolo) [15:19:29] (03PS5) 10Jbond: sre.puppet.renew-cert: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 [15:19:32] (03CR) 10Elukey: [C: 03+2] Add bigtop15 component for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/647697 (owner: 10Elukey) [15:19:42] (03CR) 10Jbond: [C: 03+1] "updated thanks" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 (owner: 10Jbond) [15:21:38] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Nice. Sorry it took so long to review this. I definitely removes some duplication" [deployment-charts] - 10https://gerrit.wikimedia.org/r/644787 (https://phabricator.wikimedia.org/T268434) (owner: 10JMeybohm) [15:22:32] (03CR) 10Bstorm: wikireplicas: close all connections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [15:23:34] PROBLEM - Host ms-be1030 is DOWN: PING CRITICAL - Packet loss = 100% [15:26:19] 10Operations, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/pipermail/wikija-l/ has broken encoding - https://phabricator.wikimedia.org/T269301 (10jbond) Do you have any idea for when this may have broke? [15:28:38] RECOVERY - Host ms-be1030 is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [15:30:05] CindyCicaleseWMF and bpirkle: #bothumor I � Unicode. All rise for Core Platform Team Deployment deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201210T1530). [15:30:05] CindyCicaleseWMF: A patch you scheduled for Core Platform Team Deployment is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [15:31:12] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:07] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10jbond) p:05Triage→03Medium [15:32:43] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10Cmjohnson) I swapped the bbu with a new one and powered the server up [15:33:52] bpirkle and I are getting ready to deploy a config change. Let us know if there is anything going on to prevent that now. [15:34:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:53] (03CR) 10Razzi: [C: 03+2] Add kafka-test1007 virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/647109 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [15:35:34] RECOVERY - HP RAID on ms-be1030 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:36:17] (03PS3) 10Alexandros Kosiaris: k8s_infrastructure_users: Amend to support groups, avoid uid conflicts [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) [15:36:19] (03PS1) 10Alexandros Kosiaris: kubestage2*: Assign role [puppet] - 10https://gerrit.wikimedia.org/r/647728 (https://phabricator.wikimedia.org/T252185) [15:36:51] (03PS2) 10Cicalese: Configure API Portal permissions for launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646862 (https://phabricator.wikimedia.org/T267953) [15:37:07] (03PS4) 10Cicalese: CommonSettings: OAuth 2.0 refresh tokens expire after 1 minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645308 (https://phabricator.wikimedia.org/T269152) (owner: 10Vlad.shapik) [15:37:10] (03CR) 10BPirkle: [C: 03+2] Configure API Portal permissions for launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646862 (https://phabricator.wikimedia.org/T267953) (owner: 10Cicalese) [15:38:31] (03Merged) 10jenkins-bot: Configure API Portal permissions for launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646862 (https://phabricator.wikimedia.org/T267953) (owner: 10Cicalese) [15:39:11] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Generalization, prod values anf fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/644787 (https://phabricator.wikimedia.org/T268434) (owner: 10JMeybohm) [15:40:17] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/647730 (https://phabricator.wikimedia.org/T135991) [15:40:50] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Mohammed Sadat - https://phabricator.wikimedia.org/T269843 (10jbond) @KFrancis are you able to help with processin the NDA for Mohammed We will also need [[ https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#wmde_access | app... [15:40:56] (03Merged) 10jenkins-bot: admin_ng: Generalization, prod values anf fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/644787 (https://phabricator.wikimedia.org/T268434) (owner: 10JMeybohm) [15:41:33] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/647730 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:43:22] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/647730 (https://phabricator.wikimedia.org/T135991) [15:44:05] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/647730 (https://phabricator.wikimedia.org/T135991) [15:45:01] (03CR) 10JMeybohm: [C: 03+2] calico: Add support for calico 3.x with kubernetes datastore [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [15:45:57] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:05] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/647730 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:47:39] (03PS1) 10Elukey: aptrepo: add key for bigtop 1.5 [puppet] - 10https://gerrit.wikimedia.org/r/647732 [15:48:59] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27070/console" [puppet] - 10https://gerrit.wikimedia.org/r/647732 (owner: 10Elukey) [15:49:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:14] !log cicalese@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 646862 Configure API Portal permissions for launch (duration: 01m 03s) [15:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:40] (03CR) 10BPirkle: [C: 03+2] CommonSettings: OAuth 2.0 refresh tokens expire after 1 minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645308 (https://phabricator.wikimedia.org/T269152) (owner: 10Vlad.shapik) [15:50:44] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2243.codfw.wmnet [15:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:30] (03Merged) 10jenkins-bot: CommonSettings: OAuth 2.0 refresh tokens expire after 1 minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645308 (https://phabricator.wikimedia.org/T269152) (owner: 10Vlad.shapik) [15:51:59] (03CR) 10RLazarus: [C: 03+1] sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [15:53:05] !log rebooting planet1002 (planet.wikimedia.org) for kernel update [15:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:11] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:31] (03PS7) 10Volans: sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) [15:54:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mwdebug1002.eqiad.wmnet [15:54:26] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) [puppet-private] (487bdca0) (jayme) Add calicoctl and calico-cni kubernetes users [15:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:32] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mwdebug1003.eqiad.wmnet [15:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:43] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mwdebug1003.eqiad.wmnet [15:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:38] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:42] !log cicalese@deploy1001 Synchronized wmf-config/CommonSettings.php: 645308 CommonSettings: OAuth 2.0 refresh tokens expire after 1 minute (duration: 01m 02s) [15:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:13] (03CR) 10RLazarus: [C: 03+1] hiera: upgrade mc1032, mc2032 to buster [puppet] - 10https://gerrit.wikimedia.org/r/647672 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [15:56:22] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi) 05Open→03Resolved We're back, thanks @Cmjohnson ! [15:56:35] 10Operations, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/pipermail/wikija-l/ has broken encoding - https://phabricator.wikimedia.org/T269301 (10Urbanecm) No, I'm sorry, that was my first time seeing the list. [15:56:38] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) >>! In T245757#6681662, @MoritzMuehlenhoff wrote: > ffmpeg -i Wall_of_Death_-_Pitts_Todeswand_2017_... [15:57:01] (03CR) 10Volans: [C: 03+2] sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [15:57:27] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) [15:58:06] 10Operations, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10MoritzMuehlenhoff) "task md0_resync:20551 blocked for more than 120 seconds" smells like a hw issue. Best to open a DC ops ticket to get the controller and system firmware update and then retry to reimage, [15:58:41] !log mw2243 pooled - first jobrunner on buster [15:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:05] (03CR) 10Volans: [C: 03+1] "LGTM, possible improvement inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 (owner: 10Jbond) [15:59:07] (03Merged) 10jenkins-bot: sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [15:59:24] We're all done with our deployments! [15:59:36] (03CR) 10Elukey: [V: 03+1 C: 03+2] aptrepo: add key for bigtop 1.5 [puppet] - 10https://gerrit.wikimedia.org/r/647732 (owner: 10Elukey) [15:59:43] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) @ssastry is there someway we could check that parse2001, which is running on buster now, works as expected? [16:03:00] (03PS6) 10Jbond: sre.puppet.renew-cert: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 [16:03:12] (03CR) 10Jbond: sre.puppet.renew-cert: convert to class API (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 (owner: 10Jbond) [16:03:17] (03PS7) 10Jbond: sre.puppet.renew-cert: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 [16:04:14] mutante: a simple redirect data funnel should suffice for that one [16:04:20] To preserve current behavior [16:04:56] (03CR) 10Jbond: [C: 03+2] sre.puppet.renew-cert: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 (owner: 10Jbond) [16:05:00] (03CR) 10Volans: [C: 03+1] "LGTM and thanks a lot to jump so quickly on the new API!" [cookbooks] - 10https://gerrit.wikimedia.org/r/647712 (owner: 10Jbond) [16:05:04] Oh long back scroll, I meant : https://gerrit.wikimedia.org/r/c/operations/puppet/+/524088 [16:06:49] (03CR) 10Dzahn: [V: 03+1] "here it is again what misled me: compiler on C:role::dnsbox comes up empty https://integration.wikimedia.org/ci/job/operations-puppet-cata" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [16:07:02] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Mohammed Sadat - https://phabricator.wikimedia.org/T269843 (10jbond) p:05Triage→03Medium [16:09:10] 10Operations, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10jbond) p:05Triage→03Medium [16:13:00] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ssastry) Run a few curl commands like these but while using parse2001 as a proxy. Here is the equivalent for scandium itself: ` curl -L -x http://scandium.eqiad.wmnet:80 http://en.wikipedia... [16:13:18] 10Operations: Update tor's apt gpg key - https://phabricator.wikimedia.org/T269861 (10elukey) [16:13:32] !log volans@cumin2001 START - Cookbook sre.hosts.downtime for 0:10:00 on cumin2001.codfw.wmnet with reason: volans's test [16:13:32] !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cumin2001.codfw.wmnet with reason: volans's test [16:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:37] rzl: ^^^ [16:13:39] !log add thirdparty/bigtop15 packages to stretch-wikimedia [16:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:46] :o [16:13:50] wwoooooooooowwwwww [16:14:08] 10Operations, 10Maps: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10jbond) Im going to drop the SRE-Access-Requests tage from this task as i dosn't look like there is an access request to action. please re add if i have missed something. [16:14:10] to be tweaked [16:14:19] this is really awesome [16:14:49] (03Abandoned) 10RLazarus: When starting a cookbook, also log the args to IRC. [software/spicerack] - 10https://gerrit.wikimedia.org/r/549879 (owner: 10RLazarus) [16:14:55] volans: ^ ;) [16:15:08] sorry it took soooooo long [16:15:28] ahaha it's all good [16:15:34] I'm really glad to see it [16:15:41] yep it is really great [16:15:54] 10Operations: Update tor's apt gpg key - https://phabricator.wikimedia.org/T269861 (10MoritzMuehlenhoff) I can simply be removed, we no longer import/use the Tor packages. It was probably mis-imported when we migrated from local storage on apt1001 to the current Puppet approach. [16:15:56] 10Operations: Update tor's apt gpg key - https://phabricator.wikimedia.org/T269861 (10Dzahn) Since torrelay1001 has been removed and from a glance at debmonitor.. I don't think we use the tor package anymore and can probably remove this component. [16:16:09] 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Cmjohnson) This is scheduled for Monday 14Dec [16:18:11] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27074/dns1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [16:19:30] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 56.67 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:19:44] PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100% [16:20:20] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ssastry) Ah, because parse2001 and parse2002 are codfw, not eqiad. Anyway, here goes: ` ssastry@scandium:~$ curl -L -x http://parse2001.codfw.wmnet:80 http://en.wikipedia.org/w/rest.php/en.... [16:20:35] (03CR) 10Dzahn: "noop confirmed on dns1001, dns3001" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [16:21:42] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:00] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [16:22:06] PROBLEM - Host ms-be1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:22:15] 10Operations, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/pipermail/wikija-l/ has broken encoding - https://phabricator.wikimedia.org/T269301 (10jbond) ack thx [16:22:44] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 77.75 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:23:38] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [16:24:10] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10Cmjohnson) @fgiunchedi The battery has been replaced. The SSD looks to be /dev/sda and is an SSD. What do you want to do about the failed disk? [16:25:26] (03PS1) 10Dzahn: poolcounter: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/647737 (https://phabricator.wikimedia.org/T266479) [16:25:52] RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 240.38 ms [16:26:05] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi) >>! In T267870#6682568, @Cmjohnson wrote: > @fgiunchedi The battery has been replaced. The SSD looks to be /dev/sda and is an SSD. What d... [16:26:55] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Cmjohnson) @bstorm I am sorry I confused which one was already in a 10G rack. I need to confirm that 1004 is in C2 and can stay and 1005... [16:27:17] (03PS1) 10Dzahn: otrs: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/647738 (https://phabricator.wikimedia.org/T266479) [16:27:41] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202 (10Cmjohnson) This server can stay in C2 and can be converted anytime. [16:27:46] RECOVERY - Host ms-be1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [16:27:52] !log depooling wdqs1011, issues with categories endpoint [16:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:24] 10Operations, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Cmjohnson) 05Stalled→03Resolved This is now a duplicate task, we have a few for the same thing. I am resolving this o... [16:28:28] !log power reset ms-be1022 - stuck after boot - T267870 [16:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:31] T267870: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 [16:28:33] 10Operations, 10Gerrit-Privilege-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10jbond) [16:28:55] (03PS1) 10Dzahn: wikimania_scholarships: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/647739 (https://phabricator.wikimedia.org/T266479) [16:29:00] PROBLEM - Check systemd state on ms-be2040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:11] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Mohammed Sadat - https://phabricator.wikimedia.org/T269843 (10WMDE-leszek) I approve this request on behalf of WMDE Engineering Managers. @Kris_Litson_WMDE is formally Mohammed's line manager (other branch in WMDE org chart tha... [16:29:32] PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100% [16:30:29] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10jijiki) @Marostegui @LSobanski Where are we regarding the purchase? @Gilles @WDoranWMF Given that we are... [16:31:06] RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [16:31:20] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:45] ACKNOWLEDGEMENT - MD RAID on ms-be1022 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269862 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:31:49] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269862 (10ops-monitoring-bot) [16:33:50] RECOVERY - HP RAID on ms-be1022 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:34:36] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10Cmjohnson) The disk error did not come back [16:40:02] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10Cmjohnson) 05Open→03Resolved Resolving this, if the error returns please re-open [16:40:04] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269862 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson [16:47:38] PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:43] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10WDoranWMF) @jijiki Ok, thank you. @Gilles may be we can chat it through? I'll try to find us a time. [16:50:40] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:05] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Mohammed Sadat - https://phabricator.wikimedia.org/T269843 (10jbond) >>! In T269843#6682591, @WMDE-leszek wrote: > I approve this request on behalf of WMDE Engineering Managers. @Kris_Litson_WMDE is formally Mohammed's line man... [16:53:29] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:11] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) [16:54:26] !log upgrade mc1032, mc2032 to buster - T213089 [16:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:29] T213089: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 [16:55:23] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: upgrade mc1032, mc2032 to buster [puppet] - 10https://gerrit.wikimedia.org/r/647672 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [16:57:34] 10Operations: Update tor's apt gpg key - https://phabricator.wikimedia.org/T269861 (10jbond) p:05Triage→03Medium [16:58:24] RECOVERY - Check systemd state on ms-be2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:31] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2032.codfw.wmnet ` The log can be... [16:59:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:26] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] jbond42 and cdanis: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201210T1700). [17:01:04] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [17:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:28] PROBLEM - debmonitor.wikimedia.org requires authentication on debmonitor2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 400 Bad Request https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:01:46] PROBLEM - very high load average likely xfs on ms-be2019 is CRITICAL: CRITICAL - load average: 151.56, 108.64, 60.74 https://wikitech.wikimedia.org/wiki/Swift [17:03:12] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:43] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:27] (03CR) 10Ahmon Dancy: [C: 03+2] Update Chart.yaml source references [deployment-charts] - 10https://gerrit.wikimedia.org/r/647354 (owner: 10Ahmon Dancy) [17:05:02] RECOVERY - very high load average likely xfs on ms-be2019 is OK: OK - load average: 28.78, 69.03, 54.24 https://wikitech.wikimedia.org/wiki/Swift [17:05:43] (03Merged) 10jenkins-bot: Update Chart.yaml source references [deployment-charts] - 10https://gerrit.wikimedia.org/r/647354 (owner: 10Ahmon Dancy) [17:07:30] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1032 site=eqiad tunnel=mc2032_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [17:10:06] ACKNOWLEDGEMENT - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Reported on T269693 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:57] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2032.codfw.wmnet with reason: REIMAGE [17:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2032.codfw.wmnet with reason: REIMAGE [17:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:03] volans: ^ I will think fondly of you every time someone reimages a host from now on <3 [17:17:23] ahahah [17:18:33] 10Operations: slapd fails to restart sometimes - https://phabricator.wikimedia.org/T269394 (10jbond) adding additional logs before they get rotated ` lines=5 Dec 3 14:20:32 serpens puppet-agent[4040]: Computing checksum on file /etc/acmecerts/ldap/cae12c858fa6417d8d999bfaef1c25ec/ec-prime256v1.ocsp Dec 3 14:2... [17:24:11] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T267672 (10Cmjohnson) replaced the cable, gave it the same cable number, removed the old fiber. Cleared the interface statistics on cr1. [17:24:11] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2032.codfw.wmnet'] ` and were **ALL** successful. [17:29:38] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: install redis on mc1032,mc2032 [puppet] - 10https://gerrit.wikimedia.org/r/647750 (owner: 10Effie Mouzeli) [17:31:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1016.eqiad.wmnet - https://phabricator.wikimedia.org/T268812 (10Cmjohnson) 05Open→03Resolved removed from rack, updated netbox and ran the script, confirmed network ports were already removed. [17:33:30] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [17:37:39] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @elukey This is what I have currently +2 servers in A2 +2 servers in A4 +2 servers in A7 ** this is new +2 servers in B2 +2 servers in... [17:40:47] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1032.eqiad.wmnet ` The log can be... [17:45:28] would anybody mind if I deploy some little mw config change now? [17:45:41] (03CR) 10Cicalese: [C: 03+1] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647751 (https://phabricator.wikimedia.org/T269809) (owner: 10Ppchelko) [17:46:46] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Cmjohnson) @bstorm I just found a space for labstore1005. Let's schedule a move for Monday if that works for you, It will go to C4 1004... [17:47:03] ACKNOWLEDGEMENT - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ryan Kemper https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:50] (03CR) 10Ppchelko: [C: 03+2] Enable wgRestAllowCrossOriginCookieAuth for meta in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647751 (https://phabricator.wikimedia.org/T269809) (owner: 10Ppchelko) [17:49:56] (03Merged) 10jenkins-bot: Enable wgRestAllowCrossOriginCookieAuth for meta in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647751 (https://phabricator.wikimedia.org/T269809) (owner: 10Ppchelko) [17:50:26] PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2032 site=codfw tunnel=mc1032_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [17:53:50] 10Operations, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10fdans) p:05Triage→03Medium [17:54:20] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1032.eqiad.wmnet with reason: REIMAGE [17:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:42] PROBLEM - HP RAID on ms-be1022 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:2 - Failed: 2I:4:1 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:54:43] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:647751 T269809 (duration: 01m 05s) [17:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:45] T269809: Clients not displaying in production - https://phabricator.wikimedia.org/T269809 [17:56:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1032.eqiad.wmnet with reason: REIMAGE [17:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:50] PROBLEM - Memcached on mwdebug1002 is CRITICAL: connect to address 10.64.0.46 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [17:58:33] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Mohammed Sadat - https://phabricator.wikimedia.org/T269843 (10WMDE-leszek) Thanks for elaboration @jbond. This process was indeed established with @MoritzMuehlenhoff, and we (WMDE managers) had in mind engineering staff. For le... [17:59:10] RECOVERY - Aggregate IPsec Tunnel Status codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [18:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201210T1800). Please do the needful. [18:11:11] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1032.eqiad.wmnet'] ` and were **ALL** successful. [18:13:02] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:27] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Bstorm) I don't know if the re-image is ready at this time (haven't synced up with @Andrew on that), so today would probably not have wo... [18:33:02] (03PS2) 10Bstorm: wikireplicas: close all connections [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) [18:34:34] (03CR) 10jerkins-bot: [V: 04-1] wikireplicas: close all connections [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [18:37:22] (03PS3) 10Bstorm: wikireplicas: close all connections [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) [18:37:36] (03CR) 10Jforrester: [C: 03+1] Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [18:40:50] (03PS1) 10Razzi: kafka: add kafka-test1007 to kafka-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/647758 (https://phabricator.wikimedia.org/T268202) [18:41:05] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:23] (03CR) 10Bstorm: "I think this corrects the confusion and makes it better https://gerrit.wikimedia.org/r/c/operations/puppet/+/647419/1..3/modules/profile/f" [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201210T1900). [19:00:04] RoanKattouw: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:01:00] I'll deploy it myself [19:01:05] And add a second one too [19:01:05] hashar: what a timing.. here I am [19:01:18] RoanKattouw: assuming you'll deploy, can you deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/645994 as well? [19:01:25] Urbanecm: Will do [19:01:29] thank you! [19:01:32] mutante: good morning ;) [19:01:37] (03CR) 10Catrope: [C: 03+2] RCFilters: Temporarily fix TagItemWidget remove button size [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647305 (https://phabricator.wikimedia.org/T269477) (owner: 10Catrope) [19:02:36] (03PS1) 10Catrope: Add banner module to the homepage [extensions/GrowthExperiments] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647635 (https://phabricator.wikimedia.org/T269804) [19:02:46] (03CR) 10Catrope: [C: 03+2] Add banner module to the homepage [extensions/GrowthExperiments] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647635 (https://phabricator.wikimedia.org/T269804) (owner: 10Catrope) [19:03:41] RoanKattouw: oh, that functionality is ready already? [19:04:07] Urbanecm: tgr works fast :) [19:04:17] yup :) [19:04:27] I merged it yesterday, I just forgot to create the backport and schedule it [19:04:43] cool :) [19:04:57] (03PS6) 10Dzahn: doc: switch to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/620368 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [19:05:19] (03CR) 10Catrope: [C: 03+2] Add PoolCounter settings for DPL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645994 (https://phabricator.wikimedia.org/T263220) (owner: 10Brian Wolff) [19:05:22] (03CR) 10Dzahn: [C: 03+2] doc: switch to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/620368 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [19:06:29] (03Merged) 10jenkins-bot: Add PoolCounter settings for DPL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645994 (https://phabricator.wikimedia.org/T263220) (owner: 10Brian Wolff) [19:07:56] !log doc1001 - restarted apache after docroot change [19:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:20] Urbanecm: Ready for testing on mwdebug1002 [19:08:22] (03PS4) 10Hashar: doc: relocate published documents to /srv/doc [puppet] - 10https://gerrit.wikimedia.org/r/625644 (https://phabricator.wikimedia.org/T149924) [19:09:01] RoanKattouw: trying to test [19:09:17] (I'm not 100% sure it can be tested, but I can at least verify DPL doesn't breek) [19:09:19] *break [19:09:32] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) I reviewed rack settings for hadoop, this is my proposal: >>! In T260445#6682850, @Cmjohnson wrote: > @elukey This is what I have currently... [19:11:50] RoanKattouw: DynamicPageList still does its work at ruwikinews, please sync [19:13:59] Thanks, syncing [19:14:23] (03CR) 10jerkins-bot: [V: 04-1] Add banner module to the homepage [extensions/GrowthExperiments] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647635 (https://phabricator.wikimedia.org/T269804) (owner: 10Catrope) [19:15:22] !log catrope@deploy1001 Synchronized wmf-config/PoolCounterSettings.php: Add PoolCounter settings for DPL (T263220) (duration: 01m 05s) [19:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:25] T263220: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 [19:17:08] (03PS1) 10Dzahn: Revert "doc: switch to scap DocumentRoot" [puppet] - 10https://gerrit.wikimedia.org/r/647636 [19:17:47] (03CR) 10Dzahn: [C: 03+2] Revert "doc: switch to scap DocumentRoot" [puppet] - 10https://gerrit.wikimedia.org/r/647636 (owner: 10Dzahn) [19:18:25] RoanKattouw: can you revert please? I'm concerned about messages like `Pool key 'nowait:dpl-query:enwikinews' (DPL): Error reading from pool counter server 10.64.0.151. ` that just started to appear in the logs [19:18:51] (03CR) 10Hashar: [C: 03+1] Revert "doc: switch to scap DocumentRoot" [puppet] - 10https://gerrit.wikimedia.org/r/647636 (owner: 10Dzahn) [19:18:51] OK, will do [19:19:48] (03PS1) 10Catrope: Revert "Add PoolCounter settings for DPL" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647637 [19:19:54] Hi, why CI isn't still reconfigured? [19:19:58] (03PS2) 10Catrope: Revert "Add PoolCounter settings for DPL" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647637 (https://phabricator.wikimedia.org/T263220) [19:20:06] (03CR) 10Catrope: [C: 03+2] Revert "Add PoolCounter settings for DPL" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647637 (https://phabricator.wikimedia.org/T263220) (owner: 10Catrope) [19:21:00] (03Merged) 10jenkins-bot: Revert "Add PoolCounter settings for DPL" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647637 (https://phabricator.wikimedia.org/T263220) (owner: 10Catrope) [19:21:28] ACKNOWLEDGEMENT - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ryan Kemper https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:28] ACKNOWLEDGEMENT - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ryan Kemper https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:16] (03PS5) 10Hashar: doc: relocate published documents to /srv/doc [puppet] - 10https://gerrit.wikimedia.org/r/625644 (https://phabricator.wikimedia.org/T149924) [19:23:18] (03PS4) 10Hashar: doc: stop backup for old doc directory [puppet] - 10https://gerrit.wikimedia.org/r/625649 (https://phabricator.wikimedia.org/T149924) [19:23:20] (03PS4) 10Hashar: doc: remove legacy doc directory [puppet] - 10https://gerrit.wikimedia.org/r/625650 (https://phabricator.wikimedia.org/T149924) [19:23:22] (03PS1) 10Hashar: doc: switch to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/647763 (https://phabricator.wikimedia.org/T149924) [19:23:49] hashar? [19:24:12] (03PS1) 10Catrope: Guard more singleton() calls with globalArticleInstance() checks [extensions/FlaggedRevs] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647638 (https://phabricator.wikimedia.org/T269608) [19:24:39] (03PS2) 10Catrope: Add banner module to the homepage [extensions/GrowthExperiments] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647635 (https://phabricator.wikimedia.org/T269804) [19:24:43] !log catrope@deploy1001 Synchronized wmf-config/PoolCounterSettings.php: Revert PoolCounter settings for DPL (T263220) (duration: 01m 03s) [19:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:47] T263220: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 [19:25:06] (03PS1) 10Andrew Bogott: nova-compute/cinder/ceph: add a cinder-specific ceph uuid [puppet] - 10https://gerrit.wikimedia.org/r/647764 (https://phabricator.wikimedia.org/T269511) [19:25:08] (03CR) 10Hashar: "That one is broken somehow, the documentation links were giving a 404 (ex: https://doc.wikimedia.org/mediawiki-core/master/php/ )." [puppet] - 10https://gerrit.wikimedia.org/r/647763 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [19:25:11] (03CR) 10Hashar: [C: 04-1] doc: switch to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/647763 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [19:25:38] thanks RoanKattouw [19:26:14] (03CR) 10jerkins-bot: [V: 04-1] doc: switch to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/647763 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [19:27:44] (03PS2) 10Andrew Bogott: nova-compute/cinder/ceph: add a cinder-specific ceph uuid [puppet] - 10https://gerrit.wikimedia.org/r/647764 (https://phabricator.wikimedia.org/T269511) [19:29:14] (03CR) 10jerkins-bot: [V: 04-1] nova-compute/cinder/ceph: add a cinder-specific ceph uuid [puppet] - 10https://gerrit.wikimedia.org/r/647764 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:30:39] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27076/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/647738 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:30:50] (03PS2) 10Dzahn: otrs: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/647738 (https://phabricator.wikimedia.org/T266479) [19:33:15] (03PS5) 10Jeena Huneidi: 0.1.0: Add ENABLE_DEBUG_LOGGING setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355 (owner: 10Ahmon Dancy) [19:33:48] (03Merged) 10jenkins-bot: RCFilters: Temporarily fix TagItemWidget remove button size [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647305 (https://phabricator.wikimedia.org/T269477) (owner: 10Catrope) [19:35:23] (03CR) 10Jeena Huneidi: [C: 03+2] "Thanks for adding this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355 (owner: 10Ahmon Dancy) [19:36:51] (03CR) 10Dzahn: "noop on otrs1001" [puppet] - 10https://gerrit.wikimedia.org/r/647738 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:37:05] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27078/miscweb1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/647739 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:37:10] (03Merged) 10jenkins-bot: 0.1.0: Add ENABLE_DEBUG_LOGGING setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355 (owner: 10Ahmon Dancy) [19:37:14] (03PS2) 10Dzahn: wikimania_scholarships: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/647739 (https://phabricator.wikimedia.org/T266479) [19:38:36] (03PS2) 10Dzahn: poolcounter: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/647737 (https://phabricator.wikimedia.org/T266479) [19:39:15] (03CR) 10Dzahn: "noop on miscweb1002" [puppet] - 10https://gerrit.wikimedia.org/r/647739 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:40:31] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27079/orespoolcounter2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/647737 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:40:58] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27080/poolcounter1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/647737 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:41:54] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.21/resources/src/mediawiki.rcfilters/styles/mw.rcfilters.ui.FilterTagMultiselectWidget.less: Work around OOUI bug breaking RCFilters UI (T269477) (duration: 01m 04s) [19:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:58] T269477: [wmf.21-regression] RC/Watchlist -misaligned close icon in oo-ui-tagMultiselectWidget-group - https://phabricator.wikimedia.org/T269477 [19:42:37] (03CR) 10Dzahn: "noop on poolcounter1005" [puppet] - 10https://gerrit.wikimedia.org/r/647737 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:42:52] (03CR) 10Catrope: [C: 03+2] Guard more singleton() calls with globalArticleInstance() checks [extensions/FlaggedRevs] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647638 (https://phabricator.wikimedia.org/T269608) (owner: 10Catrope) [19:44:56] (03PS2) 10Dave Pifke: webperf: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/647031 (https://phabricator.wikimedia.org/T266479) [19:47:11] (03CR) 10Dave Pifke: "Thanks for the re: trick!" [puppet] - 10https://gerrit.wikimedia.org/r/647031 (https://phabricator.wikimedia.org/T266479) (owner: 10Dave Pifke) [19:48:47] (03Merged) 10jenkins-bot: Guard more singleton() calls with globalArticleInstance() checks [extensions/FlaggedRevs] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647638 (https://phabricator.wikimedia.org/T269608) (owner: 10Catrope) [19:49:52] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27082/ want it merged now?" [puppet] - 10https://gerrit.wikimedia.org/r/647031 (https://phabricator.wikimedia.org/T266479) (owner: 10Dave Pifke) [19:51:14] (03PS3) 10Andrew Bogott: nova-compute/cinder/ceph: add a cinder-specific ceph uuid [puppet] - 10https://gerrit.wikimedia.org/r/647764 (https://phabricator.wikimedia.org/T269511) [19:53:01] RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:26] dpifke: wanna get that done with right now? [19:53:41] pretty confident it will be noop [19:54:00] I'm confused by "Resources only in the old catalog" in the PCC output. [19:54:11] (03Merged) 10jenkins-bot: Add banner module to the homepage [extensions/GrowthExperiments] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647635 (https://phabricator.wikimedia.org/T269804) (owner: 10Catrope) [19:54:32] That seems to imply to me that the packages would no longer be installed? [19:54:52] dpifke: it's because require_package creates a resource for each package that everything else is dependent on [19:55:01] it doesn't mean it will remove the packages [19:55:15] I know it looks weird but I just did the same thing for like 3 other places [19:55:23] Right, but will they be added if we ever try to deploy on a new host? [19:56:16] No objection to merging if you're confident it's correct. Mostly trying to understand for my own edification. :) [19:57:02] dpifke: yes, if you go to "change catalog" https://puppet-compiler.wmflabs.org/compiler1001/27082/webperf1001.eqiad.wmnet/change.webperf1001.eqiad.wmnet.pson [19:57:12] you can see there is still: [19:57:13] "type": "Package", [19:57:13] "title": "python3-tz", [19:57:15] for example [19:57:51] PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:57:52] so you could find each of the packages in the full catalog after the change if you wanted to [19:58:21] Ah-ha. Makes sense. [19:59:58] dpifke: the actual explanation is that require_package does this: [19:59:59] # Create class scope [19:59:59] 36 cls = Puppet::Parser::Resource.new( [19:59:59] 37 'class', class_name, :scope => compiler.topscope) [20:00:04] twentyafterfour and marxarelli: Dear deployers, time to do the Mediawiki train - American Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201210T2000). [20:00:05] I'm still doing the last syncs for the backport window, they're delayed due to a CI issue [20:00:08] ( twentyafterfour ) [20:00:08] so each package is a separate class [20:00:22] but with ensure_package it's a resource.. but not its own class for each package [20:00:30] RoanKattouw: ok [20:02:38] ]' [20:02:42] (03CR) 10Dzahn: [C: 03+2] webperf: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/647031 (https://phabricator.wikimedia.org/T266479) (owner: 10Dave Pifke) [20:03:31] twentyafterfour: is there some cache for phab subproject membership? After leaving a project I still remain as "member" of its subprojects [20:04:02] hauskatze: there shouldn't be [20:04:22] Thanks! If you're in a +2 mood, this is ready (and should be safe to merge whenever) as well: https://gerrit.wikimedia.org/r/c/operations/puppet/+/636759 :) [20:04:30] hauskatze: there may be a bug though [20:04:30] e.g. I removed myself from #wiki-setup but I still appear as member of its subprojects [20:04:32] smh [20:04:44] dpifke: hope that makes sense ^ and I merged and confirmed on both webperf1001 and webperf1002 nothing changed [20:04:46] hauskatze: I'll test a bit [20:04:47] twentyafterfour: random features :) [20:04:51] hauskatze: watcher vs member maybe? [20:04:58] none of them [20:05:11] Makes lots of sense, appreciate the explanation. [20:05:43] (03CR) 10Dzahn: "confirmed noop on webperf1001 and webperf1002" [puppet] - 10https://gerrit.wikimedia.org/r/647031 (https://phabricator.wikimedia.org/T266479) (owner: 10Dave Pifke) [20:06:10] (03PS1) 10Ryan Kemper: categories: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) [20:06:24] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.21/extensions/FlaggedRevs/: Guard more singleton() calls with globalArticleInstance() checks (T269608, to unbreak CI in wmf.21) (duration: 01m 04s) [20:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:29] T269608: Several failing tests in Wikibase CI (CentralAuthApiSessionProviderTest, CentralAuthHeaderSessionProviderTest, EditEntityActionTest, ViewEntityActionTest, HtmlPageLinkRendererEndHookHandlerTest) - https://phabricator.wikimedia.org/T269608 [20:06:57] One more, and then I'll be done [20:07:02] dpifke: yea, i'll merge that as well. but let's see if that actually removes those modules, I think not without some manual action [20:07:52] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.21/extensions/GrowthExperiments/: Add banner module to the homepage (T269804) (duration: 01m 03s) [20:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:57] T269804: Banner module on the Growth homepage - https://phabricator.wikimedia.org/T269804 [20:09:30] twentyafterfour: I'm done, it's all yours [20:10:12] (03CR) 10Hashar: gerrit: use proper hostname on replica hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643919 (owner: 10Hashar) [20:10:37] (03PS4) 10Hashar: gerrit: use proper hostname on replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/643919 [20:10:57] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/643919 (owner: 10Hashar) [20:13:44] mutante: I've got to step away for a few minutes to walk the dog and drop off a package, but I can do whatever cleanup is needed (e.g. running a2dismod) after that. [20:15:27] thanks RoanKattouw [20:15:33] (03CR) 10Dzahn: [C: 03+2] arclamp: add CORS header and clean up modules [puppet] - 10https://gerrit.wikimedia.org/r/636759 (owner: 10Dave Pifke) [20:15:44] dpifke: sounds good, yes, go ahead! I am making sure nothing breaks and leave the cleanup to you. [20:16:03] Thanks! [20:16:07] np [20:19:14] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler1003/654/" [puppet] - 10https://gerrit.wikimedia.org/r/643919 (owner: 10Hashar) [20:25:41] !log hashar@deploy1001 Started deploy [integration/docroot@fdf0917]: (no justification provided) [20:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:48] !log hashar@deploy1001 Finished deploy [integration/docroot@fdf0917]: (no justification provided) (duration: 00m 06s) [20:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:56] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Mohammed Sadat - https://phabricator.wikimedia.org/T269843 (10KFrancis) >>! In T269843#6682457, @jbond wrote: > @KFrancis are you able to help with processin the NDA for Mohammed > > We will also need [[ https://wikitech.wikim... [20:27:29] (03CR) 10Dzahn: "new config snippet was added on webperf2002 and service got refreshed but for example the php7.0 apache module is still enabled. that wil" [puppet] - 10https://gerrit.wikimedia.org/r/636759 (owner: 10Dave Pifke) [20:31:58] (03PS1) 10Mforns: Migrate Growth schemas from EventLogging to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647782 (https://phabricator.wikimedia.org/T267333) [20:32:47] twentyafterfour: o/ are you holding due to https://phabricator.wikimedia.org/T269477 ? [20:34:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10KFrancis) >>! In T269777#6682253, @jbond wrote: > @KFrancis Are you able to confirm NDA status for Tobias, thanks @jbond I was not able to find an NDA on r... [20:36:52] 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10hdothiduc) Awesome, thank you very much! The redirect (from wikimediafoundation.org/wikipedia20 to wikimediafoundation.org) happens to fast that ba... [20:39:05] marxarelli: yes [20:39:56] I thought the fix was deployed but Volker_E says it's still a blocker [20:40:17] twentyafterfour: marxarelli: I'm on it [20:40:58] the fix was only catching the most popular instance of the widget, not the several others [20:41:39] this was the misconception. Lukas Werkmeister captured the other instances last night, when I was already off after delivering the quick-fix [20:41:57] and I haven't had time for anything more until now [20:42:08] (03PS4) 10Andrew Bogott: nova-compute/cinder/ceph: add a cinder-specific ceph uuid [puppet] - 10https://gerrit.wikimedia.org/r/647764 (https://phabricator.wikimedia.org/T269511) [20:44:55] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute/cinder/ceph: add a cinder-specific ceph uuid [puppet] - 10https://gerrit.wikimedia.org/r/647764 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [20:46:26] twentyafterfour: marxarelli i have a config change for testwiki i'd like to deploy; is the train clear? [20:46:34] thanks Volker_E, just let me know when a patch is ready. [20:46:40] ottomata: train is on hold so go ahead [20:46:48] k danke [20:47:09] (03CR) 10Ottomata: [C: 03+2] Migrate Growth schemas from EventLogging to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647782 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns) [20:48:01] (03Merged) 10jenkins-bot: Migrate Growth schemas from EventLogging to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647782 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns) [20:51:00] (03PS1) 10Ottomata: wgEventLoggingSchemas - remove SpecialMuteSubmit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647785 (https://phabricator.wikimedia.org/T268517) [20:52:18] (03PS1) 10Andrew Bogott: rbd_libvirt: fix installation of the cinder ceph secret [puppet] - 10https://gerrit.wikimedia.org/r/647786 (https://phabricator.wikimedia.org/T269511) [20:52:26] (03CR) 10Ottomata: [C: 03+2] wgEventLoggingSchemas - remove SpecialMuteSubmit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647785 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [20:53:43] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01063 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:54:02] (03CR) 10Andrew Bogott: [C: 03+2] rbd_libvirt: fix installation of the cinder ceph secret [puppet] - 10https://gerrit.wikimedia.org/r/647786 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [20:54:35] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate Growth EventLogging schemas to Event Platform on testwiki - T267333 (duration: 01m 03s) [20:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:39] T267333: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 [20:56:54] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10wiki_willy) a:03Cmjohnson Hardware arrived Dec 3 [21:08:37] twentyafterfour: [21:08:51] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/647790 needs to be merged first [21:08:59] On it [21:09:53] I'm going to take the liberty of cherry-picking that immediately, without waiting for it to merge [21:10:07] (03PS1) 10Catrope: OOUI: Backport I18799e54ef46232a54d36e86e2b3d08c3ee0a3d5 [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647641 (https://phabricator.wikimedia.org/T269477) [21:10:15] (03CR) 10Catrope: [C: 03+2] OOUI: Backport I18799e54ef46232a54d36e86e2b3d08c3ee0a3d5 [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647641 (https://phabricator.wikimedia.org/T269477) (owner: 10Catrope) [21:10:54] That should speed up unblocking the train, because gate-and-submit for wmf.21 patches in core took ~25 minutes when I did my backport earlier today [21:19:20] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Cmjohnson) @Jclark-ctr where are these? [21:25:59] RoanKattouw: yeah unfortunately our test suite has gotten pretty slow. [21:26:22] I mean it's always been kinda slow as long as I can remember but seems to be trending in the direction of slow [21:27:21] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 313 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:28:08] whoa that's quite a spike in fatals [21:28:22] fixing https://phabricator.wikimedia.org/T155147 might help with the slow tests [21:28:59] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:33:44] seems like that one spikes every now and then. Looks like there's a patch for on the task: https://phabricator.wikimedia.org/T249745 [21:36:13] (03PS1) 10Andrew Bogott: Cinder: set default quotas to be very low [puppet] - 10https://gerrit.wikimedia.org/r/647795 (https://phabricator.wikimedia.org/T269511) [21:36:53] twentyafterfour: fwiw I filed T269893 for docs :) [21:36:53] T269893: Phabricator keeps displaying my account as a "shadow" member of milestones after leaving parent project - https://phabricator.wikimedia.org/T269893 [21:37:09] title is crap, sorry; can't find the right words right now [21:37:27] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: set default quotas to be very low [puppet] - 10https://gerrit.wikimedia.org/r/647795 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:38:18] hauskatze: I think this is a bug, I'll look into pushing it upstream or if it's easy I may patch it locally [21:39:09] twentyafterfour: as you think it's best. I'm really sorry to put yet-another-task in your backlog [21:39:11] :( [21:40:23] hauskatze: no problem, I think it's a legit bug in upstream phabricator [21:42:03] (03Merged) 10jenkins-bot: OOUI: Backport I18799e54ef46232a54d36e86e2b3d08c3ee0a3d5 [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647641 (https://phabricator.wikimedia.org/T269477) (owner: 10Catrope) [21:44:54] (03CR) 10VolkerE: [C: 03+1] OOUI: Backport I18799e54ef46232a54d36e86e2b3d08c3ee0a3d5 [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647641 (https://phabricator.wikimedia.org/T269477) (owner: 10Catrope) [21:46:04] Pchelolo is it possible to see the parser cache key that is being shown clientside? [21:46:36] yes DannyS712. every page HTML has HTML comment