[00:32:10] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [00:32:24] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [02:00:24] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:38] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:46] PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100% [02:07:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.37 [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675625 [02:18:37] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.37 [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675625 (https://phabricator.wikimedia.org/T278343) (owner: 10TrainBranchBot) [02:20:04] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:33:45] 10ops-eqiad: Eqiad: Ports with no description on cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T278726 (10Papaul) [03:12:45] (03PS6) 10DharmrajRathod98: Improved: timestamp validation in cli/recover-dump [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) [03:18:21] (03CR) 10DharmrajRathod98: "let me know if any further changes required." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [03:27:26] (03PS7) 10DharmrajRathod98: Improved: timestamp validation in cli/recover-dump [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) [03:42:28] PROBLEM - Long running screen/tmux on phab1001 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 11318, 1736640s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [04:17:20] PROBLEM - PHP7 jobrunner on mw1304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:19:30] RECOVERY - PHP7 jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 331 bytes in 1.818 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:29:40] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 237, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:38:56] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:01:18] PROBLEM - PHP7 rendering on mw1304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:03:36] RECOVERY - PHP7 rendering on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 330 bytes in 2.777 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:50:26] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:26] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1087.eqiad.wmnet [06:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:16] !log powercycle cp1087 (no ssh, no mgmt console tty) [06:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:38] RECOVERY - Host cp1087 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [06:12:33] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10elukey) Halfway ping just to remember that a month is left before the certs expire :) [06:13:56] (03PS1) 10Majavah: changeprop: Update beta servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/675657 [06:16:00] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:23] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) We have a special setting in commons.yaml, `kafka_brokers_main`, that it is used IIRC to instruct zookeeper about what connections to accept, and I see tha... [06:21:28] PROBLEM - PHP7 rendering on mw1304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:23:36] RECOVERY - PHP7 rendering on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 331 bytes in 1.302 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:25:46] PROBLEM - Disk space on backup2002 is CRITICAL: DISK CRITICAL - free space: /srv 2988468 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [06:34:52] (03PS4) 10ArielGlenn: Only abort a fragment in a batch so many times before we fail it [dumps] - 10https://gerrit.wikimedia.org/r/675543 (https://phabricator.wikimedia.org/T252396) [06:43:50] PROBLEM - MariaDB Replica Lag: m1 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1288.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:53:02] RECOVERY - MariaDB Replica Lag: m1 on db2078 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:02:48] good morning [07:03:07] so yesterday was a no deploy day and thus I refrained from pushing wmf.36 further than group 1 [07:03:20] now is time to push to all wikis ;D [07:04:22] (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.36.0-wmf.37 [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675625 (https://phabricator.wikimedia.org/T278343) (owner: 10TrainBranchBot) [07:04:36] ^ that one is for wmf.37 [07:06:19] (03PS1) 10Hashar: all wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675709 [07:06:21] (03CR) 10Hashar: [C: 03+2] all wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675709 (owner: 10Hashar) [07:07:04] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675709 (owner: 10Hashar) [07:07:11] grblblbm [07:11:58] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) @herron ping :) Should we work on this in Q4? I can allocate some time to help, at... [07:18:11] (03CR) 10Kosta Harlan: [C: 03+1] Run GrowthExperiments listTaskCounts.php script every hour (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675544 (https://phabricator.wikimedia.org/T278411) (owner: 10Gergő Tisza) [07:19:09] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Eqiad: Ports with no description on cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T278726 (10jijiki) p:05Triage→03Medium [07:19:50] 10SRE, 10Traffic: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10jijiki) p:05Triage→03Medium [07:20:04] 10SRE, 10Wikimedia-Mailing-lists: Use xapian search backend for mailman3 - https://phabricator.wikimedia.org/T278717 (10jijiki) p:05Triage→03Medium [07:20:48] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [07:20:52] (03CR) 10Filippo Giunchedi: alertmanager: open tasks for librenms alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675129 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [07:21:12] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [07:21:23] 10SRE, 10Wikimedia-Mailing-lists, 10cloud-services-team (Kanban): auto-subscribe cloud-vps and/or toolforge users to cloud-announce - https://phabricator.wikimedia.org/T278361 (10jijiki) p:05Triage→03Medium [07:24:21] elukey: I have seen a few spike of memcached timeout errors , should I file them as tasks? ;) [07:24:59] currently mw1304 , "127.0.0.1:11213": A TIMEOUT OCCURRED [07:25:16] (03PS5) 10Filippo Giunchedi: alertmanager: open tasks for librenms alerts [puppet] - 10https://gerrit.wikimedia.org/r/675129 (https://phabricator.wikimedia.org/T225140) [07:25:38] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.37 [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675625 (https://phabricator.wikimedia.org/T278343) (owner: 10TrainBranchBot) [07:26:33] hashar: hi! So in theory if it is a one off probably it is not worth it, very difficult to figure out what happened after that fact.. if you see a sustained rate of timeouts definitely [07:27:12] (03CR) 10Filippo Giunchedi: alertmanager: open tasks for librenms alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675129 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [07:27:15] that is still going on [07:27:30] and mw1304 doesn't show up in the grafana host overview dashboard, but I guess that is a different issue ;) [07:28:44] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.36 - T274940 [07:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:52] T274940: 1.36.0-wmf.36 deployment blockers - https://phabricator.wikimedia.org/T274940 [07:29:01] o/ [07:30:38] hashar: do you have a link to a graph etc.. ? [07:31:09] mmm cannot ssh to mw1304, probably the host is in a weird state [07:31:29] 10SRE: mw1304: Memcached error for key X on server 127.0.0.1:11213: A TIMEOUT OCCURRED - https://phabricator.wikimedia.org/T278734 (10hashar) [07:31:31] elukey: filed as https://phabricator.wikimedia.org/T278734 [07:31:43] also while deploying I had at least one server being super slow, but I don't know which one ;) [07:32:11] 10SRE: mw1304: Memcached error for key X on server 127.0.0.1:11213: A TIMEOUT OCCURRED - https://phabricator.wikimedia.org/T278734 (10hashar) It also does not show up in https://grafana.wikimedia.org/d/000000377/host-overview so maybe the host is broken somehow. [07:32:21] hashar: something happened some hours ago, see https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=mw1304&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jobrunner&from=now-24h&to=now [07:32:23] Amir1: good morning :) [07:32:32] this is why with the -3h you don't see metrics [07:32:51] ah that explains it [07:33:05] anyway I am going to look at the various metrics after I have pushed wmf.36 to all wikis ;) [07:33:09] Thanks ^^ Waiting for errors not to happen [07:34:00] /w/api.php InvalidArgumentException: The given PageIdentity does not represent a proper page [07:34:02] grbmblbl [07:34:21] haven't caught it, but that one is heavily spamming since 22:20 UTC yesterday [07:35:00] so the host is completely borked by ffmpeg [07:35:06] (mw1304) [07:37:21] !log restart-php7.2-fpm on mw1304, jobrunner completely overwhelmed by ffmpeg/transcode jobs (not publishing metrics, erroring out for memcached timeouts) - T278734 [07:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:29] T278734: mw1304: Memcached error for key X on server 127.0.0.1:11213: A TIMEOUT OCCURRED - https://phabricator.wikimedia.org/T278734 [07:38:58] and I have an other unrelated one https://phabricator.wikimedia.org/T278735 [07:39:12] some unwatch action spamming errors since 22:20 :\ [07:39:44] hashar: mw1304 should be ok-ish now [07:40:10] not sure why it would be full of ffmpeg jobs though [07:41:43] hashar: it is a jobrunner no? [07:41:48] 10SRE: mw1304: Memcached error for key X on server 127.0.0.1:11213: A TIMEOUT OCCURRED - https://phabricator.wikimedia.org/T278734 (10hashar) The timeout errors have vanished. No idea why the job runner would over run video transcoding on a given host though. [07:41:53] and yeah timeout have disappeared :] [07:47:21] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) Also FYI in T271136 Cas is going to add the IPv6 AAAA records for the codfw cluste... [07:51:21] Amir1: everything looks fine to me ;) [07:53:00] \o/ [07:53:56] 10SRE, 10Wikimedia-Mailing-lists: Mailman sends bounce notification messages to list-admins with a subject line in Chinese language - https://phabricator.wikimedia.org/T278574 (10Aklapper) @legoktm: Time is better spent on mailman3. This is lowest priority; no functionality problems, just the subject line bein... [08:05:36] !log refreshing wdqs entities (T278693) [08:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:45] T278693: Manually purge obsolete/outdated entites from WDQS (2021-03) - https://phabricator.wikimedia.org/T278693 [08:10:52] so yeah hmm looks all set Amir1 ;) [08:11:03] and apparently nothing exploded on the javascript client side [08:11:12] so I am going to do some laundry, it is sunny there [08:11:17] be back in a few [08:11:32] Thanks for checking [08:26:40] 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): [ceph] Test and upgrade to kernel ~15 - https://phabricator.wikimedia.org/T274565 (10dcaro) [08:28:00] 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): [ceph] Test and upgrade to kernel ~15 - https://phabricator.wikimedia.org/T274565 (10dcaro) Given that there's no noticeable improvement, will stick with the current kernel as the rest of the fleet. Might revisit once we have metrics for osd resource... [08:30:26] PROBLEM - Long running screen/tmux on puppetmaster1001 is CRITICAL: CRIT: Long running tmux process. (user: ryankemper PID: 2120, 1740948s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [08:33:26] RECOVERY - Disk space on backup2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [08:36:22] !log mariadb upgrade of all buster source backup hosts to 10.4.18 T250666 [08:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:32] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [08:39:52] (03PS1) 10David Caro: Revert "wmcs.ceph.codfw: Upgrade to latest 5.X kernel" [puppet] - 10https://gerrit.wikimedia.org/r/675722 [08:40:00] PROBLEM - Ensure local MW versions match expected deployment on parse2001 is CRITICAL: CRITICAL: 318 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [08:40:20] (03CR) 10jerkins-bot: [V: 04-1] Revert "wmcs.ceph.codfw: Upgrade to latest 5.X kernel" [puppet] - 10https://gerrit.wikimedia.org/r/675722 (owner: 10David Caro) [08:41:03] (03PS2) 10David Caro: Revert "wmcs.ceph.codfw: Upgrade to latest 5.X kernel" [puppet] - 10https://gerrit.wikimedia.org/r/675722 (https://phabricator.wikimedia.org/T274565) [08:57:42] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10jijiki) yeap, thank you! [09:01:05] (03CR) 10Volans: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/675353 (owner: 10Legoktm) [09:03:01] is parse2001 supposed to be in scap? [09:03:59] (03CR) 10Volans: Add network report (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) (owner: 10Ayounsi) [09:04:55] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [09:04:56] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [09:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:26] Urbanecm: I have no idea :\ [09:17:37] wmf.36 on all wikis looks fine at least [09:17:52] so I am going out for lunch with kids etc, will be back in a couple hours [09:18:00] hashar: it's definitely not your fault [09:18:21] (03PS1) 10Jbond: wmflib: drop array_concat function [puppet] - 10https://gerrit.wikimedia.org/r/675750 (https://phabricator.wikimedia.org/T273743) [09:18:50] it's a question for a SRE [09:19:01] (03PS1) 10Ladsgroup: Disable legacy javascript in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675751 (https://phabricator.wikimedia.org/T72470) [09:19:08] (03CR) 10Jbond: [C: 03+2] wmflib: drop array_concat function [puppet] - 10https://gerrit.wikimedia.org/r/675750 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [09:19:08] as it did not get your sync (so it is definitely not in scap) [09:20:20] ok ;) [09:20:29] anyway I have to go for lunch, be back in a couple hours [09:21:54] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [09:21:57] ttyl has [09:22:01] too late [09:22:48] Urbanecm: effie reimaged that server yesterday, ask them [09:23:14] effie: is parse2001 supposed to be in scap? it did not receive last deployment, see icinga alert few rows above [09:23:56] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [09:26:58] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1003 is CRITICAL: CRITICAL: nf_conntrack usage over 90% in netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:28:40] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:33] (03CR) 10Ayounsi: Add network report (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) (owner: 10Ayounsi) [09:32:28] (03PS1) 10Kosta Harlan: [WIP] linkrecommendation: Use rest.php endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/675755 [09:35:38] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: wmde-toolkit-analyzer-build.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:46] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [09:35:46] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [09:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:18] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:39:50] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:41:57] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [09:41:57] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [09:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:28] (03PS1) 10Filippo Giunchedi: rancid: parametrize MAILDOMAIN [puppet] - 10https://gerrit.wikimedia.org/r/675756 [09:45:10] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28821/console" [puppet] - 10https://gerrit.wikimedia.org/r/675756 (owner: 10Filippo Giunchedi) [09:46:09] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [09:46:36] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [09:47:21] (03CR) 10Filippo Giunchedi: "This has been causing spam to rancid-admin-core@ when running in Pontoon" [puppet] - 10https://gerrit.wikimedia.org/r/675756 (owner: 10Filippo Giunchedi) [09:49:08] (03CR) 10Ayounsi: [C: 03+1] rancid: parametrize MAILDOMAIN [puppet] - 10https://gerrit.wikimedia.org/r/675756 (owner: 10Filippo Giunchedi) [09:52:27] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet [09:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:09] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: disable conntrackd [puppet] - 10https://gerrit.wikimedia.org/r/675760 (https://phabricator.wikimedia.org/T270704) [09:57:52] (03CR) 10Jbond: "> Patch Set 7:" [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [09:58:10] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1003 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:58:24] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1003.eqiad.wmnet [09:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:54] (03CR) 10Jcrespo: "> Patch Set 6:" (033 comments) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [10:01:29] (03CR) 10Filippo Giunchedi: [C: 03+2] rancid: parametrize MAILDOMAIN [puppet] - 10https://gerrit.wikimedia.org/r/675756 (owner: 10Filippo Giunchedi) [10:01:52] (03PS1) 10Jbond: admin: remove olykalinichenko account [puppet] - 10https://gerrit.wikimedia.org/r/675762 (https://phabricator.wikimedia.org/T278475) [10:04:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28822/" [puppet] - 10https://gerrit.wikimedia.org/r/675760 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [10:04:32] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:05:02] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:11:46] Urbanecm: sorry I was afk, so reimageing didn't go very well, so the server is not pooled, I scap pulled yesterday, but it is not in service [10:12:13] effie: so if scap sync-* doesn't affect that server, it is fine? [10:12:27] if so, i guess we should ack the alert from icinga that sync did not work properly? [10:13:30] I will ack them [10:13:42] sorry for the bother, I didn't check icinga this morning [10:16:23] (03CR) 10Jbond: [C: 03+2] admin: remove olykalinichenko account [puppet] - 10https://gerrit.wikimedia.org/r/675762 (https://phabricator.wikimedia.org/T278475) (owner: 10Jbond) [10:16:40] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on parse2001 is CRITICAL: CRITICAL: 318 mismatched wikiversions Effie Mouzeli parse2001 was reimaged https://phabricator.wikimedia.org/T245757#6953720 https://wikitech.wikimedia.org/wiki/Application_servers [10:16:40] ACKNOWLEDGEMENT - mediawiki-installation DSH group on parse2001 is CRITICAL: Host parse2001 is not in mediawiki-installation dsh group Effie Mouzeli parse2001 was reimaged https://phabricator.wikimedia.org/T245757#6953720 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:16:40] ACKNOWLEDGEMENT - parsoid on parse2001 is CRITICAL: connect to address 10.192.0.182 and port 8000: Connection refused Effie Mouzeli parse2001 was reimaged https://phabricator.wikimedia.org/T245757#6953720 https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [10:18:55] 10SRE, 10GitLab (Initialization), 10Patch-For-Review: Offboard Oly Kalinichenko (Speed & Function) - https://phabricator.wikimedia.org/T278475 (10jbond) 05Open→03Resolved a:03jbond Thanks this account has been removed [10:19:34] 10SRE, 10GitLab (Initialization), 10Patch-For-Review: Offboard Oly Kalinichenko (Speed & Function) - https://phabricator.wikimedia.org/T278475 (10jbond) [10:20:16] thanks effie. I was just worried that a deployment caused this alert, usually, it's not a good sign :) [10:21:32] 10SRE, 10GitLab (Initialization), 10Patch-For-Review: Offboard Oly Kalinichenko (Speed & Function) - https://phabricator.wikimedia.org/T278475 (10jbond) 05Resolved→03Open Reopening, im not an admin on the gitlab cloud project. @thcipriani are you able to remove @OlyKalinichenkoSpeedAndFunction from the... [10:22:35] (03PS3) 10Jbond: pki1001: move host into multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674915 [10:26:06] (03PS5) 10Jbond: O:pki::multirootca: add multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674914 [10:28:46] (03CR) 10Jbond: [C: 03+2] pki1001: move host into multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674915 (owner: 10Jbond) [10:28:51] (03PS4) 10Jbond: pki1001: move host into multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674915 [10:28:58] (03CR) 10Jbond: [C: 03+2] O:pki::multirootca: add multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674914 (owner: 10Jbond) [10:32:01] (03CR) 10DharmrajRathod98: "Can you elaborate more about comment first part of the function line #42. As first part of the function will just split the file string." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [10:35:19] (03CR) 10DharmrajRathod98: "> Patch Set 7:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [10:38:51] (03PS1) 10Jbond: P:pki:multirootca correct profile name [puppet] - 10https://gerrit.wikimedia.org/r/675770 [10:42:23] (03CR) 10Jbond: [C: 03+2] P:pki:multirootca correct profile name [puppet] - 10https://gerrit.wikimedia.org/r/675770 (owner: 10Jbond) [10:45:27] (03CR) 10Jcrespo: "> Patch Set 7:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [10:47:49] (03PS1) 10Jbond: P:pki::mul;tirootca: fix template path [puppet] - 10https://gerrit.wikimedia.org/r/675772 [10:49:02] (03CR) 10Jbond: [C: 03+2] P:pki::mul;tirootca: fix template path [puppet] - 10https://gerrit.wikimedia.org/r/675772 (owner: 10Jbond) [10:57:40] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/28823/" [puppet] - 10https://gerrit.wikimedia.org/r/675556 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [10:57:46] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: introduce eqiad1 service implementation [puppet] - 10https://gerrit.wikimedia.org/r/675556 (https://phabricator.wikimedia.org/T270704) [10:57:59] (03PS1) 10Jbond: P:pki::multiroot: split defaults and production hiera values [puppet] - 10https://gerrit.wikimedia.org/r/675774 [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210330T1100). [11:00:05] Majavah, Zabe, and Amir1: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] here, I have a beta-only patch [11:00:22] o/ [11:00:23] o/ [11:04:01] is anyone going to deploy? [11:07:49] (03CR) 10Jbond: [C: 03+2] P:pki::multiroot: split defaults and production hiera values [puppet] - 10https://gerrit.wikimedia.org/r/675774 (owner: 10Jbond) [11:09:14] Majavah: I can deploy yours and mine. I need to check Zabe's [11:11:47] (03CR) 10Ladsgroup: [C: 04-1] "Link to community consensus is missing on the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675319 (https://phabricator.wikimedia.org/T278634) (owner: 10Zabe) [11:12:09] (03CR) 10Ladsgroup: [C: 03+2] beta: add deployment-parsoid12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675560 (owner: 10Majavah) [11:12:54] (03Merged) 10jenkins-bot: beta: add deployment-parsoid12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675560 (owner: 10Majavah) [11:13:02] (03PS2) 10Awight: parquet logging falls back to default file handler [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757) [11:13:36] Majavah: yours is rebased, will be there automatically later [11:13:55] ty Amir1 [11:14:00] let's see if anything breaks [11:16:38] I'm not comfortable merging the hewikisource patch, I honestly think the requester from hewikisource doesn't know about rights and user groups. The request doesn't make much sense to me... [11:17:19] Amir1: ok, still thx for your help [11:17:35] (03PS1) 10Jbond: wmnet: update cname for pki to pki1001 [dns] - 10https://gerrit.wikimedia.org/r/675777 [11:17:37] (03CR) 10Ladsgroup: [C: 03+2] Disable legacy javascript in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675751 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:17:51] I'd say let's Martin review it [11:18:16] (03CR) 10jerkins-bot: [V: 04-1] wmnet: update cname for pki to pki1001 [dns] - 10https://gerrit.wikimedia.org/r/675777 (owner: 10Jbond) [11:18:18] yeah [11:18:31] (03Merged) 10jenkins-bot: Disable legacy javascript in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675751 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:19:15] (03PS2) 10Jbond: wmnet: update cname for pki to pki1001 [dns] - 10https://gerrit.wikimedia.org/r/675777 [11:20:03] (03CR) 10Jbond: [C: 03+2] wmnet: update cname for pki to pki1001 [dns] - 10https://gerrit.wikimedia.org/r/675777 (owner: 10Jbond) [11:21:56] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:675751|Disable legacy javascript global variables in group1]], Some increase in client errors is expected (T72470) (duration: 01m 11s) [11:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:05] T72470: Remove legacy javascript globals - https://phabricator.wikimedia.org/T72470 [11:30:10] (03PS1) 10Arturo Borrero Gonzalez: Revert "Remove 'release' qsub label" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) [11:30:17] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove 'release' qsub label" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) (owner: 10Arturo Borrero Gonzalez) [11:32:03] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:43] (03PS1) 10ArielGlenn: distinguish between "failed" and "maxfailed" job fragments for batch runs [dumps] - 10https://gerrit.wikimedia.org/r/675778 (https://phabricator.wikimedia.org/T252396) [11:42:25] (03CR) 10Elukey: "Adam, thanks a lot for the research, I somehow missed this code review for a bit." [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [11:44:43] (03PS1) 10Jbond: P:pki::muyltirootca: add puppet alt_dns [puppet] - 10https://gerrit.wikimedia.org/r/675780 [11:44:50] (03CR) 10DharmrajRathod98: "> Patch Set 7:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [11:45:34] (03CR) 10Jbond: [C: 03+2] P:pki::muyltirootca: add puppet alt_dns [puppet] - 10https://gerrit.wikimedia.org/r/675780 (owner: 10Jbond) [11:48:40] (03PS2) 10Arturo Borrero Gonzalez: Revert "Remove 'release' qsub label" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) [11:49:28] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove 'release' qsub label" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) (owner: 10Arturo Borrero Gonzalez) [11:51:25] (03PS3) 10Arturo Borrero Gonzalez: Revert "Remove 'release' qsub label" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) [11:52:20] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove 'release' qsub label" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) (owner: 10Arturo Borrero Gonzalez) [11:55:26] (03PS4) 10Arturo Borrero Gonzalez: Revert "Remove 'release' qsub label" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) [11:57:08] (03PS5) 10Arturo Borrero Gonzalez: Revert "Remove 'release' qsub label" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) [11:57:33] (03PS1) 10Jbond: sre.puppet.renew-cert: correct typo allow_dns_alt_names [cookbooks] - 10https://gerrit.wikimedia.org/r/675782 [11:57:50] (03PS2) 10Jbond: sre.puppet.renew-cert: correct typo allow_dns_alt_names [cookbooks] - 10https://gerrit.wikimedia.org/r/675782 [12:05:43] (03PS3) 10Jbond: sre.puppet.renew-cert: correct typo allow_dns_alt_names [cookbooks] - 10https://gerrit.wikimedia.org/r/675782 [12:14:15] !log mwmaint1002: Downloading multiple big files (total filesize estimated 150 GB, downloaded and processed in batches) for server-side uploads [12:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:20] (03PS1) 10Hnowlan: osm: use osmimporter to do expiry when using imposm3. [puppet] - 10https://gerrit.wikimedia.org/r/675787 [12:17:39] (03PS2) 10Hnowlan: osm: use osmimporter to do expiry when using imposm3. [puppet] - 10https://gerrit.wikimedia.org/r/675787 [12:18:31] (03PS1) 10Jbond: pki: move pki2001 to pki::multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/675788 [12:18:33] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28825/console" [puppet] - 10https://gerrit.wikimedia.org/r/675787 (owner: 10Hnowlan) [12:20:46] (03CR) 10Jbond: [C: 03+2] pki: move pki2001 to pki::multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/675788 (owner: 10Jbond) [12:34:23] 10SRE, 10Security: pygments update review - https://phabricator.wikimedia.org/T278818 (10jbond) p:05Triage→03Medium [12:36:07] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1004.eqiad.wmnet [12:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:55] !log update python(3)-pygments [12:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:22] !log ssh -p 29418 gerrit.wikimedia.org replication start wikidata/query-builder --wait (T277060) [12:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:30] T277060: Move the Query Builder repository to Gerrit - https://phabricator.wikimedia.org/T277060 [12:43:53] Amir1: good to know you can create repositories :) [12:44:03] Recently got it [12:44:28] i see [12:44:58] 10SRE, 10Security: pygments update review - https://phabricator.wikimedia.org/T278818 (10jbond) 05Open→03Resolved The update looks as it should and has been rolled out [12:45:17] 10SRE: mw1304: Memcached error for key X on server 127.0.0.1:11213: A TIMEOUT OCCURRED - https://phabricator.wikimedia.org/T278734 (10hashar) 05Open→03Resolved a:03elukey Root cause is not addressed but flushing the stuck php transcode jobs has made the server responsive again. [12:55:24] !log update spamassasin on lists,otrs and mx T278820 [12:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:34] T278820: Review Debian update: spamassasin - https://phabricator.wikimedia.org/T278820 [12:58:58] 10SRE, 10Security: Review Debian update: lxml - https://phabricator.wikimedia.org/T278822 (10jbond) p:05Triage→03Medium [13:00:01] 10SRE, 10Security: Review Debian update: lxml - https://phabricator.wikimedia.org/T278822 (10jbond) update matches upstream patch [13:02:12] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:28] !log rollout lxml update T278822 [13:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:36] T278822: Review Debian update: lxml - https://phabricator.wikimedia.org/T278822 [13:05:00] 10SRE, 10GitLab (Initialization): Offboard Oly Kalinichenko (Speed & Function) - https://phabricator.wikimedia.org/T278475 (10thcipriani) [13:05:29] 10SRE, 10GitLab (Initialization): Offboard Oly Kalinichenko (Speed & Function) - https://phabricator.wikimedia.org/T278475 (10thcipriani) 05Open→03Resolved >>! In T278475#6955170, @jbond wrote: > Reopening, im not an admin on the gitlab cloud project. @thcipriani are you able to remove @OlyKalinichenkoSpe... [13:08:47] 10SRE, 10Security: Review Debian update: lxml - https://phabricator.wikimedia.org/T278822 (10jbond) 05Open→03Resolved This update has now been rolled out [13:36:43] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/675782 (owner: 10Jbond) [13:43:28] (03CR) 10Jbond: [C: 03+2] sre.puppet.renew-cert: correct typo allow_dns_alt_names [cookbooks] - 10https://gerrit.wikimedia.org/r/675782 (owner: 10Jbond) [13:43:32] (03PS4) 10Jbond: sre.puppet.renew-cert: correct typo allow_dns_alt_names [cookbooks] - 10https://gerrit.wikimedia.org/r/675782 [13:45:13] (03PS6) 10Filippo Giunchedi: alertmanager: get librenms alerts for dcops to open tasks [puppet] - 10https://gerrit.wikimedia.org/r/675129 (https://phabricator.wikimedia.org/T225140) [13:48:26] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] C:ssh::server: add support for multiple listen addresses [puppet] - 10https://gerrit.wikimedia.org/r/675131 (owner: 10Jbond) [13:52:05] (03PS1) 10Majavah: role::deployment_server: do not always use lvm on cloud [puppet] - 10https://gerrit.wikimedia.org/r/675802 [13:53:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Run GrowthExperiments listTaskCounts.php script every hour [puppet] - 10https://gerrit.wikimedia.org/r/675544 (https://phabricator.wikimedia.org/T278411) (owner: 10Gergő Tisza) [13:54:06] (03CR) 10Ayounsi: [C: 03+1] "Discussed over IRC, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/675129 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [13:56:07] 10SRE, 10Wikimedia-Mailing-lists, 10cloud-services-team (Kanban): auto-subscribe cloud-vps and/or toolforge users to cloud-announce - https://phabricator.wikimedia.org/T278361 (10Andrew) It sounds like @Ladsgroup is hard at work getting mailman 3 up and running in prod; I've been thinking that this task shou... [13:57:46] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) >>! In T225005#6954669, @elukey wrote: > Should we work on this in Q4? I can alloc... [13:59:06] thanks akosiaris! FWIW, the script will produce errors for a few days, that's normal, it's ahead of the train. AFAIK it won't alert or otherwise inconvenience anyone. [13:59:08] PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: / 4876 MB (3% inode=87%): /tmp 4876 MB (3% inode=87%): /var/tmp 4876 MB (3% inode=87%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [13:59:31] (03CR) 10Jbond: "Ready for review" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [13:59:45] ^^the icinga alert for disk space on mwmaint1002 is me^^ [14:03:19] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) Nice! I have used Stevie's reuse-part partman script: ` kafka-jumbo100[1-9]) ech... [14:07:22] tgr_: Cool. Thanks for the heads up regarding the errors [14:12:38] (03PS1) 10Jbond: O:cluster::managment: move monitoring from puppetdb to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/675805 [14:13:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28826/console" [puppet] - 10https://gerrit.wikimedia.org/r/675805 (owner: 10Jbond) [14:14:02] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:34] (03CR) 10Effie Mouzeli: "I really like this! Let's try this on a couple of mediawiki hosts, before deploying it everywhere." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/675237 (https://phabricator.wikimedia.org/T278220) (owner: 10Alexandros Kosiaris) [14:20:14] RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [14:26:08] (03PS1) 10Majavah: scap::sources: fix 3d2png/deploy on beta [puppet] - 10https://gerrit.wikimedia.org/r/675807 [14:26:39] (03PS2) 10Majavah: scap::sources: fix 3d2png/deploy on beta [puppet] - 10https://gerrit.wikimedia.org/r/675807 [14:31:50] (03PS1) 10David Caro: ceph: Add octopus repo entry [puppet] - 10https://gerrit.wikimedia.org/r/675812 (https://phabricator.wikimedia.org/T274566) [14:32:01] !log manually start update-openstack-mirror.service on sodium (T278505) [14:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:09] T278505: Prepare/import debian packages for openstack trove - https://phabricator.wikimedia.org/T278505 [14:37:00] (03PS1) 10Majavah: scap::sources: beta: remove unused jobrunner and recommendationapi [puppet] - 10https://gerrit.wikimedia.org/r/675814 [14:38:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) a:05RobH→03Cmjohnson >>! In T272403#6953877, @aborrero wrote: > Would you mind if I leave the ticket open until the fo... [14:43:18] PROBLEM - Check systemd state on mw1309 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:34] (03CR) 10Arturo Borrero Gonzalez: "I guess you discarded calling the component thirdparty/ceph-nautilus-buster. That's OK, but a bit confusing bc the previous had the debian" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675812 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [14:46:28] (03PS1) 10Majavah: beta: add deployment-deploy03 [puppet] - 10https://gerrit.wikimedia.org/r/675815 [14:53:14] (03CR) 10David Caro: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675812 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [14:58:40] !log Move Help talk:Help talk:Getting started --> Help talk:Getting started via moveBatch.php on enwiki (T278350) [14:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:48] T278350: "Lock wait timeout exceeded" moving a page back with ~18800 watchers on en.wp - https://phabricator.wikimedia.org/T278350 [14:59:31] !log disable puppet on mediawiki servers to deploy 663565 [14:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:00] (03CR) 10CRusnov: [C: 03+1] Add network report (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) (owner: 10Ayounsi) [15:04:40] PROBLEM - Check systemd state on mw1296 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:02] PROBLEM - Check systemd state on mw1309 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:54] (03CR) 10BryanDavis: [C: 04-1] Revert "Remove 'release' qsub label" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) (owner: 10Arturo Borrero Gonzalez) [15:11:52] (03PS3) 10Jbond: cumin: Add check_puppet_run_script so we can filter based on icinga status [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) [15:13:04] (03CR) 10jerkins-bot: [V: 04-1] cumin: Add check_puppet_run_script so we can filter based on icinga status [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [15:14:11] (03PS1) 10Arturo Borrero Gonzalez: toollabs-images: refresh toolforge repository URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/675823 (https://phabricator.wikimedia.org/T278436) [15:15:40] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: get librenms alerts for dcops to open tasks [puppet] - 10https://gerrit.wikimedia.org/r/675129 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [15:18:04] I want to bring attention to deployers to a worrying connection pattern I am seeing: https://logstash.wikimedia.org/goto/8c440ea2592b0406e4483b1f01345ca9 [15:18:37] the error is not important, it is the database killing idle connections [15:18:49] but the number of idle connections seem to be increasing [15:22:04] (03PS2) 10Jbond: O:cluster::managment: move monitoring from puppetdb to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/675805 [15:22:14] PROBLEM - PHP7 jobrunner on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [15:22:56] and indeed they are jobrunners [15:23:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28827/console" [puppet] - 10https://gerrit.wikimedia.org/r/675805 (owner: 10Jbond) [15:25:23] (03PS3) 10Jbond: O:cluster::managment: move monitoring from puppetdb to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/675805 [15:26:02] PROBLEM - PHP7 rendering on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:26:36] RECOVERY - PHP7 jobrunner on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 330 bytes in 3.574 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:27:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28828/console" [puppet] - 10https://gerrit.wikimedia.org/r/675805 (owner: 10Jbond) [15:27:44] PROBLEM - PHP7 rendering on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:28:50] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:14] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [15:29:24] !log moving all test tables out of cassandra directories on aqs hosts [15:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:36] we are having issues on the jobqueue, I think: https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=15&orgId=1&from=1617096543946&to=1617118143946&var-dc=eqiad%20prometheus%2Fk8s [15:30:30] wikibase and cirrus jobs backlog growing, but could be effects, not causes [15:31:19] nothing worring compared to last weeks, but to keep an eye on [15:31:22] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 332 bytes in 4.433 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:32:10] RECOVERY - PHP7 rendering on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 330 bytes in 7.142 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:32:19] hm, I’m not seeing any big changes in Wikidata edits that might explain the Wikibase jobs https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&from=now-24h&to=now [15:32:36] RECOVERY - PHP7 rendering on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 329 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:33:35] Lucas_WMDE, it is addUsagesForPage that is growing, but again, I cannot say if it is just regularly frequently and just affected by slowdown [15:33:51] *a very frequent job [15:34:20] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:35:19] hm, nevermind, that job is scheduled whenever a Wikibase *client* page is edited (or otherwise re-rendered), I think [15:35:37] so if there was a higher volume of those jobs, it could also be connected to activities on Wikipedias, not necessarily Wikidata [15:35:46] jobrunners are very overloaded [15:36:03] it’s also just a very common job in general, I think, so it might really just be a symptom [15:36:05] mmm, ffmmpg [15:36:08] could be? [15:36:24] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 329 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:36:39] a mass videoscaling causing slowdown? shouldn't that be separate from regular job executions? [15:37:11] Lucas_WMDE, yeah, I am combinced they are not the cause, just they are very frequent [15:37:16] ok [15:37:21] hi [15:37:24] jynus: I'm currently uploading a lot of videos [15:37:26] should i stop, [15:37:27] ? [15:37:47] I cannot say, but I am seeing overload on jobrunners [15:37:55] we had an issue earlier this morning with some mw api server crippled by video related jobs [15:37:56] PROBLEM - PHP7 jobrunner on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [15:37:56] I'll pause the script [15:38:14] PROBLEM - PHP7 rendering on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:38:15] causing open connections to dbs and other infrastructure, not only video [15:38:16] PROBLEM - Check systemd state on mw1296 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:18] https://phabricator.wikimedia.org/T278734 [15:38:31] PROBLEM - LVS jobrunner eqiad port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.eqiad.wmnet IPv4 #page on jobrunner.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:38:32] Urbanecm, if you can pause it to check if related at least temporarilly [15:38:45] uhoh [15:38:47] jynus: script paused [15:38:55] * volans here [15:39:06] see, I was predicting this 30 minutes ago :-) [15:39:22] I was seeing the "signs" on prometheus [15:39:34] summary: jobrunner overload [15:39:44] due to video coding? [15:39:46] potentially due to videoscaling? [15:39:48] 10SRE, 10Wikimedia-Mailing-lists, 10cloud-services-team (Kanban): auto-subscribe cloud-vps and/or toolforge users to cloud-announce - https://phabricator.wikimedia.org/T278361 (10Ladsgroup) Lego is doing most of the work these days (I'm more of a cheerleader/emotional support these days). Mailman3 will be ac... [15:40:00] (unsure about that, but it is my best guess) [15:40:15] what I did not get from this morning is what it happens in the first place (too many video scaling jobs piling up on the same mw app server) [15:40:21] Urbanecm: how many videos are we talking about? [15:40:34] legoktm: ~160 GBs [15:40:41] RECOVERY - LVS jobrunner eqiad port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.eqiad.wmnet IPv4 #page on jobrunner.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 404 bytes in 1.737 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:41:21] ftr, it is server-side upload requested by User:Lusccasdeutsch [15:42:02] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1004.eqiad.wmnet [15:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:28] https://commons.wikimedia.org/wiki/Special:Transcode_statistics (admin-only for some reason) says "422 running transcodes" [15:42:39] 10SRE, 10Wikimedia-Mailing-lists, 10cloud-services-team (Kanban): auto-subscribe cloud-vps and/or toolforge users to cloud-announce - https://phabricator.wikimedia.org/T278361 (10Andrew) >>! In T278361#6957054, @Ladsgroup wrote: > getting the mailman3's API to be exposed to the cloud might be complicated (s... [15:44:30] PROBLEM - PHP7 rendering on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:44:38] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:24] so, just to be clear, has the uploading process been paused? [15:45:26] (03CR) 10Andrew Bogott: [C: 03+1] "Seems fine with me although I'm not 100% sure it will do what we want (I've only used recurse to populate a directory from another directo" [puppet] - 10https://gerrit.wikimedia.org/r/675478 (owner: 10David Caro) [15:45:41] akosiaris: _my_ uploads were paused [15:45:42] akosiaris: urbanecm said he stopped it about 7 min ago [15:45:49] but I am not sure whether they are the cause or not [15:46:46] RECOVERY - PHP7 rendering on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 329 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:46:46] Urbanecm: if you started them like on 12:30 they did. https://grafana.wikimedia.org/d/000000607/cluster-overview?viewPanel=87&orgId=1&var-site=eqiad&var-cluster=jobrunner&var-instance=All&var-datasource=thanos&from=now-6h&to=now [15:47:04] I'm looking at mw1308 and it's alllll ffmpeg [15:47:10] yeah [15:47:17] we can depool the videoscalers from the jobqueue [15:49:45] yeah, we should do that until we manage to stabilize the infra a bit [15:49:52] all hosts are like at 100% cpu constantly [15:50:35] (03CR) 10David Caro: [C: 03+2] "Got it from https://ask.puppet.com/question/15753/how-can-i-chown-directories-recursivley/, so I'd expect so, from the comments it might b" [puppet] - 10https://gerrit.wikimedia.org/r/675478 (owner: 10David Caro) [15:50:54] (03CR) 10David Caro: [C: 03+2] nova: set recursive ownership for /var/lib/nova/instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675478 (owner: 10David Caro) [15:51:28] right now all job runners are also video scalers, we can have most pooled as just job runners and a few as just video scalers? [15:51:37] akosiaris: yeah, something like 12:30 :/ [15:51:47] can we somehow kill the scaling jobs? [15:51:59] legoktm: we used to have that and then we folded the 2 clusters into 1 [15:52:01] not sure how job queue behaves if we kill the ffmpeg processes [15:52:26] Urbanecm: last I checked it will restart them if they are not successful [15:52:47] yea, we can depool _just_ the videoscaler service on some [15:53:04] thanks akosiaris :/ [15:53:07] the trend on backlog is getting a bit better in the last minutes [15:53:14] we could roll depool them [15:53:32] PROBLEM - Check systemd state on mw1309 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:37] https://grafana.wikimedia.org/goto/jAGPYGlMk [15:53:43] are we maybe creating more different sizes than before? it's a lot of resizing, right? [15:54:16] RECOVERY - PHP7 jobrunner on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 327 bytes in 7.123 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:54:26] RECOVERY - PHP7 rendering on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 327 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:54:39] does seem like we are catching up now that new uploads were paused [15:54:54] (03CR) 10Cwhite: "> Patch Set 1:" [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675211 (owner: 10Cwhite) [15:55:18] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=mw1307.eqiad.wmnet,service=jobrunner [15:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:24] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=mw1308.eqiad.wmnet,service=jobrunner [15:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:44] is it possible there are more versions we are transcoding to than in the past? https://usercontent.irccloud-cdn.com/file/ni3paF64/image.png [15:55:50] (03PS6) 10Arturo Borrero Gonzalez: Revert "Remove 'release' qsub label" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) [15:55:57] (03CR) 10Arturo Borrero Gonzalez: Revert "Remove 'release' qsub label" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) (owner: 10Arturo Borrero Gonzalez) [15:56:02] legoktm: ah cool, I was doing that. Say 50% ? [15:56:12] ack [15:56:20] I started with the ones that were down in icinga [15:56:26] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:27] sudo confctl select 'dc=eqiad,cluster=videoscaler,name=mw12.*' set/pooled=false [15:56:34] that would quickly pick a few [15:57:04] eh, wait, didn't you just depool jobrunner and not videoscaler [15:57:14] PROBLEM - PHP7 jobrunner on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [15:57:14] yeah wait [15:57:49] I was removing the overloaded hosts from jobrunner so they aren't affecting normal jobs...should I do it the other way around? [15:57:58] PROBLEM - Check systemd state on mw1309 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:33] I was thinking we should limit the videoscaler cluster size [15:58:44] why are we getting the ferm failures. looking at 1309 for ferm status [15:58:56] mutante: the machines are pegged for CPU [15:59:00] (03PS4) 10Jbond: cumin: Add check_puppet_run_script so we can filter based on icinga status [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) [15:59:02] DNS query for 'prometheus1003.eqiad.wmnet' failed: [15:59:20] RECOVERY - PHP7 jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 329 bytes in 3.190 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:59:32] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=videoscaler,name=mw12.* [15:59:33] (03PS4) 10Jbond: O:cluster::managment: move monitoring from puppetdb to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/675805 [15:59:34] on mw1308: DNS query for 'prometheus1004.eqiad.wmnet' failed: query timed out [15:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:47] !log depool a number of hosts from videoscalers [15:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:56] can we catch up on what we know so far? ffmpeg is maxing CPU on videoscaler/jobrunner machines because of an aggressive upload script that waits for uploading to complete but not for transcoding, so the queue stacks up -- have I got that right? [15:59:59] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/675805 (owner: 10Jbond) [15:59:59] I think I am just gonna go and kill all ffmpegs on the remaining hosts [16:00:04] jbond42 and cdanis: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210330T1600). [16:00:04] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:06] akosiaris: wait up [16:00:06] objections ? [16:00:07] rzl: yes [16:00:14] till we are all on the same page [16:00:53] we have X videos pegging the CPUs of all jobrunners [16:00:57] jbond42, cdanis, tgr: please wait on the Puppet deploy until we get this sorted, should be unrelated but just the same [16:00:57] tgr_: will get to the puppet patch post incident (please ping me if i forget) [16:01:05] rzl: ^^ [16:01:05] thanks <3 [16:01:08] :) [16:01:32] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-omega-eqiad on cloudelastic1005 is CRITICAL: 115.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-omega-eqiad&var-instance=cloudelastic1005&panelId=37 [16:01:33] can we get an IC please [16:01:34] killing ffmpeg will cause the jobs to abort, then they'll get retried, except on the smaller videoscaler cluster instead of affecting all jbo runners [16:01:36] jbond42: I think akosiaris already deployed it a while ago [16:01:45] so our plan is to depool eg 10% [16:01:46] legoktm: yes [16:01:51] tgr_: ahh yes thanks [16:02:08] ok one thing I would add would be [16:02:11] I can IC I guess [16:02:26] PROBLEM - Check systemd state on mw1309 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:28] to increase the weight of the jobrunners we keep pooled as jobrunners [16:02:42] legoktm: ill take IC i think you seem to be helping with the actual incident [16:02:44] so they will get some more normal job traffic [16:02:49] jbond42: thanks [16:02:56] asking for permission to kill ffmpeg processes on mw1390 so that ferm can start again [16:03:02] kill it [16:03:08] ^^ [16:03:13] Actually legoktm had a point, let's solve this quickly by splitting the 2 clusters. Let's pool the many hosts into jobrunners cluster and a few into the videoscalers [16:03:28] akosiaris: and | to increase the weight of the jobrunners we keep pooled as jobrunners [16:03:40] what does the weight have to do with anything? [16:03:41] I think this will help too [16:03:53] akosiaris: if we split it 50/50 it wont [16:04:01] if jobs execution is slow, users will notice rather quickly, if videoscalers are slow, no one will notice [16:04:09] it we depool a few from the videoscalers [16:04:15] we are not going to be sending ffmpeg related jobs to the jobrunners at all [16:04:39] so weight shouldn't be a factor, right ? [16:04:45] we have 24 hosts total fwiw [16:04:59] akosiaris: I proposed it to speed up the processing queue [16:05:09] which is way more than we have in codfw and not going to stay 24 [16:05:11] * jbond42 doc is here just going through and filling in the blanks now https://docs.google.com/document/d/1YdV2d64NY7mAppH9T6TR_Ou0B4k7SQOVqrdqZ6YNc88/edit# [16:05:11] arbitrarily picking mw12* as videoscalers and mw13* as jobrunners. It's 6 vs 18 [16:05:11] on the hosts that are not in the videoscaler cluster as well [16:05:24] akosiaris: ^ do it [16:05:33] the mw12* are slower though [16:05:36] effie: ah, use the videoscalers capacity for jobrunning as well? [16:05:41] yes [16:05:45] niah, it's gonna increase latency a lot of those jobs [16:05:49] going for it [16:05:58] I think we're better off keeping them independent, at least right now [16:06:03] +1 for that split [16:06:03] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=videoscaler,name=mw12.* [16:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:11] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=videoscaler,name=mw13.* [16:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:59] I wonder if we have any niceness set in those ffmpeg processes, I don't remember at all [16:07:09] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=jobrunner,name=mw12.* [16:07:13] Urbanecm: can you please sum up what caused this? [16:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:29] !log mw1309 - systemctl start ferm [16:07:31] effie: sure. Here, or in a doc or something? [16:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:45] Urbanecm: write it here, we will transfer on the doc [16:07:50] !log split jobrunners/videoscalers clusters in conftool. mw12* become videoscalers, mw13* become jobrunners, killing ffmpeg on mw13* [16:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:00] effie: okay. [16:08:05] 1309 - greatly reduced number of ffmpeg, php-fpm7.2 process instead [16:08:32] already kind of usable again on the shell and ferm started [16:08:55] ok ok this looks better https://w.wiki/39Hf [16:09:06] RECOVERY - Check systemd state on mw1309 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:24] ok, so so a pgrep ffmpeg on mw13* says just 3 processes, this is good [16:09:38] rest of the jobs serving should start picking up [16:10:06] !log mw1308 - started ferm [16:10:12] RECOVERY - Check systemd state on mw1296 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:14] I was asked by a Commons community member to upload 65 video files via the server-side upload process . I downloaded them to mwmaint, and started uploading via `importImages.php` [16:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:20] !log mw1296 - started ferm [16:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:37] ok, this is better https://grafana.wikimedia.org/d/000000607/cluster-overview?viewPanel=87&orgId=1&var-site=eqiad&var-cluster=jobrunner&var-instance=All&var-datasource=thanos&from=now-30m&to=now [16:10:38] (importImages.php is a mediawiki core script, https://github.com/wikimedia/mediawiki/blob/master/maintenance/importImages.php) [16:11:15] effie: is this enough, or should i elaborate on something? [16:11:33] that is great, thank you! [16:11:56] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:48] we need to at some rate limits or sleeps in between uploads for that :/ [16:12:53] I am wondering if a single video upload became more expensive because there are more different versions here: https://usercontent.irccloud-cdn.com/file/ni3paF64/image.png [16:13:12] is the number of formats unchanged in a while .. or did it become longer [16:13:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:15:33] it already has "'Sleep between files. Useful mostly for debugging'," [16:15:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:16:21] yeah, the script has the featur [16:16:33] it should be trivial to set it to...20 seconds by default? or whatever? [16:16:37] heh [16:16:42] but isnt this after the upload script is done and the transcoding starts [16:17:54] ok, so for now crisis averted I guess? [16:18:05] see line 190 ff "# Batch "upload" operation" [16:18:05] it is, but maybe starting the jobs later could avoid processing them at the same point? [16:18:43] akosiaris: I believe so [16:18:45] akosiaris: Icinga thinks yes [16:19:06] I think the main difference is that a normal wiki user wouldn't have been able to DoS us because the network time to upload the video is probably enough room for transcodes to not get overloaded [16:19:18] well, except 1299 is still hard to reach [16:19:23] ok, let's see for now how this holds up. [16:19:23] legoktm: depends [16:19:28] since there are so many formats [16:19:45] I trust it is possible, and I think we had it a couple of months ago [16:19:48] mutante: yeah, those 6 boxes are going to be having issues now until the backlog is served [16:20:00] 1299 is special because it has disabled puppet [16:20:09] the others dont show up with alerts right now [16:20:10] mutante: they have all puppet disable [16:20:20] because I did it before our meeting [16:20:20] ok [16:20:25] effie: ah it's the onhost memcached change? [16:20:35] it is a noop which I was planning to deploy [16:20:40] should we update conftool-data in ops/puppet.git to reflect the cluster split? [16:20:42] you are going to have a nice time enabling puppet on these 6 boxes [16:20:57] now I think I should abort since those 6 boxes will ignore me anyway [16:21:04] legoktm: no, I don't think so. It's an emergency measure [16:21:19] shouldnt be needed if it stays temporary, i think [16:21:27] ok [16:22:16] I am a bit bumped I don't have a very easy way to know those 6 boxes state, but I 'll just craft a temp dashboard in grafana I guess [16:22:19] Urbanecm: I would have the script sleep with some proportionality to file size and resolution if possible, since that's what affects how long the scalers take on it [16:22:36] akosiaris: do we need an action item to look at how to better split the videoscalleres and other jobrunneres? [16:22:52] legoktm: can we update https://wikitech.wikimedia.org/wiki/Uploading_large_files with some guidelines for how long it should be sleeping? [16:23:13] jbond42: I don't think so. we used to haev them split and then we merged them as clusters into 1 just keeping that distinction at the LVS level [16:23:26] ack thanks [16:23:29] I wonder if we could use videoscalers in codfw to help transcoding [16:23:36] it's like 3 commands really. I can document it though [16:23:52] mutante: yeah, they are idling, it would be nice, wouldn't it ? [16:24:14] but the job output is written into the mediawiki db so .. :-( [16:24:18] yea, and I was wondering the other day exactly this question "how many jobrunners/videoscalers do we actually need" [16:24:23] now with the new hardware etc [16:24:45] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog: Migrate maps to Buster - https://phabricator.wikimedia.org/T264292 (10MSantos) [16:25:00] ok, I am off for a while, I 'll be back later. Page me if any issue arises [16:25:07] ack, thanks [16:25:34] ack im going to officially close the incident [16:26:09] I'm sorry I summoned you this way :/ [16:27:04] it happens, don't worry too much :p responding to a self-dos is much easier than some mysterious malicious actor :) [16:27:09] nah, you can't be blamed for just using what is being used the same way on regular basis. still begs the question .. why this time [16:28:06] or.. how often is it actually happening [16:28:10] some of the videos titles had "4K" in their name [16:28:18] I'm not sure if they were _actually_ 4K [16:28:34] but if they were? maybe that makes a difference in scaler's performance [16:28:55] maybe the combination of 4K with transcoding to more formats and number of files uploaded in short time frame [16:29:03] Urbanecm: where did you upload the images from (iu think i saw a suggestion that it was from some WMF network?) [16:29:05] if it's 4K it means there are more sizes to scale to [16:29:06] a little bit of all of that [16:29:39] jbond42: from mwmaint1002 (via `mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user='Lusccasdeutsch'`) [16:29:53] https://commons.wikimedia.org/wiki/File:Walking_in_BELFAST_-_Northern_Ireland_(UK)_-_4K_60fps_(UHD).webm isn't 4k but it's a 1h long 1080p video [16:29:55] Urbanecm: ack thanks [16:30:51] jbond42: np. https://wikitech.wikimedia.org/wiki/Uploading_large_files is the docs page for this process (and it's a pretty common one) [16:31:07] Urbanecm: ack thanks [16:31:14] np [16:31:15] https://commons.wikimedia.org/wiki/File:Walking_in_MADRID_-_Spain_-_Christmas_Lights_-_4K_60fps_(UHD).webm was the first video uploaded and only the smallest transcodes finished after 2h [16:33:06] honestly I think these videos are large enough that sleeping for an hour between uploads is the best action [16:33:55] would something similar happen if those videos were uploaded via the regular process? [16:34:37] if you had a fast enough upload speed to eqiad, probably [16:34:40] Urbanecm: my guess is it would still be slow but most useres wont have 1G connection to us [16:35:01] * jbond42 but definetly possible [16:35:07] true. [16:35:20] we could maybe increase resources/logic, hashar mentioned potential issues without server side upload earlier [16:35:38] we had some issue this morning yeah [16:35:56] there are 3rd parties who own machines/vms co-located near us in our core DCs, who are capable of uploading very fast with low latency from the "regular" process [16:35:57] that may be from regular uploads [16:36:12] when I prepared to promote wmf.36 at 7:00 UTC I noticed mw1304 had a bunch of memcached errors. Turned out to be video jobss / ffmpeg overloading the machine [16:36:15] so probably it needs to be tackled from both ends [16:36:21] bblack: or WMCS users maybe? [16:36:32] legoktm, +1 [16:36:34] sure, but even non-wmcs can be just about as fast [16:36:34] (not sure if we limit throughput somehow) [16:36:48] mw1304 got rebooted and I have moved to push wmf.36. We haven't investigated what caused the exact issue to happen [16:36:53] making videoscaling more resilient plus adjusting the script so we don't self-DoS [16:36:55] some 3rd parties are literally one cage over in the same DC, just a couple router hops and near zero latency [16:37:24] also, most people (assumption here) will only care that their uploads complete correctly, and not if they take a lot to render [16:37:38] youtube takes a lot too! [16:37:43] then even if one uploads Gigabytes of video, I would assume we queue those transcoding jobs [16:37:55] or at least limit the number of workers that can crunch a given server [16:37:59] "luckily" our inbound side scales differently than our outbound side, so there is a sharper bottleneck on upload bandwidth than download [16:38:06] (not a good one to hit, though) [16:38:21] please add potentiall followups to the doc [16:38:29] and we can create tickets afterwards [16:38:40] with all this discussions [16:39:39] also, thanks to everybody that attended the issues [16:40:05] Urbanecm: you normally just run this in a screen/tmux in the background right? [16:40:25] legoktm: i normally run this in tmux, but i do monitor both -operations and script output [16:40:30] I will try to re-enable puppet, and abort what I was planning to deploy [16:40:52] !log enable puppet on mw* hosts [16:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:22] PROBLEM - PHP7 rendering on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:41:34] mmm [16:41:52] that is a videoscaler, I guess it is expected [16:42:05] we could add a couple of 13* servers, they are newer and faster [16:43:09] Urbanecm: https://wikitech.wikimedia.org/w/index.php?title=Uploading_large_files&type=revision&diff=1905815&oldid=1830856 how's that? [16:44:22] I would also say that while this was probably slow to begin with, upgrading to Buster probably made it even slower [16:45:26] legoktm: nit: recommend adding it to the command below, which I'll 100% copy and paste without reading your explanation :P [16:45:42] the explanation LGTM though [16:45:50] RECOVERY - PHP7 rendering on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 330 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:45:55] * legoktm does [16:46:30] 10Puppet, 10SRE, 10observability, 10Patch-For-Review, and 2 others: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10crusnov) A thing we discovered today that should also be imported from Netbox to puppet is the PDU list which is used to produce monitoring, stored in `mo... [16:47:13] do we have any dashboard to track jobs queues / running time etc? [16:47:47] hashar, most interesting stuff is at https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue [16:47:48] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fk8s [16:47:50] 10SRE, 10DBA, 10Platform Engineering, 10Wikimedia-Incident: Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10jijiki) p:05Triage→03Medium [16:48:05] legoktm, stop copying me! :-P [16:48:19] xD [16:48:37] 10SRE: mw1304: Memcached error for key X on server 127.0.0.1:11213: A TIMEOUT OCCURRED - https://phabricator.wikimedia.org/T278734 (10hashar) The same issue happened later which is now tracked in an incident document. [16:48:58] yeah that dashboard [16:49:06] I just could not find the video transcoding jobs :\ [16:49:51] legoktm: I'm not sure how i would be actually supposed to determine the number of seconds. Can we add some examples? Ie. "video that's 2 GB => XY seconds" [16:50:36] hashar, webVideoTranscodePrioritized_0 & webVideoTranscode_0, I guess? [16:50:40] 10SRE, 10serviceops, 10Parsoid (Tracking): Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) @ssastry do we still need parsoid JS running in the parsoid servers? This is a good opportunity to clean this up. I am running into this issue T245757#6953720 when I tried to r... [16:50:48] yeah looks like [16:51:00] the old dashboard had a filter at the top which was quite convenient ;) [16:51:32] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-omega-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 37.63 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-omega-eqiad&var-instance=cloudelastic1005&panelId=37 [16:51:43] Urbanecm: well it depends on a bunch of factors that I think are too hard to try to calculate. if it's not too much difficulty, I think uploading one video, seeing how long the transcodes take in reality, and sleeping based on that is going to give you the best number [16:51:59] hmm mw1294.eqiad.wmnet is completely unresponsive [16:53:03] 10SRE, 10serviceops, 10Parsoid (Tracking): Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ssastry) No. [16:53:06] 10Puppet, 10SRE, 10observability, 10Patch-For-Review, and 2 others: Puppet: get data (row, rack, site, and other information) from Netbox - https://phabricator.wikimedia.org/T229397 (10crusnov) a:05crusnov→03jbond [16:53:36] (03CR) 10BryanDavis: [C: 03+1] "Looks good to me. I left one idea that might make maintenance easier as an inline comment." (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) (owner: 10Arturo Borrero Gonzalez) [16:53:38] (03PS1) 10MSantos: proton: bump to 2021-03-26-152830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/675841 [16:54:03] Urbanecm: also as hardware/infra/software get better, transcodes will happen faster and we can reduce the sleep time [16:54:48] (03PS1) 10Andrew Bogott: Rough in OpenStack Trove module [puppet] - 10https://gerrit.wikimedia.org/r/675842 (https://phabricator.wikimedia.org/T212595) [16:55:59] (03CR) 10jerkins-bot: [V: 04-1] Rough in OpenStack Trove module [puppet] - 10https://gerrit.wikimedia.org/r/675842 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [16:56:44] (03CR) 10MSantos: [C: 03+2] proton: bump to 2021-03-26-152830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/675841 (owner: 10MSantos) [16:57:39] legoktm: I'd really prefer a way to have at least a ballpark guess [16:57:44] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.13.0-a30 [vendor] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675738 (https://phabricator.wikimedia.org/T30980) [16:57:51] also the videos in one batch could be different [16:58:00] it could be 10 videos that have 1 GB, and another 5 that have 4 GB [16:58:29] (03CR) 10C. Scott Ananian: [C: 03+2] Bump wikimedia/parsoid to 0.13.0-a30 [vendor] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675738 (https://phabricator.wikimedia.org/T30980) (owner: 10C. Scott Ananian) [16:59:19] (03CR) 10C. Scott Ananian: [C: 03+2] "Due to Monday holiday, the parsoid release was cherry-picked after wmf.37 was branched but before it was deployed via train." [vendor] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675738 (https://phabricator.wikimedia.org/T30980) (owner: 10C. Scott Ananian) [16:59:53] train is in two hours, right? [17:00:04] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210330T1700). [17:00:35] ^ i'm merging the parsoid bump onto the wmf.37 but i don't need to swat/scap it anywhere because the train hasn't rolled yet, right? [17:00:42] (03PS1) 10MSantos: mobileapps: bump to 2021-03-29-160328-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/675843 [17:00:48] (03Merged) 10jenkins-bot: proton: bump to 2021-03-26-152830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/675841 (owner: 10MSantos) [17:02:25] cscott-away: //theoretically// [17:02:41] i strongly recommend letting this week's train conductor know about this change [17:02:47] or just wait until the branch is live on deploy1002 [17:02:53] Urbanecm: well that's what i'm attempting to do here :) [17:02:59] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:02:59] Urbanecm: fair, let me make up some numbers [17:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:05] twentyafterfour / hashar are the train conductors? [17:04:04] i'm checking at least in part because [[wikitech:Deployments]] wasn't accurate last wek [17:05:23] yeah [17:05:35] I have pushed .wmf.36 to all wikis roughly 10 hours ago [17:05:42] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [17:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:59] wmf.37 should move to group 0 wiki during the usual american window. In a couple hours i think [17:07:19] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-03-29-160328-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/675843 (owner: 10MSantos) [17:08:12] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [17:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:38] (03Merged) 10jenkins-bot: mobileapps: bump to 2021-03-29-160328-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/675843 (owner: 10MSantos) [17:10:46] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:59] I got into mw1294 [17:12:10] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:26] legoktm and others: what should be the action plan to finish the upload process? Start it again with --sleep=7200 (ie. 2 hours)? [17:13:21] I would say not today, let's let everything recover, there's still a big transcode backlog [17:13:56] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:05] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1294.eqiad.wmnet [17:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:15] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1293.eqiad.wmnet [17:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:23] legoktm: okay, works for me :). [17:17:28] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1295.eqiad.wmnet [17:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:57] I'm moving mw[1293-1295] to jobrunners and mw[1300-1302] to videoscalers [17:19:07] !log killed all ffmpeg on mw1294 [17:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:02] I'm not killing on mw1293 and mw1295 since those are relatively under control and not overloading [17:20:30] Urbanecm: also do you know why Special:Transcode status is locked down to only admins now? it used to be public... [17:20:43] legoktm: no idea, sorry [17:21:00] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1293.eqiad.wmnet [17:21:06] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1294.eqiad.wmnet [17:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:10] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1295.eqiad.wmnet [17:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:28] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1300.eqiad.wmnet [17:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:37] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1301.eqiad.wmnet [17:21:41] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1302.eqiad.wmnet [17:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:45] legoktm: it just took 20 seconds for me to load it. Maybe that's why? [17:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:54] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1300.eqiad.wmnet [17:21:58] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1301.eqiad.wmnet [17:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:02] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1302.eqiad.wmnet [17:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:29] !log moved mw[1293-1295] to jobrunners and mw[1300-1302] to videoscalers [17:22:33] Based on extension.json and the special page... That hasn't changed in a few years [17:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:41] unless something in wmf config has [17:24:30] at least I used to have permissions to see it :p [17:24:48] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10Papaul) [17:31:27] 10SRE, 10Wikimedia-Mailing-lists: Mailman sends bounce notification messages to list-admins with a subject line in Chinese language - https://phabricator.wikimedia.org/T278574 (10Legoktm) p:05Medium→03Lowest [17:32:37] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.13.0-a30 [vendor] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675738 (https://phabricator.wikimedia.org/T30980) (owner: 10C. Scott Ananian) [17:38:33] (03PS2) 10Andrew Bogott: Rough in OpenStack Trove module [puppet] - 10https://gerrit.wikimedia.org/r/675842 (https://phabricator.wikimedia.org/T212595) [17:40:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10Papaul) [17:40:34] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/675848 [17:43:57] (03CR) 10Dzahn: "Thanks Hashar, will merge" [puppet] - 10https://gerrit.wikimedia.org/r/675199 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [17:44:02] (03CR) 10Dzahn: [C: 03+2] delete contint::website:labs and template (integration.wmflabs.org) [puppet] - 10https://gerrit.wikimedia.org/r/675199 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [17:44:54] (03CR) 10Dzahn: "Thanks Ariel, will merge" [puppet] - 10https://gerrit.wikimedia.org/r/675216 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [17:44:58] (03CR) 10Dzahn: [C: 03+2] delete profile::dumps::web::dumpsuser [puppet] - 10https://gerrit.wikimedia.org/r/675216 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [17:45:45] (03CR) 10Dzahn: [C: 03+2] "thanks for attaching it to the right ticket" [puppet] - 10https://gerrit.wikimedia.org/r/675213 (https://phabricator.wikimedia.org/T249949) (owner: 10Dzahn) [17:47:47] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/675221 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [17:47:58] (03Abandoned) 10Dzahn: delete profile::parsoid::diffserver [puppet] - 10https://gerrit.wikimedia.org/r/675221 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [17:49:18] Majavah: hey, I've merged all your restbase patches and trying to test deploy in beta, but I'm getting Permission denied (publickey) trying to deploy from deployment-deploy01 -> deployment-restbase03 [17:50:11] Pchelolo: hmmm, looking [17:50:25] ah, found the issue, a sec [17:50:42] that's like the fastest debugging I've ever seen [17:51:24] Pchelolo: try now? [17:51:48] it's going [17:51:55] and done [17:52:09] thank you for a very quick resolution [17:52:34] for some reason its ssh keyholder was not armed, so I just armed it [17:53:11] I've been setting up a buster based deployment server on beta today, so I had some idea on what could be wrong, otherwise could have taken much longer :P [17:53:23] dumb question: is there like a "Wikimedia infrastructure for dummies" that explains how puppet works / what restbase is / buster / everything? [17:53:42] actually a very good question [17:53:46] (03CR) 10Subramanya Sastry: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/675221 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [17:54:33] wikitech documentation is usually fairly technical, so not sure if we have a "for dummies" guide [17:54:45] I haven't seen one [17:55:13] Buster is the codename of Debian's version 10, Debian is the Linux distribution we use, that's the easiest to answer of those you mentioned [17:55:25] DannyS712: that place is supposed to tbe the wikitech wiki [17:55:37] DannyS712: i would say there's not really one. Wikimedia infrastructure is REALLY complex [17:55:54] and I don't think there's ever going to be a "tldr" version of it [17:56:41] it's a wiki, when you run into something you don't know, ask, and then document it :) [17:56:51] (03CR) 10Razzi: [C: 03+2] refinery: Rename --labsdb flag to be --clouddb [puppet] - 10https://gerrit.wikimedia.org/r/674097 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [17:57:15] there are occasional presentations for broader audiences but no all-in-one sort of thing, no [17:57:16] DannyS712: https://wikitech.wikimedia.org/wiki/User:Quiddity/How_does_it_all_work [17:57:18] I've got most of my knowledge from just trying to do things in deployment-prep and reading puppet manifests to do things in deployment-prep [17:57:46] (03PS1) 10Andrew Bogott: Add fake passwords for OpenStack Trove [labs/private] - 10https://gerrit.wikimedia.org/r/675851 (https://phabricator.wikimedia.org/T212595) [17:57:51] and so I guess what I should research is what exactly is puppet [17:58:02] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add fake passwords for OpenStack Trove [labs/private] - 10https://gerrit.wikimedia.org/r/675851 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [17:58:15] I have a writeup on puppet for dumps folks but some of it is going to be irrelevant to you... [17:58:22] DannyS712: https://en.wikipedia.org/wiki/Puppet_(software) [17:58:27] DannyS712: it's a configuration management system for installing servers [17:58:32] because I understand how most of the code in mediawiki works, and can probably undestand how mediawiki interacts with the database, etc., but there is so much more [17:58:39] https://www.mediawiki.org/wiki/SQL/XML_Dumps/Puppet_for_dumps_maintainers [17:59:08] so the examples are very dumps specific but the info is maybe general enough to be useful. [17:59:16] DannyS712: Restbase is tricky because it was designed for one use and then has evolved over time into other uses and is "in theory" being replaced but i'm not sure i believe it. [17:59:54] it was [18:00:03] was? [18:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210330T1800) [18:00:15] * apergos shuts up because my mother told me, if i can't say anything nice, not to say anything at all [18:00:46] yes, restbase is being replaced, work is being done towards that end, slowly but surely [18:00:56] DannyS712: the original idea of restbase was as a sort of push-based parser cache -- instead of waiting until an article is fetched/pulled and then parsing it, we'd push updates into restbase from parsoid on every edit, and on template edits push updates to affected articles via a ChangePropagation service. [18:01:17] (03PS3) 10Andrew Bogott: Rough in OpenStack Trove module [puppet] - 10https://gerrit.wikimedia.org/r/675842 (https://phabricator.wikimedia.org/T212595) [18:01:41] DannyS712: this helped with visualeditor startup, because the "editable version" was always available; the latency was pushed into "between two successive edits" which is more rare. [18:02:42] while there are so many people here, does anyone know who should I get to review and deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/675657? [18:02:56] DannyS712: also once-upon-a-time the place was to use this to provide "true" time-travel/archivability of wikipedia. We'd store all the edited pages in all the revisions natively in HTML, and built on this new cassandra thing, and gradually replace the main DB for article storage. [18:03:48] DannyS712: and then over time it turned out that having a light weight storage system in not-PHP that didn't require going through the main MySQL database was very helpful, and analytics, research, machine learning, mobile, etc hopped on. [18:04:40] DannyS712: restbase has/had the benefit of a relatively clean modern REST API and modular structure. [18:05:16] (ie, you didn't have to package your ORES data inside a special page and use the action API to access it) [18:06:14] Majavah: maybe use 'git log'/'git blame' to find who touched the file before [18:06:22] that is the last resort [18:06:44] DannyS712: now parsoid is moving out of restbase and (in theory) the main parser cache will gain the ability to hold edits-in-progress and be preloaded like restbase. and i'll let apergos tell you what's happening with all the other users of restbase because i have no idea. [18:06:54] DannyS712: and that concludes the tl;dr of RESTBase. :) [18:07:09] <_joe_> DannyS712: or said otherwise, we put the cart (storage) in front of the ox (the application), which is a bad antipattern, and we've been working on getting rid of it for 5 years [18:07:14] <_joe_> also a tldr of restbase [18:07:38] well, there was also a big services push, so decoupling storage as a service made sense.... once. [18:07:53] (03PS4) 10Andrew Bogott: Rough in OpenStack Trove module [puppet] - 10https://gerrit.wikimedia.org/r/675842 (https://phabricator.wikimedia.org/T212595) [18:08:03] <_joe_> no it did not, and that was my position 5 years ago too, fwiw :) [18:08:19] hugh seems as likely as anyoe tbh, Majavah [18:08:26] Majavah: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/675657 sounds to be beta only [18:08:36] so likely anyone [18:08:41] i don't think research et al are really served by having all storage go through mediawiki-core [18:08:52] Urbanecm: I still need someone to merge [18:09:06] mobile is it's own issue, i'm not going to touch that. that should never have gone into restbase [18:09:21] except that parsoid was being stored there [18:09:24] (03CR) 10Legoktm: [C: 04-1] Rsync private mediawiki files to releases server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [18:09:29] and I have no idea how to get that running on deployment-docker-changeprop01 [18:09:40] <_joe_> we don't use k8s in beta, so i'd wait for hugh to review [18:09:51] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [18:09:56] <_joe_> there is some hiera value I would think [18:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:16] Majavah: hnowlan and Pchelolo did some changeprop stuff in prod recently per https://github.com/wikimedia/operations-deployment-charts/commits/master/charts/changeprop [18:10:33] okay, I dumped some of this discussion at https://wikitech.wikimedia.org/wiki/User:DannyS712/sandbox to read through and understand later, thanks [18:11:02] * Pchelolo reading the backscroll [18:11:11] DannyS712: wrt "because I understand how most of the code in mediawiki works, and can probably undestand how mediawiki interacts with the database, etc., but there is so much more" I have to give you a fair warning that not many people in general know why mediawiki interacts with the databases. from etcd config mgmt to load balancer factory to the dbtree, to the edge cases (x1, pc, es), to many more aspects of it. [18:11:38] _joe_: I already flicked some hiera values that looked like related, but mediawiki-07 (which is getting removed) is still getting traffic from changeprop [18:11:40] Probably one or two people in total have a holistic view of how mw interacts with the databases [18:11:47] (03PS5) 10Andrew Bogott: Rough in OpenStack Trove module [puppet] - 10https://gerrit.wikimedia.org/r/675842 (https://phabricator.wikimedia.org/T212595) [18:11:51] Amir1: does "magically" count? [18:12:07] Pchelolo: TLDR Majavah wants someone to review https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/675657 [18:12:15] <_joe_> Majavah: yeah you might need to restart it, again ask hnowlan for guidance [18:12:56] DannyS712: I'm happy to (try to) explain things that I sort of understand to you if you want [18:12:57] Urbanecm: usually yes, when you have to debug something and pull your hair for 8 straight hours, i'll tell you :D [18:13:07] 10SRE, 10ops-eqsin, 10DC-Ops: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10RobH) [18:13:13] yeah [18:13:24] oh I just meant I understand MediaWikiServices::getInstance()->getDBLoadBalancer() gives a load balancer (whatever that is) and can be used to get a read only view of the database via DB_REPLICA, and a connection that can write to the database using DB_MASTER, and then the helper functions ->(insert|select|update|delete) etc. convert the arguments [18:13:24] to SQL text for whatever backend is used. I don't understand the higher level load balancer part, just enough to know how to use it [18:13:39] Majavah I might take you up on that at some point [18:13:42] _joe_: are they eu/us based? ie about which time should I ask them [18:14:00] <_joe_> Majavah: yes [18:14:07] which one? [18:14:11] oh yeah... change-prop in deployment-prep.. We need to ping hnowlan. [18:14:11] :P [18:14:16] hugh is in ireland [18:14:34] the patch is correct, but there's some additional steps there. [18:15:15] Majavah: https://wikitech.wikimedia.org/wiki/Changeprop#To_deployment-prep [18:15:26] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul) [18:15:38] chaomodus: me [18:15:49] (03PS6) 10Andrew Bogott: Rough in OpenStack Trove module [puppet] - 10https://gerrit.wikimedia.org/r/675842 (https://phabricator.wikimedia.org/T212595) [18:15:54] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:07] papaul: hah okay you can merge the snapshot1005 change too when you're ready [18:16:07] basically once your patch is merged, you need to generate the config and paste it to a docker volume. fancy! [18:16:12] thanks Pchelolo, I guess I'll try to ping them during EU working hours [18:16:21] yup. that's the best option [18:16:36] chaomodus: done [18:16:45] papaul: thanks :) [18:16:55] (03PS10) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) [18:17:54] (03PS11) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) [18:18:28] working on deployment-prep is kind of interesting, you get to learn fancy stuff but you get to bang your head against a wall when trying to fix things like logstash (which has been broken for a few weeks now and I have no idea how to fix it) [18:18:33] (03PS7) 10Andrew Bogott: Rough in OpenStack Trove module [puppet] - 10https://gerrit.wikimedia.org/r/675842 (https://phabricator.wikimedia.org/T212595) [18:18:47] (03CR) 10Jeena Huneidi: Rsync private mediawiki files to releases server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [18:19:08] Majavah: Just a general thank you for keeping beta cluster up and running <3 greatly appreciated [18:20:01] (03CR) 10Andrew Bogott: [C: 03+2] Rough in OpenStack Trove module [puppet] - 10https://gerrit.wikimedia.org/r/675842 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [18:20:22] +\infty to Amir1's comment of appreciation [18:20:24] thanks for your help [18:20:32] (03CR) 10Jeena Huneidi: Rsync private mediawiki files to releases server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [18:21:22] (03PS12) 10Legoktm: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [18:23:21] (03CR) 10Ppchelko: [C: 03+1] changeprop: Update beta servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/675657 (owner: 10Majavah) [18:23:30] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28835/console" [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [18:23:47] 10SRE, 10ops-eqsin, 10DC-Ops: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10RobH) [18:24:56] (03CR) 10Legoktm: [V: 03+1 C: 03+2] Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [18:25:16] thank you Urbanecm and Amir1 [18:27:40] deployment-prep is mostly working these days, I'm somewhat worried what will happen to it when mw-on-k8s becomes reality [18:28:53] (03CR) 10Legoktm: [C: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/632501 (owner: 10Reedy) [18:29:11] (03PS3) 10Legoktm: mirrors: Add trailing full stop to index.html [puppet] - 10https://gerrit.wikimedia.org/r/632501 (owner: 10Reedy) [18:29:18] (03PS4) 10Legoktm: mirrors: Add trailing full stop to index.html [puppet] - 10https://gerrit.wikimedia.org/r/632501 (owner: 10Reedy) [18:29:29] Majavah: in theory it would not be needed, because we'd have a stagging environment in prod [18:29:40] at least that's what i heard from mutante i think :) [18:29:50] it's an open question [18:29:55] I have high doubts about that [18:29:59] (03CR) 10Legoktm: [C: 03+2] mirrors: Add trailing full stop to index.html [puppet] - 10https://gerrit.wikimedia.org/r/632501 (owner: 10Reedy) [18:30:46] beta does so many things for so many people and means so many different things to so many people that I doubt a replacent in production is even remotely possible. [18:31:20] maybe I'm overly optimistic :D [18:32:06] (03PS1) 10Andrew Bogott: cloud-vps: include python3-troveclient [puppet] - 10https://gerrit.wikimedia.org/r/675858 (https://phabricator.wikimedia.org/T212595) [18:32:16] is mw-on-k8s still planned to be deployed like the current train? or will that change? [18:32:36] like for example we have lots of services on k8s but I haven't seen them use the staging env [18:32:39] legoktm: how is the situation regarding the videoscalers/jobrunners? I see a swap for some more powerful hosts? [18:32:52] is that panning out ok? [18:32:55] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: include python3-troveclient [puppet] - 10https://gerrit.wikimedia.org/r/675858 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [18:33:08] (03CR) 10Dzahn: [C: 04-1] "I think you need to fix the calendar string. see inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/675308 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [18:33:53] akosiaris: I think so. the job queue is back to normal, video scalers are still chugging through the backlog when I looked a bit ago [18:34:11] nah, all I said is that there is an existing ticket about that question [18:34:15] Majavah: meaning the schedule? I think that's a good question, but it's safe to safe it won't change at the beginning to keep the "new" factor low [18:34:42] i also won't continue until tomorrow, and i will montior the transcoder backlog as well [18:34:58] later on, depending on how much better (or worse) things have become, the schedule might come under revisement. [18:35:12] legoktm: cool, that's good news [18:35:47] (03PS2) 10Legoktm: lists: Make exim4 config of the old mailman agnostic to the domain [puppet] - 10https://gerrit.wikimedia.org/r/675306 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [18:36:14] speaking of backlog... legoktm: do you think just monitoring https://commons.wikimedia.org/wiki/special:Transcodestatistics is a good idea? Or should I monitor sth else as well/instead (maybe some Grafana graph)? [18:36:53] Amir1: all services in k8s do use the staging env, it's part of the process for them to get deployed. But the staging env is just a safety net for deployments, and that's about it. It's not really a demo/qa/dev env and it would be flawed to view it as such IMHO [18:37:29] akosiaris: exactly, the staging env is not a replacement for beta [18:37:37] definitely not [18:37:43] fully agreed on that [18:37:43] (03CR) 10Dzahn: [C: 03+2] gerrit: escape remarkup for Phabricator comments [2] [puppet] - 10https://gerrit.wikimedia.org/r/675479 (https://phabricator.wikimedia.org/T93331) (owner: 10Hashar) [18:39:41] (03Restored) 10Dzahn: gerrit: test if __version__ is rendered properly now [puppet] - 10https://gerrit.wikimedia.org/r/675187 (https://phabricator.wikimedia.org/T93331) (owner: 10Dzahn) [18:39:44] Urbanecm: probably. I was going to submit a patch shortly so I could see that page again [18:40:09] (03PS2) 10Dzahn: gerrit: test if __version__ is rendered properly now [puppet] - 10https://gerrit.wikimedia.org/r/675187 (https://phabricator.wikimedia.org/T93331) [18:40:30] legoktm: I'm really not sure if it's a good idea. As I said, it takes a good amount of time to load [18:41:48] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:42:12] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:42:27] legoktm: https://drive.google.com/file/d/1_yarT48Weo188TvkWrw33j2LHa5Bc-ey/view?usp=sharing is what it says right now, if you're curious [18:42:32] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:42:39] (03CR) 10Dzahn: "I am not sure this works now. Please check. Maybe we can redirect some resources to other issues like the RSA key and the doc VMs? that wo" [puppet] - 10https://gerrit.wikimedia.org/r/675479 (https://phabricator.wikimedia.org/T93331) (owner: 10Hashar) [18:46:05] (03CR) 10Dzahn: "Maybe it would make the most sense if you find someone in Europe to merge it while you are also available to test it." [puppet] - 10https://gerrit.wikimedia.org/r/675515 (https://phabricator.wikimedia.org/T249268) (owner: 10Hashar) [18:47:56] Urbanecm: see https://phabricator.wikimedia.org/T278867 [18:50:38] (03CR) 10Legoktm: [C: 03+2] lists: Make exim4 config of the old mailman agnostic to the domain [puppet] - 10https://gerrit.wikimedia.org/r/675306 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [18:52:25] (03PS6) 10Jeena Huneidi: Include private folder in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) [18:53:01] > PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:53:12] BGP is always fun [18:56:33] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) >>! In T245757#6953720, @jijiki wrote: > I have reimaged parse2001 as a test, and it appears... [18:56:43] 10SRE, 10DBA, 10Platform Engineering, 10Wikimedia-Incident: Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10matmarex) Thanks for the ping, it doesn't seem like there's anything for me or @Esanders to do here at the moment? Let us know if there's some... [18:57:34] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul) [19:00:04] twentyafterfour and hashar: Time to snap out of that daydream and deploy Mediawiki train - American+European Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210330T1900). [19:01:56] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:02:02] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:02:43] mw1298 looks overloaded right now, I'm going to keep an eye on it while I have lunch and then switch the remaning video scalers to be mw13XX for the better hardware [19:02:50] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:07:20] (03CR) 10Legoktm: mailman3: Add rsync for mailman2 archives for importing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675354 (https://phabricator.wikimedia.org/T278609) (owner: 10Legoktm) [19:07:26] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:08:33] (03CR) 10Dduvall: [C: 04-1] Include private folder in restricted image (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [19:08:46] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:08:52] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:10:47] (03PS1) 10Cwhite: update to 2.2.0 [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675864 [19:11:34] legoktm: I think they are all overloaded (which is kind to be expected). I guess now that the jobqueue is ok again, we can pool some more boxes to serve the ffmpeg transcoding backlog a bit faster [19:12:15] I see that pybal has depool_threshold: ".2" for that cluster, which is 20% so I would definitely stay south of 80% [19:12:18] (03PS2) 10Cwhite: update to 2.2.0 [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675864 [19:12:28] (03PS1) 1020after4: testwikis wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675886 [19:12:30] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675886 (owner: 1020after4) [19:12:34] way souther... like 40%? [19:12:52] it's at 25% right now box wise (not sure cpu wise, let me check) [19:13:14] (03Abandoned) 10Cwhite: Update to 2.0.1 [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675211 (owner: 10Cwhite) [19:13:16] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675886 (owner: 1020after4) [19:13:36] (03PS2) 10Ladsgroup: tendril: Migrate crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/675308 (https://phabricator.wikimedia.org/T273673) [19:13:39] !log twentyafterfour@deploy1002 Started scap: testwikis wikis to 1.36.0-wmf.37 refs T278343 [19:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:50] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [19:14:54] akosiaris: I think we can add some more to the videoscaler pool but I think we should keep it independent from job runners still [19:15:42] yes definitely separate [19:16:08] if we add 1 single jobrunner to the videoscaler pool we are going to have jobs that take forever to finish [19:16:15] Mhm [19:16:32] RECOVERY - Ensure local MW versions match expected deployment on parse2001 is OK: OKAY: Not alerting due to fresh production wikiversions: 321 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:17:35] (03CR) 10Ladsgroup: tendril: Migrate crons to systemd timers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/675308 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [19:19:37] total cpu count for that cluster is 1048, the boxes in the mw1293-mw1306 range have 40 cpu and the boxes in mw1307-1338 have 48 cores from what I see [19:20:26] (03PS1) 10Awight: beta: ReferencePreviews out of Beta Feature mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675889 [19:20:28] we can bump to those first 10 boxes I think [19:21:31] 10SRE, 10Fundraising-Backlog, 10Thank-You-Page, 10Wikimedia-Apache-configuration, and 3 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10Tsevener) @Pcoombe unfortunately no, the app is still taking over thankyou subdomains. More work and investiga... [19:44:42] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 59, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:56:48] 10SRE, 10Traffic: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10BBlack) Seems ok for the ~14h it's been back online so far. I'm going to re-pool this and tentatively resolve the ticket hoping it's a fluke event, but not clear the SEL. If we get a recurrence, we'll re-open and kick this ov... [19:58:10] 10SRE, 10Traffic: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10BBlack) 05Open→03Resolved a:03BBlack [19:58:28] !log repool cp1087 - T278729 [19:58:31] !log joal@deploy1002 Started deploy [analytics/refinery@1a53e9a]: Regular analytics weekly train [analytics/refinery@1a53e9a] [19:58:33] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp1087.eqiad.wmnet [19:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:36] T278729: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 [19:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:21] !log when syncing 1.36.0-wmf.37 promote to testwikis, one server failed: server mw1298.eqiad.wmnet and two more appear to be hung because scap is stuck at 2 left 99% without making any progress for a long time now. refs T278343 [20:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:30] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [20:07:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:11:19] mw1299 and mw1296 are severely overloaded (1299 appears to be a bit worse but both have load average over 100 and are very slow to respond) [20:11:33] this is causing scap's rsync to go very slowly [20:11:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:11:52] twentyafterfour: maybe that is the issue I noticed earlier [20:12:00] which in logstash I found via memcached having timesout [20:12:21] it's ffmpeg [20:12:35] I guess I just wait it out... [20:12:55] it's running it just isn't letting rsync get much cpu time [20:13:12] legoktm: ^ [20:13:40] https://grafana.wikimedia.org/d/000000377/host-overview?from=now-24h&var-server=mw1299 and https://grafana.wikimedia.org/d/000000377/host-overview?from=now-24h&var-server=mw1296 [20:13:44] yeah they look bad :-\ [20:13:45] twentyafterfour: yeah [20:13:53] it's known [20:14:38] I'll move those jobs over to slightly better hardware in like 10-15 min [20:14:42] and can run scap pull afterward [20:15:42] !log joal@deploy1002 Finished deploy [analytics/refinery@1a53e9a]: Regular analytics weekly train [analytics/refinery@1a53e9a] (duration: 17m 11s) [20:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:03] !log joal@deploy1002 Started deploy [analytics/refinery@1a53e9a] (thin): Regular analytics weekly train THIN [analytics/refinery@1a53e9a] [20:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:10] !log joal@deploy1002 Finished deploy [analytics/refinery@1a53e9a] (thin): Regular analytics weekly train THIN [analytics/refinery@1a53e9a] (duration: 00m 07s) [20:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:30] !log joal@deploy1002 Started deploy [analytics/refinery@1a53e9a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1a53e9a] [20:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:11] twentyafterfour: and that slow down is the sole affect for the train afaik ;] [20:18:25] have a good train ride. I am off for some sleep (woke up at 4am this morning ouch) [20:18:40] goodnight hasharDinner [20:20:06] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1296.eqiad.wmnet [20:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:15] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1298.eqiad.wmnet [20:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:24] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1299.eqiad.wmnet [20:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:46] PROBLEM - Ensure local MW versions match expected deployment on mw1299 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:21:00] !log joal@deploy1002 Finished deploy [analytics/refinery@1a53e9a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1a53e9a] (duration: 04m 29s) [20:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:27] twentyafterfour: is it still attempting to rsync? I just killed the jobs on mw1296 and mw1299 [20:23:01] legoktm: yeah it's still running as far as I can tell [20:23:07] one of them even appears to have completed [20:23:11] just one left [20:23:34] mw1299 [20:23:56] I'm going to deploy phatality in the meantime since this is just testwikis, I'll just wait it out hopefully it completes [20:25:05] (03PS1) 10Razzi: superset: add victorops contact to superset monitoring [puppet] - 10https://gerrit.wikimedia.org/r/675898 (https://phabricator.wikimedia.org/T273064) [20:25:09] (03PS1) 10Andrew Bogott: Rabbitmq: open firewall to traffic from the cloud in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/675899 (https://phabricator.wikimedia.org/T212595) [20:25:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:25:46] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1299.eqiad.wmnet [20:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:52] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1298.eqiad.wmnet [20:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:58] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1296.eqiad.wmnet [20:26:01] twentyafterfour: phatality will return? cool! [20:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:26] Urbanecm: it's been back for a while thanks to shdubsh [20:26:33] cool [20:26:35] thanks shdubsh [20:26:36] but I'm deploying an update [20:26:41] (03CR) 10Andrew Bogott: [C: 03+2] Rabbitmq: open firewall to traffic from the cloud in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/675899 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [20:26:42] !log preparing to deploy phatality upgrade to kibana cluster [20:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:56] RECOVERY - Ensure local MW versions match expected deployment on mw1299 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:27:44] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1303.eqiad.wmnet [20:27:49] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1304.eqiad.wmnet [20:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:55] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1305.eqiad.wmnet [20:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:27:59] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1306.eqiad.wmnet [20:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:03] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1307.eqiad.wmnet [20:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:24] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1303.eqiad.wmnet [20:28:28] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1304.eqiad.wmnet [20:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:32] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1305.eqiad.wmnet [20:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:37] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1306.eqiad.wmnet [20:28:42] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1307.eqiad.wmnet [20:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:08] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@715d809]: (no justification provided) [20:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:53] shdubsh: systemctl restart kibana failed to sudo [20:29:58] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 49s) [20:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:32] hmm [20:30:37] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version, adjust timeout, disable cron in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/675900 (https://phabricator.wikimedia.org/T277297) [20:31:31] the other parts worked, so either the sudo rule is missing or I have the command wrong in my scap/checks.yaml [20:31:37] (03CR) 10Gergő Tisza: [C: 03+2] linkrecommendation: Bump version, adjust timeout, disable cron in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/675900 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [20:31:47] it just does `sudo systemctl restart kibana` [20:32:12] PROBLEM - Ensure local MW versions match expected deployment on parse2001 is CRITICAL: CRITICAL: 321 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:32:35] might have to pass the full path to systemctl [20:32:52] (03Merged) 10jenkins-bot: linkrecommendation: Bump version, adjust timeout, disable cron in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/675900 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [20:32:55] hmm [20:32:58] /usr/bin/systemctl [20:33:49] !log twentyafterfour@deploy1002 Finished scap: testwikis wikis to 1.36.0-wmf.37 refs T278343 (duration: 80m 32s) [20:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:57] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [20:34:43] !log twentyafterfour@deploy1002 Started restart [releng/phatality@715d809]: (no justification provided) [20:34:45] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [20:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:16] hmm scap's built in --service-restart also failed [20:35:28] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@715d809]: (no justification provided) [20:35:34] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 05s) [20:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:46] ah but that seems to have worked [20:36:54] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@715d809]: (no justification provided) [20:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:08] nope I was wrong [20:37:25] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 31s) [20:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:55] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [20:37:55] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [20:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:14] shdubsh: can you double check the sudoers file and see if the rule is defined? [20:39:17] twentyafterfour: I can't seem to find the restart_kibana check... [20:39:35] (03PS1) 10Gergő Tisza: [beta-only] Use local GrowthExperiments task suggester on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675901 (https://phabricator.wikimedia.org/T274198) [20:39:39] sudoers looks correct: "/usr/bin/systemctl restart kibana" [20:39:41] the check is in /srv/deploy on deploy1002 [20:39:44] hmm [20:40:03] maybe it's running the old check for some reason [20:40:20] but it shouldn't be ... I'll try committing the checks.yaml and see if that makes a difference [20:40:42] deploy1002: ls: cannot access '/srv/deploy': No such file or directory [20:40:54] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [20:40:54] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [20:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:20] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@715d809]: (no justification provided) [20:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:36] shdubsh: sorry it's /srv/deployment/releng/phatality [20:41:40] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 20s) [20:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:57] shdubsh: /srv/deployment/releng/phatality/scap/checks.yaml [20:42:08] got it [20:42:39] looks identical to what I see in puppet phatality.pp [20:43:17] hrm... fortunately I was able to reproduce locally [20:43:20] what I see: sudo: no tty present and no askpass program specified [20:43:27] (03PS3) 10Awight: Let hive use the default logging config path [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757) [20:43:29] (03CR) 10Kosta Harlan: "This doesn't seem to work:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) (owner: 10Hnowlan) [20:45:23] we should update scap's built in service-restart command to support systemctl, it currently only does a `service restart $name` instead [20:46:17] I'm looking to see which sudoers incantation will allow a service restart. [20:46:33] k [20:47:11] ahh, I think I got it [20:47:59] (03PS1) 10Kosta Harlan: linkrecommendation: Set timeout to 15s [deployment-charts] - 10https://gerrit.wikimedia.org/r/675903 (https://phabricator.wikimedia.org/T277297) [20:48:37] (03PS1) 10Cwhite: kibana: execute systemctl restart as root [puppet] - 10https://gerrit.wikimedia.org/r/675904 [20:48:40] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Set timeout to 15s [deployment-charts] - 10https://gerrit.wikimedia.org/r/675903 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [20:49:46] (03CR) 10Awight: "> What about something like: https://github.com/apache/hive/blob/branc-2.3/common/src/main/resources/parquet-logging.properties#L60, that " [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [20:50:00] (03Merged) 10jenkins-bot: linkrecommendation: Set timeout to 15s [deployment-charts] - 10https://gerrit.wikimedia.org/r/675903 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [20:51:29] (03CR) 10Cwhite: [C: 03+2] kibana: execute systemctl restart as root [puppet] - 10https://gerrit.wikimedia.org/r/675904 (owner: 10Cwhite) [20:51:47] ohhh wrong username [20:51:49] doh [20:52:39] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [20:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:25] * shdubsh running puppet [20:53:39] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [20:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:27] (03PS4) 10Awight: Stop logging parquet to the console [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757) [20:55:29] (03PS1) 10Awight: Let hive use the default logging config path [puppet] - 10https://gerrit.wikimedia.org/r/675907 (https://phabricator.wikimedia.org/T275757) [20:55:56] twentyafterfour: ok, I think we're set to try again. [20:56:01] ok trying it [20:56:08] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@715d809]: (no justification provided) [20:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:20] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 12s) [20:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:26] hmm for whatever reason --force doesn't seem to try the checks [20:58:38] !log killed remaining ffmpeg on mw1298 [20:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:45] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@715d809]: (no justification provided) [20:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:00] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 15s) [20:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:59] grrr it seems to have worked but I still see the same old version [21:00:08] look like kibana was restarted [21:00:16] PROBLEM - Ensure local MW versions match expected deployment on mw1298 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:00:32] I think you have to force-refresh kibana to clear the old version from the browser [21:01:17] shdubsh: hmm maybe I didn't get the right update zip deployed lol [21:01:25] it does appear that everything else worked correctly [21:01:45] scap pulling on mw1298 [21:01:52] thanks legoktm [21:02:04] phatality tab crashes on ecs log: RangeError: Invalid time value [21:02:16] !log scap pulling on mw1298 [21:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:37] hmm. ReferenceError: makeAnonymousUrl is not defined [21:05:25] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@fbca60c]: trying again with newly built zip [21:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:38] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@fbca60c]: trying again with newly built zip (duration: 00m 12s) [21:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:50] shdubsh: doh that doesn't sound good [21:06:26] shdubsh: I don't see it crashing for me? [21:06:32] RECOVERY - Ensure local MW versions match expected deployment on mw1298 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:06:39] am I still getting the cached one somehow? [21:07:26] ah ha cleared cache now I get it [21:07:29] possibly. I'm not sure what changed in Phatality [21:08:22] twentyafterfour: going to file a task? [21:08:29] possibly roll back? [21:09:01] shdubsh: I'll roll back and fix it [21:09:11] (03CR) 10Dzahn: mailman3: Add rsync for mailman2 archives for importing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675354 (https://phabricator.wikimedia.org/T278609) (owner: 10Legoktm) [21:09:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_sanitize_eventlogging_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:05] (03PS1) 10Andrew Bogott: trove-guestagent.conf: Don't set log_config_append [puppet] - 10https://gerrit.wikimedia.org/r/675909 (https://phabricator.wikimedia.org/T212595) [21:11:18] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@fbca60c]: rollback [21:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:30] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@fbca60c]: rollback (duration: 00m 12s) [21:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:06] (03CR) 10Andrew Bogott: [C: 03+2] trove-guestagent.conf: Don't set log_config_append [puppet] - 10https://gerrit.wikimedia.org/r/675909 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [21:12:17] shdubsh: seems ok now. I'll figure out what was wrong with the other build and I think I can deploy it on my own now!!! Thank you for all of your help [21:12:31] awesome :) [21:14:11] twentyafterfour: filed https://phabricator.wikimedia.org/T278891 for the tab crash for index patterns not in logstash-* [21:14:13] Commit No Longer Exists [21:14:14] This commit no longer exists in the repository. [21:14:50] Phab forgot about Gerrit commits that Gerrit still knows [21:15:35] mutante: hmm, weird? [21:15:40] shdubsh: thanks [21:15:41] (03CR) 10Dzahn: mailman3: Add rsync for mailman2 archives for importing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675354 (https://phabricator.wikimedia.org/T278609) (owner: 10Legoktm) [21:16:46] twentyafterfour: example: https://phabricator.wikimedia.org/rOPUP6dcbacd92e01b303bbe58ce1d3fad5bfbb9558d5 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/237299 [21:16:58] second link from manually copying change-id from first link [21:17:02] which claimed it didnt exist [21:21:10] made https://phabricator.wikimedia.org/T278893 [21:24:22] (03CR) 10Dzahn: mailman3: Add rsync for mailman2 archives for importing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675354 (https://phabricator.wikimedia.org/T278609) (owner: 10Legoktm) [21:26:06] weird [21:26:07] mutante: thanks, I'll amend it shortly to have rsync::quickdatacopy on the mailman2 host too [21:26:17] I'll deploy a beta-only patch [21:26:42] legoktm: I think both ways works, sounds good [21:28:05] (03CR) 10Dzahn: [C: 03+1] "looking good to me https://puppet-compiler.wmflabs.org/compiler1002/28836/dbmonitor1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/675308 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:28:21] (03PS2) 10Gergő Tisza: [beta-only] Use local GrowthExperiments task suggester on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675901 (https://phabricator.wikimedia.org/T274198) [21:30:53] (03CR) 10Legoktm: [C: 03+2] mailman3: Add mailman-web wrapper [puppet] - 10https://gerrit.wikimedia.org/r/675271 (https://phabricator.wikimedia.org/T278404) (owner: 10Legoktm) [21:31:02] (03CR) 10Gergő Tisza: [C: 03+2] [beta-only] Use local GrowthExperiments task suggester on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675901 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [21:31:08] (03CR) 10Dzahn: C:ssh::server: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675124 (owner: 10Jbond) [21:31:34] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: total VRPs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [21:31:50] (03Merged) 10jenkins-bot: [beta-only] Use local GrowthExperiments task suggester on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675901 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [21:36:17] (03CR) 10Dzahn: [C: 03+1] C:ssh::server: add support for multiple listen addresses [puppet] - 10https://gerrit.wikimedia.org/r/675131 (owner: 10Jbond) [21:38:40] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [21:48:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10Papaul) [21:48:35] (03CR) 10Razzi: "I chose superset for our first service to implement victorops alerting since I'm familiar with it :)" [puppet] - 10https://gerrit.wikimedia.org/r/675898 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [21:49:12] (03PS2) 10Legoktm: mailman3: Have root@ go the real root@wikimedia.org alias [puppet] - 10https://gerrit.wikimedia.org/r/675352 [21:49:14] (03PS4) 10Legoktm: mailman3: Use Stdlib::Fqdn type where possible [puppet] - 10https://gerrit.wikimedia.org/r/675355 [21:49:16] (03PS4) 10Legoktm: mailman3: Add rsync for mailman2 archives for importing [puppet] - 10https://gerrit.wikimedia.org/r/675354 (https://phabricator.wikimedia.org/T278609) [21:49:18] (03PS3) 10Legoktm: mailman3: Add documentation to classes, merge hyperkitty into web [puppet] - 10https://gerrit.wikimedia.org/r/675584 [21:49:20] (03PS3) 10Legoktm: mailman3: Explicitly don't use dbconfig-mysql system [puppet] - 10https://gerrit.wikimedia.org/r/675585 (https://phabricator.wikimedia.org/T278499) [21:49:22] (03PS4) 10Legoktm: mailman3: Add remove_from_lists helper [puppet] - 10https://gerrit.wikimedia.org/r/675353 [21:49:24] (03PS5) 10Legoktm: mailman3: Add discard_held_messages script and timer [puppet] - 10https://gerrit.wikimedia.org/r/675356 [21:49:50] (03CR) 10Legoktm: mailman3: Add rsync for mailman2 archives for importing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675354 (https://phabricator.wikimedia.org/T278609) (owner: 10Legoktm) [21:50:37] (03CR) 10Legoktm: [C: 03+2] mailman3: Have root@ go the real root@wikimedia.org alias [puppet] - 10https://gerrit.wikimedia.org/r/675352 (owner: 10Legoktm) [21:51:21] (03CR) 10Legoktm: [C: 03+2] mailman3: Use Stdlib::Fqdn type where possible [puppet] - 10https://gerrit.wikimedia.org/r/675355 (owner: 10Legoktm) [21:53:21] mutante: if you could double-check I added the rsync properly: https://gerrit.wikimedia.org/r/c/operations/puppet/+/675354/4 [21:55:32] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create `mailman-web` helper alias - https://phabricator.wikimedia.org/T278404 (10Legoktm) 05Open→03Resolved [21:59:19] (03CR) 10Dzahn: [C: 03+1] "Looks good in compiler, it creates an rsyncd on lists1001 but not on lists1002:" [puppet] - 10https://gerrit.wikimedia.org/r/675354 (https://phabricator.wikimedia.org/T278609) (owner: 10Legoktm) [21:59:51] legoktm: usually I did it differently, by putting the code into some "foo::migration" class and then included it in both roles, but this works just as well, lgtm [21:59:57] compiled [22:01:10] after running puppet you should get the rsyncd only on the old host and then you can cat /usr/local/sbin/sync-var-lib-mailman on the new host and adjust that command before executing it [22:02:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:01] hmm that sounds nicer, I'll refactor it into something like that later [22:03:06] thanks :) [22:03:12] (03CR) 10Legoktm: [C: 03+2] mailman3: Add rsync for mailman2 archives for importing [puppet] - 10https://gerrit.wikimedia.org/r/675354 (https://phabricator.wikimedia.org/T278609) (owner: 10Legoktm) [22:07:03] (03CR) 10Jbond: C:ssh::server: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675124 (owner: 10Jbond) [22:08:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_sanitize_eventlogging_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:34] (03CR) 10Jbond: "fyi i" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675124 (owner: 10Jbond) [22:15:01] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Legoktm) rsync is set up, per https://docs.mailman3.org/en/latest/migration.html we need the list config and mbox for importing. `... [22:15:24] (03PS5) 10Krinkle: Use the new mediawiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668241 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [22:16:19] (03CR) 10Dzahn: C:ssh::server: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675124 (owner: 10Jbond) [22:16:28] (03PS1) 10Legoktm: Allow autoconfirmed users to see Special:TranscodeStatistics by default [extensions/TimedMediaHandler] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/675871 (https://phabricator.wikimedia.org/T278867) [22:16:42] (03PS1) 10Legoktm: Allow autoconfirmed users to see Special:TranscodeStatistics by default [extensions/TimedMediaHandler] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675873 (https://phabricator.wikimedia.org/T278867) [22:16:50] (03CR) 10Krinkle: "As requested, applied "Insane" settings of ImageOptim.app (best combo of any AvgPNG, PNGOUT, Pngcrush), in addition to the Zopfli "more" m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668241 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [22:25:19] 10SRE, 10Wikimedia-Mailing-lists: Have a regular cronjob which alerts about (potentially unadministrated) mailing lists with large (or aged?) moderation queues - https://phabricator.wikimedia.org/T270368 (10Legoktm) I think it would be easiest to have a script that generates data for prometheus and make it vis... [22:34:30] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) @thcipriani @Sergey.Trofimovsky.SF @wkandek exec summary for you: - VM has 2 public IPs now, one intend... [22:35:02] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) 05Open→03Stalled [22:35:09] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) [22:51:08] (03PS7) 10Jeena Huneidi: Include private folder in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) [22:59:40] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10Papaul) [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210330T2300). [23:00:04] legoktm: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:40] legoktm: i guess you'll self-service? [23:02:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:45] Urbanecm: sorry I was driving. Yeah, I can [23:06:50] cool [23:07:02] may i sneak in a quick config patch? [23:07:51] (03CR) 10Ladsgroup: [C: 03+1] "I've been thinking about it too." [puppet] - 10https://gerrit.wikimedia.org/r/675584 (owner: 10Legoktm) [23:08:44] (03PS1) 10Urbanecm: Growth features: bnwiki: Enable impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675924 [23:09:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_sanitize_eventlogging_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:18] (03PS2) 10Urbanecm: Growth features: bnwiki: Enable impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675924 (https://phabricator.wikimedia.org/T274793) [23:09:47] (03CR) 10Urbanecm: [C: 03+2] Growth features: bnwiki: Enable impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675924 (https://phabricator.wikimedia.org/T274793) (owner: 10Urbanecm) [23:09:50] syncing this one [23:10:38] (03Merged) 10jenkins-bot: Growth features: bnwiki: Enable impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675924 (https://phabricator.wikimedia.org/T274793) (owner: 10Urbanecm) [23:14:13] where is logmsgbot [23:14:43] syncing again [23:15:25] "where is logmsgbot" is a magic word that restarts it?:) [23:15:38] sounds so :) [23:15:43] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ef306a35464f295f43b874301cf0170edcfa4d8c: Growth features: bnwiki: Enable impact module (T274793) (duration: 01m 07s) [23:15:44] lol, looked like it [23:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:52] T274793: Turn on Impact module for Bengali Wikipedia after conclusion of another impact stats experiment - https://phabricator.wikimedia.org/T274793 [23:16:01] anyway, /me done [23:16:21] Is there always going to be a "{"batchcomplete":"","continue":" line at the top of API results when asking for the block lists [23:16:29] or does that already tell me there is more [23:16:45] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) I tried and got this: ` ladsgroup@lists1002:/home/legoktm$ sudo mailman import21 discovery-alerts@lists-next.wikimedia.or... [23:17:10] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul) [23:17:11] I still see it even when I reduce my date window to one day [23:17:32] mutante: continue tells you there is something more [23:17:48] compare https://en.wikipedia.org/w/api.php?action=query&format=json&list=logevents&leaction=block/block&leend=2021-03-29T16:58:50.000Z&lestart=2021-03-01T16:58:50.000Z&ledir=newer&lelimit=max and https://cs.wikipedia.org/w/api.php?action=query&format=json&list=logevents&leaction=block/block&leend=2021-03-29T16:58:50.000Z&lestart=2021-03-01T16:58:50.000Z&ledir=newer&lelimit=max [23:19:31] Urbanecm: wow, that means even just a single day already has more block actions than you can list with a single request [23:19:46] yes :/ [23:20:00] yesterday, enwiki placed 26655 blocks [23:20:17] Amir1: try with sudo mailman-wrapper .... [23:20:42] (03CR) 10Legoktm: [C: 03+2] Allow autoconfirmed users to see Special:TranscodeStatistics by default [extensions/TimedMediaHandler] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675873 (https://phabricator.wikimedia.org/T278867) (owner: 10Legoktm) [23:20:43] legoktm: the same [23:20:45] (03CR) 10Legoktm: [C: 03+2] Allow autoconfirmed users to see Special:TranscodeStatistics by default [extensions/TimedMediaHandler] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/675871 (https://phabricator.wikimedia.org/T278867) (owner: 10Legoktm) [23:20:55] https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/thread/JEPMB3HW4FI57EUMOST4L7BD2ILIIS3P/#E3UXFXT27WLRPQIR62KULKFO5SBH5NLT [23:20:59] this seeems related [23:22:47] ahhh more fixes merged but not in buster :< [23:22:53] https://gitlab.com/mailman/mailman/commit/2a15437e911660ab87f960ac3a9eba131a2b7350 [23:23:02] I'll try a cherry-pick later [23:23:33] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [23:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:06] :(((( [23:25:15] Urbanecm: so... 26655 * 30 days / 500 limit = 1600 times "continue" ? :o [23:25:26] times number of wikis [23:25:33] maybe not :) [23:25:42] mutante: enwiki has the hugest number of blocks [23:25:49] oh right, of course [23:27:00] mutante: what about creating a service account for your tool, requesting high-API limits (it increases 500 to 5000) and displaying "5000+" if the number is higher than 5000? [23:27:32] I think enwiki will be one of the few wikis with such abnormal values [23:27:34] Urbanecm: that's a good idea, I will read up on the process [23:27:52] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:16] mutante: https://meta.wikimedia.org/wiki/API_high_limit_requestors is "docs" for the high limit group. There's no real process for determining which accounts are there, feel free to ping me once there's an account and I can flag it. [23:29:23] import archives works fine though [23:29:31] Urbanecm: thank you, ok [23:29:56] !log sudo django-admin hyperkitty_import -l discovery-alerts@lists-next.wikimedia.org discovery-alerts.mbox/discovery-alerts.mbox --pythonpath /usr/share/mailman3-web --settings settings (T278609) [23:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:04] T278609: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 [23:30:35] hyperkitty? [23:30:44] https://lists-next.wikimedia.org/hyperkitty/list/discovery-alerts@lists-next.wikimedia.org/thread/55QKICJKH27BRWTNY6UMHOUQVHQW4FUR/ [23:30:58] Urbanecm: hyperkitty is the archiver of mailman3 [23:31:23] s/pipermail/hyperkitty/ basically [23:31:27] yup [23:31:30] I am seeing Ads on Google for a "Wikipedia Page Writing Service". "paid editing"-much [23:31:53] Amir1: yay though :D halfway there [23:32:12] does it automatically create the list then since the config import failed? [23:32:40] no, creating the list is manual [23:32:49] https://docs.mailman3.org/en/latest/migration.html [23:33:07] First step: Create the list you are trying to migrate in Mailman 3, for the purposes of this guide, we will call it foo-list@example.com [23:33:32] it took long to migrate 4448 emails [23:33:54] fully there and indexed https://lists-next.wikimedia.org/hyperkitty/list/discovery-alerts@lists-next.wikimedia.org/2021/3/ [23:34:13] I'm calling it a day [23:34:16] see you tomorrow [23:34:23] * legoktm hugs Amir1 <3 [23:34:33] enjoy the rest of your evening [23:34:44] Thanks. you too! [23:34:51] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10Papaul) [23:38:04] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Legoktm) @ladsgroup found https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/thread/JEPMB3HW4FI57EUMOST4L7BD2ILIIS3... [23:44:47] (03Merged) 10jenkins-bot: Allow autoconfirmed users to see Special:TranscodeStatistics by default [extensions/TimedMediaHandler] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675873 (https://phabricator.wikimedia.org/T278867) (owner: 10Legoktm) [23:44:49] (03Merged) 10jenkins-bot: Allow autoconfirmed users to see Special:TranscodeStatistics by default [extensions/TimedMediaHandler] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/675871 (https://phabricator.wikimedia.org/T278867) (owner: 10Legoktm) [23:46:13] yay [23:53:43] !log legoktm@deploy1002 Synchronized php-1.36.0-wmf.36/extensions/TimedMediaHandler/extension.json: Allow autoconfirmed users to see Special:TranscodeStatistics by default (T278867) (duration: 01m 08s) [23:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:52] T278867: Special:Transcode statistics is only visible to sysops - https://phabricator.wikimedia.org/T278867 [23:55:58] !log legoktm@deploy1002 Synchronized php-1.36.0-wmf.37/extensions/TimedMediaHandler/extension.json: Allow autoconfirmed users to see Special:TranscodeStatistics by default (T278867) (duration: 01m 08s) [23:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:40] !log reindexing English wikis on elastic@eqiad, elastic@codfw, and cloudelastic (T274200) [23:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:48] T274200: Reindex English and Italian wikis to enable homoglyph plugin - https://phabricator.wikimedia.org/T274200