[00:01:50] (03PS8) 10Jeena Huneidi: Include private folder in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) [00:03:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:28] Amir1 https://phabricator.wikimedia.org/T278904 [00:06:32] flagged revs issue [00:09:17] (03PS1) 10Ppchelko: Revert "Re-apply "Deprecate constructing revision with non-proper page"" [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675875 (https://phabricator.wikimedia.org/T278376) [00:10:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_sanitize_eventlogging_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:25] (03PS1) 10Papaul: Add MAC Address and partman recipe for wcqs200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/675929 (https://phabricator.wikimedia.org/T276647) [00:13:31] (03CR) 10Papaul: [C: 03+2] Add MAC Address and partman recipe for wcqs200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/675929 (https://phabricator.wikimedia.org/T276647) (owner: 10Papaul) [00:17:24] (03PS1) 10Papaul: Add wcqd200[1-3] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/675931 (https://phabricator.wikimedia.org/T276647) [00:18:22] (03CR) 10jerkins-bot: [V: 04-1] Add wcqd200[1-3] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/675931 (https://phabricator.wikimedia.org/T276647) (owner: 10Papaul) [00:22:40] (03PS2) 10Papaul: Add wcqs200[1-3] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/675931 (https://phabricator.wikimedia.org/T276647) [00:45:51] (03CR) 10Papaul: [C: 03+2] Add wcqs200[1-3] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/675931 (https://phabricator.wikimedia.org/T276647) (owner: 10Papaul) [00:48:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` wcqs2001.codfw... [01:05:39] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs2001.codfw.wmnet with reason: REIMAGE [01:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:37] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wcqs2001.codfw.wmnet with reason: REIMAGE [01:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:05] jouncebot: now [01:10:05] No deployments scheduled for the next 9 hour(s) and 49 minute(s) [01:10:14] (03PS2) 10Urbanecm: Enable local uploads on Irish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673699 (https://phabricator.wikimedia.org/T277723) (owner: 10Luke081515) [01:10:25] (03CR) 10Urbanecm: [C: 03+2] Enable local uploads on Irish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673699 (https://phabricator.wikimedia.org/T277723) (owner: 10Luke081515) [01:11:15] (03Merged) 10jenkins-bot: Enable local uploads on Irish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673699 (https://phabricator.wikimedia.org/T277723) (owner: 10Luke081515) [01:13:55] !log urbanecm@deploy1002 Synchronized dblists/commonsuploads.dblist: 3283ae59f25f02966a81ed2f0b51b964f733cf65: Enable local uploads on Irish Wikipedia (T277723) (duration: 01m 08s) [01:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:03] T277723: Enable local uploads on Irish Wikipedia - https://phabricator.wikimedia.org/T277723 [01:14:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wcqs2001.codfw.wmnet'] ` and were **ALL** successful. [01:15:27] !log urbanecm@deploy1002 Synchronized wmf-config/config/gawiki.yaml: 3283ae59f25f02966a81ed2f0b51b964f733cf65: Enable local uploads on Irish Wikipedia (T277723) (duration: 01m 08s) [01:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` wcqs2002.codfw.wmnet ` The log can b... [01:35:40] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs2002.codfw.wmnet with reason: REIMAGE [01:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:34] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wcqs2002.codfw.wmnet with reason: REIMAGE [01:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wcqs2002.codfw.wmnet'] ` and were **ALL** successful. [01:46:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` wcqs2003.codfw.wmnet ` The log can b... [02:00:01] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [02:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:12] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:34] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs2003.codfw.wmnet with reason: REIMAGE [02:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:03] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [02:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:25] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wcqs2003.codfw.wmnet with reason: REIMAGE [02:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:43] ACKNOWLEDGEMENT - MD RAID on logstash2022 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T278908 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:12:47] 10SRE, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T278908 (10ops-monitoring-bot) [02:13:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wcqs2003.codfw.wmnet'] ` and were **ALL** successful. [02:13:42] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:45] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [02:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10Papaul) [02:19:52] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10Papaul) 05Open→03Resolved @Gehel This is ready [02:22:07] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:26] (03PS1) 10Razzi: refine: rename EventLoggingSanitization to RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/675936 (https://phabricator.wikimedia.org/T273789) [02:33:15] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [02:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:37:41] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:37:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:04] (03PS1) 10Ottomata: Use versioned refinery-job.jar in refine sanitize job [puppet] - 10https://gerrit.wikimedia.org/r/675939 (https://phabricator.wikimedia.org/T273789) [02:49:45] (03CR) 10Ottomata: "Ah! We should do this soon but we'll need more than that to use RefineSanitize, lots of stuff has changed." [puppet] - 10https://gerrit.wikimedia.org/r/675936 (https://phabricator.wikimedia.org/T273789) (owner: 10Razzi) [02:51:52] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/28840/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/675939 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [03:02:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:21:02] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:28] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.555 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:20:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:21:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_sanitize_eventlogging_analytics_delayed.service,monitor_refine_sanitize_eventlogging_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:22:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:19:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:21:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:29:39] (03Abandoned) 10Hashar: gerrit: test if __version__ is rendered properly now [puppet] - 10https://gerrit.wikimedia.org/r/675187 (https://phabricator.wikimedia.org/T93331) (owner: 10Dzahn) [06:34:12] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:34:12] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:34:20] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:37:18] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:38:50] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:38:50] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:39:00] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:39:15] mmm these are all GTT-related [06:39:35] and there is maintenance, all good [06:39:38] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:42:21] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670359 (https://phabricator.wikimedia.org/T266933) (owner: 10Waihorace) [07:10:08] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:15] 10SRE, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T278908 (10fgiunchedi) Mhh `sdc2` got booted off `md0` but stayed in `md1`, I didn't see any obvious messages/failures about `sdc` in `dmesg` so I added the disk back, let's see what happens ` root@logstash2022:~# cat /proc/... [07:51:11] 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10fgiunchedi) Thank you for taking care of the Python 3 migration in Puppet ! I ran into this bytes-vs-strings problem with `r... [07:52:15] 10SRE, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T278908 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Tentatively resolving, will get reopened if it happens again. [07:57:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "Indeed in other cases we had already a "paging" contactgroup to which we added the victorops contact. However in the Analytics case AFAIK " [puppet] - 10https://gerrit.wikimedia.org/r/675898 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [08:01:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, non-blocking comment inline" (031 comment) [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675864 (owner: 10Cwhite) [08:13:55] (03PS1) 10Zabe: Disable RelatedArticles on Timeless skin on German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675993 (https://phabricator.wikimedia.org/T278611) [08:15:23] (03PS2) 10Zabe: Disable RelatedArticles on Timeless skin on German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675993 (https://phabricator.wikimedia.org/T278611) [08:15:40] (03CR) 10Kormat: [C: 03+1] "Sure, why not. :)" [puppet] - 10https://gerrit.wikimedia.org/r/675308 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:21:22] (03CR) 10Volans: [C: 04-1] "Couple of questions and one potential issue." (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) (owner: 10Ayounsi) [08:33:44] (03PS1) 10Matthias Mullie: Reset namespace filter on cancel [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675881 (https://phabricator.wikimedia.org/T276261) [08:34:17] (03PS1) 10Matthias Mullie: Style change to mediasearch logged-in notice close [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675882 (https://phabricator.wikimedia.org/T274927) [08:34:25] (03PS1) 10Matthias Mullie: Suppress user notice on mobile [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675883 (https://phabricator.wikimedia.org/T274927) [08:38:22] !log contint2001: stopping Puppet for an Apache config live hack [08:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:05] (03PS2) 10David Caro: ceph: Add octopus repo entry [puppet] - 10https://gerrit.wikimedia.org/r/675812 (https://phabricator.wikimedia.org/T274566) [09:03:21] !log contint2001: enable puppet again [09:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:54] (03PS2) 10Hashar: contint: serve compressed json as application/json [puppet] - 10https://gerrit.wikimedia.org/r/675515 (https://phabricator.wikimedia.org/T249268) [09:04:41] 10SRE, 10DBA, 10Platform Engineering, 10Wikimedia-Incident: Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10Kormat) Adding @Marostegui for visibility. [09:05:50] 10SRE, 10Wikimedia-Mailing-lists: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Ladsgroup) [09:09:32] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) The [[ https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host?viewPanel=5&orgId=1&from=1613483986317&var-dc=esam... [09:10:24] (03PS3) 10Hashar: contint: serve compressed json as application/json [puppet] - 10https://gerrit.wikimedia.org/r/675515 (https://phabricator.wikimedia.org/T249268) [09:12:02] (03CR) 10Hashar: "I have live hacked it on contint2001.wikimedia.org and that works. I originally made a mistake setting an "encoding" header instead of "co" [puppet] - 10https://gerrit.wikimedia.org/r/675515 (https://phabricator.wikimedia.org/T249268) (owner: 10Hashar) [09:15:12] 10SRE, 10OTRS, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Mailman cannot correctly decode GB2312-superset mails labelled as GB2312 (non-standard behavior) - https://phabricator.wikimedia.org/T173894 (10Meno25) [09:16:40] (03PS2) 10Phuedx: vector: Disable WVUI search widget treatment A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675509 (https://phabricator.wikimedia.org/T276917) [09:19:33] (03CR) 10Ayounsi: Add network report (034 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) (owner: 10Ayounsi) [09:20:16] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) T277769 needs to be completed for the dashboard to be restored. Essentially the host data has to go through a new pipe... [09:21:39] (03PS4) 10Ayounsi: Add network report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) [09:21:58] (03CR) 10Volans: "reply inline" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) (owner: 10Ayounsi) [09:27:57] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10amy_rc) 05Resolved→03Open [09:30:12] (03CR) 10Volans: [C: 03+1] "thanks for the fixes, LGTM. Your call if adding the v4 check here or in a separate patch." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) (owner: 10Ayounsi) [09:30:41] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10amy_rc) @jcrespo My internship has been extended till 31.05.2021. Could you please extend my access? @Lea_WMDE could you please approve for the same? Thanks!! [09:36:29] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6959847, @Gilles wrote: > T277769 needs to be completed for the dashboard to be restored. Essentially the... [09:37:07] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: debian-glue jobs ignored error messages about libeatmydata.so in LD_PRELOAD - https://phabricator.wikimedia.org/T240430 (10hashar) That can be addressed by having the `eatmyd... [09:39:56] (03CR) 10Volans: "LGTM, just one nit inline to make it more future proof." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675787 (owner: 10Hnowlan) [09:45:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Please collect a +1 from Moritz too." [puppet] - 10https://gerrit.wikimedia.org/r/675812 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [09:46:19] (03PS5) 10Ayounsi: Add network report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) [09:46:55] (03CR) 10Ayounsi: Add network report (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) (owner: 10Ayounsi) [09:47:50] (03PS1) 10Hashar: package_builder: include eatmydata in base images [puppet] - 10https://gerrit.wikimedia.org/r/676008 (https://phabricator.wikimedia.org/T240430) [09:47:53] (03CR) 10Ayounsi: [C: 03+2] Add network report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) (owner: 10Ayounsi) [09:48:24] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674977 (https://phabricator.wikimedia.org/T222931) (owner: 10Ayounsi) [09:48:42] (03PS1) 10Jbond: icinga: fix string encoding on raid_handler port py3 port [puppet] - 10https://gerrit.wikimedia.org/r/676009 (https://phabricator.wikimedia.org/T247364) [09:51:30] (03CR) 10Hashar: "That is cause CI does set EATMYDATA=yes, configures LD_PRELOAD and set EXTRAPACKAGES=eatmydata." [puppet] - 10https://gerrit.wikimedia.org/r/676008 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [09:51:53] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) 05Open→03Resolved Hey, @amy_rc Let's create a separate ticket for that, so you don't need to go over the trouble of a regular access request- just the ex... [09:52:08] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services), and 2 others: debian-glue jobs ignored error messages about libeatmydata.so in LD_PRELOAD - https://phabricator.wikimedia.org/T240430 (10hashar) a:03hashar [09:53:13] (03CR) 10DharmrajRathod98: "> Patch Set 7:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [09:53:50] I'm merging a beta-only config change. [09:54:27] (03PS2) 10Awight: beta: ReferencePreviews out of Beta Feature mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675889 [09:54:34] (03CR) 10Awight: [C: 03+2] beta: ReferencePreviews out of Beta Feature mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675889 (owner: 10Awight) [09:55:39] (03Merged) 10jenkins-bot: beta: ReferencePreviews out of Beta Feature mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675889 (owner: 10Awight) [09:56:16] (03CR) 10Jcrespo: "One nitpick: Please settle on either CamelCase or snake_case. Given PEP8 I think recommends the latter, and it is the one already in use h" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [09:57:19] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase δ): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Aklapper) @dr0ptp4kt: Do you know who could answer Dzahn's questions, so this ticket isn't stuck anymore? Thanks! [10:00:31] (03PS1) 10Ayounsi: Small network report improvements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/676010 [10:01:12] (03CR) 10Jbond: "> Patch Set 2: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675812 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [10:01:33] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10Kormat) Adding the actual current clinic on-duty person, @jijiki :) (I've updated the stale info in #wikimedia-operations' topic) [10:02:12] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) Sorry about that :-(. I used the outdated topic. [10:02:51] (03CR) 10Ayounsi: [C: 03+2] Small network report improvements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/676010 (owner: 10Ayounsi) [10:03:03] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10amy_rc) >>! In T271725#6959901, @jcrespo wrote: > Hey, @amy_rc Let's create a separate ticket for that, so you don't need to go over the trouble of a regular access r... [10:04:02] hnowlan: hi! could you take a look and merge+deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/675657 at some point? ty in advance [10:04:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "Remove 'release' qsub label" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/675734 (https://phabricator.wikimedia.org/T278748) (owner: 10Arturo Borrero Gonzalez) [10:04:15] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Extend access Superset - https://phabricator.wikimedia.org/T278929 (10amy_rc) [10:06:54] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Extend access Superset - https://phabricator.wikimedia.org/T278929 (10jcrespo) a:03jijiki Assigning to the right person (feel free to manage that in the best way for you). :-) [10:09:10] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Extend ldap access to Superset for amy-wmde - https://phabricator.wikimedia.org/T278929 (10jcrespo) [10:10:05] !log disable puppet on all mw* hosts [10:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:37] (03CR) 10Kormat: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [10:11:58] 10SRE, 10LDAP-Access-Requests: Extend ldap access to Superset for amy-wmde - https://phabricator.wikimedia.org/T278929 (10Aklapper) [10:13:27] (03PS9) 10Jbond: C:ssh::server: refactor [puppet] - 10https://gerrit.wikimedia.org/r/675124 [10:13:29] (03PS5) 10Jbond: C:ssh::server: add support for multiple listen addresses [puppet] - 10https://gerrit.wikimedia.org/r/675131 [10:13:31] (03PS5) 10Jbond: O:gitlab: restrict gitlab ssh to only listen on the primary ip addresses [puppet] - 10https://gerrit.wikimedia.org/r/675135 [10:15:03] (03PS3) 10Hnowlan: osm: use osmimporter to do expiry when using imposm3. [puppet] - 10https://gerrit.wikimedia.org/r/675787 [10:17:09] Majavah: sure, will do [10:17:55] (03PS10) 10Jbond: C:ssh::server: refactor [puppet] - 10https://gerrit.wikimedia.org/r/675124 [10:18:21] (03PS5) 10Jbond: C:ssh::server: update parameter types [puppet] - 10https://gerrit.wikimedia.org/r/675127 [10:18:37] (03PS6) 10Jbond: C:ssh::server: make authorized_keys_file an Arrauy[Stdlib::Unixpath] [puppet] - 10https://gerrit.wikimedia.org/r/675128 [10:19:50] (03PS6) 10Jbond: C:ssh::server: add support for multiple listen addresses [puppet] - 10https://gerrit.wikimedia.org/r/675131 [10:20:06] (03PS7) 10Jbond: C:ssh::server: add support for multiple listen addresses [puppet] - 10https://gerrit.wikimedia.org/r/675131 [10:20:27] (03PS6) 10Jbond: O:gitlab: restrict gitlab ssh to only listen on the primary ip addresses [puppet] - 10https://gerrit.wikimedia.org/r/675135 [10:21:15] (03CR) 10Jbond: C:ssh::server: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675124 (owner: 10Jbond) [10:23:44] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28841/console" [puppet] - 10https://gerrit.wikimedia.org/r/675787 (owner: 10Hnowlan) [10:25:02] (03CR) 10Hnowlan: [V: 03+1] osm: use osmimporter to do expiry when using imposm3. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675787 (owner: 10Hnowlan) [10:26:44] (03CR) 10DharmrajRathod98: "> Patch Set 7:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [10:26:47] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] osm: use osmimporter to do expiry when using imposm3. [puppet] - 10https://gerrit.wikimedia.org/r/675787 (owner: 10Hnowlan) [10:29:40] (03PS1) 10Ayounsi: Small network report improvements, round 2 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/676013 [10:31:42] (03PS2) 10Ayounsi: Small network report improvements, round 2 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/676013 [10:32:09] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/676013 (owner: 10Ayounsi) [10:33:23] (03CR) 10Ayounsi: [C: 03+2] Small network report improvements, round 2 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/676013 (owner: 10Ayounsi) [10:36:52] (03CR) 10Jcrespo: "> Patch Set 7:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [10:37:25] (03CR) 10Jcrespo: "recheck" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [10:37:58] (03CR) 10jerkins-bot: [V: 04-1] Improved: timestamp validation in cli/recover-dump [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [10:40:07] (03CR) 10Jcrespo: "There are no tests currently for recover_dump.py (and no new ones were introduced), but check already the flake8 warnings about style." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [10:40:51] (03CR) 10Jbond: [C: 04-1] package_builder: include eatmydata in base images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676008 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [10:41:06] (03PS16) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [10:44:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] package_builder: include eatmydata in base images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676008 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [10:45:13] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [10:45:25] (03PS1) 10ArielGlenn: clean up check_fragments_file argument handling in page content batches test [dumps] - 10https://gerrit.wikimedia.org/r/676032 [10:48:27] !log enable puppet on all mw* servers [10:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:54] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable memcached socket mwdebug1001, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [10:51:11] (03PS11) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1001, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) [10:52:15] (03PS1) 10Arturo Borrero Gonzalez: gridengine: cleaner release default [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/676033 (https://phabricator.wikimedia.org/T278748) [10:54:29] (03CR) 10DharmrajRathod98: "> Patch Set 7:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [10:54:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:57:00] (03CR) 10Hnowlan: changeprop: Update beta servers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/675657 (owner: 10Majavah) [10:58:00] (03CR) 10Majavah: changeprop: Update beta servers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/675657 (owner: 10Majavah) [10:58:25] (03PS5) 10Alexandros Kosiaris: mediawiki: Enable CPUAccounting for various components [puppet] - 10https://gerrit.wikimedia.org/r/675237 (https://phabricator.wikimedia.org/T278220) [10:58:54] (03CR) 10Hnowlan: changeprop: Update beta servers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/675657 (owner: 10Majavah) [10:58:56] (03CR) 10Hnowlan: [C: 03+2] changeprop: Update beta servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/675657 (owner: 10Majavah) [10:59:13] (03PS3) 10Phuedx: vector: Disable WVUI search widget treatment A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675509 (https://phabricator.wikimedia.org/T276917) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210331T1100). [11:00:04] Seddon, matthiasmullie, and phuedx: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:25] o/ [11:00:43] I have to be afk, sorry [11:01:01] (03Merged) 10jenkins-bot: changeprop: Update beta servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/675657 (owner: 10Majavah) [11:01:38] I’m making lunch but could probably do the deployment on the side if no one else is available [11:02:26] * Lucas_WMDE looks at the WBMI backports [11:02:31] I can deploy :-) [11:02:38] ok! [11:02:43] PROBLEM - Memcached on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [11:02:45] The 3 WikibaseMediaInfo patches don't need to be staged on mwdebug - they can't be tested [11:02:54] I can do them myself if that make anyone's life easier :p [11:03:03] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28842/console" [puppet] - 10https://gerrit.wikimedia.org/r/675237 (https://phabricator.wikimedia.org/T278220) (owner: 10Alexandros Kosiaris) [11:03:10] Hi de hi [11:03:17] matthiasmullie: Up to you, I'm happy to press the merges :-) [11:03:31] Lucas_WMDE: I'm also here, but I also need to make lunch :D [11:03:51] matthiasmullie: Here goes! [11:03:53] awight: I'll let you do it if you don't mind [11:03:58] ack [11:03:59] Then I can make lunch as well :p [11:04:46] Looks like we're backporting to group0. [11:05:51] yep; it's just prep before it rolls over to group1 [11:06:01] (03CR) 10Awight: [C: 03+2] "Backport window." [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675882 (https://phabricator.wikimedia.org/T274927) (owner: 10Matthias Mullie) [11:06:29] * awight leans forward at seeing .vue [11:06:59] (03CR) 10Awight: [C: 03+2] "Backport window." [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675883 (https://phabricator.wikimedia.org/T274927) (owner: 10Matthias Mullie) [11:07:50] matthiasmullie: minor thing, it's nicer to squash backports like this, when they're intended to be deployed together. [11:08:43] (03CR) 10Awight: [C: 03+2] "Backport window." [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675881 (https://phabricator.wikimedia.org/T276261) (owner: 10Matthias Mullie) [11:09:18] awight: gotcha, I'll remember that for next time [11:09:48] I'll kinda simulate that today by pushing all three patches together and leaving a messy scap message ;-) [11:11:54] that's exactly how I've been doing it all along :p [11:12:02] 8D [11:12:13] phuedx: I'll merge your config while CI churns... [11:12:29] awight: ta [11:13:14] (03CR) 10Awight: [C: 03+2] "Config window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675509 (https://phabricator.wikimedia.org/T276917) (owner: 10Phuedx) [11:14:07] (03Merged) 10jenkins-bot: vector: Disable WVUI search widget treatment A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675509 (https://phabricator.wikimedia.org/T276917) (owner: 10Phuedx) [11:14:27] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 9 DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28843/console" [puppet] - 10https://gerrit.wikimedia.org/r/675237 (https://phabricator.wikimedia.org/T278220) (owner: 10Alexandros Kosiaris) [11:15:35] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:17:39] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/676009 (https://phabricator.wikimedia.org/T247364) (owner: 10Jbond) [11:17:57] phuedx: Ready to test, on mwdebug1002 [11:21:03] awight: Thanks! Olga Vasileva and I have kicked the proverbial tyres. It looks good to us [11:23:59] phuedx: Great, deploying now. [11:26:16] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:675509|vector: Disable WVUI search widget treatment A/B test (T276917)]] (duration: 01m 08s) [11:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:24] T276917: Turn off search A/B test and go to 100% for all logged in users - https://phabricator.wikimedia.org/T276917 [11:28:17] Thanks, awight! [11:28:28] :-) thanks for testing! [11:33:24] (03Merged) 10jenkins-bot: Style change to mediasearch logged-in notice close [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675882 (https://phabricator.wikimedia.org/T274927) (owner: 10Matthias Mullie) [11:33:27] (03Merged) 10jenkins-bot: Suppress user notice on mobile [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675883 (https://phabricator.wikimedia.org/T274927) (owner: 10Matthias Mullie) [11:34:50] (03Merged) 10jenkins-bot: Reset namespace filter on cancel [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675881 (https://phabricator.wikimedia.org/T276261) (owner: 10Matthias Mullie) [11:35:01] matthiasmullie: Okay, deploying straight to group0 without testing. [11:35:10] perfect, thanks [11:38:26] !log awight@deploy1002 Synchronized php-1.36.0-wmf.37/extensions/WikibaseMediaInfo: Backport: [[gerrit:675882|Style change to mediasearch logged-in notice close (T274927)]] [[gerrit:675883|Suppress user notice on mobile (T274927)]] [[gerrit:675881|Reset namespace filter on cancel (T276261)]] (duration: 01m 08s) [11:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:38] T276261: [L] Build namespace filter feature in JS UI - https://phabricator.wikimedia.org/T276261 [11:38:38] T274927: [M] Add messaging to the MediaSearch page linking back to Special:Search and indicating that there is a preference - https://phabricator.wikimedia.org/T274927 [11:38:47] !log EU deployment complete [11:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:57] Thank you, awight ! [11:39:21] matthiasmullie: It was a vicarious thrill to see the Vue code fly by, best of luck with that! [11:39:32] Majavah: deployed [11:39:41] thanks! [11:39:57] (03PS1) 10Jbond: debian: add sid as valid codename [puppet] - 10https://gerrit.wikimedia.org/r/676034 [11:39:59] (03PS1) 10Jbond: package_builder: add ability to inject extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676035 (https://phabricator.wikimedia.org/T240430) [11:40:01] (03PS1) 10Jbond: P:ci::package_builder: update profile to pass through extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676036 (https://phabricator.wikimedia.org/T240430) [11:40:03] (03PS1) 10Jbond: cloud - hieradata: add extra_packages eatmydata to integrations package_builder [puppet] - 10https://gerrit.wikimedia.org/r/676037 (https://phabricator.wikimedia.org/T240430) [11:41:11] (03CR) 10jerkins-bot: [V: 04-1] package_builder: add ability to inject extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676035 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [11:41:21] (03CR) 10Jbond: [C: 03+2] debian: add sid as valid codename [puppet] - 10https://gerrit.wikimedia.org/r/676034 (owner: 10Jbond) [11:41:42] (03PS2) 10Jbond: package_builder: add ability to inject extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676035 (https://phabricator.wikimedia.org/T240430) [11:42:23] (03CR) 10jerkins-bot: [V: 04-1] package_builder: add ability to inject extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676035 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [11:43:24] (03PS3) 10Jbond: package_builder: add ability to inject extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676035 (https://phabricator.wikimedia.org/T240430) [11:43:40] (03PS2) 10Jbond: P:ci::package_builder: update profile to pass through extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676036 (https://phabricator.wikimedia.org/T240430) [11:44:00] (03PS2) 10Jbond: cloud - hieradata: add extra_packages eatmydata to integrations package_builder [puppet] - 10https://gerrit.wikimedia.org/r/676037 (https://phabricator.wikimedia.org/T240430) [11:44:54] (03CR) 10Jbond: [C: 04-1] package_builder: include eatmydata in base images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676008 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [11:46:39] (03CR) 10Jbond: [C: 03+2] icinga: fix string encoding on raid_handler port py3 port [puppet] - 10https://gerrit.wikimedia.org/r/676009 (https://phabricator.wikimedia.org/T247364) (owner: 10Jbond) [11:57:09] PROBLEM - memcached socket on mwdebug2001 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory https://wikitech.wikimedia.org/wiki/Memcached [12:00:23] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:49] (03PS7) 10Kormat: mariadb: Convert pt-heartbeat to a systemd service. [puppet] - 10https://gerrit.wikimedia.org/r/665324 (https://phabricator.wikimedia.org/T252528) [12:07:03] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: wmde-toolkit-analyzer-build.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:33] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3651 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [12:22:57] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01587 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [12:24:25] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] package_builder: add ability to inject extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676035 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [12:29:35] (03CR) 10Alexandros Kosiaris: [C: 03+1] P:ci::package_builder: update profile to pass through extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676036 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [12:29:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] cloud - hieradata: add extra_packages eatmydata to integrations package_builder [puppet] - 10https://gerrit.wikimedia.org/r/676037 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [12:31:07] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:20] 10SRE, 10serviceops, 10Sustainability (Incident Followup): Add alerting for Memcached timeout errors - https://phabricator.wikimedia.org/T278946 (10jbond) p:05Triage→03Medium [12:56:29] 10SRE, 10Sustainability (Incident Followup): Update Runboook wikis for the application and LVS servers - https://phabricator.wikimedia.org/T278948 (10jbond) p:05Triage→03Medium [12:59:01] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=jobrunner [12:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:17] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=videoscaler [12:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:41] !log repool all jobrunners/videoscalers in the respective conftool clusters [12:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] twentyafterfour and hashar: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210331T1300). [13:00:27] !log repool all jobrunners/videoscalers in the respective conftool clusters. The video transcoding backlog has been served we can return to "normal" [13:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:45] (03CR) 10Effie Mouzeli: [C: 03+1] "Well played" [puppet] - 10https://gerrit.wikimedia.org/r/675237 (https://phabricator.wikimedia.org/T278220) (owner: 10Alexandros Kosiaris) [13:20:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:22:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:24:01] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) If we need to have a workaround, we can just drop all bounce information from those mailing lists (also, probably we shou... [13:24:10] 10SRE, 10serviceops, 10Parsoid (Tracking): Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) [13:28:08] (03PS1) 10Effie Mouzeli: profile::parsoid: remove parsoid class from parsoid profile [puppet] - 10https://gerrit.wikimedia.org/r/676068 (https://phabricator.wikimedia.org/T268524) [13:33:52] (03PS1) 10Effie Mouzeli: modules: remove parsoidJS puppet module [puppet] - 10https://gerrit.wikimedia.org/r/676071 (https://phabricator.wikimedia.org/T268524) [13:34:05] !log disabling puppet on role::mediawiki::appserver, role::mediawiki::appserver::api, role::mediawiki::maintenance, role::mediawiki::jobrunner, role::parsoid, role::parsoid::testing T278220 [13:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:14] T278220: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 [13:39:34] !log revert mw1412, mw1413, wtp1032, mw2305 to the previous state for T278220 [13:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:41] T278220: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 [13:40:27] PROBLEM - puppet last run on mw2305 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:41:31] PROBLEM - puppet last run on wtp1032 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:46:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:49] RECOVERY - puppet last run on mw2305 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:47:51] RECOVERY - puppet last run on wtp1032 is OK: OK: Puppet is currently disabled (Slow rollout of T278220), not alerting. Last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:47:54] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] mediawiki: Enable CPUAccounting for various components [puppet] - 10https://gerrit.wikimedia.org/r/675237 (https://phabricator.wikimedia.org/T278220) (owner: 10Alexandros Kosiaris) [13:48:24] !log retrying s3 snapshot on codfw [13:48:27] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "Merging per the plan in the task (slightly varied to run it per role). Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/675237 (https://phabricator.wikimedia.org/T278220) (owner: 10Alexandros Kosiaris) [13:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:52] (03CR) 10Alexandros Kosiaris: mediawiki: Enable CPUAccounting for various components (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/675237 (https://phabricator.wikimedia.org/T278220) (owner: 10Alexandros Kosiaris) [13:49:13] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:00] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1003/28845/" [puppet] - 10https://gerrit.wikimedia.org/r/676068 (https://phabricator.wikimedia.org/T268524) (owner: 10Effie Mouzeli) [13:57:31] 10SRE, 10Wikimedia-Mailing-lists: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10jijiki) p:05Triage→03Medium [14:02:24] !log Server side upload of two video files (T278961, T278960) [14:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:34] T278961: Server side upload for Sturm - https://phabricator.wikimedia.org/T278961 [14:02:34] T278960: Server side upload for Sturm - https://phabricator.wikimedia.org/T278960 [14:04:01] 10SRE, 10LDAP-Access-Requests: Extend ldap access to Superset for amy-wmde - https://phabricator.wikimedia.org/T278929 (10Lea_WMDE) 05Open→03Invalid Rights have already been extended by @MoritzMuehlenhoff, sorry @amy_rc I forgot to pass that info on [14:04:05] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10Lea_WMDE) [14:07:28] 10SRE, 10SRE-swift-storage: Problems generating thumbnails - https://phabricator.wikimedia.org/T278969 (10Peachey88) [14:08:55] 10SRE, 10SRE-swift-storage: Problems generating thumbnails - https://phabricator.wikimedia.org/T278969 (10Peachey88) [14:16:28] 10SRE, 10SRE-swift-storage: Problems generating thumbnails - https://phabricator.wikimedia.org/T278969 (10jcrespo) User Askeuhd is currently uploading files at a speed of ~>130 files per minute. I would guess it could be related to that. [14:17:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:17:52] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1007.eqiad.wmnet [14:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:27:23] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 59695456 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:29:47] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 590456 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:32:15] RECOVERY - mediawiki-installation DSH group on parse2001 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:32:41] RECOVERY - memcached socket on mwdebug2001 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached [14:44:26] (03CR) 10Cwhite: update to 2.2.0 (031 comment) [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675864 (owner: 10Cwhite) [14:50:06] (03CR) 10MSantos: [C: 03+1] postgres: use remote script on replica to resync [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [14:53:49] (03CR) 10Volans: [C: 04-1] "Potential issue, see inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [14:57:41] !log disconnecting ps1-d8-codfw for replacement [14:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:49] (03CR) 10Alexandros Kosiaris: "> Patch Set 5:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674562 (https://phabricator.wikimedia.org/T277297) (owner: 10Hnowlan) [15:02:05] PROBLEM - Host ps1-d8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:03:22] 10Puppet, 10SRE, 10SRE-tools: Private puppet commit hook checks current state of folder, not what is staged - https://phabricator.wikimedia.org/T278187 (10jbond) > Or ensure that when you're committing, you commit everything and don't leave a dirty state. I think i prefer this option then we could do somethi... [15:05:43] PROBLEM - Juniper alarms on asw-d-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:07:23] Juniper alarms is me switch running on 1 power suply [15:09:38] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:09:38] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:09:42] 10SRE, 10Fundraising-Backlog, 10Traffic, 10fr-donorservices, and 2 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) Noting that the example domain (links.e.cantwell.com) also receives a score of A on [[ https://www.ssllabs.com/ssltest/analyze.html?... [15:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:46] 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) [15:09:48] 10Puppet, 10SRE, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10crusnov) [15:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:39] RECOVERY - Juniper alarms on asw-d-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:19:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:22:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:26:36] (03CR) 10Filippo Giunchedi: [C: 03+1] update to 2.2.0 (031 comment) [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675864 (owner: 10Cwhite) [15:38:37] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10crusnov) >>! In T271136#6954616, @elukey wrote: > We have a special setting in commons.yaml, `kafka_brokers_main`, that it is used IIRC to instruct zookeeper about... [15:45:36] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) @crusnov the eqiad ones have AAAA records afaics, so we should be good on that side. For the codfw ones, I'd pick one host (say kafka-main2001) and I'd add... [15:49:56] 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) [15:49:58] 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) - added my full working todo list for this project [15:56:22] (03CR) 10Krinkle: [C: 03+1] "What kind of glitch?" [puppet] - 10https://gerrit.wikimedia.org/r/675515 (https://phabricator.wikimedia.org/T249268) (owner: 10Hashar) [16:06:09] (03PS2) 10David Caro: WIP step_by_step: Added cli option to ask confirmation before each command [software/spicerack] - 10https://gerrit.wikimedia.org/r/667170 [16:13:08] (03CR) 10jerkins-bot: [V: 04-1] WIP step_by_step: Added cli option to ask confirmation before each command [software/spicerack] - 10https://gerrit.wikimedia.org/r/667170 (owner: 10David Caro) [16:22:27] (03PS1) 10Papaul: Add test pdu ps-test-d8-codfw [puppet] - 10https://gerrit.wikimedia.org/r/676098 (https://phabricator.wikimedia.org/T265435) [16:24:32] (03CR) 10Papaul: [C: 03+2] Add test pdu ps-test-d8-codfw [puppet] - 10https://gerrit.wikimedia.org/r/676098 (https://phabricator.wikimedia.org/T265435) (owner: 10Papaul) [16:30:30] (03CR) 10BryanDavis: [C: 03+1] "untested, but the changes look logically correct" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/676033 (https://phabricator.wikimedia.org/T278748) (owner: 10Arturo Borrero Gonzalez) [16:34:01] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 185698208 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:36:17] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 723192 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:39:41] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:41:25] (03CR) 10Majavah: update to 2.2.0 (031 comment) [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675864 (owner: 10Cwhite) [16:46:18] 10SRE, 10SRE-swift-storage: Problems generating thumbnails - https://phabricator.wikimedia.org/T278969 (10jcrespo) @Wilfredor is this still happening to you? I saw some contention at the time of your report, due to high upload rate, but not at the moment. It could be datacenter-dependent, though. [17:00:38] (03PS3) 10Cwhite: update to 2.2.0 [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675864 [17:01:35] (03PS4) 10Cwhite: update to 2.2.0 [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675864 [17:01:47] !log Server side upload of three video files (T278959, T278958, T278957) [17:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:02] T278958: Server side upload for Sturm - https://phabricator.wikimedia.org/T278958 [17:02:02] T278957: Server side upload for Sturm - https://phabricator.wikimedia.org/T278957 [17:02:02] T278959: Server side upload for Sturm - https://phabricator.wikimedia.org/T278959 [17:02:39] (03CR) 10Cwhite: update to 2.2.0 (032 comments) [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675864 (owner: 10Cwhite) [17:06:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675864 (owner: 10Cwhite) [17:15:59] (03CR) 10Subramanya Sastry: [C: 04-1] "parsood-testing.nginx.conf.erb is not production config. It is used on testreduce1001 for our test infrastructure and should not be remove" [puppet] - 10https://gerrit.wikimedia.org/r/676071 (https://phabricator.wikimedia.org/T268524) (owner: 10Effie Mouzeli) [17:19:58] (03CR) 10Subramanya Sastry: profile::parsoid: remove parsoid class from parsoid profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676068 (https://phabricator.wikimedia.org/T268524) (owner: 10Effie Mouzeli) [17:23:06] (03CR) 10Cwhite: [V: 03+2 C: 03+2] update to 2.2.0 [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/675864 (owner: 10Cwhite) [17:34:26] /away [17:37:00] (03PS1) 10Jbond: P:debmonitor::server: serve static files directly from apache [puppet] - 10https://gerrit.wikimedia.org/r/676110 [17:38:58] (03PS2) 10Jbond: P:debmonitor::server: serve static files directly from apache [puppet] - 10https://gerrit.wikimedia.org/r/676110 [17:41:42] !log The train is now unblocked, promoting to group0 refs T278343 [17:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:50] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [17:43:26] (03PS3) 10Jbond: P:debmonitor::server: serve static files directly from apache [puppet] - 10https://gerrit.wikimedia.org/r/676110 [17:44:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28849/console" [puppet] - 10https://gerrit.wikimedia.org/r/676110 (owner: 10Jbond) [17:44:35] (03PS1) 1020after4: group0 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676111 [17:44:37] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676111 (owner: 1020after4) [17:45:21] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676111 (owner: 1020after4) [17:45:53] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676110 (owner: 10Jbond) [17:47:23] (03PS4) 10Jbond: P:debmonitor::server: serve static files directly from apache [puppet] - 10https://gerrit.wikimedia.org/r/676110 [17:47:45] (03CR) 10Jbond: P:debmonitor::server: serve static files directly from apache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676110 (owner: 10Jbond) [17:47:57] (03CR) 1020after4: [C: 03+2] Revert "Re-apply "Deprecate constructing revision with non-proper page"" [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675875 (https://phabricator.wikimedia.org/T278376) (owner: 10Ppchelko) [17:49:02] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.37 refs T278343 [17:49:06] (03CR) 10Dzahn: [C: 03+2] "merging per "live-hacked on contint2001"" [puppet] - 10https://gerrit.wikimedia.org/r/675515 (https://phabricator.wikimedia.org/T249268) (owner: 10Hashar) [17:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:10] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [17:49:53] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: serve static files directly from apache [puppet] - 10https://gerrit.wikimedia.org/r/676110 (owner: 10Jbond) [17:50:01] (03CR) 10Dzahn: "fail. Failed to parse template contint/apache/proxy_jenkins.erb" [puppet] - 10https://gerrit.wikimedia.org/r/675515 (https://phabricator.wikimedia.org/T249268) (owner: 10Hashar) [17:50:13] (03PS1) 10Jbond: P:debmonitor: clean up absented files [puppet] - 10https://gerrit.wikimedia.org/r/676112 [17:50:29] (03PS1) 10Dzahn: Revert "contint: serve compressed json as application/json" [puppet] - 10https://gerrit.wikimedia.org/r/676053 [17:50:49] (03CR) 10Volans: [C: 03+1] "pre-emptive LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/676112 (owner: 10Jbond) [17:51:19] RECOVERY - Ensure local MW versions match expected deployment on parse2001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [17:53:19] going back to 36 on group0 for the moment while I wait on CI For the above patch [17:53:34] (03PS1) 10Jbond: P:debmonitor::server: remove old include file [puppet] - 10https://gerrit.wikimedia.org/r/676114 [17:53:50] (03PS1) 1020after4: group0 wikis to 1.36.0-wmf.36 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676115 [17:53:52] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.36.0-wmf.36 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676115 (owner: 1020after4) [17:54:11] (03PS1) 10Dzahn: contint: fix syntax in erb template for jenkins proxy [puppet] - 10https://gerrit.wikimedia.org/r/676118 (https://phabricator.wikimedia.org/T249268) [17:54:47] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: remove old include file [puppet] - 10https://gerrit.wikimedia.org/r/676114 (owner: 10Jbond) [17:55:20] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.36 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676115 (owner: 1020after4) [17:56:47] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.36 refs T278343 [17:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:55] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [17:57:17] (03CR) 10Dzahn: [C: 03+2] contint: fix syntax in erb template for jenkins proxy [puppet] - 10https://gerrit.wikimedia.org/r/676118 (https://phabricator.wikimedia.org/T249268) (owner: 10Dzahn) [17:58:09] (03CR) 10Dzahn: "fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/676118" [puppet] - 10https://gerrit.wikimedia.org/r/675515 (https://phabricator.wikimedia.org/T249268) (owner: 10Hashar) [17:58:24] (03Abandoned) 10Dzahn: Revert "contint: serve compressed json as application/json" [puppet] - 10https://gerrit.wikimedia.org/r/676053 (owner: 10Dzahn) [17:58:44] mutante: ouch :/ Guess next time I should really use the ppc :/ [18:00:04] twentyafterfour and hashar: My dear minions, it's time we take the moon! Just kidding. Time for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210331T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210331T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:08:17] yep, np [18:11:56] (03CR) 10Effie Mouzeli: "For the keys we emit to both DCs, wouldn't we want to read those keys from the one we are making the GET request from?" [puppet] - 10https://gerrit.wikimedia.org/r/654330 (owner: 10Aaron Schulz) [18:11:59] (03CR) 10Ori.livneh: "This change is ready for review." [extensions/Wikibase] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/676124 (owner: 10Ori.livneh) [18:13:12] (03CR) 10Ori.livneh: "This change is ready for review." [extensions/Wikibase] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676125 (owner: 10Ori.livneh) [18:16:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:18:24] (03Abandoned) 10Hashar: package_builder: include eatmydata in base images [puppet] - 10https://gerrit.wikimedia.org/r/676008 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [18:18:59] (03Merged) 10jenkins-bot: Revert "Re-apply "Deprecate constructing revision with non-proper page"" [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675875 (https://phabricator.wikimedia.org/T278376) (owner: 10Ppchelko) [18:19:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:21:39] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase δ): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Dzahn) I can tell that about 3 hours ago MarkMonitor changed the name servers to the MarkMonitor servers for: wikifunctions.org wikilambda.com wikilamb... [18:22:42] (03CR) 10Hashar: [C: 03+1] "Looks good thank you!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676035 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [18:23:18] (03CR) 10Hashar: [C: 03+1] P:ci::package_builder: update profile to pass through extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676036 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [18:24:02] (03CR) 10Hashar: [C: 03+1] cloud - hieradata: add extra_packages eatmydata to integrations package_builder [puppet] - 10https://gerrit.wikimedia.org/r/676037 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [18:24:11] hashar: fyi https://puppet.com/docs/puppet/5.5/lang_data_string.html#heredocs [18:24:26] (03CR) 10Jbond: [C: 03+2] package_builder: add ability to inject extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676035 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [18:24:32] (03CR) 10Jbond: [C: 03+2] P:ci::package_builder: update profile to pass through extra packages [puppet] - 10https://gerrit.wikimedia.org/r/676036 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [18:24:36] (03CR) 10Jbond: [C: 03+2] cloud - hieradata: add extra_packages eatmydata to integrations package_builder [puppet] - 10https://gerrit.wikimedia.org/r/676037 (https://phabricator.wikimedia.org/T240430) (owner: 10Jbond) [18:24:49] jbond42: ah so that is puppet specific right? ;) [18:25:14] jbond42: at least it is well documented thank you. [18:25:19] puppet specific format yes [18:26:03] the whole series of patch looks fine as well. Thanks ! [18:26:20] hashar: np, they have all just been merged now [18:26:25] \o/ [18:26:51] slicing one minor techdebt after the other [18:27:12] :) [18:34:25] jbond42: will check tomorrow if that did the trick and let you know [18:35:06] hashar: ack although im off for the rest of the week but mori.tzm is back in tomorrow [18:36:30] no worries ;) [18:38:07] and Idiscover systemd replaces cron nowadays \o/ [18:40:39] jbond42: the extra packages are not added to the cowbuilder update one though [18:41:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:42:32] (03PS1) 10Hashar: package_builder: also inject extra packages on update [puppet] - 10https://gerrit.wikimedia.org/r/676127 (https://phabricator.wikimedia.org/T240430) [18:45:05] hashar: couldn;t parse that last comment, however the end result should be the same as your original PS excetp only for the profile::ci::package_builder host [18:45:28] 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jgreen) [18:45:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) install second SSD into payments100[5-8] - https://phabricator.wikimedia.org/T278250 (10Jgreen) [18:48:47] jbond42: ah sorry my bad. I am doing too many things at the same time. I just noticed that the daily "cowbuilder --update" could use the extra packages parameter as well [18:49:05] I have sent a patch for it, but I can do that with moritz.m tomorrow [18:49:51] (03CR) 10jerkins-bot: [V: 04-1] package_builder: also inject extra packages on update [puppet] - 10https://gerrit.wikimedia.org/r/676127 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [18:49:58] bah [18:50:07] obviously needs more work. Will continue tomorrow ;) [18:51:23] * Reedy assigns more tasks to hashar [18:57:39] (03PS1) 10Jbond: R:pbuilder_base: add extra packages to updates as well [puppet] - 10https://gerrit.wikimedia.org/r/676133 [18:58:03] hashar: yes i worked out i think ^^^ shold fix it, but rspec is being a pain and not working on my latop right now (which is why it took a while to respond) [18:59:37] ah I am not the only one ;] [19:00:05] twentyafterfour and hashar: Your horoscope predicts another unfortunate Mediawiki train - American+European Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210331T1900). [19:00:24] (03Abandoned) 10Hashar: package_builder: also inject extra packages on update [puppet] - 10https://gerrit.wikimedia.org/r/676127 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [19:00:39] jbond42: my related patch took 7minutes and went being aborted :\ [19:01:38] hashar: ack could be an issue else where with the gems system or something [19:01:44] yeah :-\ [19:01:58] I am out though it is late. Can continue with moritz tomorrow [19:03:52] !log twentyafterfour@deploy1002 Synchronized php-1.36.0-wmf.37/includes/Revision/RevisionRecord.php: sync https://gerrit.wikimedia.org/r/c/mediawiki/core/+/675875 to unblock train refs T278376 T278343 (duration: 00m 58s) [19:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:02] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [19:04:02] T278376: Constructing RevisionRecord for a page that can't exist: Special:MyLanguage/Main Page [Called from MediaWiki\Revision\MutableRevisionRecord::__construct] - https://phabricator.wikimedia.org/T278376 [19:04:21] jbond42: I am off. And don't waste too much time on it, it is not urgent by any mean :] [19:04:59] (03CR) 10jerkins-bot: [V: 04-1] R:pbuilder_base: add extra packages to updates as well [puppet] - 10https://gerrit.wikimedia.org/r/676133 (owner: 10Jbond) [19:05:26] ~. [19:05:29] yeah [19:05:34] same thing happened on my other patch [19:05:38] and the same happens on my local machine [19:05:44] maybe because of oreusing @(COMMAND) [19:05:53] or some oddity when using extra_packages a second time [19:05:55] who knows really :-\ [19:06:12] hashar: ack im also calling it a day for now, hopefully it fixes its self :) [19:06:20] yeah same! [19:06:27] thank you for the series of patch and the rspec etc! [19:06:48] jbond42: have a good week-end! [19:06:54] !log robh@cumin1001 START - Cookbook sre.dns.netbox [19:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:08] hashar: thanks you too :) [19:08:44] (03PS1) 10Andrew Bogott: OpenStack Trove: use /dev/sdb instead of /dev/vdb [puppet] - 10https://gerrit.wikimedia.org/r/676135 [19:11:00] (03PS1) 1020after4: group0 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676136 [19:11:02] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676136 (owner: 1020after4) [19:11:51] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676136 (owner: 1020after4) [19:13:20] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.37 refs T278343 [19:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:28] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [19:16:01] (03PS2) 10Andrew Bogott: OpenStack Trove: use /dev/sdb instead of /dev/vdb [puppet] - 10https://gerrit.wikimedia.org/r/676135 (https://phabricator.wikimedia.org/T212595) [19:16:03] (03PS1) 10Andrew Bogott: Openstack Trove: Hack in a bugfix that's missing from the debian package [puppet] - 10https://gerrit.wikimedia.org/r/676137 (https://phabricator.wikimedia.org/T212595) [19:16:49] (03CR) 10jerkins-bot: [V: 04-1] Openstack Trove: Hack in a bugfix that's missing from the debian package [puppet] - 10https://gerrit.wikimedia.org/r/676137 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [19:16:57] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Trove: use /dev/sdb instead of /dev/vdb [puppet] - 10https://gerrit.wikimedia.org/r/676135 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [19:18:13] (03PS1) 1020after4: group1 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676138 [19:18:15] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676138 (owner: 1020after4) [19:18:30] (03PS2) 10Andrew Bogott: Openstack Trove: Hack in a bugfix that's missing from the debian package [puppet] - 10https://gerrit.wikimedia.org/r/676137 (https://phabricator.wikimedia.org/T212595) [19:18:37] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:49] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:19:06] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676138 (owner: 1020after4) [19:20:31] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.37 refs T278343 [19:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:39] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [19:21:23] PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100% [19:21:40] !log twentyafterfour@deploy1002 Synchronized php: group1 wikis to 1.36.0-wmf.37 refs T278343 (duration: 01m 08s) [19:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:20] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Trove: Hack in a bugfix that's missing from the debian package [puppet] - 10https://gerrit.wikimedia.org/r/676137 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [19:30:39] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 562 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:33:03] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 29 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:55:46] !log robh@cumin1001 START - Cookbook sre.dns.netbox [19:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210331T2000). Please do the needful. [20:05:35] 10SRE, 10serviceops: bring 35 new mediawiki appserver in codfw into production (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) codfw: number of appservers ("apaches"): 49 number of API appservers ("api"): 54 number of jobrunners/videoscalers ("jobrunner"): 18 eqiad: number of app... [20:14:51] 10SRE, 10serviceops: bring 35 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) [20:16:34] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:15] (03CR) 10Volans: [C: 04-1] "LGTM, -1 just because of leftover from debug. Consider it as a +1 without that ;)" (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [20:47:52] 10SRE, 10SRE-swift-storage: Problems generating thumbnails - https://phabricator.wikimedia.org/T278969 (10Peachey88) I had a look at [[ https://commons.wikimedia.org/wiki/Special:NewFiles | Special:NewFiles on commons ]] which still has a number of broken thumbnails, but I also noticed the page header refers t... [20:50:13] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) The test PDU is on online .its been monitor in Librenms https://librenms.wikimedia.org/device/203 The only thing left is the setup in icinga @fgiunchedi ^ [20:50:40] mutante: any thoughts on T279013 ? [20:50:40] T279013: Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution" - https://phabricator.wikimedia.org/T279013 [20:51:12] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Peachey88) [20:51:52] 10SRE, 10Sustainability (Incident Followup): Update Runboook wikis for the application and LVS servers - https://phabricator.wikimedia.org/T278948 (10Legoktm) I'm not sure if we should put jobrunner stuff on the LVS page, LVS was working fine, it just happened to be what paged because the entire jobrunner clus... [20:52:53] (sorry, that probably should have been in #-serviceops.) [20:54:36] DNS is failing? [20:56:50] evidently intermittently? seems to work fine from phab1001 at a quick check. [20:57:52] legoktm: I think it's the usual DNS resolution in PHP being flaky [20:58:01] Same reason we mostly avoid hostnames in MW config stuff [21:02:13] yeah, seems like [21:02:32] btw there's a ton of Permission denied (publickey) errors in phd/daemons.log [21:03:03] seems to have grown recently, based on /var/log/phd$ zgrep "Permission denied (publickey)" *.gz -c [21:15:37] (03PS1) 10Dzahn: site/conftool-data: add 12 more appserver and 8 more API servers [puppet] - 10https://gerrit.wikimedia.org/r/676153 (https://phabricator.wikimedia.org/T278396) [21:20:08] (03PS1) 10Dzahn: phabricator: use IP instead of host name for mysql host value [puppet] - 10https://gerrit.wikimedia.org/r/676154 (https://phabricator.wikimedia.org/T279013) [21:20:50] (03PS2) 10Dzahn: phabricator: use IP instead of host name for mysql host value [puppet] - 10https://gerrit.wikimedia.org/r/676154 (https://phabricator.wikimedia.org/T279013) [21:22:40] (03CR) 10Dzahn: [C: 04-1] "Unknown function: 'dns_a'." [puppet] - 10https://gerrit.wikimedia.org/r/676154 (https://phabricator.wikimedia.org/T279013) (owner: 10Dzahn) [21:24:52] question: how to solve recent edits not being in recent changes? [21:25:28] there's not enough context or information to even begin to answer that question [21:25:58] sorry. Pigsonthewing created a bunch of blank items on wikidata accidentally, and I can't nuke some of them because they don't appear to have corresponding entries in the recent changes table for nuke to find [21:27:54] a bunch? [21:27:58] 10? 100? 1000? 10000? [21:29:21] I already deleted 1169 [21:29:22] if you look at https://www.wikidata.org/wiki/Special:Contributions/Pigsonthewing?limit=300 shows 280 more blank items that need deletion, but searching the api for items they created since 31 March at 00:00 only shows a single item [21:29:39] https://www.wikidata.org/w/api.php?action=query&list=recentchanges&rcstart=2021-03-31T00%3A00%3A00.000Z&rcdir=newer&rcuser=Pigsonthewing&rctype=new [21:30:00] so the nuke extension can't find those pages to delete them [21:31:10] is there a maintenance script that can regenerate recent changes partially? For Wikidata it would be impractical to regenerate it all with rebuildrecentchanges, but I was hoping there was a way to just add these missing entries so I could delete them easier [21:32:13] you can rebuild rows between two date timestamps [21:32:17] but there'll still be a lot of rows [21:32:29] I guess you could hack the script to do extra filtering... [21:32:48] but by that point, it would have been easier to just delete them all manually/semi-automatedly [21:32:53] aren't recentchanges rows inserted by the job queue? [21:33:16] certainly does seem a bit silly to spend time inserting these rows just for the purpose of deleting them :shrug: [21:33:22] I only need ~45 minutes worth (19:52, 31 March 2021 to 20:37, 31 March 2021) [21:33:29] I'll try to write a script to delete them [21:33:40] I'd be more concerned why those rows didn't make it to the RC table in the first place [21:33:56] it doesn't seem too hard to get his contribs from the api, and then script the deletion [21:34:39] (03Abandoned) 10Dzahn: phabricator: use IP instead of host name for mysql host value [puppet] - 10https://gerrit.wikimedia.org/r/676154 (https://phabricator.wikimedia.org/T279013) (owner: 10Dzahn) [21:34:50] I'm guessing it might be related to it being completely blank [21:35:00] no identifying text [21:35:05] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28851/console" [puppet] - 10https://gerrit.wikimedia.org/r/675584 (owner: 10Legoktm) [21:35:45] found a faster way, I'm ready to delete them all but should I leave some behind to allow investigating? [21:35:45] strategy: copy text from contribs page for the relevant items, regex to convert to 'Qxxx' and use that as an arary to loop through and delete [21:36:47] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Add documentation to classes, merge hyperkitty into web [puppet] - 10https://gerrit.wikimedia.org/r/675584 (owner: 10Legoktm) [21:36:55] if you want to file a bug under wikidata for it... [21:37:02] I'm sure someone can try to reproduce elsewhere [21:37:11] So little to no value in keeping them I imagine [21:37:18] A bug would be very nice, this should never happen [21:37:43] But for the deletions, I'd just scrape the ids in some way (regex, jquery, whatever) and delete them from there [21:39:03] how about going by the Special:Contributions by that user instead of RC ? [21:39:37] mutante: The problem is Special:Nuke (which reads from RC) [21:39:46] (03PS4) 10Legoktm: mailman3: Explicitly don't use dbconfig-mysql system [puppet] - 10https://gerrit.wikimedia.org/r/675585 (https://phabricator.wikimedia.org/T278499) [21:39:57] mutante I'm using Special:Contributions to find the correct list of pages needing deletion, but the issue is that a bunch of those aren't showing up in nuke [21:40:53] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28852/console" [puppet] - 10https://gerrit.wikimedia.org/r/675585 (https://phabricator.wikimedia.org/T278499) (owner: 10Legoktm) [21:41:03] filed https://phabricator.wikimedia.org/T279018 [21:41:09] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Explicitly don't use dbconfig-mysql system [puppet] - 10https://gerrit.wikimedia.org/r/675585 (https://phabricator.wikimedia.org/T278499) (owner: 10Legoktm) [21:41:20] I see, well, I can't even read that special page [21:46:50] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Clare Ming - https://phabricator.wikimedia.org/T278265 (10MarkTraceur) Sorry about the delay, luckily I was here anyway approving T279014. Approved as manager! [21:46:53] okay, all deleted [22:01:47] !log Server side upload of three video files (T279011, T278956, T278955) [22:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:04] T278956: Server side upload for Sturm - https://phabricator.wikimedia.org/T278956 [22:02:04] T278955: Server side upload for Sturm - https://phabricator.wikimedia.org/T278955 [22:02:04] T279011: Server side upload for Sturm - https://phabricator.wikimedia.org/T279011 [22:18:38] (03PS2) 10Dzahn: site/conftool-data: add 4 jobrunners, 12 appserver and 8 API servers [puppet] - 10https://gerrit.wikimedia.org/r/676153 (https://phabricator.wikimedia.org/T278396) [22:19:18] (03CR) 10Legoktm: [C: 04-1] "In mailman3 a user account can have multiple addresses associated with their account. Given an address, we should look up the correspondin" [puppet] - 10https://gerrit.wikimedia.org/r/675353 (owner: 10Legoktm) [22:23:30] (03CR) 10Legoktm: "Are we no longer doing the odd/even app/api split?" [puppet] - 10https://gerrit.wikimedia.org/r/676153 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [22:23:41] (03PS3) 10Dzahn: site/conftool-data: add 4 jobrunners, 12 appserver and 8 API servers [puppet] - 10https://gerrit.wikimedia.org/r/676153 (https://phabricator.wikimedia.org/T278396) [22:24:43] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/676153 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [22:29:35] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, 10Wikimedia-production-error: Error restoring file: "The file … is in an inconsistent state within the internal storage backends" - https://phabricator.wikimedia.org/T236246 (10Krinkle) [22:29:44] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 4 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10Krinkl... [22:30:16] (03PS9) 10Jeena Huneidi: Include private folder in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) [22:30:22] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 4 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10Krinkl... [22:30:26] 10SRE, 10SRE-swift-storage, 10Documentation: Document how to handle 'inconsistent state within the internal storage backends' issues - https://phabricator.wikimedia.org/T135318 (10Krinkle) [22:30:29] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 4 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10Krinkl... [22:30:39] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 4 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10Krinkl... [22:34:28] (03CR) 10Legoktm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/676153 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [22:48:44] legoktm: I have two cherry-picks scheduled for the deployment window coming up in 10m. Do I +2 those myself or wait for the deployer? I'm a bit rusty :) [22:49:34] (03CR) 10Dzahn: "> It was nice that at a quick glance (like an icinga IRC alert) you could mostly tell whether it was an appserver or API server" [puppet] - 10https://gerrit.wikimedia.org/r/676153 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [22:49:46] ori: wait for the deployer :) [22:49:53] cool, thanks [22:49:57] you could also choose to be the deployer yourself of course :p [22:50:48] nah [22:51:31] nice try legoktm ;) [22:52:31] (03CR) 10Dduvall: [C: 03+1] "I've verified the built image contents. Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [22:54:18] (03CR) 10Ladsgroup: "legal cleared it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668241 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [22:54:22] (03PS6) 10Ladsgroup: Use the new mediawiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668241 (https://phabricator.wikimedia.org/T268230) [22:55:05] when are the new stickers being mailed out? :D [22:55:16] (03CR) 10Ladsgroup: [C: 03+2] "Deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668241 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [22:56:07] (03Merged) 10jenkins-bot: Use the new mediawiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668241 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [22:56:40] legoktm: which stickers? [22:57:27] I have like 100-200 (old) MediaWiki stickers sitting in my room that I distribute at in-person events [22:58:27] !log Start server side upload for 3 files [22:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:47] I was expecting that Amir1 would mail us new ones ;-) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210331T2300). [23:00:04] ori: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:07] i see [23:00:13] Amir1: how's your deploy? [23:00:14] fundraising season [23:00:18] on my way [23:00:29] need to sync and clear varnish cache [23:00:32] coolio [23:00:41] ori: merging your backports [23:00:58] (03CR) 10Urbanecm: [C: 03+2] Eliminate another php.getSetting() from Lua code [extensions/Wikibase] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/676124 (owner: 10Ori.livneh) [23:01:01] (03CR) 10Urbanecm: [C: 03+2] Eliminate another php.getSetting() from Lua code [extensions/Wikibase] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676125 (owner: 10Ori.livneh) [23:01:10] Urbanecm: thank you [23:01:13] np [23:01:19] will ping you once they can be tested [23:01:27] I think I added my patch to the deploy calendar incorrectly...since it didn't show up in the comment here [23:01:37] longma: i was just about to ping you manually :) [23:01:42] you added it correctly, but late [23:01:43] oh thanks :) [23:01:50] it's cached for a while [23:01:51] legoktm: If i do odd numbers/even numbers for apaches/API then I can only add the same number of each, but we currently are unbalanced with 49 vs 54 in codfw .. hrmmm [23:01:55] gotcha [23:01:59] if you add it on short notice you can do this [23:02:01] jouncebot: refresh [23:02:02] I refreshed my knowledge about deployments. [23:02:27] and any number could randomly be a jobrunner [23:03:04] !log ladsgroup@deploy1002 Synchronized static: [[gerrit:668241|Use the new mediawiki logos]], part I (T268230) (duration: 01m 09s) [23:03:05] mutante: might not be worth it then, yeah [23:03:08] to actually do it the proper way we'd have to change older servers [23:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:13] T268230: Rolling out the new logo of MediaWiki - https://phabricator.wikimedia.org/T268230 [23:03:24] thanks Urbanecm I didn't know about that! [23:03:30] any time :) [23:03:31] https://en.wikipedia.org/static/images/project-logos/mediawikiwiki.png?fg [23:03:37] ohoho [23:04:04] longma: i assume you'll want to self-deploy it? [23:04:15] sure I can do it [23:04:28] just let me know when [23:04:32] sure, will do [23:04:33] legoktm: I am kind of thinking that we should not call them all "mw" but have something like mwapi1001 and mwapp1001 and mwjob1001 .. but now people will say it's too late for that ..since k8s [23:04:53] yeah, probably not worth it [23:05:03] and also we sometimes change the purpose in conftool [23:05:05] also it gives us some flexibility in switching stuff around [23:05:26] that's right, sometimesvideoscalingandsometimesonlyjobs1001 [23:05:30] lol [23:05:35] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:668241|Use the new mediawiki logos]], part II (T268230) (duration: 01m 11s) [23:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:52] xD [23:06:50] okay, my change is pushed, need to clear edge caches now [23:07:04] if anyone has other changes, feel free [23:07:17] longma: go ahead, I'm waiting for CI. [23:07:25] okay, thanks! [23:07:43] (03CR) 10Jeena Huneidi: [C: 03+2] Include private folder in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [23:08:17] (03PS1) 10Andrew Bogott: codfw1dev: turn off trove for now [puppet] - 10https://gerrit.wikimedia.org/r/676169 (https://phabricator.wikimedia.org/T212595) [23:08:26] (03Merged) 10jenkins-bot: Include private folder in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674698 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [23:11:41] new logo looks really nice [23:12:37] !log jhuneidi@deploy1002 Synchronized .pipeline/config.yaml: Config: [[gerrit:674698|Include private folder in restricted image (T276145)]] (duration: 01m 08s) [23:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:46] T276145: Ensure MW PrivateSettings exist on releases server for production MW builds - https://phabricator.wikimedia.org/T276145 [23:13:02] Urbanecm: done [23:13:08] thanks [23:13:12] still waiting on CI [23:24:37] 23 minutes, wow [23:24:55] (03Merged) 10jenkins-bot: Eliminate another php.getSetting() from Lua code [extensions/Wikibase] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/676124 (owner: 10Ori.livneh) [23:25:26] finally [23:25:27] (03Merged) 10jenkins-bot: Eliminate another php.getSetting() from Lua code [extensions/Wikibase] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676125 (owner: 10Ori.livneh) [23:25:59] ori: not sure how much you're rusty, do you know how to test your patches on mwdebug1001? [23:26:02] (not there yet) [23:27:18] pretty sure ori was the one who invented X-Wikimedia-Debug :) [23:27:33] good to know :) [23:27:34] ori: pulled your patches to mwdebug1001, please test. [23:27:36] yes. not sure how to test it practically since it's a deeply internal change [23:27:42] i'll just make an edit on testwiki i guess [23:27:59] announcement sent! [23:28:22] for new logo or stickers? [23:30:10] stickers are lego's specialty [23:30:17] * Amir1 passes the ball [23:30:33] :) [23:30:36] Urbanecm: looks ok [23:30:42] thanks, syncing [23:30:44] hehe let me ask around [23:32:46] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.36/extensions/Wikibase/client/includes/DataAccess/Scribunto/: ad564a098f9174d76ff5c95adec20064ddde7bc9: Eliminate another php.getSetting() from Lua code (duration: 01m 10s) [23:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:20] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.37/extensions/Wikibase/client/includes/DataAccess/Scribunto/: bfc8f55196f57e43c0abc8a16d81cb3b390ac94a: Eliminate another php.getSetting() from Lua code (duration: 01m 09s) [23:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:30] ori: should be live. Anything else? [23:35:18] that's it! děkuji! [23:36:11] any time :) [23:49:33] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: turn off trove for now [puppet] - 10https://gerrit.wikimedia.org/r/676169 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [23:54:54] (03PS4) 10Dzahn: site/conftool-data: add 24 new codfw appservers with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/676153 (https://phabricator.wikimedia.org/T278396) [23:55:20] (03CR) 10jerkins-bot: [V: 04-1] site/conftool-data: add 24 new codfw appservers with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/676153 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [23:55:51] James_F: The portals is not updated yet https://www.wikipedia.org/ [23:59:19] Urbanecm: do you feel like updating https://www.wikiquote.org/? [23:59:46] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Eqiad: Ports with no description on cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T278726 (10wiki_willy) a:03Cmjohnson [23:59:57] https://meta.wikimedia.org/wiki/Www.wikiquote.org_template