[00:00:05] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210513T0000). Please do the needful. [00:00:41] tgr_: sounds plausible. So, CI should be now fully working again? [00:02:09] in theory, yes [00:02:27] I don't know of a wmf.4 patch to test it on, though [00:02:56] we can run recheck on the patch i merged earlier [00:03:01] IIRC it should run the suite anyway [00:03:07] good idea [00:03:19] (test-wmf, not gate-and-submit, but iirc both were failing) [00:03:42] (03CR) 10Gergő Tisza: "recheck" [extensions/WikiEditor] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/689901 (https://phabricator.wikimedia.org/T281409) (owner: 10DLynch) [00:05:40] * Urbanecm goes off now, hoping that it did the trick [00:05:44] see you tomorrow tgr_ [00:08:11] thanks for the help! [00:09:12] any time :) [00:14:33] ACKNOWLEDGEMENT - MD RAID on wdqs2007 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T282758 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:14:38] 10SRE, 10ops-codfw: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T282758 (10ops-monitoring-bot) [00:21:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:24:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:37:47] (03CR) 10Jforrester: [C: 03+1] "I like this, it makes this more self-documenting, agreed. CCing for info." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [01:07:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:09:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:20:55] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:31] (03CR) 10Razzi: "So much for oozie changes! On to airflow!" [puppet] - 10https://gerrit.wikimedia.org/r/640260 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [02:35:53] (03Abandoned) 10Razzi: oozie: Use admin groups for permissions [puppet] - 10https://gerrit.wikimedia.org/r/640260 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [03:29:26] PROBLEM - SSH on logstash2020.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:38:34] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Ladsgroup) I agree, it's just better safe than sorry. Maybe it errors out and DBAs need some time (=complications). Definitely not... [03:39:36] 10SRE, 10Continuous-Integration-Infrastructure, 10observability, 10Goal, 10Release-Engineering-Team (Seen): Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10thcipriani) p:05High→03Low [03:47:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:50:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:00:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:02:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:26:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:56:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:37:50] (03CR) 10Giuseppe Lavagetto: "> Patch Set 2: Code-Review+1" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/689945 (owner: 10Giuseppe Lavagetto) [05:39:42] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: " %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s" in recent Wikitech-l posts - https://phabricator.wikimedia.org/T282762 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup That was my fault. The wikitech-l upgrade failed at the middle and while I fi... [05:54:55] <_joe_> !log running docker image prune on contint1001, which has 722 unlinked images stored in its docker daemon [05:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:49] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: 'Held Unsubscriptions' keeps sending email notifications in Mailman3 - https://phabricator.wikimedia.org/T282319 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I think I finally removed it for real. [06:15:26] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master: add alert to bump the heap size when needed [puppet] - 10https://gerrit.wikimedia.org/r/688778 (owner: 10Elukey) [06:16:30] (03CR) 10Elukey: "Ping Razzi + Ottomata - I just merged this, it links to documentation on wikitech so easy to follow up when it will fire in the future." [puppet] - 10https://gerrit.wikimedia.org/r/688778 (owner: 10Elukey) [06:20:30] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [06:36:53] 10SRE, 10observability, 10Patch-For-Review: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10elukey) @herron I acked some alerts related to logstash100[7-9]'s ES on port 9200 not responsive, IIUC we are waiting for https://gerrit.wikimedia.org/r/689977 for the clean up r... [06:50:18] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: " %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s" in recent Wikitech-l posts - https://phabricator.wikimedia.org/T282762 (10Legoktm) >>! In T282762#7084326, @Krinkle wrote: > I'm not sure if it is related, but since May 10 there are also no new entri... [06:57:12] (03PS3) 10Giuseppe Lavagetto: Builder: use the full image tag, not just the name when pulling [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/689945 [06:57:14] (03PS1) 10Giuseppe Lavagetto: Update the docker dependency [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/690316 [06:57:16] (03PS1) 10Giuseppe Lavagetto: Release 3.0.3 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/690317 [07:04:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Builder: use the full image tag, not just the name when pulling [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/689945 (owner: 10Giuseppe Lavagetto) [07:06:05] (03Merged) 10jenkins-bot: Builder: use the full image tag, not just the name when pulling [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/689945 (owner: 10Giuseppe Lavagetto) [07:10:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Update the docker dependency [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/690316 (owner: 10Giuseppe Lavagetto) [07:10:47] !log kevinbazira@deploy1002 Started deploy [ores/deploy@8fd23ed]: Regular ORES Deployment T278723 [07:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:51] T278723: ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 [07:12:35] (03Merged) 10jenkins-bot: Update the docker dependency [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/690316 (owner: 10Giuseppe Lavagetto) [07:21:30] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove kibana and elasticsearch from role::logstash [puppet] - 10https://gerrit.wikimedia.org/r/689977 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [07:23:40] (03CR) 10Filippo Giunchedi: "I see that other elk7 hosts have hiera per-host entries like the following, do we need the same?" [puppet] - 10https://gerrit.wikimedia.org/r/689994 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [07:23:50] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: enable ecs_170 template and transition prometheus [puppet] - 10https://gerrit.wikimedia.org/r/689160 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [07:25:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Release 3.0.3 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/690317 (owner: 10Giuseppe Lavagetto) [07:28:00] (03Merged) 10jenkins-bot: Release 3.0.3 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/690317 (owner: 10Giuseppe Lavagetto) [07:28:43] (03PS1) 10Jcrespo: bacula: Do not ignore people2002 and ignore cloudmetrics1002 [puppet] - 10https://gerrit.wikimedia.org/r/690329 (https://phabricator.wikimedia.org/T281881) [07:34:52] RECOVERY - SSH on logstash2020.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:35:01] (03PS2) 10Jcrespo: dbbackups: Disable temporarily rw-backups, enable ro-backups [puppet] - 10https://gerrit.wikimedia.org/r/689672 (https://phabricator.wikimedia.org/T282249) [07:36:43] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Disable temporarily rw-backups, enable ro-backups [puppet] - 10https://gerrit.wikimedia.org/r/689672 (https://phabricator.wikimedia.org/T282249) (owner: 10Jcrespo) [07:36:58] (03PS2) 10Jcrespo: bacula: Do not ignore people2002 and ignore cloudmetrics1002 [puppet] - 10https://gerrit.wikimedia.org/r/690329 (https://phabricator.wikimedia.org/T281881) [07:38:33] (03CR) 10Majavah: [C: 04-1] toolforge: re-enable toolforge certificate monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690055 (https://phabricator.wikimedia.org/T282264) (owner: 10Bstorm) [07:38:37] (03CR) 10Jcrespo: [C: 03+2] bacula: Do not ignore people2002 and ignore cloudmetrics1002 [puppet] - 10https://gerrit.wikimedia.org/r/690329 (https://phabricator.wikimedia.org/T281881) (owner: 10Jcrespo) [07:41:54] (03PS2) 10Alexandros Kosiaris: linkrecommendation: Match gunicorn status code in statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/685788 [07:43:37] !log kevinbazira@deploy1002 Finished deploy [ores/deploy@8fd23ed]: Regular ORES Deployment T278723 (duration: 32m 50s) [07:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:41] T278723: ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 [07:47:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/685788 (owner: 10Alexandros Kosiaris) [07:48:36] (03CR) 10Volans: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/689786 (https://phabricator.wikimedia.org/T280382) (owner: 10Jbond) [07:49:06] (03Merged) 10jenkins-bot: linkrecommendation: Match gunicorn status code in statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/685788 (owner: 10Alexandros Kosiaris) [07:51:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 (10jcrespo) ^I have paused monitoring of cloudmetrics1002 on bacula, so it doesn't alter unnecessarily du... [07:51:54] (03PS1) 10Jcrespo: bacula: Reenable read-write ES database backups, disable read-only [puppet] - 10https://gerrit.wikimedia.org/r/690338 (https://phabricator.wikimedia.org/T282249) [08:11:05] (03CR) 10Volans: [C: 03+2] various: use git -C instead of cd && git [cookbooks] - 10https://gerrit.wikimedia.org/r/689988 (owner: 10Volans) [08:12:36] Victory! ORES deploy looks good thanks to elukey! [08:13:22] and Kevin :) [08:14:37] (03Merged) 10jenkins-bot: various: use git -C instead of cd && git [cookbooks] - 10https://gerrit.wikimedia.org/r/689988 (owner: 10Volans) [08:20:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:21:46] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [08:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:29:25] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Implement static redirects from pipermail archives to hyperkitty archives - https://phabricator.wikimedia.org/T280731 (10Legoktm) It's basically impossible for me to figure out why these December 2004 archives are messed up. Maybe someone broke the archiv... [08:30:54] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: 'Held Unsubscriptions' keeps sending email notifications in Mailman3 - https://phabricator.wikimedia.org/T282319 (10Ciell) Yes, the reminder does not mention the unsubscription any more. Thanks! [08:33:54] 10SRE, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10Legoktm) @jcrespo before we embark upon this cleanup, can we mark one of the backups of `var-lib-mailman` to be kept long term? Per it seems the... [08:36:07] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/690351 [08:37:47] (03PS2) 10Jbond: gitlab: minor variable value updates [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/688998 [08:39:27] (03Abandoned) 10Jbond: gitlab: minor variable value updates [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/688998 (owner: 10Jbond) [08:44:05] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP/WMF for Sannita - https://phabricator.wikimedia.org/T282600 (10Elitre) Confirming as his manager that he needs access. TY [08:45:08] (03PS1) 10Volans: doc: link to wikitech for list of cumin hosts [software/transferpy] - 10https://gerrit.wikimedia.org/r/690355 [08:45:50] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [08:45:51] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [08:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:20] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [08:47:20] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [08:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:57:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:59:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:02:19] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - trizek - https://phabricator.wikimedia.org/T282772 (10Elitre) [09:03:03] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - Johan - https://phabricator.wikimedia.org/T282773 (10Elitre) [09:03:51] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - Keegan - https://phabricator.wikimedia.org/T282774 (10Elitre) [09:05:19] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10Elitre) [09:06:09] (03PS1) 10Jbond: (WIP) admin::get_users: add function to get a list of configured users [puppet] - 10https://gerrit.wikimedia.org/r/690366 [09:06:11] (03PS1) 10Jbond: (do not merge) Testing get useres function [puppet] - 10https://gerrit.wikimedia.org/r/690367 [09:06:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29540/console" [puppet] - 10https://gerrit.wikimedia.org/r/690367 (owner: 10Jbond) [09:08:56] (03PS3) 10Jbond: install_server: add new installer to test raid0 configuration: [puppet] - 10https://gerrit.wikimedia.org/r/689786 (https://phabricator.wikimedia.org/T280382) [09:09:34] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/689786 (https://phabricator.wikimedia.org/T280382) (owner: 10Jbond) [09:10:00] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=misc file=device_smart.prom instance=cloudmetrics1002 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [09:10:57] (03PS21) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [09:11:30] (03PS2) 10Jbond: (do not merge) Testing get useres function [puppet] - 10https://gerrit.wikimedia.org/r/690367 [09:11:58] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10Elitre) [09:12:19] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - Elitre - https://phabricator.wikimedia.org/T282776 (10Elitre) [09:12:49] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) Added `elitre` to the `wmf` LDAP group. [09:14:18] 10SRE, 10LDAP-Access-Requests: Grant access to LDAP/WMF for SGrabarczuk - https://phabricator.wikimedia.org/T282475 (10elukey) 05Open→03Resolved a:03elukey On LDAP the email points to @wikimedia.org, plus I followed up on slack with @Elitre. Added to `wmf`. [09:14:22] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) [09:14:46] 10SRE, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10jcrespo) We cannot mark existing backup to be kept long term. But we can generate new backups on the archive schedule/pool, which will be retained for 5 years. If it is an old backup, we can recover... [09:14:55] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - Elitre - https://phabricator.wikimedia.org/T282776 (10elukey) 05Open→03Resolved a:03elukey Followed up on slack, added `elitre` to `wmf`. [09:14:57] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) [09:15:21] (03PS2) 10Jbond: (WIP) admin::get_users: add function to get a list of configured users [puppet] - 10https://gerrit.wikimedia.org/r/690366 [09:15:32] (03PS3) 10Jbond: (do not merge) Testing get useres function [puppet] - 10https://gerrit.wikimedia.org/r/690367 [09:16:50] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - Johan - https://phabricator.wikimedia.org/T282773 (10elukey) 05Open→03Resolved a:03elukey Followed up with @Elitre on slack, also verified that user `johan` was assi... [09:16:54] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) [09:17:42] (03PS1) 10Mvolz: Update zotero and re-enable gzip [deployment-charts] - 10https://gerrit.wikimedia.org/r/690372 [09:18:41] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/682235 (owner: 10PipelineBot) [09:20:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29543/console" [puppet] - 10https://gerrit.wikimedia.org/r/690367 (owner: 10Jbond) [09:21:21] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - Keegan - https://phabricator.wikimedia.org/T282774 (10elukey) 05Open→03Resolved a:03elukey @wikimedia.org email for LDAP uid `keegan`, plus I have followed up with @... [09:21:24] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) [09:27:03] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - trizek - https://phabricator.wikimedia.org/T282772 (10elukey) 05Open→03Resolved a:03elukey Added `trizek` to `wmf`. The email associated with the account was not @wi... [09:27:06] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) [09:32:26] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP/WMF for Sannita - https://phabricator.wikimedia.org/T282600 (10elukey) I am following https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#Modify_LDAP_groups but I can find Sannita's record on the spreadsheet. [09:33:09] 10SRE, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10Legoktm) Ack, we haven't deleted anything yet so creating a new backup should work. I'll ping you again once we're ready for that, thanks! [09:33:58] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) Note: the superset dashboards may not be all accessible if all users are also part of the `an... [09:40:09] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP/WMF for Sannita - https://phabricator.wikimedia.org/T282600 (10elukey) @KFrancis hi! I can't find Sannita's NDA in the spreadsheet, could you please help with next steps? Thanks in advance :) [09:43:06] 10SRE, 10serviceops: Publish wikimedia-bullseye base docker image - https://phabricator.wikimedia.org/T281596 (10Joe) A possible procedure we can use is the following: - Automate building base images using debeurrotype and docker daily. These images will be bare minimum debian-slim images similar in all respe... [09:43:59] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP/WMF for Sannita - https://phabricator.wikimedia.org/T282600 (10Elitre) (Specifying here as well that this is needed for a workshop our team has scheduled for next week.) [09:47:09] 10SRE, 10Wikimedia-Hackathon-2021, 10Wikimedia-Mailing-lists, 10Upstream: Add OAuth login to mailman for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Legoktm) If someone more knowledgeable about OAuth would like to work with me on this during the hackathon, I ca... [09:58:36] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP/WMF for Sannita - https://phabricator.wikimedia.org/T282600 (10Sannita) >>! In T282600#7084689, @elukey wrote: > @KFrancis hi! I can't find Sannita's NDA in the spreadsheet, could you please help with next steps... [10:00:04] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210513T1000). [10:00:24] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/687064 (owner: 10Amire80) [10:05:44] (03CR) 10Mvolz: [C: 03+2] Update zotero and re-enable gzip [deployment-charts] - 10https://gerrit.wikimedia.org/r/690372 (owner: 10Mvolz) [10:06:41] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10jbond) > As CE was the outcome of the community consultation, it sounds like the "knock something up using the gitlab API" option is the one t... [10:07:08] (03Merged) 10jenkins-bot: Update zotero and re-enable gzip [deployment-charts] - 10https://gerrit.wikimedia.org/r/690372 (owner: 10Mvolz) [10:10:18] (03PS1) 10Legoktm: mailman3: Don't redirect pipermail messages with duplicate Message-IDs [puppet] - 10https://gerrit.wikimedia.org/r/690391 (https://phabricator.wikimedia.org/T280731) [10:11:29] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [10:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:35] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP/WMF for Sannita - https://phabricator.wikimedia.org/T282600 (10Urbanecm) >>! In T282600#7084689, @elukey wrote: > @KFrancis hi! I can't find Sannita's NDA in the spreadsheet, could you please help with next step... [10:24:18] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Implement static redirects from pipermail archives to hyperkitty archives - https://phabricator.wikimedia.org/T280731 (10Legoktm) OK, per https://bugs.launchpad.net/mailman/+bug/558263 and https://bugs.launchpad.net/mailman/+bug/266377 it seems In-Reply-T... [10:25:50] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP/WMF for Sannita - https://phabricator.wikimedia.org/T282600 (10elukey) @Urbanecm the task description is correct, but the title it is not, the request is for `nda` LDAP (you are totally right about `wmf`). Fixing. [10:25:56] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [10:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:04] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10elukey) [10:26:16] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:26:48] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10Urbanecm) >>! In T282600#7084748, @elukey wrote: > @Urbanecm the task description is correct, but the title it is not, the request is for `nda` LDAP... [10:31:32] !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [10:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:44] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/690351 (owner: 10PipelineBot) [10:37:09] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/682183 (owner: 10PipelineBot) [10:37:23] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/677974 (owner: 10PipelineBot) [10:38:15] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/690351 (owner: 10PipelineBot) [10:39:10] (03PS3) 10Jbond: (WIP) admin::get_users: add function to get a list of configured users [puppet] - 10https://gerrit.wikimedia.org/r/690366 [10:39:32] (03PS4) 10Jbond: (do not merge) Testing get useres function [puppet] - 10https://gerrit.wikimedia.org/r/690367 [10:39:43] (03CR) 10jerkins-bot: [V: 04-1] (WIP) admin::get_users: add function to get a list of configured users [puppet] - 10https://gerrit.wikimedia.org/r/690366 (owner: 10Jbond) [10:40:33] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Update generator to remove multisource exception [puppet] - 10https://gerrit.wikimedia.org/r/690402 (https://phabricator.wikimedia.org/T282662) [10:40:34] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [10:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:02] (03PS4) 10Jbond: (WIP) admin::get_users: add function to get a list of configured users [puppet] - 10https://gerrit.wikimedia.org/r/690366 [10:41:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29544/console" [puppet] - 10https://gerrit.wikimedia.org/r/690367 (owner: 10Jbond) [10:42:32] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Update generator to remove multisource exception [puppet] - 10https://gerrit.wikimedia.org/r/690402 (https://phabricator.wikimedia.org/T282662) (owner: 10Jcrespo) [10:43:11] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Update generator to remove multisource exception [puppet] - 10https://gerrit.wikimedia.org/r/690402 (https://phabricator.wikimedia.org/T282662) [10:44:06] (03PS3) 10Jcrespo: prometheus-mysqld-exporter: Update generator to remove multisource exception [puppet] - 10https://gerrit.wikimedia.org/r/690402 (https://phabricator.wikimedia.org/T282662) [10:46:30] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:56] (03PS1) 10Gergő Tisza: AddLink: LinkSuggestionInteraction logger and instrumentation [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690070 (https://phabricator.wikimedia.org/T278116) [10:53:58] (03PS1) 10Gergő Tisza: AddLink: Instrumentation for onboarding dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690071 (https://phabricator.wikimedia.org/T278111) [10:54:34] (03PS1) 10Gergő Tisza: AddLink: Instrumentation for skipall_dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690072 (https://phabricator.wikimedia.org/T278118) [10:55:33] (03PS2) 10Gergő Tisza: AddLink: Instrumentation for skipall_dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690072 (https://phabricator.wikimedia.org/T278118) [10:56:06] (03PS1) 10Gergő Tisza: AddLink: Instrumentation for edit summary dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690073 (https://phabricator.wikimedia.org/T278118) [11:00:04] Amir1, Lucas_WMDE, apergos, and duesen: May I have your attention please! EU Backport and Config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210513T1100) [11:00:04] tgr: A patch you scheduled for EU Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:54] ooooh snuck in at the last minute I see :-) [11:00:55] that was scheduled in the last minute and the master version of the patches are still merging, so it will take a few minutes until it's ready. [11:00:56] !log deleting packages still referenced by jessie components: `sudo -i reprepro clearvanished --delete` [11:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:10] welp no one else is ready so I guess we can just wait [11:02:13] SonarQube bot whined about the first two [11:02:52] the patches all depend on a patch in schemas/event/secondary, which AIUI does not need to be backported [11:03:39] we usually ignore SonarQube, the errors are almost always about lack of code coverage [11:03:56] and the way it's calculated has little to do with the specific patch [11:04:13] good call [11:04:35] are you doing self-service today or do you prefer someone to deploy for you? [11:05:03] I can deploy it [11:05:29] (03PS1) 10Hnowlan: api-gateway: bump Envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/690404 [11:05:36] note to any lurkers: no one has signed up for the deployment training that is scheduled for this window. I am in the google meet just in case someone shows up but after 30 minutes (or when these patches go, whichever is first), I'll leave. [11:05:49] self-serve it is, then :-) [11:07:07] (03PS1) 10Jbond: admin: drop christinedk do not merge before 14/05/2021 [puppet] - 10https://gerrit.wikimedia.org/r/690405 [11:07:36] 10SRE, 10Wikimedia-Hackathon-2021, 10Wikimedia-Mailing-lists, 10Upstream: Add OAuth login to mailman for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Tgr) How would this work exactly? You are logged in if you can identify as a Wikimedia users whose email address... [11:08:38] (03CR) 10Jbond: "question for moritz and Luca do we need to do anything special regarding kerberos?" [puppet] - 10https://gerrit.wikimedia.org/r/690405 (owner: 10Jbond) [11:21:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:23:36] FYI This morning I've updagraded cumin to the latest version on all prod hosts, see the email to ops-private for more details. [11:24:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:24:46] (03CR) 10Gergő Tisza: [C: 03+2] AddLink: LinkSuggestionInteraction logger and instrumentation [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690070 (https://phabricator.wikimedia.org/T278116) (owner: 10Gergő Tisza) [11:24:51] (03CR) 10Gergő Tisza: [C: 03+2] AddLink: Instrumentation for onboarding dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690071 (https://phabricator.wikimedia.org/T278111) (owner: 10Gergő Tisza) [11:24:55] (03CR) 10Gergő Tisza: [C: 03+2] AddLink: Instrumentation for skipall_dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690072 (https://phabricator.wikimedia.org/T278118) (owner: 10Gergő Tisza) [11:25:08] (03CR) 10Gergő Tisza: [C: 03+2] AddLink: Instrumentation for edit summary dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690073 (https://phabricator.wikimedia.org/T278118) (owner: 10Gergő Tisza) [11:27:40] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:31:47] 30 minutes later, no trainees dropped by, so I closed the google meet tab. [11:47:00] (03Merged) 10jenkins-bot: AddLink: LinkSuggestionInteraction logger and instrumentation [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690070 (https://phabricator.wikimedia.org/T278116) (owner: 10Gergő Tisza) [11:47:41] (03Merged) 10jenkins-bot: AddLink: Instrumentation for onboarding dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690071 (https://phabricator.wikimedia.org/T278111) (owner: 10Gergő Tisza) [11:48:28] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/678907 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [11:48:32] two done two to go... sigh [11:51:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:51:15] (03PS3) 10Jbond: systemd::resolved: start work on puppet module for systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/678907 (https://phabricator.wikimedia.org/T171498) [11:53:09] (03CR) 10jerkins-bot: [V: 04-1] systemd::resolved: start work on puppet module for systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/678907 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [11:53:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:54:36] (03PS4) 10Jbond: (WIP) systemd::resolved: start work on puppet module for systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/678907 (https://phabricator.wikimedia.org/T171498) [12:06:05] for lurkers following along, there are 4 patches still pending from the backport deployment window, the +2 taking a little while to get through zuul for the last couple so we'll be a bit past the end of the window. [12:12:54] (03CR) 10Jbond: [C: 03+2] (WIP) systemd::resolved: start work on puppet module for systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/678907 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [12:13:17] (03Merged) 10jenkins-bot: AddLink: Instrumentation for skipall_dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690072 (https://phabricator.wikimedia.org/T278118) (owner: 10Gergő Tisza) [12:13:20] (03Merged) 10jenkins-bot: AddLink: Instrumentation for edit summary dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690073 (https://phabricator.wikimedia.org/T278118) (owner: 10Gergő Tisza) [12:14:17] and all the merges are complete at last [12:15:19] tgr_: ^^ [12:15:37] thanks! that took a while. [12:15:45] indeed! [12:21:20] (03PS1) 10Jbond: P:idp::client::http: rename ssout to slo [puppet] - 10https://gerrit.wikimedia.org/r/690427 [12:28:27] what's it look like? [12:28:47] (03PS1) 10Jbond: P:idp::client::httpd: move simple sites to opt into SLO [puppet] - 10https://gerrit.wikimedia.org/r/690432 [12:29:02] (03CR) 10Jbond: [C: 03+2] P:idp::client::http: rename ssout to slo [puppet] - 10https://gerrit.wikimedia.org/r/690427 (owner: 10Jbond) [12:30:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [12:32:13] (03CR) 10Jbond: [C: 03+2] P:idp::client::httpd: move simple sites to opt into SLO [puppet] - 10https://gerrit.wikimedia.org/r/690432 (owner: 10Jbond) [12:35:56] tgr... ? how's the deployment coming along? [12:36:04] it's not working, but it's not breaking anything either, so I'll call that a win [12:36:19] hrm [12:36:31] I guess you're testing on mwdebug100x? [12:36:34] seems like there is configuration step for enabling logging that we aren't aware of [12:36:39] oops [12:37:07] yeah and the code is deployed (it's all client side), but event logs are not sent [12:37:32] meh [12:37:36] probably needs some config change, will figure it out later [12:38:17] I know nothing about logging from the js side, sadly [12:38:36] are you going to scap this around as is then? [12:39:42] yeah [12:39:53] I think the extension code is fine [12:40:38] ok, I guess just keep an eye on all the logstash tabs for a little bit, anyways [12:40:40] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.5/extensions/GrowthExperiments: Backport: instrumentation patches ([[gerrit:690070|]] [[gerrit:690071|]] [[gerrit:690072|]] [[gerrit:690073|]]) (T278116 T278117 T278114 T278177 T278487 T278112 T278111 T278118) (duration: 01m 09s) [12:40:41] we'll probably need to update $wgEventStreams or something like that in mediawiki-config [12:40:42] I got 'em open over here too [12:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:56] T278487: Instrumentation: Number of phrases not found in document - https://phabricator.wikimedia.org/T278487 [12:40:56] T278116: Instrumentation: Link inspector - https://phabricator.wikimedia.org/T278116 [12:40:56] T278114: Instrumentation: Suggestions mode - https://phabricator.wikimedia.org/T278114 [12:40:56] T278117: Instrumentation: Rejection dialog - https://phabricator.wikimedia.org/T278117 [12:40:57] T278112: Instrumentation: dialog for no suggestions available - https://phabricator.wikimedia.org/T278112 [12:40:57] T278111: Instrumentation: Onboarding - https://phabricator.wikimedia.org/T278111 [12:40:57] T278177: Add a link: create new Schema - https://phabricator.wikimedia.org/T278177 [12:40:57] T278118: Instrumentation: Edit summary - https://phabricator.wikimedia.org/T278118 [12:49:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:54:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:06:15] for any lurkers following along, nothing new in logstash and the window is long since over :-D [13:08:08] (03PS1) 10Gergő Tisza: Enable structured_task/article/link_suggestion_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690459 (https://phabricator.wikimedia.org/T278177) [13:11:50] (03CR) 10Gergő Tisza: "Not sure why Gerrit thinks this is in conflict with the entire universe..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690459 (https://phabricator.wikimedia.org/T278177) (owner: 10Gergő Tisza) [13:12:27] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) 05Open→03Resolved This has been completed [13:40:44] (03PS2) 10Herron: logstash: add logstash101[012] to elk7 cluster as ES backends [puppet] - 10https://gerrit.wikimedia.org/r/689994 (https://phabricator.wikimedia.org/T281266) [13:43:12] (03CR) 10Herron: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/689994 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [13:52:15] 10SRE, 10SRE-Access-Requests, 10observability: New VictorOps user request - https://phabricator.wikimedia.org/T282784 (10Volans) p:05Triage→03Medium a:03Volans [13:54:24] 10SRE, 10SRE-Access-Requests, 10observability: New VictorOps user request - https://phabricator.wikimedia.org/T282784 (10Volans) @cmooney I've invited you and added you to the SRE team. Please follow the instructions at https://wikitech.wikimedia.org/wiki/VictorOps#Set_up_as_a_new_user Feel free to resolve... [13:57:55] (03PS13) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [14:01:19] (03CR) 10Herron: [C: 03+2] logstash: remove kibana and elasticsearch from role::logstash [puppet] - 10https://gerrit.wikimedia.org/r/689977 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [14:07:14] !log Start server-side upload for 3 video files (T282558, T282556) [14:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:19] T282556: Server side upload for Butko - https://phabricator.wikimedia.org/T282556 [14:07:19] T282558: Server side upload for Butko - https://phabricator.wikimedia.org/T282558 [14:11:56] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [14:12:24] (03PS1) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 [14:12:38] (03CR) 10Jgreen: [C: 03+2] Monitor services for new donor_prefs flow [puppet] - 10https://gerrit.wikimedia.org/r/690053 (https://phabricator.wikimedia.org/T125272) (owner: 10Dwisehaupt) [14:14:49] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (owner: 10Jbond) [14:15:19] 10SRE, 10SRE-Access-Requests, 10observability: New VictorOps user request - https://phabricator.wikimedia.org/T282784 (10herron) Hi @cmooney I've added your VO account to the GMT+1 "batphone" rotation just now. If you'd like to adjust that please feel free to ping #observability any time. Thanks! [14:18:26] (03PS1) 10Jbond: (WIP) refactor resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/690522 [14:19:24] (03CR) 10jerkins-bot: [V: 04-1] (WIP) refactor resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/690522 (owner: 10Jbond) [14:21:23] (03CR) 10Herron: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/683695 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [14:21:39] (03Abandoned) 10Herron: logstash: move kafka input configs to profile::logstash::kafka_inputs [puppet] - 10https://gerrit.wikimedia.org/r/683695 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [14:22:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/689994 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [14:22:43] (03PS2) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 [14:26:16] 10SRE, 10SRE-Access-Requests, 10observability: New VictorOps user request - https://phabricator.wikimedia.org/T282784 (10cmooney) 05Open→03Resolved Ahh thank you @herron, that makes sense! I was getting stuck following the guide which was telling me to add myself to it, but it looked like I already was... [14:29:12] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana_80: Servers logstash2004.codfw.wmnet, logstash2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:29:30] hmm that's me [14:29:30] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana_80: Servers logstash1007.eqiad.wmnet, logstash1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:29:39] retiring that kibana instance, no problem [14:29:50] didn't mean for it to alert though [14:31:07] (03CR) 10Dzahn: "Thank you, and fyi, it wasn't reimaged. It was just new, same as with people1002." [puppet] - 10https://gerrit.wikimedia.org/r/690329 (https://phabricator.wikimedia.org/T281881) (owner: 10Jcrespo) [14:31:23] I'm going to revert my patch and split this up so that can go back to OK asap [14:31:36] (03CR) 10Ppchelko: [C: 03+1] "ok, merge whenever. I have to note it's not 'latest' envoy, latest is 1.18.3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/690404 (owner: 10Hnowlan) [14:32:22] (03PS1) 10Herron: Revert "logstash: remove kibana and elasticsearch from role::logstash" [puppet] - 10https://gerrit.wikimedia.org/r/690077 [14:32:58] (03CR) 10jerkins-bot: [V: 04-1] Revert "logstash: remove kibana and elasticsearch from role::logstash" [puppet] - 10https://gerrit.wikimedia.org/r/690077 (owner: 10Herron) [14:33:28] (03PS2) 10Herron: Revert "logstash: remove kibana and elasticsearch from role::logstash" [puppet] - 10https://gerrit.wikimedia.org/r/690077 [14:33:48] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana_80: Servers logstash1007.eqiad.wmnet, logstash1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:34:19] (03CR) 10Herron: [C: 03+2] Revert "logstash: remove kibana and elasticsearch from role::logstash" [puppet] - 10https://gerrit.wikimedia.org/r/690077 (owner: 10Herron) [14:37:00] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/690329 (https://phabricator.wikimedia.org/T281881) (owner: 10Jcrespo) [14:37:35] (03CR) 10Elukey: "The code review should be at a good stage, let me know your thoughts :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [14:38:25] (03PS1) 10Majavah: toolforge: Add separate role for Redis Sentinel [puppet] - 10https://gerrit.wikimedia.org/r/690528 (https://phabricator.wikimedia.org/T153810) [14:39:10] (03PS3) 10Effie Mouzeli: ProductionServices: poolcounter1004 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688239 (https://phabricator.wikimedia.org/T273278) [14:40:17] (03PS1) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 [14:40:22] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:41:40] (03PS1) 10Effie Mouzeli: ProductionServices: add poolcounter1004 back to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690530 (https://phabricator.wikimedia.org/T273278) [14:41:52] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:42:32] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (owner: 10Jbond) [14:42:39] (03PS1) 10Effie Mouzeli: ProductionServices: poolcounter1005 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690531 (https://phabricator.wikimedia.org/T273278) [14:42:50] (03CR) 10jerkins-bot: [V: 04-1] ProductionServices: add poolcounter1004 back to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690530 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [14:44:03] (03CR) 10jerkins-bot: [V: 04-1] ProductionServices: poolcounter1005 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690531 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [14:44:25] (03PS2) 10Effie Mouzeli: ProductionServices: add poolcounter1004 back to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690530 (https://phabricator.wikimedia.org/T273278) [14:44:40] PROBLEM - SSH on logstash2020.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:45:00] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana_80: Servers logstash2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:45:28] (03PS1) 10Herron: logstash: remove elasticsearch from role::logstash [puppet] - 10https://gerrit.wikimedia.org/r/690532 (https://phabricator.wikimedia.org/T281266) [14:46:56] (03PS2) 10Majavah: toolforge: Add separate role for Redis Sentinel [puppet] - 10https://gerrit.wikimedia.org/r/690528 (https://phabricator.wikimedia.org/T153810) [14:47:04] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:47:17] (03PS2) 10Effie Mouzeli: ProductionServices: poolcounter1005 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690531 (https://phabricator.wikimedia.org/T273278) [14:47:44] (03Abandoned) 10Effie Mouzeli: ProductionServices: poolcounter1005 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690531 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [14:48:24] (03CR) 10Herron: [C: 03+2] "following up Ie91db270484908a53c9a2a8d98da2534a0c32f71 to remove ES separately from kibana" [puppet] - 10https://gerrit.wikimedia.org/r/690532 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [14:48:40] (03CR) 10Cwhite: [C: 03+2] Add normalized object field [software/ecs] - 10https://gerrit.wikimedia.org/r/672805 (owner: 10Cwhite) [14:49:08] !log Start server-side upload for 1 video file (T282785) [14:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:13] T282785: Server side upload for Butko - https://phabricator.wikimedia.org/T282785 [14:49:15] (03Merged) 10jenkins-bot: Add normalized object field [software/ecs] - 10https://gerrit.wikimedia.org/r/672805 (owner: 10Cwhite) [14:49:44] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:50:42] 10SRE, 10Traffic: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10Volans) Other random things that needs to be updated sooner or later. I hope you don't mind if I drop them here, feel free to move them to a dedicated task. Some should wait that we... [14:50:55] (03CR) 10Jbond: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29546" [puppet] - 10https://gerrit.wikimedia.org/r/690529 (owner: 10Jbond) [14:51:06] (03PS2) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 [14:53:21] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (owner: 10Jbond) [14:55:46] (03CR) 10Ladsgroup: [C: 03+1] ProductionServices: poolcounter1004 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688239 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [14:55:48] 10SRE, 10Traffic: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10BBlack) [14:56:33] (03PS1) 10Cwhite: logstash: update ES template to patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/690538 [14:57:18] (03PS2) 10Cwhite: logstash: update ES template to patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/690538 [14:58:55] (03CR) 10jerkins-bot: [V: 04-1] logstash: update ES template to patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/690538 (owner: 10Cwhite) [15:00:12] (03PS2) 10Jbond: (WIP) refactor resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/690522 [15:01:18] (03PS3) 10Jbond: (WIP) refactor resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/690522 [15:02:16] (03PS4) 10Jbond: (WIP) refactor resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/690522 [15:03:01] (03CR) 10Elukey: [C: 04-1] "Nope, not ready :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [15:03:59] (03PS1) 10Cwhite: logstash: clean up deprecated curator entries [puppet] - 10https://gerrit.wikimedia.org/r/690539 (https://phabricator.wikimedia.org/T274394) [15:04:27] (03CR) 10jerkins-bot: [V: 04-1] (WIP) refactor resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/690522 (owner: 10Jbond) [15:05:34] (03PS1) 10Effie Mouzeli: mediawiki::alerts fix panelId for mediawiki exceptions alert [puppet] - 10https://gerrit.wikimedia.org/r/690540 [15:06:32] (03PS1) 10Cwhite: hiera: synchronize curator descriptions [puppet] - 10https://gerrit.wikimedia.org/r/690541 [15:08:08] (03CR) 10Herron: "so what is the process from here?" [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [15:09:05] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1001/29547/" [puppet] - 10https://gerrit.wikimedia.org/r/690539 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [15:10:15] (03CR) 10Cwhite: [C: 03+2] hiera: synchronize curator descriptions [puppet] - 10https://gerrit.wikimedia.org/r/690541 (owner: 10Cwhite) [15:11:04] (03PS3) 10Cwhite: logstash: update ES template to patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/690538 [15:12:41] (03CR) 10jerkins-bot: [V: 04-1] logstash: update ES template to patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/690538 (owner: 10Cwhite) [15:14:16] (03CR) 10Volans: "> Patch Set 2:" [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [15:14:19] (03PS4) 10Cwhite: logstash: update ES template to patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/690538 [15:14:21] (03CR) 10Elukey: "Hey Keith! What I'd do (not sure if the best)" [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [15:15:05] (03PS5) 10Jbond: O:base::resolver: unify resolv.con templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 [15:16:17] (03CR) 10Herron: "got it! thank you both!" [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [15:17:26] (03CR) 10Herron: [C: 03+2] add kafka-logging200[123] to kafka term [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [15:17:28] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.con templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (owner: 10Jbond) [15:17:43] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia [15:18:15] (03Merged) 10jenkins-bot: add kafka-logging200[123] to kafka term [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [15:18:35] herron: I'm around if you need any help with the homer run [15:18:51] volans: ok thank you [15:19:27] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [15:19:43] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:21:42] (03CR) 10Effie Mouzeli: [C: 03+2] ProductionServices: poolcounter1004 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688239 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [15:22:47] (03Merged) 10jenkins-bot: ProductionServices: poolcounter1004 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688239 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [15:23:47] (03PS2) 10Cwhite: logstash: update ingest errors to use dead letters gauge [puppet] - 10https://gerrit.wikimedia.org/r/672769 (https://phabricator.wikimedia.org/T277080) [15:24:24] (03PS1) 10Andrew Bogott: Trove: remove train and ussuri manifests [puppet] - 10https://gerrit.wikimedia.org/r/690544 [15:24:26] (03PS1) 10Andrew Bogott: Trove: hack in fixes to the dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690545 [15:24:28] (03PS1) 10Andrew Bogott: OpenStack Trove: enable dns integration for DB access [puppet] - 10https://gerrit.wikimedia.org/r/690546 (https://phabricator.wikimedia.org/T212595) [15:25:22] (03CR) 10jerkins-bot: [V: 04-1] Trove: hack in fixes to the dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690545 (owner: 10Andrew Bogott) [15:25:45] volans: I'm seeing diffs only on eqiad cr, but maybe that's normal? I would have thought codfw too though [15:26:07] (03CR) 10Andrew Bogott: [C: 03+2] Trove: remove train and ussuri manifests [puppet] - 10https://gerrit.wikimedia.org/r/690544 (owner: 10Andrew Bogott) [15:26:22] herron: checking [15:26:24] (03CR) 10Jbond: [C: 03+1] "lgtm, nbit inline but its personal style so feel free to ignore" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690041 (owner: 10Volans) [15:26:37] (03CR) 10Cwhite: [C: 03+2] logstash: update ingest errors to use dead letters gauge [puppet] - 10https://gerrit.wikimedia.org/r/672769 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [15:26:40] (03CR) 10jerkins-bot: [V: 04-1] OpenStack Trove: enable dns integration for DB access [puppet] - 10https://gerrit.wikimedia.org/r/690546 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [15:27:26] (03CR) 10Bstorm: toolforge: re-enable toolforge certificate monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690055 (https://phabricator.wikimedia.org/T282264) (owner: 10Bstorm) [15:27:37] !log jiji@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:688239|ProductionServices: poolcounter1004 will be rebooted for updates (T273278)]] (duration: 01m 08s) [15:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:03] herron: no, seems correct, at line 438 there is {% if metadata['site'] == "eqiad" %} [15:28:07] and you are still inside that if [15:28:21] (03PS2) 10Jbond: admin: drop christinedk do not merge before 17/05/2021 [puppet] - 10https://gerrit.wikimedia.org/r/690405 [15:28:32] that closes at line 1145... a bit hard to find I agree [15:28:35] (03CR) 10Jbond: [C: 04-1] "-=1 untill the 17th" [puppet] - 10https://gerrit.wikimedia.org/r/690405 (owner: 10Jbond) [15:28:42] volans: ah! I see that now, makes sense [15:29:06] thanks for asking [15:30:20] (03PS2) 10Andrew Bogott: Trove: hack in fixes to the dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690545 [15:30:22] (03PS2) 10Andrew Bogott: OpenStack Trove: enable dns integration for DB access [puppet] - 10https://gerrit.wikimedia.org/r/690546 (https://phabricator.wikimedia.org/T212595) [15:32:01] volans: thx for the help. have committd to eqiad just now, and finished successfully [15:32:37] (03CR) 10jerkins-bot: [V: 04-1] Trove: hack in fixes to the dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690545 (owner: 10Andrew Bogott) [15:33:20] herron: great! thank you! [15:35:00] (03PS2) 10Cwhite: logstash: clean up mtail config [puppet] - 10https://gerrit.wikimedia.org/r/672771 (https://phabricator.wikimedia.org/T277080) [15:35:55] (03PS3) 10Cwhite: logstash: clean up mtail config [puppet] - 10https://gerrit.wikimedia.org/r/672771 (https://phabricator.wikimedia.org/T277080) [15:36:35] (03PS3) 10Volans: homer: disable diff checker on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/690041 [15:36:56] 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) Upstream calico issue at https://github.com/projectcalico/calico/issues/4607 I am also working on a PR, I 'll post... [15:40:18] (03PS3) 10Andrew Bogott: Trove: hack in fixes to the dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690545 [15:40:20] (03PS3) 10Andrew Bogott: OpenStack Trove: enable dns integration for DB access [puppet] - 10https://gerrit.wikimedia.org/r/690546 (https://phabricator.wikimedia.org/T212595) [15:42:13] (03CR) 10jerkins-bot: [V: 04-1] OpenStack Trove: enable dns integration for DB access [puppet] - 10https://gerrit.wikimedia.org/r/690546 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [15:42:40] (03CR) 10Bstorm: [C: 03+2] wikireplicas-dns: condense repeated nodes for better failover [puppet] - 10https://gerrit.wikimedia.org/r/688501 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [15:42:42] (03CR) 10jerkins-bot: [V: 04-1] Trove: hack in fixes to the dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690545 (owner: 10Andrew Bogott) [15:42:48] (03CR) 10Volans: "Addresse comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690041 (owner: 10Volans) [15:44:53] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10mpopov) Thanks so much @elukey you're the best! [15:45:39] (03CR) 10Jbond: homer: disable diff checker on cumin2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690041 (owner: 10Volans) [15:45:50] (03PS4) 10Volans: homer: disable diff checker on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/690041 [15:46:10] !log restarting poolcounter1004 [15:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:35] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host poolcounter1004.eqiad.wmnet [15:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:55] (03PS4) 10Andrew Bogott: Trove: hack in fixes to the dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690545 [15:46:57] (03PS4) 10Andrew Bogott: OpenStack Trove: enable dns integration for DB access [puppet] - 10https://gerrit.wikimedia.org/r/690546 (https://phabricator.wikimedia.org/T212595) [15:46:59] (03PS1) 10Andrew Bogott: jypterhub: linter fix [puppet] - 10https://gerrit.wikimedia.org/r/690554 [15:48:01] (03PS5) 10Volans: homer: disable diff checker on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/690041 [15:48:24] (03CR) 10Andrew Bogott: [C: 03+2] jypterhub: linter fix [puppet] - 10https://gerrit.wikimedia.org/r/690554 (owner: 10Andrew Bogott) [15:49:20] (03CR) 10jerkins-bot: [V: 04-1] Trove: hack in fixes to the dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690545 (owner: 10Andrew Bogott) [15:49:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1004.eqiad.wmnet [15:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:11] (03PS6) 10Jbond: homer: disable diff checker on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/690041 (owner: 10Volans) [15:50:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/690041 (owner: 10Volans) [15:54:00] (03PS5) 10Andrew Bogott: Trove: hack in fixes to the dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690545 [15:54:02] (03PS5) 10Andrew Bogott: OpenStack Trove: enable dns integration for DB access [puppet] - 10https://gerrit.wikimedia.org/r/690546 (https://phabricator.wikimedia.org/T212595) [15:54:48] (03CR) 10Effie Mouzeli: [C: 03+2] ProductionServices: add poolcounter1004 back to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690530 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [15:54:51] (03CR) 10Volans: [C: 03+2] homer: disable diff checker on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/690041 (owner: 10Volans) [15:55:54] (03Merged) 10jenkins-bot: ProductionServices: add poolcounter1004 back to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690530 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [15:56:39] (03CR) 10Andrew Bogott: [C: 03+2] Trove: hack in fixes to the dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690545 (owner: 10Andrew Bogott) [15:56:50] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Trove: enable dns integration for DB access [puppet] - 10https://gerrit.wikimedia.org/r/690546 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [15:56:59] (03PS6) 10Andrew Bogott: OpenStack Trove: enable dns integration for DB access [puppet] - 10https://gerrit.wikimedia.org/r/690546 (https://phabricator.wikimedia.org/T212595) [15:58:08] !log jiji@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:690530|ProductionServices: add poolcounter1004 back to config (T273278)]] (duration: 01m 07s) [15:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:14] (03PS2) 10Razzi: kerberos: require --email_address for create and reset-password [puppet] - 10https://gerrit.wikimedia.org/r/686766 (https://phabricator.wikimedia.org/T282185) [15:58:35] (03CR) 10Razzi: "Updated!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/686766 (https://phabricator.wikimedia.org/T282185) (owner: 10Razzi) [15:59:54] (03PS1) 10Effie Mouzeli: ProductionServices: poolcounter1005 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690557 (https://phabricator.wikimedia.org/T273278) [16:00:05] jbond42 and cdanis: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210513T1600). [16:00:05] Urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:11] \o/ [16:01:05] (03PS1) 10Effie Mouzeli: ProductionServices: add poolcounter1005 back to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690558 (https://phabricator.wikimedia.org/T273278) [16:04:12] (03CR) 10Ladsgroup: [C: 03+1] "1004 is now fully pooled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690557 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [16:04:13] jbond42: cdanis: hello, is any of you able to deploy my puppet patch please? 🙂 [16:05:01] (03PS14) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [16:05:17] Urbanecm: looking [16:05:44] (03CR) 10Jbond: [C: 03+2] Add vrt-wiki.wikimedia.org to mediawiki.yaml [puppet] - 10https://gerrit.wikimedia.org/r/683000 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [16:06:20] Urbanecm: merged [16:06:24] (03CR) 10Effie Mouzeli: [C: 03+2] ProductionServices: poolcounter1005 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690557 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [16:06:31] jbond42: thank you, appreciated [16:06:37] no probs [16:06:39] (03PS15) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [16:07:27] (03Merged) 10jenkins-bot: ProductionServices: poolcounter1005 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690557 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [16:07:56] (03CR) 10Cwhite: "This change is ready for review." [software/ecs] - 10https://gerrit.wikimedia.org/r/636515 (owner: 10Cwhite) [16:09:24] !log jiji@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:690557|ProductionServices: poolcounter1005 will be rebooted for updates (T273278)]] (duration: 01m 07s) [16:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:09] ah thanks jbond42 [16:10:13] just back from grabbing some food :) [16:10:17] (03PS1) 10Zabe: Make sure mId exists [extensions/GeoData] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690078 (https://phabricator.wikimedia.org/T282735) [16:13:31] (03PS1) 10Andrew Bogott: Trove: fix up hacks for dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690560 [16:14:10] (03CR) 10jerkins-bot: [V: 04-1] Trove: fix up hacks for dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690560 (owner: 10Andrew Bogott) [16:14:53] (03PS2) 10Andrew Bogott: Trove: fix up hacks for dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690560 [16:17:24] (03CR) 10Andrew Bogott: [C: 03+2] Trove: fix up hacks for dns integration [puppet] - 10https://gerrit.wikimedia.org/r/690560 (owner: 10Andrew Bogott) [16:21:04] (03PS3) 10Dwisehaupt: Add new payments hosts to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/682186 (https://phabricator.wikimedia.org/T266481) [16:22:54] (03CR) 10Jgreen: [C: 03+2] Add new payments hosts to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/682186 (https://phabricator.wikimedia.org/T266481) (owner: 10Dwisehaupt) [16:24:29] !log rebooting poolcounter1005 [16:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:47] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host poolcounter1005.eqiad.wmnet [16:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:34] (03PS1) 10Bstorm: labstore: Switch DRBD devices to using the 10Gb addresses [puppet] - 10https://gerrit.wikimedia.org/r/690563 (https://phabricator.wikimedia.org/T266198) [16:26:27] (03PS2) 10Bstorm: labstore: Switch DRBD devices to using the 10Gb addresses [puppet] - 10https://gerrit.wikimedia.org/r/690563 (https://phabricator.wikimedia.org/T282754) [16:26:52] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1005.eqiad.wmnet [16:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:23] (03PS8) 10Thcipriani: Beta: Clean puppet cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) [16:27:52] (03CR) 10jerkins-bot: [V: 04-1] Beta: Clean puppet cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [16:29:30] (03PS9) 10Thcipriani: Beta: Clean puppet cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) [16:30:04] (03CR) 10jerkins-bot: [V: 04-1] Beta: Clean puppet cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [16:31:23] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: clean up mtail config [puppet] - 10https://gerrit.wikimedia.org/r/672771 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [16:33:39] (03CR) 10Effie Mouzeli: [C: 03+2] ProductionServices: add poolcounter1005 back to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690558 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [16:34:33] (03Merged) 10jenkins-bot: ProductionServices: add poolcounter1005 back to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690558 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [16:36:33] !log jiji@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:690558|ProductionServices: add poolcounter1005 back to config (T273278)]] (duration: 01m 07s) [16:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:31] (03CR) 10Nettrom: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690459 (https://phabricator.wikimedia.org/T278177) (owner: 10Gergő Tisza) [16:39:14] (03CR) 10Elukey: kerberos: require --email_address for create and reset-password (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/686766 (https://phabricator.wikimedia.org/T282185) (owner: 10Razzi) [16:43:55] (03PS3) 10Bstorm: labstore: Switch DRBD devices to using the 10Gb addresses [puppet] - 10https://gerrit.wikimedia.org/r/690563 (https://phabricator.wikimedia.org/T282754) [16:45:03] (03PS4) 10Bstorm: labstore: Switch DRBD devices to using the 10Gb addresses [puppet] - 10https://gerrit.wikimedia.org/r/690563 (https://phabricator.wikimedia.org/T282754) [16:47:34] (03CR) 10Bstorm: "PCC looks right: https://puppet-compiler.wmflabs.org/compiler1002/29557/" [puppet] - 10https://gerrit.wikimedia.org/r/690563 (https://phabricator.wikimedia.org/T282754) (owner: 10Bstorm) [16:48:38] (03CR) 10Bstorm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/690563 (https://phabricator.wikimedia.org/T282754) (owner: 10Bstorm) [16:51:34] (03PS5) 10Bstorm: labstore: Switch DRBD devices to using the 10Gb addresses [puppet] - 10https://gerrit.wikimedia.org/r/690563 (https://phabricator.wikimedia.org/T282754) [16:53:15] (03CR) 10Bstorm: "That looks more right https://puppet-compiler.wmflabs.org/compiler1002/29558/" [puppet] - 10https://gerrit.wikimedia.org/r/690563 (https://phabricator.wikimedia.org/T282754) (owner: 10Bstorm) [16:57:14] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10Milimetric) p:05Triage→03High a:03elukey [16:58:07] (03CR) 10Bstorm: [C: 03+2] labstore: Switch DRBD devices to using the 10Gb addresses [puppet] - 10https://gerrit.wikimedia.org/r/690563 (https://phabricator.wikimedia.org/T282754) (owner: 10Bstorm) [17:00:04] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210513T1700). [17:16:04] !log andrew@deploy1002 Started deploy [horizon/deploy@3d160f6]: Adding Database dashboards [17:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:04] (03PS1) 10Bstorm: Revert "labstore: Switch DRBD devices to using the 10Gb addresses" [puppet] - 10https://gerrit.wikimedia.org/r/690079 [17:19:33] (03CR) 10MewOphaswongse: [C: 03+1] Enable structured_task/article/link_suggestion_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690459 (https://phabricator.wikimedia.org/T278177) (owner: 10Gergő Tisza) [17:20:00] (03CR) 10Bstorm: [C: 03+2] Revert "labstore: Switch DRBD devices to using the 10Gb addresses" [puppet] - 10https://gerrit.wikimedia.org/r/690079 (owner: 10Bstorm) [17:20:12] !log andrew@deploy1002 Finished deploy [horizon/deploy@3d160f6]: Adding Database dashboards (duration: 04m 08s) [17:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:25] (03PS3) 10Majavah: Look for service.template in various code directories [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/636993 (https://phabricator.wikimedia.org/T266692) (owner: 10Legoktm) [17:24:38] (03CR) 10Majavah: [C: 03+2] "Tested and working, code looks good => ship it!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/636993 (https://phabricator.wikimedia.org/T266692) (owner: 10Legoktm) [17:26:33] (03Merged) 10jenkins-bot: Look for service.template in various code directories [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/636993 (https://phabricator.wikimedia.org/T266692) (owner: 10Legoktm) [17:30:07] 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): labstore1004/5: buy a DAC 10Gb cable or adjust the current fiber cable for DAC/crossover - https://phabricator.wikimedia.org/T282799 (10Bstorm) @wiki_willy I worked with @Jclark-ctr yesterday (who probably will know the length when he's availabl... [17:31:05] 10SRE, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) [17:39:54] (03PS1) 10Gergő Tisza: Suggested edits: Set footer color for topic filter dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690080 (https://phabricator.wikimedia.org/T282711) [17:40:51] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10KFrancis) @Urbanecm @elukey Hi all, just to confirm, if this person is currently a contractor with the WMF, they would be covered under their NDA wi... [17:41:47] (03CR) 10Ahmon Dancy: multiversion: enhance buildDBList output (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [17:43:05] (03CR) 10Dave Pifke: [C: 03+1] "Overall, this looks great. It definitely solves my use case of sometimes forgetting to un-cherry-pick an earlier draft after I get a patc" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [17:47:25] (03PS1) 10Urbanecm: Growth features: Push elwiki and cawiki out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690606 (https://phabricator.wikimedia.org/T280673) [17:47:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:50:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:52:27] (03CR) 10Ahmon Dancy: "I tried running `php ./multiversion/buildDBLists.php` in my environment but I get:" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [17:58:16] (03PS6) 10Matthias Mullie: Enable Extension:MediaSearch on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682102 (https://phabricator.wikimedia.org/T265939) [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210513T1800). [18:00:05] matthiasmullie, mewoph, tgr, and Urbanecm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:10] I can deploy today [18:00:12] o/ [18:00:26] o/ [18:00:41] o/ [18:00:46] (03PS10) 10Thcipriani: Beta: Clean puppet cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) [18:00:52] hello mewoph, tgr_ and matthiasmullie 🙂 [18:01:02] (03CR) 10Urbanecm: [C: 03+2] Add a link: select annotation view when acceptance changes on desktop [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/689898 (https://phabricator.wikimedia.org/T282175) (owner: 10Kosta Harlan) [18:01:06] (03CR) 10Urbanecm: [C: 03+2] Suggested edits: Set footer color for topic filter dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690080 (https://phabricator.wikimedia.org/T282711) (owner: 10Gergő Tisza) [18:01:21] (03CR) 10jerkins-bot: [V: 04-1] Beta: Clean puppet cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [18:03:22] (03CR) 10Urbanecm: [C: 03+2] Enable Extension:MediaSearch on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682102 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [18:03:35] (03PS3) 10Urbanecm: Enable Extension:MediaSearch on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682105 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [18:03:38] (03CR) 10Urbanecm: [C: 03+2] Enable Extension:MediaSearch on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682105 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [18:04:21] (03Merged) 10jenkins-bot: Enable Extension:MediaSearch on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682102 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [18:04:50] (03Merged) 10jenkins-bot: Enable Extension:MediaSearch on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682105 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [18:05:31] matthiasmullie: both patches are pulled to mwdebug1001, can you test, please? [18:05:53] doing [18:06:41] LGTM! [18:07:04] thanks [18:07:37] lgtm (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/689898/) [18:08:08] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:09:03] (03PS3) 10Matthias Mullie: Enable media change tags on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674882 (https://phabricator.wikimedia.org/T266067) [18:09:28] (03CR) 10Urbanecm: [C: 03+2] Enable media change tags on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674882 (https://phabricator.wikimedia.org/T266067) (owner: 10Matthias Mullie) [18:09:46] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b3300c3: 59c8448: Enable Extension:MediaSearch on (test)commons (T265939) (duration: 01m 08s) [18:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:51] T265939: Split MediaSearch out into its own extension - https://phabricator.wikimedia.org/T265939 [18:09:53] first two patches are live matthiasmullie [18:10:02] thanks! [18:10:12] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10elukey) @KFrancis hi! Thanks a lot for the answer, if I got it correctly this should be the case. Could you please confirm if we can proceed? [18:10:17] (03Merged) 10jenkins-bot: Enable media change tags on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674882 (https://phabricator.wikimedia.org/T266067) (owner: 10Matthias Mullie) [18:10:24] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:10:25] mewoph: I did not yet pull https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/689898/ to the cluster 🙂 [18:10:36] whoops looking at wrong env sorry about that! [18:11:12] no problem. The names can be a bit confusing [18:11:52] matthiasmullie: Enable media change tags on wikipedias is pulled to mwdebug1001. Note that if it uses jobs, it probably won't be testable there. [18:12:14] (03CR) 10Urbanecm: [C: 03+2] Growth features: Push elwiki and cawiki out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690606 (https://phabricator.wikimedia.org/T280673) (owner: 10Urbanecm) [18:12:19] (03PS2) 10Urbanecm: Growth features: Push elwiki and cawiki out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690606 (https://phabricator.wikimedia.org/T280673) [18:12:22] (03CR) 10Urbanecm: [C: 03+2] Growth features: Push elwiki and cawiki out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690606 (https://phabricator.wikimedia.org/T280673) (owner: 10Urbanecm) [18:13:02] Urbanecm: let's check whether it works [18:13:12] (03Merged) 10jenkins-bot: Growth features: Push elwiki and cawiki out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690606 (https://phabricator.wikimedia.org/T280673) (owner: 10Urbanecm) [18:13:23] matthiasmullie: feel free to, it's at mwdebug1001 now :). [18:15:38] mewoph: tgr_: since your patches are for add a link, which is not yet user visible, is it okay if i just sync them once they merge? [18:16:25] 10SRE, 10Wikimedia-Hackathon-2021, 10Wikimedia-Mailing-lists, 10Upstream: Add OAuth login to mailman for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Legoktm) I think so, but I haven't really looked into it yet. Both https://mail.python.org/accounts/login/?next=... [18:17:20] Urbanecm: doesn't seem to be working, but that's not unexpected; I guess we can move forward since nothing broke :p [18:17:30] okay, syncing [18:18:43] Urbanecm: mine is not (or not exclusively) [18:19:11] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 04eb9d30b069e60004a42fcb128a958a24aee229: Enable media change tags on wikipedias (T266067) (duration: 01m 07s) [18:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:14] right [18:19:15] T266067: [L] Create edit tags to measure multimedia edits to Wikipedia articles - https://phabricator.wikimedia.org/T266067 [18:19:18] i'll fetch it then [18:19:22] (to mwdebug) [18:21:01] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4cd6a782a946e121f5f8301e2649be8d338baaf8: Growth features: Push elwiki and cawiki out of dark mode (T280673; T280172) (duration: 01m 07s) [18:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:06] T280172: Deploy Growth features on Greek Wikipedia - https://phabricator.wikimedia.org/T280172 [18:21:06] T280673: Deploy Growth features on Catalan Wikipedia - https://phabricator.wikimedia.org/T280673 [18:22:33] (03PS11) 10Thcipriani: Beta: Clean puppet cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) [18:22:54] !log Start server-side upload for 2 video files (T282643, T282644) [18:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:59] T282644: Server side upload for Sturm - https://phabricator.wikimedia.org/T282644 [18:22:59] T282643: Server side upload for Sturm - https://phabricator.wikimedia.org/T282643 [18:25:18] (03CR) 10Majavah: [C: 04-1] Beta: Clean puppet cherry-picks (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [18:25:28] (03Merged) 10jenkins-bot: Add a link: select annotation view when acceptance changes on desktop [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/689898 (https://phabricator.wikimedia.org/T282175) (owner: 10Kosta Harlan) [18:25:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts people2001.codfw.wmnet [18:25:40] (03Merged) 10jenkins-bot: Suggested edits: Set footer color for topic filter dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690080 (https://phabricator.wikimedia.org/T282711) (owner: 10Gergő Tisza) [18:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:27] !log people2001 is going down - people1003 (eqiad) and people2002 (codfw) are your replacements on bullseye [18:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:26] mewoph: tgr_: your patches are at mwdebug1001, please test. [18:28:18] (03CR) 10Thcipriani: [C: 04-1] "Thanks for review Majavah! This patch may need a bit more dusting off" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [18:31:16] hm, the task is not exactly clear about how to reproduce the footer bug [18:32:09] so not sure how to test. The page looks fine with the patch, in any case, so please go ahead. [18:32:18] verified add link change (when annotation is accepted, it should be highlighted) [18:32:40] thanks mewoph [18:32:46] tgr_: okay, thanks. syncing [18:32:50] for the footer change, it should be reproducible when the page is really short [18:33:10] (so the overlay would scroll) [18:34:02] (03PS2) 10Jcrespo: bacula: Reenable read-write ES database backups, disable read-only [puppet] - 10https://gerrit.wikimedia.org/r/690338 (https://phabricator.wikimedia.org/T282249) [18:34:03] doesn't happen for me, maybe it's browser-dependent? [18:34:18] the topic are scrolls, but it's not under the footer [18:34:47] anyway, it's not a risky backport [18:34:53] yup yup [18:34:55] already syncing [18:35:00] Urbanecm: do you have time for one more? [18:35:02] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.5/extensions/GrowthExperiments/: 0856ae1: ca52e78: GrowthExperiments backports (T282711, T282175) (duration: 01m 08s) [18:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:08] T282711: Homepage: footer of topic filter overlaps topics - https://phabricator.wikimedia.org/T282711 [18:35:09] T282175: Add link: an image placeholder flickers when Yes button clicked - https://phabricator.wikimedia.org/T282175 [18:35:09] tgr_: sure :) [18:35:14] link? [18:35:18] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/690459 [18:35:29] I'll add it to the wiki page [18:35:39] (03PS2) 10Urbanecm: Enable structured_task/article/link_suggestion_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690459 (https://phabricator.wikimedia.org/T278177) (owner: 10Gergő Tisza) [18:35:48] (03CR) 10Urbanecm: [C: 03+2] Enable structured_task/article/link_suggestion_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690459 (https://phabricator.wikimedia.org/T278177) (owner: 10Gergő Tisza) [18:35:55] (I hope I don't get arrested by the deploy police for adding a seventh patch) [18:36:21] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts people2001.codfw.wmnet [18:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:28] 10SRE, 10Patch-For-Review: try planet/people on bullseye / upgrade people.wikimedia.org backends to bullseye - https://phabricator.wikimedia.org/T280989 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `people2001.codfw.wmnet` - people2001.codfw.wmnet (**PASS**)... [18:36:41] (03Merged) 10jenkins-bot: Enable structured_task/article/link_suggestion_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690459 (https://phabricator.wikimedia.org/T278177) (owner: 10Gergő Tisza) [18:36:44] tgr_: hehe, it's fine as long as we have time (and we have almost 25 minutes) [18:37:03] pulled onto mwdebug1001, in case you want to test it there [18:38:08] I do, first time we are using new-style event logging [18:38:47] in that case, mwdebug1001 is yours :) [18:39:56] it's used by a ResourceLoader callback, so I guess we'll have to wait a bit [18:41:26] or use ?debug=1 [18:41:41] I don't trust debug mode [18:45:30] (03PS1) 10Nray: Fix 'final_state: vector' bug in VectorPrefDiffInstrumentation [extensions/WikimediaEvents] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690081 (https://phabricator.wikimedia.org/T261842) [18:48:40] can't be this slow, can it? [18:49:27] there's a ResourceLoader callback (theoretically) putting that config field into a pseudo JSON file, and it's not there [18:49:31] (03CR) 10Andrew Bogott: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/689262 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [18:50:02] oh duh, the browser extension timer ran out [18:50:09] I hate it when that happens [18:50:27] that explains it [18:52:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) The last Dell tech that came in identified the problem as a riser card, Oddly enough this was replaced already but maybe the second time is the charm. Dell is... [18:53:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:53:24] still not working though. [18:53:32] let me see [18:53:46] I think there is a genuine bug. Not in the config patch though. [18:54:11] it is definitely pulled to mwdebug1001 [18:54:15] which means a fix won't fit into this deploy window. [18:54:38] yeah, I can verify that the client code is updated. [18:54:53] (03CR) 10Andrew Bogott: [C: 04-1] "Thanks for working on this! I've noted a few missing services inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/689262 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [18:55:03] cool [18:55:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:55:59] hm, I can fix this in the config. [18:56:48] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) a:03Dzahn requested new repo operations/miscweb at https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests to host the Blubber file and static H... [18:57:10] cool :) [18:57:37] (03PS1) 10Gergő Tisza: Fix link_suggestion_interaction stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690661 (https://phabricator.wikimedia.org/T278177) [18:57:49] ohh... [18:57:58] do we have time to deploy that? ^ [18:58:04] sure [18:58:08] (03CR) 10Urbanecm: [C: 03+2] Fix link_suggestion_interaction stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690661 (https://phabricator.wikimedia.org/T278177) (owner: 10Gergő Tisza) [18:59:03] (03Merged) 10jenkins-bot: Fix link_suggestion_interaction stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690661 (https://phabricator.wikimedia.org/T278177) (owner: 10Gergő Tisza) [18:59:33] pulled to mwdebug1001, can you check tgr_ ? [18:59:52] !log Morning B&C is going to take few more minutes [18:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] dancy and brennen: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210513T1900). [19:00:40] dancy: brennen: please hold train for a minute [19:00:59] ok [19:01:01] Urbanecm: ack. i believe we're still blocked anyhow, although probably not after a patch is deployed. [19:01:10] (03PS1) 10Dzahn: site: remove people1002 and people2001, update comments [puppet] - 10https://gerrit.wikimedia.org/r/690666 (https://phabricator.wikimedia.org/T280989) [19:01:27] thanks both :) [19:01:28] looks like there's one for T282735 [19:01:29] T282735: Wikimedia\Rdbms\DBQueryError: Error 1048: Column 'gt_page_id' cannot be null (db1138)Function: GeoData\Hooks::doLinksUpdateQuery: INSERT INTO `geo_tags` (gt_page_id,gt_id,gt_lat,gt_lon,gt_globe,gt_primary,gt_dim,gt_type,gt_name,gt_country,gt_region) VALUES (NULL,NULL,'45.811666666667','4.9194444444444','earth',1,1000,'camera',NULL,NULL,NULL) - https://phabricator.wikimedia.org/T282735 [19:03:34] Urbanecm: thanks! it's working now. [19:03:40] excellent. syncing it now. [19:05:30] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 80e5b9d: cd113a7: Enable structured_task/article/link_suggestion_interaction schema (T278177) (duration: 01m 06s) [19:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:36] T278177: Add a link: create new Schema - https://phabricator.wikimedia.org/T278177 [19:05:39] tgr_: should be live. [19:05:47] (03PS12) 10Thcipriani: Beta: Clean puppet cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) [19:05:51] brennen: dancy: the floor is yours. Thanks! [19:05:59] Thanks Martin [19:06:25] also working. Thanks Urbanecm ! [19:06:31] any time :) [19:06:31] (03CR) 10Ahmon Dancy: [C: 03+2] Make sure mId exists [extensions/GeoData] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690078 (https://phabricator.wikimedia.org/T282735) (owner: 10Zabe) [19:15:18] (03PS1) 10QChris: Add .gitreview [container/miscweb] - 10https://gerrit.wikimedia.org/r/690668 [19:15:20] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [container/miscweb] - 10https://gerrit.wikimedia.org/r/690668 (owner: 10QChris) [19:20:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:23:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:23:45] cdanis: a very belated np :) [19:35:52] (03Merged) 10jenkins-bot: Make sure mId exists [extensions/GeoData] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690078 (https://phabricator.wikimedia.org/T282735) (owner: 10Zabe) [19:36:10] (03PS1) 10Dzahn: add initial README file [container/miscweb] - 10https://gerrit.wikimedia.org/r/690670 [19:37:04] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add initial README file [container/miscweb] - 10https://gerrit.wikimedia.org/r/690670 (owner: 10Dzahn) [19:39:01] !log dancy@deploy1002 Synchronized php-1.37.0-wmf.5/extensions/GeoData/includes/Hooks.php: Backport: [[gerrit:690078|Make sure mId exists (T282735)]] (duration: 01m 08s) [19:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:06] T282735: Wikimedia\Rdbms\DBQueryError: Error 1048: Column 'gt_page_id' cannot be null (db1138)Function: GeoData\Hooks::doLinksUpdateQuery: INSERT INTO `geo_tags` (gt_page_id,gt_id,gt_lat,gt_lon,gt_globe,gt_primary,gt_dim,gt_type,gt_name,gt_country,gt_region) VALUES (NULL,NULL,'45.811666666667','4.9194444444444','earth',1,1000,'camera',NULL,NULL,NULL) - https://phabricator.wikimedia.org/T282735 [19:39:30] (03PS1) 10Ahmon Dancy: group1 wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690676 [19:39:32] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690676 (owner: 10Ahmon Dancy) [19:40:22] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690676 (owner: 10Ahmon Dancy) [19:42:07] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.5 [19:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:14] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.5 (duration: 01m 06s) [19:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:32] RECOVERY - SSH on logstash2020.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:56:27] matthiasmullie: https://phabricator.wikimedia.org/T282822 [19:56:55] upps. that sounds to be a bug in the patch i synced earlier [19:56:58] ^ might be related to your config change [19:58:10] Urbanecm: the timing seems to match [19:58:27] and i just confirmed it via mwmaint too [19:58:34] (03PS1) 10Urbanecm: Revert "Enable media change tags on wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690082 (https://phabricator.wikimedia.org/T266067) [20:00:06] a merge strategy might fix it. dunno if that's a case for core variables through [20:04:37] (03PS1) 10Dzahn: add initial config stub for pipelinelib [container/miscweb] - 10https://gerrit.wikimedia.org/r/690678 (https://phabricator.wikimedia.org/T281538) [20:06:03] (03CR) 10Herron: [C: 03+2] eventgate-logging-external: add codfw kafka-logging hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [20:06:58] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10KFrancis) @elukey Yes, please proceed. Thanks! [20:07:21] (03Merged) 10jenkins-bot: eventgate-logging-external: add codfw kafka-logging hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [20:07:25] I'd go with "revert and figure out a proper solution later", since I'm not sure how could be merged with the defaults [20:08:05] yeah, i feel so too. switching to wmg variable and merging in CS.php would fix that, not sure if that's the best solution through [20:08:37] !log herron@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [20:08:38] !log herron@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [20:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:53] !log herron@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [20:09:53] !log herron@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [20:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:28] (03PS1) 10Urbanecm: WIP: Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690680 (https://phabricator.wikimedia.org/T280400) [20:16:40] jouncebot: now [20:16:41] For the next 0 hour(s) and 43 minute(s): MediaWiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210513T1900) [20:16:54] (03PS1) 10Urbanecm: Properly enable media change tags on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690691 (https://phabricator.wikimedia.org/T266067) [20:17:07] dancy: can i sync a quick patch for T266067 please (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/690082/)? [20:17:08] T266067: [L] Create edit tags to measure multimedia edits to Wikipedia articles - https://phabricator.wikimedia.org/T266067 [20:17:17] Urbanecm: Yep. Go for it. [20:17:22] thank you very much [20:17:26] 10SRE, 10Wikimedia-Hackathon-2021, 10Wikimedia-Mailing-lists, 10Upstream: Add OAuth login to mailman for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Tgr) Seems straightforward as long as there is no need to tie accounts to each other (and this handle user renam... [20:17:33] (03PS2) 10Urbanecm: Revert "Enable media change tags on wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690082 (https://phabricator.wikimedia.org/T266067) [20:17:35] (03CR) 10Urbanecm: [C: 03+2] Revert "Enable media change tags on wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690082 (https://phabricator.wikimedia.org/T266067) (owner: 10Urbanecm) [20:19:53] (03CR) 10Acamicamacaraca: [C: 03+1] "+" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690691 (https://phabricator.wikimedia.org/T266067) (owner: 10Urbanecm) [20:19:58] (03Merged) 10jenkins-bot: Revert "Enable media change tags on wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690082 (https://phabricator.wikimedia.org/T266067) (owner: 10Urbanecm) [20:21:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:21:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REVERT: 9dc74e45579c9b868571529171421c4bf7de41fa: Revert "Enable media change tags on wikipedias" (T266067, T282822) (duration: 01m 07s) [20:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:57] T282822: Certain tags are no longer activated by default - https://phabricator.wikimedia.org/T282822 [20:21:59] * Urbanecm done [20:23:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:24:24] * Urbanecm done [20:29:10] (03PS2) 10Dzahn: add initial config stub for pipelinelib [container/miscweb] - 10https://gerrit.wikimedia.org/r/690678 (https://phabricator.wikimedia.org/T281538) [21:03:31] (03CR) 10Jeena Huneidi: [C: 03+1] "LGTM" [container/miscweb] - 10https://gerrit.wikimedia.org/r/690678 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:04:20] (03CR) 10Dzahn: [C: 03+2] "thanks :)" [container/miscweb] - 10https://gerrit.wikimedia.org/r/690678 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:04:23] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add initial config stub for pipelinelib [container/miscweb] - 10https://gerrit.wikimedia.org/r/690678 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:22:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:24:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:27:53] (03PS1) 10Ssingh: WIP: wikidough: update role to work towards anycast support [puppet] - 10https://gerrit.wikimedia.org/r/690698 [21:43:49] (03PS1) 10Bstorm: cloudstore: fix the sync path for the secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/690706 (https://phabricator.wikimedia.org/T224747) [22:01:52] (03PS2) 10Bstorm: cloudstore: fix the sync path for the secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/690706 (https://phabricator.wikimedia.org/T224747) [22:17:02] PROBLEM - HP RAID on ms-be1053 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:17:05] ACKNOWLEDGEMENT - HP RAID on ms-be1053 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T282839 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:17:09] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1053 - https://phabricator.wikimedia.org/T282839 (10ops-monitoring-bot) [22:17:23] (03PS3) 10Bstorm: cloudstore: fix the sync path for the secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/690706 (https://phabricator.wikimedia.org/T224747) [22:23:47] (03CR) 10Bstorm: "PCC seems right: https://puppet-compiler.wmflabs.org/compiler1002/29562/" [puppet] - 10https://gerrit.wikimedia.org/r/690706 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [22:28:05] (03PS10) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [22:33:48] (03PS2) 10Bstorm: toolforge: re-enable toolforge certificate monitor [puppet] - 10https://gerrit.wikimedia.org/r/690055 (https://phabricator.wikimedia.org/T282264) [22:38:39] (03CR) 10Bstorm: "PCC (which doesn't validate the config) https://puppet-compiler.wmflabs.org/compiler1003/29563/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/690055 (https://phabricator.wikimedia.org/T282264) (owner: 10Bstorm) [22:39:51] (03CR) 10Bstorm: [C: 03+2] cloudstore: fix the sync path for the secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/690706 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [22:40:24] (03PS1) 10Cwhite: logstash: add nodejs ecs migration config and tests [puppet] - 10https://gerrit.wikimedia.org/r/690759 (https://phabricator.wikimedia.org/T234565) [22:41:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:43:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:45:54] (03PS1) 10Cwhite: rsyslog: remove host parameter from syslog_cee template [puppet] - 10https://gerrit.wikimedia.org/r/690760 [22:47:35] (03CR) 10Bstorm: toolforge: re-enable toolforge certificate monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690055 (https://phabricator.wikimedia.org/T282264) (owner: 10Bstorm) [22:50:26] (03CR) 10Cwhite: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/689262 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:53:25] (03CR) 10Ryan Kemper: "To Erik's point, we can totally bump this limit if we need to, but first let's look into why we have so many shards. Is there a phab ticke" [puppet] - 10https://gerrit.wikimedia.org/r/688309 (owner: 10ZPapierski) [22:53:54] (03CR) 10Ryan Kemper: [C: 03+1] install_server: add new installer to test raid0 configuration: [puppet] - 10https://gerrit.wikimedia.org/r/689786 (https://phabricator.wikimedia.org/T280382) (owner: 10Jbond) [22:54:24] (03CR) 10Ryan Kemper: [C: 03+2] install_server: add new installer to test raid0 configuration: [puppet] - 10https://gerrit.wikimedia.org/r/689786 (https://phabricator.wikimedia.org/T280382) (owner: 10Jbond) [23:00:05] brennen: How many deployers does it take to do US Backport and Config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210513T2300). [23:00:05] nray and Zabe: A patch you scheduled for US Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:17] * thcipriani waves [23:00:36] o/ [23:00:58] howdy nray [23:01:10] hello! [23:01:16] o/ [23:01:52] (03CR) 10Thcipriani: [C: 03+2] "Backport" [extensions/WikimediaEvents] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690081 (https://phabricator.wikimedia.org/T261842) (owner: 10Nray) [23:02:00] hello Zabe [23:02:17] hi [23:02:27] (03PS5) 10Thcipriani: Enable WikiLove extension on tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686700 (https://phabricator.wikimedia.org/T280326) (owner: 10Neechalkaran) [23:02:42] hey backport training folk - trying to help debug a zuul issue, will join you shortly [23:03:13] take your time brennen [23:03:47] (03CR) 10Thcipriani: [C: 03+2] "Backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686700 (https://phabricator.wikimedia.org/T280326) (owner: 10Neechalkaran) [23:04:13] I imagine ^ will land before nray 's patch but we'll see how Jenkins feels about it [23:04:23] (03PS4) 10Cwhite: logstash: replace ECS allow list with filter_on_template [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) [23:04:37] (03Merged) 10jenkins-bot: Enable WikiLove extension on tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686700 (https://phabricator.wikimedia.org/T280326) (owner: 10Neechalkaran) [23:05:40] Zabe: your change is live on mwdebug1001, check please [23:06:26] (03Merged) 10jenkins-bot: Fix 'final_state: vector' bug in VectorPrefDiffInstrumentation [extensions/WikimediaEvents] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690081 (https://phabricator.wikimedia.org/T261842) (owner: 10Nray) [23:06:58] thcipriani: works [23:07:20] Zabe: great! going live [23:09:08] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1003.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [23:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:12] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [23:09:22] (03PS1) 10Dzahn: add initial Blubberfile and placeholders for prod and staging HTML [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) [23:09:45] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs2003.codfw.wmnet` on `ryankemper@cumin2001` tmux session `wdqs_reimage` [23:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:18] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs2003.codfw.wmnet` on `ryankemper@cumin2001` tmux session `wdqs_reimage` [23:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:19] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:686700|Enable WikiLove extension on tawiki (T280326)]] (duration: 01m 07s) [23:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:27] T280326: WikiLove Extension in Tamil Wikipedia - https://phabricator.wikimedia.org/T280326 [23:11:32] ^ Zabe live now! [23:12:28] thanks :) [23:12:36] yw :) [23:12:47] nray: you're up next [23:12:51] cool [23:16:22] nray: live on mwdebug1001, please check (if there's anything to check there) [23:16:38] k, checking now, thank you [23:18:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RKemper) 05Resolved→03Open [23:19:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RKemper) @Papaul This host is still in a bad state. It looks like one of the RAID drives is failing: ` ryankemper@wdqs2007:~$ sudo /s... [23:20:06] Looks good thcipriani , You can proceed! [23:20:26] nray: cool, going live! [23:22:30] !log thcipriani@deploy1002 Synchronized php-1.37.0-wmf.5/extensions/WikimediaEvents: Backport: [[gerrit:690081|Fix "final_state: vector" bug in VectorPrefDiffInstrumentation (T261842)]] (duration: 01m 07s) [23:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:34] T261842: Create schema to track users opting in/out of desktop improvements - https://phabricator.wikimedia.org/T261842 [23:22:47] ^ nray should be live now! [23:23:03] thcipriani: Thanks so much for your help! I appreciate it :) [23:23:14] nray: sure thing, thanks for the backport [23:24:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:24:17] (03PS11) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [23:25:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:32:21] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): labstore1004/5: buy a DAC 10Gb cable or adjust the current fiber cable for DAC/crossover - https://phabricator.wikimedia.org/T282799 (10wiki_willy) a:03Jclark-ctr Hi @Bstorm - we can definitely order parts if needed. I think @Jclark-... [23:33:07] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): labstore1004/5: buy a DAC 10Gb cable or adjust the current fiber cable for DAC/crossover - https://phabricator.wikimedia.org/T282799 (10Bstorm) Sounds good! [23:42:41] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:15] ^ I'll take a look [23:50:28] !log [sodium:~] $ sudo -u mirror /usr/local/sbin/update-ubuntu-mirror [23:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:31] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:39] !log [sodium:~] $ sudo systemctl start update-ubuntu-mirror.service [23:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:16] some hickup or race that did not happen again on the next (manual) run. this runs every hour. fixed