[00:14:51] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:19:42] Again, sigh [01:10:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:12:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:34:35] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 27 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [02:48:05] ^ known, I'll ack the mailman3 alerts later [02:48:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [02:53:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [03:22:23] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:34:11] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:36:17] 10SRE, 10observability, 10Patch-For-Review: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10dpifke) ArcLamp (performance flamegraphs) stopped getting data on May 5, likely as a result of this change. Other things which might have also been missed: ` $ find . -type f |... [03:39:01] (03PS1) 10Dave Pifke: arclamp: switch to mwlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/687928 (https://phabricator.wikimedia.org/T224565) [04:10:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:12:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:03:43] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:43] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1129 for schema change', diff saved to https://phabricator.wikimedia.org/P15870 and previous config saved to /var/cache/conftool/dbconfig/20210510-050727-marostegui.json [05:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:39] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:56] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:11:58] (03PS1) 10Marostegui: instances.yaml: Remove db1082 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/687957 (https://phabricator.wikimedia.org/T281794) [05:13:02] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1082 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/687957 (https://phabricator.wikimedia.org/T281794) (owner: 10Marostegui) [05:13:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1082 from dbctl T281794', diff saved to https://phabricator.wikimedia.org/P15871 and previous config saved to /var/cache/conftool/dbconfig/20210510-051334-marostegui.json [05:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:38] T281794: decommission db1082.eqiad.wmnet - https://phabricator.wikimedia.org/T281794 [05:19:07] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Marostegui) Sounds good, ping me whenever you want to get the databases deleted. Is it needed to take a final backup from these testing databases? I am off Thursday and Friday, but @Kormat... [05:23:27] (03PS1) 10Marostegui: install_server: Do not reimage db1177 [puppet] - 10https://gerrit.wikimedia.org/r/687962 [05:24:15] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1177 [puppet] - 10https://gerrit.wikimedia.org/r/687962 (owner: 10Marostegui) [05:31:13] (03PS1) 10Marostegui: install_server: Reimage db1121 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/687965 (https://phabricator.wikimedia.org/T280492) [05:32:05] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1121 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/687965 (https://phabricator.wikimedia.org/T280492) (owner: 10Marostegui) [05:46:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 25%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P15872 and previous config saved to /var/cache/conftool/dbconfig/20210510-054610-root.json [05:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [05:53:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [06:01:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 50%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P15873 and previous config saved to /var/cache/conftool/dbconfig/20210510-060113-root.json [06:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:31] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - enwiki_content_1617305154[5](2021-05-07T00:40:48.738Z), enwiki_content_1617305154[10](2021-05-07T00:40:48.738Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:14:27] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [06:16:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 75%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P15874 and previous config saved to /var/cache/conftool/dbconfig/20210510-061617-root.json [06:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 100%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P15875 and previous config saved to /var/cache/conftool/dbconfig/20210510-063121-root.json [06:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146:3312 for schema change', diff saved to https://phabricator.wikimedia.org/P15876 and previous config saved to /var/cache/conftool/dbconfig/20210510-063254-marostegui.json [06:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:49] !log apt-get clean on rpki1001 to free some space [06:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10elukey) 05Resolved→03Open Still reported down :( ` racadm>>racadm getsel Record: 1 Date/Time: 05/07/2021 00:43:42 Source: sys... [06:42:08] ACKNOWLEDGEMENT - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 42 (limit: 25) Legoktm T282348 https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [06:42:13] ACKNOWLEDGEMENT - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner Legoktm T282348 https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:58:32] (03CR) 10Muehlenhoff: [C: 03+2] Apply LDAP replica role to ldap-replica1003/1004/2006 [puppet] - 10https://gerrit.wikimedia.org/r/686434 (owner: 10Muehlenhoff) [07:02:09] (03PS1) 10Muehlenhoff: Add missing Hiera entries for new LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/688167 [07:06:17] (03CR) 10Muehlenhoff: [C: 03+2] Add missing Hiera entries for new LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/688167 (owner: 10Muehlenhoff) [07:06:23] (03CR) 10Elukey: [C: 03+2] arclamp: switch to mwlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/687928 (https://phabricator.wikimedia.org/T224565) (owner: 10Dave Pifke) [07:18:54] 10SRE, 10DBA, 10Datacenter-Switchover: When switching DCs, update pc hosts in tendril - https://phabricator.wikimedia.org/T266723 (10Marostegui) This is probably not worth the effort if we are expecting to drop tendril "soon". We can update these manually for the next switch (and switch back) and hopefully f... [07:22:01] (03CR) 10JMeybohm: [C: 04-1] Add canary support in scaffolding (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) (owner: 10Effie Mouzeli) [07:34:59] 10SRE, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Dumps-Generation: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001 (10ArielGlenn) Whatever happened to media backups? Was an implementation decided on or even completed? [07:35:31] oh wikitech is now being upgraded [07:35:36] * Amir1 bites his nails [07:38:17] !log Restarted CI Jenkins # T281737 [07:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:22] T281737: Zuul can't stop jobs or set the build description - https://phabricator.wikimedia.org/T281737 [07:41:45] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:48:31] (03PS1) 10Muehlenhoff: Add ldap-replica1003/1004/2006 to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/688175 [07:51:57] (03PS3) 10Tonina Zhelyazkova: Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685776 (https://phabricator.wikimedia.org/T241422) [07:55:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 25%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P15877 and previous config saved to /var/cache/conftool/dbconfig/20210510-075529-root.json [07:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:52] (03CR) 10Muehlenhoff: [C: 03+2] Add ldap-replica1003/1004/2006 to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/688175 (owner: 10Muehlenhoff) [08:02:07] (03CR) 10Volans: "reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/685855 (owner: 10Volans) [08:05:23] (03PS1) 10Vgutierrez: trafficserver: Fine tune acme-chief cert warning [puppet] - 10https://gerrit.wikimedia.org/r/688176 [08:08:13] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29463/console" [puppet] - 10https://gerrit.wikimedia.org/r/688176 (owner: 10Vgutierrez) [08:09:26] 10SRE: Decom failoid1001/failoid2001 - https://phabricator.wikimedia.org/T282405 (10MoritzMuehlenhoff) [08:10:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 50%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P15878 and previous config saved to /var/cache/conftool/dbconfig/20210510-081033-root.json [08:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:14] (03CR) 10Elukey: [C: 03+1] trafficserver: Fine tune acme-chief cert warning [puppet] - 10https://gerrit.wikimedia.org/r/688176 (owner: 10Vgutierrez) [08:13:41] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Fine tune acme-chief cert warning [puppet] - 10https://gerrit.wikimedia.org/r/688176 (owner: 10Vgutierrez) [08:15:02] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts failoid2001.codfw.wmnet [08:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:58] (03PS1) 10Itamar Givon: Add P2671 and P4839 to deprecated properties list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688178 (https://phabricator.wikimedia.org/T280779) [08:18:21] (03PS4) 10Tonina Zhelyazkova: Add Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685776 (https://phabricator.wikimedia.org/T241422) [08:19:58] (03PS2) 10Muehlenhoff: Remove Puppet refencences to old Buster failoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/683802 [08:21:23] (03PS5) 10Tonina Zhelyazkova: Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685776 (https://phabricator.wikimedia.org/T241422) [08:21:33] (03PS6) 10Tonina Zhelyazkova: Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685776 (https://phabricator.wikimedia.org/T241422) [08:24:29] !log push pfw policies - T282286 [08:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:35] T282286: Deploy pfw policy1620422079 for T268501 and T281320 - https://phabricator.wikimedia.org/T282286 [08:24:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts failoid2001.codfw.wmnet [08:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:42] 10SRE: Decom failoid1001/failoid2001 - https://phabricator.wikimedia.org/T282405 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `failoid2001.codfw.wmnet` - failoid2001.codfw.wmnet (**PASS**) - Downtimed host on Icinga - Found Ganeti VM - VM shutdown - Started... [08:25:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 75%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P15879 and previous config saved to /var/cache/conftool/dbconfig/20210510-082536-root.json [08:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:55] (03CR) 10Tonina Zhelyazkova: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688184 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [08:28:37] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts failoid1001.eqiad.wmnet [08:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:55] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: collect kaios_app.error stream into logstash clienterror input [puppet] - 10https://gerrit.wikimedia.org/r/686803 (https://phabricator.wikimedia.org/T281507) (owner: 10Cwhite) [08:31:18] (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet refencences to old Buster failoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/683802 (owner: 10Muehlenhoff) [08:32:20] (03PS2) 10Samwilson: Enable Wikimedia OCR on Beta Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685643 (https://phabricator.wikimedia.org/T282080) [08:37:13] (03PS1) 10Vgutierrez: trafficserver: Clear outbound TLS cacert_path for ats@esams [puppet] - 10https://gerrit.wikimedia.org/r/688194 (https://phabricator.wikimedia.org/T281673) [08:37:41] (03PS1) 10Volans: firewall: add cumin2002 to the cumin term [homer/public] - 10https://gerrit.wikimedia.org/r/688195 (https://phabricator.wikimedia.org/T276589) [08:39:12] (03PS2) 10Volans: firewall: add cumin2002 to the cumin term [homer/public] - 10https://gerrit.wikimedia.org/r/688195 (https://phabricator.wikimedia.org/T276589) [08:39:57] (03CR) 10Ayounsi: [C: 03+1] firewall: add cumin2002 to the cumin term [homer/public] - 10https://gerrit.wikimedia.org/r/688195 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [08:40:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [homer/public] - 10https://gerrit.wikimedia.org/r/688195 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [08:40:37] 10SRE, 10fr-email-preference-center, 10fundraising-tech-ops, 10netops, 10WMF-NDA: Deploy pfw policy1620422079 for T268501 and T281320 - https://phabricator.wikimedia.org/T282286 (10ayounsi) 05Open→03Resolved [08:40:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 100%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P15880 and previous config saved to /var/cache/conftool/dbconfig/20210510-084040-root.json [08:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts failoid1001.eqiad.wmnet [08:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:53] 10SRE: Decom failoid1001/failoid2001 - https://phabricator.wikimedia.org/T282405 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `failoid1001.eqiad.wmnet` - failoid1001.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found Ganeti VM - VM shutdown - Started... [08:41:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1156 for schema change', diff saved to https://phabricator.wikimedia.org/P15881 and previous config saved to /var/cache/conftool/dbconfig/20210510-084102-marostegui.json [08:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:25] (03CR) 10Volans: [C: 03+2] firewall: add cumin2002 to the cumin term [homer/public] - 10https://gerrit.wikimedia.org/r/688195 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [08:42:06] (03Merged) 10jenkins-bot: firewall: add cumin2002 to the cumin term [homer/public] - 10https://gerrit.wikimedia.org/r/688195 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [08:42:20] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29464/console" [puppet] - 10https://gerrit.wikimedia.org/r/688194 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [08:48:22] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Clear outbound TLS cacert_path for ats@esams [puppet] - 10https://gerrit.wikimedia.org/r/688194 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [08:48:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:48:56] !log Enforce Puppet Internal CA validation on trafficserver@esams - T281673 [08:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:37] 10SRE, 10DBA, 10Datacenter-Switchover: When switching DCs, update pc hosts in tendril - https://phabricator.wikimedia.org/T266723 (10Kormat) I discussed this with @RLazarus back in october, and we agreed it's not worth the effort given the impending any-day-now™ tendril decomm. (I forgot to update the task w... [08:52:35] !log installing bind9 security updates on stretch (client-side tools/libs only) [08:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [08:54:01] (03PS1) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization (part 1/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688200 (https://phabricator.wikimedia.org/T281972) [08:54:03] (03PS1) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization (part 2/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) [08:54:25] 10SRE, 10DBA, 10Datacenter-Switchover: When switching DCs, update pc hosts in tendril - https://phabricator.wikimedia.org/T266723 (10Marostegui) 05Open→03Declined [08:58:23] (03CR) 10Itamar Givon: [C: 03+1] "Looks good, thanks for introducing this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685776 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [09:02:38] (03CR) 10Tonina Zhelyazkova: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688184 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [09:03:11] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685776 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [09:03:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add tmpSerializeEmptyListsAsObjects to Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688184 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [09:03:47] (03CR) 10jerkins-bot: [V: 04-1] DatabaseBlockStore: fetch correct ActorNormalization (part 2/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [09:06:10] (03Abandoned) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization (part 2/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [09:06:27] (03PS2) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688200 (https://phabricator.wikimedia.org/T281972) [09:08:16] 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10ayounsi) 05Resolved→03Open This is alerting again: https://librenms.wikimedia.org/device/device=158/tab=health/metric=processor/ [09:09:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [09:14:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [09:21:25] (03Restored) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization (part 2/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [09:21:55] (03PS2) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization (part 2/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) [09:25:16] (03CR) 10Hashar: [C: 03+2] Merge 'upstream/stable-3.2' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/687117 (owner: 10Hashar) [09:26:57] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ldap-replica2006.wikimedia.org [09:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:30] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ldap-replica1003.wikimedia.org [09:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:15] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ldap-replica1004.wikimedia.org [09:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:18] (03Merged) 10jenkins-bot: Merge 'upstream/stable-3.2' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/687117 (owner: 10Hashar) [09:32:26] (03CR) 10jerkins-bot: [V: 04-1] DatabaseBlockStore: fetch correct ActorNormalization [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688200 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [09:37:04] (03PS1) 10Vgutierrez: trafficserver: Clear outbound TLS cacert_path for ats@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/688203 (https://phabricator.wikimedia.org/T281673) [09:41:04] (03PS3) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization (part 1/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688200 (https://phabricator.wikimedia.org/T281972) [09:41:23] (03Abandoned) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization (part 2/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [09:41:58] (03Restored) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization (part 2/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [09:42:15] (03PS3) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization (part 2/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) [09:42:46] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 8 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29465/console" [puppet] - 10https://gerrit.wikimedia.org/r/688203 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [09:44:51] (03PS4) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization (part 1/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688200 (https://phabricator.wikimedia.org/T281972) [09:45:03] (03PS4) 10Urbanecm: DatabaseBlockStore: fetch correct ActorNormalization (part 2/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) [09:45:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 T281959', diff saved to https://phabricator.wikimedia.org/P15883 and previous config saved to /var/cache/conftool/dbconfig/20210510-094554-marostegui.json [09:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:58] T281959: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 [09:48:16] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Clear outbound TLS cacert_path for ats@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/688203 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [09:48:52] !log Enforce Puppet Internal CA validation on trafficserver@eqiad - T281673 [09:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:37] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10jbond) below is a summary of the current issues copied from slack > * We would like to use the e-mail and ssh public key provided by CAS, but... [09:52:26] (03PS1) 10Elukey: WIP - Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 [09:56:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Repool db1156', diff saved to https://phabricator.wikimedia.org/P15884 and previous config saved to /var/cache/conftool/dbconfig/20210510-095608-root.json [09:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:08] (03CR) 10Daniel Kinzler: [C: 03+1] DatabaseBlockStore: fetch correct ActorNormalization (part 1/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688200 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [10:01:15] (03CR) 10Daniel Kinzler: [C: 03+1] DatabaseBlockStore: fetch correct ActorNormalization (part 2/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [10:11:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Repool db1156', diff saved to https://phabricator.wikimedia.org/P15885 and previous config saved to /var/cache/conftool/dbconfig/20210510-101112-root.json [10:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:35] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database from master [10:13:35] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database from master [10:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:35] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1004.eqiad.wmnet [10:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:45] !log rolling restart of ATS backend instances to clear spurious warnings [10:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:28] (03PS4) 10Giuseppe Lavagetto: Add diff tasks to rake [deployment-charts] - 10https://gerrit.wikimedia.org/r/685721 [10:22:09] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1005.eqiad.wmnet [10:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:09] (03PS1) 10Arturo Borrero Gonzalez: prometheus-labmon: point to cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/688213 (https://phabricator.wikimedia.org/T275605) [10:26:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Repool db1156', diff saved to https://phabricator.wikimedia.org/P15886 and previous config saved to /var/cache/conftool/dbconfig/20210510-102615-root.json [10:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] prometheus-labmon: point to cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/688213 (https://phabricator.wikimedia.org/T275605) (owner: 10Arturo Borrero Gonzalez) [10:27:06] (03CR) 10Arturo Borrero Gonzalez: "Thanks Filippo for the heads up!" [dns] - 10https://gerrit.wikimedia.org/r/688213 (https://phabricator.wikimedia.org/T275605) (owner: 10Arturo Borrero Gonzalez) [10:30:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T1100). [10:30:05] CFisch_WMDE, Tonina_WMDE, samwilson, jan_drewniak, and Zabe: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:30:20] o.O [10:31:02] That's half an hour to early - right? [10:31:17] !log installing openjdk-11 security updates [10:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:35] yes it is :P [10:31:39] jouncebot: you are wrong. it's portal time. [10:32:09] didn't it announce the portal time yesterday? [10:32:46] what? [10:32:48] why would it [10:33:02] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688214 (https://phabricator.wikimedia.org/T128546) [10:34:25] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688214 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:54] (03CR) 10Urbanecm: [C: 03+2] "merging in advance of B&C, to give time for CI" [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688200 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [10:34:58] (03CR) 10Urbanecm: [C: 03+2] "merging in advance of B&C, to give time for CI" [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [10:36:22] PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,prometheus-debian-version-textfile.service,prometheus-nic-firmware-textfile.service,prometheus-node-exporter-apt.service,prometheus-node-exporter.service,prometheus-statsd-exporter.service,prometheus_puppet_agent_stats.service,swift-container-sync.service,swift-container-updater.service,swift-con [10:36:22] ift-object-replicator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:27] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688214 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:36:28] PROBLEM - very high load average likely xfs on ms-be1038 is CRITICAL: CRITICAL - load average: 28.45, 162.33, 143.46 https://wikitech.wikimedia.org/wiki/Swift [10:36:56] Urbanecm: I think it's always skipping the current even and announcing the next one, yesterday the no deploys one was replaced by portals and now portals was replaced by the backport window [10:38:07] weird [10:39:16] the only recent commit i see is https://gerrit.wikimedia.org/r/c/wikimedia/bots/jouncebot/+/680033 [10:39:54] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:688214| Bumping portals to master (T128546)]] (duration: 00m 59s) [10:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:00] previous one was data-utcstart [10:40:00] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:40:03] https://gerrit.wikimedia.org/r/c/wikimedia/bots/jouncebot/+/677332 [10:40:15] which uses utcstart rather than SF timezone, which might be the issue? [10:40:18] Majavah: what do you think? [10:40:53] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:688214| Bumping portals to master (T128546)]] (duration: 00m 58s) [10:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:12] possible [10:41:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Repool db1156', diff saved to https://phabricator.wikimedia.org/P15887 and previous config saved to /var/cache/conftool/dbconfig/20210510-104119-root.json [10:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:21] 10SRE, 10CAS-SSO: Tomcat/CAS fails to start with OpenJDK 11.0.11 - https://phabricator.wikimedia.org/T281345 (10MoritzMuehlenhoff) I couldn't pin-point a single change responsible for it, but essentially we need to upgrade the java.security file to the new version from 11.0.11. I'll run some tests whether it w... [10:43:40] PROBLEM - MD RAID on ms-be1038 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:43:41] ACKNOWLEDGEMENT - MD RAID on ms-be1038 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T282434 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:43:44] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1038 - https://phabricator.wikimedia.org/T282434 (10ops-monitoring-bot) [10:46:40] RECOVERY - very high load average likely xfs on ms-be1038 is OK: OK - load average: 2.20, 22.34, 74.74 https://wikitech.wikimedia.org/wiki/Swift [10:47:16] PROBLEM - Disk space on ms-be1038 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb4 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1038&var-datasource=eqiad+prometheus/ops [10:51:38] (03PS5) 10Effie Mouzeli: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) [10:54:48] (03PS6) 10Effie Mouzeli: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) [10:56:57] (03PS2) 10WMDE-Fisch: Enable ReferencePreviews as full default on Marathi wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685820 (https://phabricator.wikimedia.org/T282147) [11:00:05] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T1700). [11:00:15] :( [11:00:15] (03Merged) 10jenkins-bot: DatabaseBlockStore: fetch correct ActorNormalization (part 1/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688200 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [11:00:21] (03Merged) 10jenkins-bot: DatabaseBlockStore: fetch correct ActorNormalization (part 2/2) [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688201 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [11:00:26] 12:30 <+jouncebot> Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T1100). [11:00:27] 12:30 <+jouncebot> CFisch_WMDE, Tonina_WMDE, samwilson, jan_drewniak, and Zabe: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:29] I can deploy today [11:00:39] o/ [11:00:40] o/ [11:00:42] o/ [11:00:59] o/ [11:01:00] o/ [11:01:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3312', diff saved to https://phabricator.wikimedia.org/P15888 and previous config saved to /var/cache/conftool/dbconfig/20210510-110125-marostegui.json [11:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:31] (03CR) 10Urbanecm: [C: 03+2] Enable ReferencePreviews as full default on Marathi wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685820 (https://phabricator.wikimedia.org/T282147) (owner: 10WMDE-Fisch) [11:01:39] Urbanecm: feel free :-) [11:03:35] (03Merged) 10jenkins-bot: Enable ReferencePreviews as full default on Marathi wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685820 (https://phabricator.wikimedia.org/T282147) (owner: 10WMDE-Fisch) [11:04:08] CFisch_WMDE: your patch is at mwdebug1001, please test [11:04:22] I'll do [11:05:26] Urbanecm: looks good, go ahead [11:05:31] will sync [11:05:56] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.4/includes/block/DatabaseBlockStore.php: 85dc711dee753ad8302a431369d7814efb2785d1: DatabaseBlockStore: fetch correct ActorNormalization (1/3; T281972) (duration: 00m 57s) [11:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:17] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.4/includes/ServiceWiring.php: 85dc711dee753ad8302a431369d7814efb2785d1: DatabaseBlockStore: fetch correct ActorNormalization (2/3; T281972) (duration: 00m 56s) [11:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:39] !log urbanecm@deploy1002 sync-file aborted: bd28391f807d6205875cad0d049760c0e606de24: DatabaseBlockStore: fetch correct ActorNormalization (T281972) (duration: 00m 04s) [11:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:42] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.4/includes/block/DatabaseBlockStore.php: bd28391f807d6205875cad0d049760c0e606de24: DatabaseBlockStore: fetch correct ActorNormalization (3/3; T281972) (duration: 00m 56s) [11:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:45] (03PS7) 10Urbanecm: Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685776 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [11:10:49] (03CR) 10Urbanecm: [C: 03+2] Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685776 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [11:11:00] (03PS3) 10Urbanecm: Add tmpSerializeEmptyListsAsObjects to Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688184 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [11:11:05] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 23271ddb555b44c2c9659c32907fdeff2a768916: Enable ReferencePreviews as full default on Marathi wiki (T282147) (duration: 00m 57s) [11:11:06] (03CR) 10Urbanecm: [C: 03+2] Add tmpSerializeEmptyListsAsObjects to Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688184 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [11:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:09] T282147: Enable RefPreviews on Marathi Wikipedia - https://phabricator.wikimedia.org/T282147 [11:11:10] CFisch_WMDE: should be live [11:11:38] (03Merged) 10jenkins-bot: Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685776 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [11:11:49] (03Merged) 10jenkins-bot: Add tmpSerializeEmptyListsAsObjects to Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688184 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [11:12:15] Tonina_WMDE: your patch is at mwdebug1001. Can you test? [11:12:25] (both of them, ftr) [11:12:34] Urbanecm: cool, looks good, works, thanks [11:12:38] :-) [11:12:43] hmm I think this is not testable, as we are only introducing the setting and not enabling it? [11:12:52] i pulled both of them [11:13:07] otherwise I'd have to ask addshore to test for me [11:13:56] but $wgWBRepoSettings['tmpSerializeEmptyListsAsObjects'] is false at wikidatawiki [11:14:04] and CFisch_WMDE says it works [11:14:06] so syncing :-) [11:14:16] Urbanecm: nope [11:14:23] sorry that's not my patch [11:14:38] sorry this was a replay to the full scap [11:14:40] oh, sorry [11:14:59] Tonina_WMDE: Might be testable on beta cluster [11:15:20] or maybe the patch that introduces the setting is not merged yet? [11:15:50] is this intended to prepare for a patch that's yet-to-arrive to cluster? [11:15:54] it's not rolled out [11:16:05] okay [11:16:05] ahh and it's just a $wmg var [11:16:24] not used in the CommonSettings [11:16:26] yet [11:16:45] CFisch_WMDE: i pulled https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/688184 as well :) [11:16:54] https://www.irccloud.com/pastebin/4omOMCAE/ [11:17:07] Ahh... get it [11:17:17] (03PS1) 10Effie Mouzeli: ProductionServices: poolcounter1004 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688239 (https://phabricator.wikimedia.org/T273278) [11:17:26] Tonina_WMDE: if `$wgWBRepoSettings['tmpSerializeEmptyListsAsObjects'] = false` is expected behavior on prod wikidatawiki, I'm happy to sync it. [11:17:36] sorry I should probably not interfere with the stuff other teams are doing :-D [11:17:43] yes, false is the expected behavior [11:17:52] excellent. syncing :) [11:17:54] * CFisch_WMDE sneaks away [11:18:19] see you later CFisch_WMDE :) [11:18:25] ;-) [11:19:05] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 6138c64e7c13fbc52ad084c0901bdd2ab30ad953: Add tmpSerializeEmptyListsAsObjects Wikibase repo config (T241422) (duration: 00m 57s) [11:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:09] T241422: Wikidata forms without statements use empty JSON array instead of empty JSON object - https://phabricator.wikimedia.org/T241422 [11:19:38] (03PS2) 10Effie Mouzeli: ProductionServices: poolcounter1004 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688239 (https://phabricator.wikimedia.org/T273278) [11:20:18] (03PS3) 10Urbanecm: Enable Wikimedia OCR on Beta Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685643 (https://phabricator.wikimedia.org/T282080) (owner: 10Samwilson) [11:20:27] (03CR) 10Urbanecm: [C: 03+2] "no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685643 (https://phabricator.wikimedia.org/T282080) (owner: 10Samwilson) [11:20:33] !log urbanecm@deploy1002 Synchronized wmf-config/Wikibase.php: 7f6f8497cdfba6d766e3e6974ee15a492f0518ac: Add tmpSerializeEmptyListsAsObjects to Wikibase.php (T241422) (duration: 01m 01s) [11:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:48] samwilson: your patch will appear on beta within 30 minutes. Let me know if it doesn't for any reason :) [11:20:56] Thanks Urbanecm ! [11:21:00] no problem Tonina_WMDE [11:21:09] Urbanecm: terrific, thanks. I'll test. [11:21:36] (03Merged) 10jenkins-bot: Enable Wikimedia OCR on Beta Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685643 (https://phabricator.wikimedia.org/T282080) (owner: 10Samwilson) [11:22:01] (03PS3) 10Urbanecm: Remove Vector language button from Commons, Wikidata, Mediawiki, Wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685795 (https://phabricator.wikimedia.org/T281968) (owner: 10Jdrewniak) [11:22:16] (03CR) 10Urbanecm: [C: 03+2] Remove Vector language button from Commons, Wikidata, Mediawiki, Wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685795 (https://phabricator.wikimedia.org/T281968) (owner: 10Jdrewniak) [11:23:42] (03Merged) 10jenkins-bot: Remove Vector language button from Commons, Wikidata, Mediawiki, Wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685795 (https://phabricator.wikimedia.org/T281968) (owner: 10Jdrewniak) [11:24:08] jan_drewniak: your patch is on mwdebug1001. Can you test, please? [11:25:22] Urbanecm: ok, just checked it out. It's good to sync. [11:25:28] great, syncing [11:26:56] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 9209d96560777cf6747d57855c7b525e702664d7: Remove Vector language button from Commons, Wikidata, Mediawiki, Wikispecies (T281968) (duration: 00m 57s) [11:26:59] jan_drewniak: your patch should be live! [11:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:00] T281968: Remove language button from commons - https://phabricator.wikimedia.org/T281968 [11:27:07] (03PS3) 10Urbanecm: Change namespace name and aliases on jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686437 (https://phabricator.wikimedia.org/T262155) (owner: 10Zabe) [11:27:10] Urbanecm: great, thanks! [11:27:14] no problem :) [11:27:16] (03CR) 10Urbanecm: [C: 03+2] Change namespace name and aliases on jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686437 (https://phabricator.wikimedia.org/T262155) (owner: 10Zabe) [11:28:43] (03Merged) 10jenkins-bot: Change namespace name and aliases on jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686437 (https://phabricator.wikimedia.org/T262155) (owner: 10Zabe) [11:29:08] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10jbond) >> - Logout is still an issue > See above i dont think this is an issue I have just tested idp.wmfcloud.org configured with SingleSi... [11:30:19] Zabe: your patch is on mwdebug1001, please test. [11:31:59] Urbanecm: works the supposed way [11:32:19] thanks, syncing [11:33:33] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 068cd7e41e339acf72fb81d4fcc3b86292209fe3: Change namespace name and aliases on jawikivoyage (T262155) (duration: 00m 57s) [11:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:36] T262155: Request for settings about namespaces on ja.wikivoyage - https://phabricator.wikimedia.org/T262155 [11:33:44] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=jawikivoyage # T262155 [11:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:49] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=jawikivoyage --fix # T262155 [11:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:52] Zabe: should be live [11:34:11] (03PS2) 10Urbanecm: Add *.geograph.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/687795 (https://phabricator.wikimedia.org/T282007) [11:34:15] (03CR) 10Urbanecm: [C: 03+2] Add *.geograph.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/687795 (https://phabricator.wikimedia.org/T282007) (owner: 10Urbanecm) [11:35:01] (03Merged) 10jenkins-bot: Add *.geograph.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/687795 (https://phabricator.wikimedia.org/T282007) (owner: 10Urbanecm) [11:37:20] thanks [11:37:37] no problem [11:37:46] (03PS7) 10Urbanecm: Disabling Education Program namespaces in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685597 (https://phabricator.wikimedia.org/T282112) (owner: 10Rubin) [11:37:50] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8bef11c3048683663e6edc38e21cd6d6d1192eb7: Add *.geograph.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T282007) (duration: 00m 57s) [11:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:53] (03CR) 10Urbanecm: [C: 03+2] Disabling Education Program namespaces in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685597 (https://phabricator.wikimedia.org/T282112) (owner: 10Rubin) [11:37:54] T282007: Add *.geograph.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T282007 [11:38:53] (03Merged) 10jenkins-bot: Disabling Education Program namespaces in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685597 (https://phabricator.wikimedia.org/T282112) (owner: 10Rubin) [11:41:27] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3418237fdbe3eaff409bb23bf97fbba51e60337a: Disabling Education Program namespaces in Russian Wikipedia (T282112) (duration: 00m 57s) [11:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:30] T282112: Delete "Education Program" and "Education Program talk" namespace from ruwiki - https://phabricator.wikimedia.org/T282112 [11:46:42] !log EU B&C window done [11:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:37] (03PS1) 10Muehlenhoff: Manage different templates per Java release branch [puppet] - 10https://gerrit.wikimedia.org/r/688246 (https://phabricator.wikimedia.org/T281345) [11:54:12] (03PS5) 10Majavah: toolforge: Add ingress-nginx Helm files [puppet] - 10https://gerrit.wikimedia.org/r/685715 (https://phabricator.wikimedia.org/T264221) [11:58:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/688246 (https://phabricator.wikimedia.org/T281345) (owner: 10Muehlenhoff) [12:02:48] Urbanecm: the beta cluster config patch from earlier seems to not yet be live; should it be already? [12:03:17] (03PS1) 10Marostegui: mariadb: Clarify that innodb_change_buffering is none [puppet] - 10https://gerrit.wikimedia.org/r/688248 (https://phabricator.wikimedia.org/T263443) [12:03:57] (03PS5) 10Giuseppe Lavagetto: Add diff tasks to rake [deployment-charts] - 10https://gerrit.wikimedia.org/r/685721 [12:03:59] (03PS6) 10Giuseppe Lavagetto: eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) [12:04:18] samwilson: I'll check within few minutes. [12:04:21] (03PS2) 10Marostegui: mariadb: Clarify that innodb_change_buffering is none [puppet] - 10https://gerrit.wikimedia.org/r/688248 (https://phabricator.wikimedia.org/T263443) [12:04:43] Urbanecm: thanks [12:04:43] (03PS1) 10Jbond: P:idp::client::httpd: add support for CASSSOEnabled [puppet] - 10https://gerrit.wikimedia.org/r/688249 [12:05:19] (03PS3) 10Marostegui: mariadb: Clarify that innodb_change_buffering is none [puppet] - 10https://gerrit.wikimedia.org/r/688248 (https://phabricator.wikimedia.org/T263443) [12:06:06] (03PS2) 10Jbond: P:idp::client::httpd: add support for CASSSOEnabled [puppet] - 10https://gerrit.wikimedia.org/r/688249 (https://phabricator.wikimedia.org/T233941) [12:06:36] (03PS6) 10Majavah: toolforge: Add ingress-nginx Helm files [puppet] - 10https://gerrit.wikimedia.org/r/685715 (https://phabricator.wikimedia.org/T264221) [12:06:52] Urbanecm: https://phabricator.wikimedia.org/T282206 [12:06:56] samwilson: ^ [12:06:57] (03PS2) 10Muehlenhoff: Manage different templates per Java release branch [puppet] - 10https://gerrit.wikimedia.org/r/688246 (https://phabricator.wikimedia.org/T281345) [12:07:10] (03CR) 10Marostegui: [C: 03+2] mariadb: Clarify that innodb_change_buffering is none [puppet] - 10https://gerrit.wikimedia.org/r/688248 (https://phabricator.wikimedia.org/T263443) (owner: 10Marostegui) [12:07:50] Majavah: ah, thanks [12:08:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29467/console" [puppet] - 10https://gerrit.wikimedia.org/r/688249 (https://phabricator.wikimedia.org/T233941) (owner: 10Jbond) [12:10:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/688246 (https://phabricator.wikimedia.org/T281345) (owner: 10Muehlenhoff) [12:12:08] (03PS7) 10Majavah: toolforge: Add ingress-nginx Helm files [puppet] - 10https://gerrit.wikimedia.org/r/685715 (https://phabricator.wikimedia.org/T264221) [12:12:48] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1038 - https://phabricator.wikimedia.org/T282434 (10fgiunchedi) Looks like the host is busted, I'll try a reboot ` Debian GNU/Linux 9 auto-installed on Thu Jul 13 14:37:19 UTC 2017. -bash: /usr/bin/lesspipe: Input/output error -bash: /usr/bin/tput: Input/output error... [12:13:59] (03PS2) 10Volans: sre.deploy.python-code: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/685855 [12:16:28] PROBLEM - Host ms-be1038 is DOWN: PING CRITICAL - Packet loss = 100% [12:16:41] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1038 - https://phabricator.wikimedia.org/T282434 (10fgiunchedi) Message at boot up ` Slot 3 Port 1 : Smart Array P840 Controller - (4096 MB, V4.52) 14 Logical Drive(s) - Operation Failed - 1719-Slot 3 Drive Array - A controller failure event occurred prior to thi... [12:17:50] RECOVERY - Host ms-be1038 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [12:18:16] RECOVERY - Check systemd state on ms-be1038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:12] (03CR) 10Volans: [C: 03+2] sre.deploy.python-code: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/685855 (owner: 10Volans) [12:22:59] (03Merged) 10jenkins-bot: sre.deploy.python-code: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/685855 (owner: 10Volans) [12:24:33] (03PS3) 10Jbond: P:idp::client::httpd: add support for CASSSOEnabled [puppet] - 10https://gerrit.wikimedia.org/r/688249 (https://phabricator.wikimedia.org/T233941) [12:24:35] (03PS1) 10Filippo Giunchedi: prometheus: bump PDU SNMP scrape timeout [puppet] - 10https://gerrit.wikimedia.org/r/688262 [12:25:21] 10SRE, 10Platform Engineering, 10Services, 10Wikimedia-Mailing-lists: Decide on future of public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Mvolz) I've now changed the settings to respond with the following message to all new posts to the list: ` This... [12:26:58] RECOVERY - MD RAID on ms-be1038 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:27:34] !log volans@cumin2002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: Initial deploy to cumin2002 - volans@cumin2002 [12:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29468/console" [puppet] - 10https://gerrit.wikimedia.org/r/688249 (https://phabricator.wikimedia.org/T233941) (owner: 10Jbond) [12:28:10] (03PS7) 10Giuseppe Lavagetto: eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) [12:28:12] (03PS1) 10Giuseppe Lavagetto: Rakefile: split more of it into submodules [deployment-charts] - 10https://gerrit.wikimedia.org/r/688265 [12:29:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 25%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P15889 and previous config saved to /var/cache/conftool/dbconfig/20210510-122923-root.json [12:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:30] !log volans@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: Initial deploy to cumin2002 - volans@cumin2002 [12:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:05] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1038 - https://phabricator.wikimedia.org/T282434 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi RAID firmware upgraded and host rebooted 2x, we're back [12:33:46] (03PS1) 10Volans: python_deploy: use forward-only git pulls [puppet] - 10https://gerrit.wikimedia.org/r/688273 [12:34:48] RECOVERY - Disk space on ms-be1038 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1038&var-datasource=eqiad+prometheus/ops [12:36:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/688273 (owner: 10Volans) [12:38:11] (03CR) 10Volans: [C: 03+2] python_deploy: use forward-only git pulls [puppet] - 10https://gerrit.wikimedia.org/r/688273 (owner: 10Volans) [12:38:20] (03CR) 10Jbond: "lgtm comment inline, I also wonder if we should stick with one erb file with some if statements e.g." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/688246 (https://phabricator.wikimedia.org/T281345) (owner: 10Muehlenhoff) [12:39:16] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We need to also change the selector for the tls_service under common_templates" [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) (owner: 10Effie Mouzeli) [12:39:34] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10MoritzMuehlenhoff) >>! In T274461#7073817, @jbond wrote: > >> * User dangle issue. If a user is deleted in Apereo and then another user gets... [12:41:52] Majavah: right, that one again... [12:44:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 50%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P15890 and previous config saved to /var/cache/conftool/dbconfig/20210510-124427-root.json [12:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:43] (03CR) 10Muehlenhoff: "In the latest 11.0.9->11.0.11 update they added a few hundred lines with plenty of comments and many of those affect config lines used by " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/688246 (https://phabricator.wikimedia.org/T281345) (owner: 10Muehlenhoff) [12:47:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:50:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:55:57] (03CR) 10Jbond: [C: 03+1] "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/688246 (https://phabricator.wikimedia.org/T281345) (owner: 10Muehlenhoff) [12:58:24] (03CR) 10Jbond: [C: 03+2] O:nagios_common: Pass ssl expiry constraints to https checks [puppet] - 10https://gerrit.wikimedia.org/r/686495 (owner: 10Jbond) [12:58:31] (03PS3) 10Jbond: O:nagios_common: Pass ssl expiry constraints to https checks [puppet] - 10https://gerrit.wikimedia.org/r/686495 [12:59:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 75%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P15891 and previous config saved to /var/cache/conftool/dbconfig/20210510-125930-root.json [12:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:00] (03PS6) 10Giuseppe Lavagetto: eventgate-main: autogenerate egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 [13:03:16] (03PS1) 10Jbond: hiera: enable SSOut for peopleweb and puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/688277 [13:04:40] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 156884680 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:05:39] (03CR) 10Volans: [C: 04-1] "Nice! Almost ready, just one small error and a nit for the phab message." (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [13:07:12] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 637864 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:14:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 100%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P15892 and previous config saved to /var/cache/conftool/dbconfig/20210510-131434-root.json [13:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:46] (03CR) 10Volans: [C: 03+1] "+1 to test it on puppetboard" [puppet] - 10https://gerrit.wikimedia.org/r/688277 (owner: 10Jbond) [13:17:39] (03PS1) 10Herron: arclamp/xenon: point codfw hosts to eqiad (mwlog1002) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688281 (https://phabricator.wikimedia.org/T224565) [13:18:36] (03PS2) 10Herron: arclamp/xenon: point all hosts to eqiad (mwlog1002) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688281 (https://phabricator.wikimedia.org/T224565) [13:18:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:19:49] (03PS2) 10Elukey: WIP - Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 [13:19:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:20:04] RECOVERY - WDQS SPARQL on wdqs1012 is OK: OK - Certificate wdqs.discovery.wmnet will expire on Mon 09 Mar 2026 07:17:36 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:23:29] (03PS6) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) [13:24:30] (03CR) 10Jbond: [C: 04-1] thumbor/mwmaint: add periodic job to pull fc-list file (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [13:24:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] etcd::client::globalconfig: Remove python-etcd [puppet] - 10https://gerrit.wikimedia.org/r/685766 (owner: 10Muehlenhoff) [13:24:55] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [13:25:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] ProductionServices: poolcounter1004 will be rebooted for updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688239 (https://phabricator.wikimedia.org/T273278) (owner: 10Effie Mouzeli) [13:27:16] (03PS2) 10Jbond: nagios_common: add check_https_url_custom_ip [puppet] - 10https://gerrit.wikimedia.org/r/686622 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:27:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/686622 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:28:26] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/683837 (https://phabricator.wikimedia.org/T277990) (owner: 10Jbond) [13:29:58] (03PS1) 10ZPapierski: Push the limit for shads queried in relforge [puppet] - 10https://gerrit.wikimedia.org/r/688309 [13:34:33] 10SRE, 10Analytics: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10MoritzMuehlenhoff) [13:34:45] 10SRE, 10Analytics: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:35:13] (03CR) 10Muehlenhoff: Manage different templates per Java release branch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/688246 (https://phabricator.wikimedia.org/T281345) (owner: 10Muehlenhoff) [13:36:43] (03PS1) 10WMDE-Fisch: [beta] Forward renamed config name for improved template search features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688310 [13:36:45] (03PS1) 10WMDE-Fisch: [beta] Improve comment around ReferencePreviews beta cluster default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688311 [13:37:32] (03PS1) 10Jbond: P:services_proxy::envoy: drop the ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/688312 (https://phabricator.wikimedia.org/T277990) [13:39:21] (03CR) 10Vgutierrez: P:trafficserver::backend: Use a trusted CA file outside of /etc/ssl/certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [13:39:36] (03CR) 10jerkins-bot: [V: 04-1] P:services_proxy::envoy: drop the ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/688312 (https://phabricator.wikimedia.org/T277990) (owner: 10Jbond) [13:39:46] (03PS1) 10Jbond: (DO NOT MEREG): test disableing service proxy but providing an empty set [puppet] - 10https://gerrit.wikimedia.org/r/688313 [13:41:17] (03PS2) 10Jbond: P:services_proxy::envoy: drop the ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/688312 (https://phabricator.wikimedia.org/T277990) [13:43:19] (03Abandoned) 10Jbond: (DO NOT MEREG): test disableing service proxy but providing an empty set [puppet] - 10https://gerrit.wikimedia.org/r/688313 (owner: 10Jbond) [13:44:29] (03CR) 10Jbond: "alternate to https://gerrit.wikimedia.org/r/c/operations/puppet/+/683837" [puppet] - 10https://gerrit.wikimedia.org/r/688313 (owner: 10Jbond) [13:44:38] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy::envoy: noop when fed an empty list of listeners [puppet] - 10https://gerrit.wikimedia.org/r/688315 (https://phabricator.wikimedia.org/T277990) [13:44:47] (03CR) 10Muehlenhoff: [C: 03+2] Manage different templates per Java release branch [puppet] - 10https://gerrit.wikimedia.org/r/688246 (https://phabricator.wikimedia.org/T281345) (owner: 10Muehlenhoff) [13:45:09] (03CR) 10Jbond: "alternate to https://gerrit.wikimedia.org/r/c/operations/puppet/+/683837" [puppet] - 10https://gerrit.wikimedia.org/r/688312 (https://phabricator.wikimedia.org/T277990) (owner: 10Jbond) [13:49:28] (03Abandoned) 10Jbond: P:services_proxy::envoy: drop the ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/688312 (https://phabricator.wikimedia.org/T277990) (owner: 10Jbond) [13:49:42] (03Abandoned) 10Jbond: P::envoy: allow users to run tlsproxy without service proxy [puppet] - 10https://gerrit.wikimedia.org/r/683837 (https://phabricator.wikimedia.org/T277990) (owner: 10Jbond) [13:50:52] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 47464 and 4184 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:56:44] (03PS1) 10Herron: scholarships: update default value to mwlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/688317 (https://phabricator.wikimedia.org/T224565) [13:59:01] (03PS5) 10Jbond: P:trafficserver::backend: Use a trusted CA file outside of /etc/ssl/certs [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) [13:59:44] (03PS7) 10Jbond: hiera - cp1077: test CA bundle with pki and puppet ROOT ca certs [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) [13:59:49] (03PS8) 10Jbond: P:trafficserver::backend: update the source of the ATS trusted ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) [13:59:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29471/console" [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [13:59:59] (03PS2) 10Jbond: O:debmonitor::server: Switch debmonitor.wikimedia.org ssl to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/685576 (https://phabricator.wikimedia.org/T281673) [14:01:18] FYI: Merging two minor beta cluster only config changes [14:01:49] (03CR) 10Jbond: [V: 03+1] P:trafficserver::backend: Use a trusted CA file outside of /etc/ssl/certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [14:02:15] 10SRE, 10observability, 10Patch-For-Review: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10herron) Thanks for correcting the oversight on arclamp/xenon, TIL I've created https://wikitech.wikimedia.org/wiki/Mwlog just now to help clear up the services that are deployed... [14:02:32] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688310 (owner: 10WMDE-Fisch) [14:03:02] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688311 (owner: 10WMDE-Fisch) [14:03:17] (03CR) 10Herron: [C: 03+2] scholarships: update default value to mwlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/688317 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [14:03:31] (03Merged) 10jenkins-bot: [beta] Forward renamed config name for improved template search features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688310 (owner: 10WMDE-Fisch) [14:03:59] (03Merged) 10jenkins-bot: [beta] Improve comment around ReferencePreviews beta cluster default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688311 (owner: 10WMDE-Fisch) [14:08:02] o/ I forget the protocol: Can BC-only changes be merged/synced outside of deployment windows? [14:08:58] (03CR) 10Andrew Bogott: "Because we don't have ipv4 on cloud-vps, this breaks puppet on some deployment-prep nodes. Can this be made conditional somehow?" [puppet] - 10https://gerrit.wikimedia.org/r/684814 (owner: 10Giuseppe Lavagetto) [14:21:50] (03CR) 10Ssingh: [C: 03+2] nagios_common: add check_https_url_custom_ip [puppet] - 10https://gerrit.wikimedia.org/r/686622 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:24:05] (03CR) 10Ssingh: "With https://gerrit.wikimedia.org/r/c/operations/puppet/+/686622/ merged, this is ready. Rebasing." [puppet] - 10https://gerrit.wikimedia.org/r/686625 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:24:44] (03CR) 10Filippo Giunchedi: "LGTM! See inline post-merge nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/686622 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:25:45] (03CR) 10Herron: "Should effectively be a noop https://puppet-compiler.wmflabs.org/compiler1003/29461/" [puppet] - 10https://gerrit.wikimedia.org/r/686633 (https://phabricator.wikimedia.org/T232343) (owner: 10Herron) [14:27:36] (03CR) 10Herron: [C: 03+1] prometheus: bump PDU SNMP scrape timeout [puppet] - 10https://gerrit.wikimedia.org/r/688262 (owner: 10Filippo Giunchedi) [14:27:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/686625 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:30:19] (03PS1) 10Muehlenhoff: Update java.security file for 11.0.11 [puppet] - 10https://gerrit.wikimedia.org/r/688327 (https://phabricator.wikimedia.org/T281345) [14:31:02] (03PS1) 10Ssingh: nagios_common: add -C option to check_https_url_custom_ip [puppet] - 10https://gerrit.wikimedia.org/r/688328 [14:31:48] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) I've looked into typha and kube-controllers component as well as they shot a similar patterns (different magnitude, though). Unfortunately we lack... [14:32:42] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet [14:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:38] (03PS2) 10Ssingh: wikidough: use check_https_url_custom_ip for DoH check [puppet] - 10https://gerrit.wikimedia.org/r/686625 (https://phabricator.wikimedia.org/T252132) [14:37:05] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: bump PDU SNMP scrape timeout [puppet] - 10https://gerrit.wikimedia.org/r/688262 (owner: 10Filippo Giunchedi) [14:38:29] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29472/console" [puppet] - 10https://gerrit.wikimedia.org/r/686625 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:42:23] (03PS1) 10JMeybohm: calico: Remove CPU limit for calico-node, bump for typha and kube-controllers [deployment-charts] - 10https://gerrit.wikimedia.org/r/688332 (https://phabricator.wikimedia.org/T277877) [14:43:36] (03PS3) 10JMeybohm: prometheus: Clean up absent file resource [puppet] - 10https://gerrit.wikimedia.org/r/684801 (https://phabricator.wikimedia.org/T271573) [14:47:38] (03PS3) 10Elukey: WIP - Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [14:49:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/688328 (owner: 10Ssingh) [14:50:57] (03CR) 10Ssingh: [C: 03+2] nagios_common: add -C option to check_https_url_custom_ip [puppet] - 10https://gerrit.wikimedia.org/r/688328 (owner: 10Ssingh) [14:52:12] (03CR) 10Ssingh: [V: 03+1 C: 03+2] wikidough: use check_https_url_custom_ip for DoH check [puppet] - 10https://gerrit.wikimedia.org/r/686625 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:53:04] (03CR) 10Andrew Bogott: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/684814 (owner: 10Giuseppe Lavagetto) [14:53:42] (03PS5) 10Jbond: mail: move default mail relay config out of standard module [puppet] - 10https://gerrit.wikimedia.org/r/686633 (https://phabricator.wikimedia.org/T232343) (owner: 10Herron) [14:53:44] (03PS1) 10Jbond: exim: make exim class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/688333 (https://phabricator.wikimedia.org/T232343) [14:56:48] (03PS1) 10Ssingh: wikidough: lookup domain and IP from hiera [puppet] - 10https://gerrit.wikimedia.org/r/688336 (https://phabricator.wikimedia.org/T252132) [14:57:50] (03CR) 10Jbond: "LGTM see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/686633 (https://phabricator.wikimedia.org/T232343) (owner: 10Herron) [14:57:59] (03PS6) 10Jbond: mail: move default mail relay config out of standard module [puppet] - 10https://gerrit.wikimedia.org/r/686633 (https://phabricator.wikimedia.org/T232343) (owner: 10Herron) [14:58:55] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29473/console" [puppet] - 10https://gerrit.wikimedia.org/r/688336 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:59:26] (03Abandoned) 10Jforrester: [jawikiquote] Add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576457 (https://phabricator.wikimedia.org/T150618) (owner: 10Jforrester) [14:59:42] (03Abandoned) 10Jforrester: [cywikiquote] Add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [14:59:53] (03CR) 10Jbond: "The reason for initially drafting this where invalid, so there is no priority on this" [puppet] - 10https://gerrit.wikimedia.org/r/688333 (https://phabricator.wikimedia.org/T232343) (owner: 10Jbond) [15:00:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] calico: Remove CPU limit for calico-node, bump for typha and kube-controllers [deployment-charts] - 10https://gerrit.wikimedia.org/r/688332 (https://phabricator.wikimedia.org/T277877) (owner: 10JMeybohm) [15:01:02] (03Abandoned) 10Jforrester: [betawikiversity] Provide HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577362 (owner: 10Jforrester) [15:01:08] (03Abandoned) 10Jforrester: [betawikiversity] Add HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556864 (https://phabricator.wikimedia.org/T150618) (owner: 10TechneSiyam) [15:01:58] (03Abandoned) 10Jforrester: Undeploy InterwikiSorting - I: Disable everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599064 (https://phabricator.wikimedia.org/T253764) (owner: 10Jforrester) [15:02:03] (03Abandoned) 10Jforrester: Undeploy InterwikiSorting - II: Drop loading ability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599065 (https://phabricator.wikimedia.org/T253764) (owner: 10Jforrester) [15:02:08] (03Abandoned) 10Jforrester: Undeploy InterwikiSorting - III: Drop InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599066 (https://phabricator.wikimedia.org/T253764) (owner: 10Jforrester) [15:02:14] (03Abandoned) 10Jforrester: Provide wgULSCompactLinksPrepend as wgInterwikiSortingSortPrepend is going away [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599081 (owner: 10Jforrester) [15:02:22] (03Abandoned) 10Jforrester: Undeploy InterwikiSorting - IV: Drop all config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599067 (https://phabricator.wikimedia.org/T253764) (owner: 10Jforrester) [15:02:30] (03Abandoned) 10Jforrester: Undeploy InterwikiSorting - V: Stop loading i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599068 (https://phabricator.wikimedia.org/T253764) (owner: 10Jforrester) [15:02:35] (03CR) 10Ssingh: [V: 03+1 C: 03+2] "PCC checks out; parameters updated and no change to Profile::Wikidough as expected." [puppet] - 10https://gerrit.wikimedia.org/r/688336 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:03:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/688327 (https://phabricator.wikimedia.org/T281345) (owner: 10Muehlenhoff) [15:08:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10MPhamWMF) [15:13:58] (03PS3) 10Dave Pifke: arclamp/xenon: point all hosts to eqiad (mwlog1002) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688281 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [15:15:08] (03CR) 10Dave Pifke: [C: 03+1] "I took a stab at adding some comments to make this more obvious." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688281 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [15:15:29] (03CR) 10Herron: "> Patch Set 3: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688281 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [15:16:07] PROBLEM - Disk space on rpki1001 is CRITICAL: DISK CRITICAL - free space: / 3365 MB (39% inode=3%): /tmp 3365 MB (39% inode=3%): /var/tmp 3365 MB (39% inode=3%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=rpki1001&var-datasource=eqiad+prometheus/ops [15:17:13] I did some clean up this morning for --^ [15:18:02] (03PS4) 10Alexandros Kosiaris: Remove rdb200{3,5} from netpols [deployment-charts] - 10https://gerrit.wikimedia.org/r/682912 (https://phabricator.wikimedia.org/T255250) [15:18:08] ahhh inodes for root [15:18:13] 10SRE, 10observability, 10Patch-For-Review: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10dpifke) ArcLamp data is arriving again, and I'm working on fixing our monitoring for it. Thanks for taking a look at the others. [15:19:03] mostly rsyslogd [15:19:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove rdb200{3,5} from netpols [deployment-charts] - 10https://gerrit.wikimedia.org/r/682912 (https://phabricator.wikimedia.org/T255250) (owner: 10Alexandros Kosiaris) [15:20:20] !log restart rsyslog on rpki1001 [15:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:57] of course not related, pebcak [15:21:10] (03Merged) 10jenkins-bot: Remove rdb200{3,5} from netpols [deployment-charts] - 10https://gerrit.wikimedia.org/r/682912 (https://phabricator.wikimedia.org/T255250) (owner: 10Alexandros Kosiaris) [15:21:28] (03PS7) 10Muehlenhoff: Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 [15:21:34] (03CR) 10Muehlenhoff: Add a cookbook to delete hosts from debmonitor (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [15:24:11] (03CR) 10jerkins-bot: [V: 04-1] Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [15:27:26] 10SRE, 10Analytics-Clusters: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Milimetric) [15:29:56] (03PS6) 10Jbond: P:trafficserver::backend: Use a trusted CA file outside of /etc/ssl/certs [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) [15:30:04] XioNoX: around? [15:30:24] there seems to be a ton of files under /var on rpki1001, I am wondering if we can clean up some [15:31:50] mostly routinator-related files [15:32:59] (03PS8) 10Jbond: hiera - cp1077: test CA bundle with pki and puppet ROOT ca certs [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) [15:34:13] /var/lib/routinator/repository/rrdp, a lot of dirs from 2020 [15:34:14] (03PS7) 10Jbond: P:trafficserver::backend: Use a trusted CA file outside of /etc/ssl/certs [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) [15:34:19] no idea though if we can drop [15:34:55] (03PS9) 10Jbond: hiera - cp1077: test CA bundle with pki and puppet ROOT ca certs [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) [15:35:03] cdanis: around? Do you have any idea? [15:36:31] they all seem past versions of the ROA data [15:37:54] ummm [15:37:56] (03PS8) 10Muehlenhoff: Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 [15:38:01] I do not know much about how routinator works [15:38:08] ah snap sorry [15:38:12] can I help? [15:38:35] elukey: looking [15:38:55] ciao volans, on rpki1001 the root's inodes are at 95% usage, and I see a ton of files under /var/lib/routinator/repository/rrdp [15:39:15] oh, not good indeed [15:39:24] some dirs are from 2020, so the temptation to clean up is big [15:40:32] I also assume that if rpki1001 goes down the worst side effect will be cr1/cr2 not getting updated ROAs etc.., so not a big deal for the moment [15:41:15] let's try to fix it if possible though :D [15:44:17] volans: sure I meant that we can wait for Arzhel :D [15:45:00] ~290k files in rrdp/ and ~75k in rsync/ [15:45:34] inodes used more because of the directory trees [15:46:24] yep [15:46:36] but that is old data, in theory [15:50:00] elukey: looking [15:50:05] <3 [15:50:27] elukey: rule of thumb is that you can nuke everything and restart routinator, and it will create whatever it needs [15:50:52] ah lovely, so we can drop say data older than 30d? [15:50:59] from that repository/rrdp dir [15:51:42] i dont think you can do that, the roas are i belive copied with the original creation data, and they could be older then 30 days [15:52:02] ahhh so it is not a snapshot of all of them every time [15:52:07] okok thanks jbond42 [15:52:13] then it is a problem :D [15:52:31] elukey: im not 100% but i think rrdp works wimlar to rsync in relation to time stamps [15:52:52] yeah that's what I'm looking for in the doc [15:53:20] happy to see RPKI is a thing... I doubt I can be of much help though :( [15:53:23] XioNoX: fyi i notice this in the most recennt https://github.com/NLnetLabs/routinator/pull/470 release [15:53:40] which could help with rebuilding (if we update) [15:53:42] jbond42: nice! [15:53:49] I was having a quick look at the docs but didn't find anything relevant in our version [15:54:07] also it could be that someone has just signed a couple of thousand prefixes or something that has pushed is over the edge [15:54:18] topranks: some context on our setup: https://wikitech.wikimedia.org/wiki/RPKI [15:54:29] XioNox: thanks! [15:54:51] elukey: is the solution to grow the storage on the host? [15:55:11] or use xfs :-P [15:55:12] * volans hides [15:55:20] they only have 10G each https://netbox.wikimedia.org/virtualization/virtual-machines/?q=rpki [15:56:17] I don't think that we can grow the root partition but only to add another virtual disk, and then mount a slice the fs on top of it (this is my understanding) [15:56:23] its worth noting according to clodflare there are currently 250k roa's so 290k files dosn;t seem that abnormal [15:56:30] https://rpki.cloudflare.com/ [15:57:50] oh, it's not disk space but inodes we're running low of [15:57:55] we could tweak the ext4 partition with -N number-of-inodes (I think it can be done only at format time) [15:57:59] how do we increase that? [15:58:16] XioNoX: the use xfs half-joke was for that ;) [15:58:34] was hoping alex was still around to get his reaction [15:59:02] if we double the partition though we get double the inodes I think [15:59:30] volans: how do we double the root partition though? [15:59:53] can it be done in ganeti with the VM stopped? [16:00:03] I know that we can grow the virtual disk [16:00:17] elukey: we create a new one [16:00:32] okok then my understanding is right [16:00:32] rpki doesn't have any source of truth data IIRC [16:00:36] 7elyes was just about to say we can add another virtual disk, format that specificly to have a high inode count and mount it to /var/lib/routinator/repository [16:00:38] and we mount on top of it say /var [16:00:39] it's all downloaded/derived [16:01:08] yes yes as I suggested above jbond42, +1 for me [16:02:07] volans: i would do it specificly in /var/lib/routinator/repository though and not /var. as /var/log,cache,etc can have some bigger files which will be less compatible with a disk formated for a hugh inode count [16:02:14] sorry that was for elukey [16:02:15] https://wikitech.wikimedia.org/wiki/Ganeti#Adding_a_disk [16:02:29] jbond42: yes yes +1 for more specific [16:03:19] how much space? 50G to be sure? [16:03:42] volans: re source of truth yes thats correct there is none, there is not even a trusted root anchor [16:04:02] elukey: in terms of GB it's small [16:04:03] XioNoX: I assume that stopping the VM + some downtime if needed is ok right? [16:04:11] elukey: yep [16:04:15] volans: small? [16:04:18] we have 8 now :D [16:04:27] 8.8G 5.1G 3.4G 61% [16:05:01] 1.9G /var/lib/routinator/repository [16:05:04] sure I can read but I am not getting your reasoning :) [16:05:19] that what we need is a partition with a lot of inodes, not a lot of space [16:05:27] if we're adding it [16:05:39] (03CR) 10Michael Große: "Looks good, but one of the lines slipped above the comment describing it" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688178 (https://phabricator.wikimedia.org/T280779) (owner: 10Itamar Givon) [16:06:10] my simplest way to solve it is replace the VM with another just double the space. If instead you want to add a dedicated disk that's another solution too, but then I would add a small one with a tweaked -N option for ext4 to increase the number of inodes [16:06:36] probably 10GB, no more [16:07:14] sure we can use the -N option, but we need to document it etc.., I thought that having a 50G partition was simpler, but probably a waste [16:07:14] i would add a dedicated disk, the ROA db has doubled in the last year so imo keeping the repo on /root is just punting the problem to later [16:07:23] adding a new disk is better in my opinion [16:07:32] quicker and we don't need to create another vm [16:08:28] ack to that and yes we can expect the ROA db to grow over time given that more and more starts to block RPKI invalid [16:08:59] ok so things to decide [16:09:10] 1) size for the new disk [16:09:15] 2) who is going to add it :D [16:09:27] 3) who is going to format it with -N and with what value [16:09:32] :D [16:10:19] for 1) volans we can put 20G with extra -N power in my opinion, 10G seems tiny, just some room for growth in case bigger files will be store [16:10:34] my 2c otherwise people can downvote me :P [16:10:49] I know that Riccardo is more authoritative and shiny [16:10:58] :D [16:11:37] +1 to all, 20GB seems a good compromise [16:11:44] it mgith even work without -N [16:12:01] if with 8.8GB we have ~500k inodes we can expect 1M+ with 20GB [16:12:11] RECOVERY - Disk space on backup2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [16:14:19] in relation to 3) the max file in that dir is 3 files at 61MB, ~20 file between 1 -> 20MB, every thing elses is < 1MB so i think its with the vast majority using 20-32k and lots of folder entries taking up 4 k each [16:15:33] hello, new clinic duty person this week, could you take over from me by updating the topic ? [16:16:18] gnt-instance modify --disk add:size=20g rpki1001.eqiad.wmnet - in theory this should create /dev/vdb right? [16:16:22] volans, jbond42 --^ [16:16:43] then a gnt-instance reboot should make it appear [16:16:51] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:52] yes, that should work [16:16:58] ah hello mutante :) [16:17:05] you will have to create a filesystem on it and mount it yourself [16:17:11] yep yep [16:17:13] elukey: never done it but seems reasonable [16:17:28] elukey: one sec [16:17:37] (03CR) 10Herron: [C: 03+2] arclamp/xenon: point all hosts to eqiad (mwlog1002) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688281 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [16:17:49] I'm wondering why rpki2001 is only at 63% [16:17:52] of inodes used [16:18:28] (03Merged) 10jenkins-bot: arclamp/xenon: point all hosts to eqiad (mwlog1002) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688281 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [16:19:56] good point [16:20:47] elukey: in relation to formating the fs i think that `mkfs.ext4 -T small` should picks what look like good values blocksize = 1024, inode_size 128, inode_ratio = 4096` [16:21:09] nice [16:22:04] elukey: beware, if you create a new disk and then reboot the VM, it might not come back. If that happens the reason is that the devices have been "renumbered" and the older disk suddenly has a new device name :p [16:22:17] fix is to login at console and ~ "replaced ens5 with ens6 in /etc/network/interfaces" [16:22:31] and then all should be ok again [16:23:02] ah the interface comes back with a different name, lovely [16:23:17] afair it has always been ens5 -ens 6 [16:23:18] so to why the difference muy random guess is that rpki1001 had some corruption, and rpki2001 dosn;t, did anyone try XioNoX suggestion to rm the current rpo and let routanatir re download it [16:23:36] !log herron@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:688281|arclamp/xenon: point all hosts to eqiad (mwlog1002) (T224565)]] (duration: 00m 59s) [16:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:41] T224565: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 [16:23:42] jbond42: we can surely try [16:24:27] jbond42: with current repo do you mean whatever it is under ../repository/.. ? [16:25:00] elukey: yes but it would make me more conftabl if XioNoX could confirm that :) [16:25:19] looking [16:25:53] /var/lib/routinator/repository yep, routinator should recreate it (it will take some time though) [16:26:11] as long as 1 of the two RPKI host is up, go for it [16:26:52] ack thanks ill go ahead [16:27:43] !log rm -r /var/lib/routinator/repository and rebuilding repo [16:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:43] thanks jbond42 [16:33:12] the inodes looks good now [16:33:26] (03CR) 10Volans: [C: 03+1] "LGTM!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [16:36:03] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [16:38:10] elukey: yes i think it has finished rebuilding its repo now as well, it is at 1.9 GB which is simlar enough to rpki2001 and also hasn't grown in some time as such i think its finished rebuilding [16:38:49] before deleteing the repo was at ~1.9 GB so it looks like some old data was removed. [16:39:34] nice! [16:40:57] RECOVERY - Disk space on rpki1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=rpki1001&var-datasource=eqiad+prometheus/ops [16:41:50] 10SRE, 10netops: routinator: creat gabage collection job - https://phabricator.wikimedia.org/T282469 (10jbond) p:05Triage→03Low [16:42:02] elukey: i have created ^^^ to explore some type of better GC job [16:43:13] thanks! [16:44:34] (03CR) 10Bstorm: wikireplicas: cut over the last IPs to the new cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685947 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:44:44] (03PS1) 10Jcrespo: dbbackups: Split dbbackups::content into 2 roles, client and storage [puppet] - 10https://gerrit.wikimedia.org/r/688353 (https://phabricator.wikimedia.org/T282249) [16:48:52] (03PS2) 10Jcrespo: dbbackups: Split dbbackups::content into 2 roles, client and storage [puppet] - 10https://gerrit.wikimedia.org/r/688353 (https://phabricator.wikimedia.org/T282249) [16:54:25] 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Sergey.Trofimovsky.SF) These were all addressed (please review). Please note, that some options are overridden in `host... [16:54:29] (03PS1) 10Legoktm: mailman3: Enable debug runner logs [puppet] - 10https://gerrit.wikimedia.org/r/688354 [16:54:58] (03PS2) 10Legoktm: mailman3: Enable debug runner logs [puppet] - 10https://gerrit.wikimedia.org/r/688354 (https://phabricator.wikimedia.org/T282348) [16:56:20] (03PS3) 10Jcrespo: dbbackups: Split dbbackups::content into 2 roles, client and storage [puppet] - 10https://gerrit.wikimedia.org/r/688353 (https://phabricator.wikimedia.org/T282249) [16:56:31] (03CR) 10BryanDavis: wikireplicas: cut over the last IPs to the new cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685947 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T1800) [17:00:04] James_F: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:03:28] (03PS4) 10Jcrespo: dbbackups: Split dbbackups::content into 2 roles, client and storage [puppet] - 10https://gerrit.wikimedia.org/r/688353 (https://phabricator.wikimedia.org/T282249) [17:05:15] (03CR) 10Bstorm: [C: 03+2] "Ok, I'm going to merge this now so that if there's follow-up and fixing, there's time to do that." [puppet] - 10https://gerrit.wikimedia.org/r/685947 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:07:25] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:13:55] ^ happens with seemingly random hosts in codfw, only mgmt [17:14:03] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 216478680 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:14:05] will heal itself usually [17:14:10] suspects the mgmt switch [17:16:31] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 481232 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:17:35] (03PS1) 10Cathal Mooney: Added my user details to data.yaml and username to the ops group there. [puppet] - 10https://gerrit.wikimedia.org/r/688355 [17:17:55] mutante: I was curious if it was always the same rack, but from https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?servicegroup=mgmt it looks like not [17:18:04] (03CR) 10jerkins-bot: [V: 04-1] Added my user details to data.yaml and username to the ops group there. [puppet] - 10https://gerrit.wikimedia.org/r/688355 (owner: 10Cathal Mooney) [17:18:09] not even the same row [17:18:22] (03CR) 10Dzahn: "When you introduce new secrets please don't forget to also add fake secrets in labs/private. This broke Phab cloud testing setup and we di" [puppet] - 10https://gerrit.wikimedia.org/r/620822 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [17:19:13] rzl: the only pattern I see is "some random mgmt host in codfw". never eqiad, always just mgmt [17:19:21] and then always comes back a short while later [17:19:45] nod [17:19:50] like the mgmt switch there is sometimes just forgetting a port and then remembers it [17:21:03] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Split dbbackups::content into 2 roles, client and storage [puppet] - 10https://gerrit.wikimedia.org/r/688353 (https://phabricator.wikimedia.org/T282249) (owner: 10Jcrespo) [17:27:41] (03PS1) 10Cathal Mooney: Added my user details to data.yaml, hopefully without the syntax error this time. [puppet] - 10https://gerrit.wikimedia.org/r/688357 [17:28:23] (03CR) 10jerkins-bot: [V: 04-1] Added my user details to data.yaml, hopefully without the syntax error this time. [puppet] - 10https://gerrit.wikimedia.org/r/688357 (owner: 10Cathal Mooney) [17:28:23] James_F: do you know who will be doing the actual deploy for the upcoming backport window? [17:29:23] I'm wondering whether it'd be OK to add another change to the window, to enable ChessBrowser on Beta. Are the config changes you're pushing out at all risky? [17:32:16] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "Removing jenkins, as it is downvoting the change for stuff unrelated to this change (the tox file change triggers a tox run in all files, " [puppet] - 10https://gerrit.wikimedia.org/r/686457 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [17:32:23] I'll just move it to the evening window instead. [17:32:39] (03PS3) 10Arturo Borrero Gonzalez: openstack: cleanup neutron hacks [puppet] - 10https://gerrit.wikimedia.org/r/686457 (https://phabricator.wikimedia.org/T270704) [17:32:47] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] openstack: cleanup neutron hacks [puppet] - 10https://gerrit.wikimedia.org/r/686457 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [17:36:05] (03PS1) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) [17:36:08] (03PS1) 10Arturo Borrero Gonzalez: hieradata: drop unused neutron configuration for dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/688359 (https://phabricator.wikimedia.org/T270704) [17:36:22] (03Abandoned) 10Cathal Mooney: Added my user details to data.yaml and username to the ops group there. [puppet] - 10https://gerrit.wikimedia.org/r/688355 (owner: 10Cathal Mooney) [17:36:25] puppet failures on backup1001 is me working on T282249 [17:36:26] T282249: Setup backup1003 and backup2003 as the storage location for es bacula backups - https://phabricator.wikimedia.org/T282249 [17:37:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: drop unused neutron configuration for dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/688359 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [17:38:48] 10SRE, 10LDAP-Access-Requests: Grant access to LDAP/WMF for SGrabarczuk - https://phabricator.wikimedia.org/T282475 (10sgrabarczuk) [17:42:10] (03PS1) 10Majavah: toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 [17:43:36] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 (owner: 10Majavah) [17:44:09] (03PS2) 10Cathal Mooney: Added my user details to data.yaml, hopefully without the syntax error this time. [puppet] - 10https://gerrit.wikimedia.org/r/688357 [17:44:34] (03PS2) 10Majavah: toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) [17:44:51] (03CR) 10jerkins-bot: [V: 04-1] Added my user details to data.yaml, hopefully without the syntax error this time. [puppet] - 10https://gerrit.wikimedia.org/r/688357 (owner: 10Cathal Mooney) [17:47:10] (03PS3) 10Majavah: toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) [17:47:31] puppet on backup1001 should be ok now, no more errors expected [17:47:56] (except for some new backup jobs being empty) [17:48:30] (those will only create warnings) [17:49:48] (03PS3) 10Cathal Mooney: Added my user details to data.yaml, hopefully without the syntax error this time. [puppet] - 10https://gerrit.wikimedia.org/r/688357 [17:49:51] (03PS4) 10Majavah: toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) [17:51:31] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: factorize NAT template file into base profile [puppet] - 10https://gerrit.wikimedia.org/r/688365 (https://phabricator.wikimedia.org/T270704) [17:51:33] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [17:51:35] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: cleanup unused all_phy_nics parameter [puppet] - 10https://gerrit.wikimedia.org/r/688366 (https://phabricator.wikimedia.org/T270704) [17:51:37] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: don't use concatenation with CIDR [puppet] - 10https://gerrit.wikimedia.org/r/688367 (https://phabricator.wikimedia.org/T275129) [17:52:13] (03PS5) 10Majavah: toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) [17:53:07] (03PS6) 10Majavah: toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) [17:54:18] (03PS4) 10Cathal Mooney: Added my user details. [puppet] - 10https://gerrit.wikimedia.org/r/688357 [17:57:28] (03PS1) 10Bstorm: wikireplicas: remove the old wikireplicas role from the proxy [puppet] - 10https://gerrit.wikimedia.org/r/688368 (https://phabricator.wikimedia.org/T260389) [17:57:35] (03CR) 10Ayounsi: [C: 03+2] Added my user details. [puppet] - 10https://gerrit.wikimedia.org/r/688357 (owner: 10Cathal Mooney) [17:58:51] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: factorize NAT template file into base profile [puppet] - 10https://gerrit.wikimedia.org/r/688365 (https://phabricator.wikimedia.org/T270704) [17:58:53] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: cleanup unused all_phy_nics parameter [puppet] - 10https://gerrit.wikimedia.org/r/688366 (https://phabricator.wikimedia.org/T270704) [17:58:55] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: don't use concatenation with CIDR [puppet] - 10https://gerrit.wikimedia.org/r/688367 (https://phabricator.wikimedia.org/T275129) [18:00:05] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T2000). [18:01:49] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: factorize NAT template file into base profile [puppet] - 10https://gerrit.wikimedia.org/r/688365 (https://phabricator.wikimedia.org/T270704) [18:03:22] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:25] ori: I was going to do it. Want me to add yours? [18:04:17] Oh, huh, I thought that window was now. But jouncebot thinks it was an hour ago? [18:04:52] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [18:04:56] it announces the next window for some reason [18:05:08] Ah. Timezone issue? [18:05:20] maybe [18:05:41] OK, I'll do the config patches now. [18:05:49] (03PS2) 10Jforrester: FlaggedRevs: Stop setting wgFlaggedRevsWhitelist, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673306 [18:05:53] jouncebot: now [18:05:53] For the next 0 hour(s) and 54 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T1700) [18:05:53] For the next 1 hour(s) and 54 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T1800) [18:06:00] but probably not. it seems to report the next-window [18:06:06] Oh. I thinks they overlap? [18:06:25] (03CR) 10Jforrester: [C: 03+2] FlaggedRevs: Stop setting wgFlaggedRevsWhitelist, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673306 (owner: 10Jforrester) [18:06:31] Krinkle wrote the newest parser code, so maybe he knows? :D [18:06:37] no, the wdqs window was an hour ago [18:06:49] and the backport window stops in an hour, not two as it says [18:06:53] Indeed, it's all entirely wrong. [18:07:00] * James_F sighs in timezones. [18:07:53] I wrote parser code? [18:08:13] !log imported new mailman3, flufl.bounce packages to apt.wm.o [18:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:42] (03Merged) 10jenkins-bot: FlaggedRevs: Stop setting wgFlaggedRevsWhitelist, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673306 (owner: 10Jforrester) [18:08:55] (03PS6) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Stop setting, COMPAT_NEW is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712) [18:08:59] (03CR) 10Jforrester: [C: 03+2] wgAbuseFilterAflFilterMigrationStage: Stop setting, COMPAT_NEW is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [18:10:00] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: factorize NAT template file into base profile [puppet] - 10https://gerrit.wikimedia.org/r/688365 (https://phabricator.wikimedia.org/T270704) [18:10:07] !log jforrester@deploy1002 Synchronized wmf-config/flaggedrevs.php: Config: [[gerrit:673306|FlaggedRevs: Stop setting wgFlaggedRevsWhitelist, now ignored]] (duration: 00m 57s) [18:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:09] (03Merged) 10jenkins-bot: wgAbuseFilterAflFilterMigrationStage: Stop setting, COMPAT_NEW is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [18:13:00] !log jforrester@deploy1002 Synchronized wmf-config: Config: [[gerrit:657697|wgAbuseFilterAflFilterMigrationStage: Stop setting, COMPAT_NEW is default (T269712)]] (duration: 00m 57s) [18:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:05] T269712: Migrate afl_filter to afl_filter_id and afl_global - https://phabricator.wikimedia.org/T269712 [18:13:08] (03PS4) 10Jforrester: [wikitech] Enable VE desktop section edit links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679940 (https://phabricator.wikimedia.org/T280291) [18:13:17] (03CR) 10Jforrester: [C: 03+2] [wikitech] Enable VE desktop section edit links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679940 (https://phabricator.wikimedia.org/T280291) (owner: 10Jforrester) [18:14:16] (03Merged) 10jenkins-bot: [wikitech] Enable VE desktop section edit links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679940 (https://phabricator.wikimedia.org/T280291) (owner: 10Jforrester) [18:14:23] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: factorize NAT template file into base profile [puppet] - 10https://gerrit.wikimedia.org/r/688365 (https://phabricator.wikimedia.org/T270704) [18:17:51] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: factorize NAT template file into base profile [puppet] - 10https://gerrit.wikimedia.org/r/688365 (https://phabricator.wikimedia.org/T270704) [18:18:28] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10Sergey.Trofimovsky.SF) >>! In T274461#7073817, @jbond wrote: >> - SSH keys cannot be imported from SSO, regardless of SSO type, CAS or SA... [18:18:41] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10Sergey.Trofimovsky.SF) Anyway, given that the decision to stick with SSO has been made, we're going with CAS, having following limitations in... [18:18:43] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:679940|[wikitech] Enable VE desktop section edit links (T280291)]] (duration: 00m 57s) [18:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:47] T280291: Enable VE desktop section edit links on wikitech - https://phabricator.wikimedia.org/T280291 [18:19:58] (03PS1) 10Ayounsi: Add cmooney [homer/public] - 10https://gerrit.wikimedia.org/r/688371 [18:20:46] (03PS3) 10Jforrester: Disable LocalisationUpdate, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677325 (https://phabricator.wikimedia.org/T158360) [18:21:01] (03CR) 10Ayounsi: [C: 03+2] Add cmooney [homer/public] - 10https://gerrit.wikimedia.org/r/688371 (owner: 10Ayounsi) [18:21:30] (03CR) 10Jforrester: [C: 03+2] "OK, let's do this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677325 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [18:21:40] (03Merged) 10jenkins-bot: Add cmooney [homer/public] - 10https://gerrit.wikimedia.org/r/688371 (owner: 10Ayounsi) [18:22:18] (03Merged) 10jenkins-bot: Disable LocalisationUpdate, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677325 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [18:22:20] (03PS7) 10Arturo Borrero Gonzalez: cloudgw: factorize NAT template file into base profile [puppet] - 10https://gerrit.wikimedia.org/r/688365 (https://phabricator.wikimedia.org/T270704) [18:24:04] (03PS8) 10Arturo Borrero Gonzalez: cloudgw: factorize NAT template file into base profile [puppet] - 10https://gerrit.wikimedia.org/r/688365 (https://phabricator.wikimedia.org/T270704) [18:24:17] !log add cmooney to all network devices [18:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:49] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10LSobanski) Actually, @Kormat is out as well so it'll have to either happen by Wednesday or next week. [18:25:27] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:677325|Disable LocalisationUpdate, part I (T158360)]] (duration: 00m 58s) [18:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:31] T158360: RFC: Reevaluate LocalisationUpdate extension for WMF - https://phabricator.wikimedia.org/T158360 [18:25:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/29481/" [puppet] - 10https://gerrit.wikimedia.org/r/688365 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [18:26:45] (03PS5) 10Jforrester: loginwiki: Allow users to mark Notifications as read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632598 (https://phabricator.wikimedia.org/T264834) [18:29:28] (03CR) 10Jforrester: [C: 03+2] loginwiki: Allow users to mark Notifications as read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632598 (https://phabricator.wikimedia.org/T264834) (owner: 10Jforrester) [18:30:29] (03Merged) 10jenkins-bot: loginwiki: Allow users to mark Notifications as read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632598 (https://phabricator.wikimedia.org/T264834) (owner: 10Jforrester) [18:31:08] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: cleanup unused all_phy_nics parameter [puppet] - 10https://gerrit.wikimedia.org/r/688366 (https://phabricator.wikimedia.org/T270704) [18:32:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/29482/" [puppet] - 10https://gerrit.wikimedia.org/r/688366 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [18:34:01] Does someone has time to deploy ptwikis 20th anniversary logo? [18:34:06] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:632598|loginwiki: Allow users to mark Notifications as read (T264834)]] (duration: 00m 57s) [18:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:10] T264834: Users can't mark their Notifications from loginwiki as read because they don't have the `writeapi` permission - https://phabricator.wikimedia.org/T264834 [18:34:13] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: don't use concatenation with CIDR [puppet] - 10https://gerrit.wikimedia.org/r/688367 (https://phabricator.wikimedia.org/T275129) [18:34:26] Zabe: I was going to schedule that for the later one tonight, as otherwise it'd be too early for Europeans. [18:34:54] okay [18:36:32] Zabe: Added to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T2300 [18:36:38] (Deploy window closed.) [18:36:56] thanks [18:37:51] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: don't use concatenation with CIDR [puppet] - 10https://gerrit.wikimedia.org/r/688367 (https://phabricator.wikimedia.org/T275129) [18:42:16] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install pc2011-pc2014 - https://phabricator.wikimedia.org/T282482 (10RobH) [18:42:36] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install pc2011-pc2014 - https://phabricator.wikimedia.org/T282482 (10RobH) [18:44:41] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10Sergey.Trofimovsky.SF) Requesting some more information on the current LDAP schema, which attributes can and should be used for key mapping (h... [18:53:06] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) [18:53:36] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) [18:55:08] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1001 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [18:55:14] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:55] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Legoktm) We don't need a backup, it's all just emails of people saying "Test" etc. :P, I'll delete the VM tomorrow (Pacific Time) so it should be ready for DBA deletion on Wednesday :) [19:02:16] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [19:02:22] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:25] (03PS1) 10Huji: Enable ShortDescriptions on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688381 (https://phabricator.wikimedia.org/T282486) [19:02:32] (03PS1) 10Legoktm: mailman3: Install flufl.bounce from our component too [puppet] - 10https://gerrit.wikimedia.org/r/688382 [19:02:34] (03PS1) 10Legoktm: backup: Exclude /var/lib/mailman3/queue [puppet] - 10https://gerrit.wikimedia.org/r/688383 [19:03:18] * legoktm looks at acme-chief [19:04:30] https://letsencrypt.status.io/ "Planned Maintenance In Progress" [19:05:06] ah, good find [19:05:39] it will need a manual restart once fixed since it exceeded the auto restart count [19:07:22] https://community.letsencrypt.org/t/is-lets-encrypt-currently-down-i-cant-issue-a-lets-encrypt-cert/151440/14 says we can subscribe to their status page, maybe for maint-announce? [19:08:56] that would make sense to me, scheduled maintenance that affects us like others we put on that calendar [19:09:14] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:42] also since the "notes_url" for the check is https://wikitech.wikimedia.org/wiki/Acme-chief maybe we should link the status page there [19:09:53] rzl: ^ btw, there is that recovery as expected [19:10:02] it's all happening by itself [19:10:56] James_F: oo I'd love that; is it too late? [19:11:14] (yes, the window closed. oh well.) [19:11:19] sorry I just saw this [19:12:06] (03PS7) 10Herron: mail: move default mail relay config out of standard module [puppet] - 10https://gerrit.wikimedia.org/r/686633 (https://phabricator.wikimedia.org/T232343) [19:13:04] (03CR) 10Herron: mail: move default mail relay config out of standard module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/686633 (https://phabricator.wikimedia.org/T232343) (owner: 10Herron) [19:16:04] mutante: yeah, weird [19:16:36] https://wikitech.wikimedia.org/w/index.php?title=Acme-chief&type=revision&diff=1911480&oldid=1867604 [19:19:38] cool, legoktm [19:19:42] so related question but a bit off-topic I guess: when does a critical notification from Icinga turn into a VictorOps page? [19:20:03] is there a length of time that needs to pass for the page to be active? [19:20:39] (03CR) 10Nskaggs: "Yay for eliminating technical debt. Thank you for everyone's work this past year in making this possible!" [puppet] - 10https://gerrit.wikimedia.org/r/686457 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [19:21:18] 10SRE, 10Traffic, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10ssingh) @ayounsi, @Aklapper: I forgot to update this ticket, apologies! Last I checked, all reports were in the green and there wasn't any debugging informat... [19:21:36] they go out at the same time sukhe [19:21:46] sukhe: should be when 2 requirements are met. first it has "critical => true" in the service check definition in puppet (should be called paging => true, not related to Icinga CRIT status) AND there is no do_paging: false in Hiera to turn it off again [19:22:03] herron: mutante: thank you! [19:24:39] (03PS1) 10Herron: profile::mail: add mta hiera option profile::mail::mta [puppet] - 10https://gerrit.wikimedia.org/r/688391 (https://phabricator.wikimedia.org/T232343) [19:26:34] (03CR) 10Herron: "mostly thinking about how to approach transition of host MTA from exim to postix with this, please lmk what you think" [puppet] - 10https://gerrit.wikimedia.org/r/688391 (https://phabricator.wikimedia.org/T232343) (owner: 10Herron) [19:32:31] cdanis: would you mind updating the topic, Clinic Duty from me to XioNoX [19:33:04] (03PS1) 10RLazarus: openldap: Only run cross-validate-accounts on weekdays. [puppet] - 10https://gerrit.wikimedia.org/r/688394 [19:34:59] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29485/console" [puppet] - 10https://gerrit.wikimedia.org/r/688394 (owner: 10RLazarus) [19:35:17] mutante: {{done}} [19:35:38] cdanis: thank you :) [19:46:39] (03CR) 10Herron: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/688333 (https://phabricator.wikimedia.org/T232343) (owner: 10Jbond) [20:00:04] Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T2100). [20:00:28] o_0 [20:00:35] Isn't that an hour earlier than usual? [20:00:52] Reedy: the bot is off by one [20:01:03] so yes then :P [20:01:18] it's using 2100 BST, not 2100 UTC (which is currently 2000) [20:03:40] Reedy: I mean, nothing else is happening so feel free… [20:08:42] 10SRE, 10Traffic, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10CDanis) There is evidence in our [[ https://wikitech.wikimedia.org/wiki/Network_Error_Logging | NEL ]] data that suggests this problem still exists. This pa... [20:12:46] (03CR) 10Legoktm: "I didn't include Mailman2 because we already migrated all the no-archive lists off it" [puppet] - 10https://gerrit.wikimedia.org/r/688383 (owner: 10Legoktm) [20:13:50] (03CR) 10Volans: [C: 03+1] "LGTM, bonus point if you skip also WMF holidays :-P Real bonus point if you migrate it to a systemd timer." [puppet] - 10https://gerrit.wikimedia.org/r/688394 (owner: 10RLazarus) [20:16:40] the LE maintenance is over but acme-chief is still crashing, I'll file a task [20:17:29] thx legoktm [20:18:09] I'll take care of that tomorrow EU morning [20:18:35] 10SRE, 10Acme-chief: acme-chief is down: ValueError: OCSP response status is not successful so the property has no value - https://phabricator.wikimedia.org/T282490 (10Legoktm) p:05Triage→03Unbreak! [20:18:54] Hmmm [20:19:05] should it be UBN then or just "high"? [20:19:20] (03CR) 10RLazarus: [V: 03+1 C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/688394 (owner: 10RLazarus) [20:21:08] !log andrew@deploy1002 Started deploy [horizon/deploy@6dc83bd]: update horizon to fix T282489 [20:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:12] T282489: Horizon proxy dashboard: edit dialog shows wmcloud.org even if a wmflabs.org proxy is being edited - https://phabricator.wikimedia.org/T282489 [20:25:18] !log andrew@deploy1002 Finished deploy [horizon/deploy@6dc83bd]: update horizon to fix T282489 (duration: 04m 10s) [20:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:45] legoktm: so the issue is not related to the maintenance, for some reason acme-chief is failing to retrieve the OCSP response for gitlab.wikimedia.org ec-prime256v1 cert and that's messing with acme-chief [20:28:53] !log andrew@deploy1002 Started deploy [horizon/deploy@6dc83bd]: update horizon to fix T282489 [20:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:57] T282489: Horizon proxy dashboard: edit dialog shows wmcloud.org even if a wmflabs.org proxy is being edited - https://phabricator.wikimedia.org/T282489 [20:29:09] !log andrew@deploy1002 deploy aborted: update horizon to fix T282489 (duration: 00m 15s) [20:29:11] !log andrew@deploy1002 Started deploy [horizon/deploy@6dc83bd]: update horizon to fix T282489 [20:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:47] !log andrew@deploy1002 deploy aborted: update horizon to fix T282489 (duration: 00m 36s) [20:29:50] !log andrew@deploy1002 Started deploy [horizon/deploy@6dc83bd]: update horizon to fix T282489 [20:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:52] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:11] !log andrew@deploy1002 Finished deploy [horizon/deploy@6dc83bd]: update horizon to fix T282489 (duration: 01m 21s) [20:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:58] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1001 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [20:32:08] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:32:08] vgutierrez: I have a pretty narrow understanding of OCSP except that it's important :/ [20:33:14] so acme-chief triggered a renew of the gitlab certificate during LE's maintenance window and LE issued a cert that no longer considered valid afterwards [20:33:29] !log andrew@deploy1002 Started deploy [horizon/deploy@6dc83bd]: update horizon to fix T282489 [20:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:56] (03PS1) 10Cwhite: Build on the production builder host [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/688418 [20:34:14] ahh, did you just delete the cert then? [20:35:24] !log andrew@deploy1002 Finished deploy [horizon/deploy@6dc83bd]: update horizon to fix T282489 (duration: 01m 55s) [20:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:28] T282489: Horizon proxy dashboard: edit dialog shows wmcloud.org even if a wmflabs.org proxy is being edited - https://phabricator.wikimedia.org/T282489 [20:36:34] 10SRE, 10Acme-chief: acme-chief is down: ValueError: OCSP response status is not successful so the property has no value - https://phabricator.wikimedia.org/T282490 (10Vgutierrez) p:05Unbreak!→03High it looks like a horrible case of bad timing.. acme-chief triggered a certificate renewal for our gitlab cer... [20:37:06] legoktm: stored under gitlab.old just in case [20:37:15] legoktm: and yes, that made the trick [20:37:58] !log andrew@deploy1002 Started deploy [horizon/deploy@6dc83bd]: update horizon to fix T282489 [20:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:05] !log andrew@deploy1002 Finished deploy [horizon/deploy@6dc83bd]: update horizon to fix T282489 (duration: 02m 07s) [20:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:07] !log andrew@deploy1002 Started deploy [horizon/deploy@2604d7b]: more deployment fixes [20:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:51] !log andrew@deploy1002 Finished deploy [horizon/deploy@2604d7b]: more deployment fixes (duration: 03m 44s) [20:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:58] (03PS1) 10Ahmon Dancy: Manually include I18nUtils class [extensions/MediaSearch] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/688294 (https://phabricator.wikimedia.org/T282206) [20:48:31] (03PS1) 10Ahmon Dancy: Manually include I18nUtils class [extensions/MediaSearch] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688295 (https://phabricator.wikimedia.org/T282206) [21:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T2300). [21:00:05] ori, James_F, and Huji: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:19] jouncebot: Bad bot. That's in two hours' time. [21:03:15] (03PS1) 10RLazarus: openldap: Convert the weekday cross-validate-accounts from cron to systemd. [puppet] - 10https://gerrit.wikimedia.org/r/688423 [21:07:51] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29486/console" [puppet] - 10https://gerrit.wikimedia.org/r/688423 (owner: 10RLazarus) [21:07:53] (03PS2) 10Cwhite: Build on the production builder host [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/688418 [21:16:02] Anyone mind if I deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/688294 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/688295 now? [21:16:44] Go for it. [21:16:55] I noticed it in prod logs once or twice. [21:17:05] Awesome. [21:18:05] (03CR) 10Ahmon Dancy: [C: 03+2] Manually include I18nUtils class [extensions/MediaSearch] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/688294 (https://phabricator.wikimedia.org/T282206) (owner: 10Ahmon Dancy) [21:20:42] PROBLEM - Stale file for node-exporter textfile in codfw on alert1001 is CRITICAL: cluster=misc file=puppet_agent.prom instance=mwlog2001 job=node site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [21:22:21] (03CR) 10Legoktm: [C: 03+2] mailman3: Enable debug runner logs [puppet] - 10https://gerrit.wikimedia.org/r/688354 (https://phabricator.wikimedia.org/T282348) (owner: 10Legoktm) [21:22:30] (03CR) 10Legoktm: [C: 03+2] mailman3: Install flufl.bounce from our component too [puppet] - 10https://gerrit.wikimedia.org/r/688382 (owner: 10Legoktm) [21:22:54] (03PS1) 10Bstorm: wikireplicas: disable notifications on the old replica cluster [puppet] - 10https://gerrit.wikimedia.org/r/688443 (https://phabricator.wikimedia.org/T260389) [21:24:56] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:26:38] !log upgraded flufl.bounce on lists1001 and restarted mailman3 T282348 [21:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:43] T282348: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 [21:31:06] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:32:49] * dancy twiddles thumbs [21:33:35] (03CR) 10Ahmon Dancy: [C: 03+2] Manually include I18nUtils class [extensions/MediaSearch] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688295 (https://phabricator.wikimedia.org/T282206) (owner: 10Ahmon Dancy) [21:36:10] (03PS2) 10Bstorm: wikireplicas: remove the old wikireplicas role from the proxy [puppet] - 10https://gerrit.wikimedia.org/r/688368 (https://phabricator.wikimedia.org/T260389) [21:39:36] !log nvm, downgraded flufl.bounce on lists1001 [21:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:50] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:42:30] (03Merged) 10jenkins-bot: Manually include I18nUtils class [extensions/MediaSearch] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/688294 (https://phabricator.wikimedia.org/T282206) (owner: 10Ahmon Dancy) [21:45:06] !log dancy@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/MediaSearch/MediaSearch.i18n.php: Backport: [[gerrit:688294|Manually include I18nUtils class (T282206)]] (duration: 01m 01s) [21:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:10] T282206: scap sync fails on beta with Error: Class 'MediaWiki\Extension\MediaSearch\I18nUtils' not found - https://phabricator.wikimedia.org/T282206 [21:45:28] (03PS3) 10Bstorm: wikireplicas: remove the old wikireplicas role from the proxy [puppet] - 10https://gerrit.wikimedia.org/r/688368 (https://phabricator.wikimedia.org/T260389) [21:48:02] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Legoktm) Current status: * I upgraded flufl.bounce which I thought might be better at parsing bounce messages, hopefully causing less crashes and then it unsubscr... [21:48:11] * legoktm stabs mailman3 [21:51:35] legoktm: i think we like its improved security model and features and everything 🙂 [21:51:38] oh no sorry to hear [21:51:42] straight to mm4 it is [21:54:01] I retract my stab [21:54:08] the listadmins@ list was misconfigured [21:57:07] (03CR) 10Bstorm: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/29487/" [puppet] - 10https://gerrit.wikimedia.org/r/688368 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [21:57:15] (03Merged) 10jenkins-bot: Manually include I18nUtils class [extensions/MediaSearch] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/688295 (https://phabricator.wikimedia.org/T282206) (owner: 10Ahmon Dancy) [21:57:31] (03PS4) 10Bstorm: wikireplicas: remove the old wikireplicas profile from the proxy [puppet] - 10https://gerrit.wikimedia.org/r/688368 (https://phabricator.wikimedia.org/T260389) [21:57:49] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10Sergey.Trofimovsky.SF) More good news: inspired by this issue (https://gitlab.com/gitlab-org/gitlab/-/issues/24510), @Sfigor was able to make... [21:59:48] !log dancy@deploy1002 Synchronized php-1.37.0-wmf.4/extensions/MediaSearch/MediaSearch.i18n.php: Backport: [[gerrit:688295|Manually include I18nUtils class (T282206)]] (duration: 00m 56s) [21:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:52] T282206: scap sync fails on beta with Error: Class 'MediaWiki\Extension\MediaSearch\I18nUtils' not found - https://phabricator.wikimedia.org/T282206 [21:59:58] I'm done deploying. Thanks! [22:08:06] (03PS4) 10Thcipriani: Fixed a few minor typos in README.md [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/682689 (owner: 10Ahmon Dancy) [22:11:07] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Fixed a few minor typos in README.md [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/682689 (owner: 10Ahmon Dancy) [22:18:27] (03CR) 10Bstorm: toolforge: Allow passing host port for k8s ingress (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [22:25:53] (03PS1) 10Ssingh: nagios_common: add sukhe to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/688462 [22:28:36] sukhe: I don't think that's used anymore? it goes through victorops now [22:29:08] ha wow [22:30:50] ok I will do this tomorrow as I can't log in to VictorOps (sorry, "Splunk") as well [22:32:01] ok I take it back, it works for me now. I did the test SMS thing. that's interesting. [22:32:47] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [22:32:53] (03Abandoned) 10Ssingh: nagios_common: add sukhe to sms contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/688462 (owner: 10Ssingh) [22:33:16] legoktm: I gave you credit in the abandonment :P [22:34:40] (03CR) 10Ori.livneh: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685035 (https://phabricator.wikimedia.org/T244075) (owner: 10Ori.livneh) [22:37:41] (03CR) 10Dzahn: [C: 03+1] openldap: Only run cross-validate-accounts on weekdays. [puppet] - 10https://gerrit.wikimedia.org/r/688394 (owner: 10RLazarus) [22:37:47] (Traffic bill over quota) firing: (3) Traffic bill over quota - https://alerts.wikimedia.org [22:40:03] (03CR) 10Dzahn: [C: 04-1] "existing file has been updated and there is ongoing work to keep it updated. let's keep it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [22:42:47] (Traffic bill over quota) resolved: (2) Traffic bill over quota - https://alerts.wikimedia.org [22:49:28] sukhe: :) I have it installed as a mobile app and it pages me via push notifications [22:51:20] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:40] (03CR) 10Bstorm: "Since this is a helm values file, I think it would be good to add that to the already-long file name so those familiar with helm will know" [puppet] - 10https://gerrit.wikimedia.org/r/685715 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [22:52:33] (03PS3) 10Reedy: Update messages used for tech CoC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681834 (https://phabricator.wikimedia.org/T280886) [23:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T1100). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:14] wait, I have a patch [23:00:34] there are multiple patches [23:00:54] Jouncebot is one window ahead :( [23:01:05] jouncebot: reload [23:01:11] oh man. did I break it? [23:01:18] there was some command for that, wasnt there [23:01:18] jouncebot: refresh [23:01:18] I refreshed my knowledge about deployments. [23:01:21] that one :) [23:01:23] jouncebot: now [23:01:23] For the next 1 hour(s) and 58 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T2300) [23:01:36] seems correct now [23:02:22] No it doesn't. The window is one hour long. [23:02:37] bd808: and yes, i think a recent commit broke it :( [23:03:04] bd808: it seemed to be treating BST as UTC [23:03:14] (03CR) 10Dzahn: [C: 03+2] icinga::ircbot: Send Wikidata notifications to #wikidata-feed [puppet] - 10https://gerrit.wikimedia.org/r/686697 (https://phabricator.wikimedia.org/T282301) (owner: 101997kB) [23:03:14] let's blame James_F for giving a +2 to https://gerrit.wikimedia.org/r/c/wikimedia/bots/jouncebot/+/680033 :) [23:03:19] Anyway, I can deploy today. [23:03:49] (03CR) 10Dzahn: [C: 03+2] "merging, my 2 cents are that "feed" channels tend to be ignored though" [puppet] - 10https://gerrit.wikimedia.org/r/686697 (https://phabricator.wikimedia.org/T282301) (owner: 101997kB) [23:03:51] Urbanecm: If you're doing SWAT, mind putting https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681834 out please? [23:04:34] uh, BACON [23:04:46] bd808: Ah, right. [23:04:50] Reedy: No. [23:04:53] (03CR) 10Urbanecm: [C: 03+2] Add ptwiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685557 (https://phabricator.wikimedia.org/T281925) (owner: 10Zabe) [23:04:54] It' [23:05:10] We're trying to get away from acronyms related to death. No meat puns, please. [23:05:17] (03PS1) 10BryanDavis: Revert "Reparse deploy page before announcing an event" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/688297 [23:05:33] (03CR) 10Jforrester: "It seems like a good idea at the time." [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/688297 (owner: 10BryanDavis) [23:05:46] (03Merged) 10jenkins-bot: Add ptwiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685557 (https://phabricator.wikimedia.org/T281925) (owner: 10Zabe) [23:06:27] (03CR) 10BryanDavis: "> Patch Set 1:" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/688297 (owner: 10BryanDavis) [23:06:40] (03CR) 10BryanDavis: [C: 03+2] Revert "Reparse deploy page before announcing an event" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/688297 (owner: 10BryanDavis) [23:07:15] (03Merged) 10jenkins-bot: Revert "Reparse deploy page before announcing an event" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/688297 (owner: 10BryanDavis) [23:07:28] (03PS2) 10Urbanecm: Use ptwiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685563 (https://phabricator.wikimedia.org/T281925) (owner: 10Zabe) [23:07:45] 10SRE, 10Wikidata, 10observability, 10wdwb-tech, 10Patch-For-Review: Move icinga-wm from #wikidata to #wikidata-feed - https://phabricator.wikimedia.org/T282301 (10Dzahn) 05Open→03Resolved a:03Dzahn deployed! 23:06 -!- icinga-wm [~icinga-wm@wikimedia/bot/icinga-wm] has joined #wikidata-feed [23:07:53] (03CR) 10Urbanecm: [C: 03+2] Use ptwiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685563 (https://phabricator.wikimedia.org/T281925) (owner: 10Zabe) [23:08:02] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: f2a76b1a6eb55749395e67d74c74a7fc5df52f1b: Add ptwiki 20th anniversary logos (T281925) (duration: 00m 58s) [23:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:07] T281925: Change ptwiki logo temporarily (celebration of 20 years) - https://phabricator.wikimedia.org/T281925 [23:08:42] (03Merged) 10jenkins-bot: Use ptwiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685563 (https://phabricator.wikimedia.org/T281925) (owner: 10Zabe) [23:09:24] jouncebot: now [23:09:25] For the next 1 hour(s) and 50 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T2300) [23:09:31] :-( [23:12:15] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: dd6fa6504350a90c9f14c218bc972558791f0a6d: Use ptwiki 20th anniversary logos (T281925) (duration: 00m 59s) [23:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:44] James_F: pushed it out. Works for me with old vector, but ptwiki is apparently pilot for the new one :( [23:13:17] Urbanecm: the off by one on the length is in the source on wikitech. which is probably lua magic? [23:13:50] `
` [23:14:10] toothpicks and superglue [23:14:43] probably [23:15:51] bd808: https://wikitech.wikimedia.org/wiki/Module:Deployment_schedule [23:18:18] (03CR) 10Urbanecm: [C: 04-2] "this needs further discussion, see T204136#4584681, T279829" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688381 (https://phabricator.wikimedia.org/T282486) (owner: 10Huji) [23:18:47] looks like it defaults to two hours (line 49) [23:19:14] huji: sadly I'm unable to deploy your patch now. enabling SHORTDESC should be discussed with DannyH first (see the german Wikipedia task i linked in my -2) [23:19:38] PROBLEM - Check systemd state on analytics1075 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:14] Urbanecm: I will follow up on Phab. Thanks for your attention [23:20:29] No problem. I'll leave a note there too. [23:21:54] jouncebot: refresh [23:21:55] I refreshed my knowledge about deployments. [23:22:01] jouncebot: now [23:22:01] For the next 1 hour(s) and 37 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T2300) [23:22:10] * bd808 grumbles [23:22:34] jouncebot: refresh [23:22:35] I refreshed my knowledge about deployments. [23:22:37] jouncebot: now [23:22:37] For the next 0 hour(s) and 37 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T2300) [23:22:41] \o/ [23:22:49] James_F: could you help me review ori's patch please? Looks good, to me, but not 100% sure. [23:23:09] you did it bd808 ! thanks :) [23:23:14] if anyone cares, the fix in the lua was https://wikitech.wikimedia.org/w/index.php?title=Module:Deployment_schedule&diff=1911549&oldid=1907063 [23:23:35] * Urbanecm goes to send a thank [23:24:26] also, this is all way more complicated than it could be because of the wiki involvement. just saying [23:24:38] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/521233 is a recent-ish enable-extension-on-meta that could be used as a reference [23:25:01] on beta, not meta [23:25:11] i'm not sure if it needs to be present in wmf.3 as well, to be precise [23:25:13] (it's currently not) [23:25:57] If we aren't going to rollback to .3, it doesn't [23:26:12] if there's a chance, and we will need to run full scap, it does [23:26:39] even if i run scap sync-world (now)? i feel like that would update i18n cache for wmf.3 (with an extension that doesn't exist there) [23:27:34] why would it [23:27:53] because wmf.3 is currently on the deployment host (and presumably, on mw servers) [23:28:12] no [23:28:26] are you sure it's not on wmf.3? there's a wmf/1.37.0-wmf.3 that was auto-created on the extension's repository [23:28:40] branch even [23:28:47] scap only builds for currently active versions [23:29:03] i see [23:29:19] otherwise it'd be building it for 5+ MW versions at times [23:29:22] which makes no sense [23:29:22] (03CR) 10Urbanecm: [C: 03+2] Enable ChessBrowser on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685035 (https://phabricator.wikimedia.org/T244075) (owner: 10Ori.livneh) [23:29:34] +2'ed then. thanks Reedy. [23:29:45] reedy@deploy1002:~$ ls -al /srv/mediawiki-staging/php-1.37.0-wmf.3/extensions/Ch [23:29:45] CharInsert/ CheckUser/ [23:29:54] It's not there in .3, but doesn't need to be [23:30:08] (03Merged) 10jenkins-bot: Enable ChessBrowser on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685035 (https://phabricator.wikimedia.org/T244075) (owner: 10Ori.livneh) [23:30:21] ori: your patch should arrive to be automagically :) [23:30:28] See also T273334 [23:30:29] T273334: Re-imaged mw app servers can end up with missing l10n cache for old versions of MW needed for rollback - https://phabricator.wikimedia.org/T273334 [23:30:33] Which is kinda the same issue, but done in a different way [23:30:38] (and it's not fixed, so not applicable) [23:30:59] Which might've been what you were thinking of [23:31:52] i see [23:31:55] how often does beta update? [23:32:02] depends if jerkins is broken [23:32:12] and if the i18n build job is fixed indeed :) [23:32:14] https://integration.wikimedia.org/ci/view/Beta/ [23:32:29] !log urbanecm@deploy1002 Synchronized wmf-config/extension-list: ba8b786c7f3a290f0747a6859fd07502eb83108f: NO-OP: Enable ChessBrowser on beta (T244075) (duration: 00m 57s) [23:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:33] T244075: Deploy ChessBrowser extension to Beta Cluster - https://phabricator.wikimedia.org/T244075 [23:32:36] it's currently scapping your change [23:33:29] (03CR) 10Urbanecm: [C: 03+2] Update messages used for tech CoC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681834 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [23:34:26] Urbanecm: Don't run a scap world. It's just a beta deployment, no need to do anything at all. [23:34:39] Oh good, you didn't. [23:34:44] Scrollback is grand. [23:34:48] i'm not running it. I was asked "what if someone did" :) [23:34:55] *asking [23:35:00] (03PS1) 10Bstorm: wikireplicas-dns: condense repeated nodes for better failover [puppet] - 10https://gerrit.wikimedia.org/r/688501 (https://phabricator.wikimedia.org/T260389) [23:35:01] If they did it's probably break things. [23:35:06] (03Merged) 10jenkins-bot: Update messages used for tech CoC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681834 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [23:35:28] But wmf.5 gets cut in 25 minutes and deployed in a few hours. [23:35:31] So it's fine. [23:37:09] Reedy: syncing your patch [23:38:00] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 779fb53bfd7a4d9b11f865df14f8a72adb97f33b: Update messages used for tech CoC (T280886) (duration: 00m 56s) [23:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:04] T280886: Add Code of Conduct link to the Universal Code of Conduct to all non technical wikis - https://phabricator.wikimedia.org/T280886 [23:38:06] and....done [23:38:21] Hurrah. [23:38:25] Thanks, Urbanecm. [23:38:33] any time [23:38:34] (03PS5) 10Razzi: netboot: add reuse-analytics-raid1-2dev.cfg recipe for an-master and an-coord [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) [23:38:57] (03CR) 10BryanDavis: Reparse deploy page before announcing an event (031 comment) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/680033 (https://phabricator.wikimedia.org/T243394) (owner: 10BryanDavis) [23:39:48] bd808: Alternatively you could adjust the timer to trigger 30s before the event? [23:39:53] (What me, a hack? Never.) [23:40:11] it's nasty hacks from top to bottom :) [23:40:20] We call it… MediaWiki. [23:40:22] Oh, wait. [23:40:27] * bd808 blames people who left >5 years ago [23:40:43] * James_F blames himself for not leaving 5 years ago and thus being a useful target for blame. [23:43:50] James_F: in the "its all hacks" vein, the timer goes off 5 seconds after the time in the wiki page, which I'm guessing was a hack itself :) [23:44:02] how is there so much variance in the amount of time it takes to run scap-cdb-rebuild on beta app servers [23:44:34] first one finished in ~3 seconds, last one is still running after 5m [23:44:56] ~3 seconds means it didn't find anything that needed to be rebuilt [23:45:11] a rebuild takes much much longer [23:46:08] that's the scary pile of file stat compares to php serialized timestamp stuff in the l10n build step [23:46:23] oh yeah, that [23:46:36] i've done my best to forget that that exists [23:46:40] also, by the point it does that at the end... [23:46:42] (03CR) 10Razzi: [C: 03+2] "Ok great, I'll merge this so the change will have plenty of time to propagate." [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [23:46:42] and why do I remember this ~5 years after last touching i18n codes [23:46:45] the deployment host has already run it earlier [23:46:54] 00:31:18 23:31:18 Started l10n-update [23:46:54] 00:31:18 23:31:18 Updating ExtensionMessages-master.php [23:46:54] 00:31:19 23:31:19 Updating LocalisationCache for master using 6 thread(s) [23:46:54] 00:32:37 23:32:37 Generating JSON versions and md5 files [23:46:54] 00:32:51 23:32:51 Finished l10n-update (duration: 01m 32s) [23:47:29] woohoo, https://en.wikipedia.beta.wmflabs.org/wiki/User:Ori.livneh/Chess [23:47:40] Urbanecm, James_F, Reedy : thanks! [23:47:56] excellent! [23:48:51] looks like it needs some tweaking of the resources loading order/perf [23:49:11] Reedy: what are you seeing? [23:49:34] other than a pedestrian chasing a cyclist toward the yin-yang [23:49:35] many reflows/redraws/unstyled [23:50:23] (03PS1) 10Cwhite: rsyslog: add ecs_170 template [puppet] - 10https://gerrit.wikimedia.org/r/688502 (https://phabricator.wikimedia.org/T234565) [23:50:38] I'll take a look [23:50:58] put a video on https://phabricator.wikimedia.org/T282503 [23:53:07] thanks [23:57:30] (03PS1) 10BryanDavis: Reparse deploy page before announcing an event (v2) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/688503 (https://phabricator.wikimedia.org/T243394) [23:59:35] (03CR) 10BryanDavis: [C: 03+2] Reparse deploy page before announcing an event (v2) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/688503 (https://phabricator.wikimedia.org/T243394) (owner: 10BryanDavis)