[00:06:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:08:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:18:36] (03CR) 10jerkins-bot: [V: 04-1] Revert "Move User::changeable(By)Groups methods to UserGroupManager" [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660649 (https://phabricator.wikimedia.org/T273296) (owner: 10Urbanecm) [00:21:08] (03PS5) 10Urbanecm: Revert "Move User::changeable(By)Groups methods to UserGroupManager" [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660649 (https://phabricator.wikimedia.org/T273296) [01:02:07] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1130443328 and 83 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:13] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 834619096 and 61 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:25] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2384 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:59] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 74168 and 104 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:18:04] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:22:34] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 4.084 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:29:28] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:46:12] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 7.864 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:53:18] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:17:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:22:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:58:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1094', diff saved to https://phabricator.wikimedia.org/P14058 and previous config saved to /var/cache/conftool/dbconfig/20210201-055851-marostegui.json [05:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1094 (re)pooling @ 10%: After fixing replication', diff saved to https://phabricator.wikimedia.org/P14059 and previous config saved to /var/cache/conftool/dbconfig/20210201-060415-root.json [06:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:14] !log Upgrade db2071 and db2102 to 10.4.18 [06:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:18:04] (03PS1) 10Marostegui: db1175: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/660661 (https://phabricator.wikimedia.org/T258361) [06:18:44] (03CR) 10Marostegui: [C: 03+2] db1175: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/660661 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:19:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1094 (re)pooling @ 25%: After fixing replication', diff saved to https://phabricator.wikimedia.org/P14060 and previous config saved to /var/cache/conftool/dbconfig/20210201-061919-root.json [06:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:21:03] (03PS1) 10Marostegui: db1175: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/660662 (https://phabricator.wikimedia.org/T258361) [06:21:35] (03CR) 10Marostegui: [C: 03+2] db1175: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/660662 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:23:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1175 to dbctl, depooled T258361', diff saved to https://phabricator.wikimedia.org/P14061 and previous config saved to /var/cache/conftool/dbconfig/20210201-062358-marostegui.json [06:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:03] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:34:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1094 (re)pooling @ 50%: After fixing replication', diff saved to https://phabricator.wikimedia.org/P14062 and previous config saved to /var/cache/conftool/dbconfig/20210201-063422-root.json [06:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:01] !log Run analyze table on db2071 and db2102 [06:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1094 (re)pooling @ 75%: After fixing replication', diff saved to https://phabricator.wikimedia.org/P14063 and previous config saved to /var/cache/conftool/dbconfig/20210201-064926-root.json [06:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:00] (03PS2) 10Giuseppe Lavagetto: kubernetes::deployment_server: add yaml to configure MediaWiki sites [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) [07:04:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1094 (re)pooling @ 100%: After fixing replication', diff saved to https://phabricator.wikimedia.org/P14064 and previous config saved to /var/cache/conftool/dbconfig/20210201-070429-root.json [07:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:58] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10Joe) >>! In T271822#6773823, @lmata wrote: > Hi Joe, > > Let us know if there is any support you'd like from our team on this tas... [07:35:02] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) @lmata we really need to set up a meeting to tackle the questions here and in T271822 pretty soon; we're at the point where not figuring out this stuff will harm o... [07:36:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1175 with some more minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P14065 and previous config saved to /var/cache/conftool/dbconfig/20210201-073603-marostegui.json [07:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:08] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [07:43:57] (03PS1) 10Elukey: Avoid disk space alarms for Hadoop backup master/standby [puppet] - 10https://gerrit.wikimedia.org/r/660665 [07:44:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 2%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14066 and previous config saved to /var/cache/conftool/dbconfig/20210201-074422-root.json [07:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:33] (03CR) 10Elukey: [C: 03+2] Avoid disk space alarms for Hadoop backup master/standby [puppet] - 10https://gerrit.wikimedia.org/r/660665 (owner: 10Elukey) [07:50:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:52:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:54:22] RECOVERY - Check systemd state on kafka-test1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:57] 10SRE, 10Analytics-Radar: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) Cleaned up with `sudo ip addr flush ens5; sudo systemctl restart ifup@ens5` in tmux kafka-test1010, kafka-test1009 and kafka-test1007 [07:55:08] RECOVERY - Check systemd state on kafka-test1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:55:22] RECOVERY - Check systemd state on kafka-test1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:54] RECOVERY - Disk space on an-worker1124 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1124&var-datasource=eqiad+prometheus/ops [07:57:30] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27785/console" [puppet] - 10https://gerrit.wikimedia.org/r/659327 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [07:59:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 3%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14067 and previous config saved to /var/cache/conftool/dbconfig/20210201-075926-root.json [07:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:57] (03PS1) 10Marostegui: mariadb: db1166 added to dbctl and notifications enabled [puppet] - 10https://gerrit.wikimedia.org/r/660755 (https://phabricator.wikimedia.org/T258361) [08:00:29] (03PS3) 10Matthias Mullie: Add global to indicate that elastic LTR features are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646663 [08:00:57] (03CR) 10Matthias Mullie: [C: 03+1] "Ready for deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658589 (https://phabricator.wikimedia.org/T271532) (owner: 10Matthias Mullie) [08:01:40] (03CR) 10Marostegui: [C: 03+2] mariadb: db1166 added to dbctl and notifications enabled [puppet] - 10https://gerrit.wikimedia.org/r/660755 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [08:02:23] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:05:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1166 to dbctl, depooled T258361', diff saved to https://phabricator.wikimedia.org/P14068 and previous config saved to /var/cache/conftool/dbconfig/20210201-080520-marostegui.json [08:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:25] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [08:06:02] (03PS1) 10Elukey: role::analytics_test_cluster::presto::server: add new TLS configs [puppet] - 10https://gerrit.wikimedia.org/r/660756 (https://phabricator.wikimedia.org/T266640) [08:06:24] (03PS1) 10Giuseppe Lavagetto: httpbb: some fixes to deploy-apache-change [puppet] - 10https://gerrit.wikimedia.org/r/660757 [08:07:02] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::presto::server: add new TLS configs [puppet] - 10https://gerrit.wikimedia.org/r/660756 (https://phabricator.wikimedia.org/T266640) (owner: 10Elukey) [08:14:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 5%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14069 and previous config saved to /var/cache/conftool/dbconfig/20210201-081429-root.json [08:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1166 with minimal weight for the first time T258361', diff saved to https://phabricator.wikimedia.org/P14070 and previous config saved to /var/cache/conftool/dbconfig/20210201-081554-marostegui.json [08:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:59] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [08:17:43] !log swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837 [08:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:47] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [08:19:13] (03PS1) 10Marostegui: install_server: Do not reimage db1166,db1175 [puppet] - 10https://gerrit.wikimedia.org/r/660761 [08:19:25] 10SRE, 10Analytics: rsyslog segfault on an-test-presto1001 - https://phabricator.wikimedia.org/T273412 (10elukey) [08:20:06] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1166,db1175 [puppet] - 10https://gerrit.wikimedia.org/r/660761 (owner: 10Marostegui) [08:20:49] 10SRE, 10Traffic, 10netops: cr4-ulsfo<>cr2-eqsin GRE tunnel flapping due to BFD timer expired - https://phabricator.wikimedia.org/T273328 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks, looks like last flap was on Jan 29 22:06:55. As it's over the wild Internet, there is nobody to complain to and mos... [08:23:39] 10SRE, 10Analytics: rsyslog segfault on an-test-presto1001 - https://phabricator.wikimedia.org/T273412 (10elukey) [08:27:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 2%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14071 and previous config saved to /var/cache/conftool/dbconfig/20210201-082707-root.json [08:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:37] 10SRE, 10Analytics: rsyslog segfault on an-test-presto1001 - https://phabricator.wikimedia.org/T273412 (10elukey) [08:28:20] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:29:06] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:29:19] (03CR) 10Ayounsi: customscripts/interface_automation: skipp slaac addresses (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [08:29:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 7%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14072 and previous config saved to /var/cache/conftool/dbconfig/20210201-082933-root.json [08:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:29] 10SRE, 10Analytics: rsyslog segfault on an-test-presto1001 - https://phabricator.wikimedia.org/T273412 (10elukey) I was able to make rsyslog to start deleting the content of `/var/spool/rsyslog` on the host (looks similar to https://github.com/rsyslog/rsyslog/issues/2654). The old files were backed up in `/hom... [08:38:04] (03PS1) 10Marostegui: db1089: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/660763 (https://phabricator.wikimedia.org/T273417) [08:39:20] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:41:01] checking --^ [08:41:59] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Marostegui) @wiki_willy db1111 can be moved somewhere else if needed. From our side our needs would be: - Choose a day/time so DBAs can depool the host in... [08:42:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 4%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14073 and previous config saved to /var/cache/conftool/dbconfig/20210201-084211-root.json [08:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:52] (03CR) 10Marostegui: [C: 03+2] db1089: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/660763 (https://phabricator.wikimedia.org/T273417) (owner: 10Marostegui) [08:45:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1089 from dbctl T273417', diff saved to https://phabricator.wikimedia.org/P14074 and previous config saved to /var/cache/conftool/dbconfig/20210201-084523-marostegui.json [08:45:25] (03CR) 10Hashar: "When reviewing 'gerrit logging ls' differences I have noticed a few differences:" [puppet] - 10https://gerrit.wikimedia.org/r/660030 (https://phabricator.wikimedia.org/T141324) (owner: 10Hashar) [08:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:30] T273417: decommission db1089.eqiad.wmnet - https://phabricator.wikimedia.org/T273417 [08:45:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14075 and previous config saved to /var/cache/conftool/dbconfig/20210201-084531-root.json [08:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:03] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:46:30] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:49:21] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10JMeybohm) >>! In T271822#6790319, @Joe wrote: > - Is there a way to tell prometheus to read from multiple ports from the same pod... [08:50:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:50:47] (03CR) 10Hashar: gerrit: drop log4j custom config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/660030 (https://phabricator.wikimedia.org/T141324) (owner: 10Hashar) [08:51:28] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27786/console" [puppet] - 10https://gerrit.wikimedia.org/r/659405 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [08:52:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:53:13] (03PS4) 10ArielGlenn: use the platform-engineering group to add people to deployment [puppet] - 10https://gerrit.wikimedia.org/r/658552 [08:53:36] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [08:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:44] !log gilles@deploy1001 Started deploy [performance/navtiming@1e02d76]: T271208 Add more debug logging [08:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:47] T271208: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 [08:53:49] !log gilles@deploy1001 Finished deploy [performance/navtiming@1e02d76]: T271208 Add more debug logging (duration: 00m 05s) [08:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:38] !log Stop MySQL on db1089 - T273417 [08:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:42] T273417: decommission db1089.eqiad.wmnet - https://phabricator.wikimedia.org/T273417 [08:57:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 3%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14077 and previous config saved to /var/cache/conftool/dbconfig/20210201-085714-root.json [08:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:31] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] nagios_common::commands: require_package -> ensure_packages, simplify [puppet] - 10https://gerrit.wikimedia.org/r/659405 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [09:00:03] !log elukey@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 12%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14078 and previous config saved to /var/cache/conftool/dbconfig/20210201-090034-root.json [09:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:41] (03CR) 10ArielGlenn: "And I've taken nikkin and gmodena back out so they can go through the access request process as documented. At least it will now be a shor" [puppet] - 10https://gerrit.wikimedia.org/r/658552 (owner: 10ArielGlenn) [09:02:00] (03CR) 10ArielGlenn: [C: 03+2] use the platform-engineering group to add people to deployment [puppet] - 10https://gerrit.wikimedia.org/r/658552 (owner: 10ArielGlenn) [09:03:26] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be1054.eqiad.wmnet with reason: reboot [09:03:26] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1054.eqiad.wmnet with reason: reboot [09:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:20] !log gilles@deploy1001 Started deploy [performance/navtiming@3215510]: T271208 browser_minor is needed for Mobile Safari allowlist [09:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:24] T271208: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 [09:04:25] !log gilles@deploy1001 Finished deploy [performance/navtiming@3215510]: T271208 browser_minor is needed for Mobile Safari allowlist (duration: 00m 05s) [09:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14079 and previous config saved to /var/cache/conftool/dbconfig/20210201-091218-root.json [09:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 15%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14080 and previous config saved to /var/cache/conftool/dbconfig/20210201-091538-root.json [09:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:43] (03PS1) 10Ayounsi: Fix IPv6 /64s includes [dns] - 10https://gerrit.wikimedia.org/r/660767 [09:17:45] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:21] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:01] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:10] !log renumber gr-3/3/0.1 local endpoint on cr1-eqiad [09:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:55] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:27:13] !log restarting blazegraph on wdqs1013 [09:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 6%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14081 and previous config saved to /var/cache/conftool/dbconfig/20210201-092722-root.json [09:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:46] (03PS2) 10Hashar: gerrit: drop log4j custom config [puppet] - 10https://gerrit.wikimedia.org/r/660030 (https://phabricator.wikimedia.org/T141324) [09:29:46] (03CR) 10Hashar: "PS2:" [puppet] - 10https://gerrit.wikimedia.org/r/660030 (https://phabricator.wikimedia.org/T141324) (owner: 10Hashar) [09:29:53] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/660030 (https://phabricator.wikimedia.org/T141324) (owner: 10Hashar) [09:30:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 20%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14082 and previous config saved to /var/cache/conftool/dbconfig/20210201-093041-root.json [09:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:45] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:32:46] (03PS1) 10Marostegui: db-eqiad.php: Depool es4 from writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660772 (https://phabricator.wikimedia.org/T272568) [09:32:59] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/660767 (owner: 10Ayounsi) [09:36:03] (03CR) 10Ayounsi: [C: 03+2] Fix IPv6 /64s includes [dns] - 10https://gerrit.wikimedia.org/r/660767 (owner: 10Ayounsi) [09:39:48] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [09:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:05] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 7%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14083 and previous config saved to /var/cache/conftool/dbconfig/20210201-094226-root.json [09:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:58] (03CR) 10Kormat: [C: 03+1] db-eqiad.php: Depool es4 from writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660772 (https://phabricator.wikimedia.org/T272568) (owner: 10Marostegui) [09:43:22] (03PS1) 10Hashar: scap: config for devtools on WMCS [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/660773 [09:43:28] (03CR) 10Ayounsi: [C: 03+1] "Cleaner indeed!" [homer/public] - 10https://gerrit.wikimedia.org/r/659235 (owner: 10Volans) [09:43:42] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool es4 from writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660772 (https://phabricator.wikimedia.org/T272568) (owner: 10Marostegui) [09:43:47] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es4 from writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660772 (https://phabricator.wikimedia.org/T272568) (owner: 10Marostegui) [09:45:33] (03CR) 10Hashar: [C: 04-1] "Cant figure out the ssh config so ... not much to do ;)" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/660773 (owner: 10Hashar) [09:45:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14084 and previous config saved to /var/cache/conftool/dbconfig/20210201-094545-root.json [09:45:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es4 from writes T266483 (duration: 01m 04s) [09:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:53] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [09:46:04] !log Restart mysql on es1021 T266483 [09:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:11] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:22] (03Abandoned) 10Hashar: scap: config for devtools on WMCS [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/660773 (owner: 10Hashar) [09:49:37] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool es4 from writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660542 [09:50:01] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:36] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool es4 from writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660542 (owner: 10Marostegui) [09:52:29] (03CR) 10Hashar: "I have cherry picked this change on the development platform which thus no more has any log4j The instance is running at https://gerrit.de" [puppet] - 10https://gerrit.wikimedia.org/r/660030 (https://phabricator.wikimedia.org/T141324) (owner: 10Hashar) [09:52:47] (03PS1) 10Rosalie Perside (WMDE): wikidata: post edit constrain jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660774 (https://phabricator.wikimedia.org/T204031) [09:52:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es4 into writes T266483 (duration: 00m 56s) [09:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:56] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [09:54:40] hashar: hey, I see you're train conductor for the upcoming week. The blocker for T271342 was successfully fixed, however, train is currently still at wmf.27. Should we go with wmf.28 today? Was already at all wikis with no log spam, this is "just" a major feature regression. I can deploy & test the backports, but I would need someone to move the wikis forward. [09:54:41] T271342: 1.36.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T271342 [09:55:08] ahrgh [09:55:26] havent caught up with the train status [09:55:29] but yeah guess I can push it [09:57:05] Urbanecm: when will you backport the patches? during the regular window? [09:57:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 8%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14085 and previous config saved to /var/cache/conftool/dbconfig/20210201-095729-root.json [09:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:38] Majavah: probably earlier [09:58:00] (03CR) 10Volans: "Some post-merge comments" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [09:58:02] hashar: okay, I'll +2 the backports then and ping you once it's time to change wikiversions. Does that sound good to you? [09:58:40] not sure when we push the new version though [09:59:51] Urbanecm: ack, thanks. not sure if I'm available to help with testing but please ping me when backporting? [10:00:33] hashar: that's up to you i guess, I'm able to test it anyway [10:00:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 30%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14086 and previous config saved to /var/cache/conftool/dbconfig/20210201-100048-root.json [10:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:14] Urbanecm: +1 : ] [10:01:21] (03CR) 10Urbanecm: [C: 03+2] "train blocker" [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660533 (https://phabricator.wikimedia.org/T273317) (owner: 10Urbanecm) [10:01:24] (03CR) 10Urbanecm: [C: 03+2] "train blocker" [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660649 (https://phabricator.wikimedia.org/T273296) (owner: 10Urbanecm) [10:01:46] (03CR) 10Hashar: [C: 03+1] Revert "Move User::changeable(By)Groups methods to UserGroupManager" [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660649 (https://phabricator.wikimedia.org/T273296) (owner: 10Urbanecm) [10:01:48] ah, you're doing it now [10:01:52] (03CR) 10Hashar: [C: 03+1] Revert "Revert "Revert "Remove usages and hard deprecate User::changeable(By)Group""" [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660533 (https://phabricator.wikimedia.org/T273317) (owner: 10Urbanecm) [10:02:08] Majavah: yeah, whenever it merges :). [10:02:37] hmm have we rolledback last week? [10:03:03] we rolled back late Friday [10:03:19] this week is going to be a verryyyyyyy loooong weeek :-\ [10:05:35] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:41] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 9%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14087 and previous config saved to /var/cache/conftool/dbconfig/20210201-101233-root.json [10:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:19] hashar: I assume such trains (.27 > .29) aren't the most favorable type of train, right? [10:13:32] .27 > .28 I guess [10:13:36] then .29 tomorrow [10:13:58] Yeah, today just .28 to everywhere [10:14:44] I don't get why we move methods and immediately hard deprecate them [10:14:56] I thought he rule was to leave them as deprecated for a bit [10:15:03] Well in this case hard deprecation wasn't really the culprit [10:15:15] then it is not like I know anything about mediawiki processes nowadays [10:15:36] But yeah, it should've be left soft deprecated for a release [10:15:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14088 and previous config saved to /var/cache/conftool/dbconfig/20210201-101552-root.json [10:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:53] (03PS1) 10Volans: cookbooks: force title to be one line [software/spicerack] - 10https://gerrit.wikimedia.org/r/660777 [10:18:25] (03CR) 10Volans: "I've sent I78cd5f65e74378261ea5c983c9f85fd8f4693589" [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [10:19:43] (03PS2) 10Rosalie Perside (WMDE): wikidata: post edit constraint jobs on 40% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660774 (https://phabricator.wikimedia.org/T204031) [10:20:07] (03CR) 10jerkins-bot: [V: 04-1] Revert "Move User::changeable(By)Groups methods to UserGroupManager" [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660649 (https://phabricator.wikimedia.org/T273296) (owner: 10Urbanecm) [10:20:15] heeelp [10:20:20] (looking) [10:20:41] fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/WikiEditor/': The requested URL returned error: 502 [10:20:53] ehh [10:21:06] (03CR) 10Urbanecm: [C: 03+2] "train blocker" [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660649 (https://phabricator.wikimedia.org/T273296) (owner: 10Urbanecm) [10:21:11] let's try again... [10:22:41] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:44] bah [10:23:08] Urbanecm: let me restart Gerrit [10:23:22] hashar: it's almost finished? [10:23:41] yeah [10:23:42] hmm [10:23:52] hashar: dunno if it's a good idea to restart gerrit when the patches are almost merged [10:24:07] probably not ;] [10:26:01] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Remove usages and hard deprecate User::changeable(By)Group""" [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660533 (https://phabricator.wikimedia.org/T273317) (owner: 10Urbanecm) [10:26:09] (03Merged) 10jenkins-bot: Revert "Move User::changeable(By)Groups methods to UserGroupManager" [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660649 (https://phabricator.wikimedia.org/T273296) (owner: 10Urbanecm) [10:27:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14089 and previous config saved to /var/cache/conftool/dbconfig/20210201-102736-root.json [10:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:43] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 60%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14090 and previous config saved to /var/cache/conftool/dbconfig/20210201-103055-root.json [10:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:03] (03CR) 10Ayounsi: [C: 03+1] profile: update netdev to output ECS-formatted logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [10:31:35] Majavah: should be testable at mwdebug1003 [10:32:18] Urbanecm: I'm unable to test for the next 10-15 mins, sorry [10:32:25] (eating) [10:32:28] np, I'll do it :) [10:33:00] (03PS1) 10ArielGlenn: add a proper media section to the deployment-prep dumps config file [puppet] - 10https://gerrit.wikimedia.org/r/660779 (https://phabricator.wikimedia.org/T269377) [10:33:04] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10fgiunchedi) I dug a bit into upstream Prometheus issues and this is relevant: https://github.com/prometheus/prometheus/issues/3756... [10:34:22] Urbanecm: posted about the train to wikitech-l [10:34:30] group0 as soon as those patches are merged [10:34:41] will do group 1 this afternoon after the backport window [10:35:04] and the rest of the wikis later tonight (which I guess I will hand off to USA people) [10:35:41] patchers work for me, syncing [10:36:29] \o/ [10:37:00] (03CR) 10Lucas Werkmeister (WMDE): "That would be great, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/659945 (https://phabricator.wikimedia.org/T264883) (owner: 10Lucas Werkmeister (WMDE)) [10:37:22] (03CR) 10ArielGlenn: [C: 03+2] add a proper media section to the deployment-prep dumps config file [puppet] - 10https://gerrit.wikimedia.org/r/660779 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [10:38:39] (03CR) 10Jbond: "updated, would also be good if someone (cas?) could show me the best way to test this and https://gerrit.wikimedia.org/r/c/operations/soft" (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [10:39:40] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.28/includes/user//User.php: Fixing T273317 T273296 (duration: 00m 58s) [10:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:45] T273296: Some administrators can no longer add/remove users from the autochecked/editor user groups - https://phabricator.wikimedia.org/T273296 [10:39:45] T273317: some users with access are unable to configure pending changes - https://phabricator.wikimedia.org/T273317 [10:40:20] ^^will take more syncs^^ [10:41:04] !log urbanecm@deploy1001 sync-file aborted: Fixing T273317 T273296 (duration: 00m 12s) [10:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:20] actually, why I just don't sync it at once... [10:41:27] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:41:32] it's undeployed version, it's not hit by anything [10:42:07] PROBLEM - Docker registry HTTPS interface on registry2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [10:42:09] {{doing}} [10:42:14] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.28/includes/: Fixing T273317 T273296 (duration: 01m 01s) [10:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:27] (03CR) 10Noa wmde: [C: 03+1] wikidata: post edit constraint jobs on 40% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660774 (https://phabricator.wikimedia.org/T204031) (owner: 10Rosalie Perside (WMDE)) [10:42:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 12%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14091 and previous config saved to /var/cache/conftool/dbconfig/20210201-104240-root.json [10:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:55] PROBLEM - SSH on ms-be2041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:43:23] hashar: patches synced. I think group0 can be finally promoted nwo [10:43:25] *now [10:43:46] awesome, doing it now [10:44:10] * Majavah back [10:44:31] RECOVERY - Docker registry HTTPS interface on registry2002 is OK: HTTP OK: HTTP/1.1 200 OK - 2581 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Docker [10:44:53] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:58] thanks hashar [10:45:10] Promote group0 from 1.36.0-wmf.27 to 1.36.0-wmf.27 [y/N] n [10:45:11] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be1047.eqiad.wmnet with reason: reboot [10:45:12] ... [10:45:12] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1047.eqiad.wmnet with reason: reboot [10:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:21] RECOVERY - SSH on ms-be2041 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:45:44] (03PS1) 10Hashar: testwikis wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660780 [10:45:46] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660780 (owner: 10Hashar) [10:45:49] gotta do testwikis first [10:45:53] thanks [10:46:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14092 and previous config saved to /var/cache/conftool/dbconfig/20210201-104559-root.json [10:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:02] hashar: ping me once it's synced, so I can verify it works [10:46:21] (I tested already, but there's never enough of tests) [10:46:34] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660780 (owner: 10Hashar) [10:46:55] !log hashar@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.28 [10:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:10] which runs a full sync bah [10:47:16] scap? [10:47:27] oh, is that automated by a script hashar ? [10:47:31] yeah [10:47:39] i see [10:48:14] shouldn't take _that_ long, considering hosts have already their cache builded, etc. [10:48:41] it also rebuilds the l10n cache [10:48:50] guess I should have run sync wikiversions instead bah [10:48:56] even if it's already there? [10:49:22] apparently yes [10:49:26] :( [10:49:29] it is not that long though [10:49:42] for the record, there's no non-testwiki group0 wiki that was affected by the bug [10:51:26] (03CR) 10Urbanecm: [C: 03+2] "no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658076 (https://phabricator.wikimedia.org/T272608) (owner: 10Urbanecm) [10:52:02] it took like 7 mins or so when running on Friday to fix the re-imaged servers [10:52:20] (03Merged) 10jenkins-bot: [beta] Initial configuration for votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658076 (https://phabricator.wikimedia.org/T272608) (owner: 10Urbanecm) [10:53:52] (03PS1) 10ArielGlenn: Make media lists dump easily runnable in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660781 (https://phabricator.wikimedia.org/T269377) [10:54:25] !log hashar@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.28 (duration: 07m 48s) [10:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:36] (03CR) 10ArielGlenn: [C: 03+2] "Already tested in deloyment-prep and works as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/660781 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [10:57:22] guess I can do group0 now [10:57:42] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, 10Patch-For-Review: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) I have tested in deployment-prep all of the "other" dumps (not xml/sql) except for the wikidata and adds-changes dumps. Those ar... [10:57:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 15%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14093 and previous config saved to /var/cache/conftool/dbconfig/20210201-105743-root.json [10:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:10] is test2wiki in group0 or in group1? [10:59:50] Majavah: group0 [10:59:54] (03PS2) 10Ladsgroup: Add sources to specialSiteLinkGroups Wikibase setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655428 (https://phabricator.wikimedia.org/T138332) [11:00:10] doing group 0 [11:00:24] Majavah: at least it should be... [11:01:03] test2wiki is still on .27 [11:01:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Slowly pooling db1175 for the first time', diff saved to https://phabricator.wikimedia.org/P14094 and previous config saved to /var/cache/conftool/dbconfig/20210201-110102-root.json [11:01:05] but https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/config/test2wiki.yaml seems to disagree? [11:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:14] hashar: why is test2wiki in group1? [11:02:11] (03PS1) 10Hashar: group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660782 [11:02:13] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660782 (owner: 10Hashar) [11:02:17] (03CR) 10Volans: customscripts/interface_automation: skipp slaac addresses (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [11:02:18] Urbanecm: I have no idea :D [11:02:36] I was convinced all testwikis are supposed to be in group0 [11:02:39] but seemingly not? [11:02:44] maybe so that after testwikis get promoted we still have a test wiki using the N-1 version [11:02:45] `git blame` doesn't even help :-( [11:02:55] so at somepoint we have testwiki @ N and testwiki2 @ N-1 [11:03:03] (03PS1) 10Urbanecm: [beta] Add vote.wikimedia.beta.wmflabs.org to beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/660783 (https://phabricator.wikimedia.org/T272608) [11:03:10] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660782 (owner: 10Hashar) [11:03:18] I dont remember the history of test2wiki [11:04:15] hashar: that kinda means I can't test it at all in group0 :( [11:04:32] (I mean, apart from I already did, manually touching wikiversions.php at mwdebug :D) [11:04:35] I can't find any history with a quick phab search [11:04:41] (03CR) 10Elukey: [C: 03+1] cookbooks: force title to be one line [software/spicerack] - 10https://gerrit.wikimedia.org/r/660777 (owner: 10Volans) [11:04:47] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.28 [11:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:53] Majavah: I feel like it broke with the migration to .yaml files [11:05:13] (03CR) 10Volans: [C: 03+2] cookbooks: force title to be one line [software/spicerack] - 10https://gerrit.wikimedia.org/r/660777 (owner: 10Volans) [11:05:31] Urbanecm: correct [11:06:35] I'll fill a task :/ [11:07:30] (03PS1) 10JMeybohm: Add an check for numeric USER instruction in Dockerfile [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) [11:08:42] Majavah: filled T273435 [11:08:43] T273435: test2wiki is in group1 rather than group0 - https://phabricator.wikimedia.org/T273435 [11:09:05] (03CR) 10JMeybohm: Add an check for numeric USER instruction in Dockerfile (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [11:09:12] (03CR) 10jerkins-bot: [V: 04-1] Add an check for numeric USER instruction in Dockerfile [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [11:09:27] hashar: so, I guess we're done for now? [11:09:34] (03CR) 10Arturo Borrero Gonzalez: "I wasn't aware of this ip_lib.py hack. Do we have docs somewhere?" [puppet] - 10https://gerrit.wikimedia.org/r/660085 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [11:10:36] (03PS2) 10JMeybohm: Add an check for numeric USER instruction in Dockerfile [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) [11:12:05] (03CR) 10jerkins-bot: [V: 04-1] Add an check for numeric USER instruction in Dockerfile [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [11:12:21] (03PS3) 10JMeybohm: Add an check for numeric USER instruction in Dockerfile [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) [11:12:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 20%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14095 and previous config saved to /var/cache/conftool/dbconfig/20210201-111246-root.json [11:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] wikidata: post edit constraint jobs on 40% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660774 (https://phabricator.wikimedia.org/T204031) (owner: 10Rosalie Perside (WMDE)) [11:13:22] (03CR) 10Elukey: [C: 03+2] [beta] Add vote.wikimedia.beta.wmflabs.org to beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/660783 (https://phabricator.wikimedia.org/T272608) (owner: 10Urbanecm) [11:13:44] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/659312 (https://phabricator.wikimedia.org/T271476) (owner: 10Arturo Borrero Gonzalez) [11:13:53] Thanks elukey :) [11:14:07] np! let me know if you have issues [11:14:26] Urbanecm: yes! thx ;) [11:14:42] Urbanecm: will do group1 this afternoon after the backport window [11:14:46] (03Merged) 10jenkins-bot: cookbooks: force title to be one line [software/spicerack] - 10https://gerrit.wikimedia.org/r/660777 (owner: 10Volans) [11:14:58] hashar: okay, sounds good [11:16:26] Urbanecm: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#editautopatrolprotected_level_of_protection was your change, can you reply there? [11:16:53] Majavah: will look shortly [11:16:59] thanks [11:17:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Make sure @Jbond reviews this change! :-)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641207 (owner: 10David Caro) [11:20:23] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:51] Urbanecm: and thank you for all the patches! [11:20:53] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:03] hashar: happy to help :) [11:21:14] patches were the easiest part, knowing why the hell it happens was harder :D [11:24:52] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See my comment on the default value. Otherwise, LGTM!" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [11:25:13] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14096 and previous config saved to /var/cache/conftool/dbconfig/20210201-112750-root.json [11:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:55] (03PS4) 10JMeybohm: Add an check for numeric USER instruction in Dockerfile [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) [11:28:45] !log push pfw policies - T272073 [11:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210201T1130). [11:32:08] (03PS1) 10Urbanecm: [beta] Add wg(Canonical)Server for votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660806 (https://phabricator.wikimedia.org/T272608) [11:32:27] (03CR) 10Urbanecm: [C: 03+2] "no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660806 (https://phabricator.wikimedia.org/T272608) (owner: 10Urbanecm) [11:33:19] (03Merged) 10jenkins-bot: [beta] Add wg(Canonical)Server for votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660806 (https://phabricator.wikimedia.org/T272608) (owner: 10Urbanecm) [11:33:43] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660807 (https://phabricator.wikimedia.org/T128546) [11:35:44] (03CR) 10BBlack: [C: 03+1] "The C bits look sane to me, and comparable in style/standards/risks to the existing geoip code. My mental parser doesn't see any new runt" [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [11:37:06] (03CR) 10BBlack: [C: 03+1] geoip VCL: add a 'which' param to get_geo_xcip [puppet] - 10https://gerrit.wikimedia.org/r/630315 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [11:37:12] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "ship it!" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [11:39:48] (03CR) 10BBlack: [C: 03+1] geoip VCL: init/free functions are now reusable [puppet] - 10https://gerrit.wikimedia.org/r/630314 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [11:41:44] PROBLEM - LVS swift-https codfw port 443/tcp - Swift/Ceph media storage IPv4 #page on ms-fe.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:42:07] * volans here [11:42:08] <_joe_> uh [11:42:10] checking [11:42:18] I'm here too [11:42:22] o/ [11:42:26] <_joe_> was anyone working on codfw's swift? [11:42:29] <_joe_> I guess not [11:42:31] not me [11:42:37] PROBLEM - SSH on ms-be2033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:42:42] I launched a rebalance in codfw swift this morning, might be that [11:42:50] thumbor codfw also down? [11:42:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 30%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14097 and previous config saved to /var/cache/conftool/dbconfig/20210201-114254-root.json [11:42:55] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660807 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:01] <_joe_> Feb 1 11:42:50 lvs2009 pybal[27522]: [swift_80] ERROR: Monitoring instance ProxyFetch reports server ms-fe2006.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/monitoring/backend took longer than 5 seconds. [11:43:07] <_joe_> yeah it's being very slow [11:43:20] hi [11:43:21] <_joe_> now, do we serve active/active from swift? I don't remember [11:43:29] _joe_, yes, normally [11:43:38] <_joe_> I woudl first of all change that [11:43:43] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660807 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:43:53] yeah depooling sounds good to me [11:44:04] I guess we get an early test [11:44:08] <_joe_> ok, this will be only a depool from traffic [11:44:16] RECOVERY - LVS swift-https codfw port 443/tcp - Swift/Ceph media storage IPv4 #page on ms-fe.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:44:17] <_joe_> not from writing from mediawiki [11:44:27] recovery? [11:44:29] yeah [11:44:33] <_joe_> (the problem is still there, the recovery is just a byproduct) [11:44:45] yeah, checking if we have good latency graphs [11:44:45] <_joe_> Feb 1 11:44:40 lvs2009 pybal[27522]: [swift_80] ERROR: Could not depool server ms-fe2008.codfw.wmnet because of too many down! [11:45:40] object availability also got reduced [11:46:07] _joe_: depool from traffic you mean external dns or discovery ? [11:46:07] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:25] load doubled, network went 7x [11:46:38] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:660807| Bumping portals to master (T128546)]] (duration: 01m 14s) [11:46:41] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:43] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:46:50] <_joe_> godog: I meant from ats [11:47:01] discovery [11:47:03] <_joe_> but I can't find it right now, hiera stuff got moved around it seems [11:47:14] <_joe_> bblack: ats is using discovery? [11:47:17] yes [11:47:21] for everything AFAIK [11:47:28] yes [11:47:42] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:660807| Bumping portals to master (T128546)]] (duration: 01m 04s) [11:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:56] target: http://upload.wikimedia.org [11:47:57] replacement: https://swift.discovery.wmnet [11:48:01] ^ from ats config [11:48:05] yep [11:49:00] +1 on depooling from discovery, if there's consensus I'll go ahead [11:49:27] +1 [11:49:31] +1 for me I can't think why not [11:49:35] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=^swift,name=codfw [11:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:00] ah nevermind, _joe_ is on it [11:50:35] 300s ttl for all the atses to notice, too. [11:50:38] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift-ro,name=codfw [11:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:00] <_joe_> yeah but still [11:51:05] <_joe_> better than the alternative :) [11:51:11] yeah [11:51:12] I would like to know later how lvs for ms-fes is configured, load graphs are weird [11:51:15] <_joe_> and ofc [11:51:18] <_joe_> the moment I do this [11:51:29] <_joe_> everything goes back up for 1 minute :D [11:53:07] <_joe_> jynus: right now we keep depooling/repooling servers, so the connections are quite unevenly distributed [11:53:25] yes, that is what I observed on the graphs [11:53:51] <_joe_> is anyone looking at swift itself? [11:54:25] I am, things are slower due to the rebalance, although it isn't the first time we rebalance this way [11:54:39] it looked as if ms-fe2006 was for some time the only available frontend [11:55:05] but load is spiking up and down [11:55:49] and now most of the load moved to ms-fe2008 [11:56:00] but yes a few slow to respond ms-be hosts as expected, looking at /var/log/swift/server.log that is [11:57:05] is load going down now everywhere because of depool? [11:57:11] <_joe_> godog: the issue is, fetching /monitoring/backend takes too long [11:57:18] <_joe_> jynus: it should! [11:57:35] <_joe_> by now it should mostly be depooled I think? [11:57:54] (03PS2) 10Hnowlan: profile::maps::tlsproxy: add_ecdhe_curve toggle [puppet] - 10https://gerrit.wikimedia.org/r/659950 (https://phabricator.wikimedia.org/T238753) [11:57:57] prometheus + grafana has a bit of lag, you know :-) [11:57:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14098 and previous config saved to /var/cache/conftool/dbconfig/20210201-115757-root.json [11:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:03] almost fully drained yes [11:58:35] <_joe_> godog: so what is /monitoring/backend doing, and why it takes 5 seconds or more is our next question [11:58:37] _joe_: indeed, and the backend host for that is slow to respond, although it shouldn't rely on a single backend [11:58:59] <_joe_> because at this point I suspect we're monitoring the wrong url on the proxies :) [12:00:03] it is a difficult task to proxy-monitoring a proxy, when some of the inner backends may be unstable [12:00:04] <_joe_> and indeed, now it's instantaneous [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210201T1200). [12:00:04] matthiasmullie, Lucas_WMDE, and noa_wmde: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:25] <_joe_> jynus: we should be monitoring the proxy availability from pybal [12:00:32] \o/ [12:00:33] o/ [12:00:35] RECOVERY - SSH on ms-be2033 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:00:44] no deploys right now, right? [12:00:47] <_joe_> the proxy itself should then monitor the health of the backends independently [12:01:10] _joe_: may we start with the deploy window? :) [12:01:38] <_joe_> I'm not the only one calling the shots, but I don't see a reason to stop deployments. Anyone disagrees? [12:01:43] <_joe_> jynus bblack godog? [12:01:55] +1 to continue [12:01:56] yeah I think we're ok to go with the deploy window [12:01:58] +1 [12:01:59] ok [12:02:06] yeah +1 [12:02:10] (03PS2) 10Volans: templates: refactor macros and extends [homer/public] - 10https://gerrit.wikimedia.org/r/659235 [12:02:12] <_joe_> so my point is [12:02:12] (03PS1) 10Volans: *sw: move generic section at the top [homer/public] - 10https://gerrit.wikimedia.org/r/660814 [12:02:14] (03PS1) 10Volans: cr: fix indentation of generated config [homer/public] - 10https://gerrit.wikimedia.org/r/660815 [12:02:15] thanks :) [12:02:25] Lucas_WMDE: I assume you'll lead this window? :) [12:02:39] <_joe_> anyways, we can talk in #sre given we're out of the "incident" [12:02:50] (03PS2) 10ArielGlenn: Update minimum expected file size for lexeme JSON dumps [puppet] - 10https://gerrit.wikimedia.org/r/659945 (https://phabricator.wikimedia.org/T264883) (owner: 10Lucas Werkmeister (WMDE)) [12:03:31] (03CR) 10Jbond: "lgtm som minor nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/641207 (owner: 10David Caro) [12:03:33] (03CR) 10Ayounsi: [C: 03+1] *sw: move generic section at the top [homer/public] - 10https://gerrit.wikimedia.org/r/660814 (owner: 10Volans) [12:03:38] (03CR) 10Volans: "The same work should be done on the other templates too to make the generated configuration more readable and comparable with the one dump" [homer/public] - 10https://gerrit.wikimedia.org/r/660815 (owner: 10Volans) [12:03:44] should I not be merging in puppet right now? [12:03:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:04:52] Urbanecm: sure [12:05:02] thanks :) [12:05:07] matthiasmullie: can you self-deploy or should I do it? [12:05:33] Lucas_WMDE: I can do my own; shouldn't take long, they don't need any testing - they're just prep, setting (currently unused) vars ahead of time [12:05:59] I'll go ahead [12:06:13] (03PS2) 10Matthias Mullie: [WikibaseMediaInfo] MediaSearch: new set of heuristics for alternative implementation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658589 (https://phabricator.wikimedia.org/T271532) [12:06:24] (03CR) 10Matthias Mullie: [C: 03+2] [WikibaseMediaInfo] MediaSearch: new set of heuristics for alternative implementation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658589 (https://phabricator.wikimedia.org/T271532) (owner: 10Matthias Mullie) [12:06:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:07:05] okay! [12:07:17] (03Merged) 10jenkins-bot: [WikibaseMediaInfo] MediaSearch: new set of heuristics for alternative implementation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658589 (https://phabricator.wikimedia.org/T271532) (owner: 10Matthias Mullie) [12:07:26] (03PS4) 10Matthias Mullie: Add global to indicate that elastic LTR features are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646663 [12:07:34] (03CR) 10Matthias Mullie: [C: 03+2] "Ready for deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646663 (owner: 10Matthias Mullie) [12:07:39] (03CR) 10ArielGlenn: [C: 03+2] Update minimum expected file size for lexeme JSON dumps [puppet] - 10https://gerrit.wikimedia.org/r/659945 (https://phabricator.wikimedia.org/T264883) (owner: 10Lucas Werkmeister (WMDE)) [12:07:51] \o/ [12:08:32] (03Merged) 10jenkins-bot: Add global to indicate that elastic LTR features are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646663 (owner: 10Matthias Mullie) [12:11:49] syncing [12:12:30] !log mlitn@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 9836287e0, 424efdcdb: [WikibaseMediaInfo] Set wgMediaInfoMediaSearchHasLtrPlugin & wgMediaInfoMediaSearchConceptChipsSimpleHeuristics (duration: 01m 10s) [12:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:39] Lucas_WMDE: I'm done - want me to do yours as well, or will you do it yourself? [12:13:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 60%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14099 and previous config saved to /var/cache/conftool/dbconfig/20210201-121301-root.json [12:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:28] matthiasmullie: I’ll do it myself, thanks :) [12:13:39] (03PS3) 10Lucas Werkmeister (WMDE): wikidata: post edit constraint jobs on 40% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660774 (https://phabricator.wikimedia.org/T204031) (owner: 10Rosalie Perside (WMDE)) [12:13:41] ok, the floor is yours then! [12:13:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] wikidata: post edit constraint jobs on 40% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660774 (https://phabricator.wikimedia.org/T204031) (owner: 10Rosalie Perside (WMDE)) [12:14:33] alright! first deployment I’m doing via YubiKey :) [12:14:58] (03Merged) 10jenkins-bot: wikidata: post edit constraint jobs on 40% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660774 (https://phabricator.wikimedia.org/T204031) (owner: 10Rosalie Perside (WMDE)) [12:15:26] quickly testing on mwdebug1001 that nothing blows up [12:15:31] (proper testing not possible) [12:16:18] seems fine, syncing [12:17:14] Lucas_WMDE: just out of curiosity, what's your OS? [12:17:20] Ubuntu [12:17:23] aha [12:17:25] 20.10 [12:17:45] I'm on windows with wsl and all my attempts to get yubikey actually function as a smart card failed so far :/ [12:18:04] damn :( [12:19:03] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:660774|wikidata: post edit constraint jobs on 40% of edits (T204031)]] (duration: 01m 03s) [12:19:05] yeah I haven't got my yubikey working with SSH on windows [12:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:08] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [12:19:20] Majavah: at least I'm not the only one :D [12:19:20] (mine works as a smart card for Windows login) [12:19:41] * Urbanecm uses Microsoft account for signing in, so not compatible AFAIK [12:20:02] I think we’re done with the backport window? [12:20:06] I have local AD :D [12:20:13] Majavah: :D [12:20:17] :D [12:20:24] Lucas_WMDE: sounds so :) [12:20:34] !log EU backport&config window done [12:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:53] (03CR) 10Urbanecm: [C: 03+2] Publish logos.php at noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659433 (https://phabricator.wikimedia.org/T273330) (owner: 10Urbanecm) [12:21:00] let's just get this beast out as well [12:21:27] I have one more config change I’d like to deploy soon, but that should wait until wmf.28 is rolled out [12:21:33] so probably later today [12:21:49] (03Merged) 10jenkins-bot: Publish logos.php at noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659433 (https://phabricator.wikimedia.org/T273330) (owner: 10Urbanecm) [12:22:03] Lucas_WMDE: hash.ar said he'll roll out wmf.28 to group1 after B&C, and leave group2 to the US people [12:22:58] yup, I saw that [12:23:02] group1 should be good enough for me [12:23:16] assuming we won't have to rollback again :P [12:23:53] 🙏 [12:24:00] !log urbanecm@deploy1001 Synchronized docroot/noc/conf/logos.php.txt: ec5b6d221b50d0b3807242d7a8869f97e6cbdbef: Publish logos.php at noc.wikimedia.org (1/2; T273330) (duration: 01m 04s) [12:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:04] T273330: Publish logos.php at noc.wikimedia.org - https://phabricator.wikimedia.org/T273330 [12:24:37] (03CR) 10Hamish: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660795 (owner: 10Hamish) [12:25:18] (03CR) 10Urbanecm: [C: 03+2] logos: Update dewiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660620 (owner: 10Legoktm) [12:25:21] !log urbanecm@deploy1001 Synchronized docroot/noc/createTxtFileSymlinks.sh: ec5b6d221b50d0b3807242d7a8869f97e6cbdbef: Publish logos.php at noc.wikimedia.org (2/2; T273330) (duration: 01m 05s) [12:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:57] (03PS3) 10Hamish: Allow sysop to add/remove transwiki for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660795 (https://phabricator.wikimedia.org/T273405) [12:26:13] (03Merged) 10jenkins-bot: logos: Update dewiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660620 (owner: 10Legoktm) [12:27:43] (03CR) 10Urbanecm: [C: 03+2] logos: Update frwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660621 (owner: 10Legoktm) [12:27:45] (03CR) 10Urbanecm: [C: 03+2] logos: Update plwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660622 (owner: 10Legoktm) [12:27:57] (03PS2) 10Urbanecm: logos: Update frwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660621 (owner: 10Legoktm) [12:28:05] (03CR) 10Urbanecm: [C: 03+2] logos: Update frwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660621 (owner: 10Legoktm) [12:28:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14100 and previous config saved to /var/cache/conftool/dbconfig/20210201-122804-root.json [12:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:37] (03PS2) 10Urbanecm: logos: Update plwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660622 (owner: 10Legoktm) [12:28:44] (03CR) 10Urbanecm: [C: 03+2] logos: Update plwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660622 (owner: 10Legoktm) [12:28:50] (03PS2) 10Urbanecm: logos: Update itwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660623 (owner: 10Legoktm) [12:28:54] (03CR) 10Urbanecm: [C: 03+2] logos: Update itwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660623 (owner: 10Legoktm) [12:29:01] (03CR) 10jerkins-bot: [V: 04-1] logos: Update plwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660622 (owner: 10Legoktm) [12:30:45] (03PS2) 10Urbanecm: logos: Update jawiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660624 (owner: 10Legoktm) [12:30:51] (03CR) 10Urbanecm: [C: 03+2] logos: Update jawiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660624 (owner: 10Legoktm) [12:31:33] (03Merged) 10jenkins-bot: logos: Update plwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660622 (owner: 10Legoktm) [12:31:35] (03Merged) 10jenkins-bot: logos: Update itwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660623 (owner: 10Legoktm) [12:32:21] (03Merged) 10jenkins-bot: logos: Update jawiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660624 (owner: 10Legoktm) [12:34:04] !log urbanecm@deploy1001 Synchronized logos/config.yaml: Regenerate a couple of logos from Commons (1/2) (duration: 01m 07s) [12:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:29] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Regenerate a couple of logos from Commons (2/2) (duration: 01m 08s) [12:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:45] (03PS2) 10Urbanecm: ombudsmenwiki: Set sitename to "Ombuds Commission" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660002 (https://phabricator.wikimedia.org/T273323) [12:35:50] (03CR) 10Urbanecm: [C: 03+2] ombudsmenwiki: Set sitename to "Ombuds Commission" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660002 (https://phabricator.wikimedia.org/T273323) (owner: 10Urbanecm) [12:36:08] (03PS2) 10Urbanecm: Update ombudsmenwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660000 (https://phabricator.wikimedia.org/T273323) [12:36:12] (03CR) 10Urbanecm: [C: 03+2] Update ombudsmenwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660000 (https://phabricator.wikimedia.org/T273323) (owner: 10Urbanecm) [12:36:45] (03Merged) 10jenkins-bot: ombudsmenwiki: Set sitename to "Ombuds Commission" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660002 (https://phabricator.wikimedia.org/T273323) (owner: 10Urbanecm) [12:37:00] (03Merged) 10jenkins-bot: Update ombudsmenwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660000 (https://phabricator.wikimedia.org/T273323) (owner: 10Urbanecm) [12:37:32] (03PS2) 10Jbond: customscripts/interface_automation: skipp slaac addresses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) [12:38:35] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cf349361b392f1831fe5ebce8fb544b035a83835: ombudsmenwiki: Set sitename to "Ombuds Commission" (T273323) (duration: 01m 06s) [12:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:40] T273323: Rename private "ombudsmenwiki" to "ombudswiki" and change the logo - https://phabricator.wikimedia.org/T273323 [12:38:58] (03PS3) 10Jbond: customscripts/interface_automation: skipp slaac addresses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) [12:40:18] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: d70e8ac549145872c9d251cc78e6e40355029fc7: Update ombudsmenwiki logo (1/3) (duration: 01m 05s) [12:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:34] !log urbanecm@deploy1001 Synchronized logos/config.yaml: d70e8ac549145872c9d251cc78e6e40355029fc7: Update ombudsmenwiki logo (2/3) (duration: 01m 04s) [12:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:44] !log Purge 'https://en.wikipedia.org/static/images/project-logos/ombudsmenwiki.png' (T273323) [12:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:52] (03CR) 10MSantos: [C: 03+1] profile::maps::tlsproxy: add_ecdhe_curve toggle [puppet] - 10https://gerrit.wikimedia.org/r/659950 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:42:53] !log urbanecm@deploy1001 Synchronized wmf-config/logos.php: d70e8ac549145872c9d251cc78e6e40355029fc7: Update ombudsmenwiki logo (3/3) (duration: 01m 05s) [12:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Slowly pooling db1166 for the first time', diff saved to https://phabricator.wikimedia.org/P14102 and previous config saved to /var/cache/conftool/dbconfig/20210201-124308-root.json [12:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:40] (03PS1) 10ArielGlenn: make adds-changes dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660819 (https://phabricator.wikimedia.org/T269377) [12:52:56] (03PS4) 10Jbond: customscripts/interface_automation: skipp slaac addresses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) [12:54:48] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27788/console" [puppet] - 10https://gerrit.wikimedia.org/r/659950 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:56:07] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27789/console" [puppet] - 10https://gerrit.wikimedia.org/r/659950 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:56:34] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] profile::maps::tlsproxy: add_ecdhe_curve toggle [puppet] - 10https://gerrit.wikimedia.org/r/659950 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:59:02] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) @jcrespo be aware that you can proceed replacing db1095 with db1171 anytime. [12:59:38] (03CR) 10Jbond: "https://phabricator.wikimedia.org/P14101 shows a before and after run indicating that the slaac addresses are ignored" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [13:01:36] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) Thanks for the notice! [13:05:05] (03PS4) 10Jbond: interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) [13:06:57] (03PS5) 10Jbond: interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) [13:08:19] (03PS2) 10ArielGlenn: make adds-changes dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660819 (https://phabricator.wikimedia.org/T269377) [13:08:36] Deployment window is empty, I'm going to deploy l10n update [13:09:11] (03PS1) 10Ladsgroup: Add Multilingual Wikisource to list of Wikidata's special sites [extensions/WikimediaMessages] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660796 (https://phabricator.wikimedia.org/T138332) [13:09:18] (03CR) 10Ladsgroup: [C: 03+2] Add Multilingual Wikisource to list of Wikidata's special sites [extensions/WikimediaMessages] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660796 (https://phabricator.wikimedia.org/T138332) (owner: 10Ladsgroup) [13:11:55] (03CR) 10Jbond: "See the following for an example of the output once the change is applied" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [13:16:52] Amir1: are you aware wmf.28 is at group0 now? [13:17:15] yeah [13:17:34] okay, just making sure you don't intend to deploy it everywhere :) [13:19:36] hashar: hey, when are you planning to move wmf.28 to group1. Now? [13:20:20] Amir1: see email. After the backport window so 40 minutes from now ;) [13:20:27] jouncebot: now [13:20:28] No deployments scheduled for the next 4 hour(s) and 39 minute(s) [13:20:42] or is the window down ? ;) [13:20:43] hashar: B&C is already over for 20 mins? [13:20:44] the backport window finished twenty minutes ago [13:21:08] damn timezones [13:21:13] it's fine. I can use the time to push a change to wmf.28 :D [13:21:17] hehe [13:21:18] It's good for me actually [13:21:27] Amir1: 40 minutes might not be enough for full scap [13:21:28] well up to you [13:21:48] I can do the promote right now [13:21:52] or after your backport [13:22:10] Urbanecm: yeah but it's not a full scap. I found a new thing [13:22:25] I would personally do the promote first, as it's faster than rebuild of l10n cache - but I'll let Amir1 to decide :) [13:22:33] sync-l10n [13:22:42] or that [13:22:45] hashar: yeah, let's promote now [13:22:55] mine is sloooow [13:23:15] sold [13:23:24] (03PS3) 10ArielGlenn: make adds-changes dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660819 (https://phabricator.wikimedia.org/T269377) [13:23:34] (03PS1) 10Hashar: group1 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660822 [13:23:36] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660822 (owner: 10Hashar) [13:23:57] I mixed up the time of the beginning and end of window bah [13:24:43] happens :) [13:24:45] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660822 (owner: 10Hashar) [13:25:01] I've mixed days [13:25:10] these is fine :D [13:26:26] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.28 [13:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:08] 10SRE, 10SRE-swift-storage: ms-fe.svc.codfw.wmnet paged during Swift rebalance - https://phabricator.wikimedia.org/T273453 (10fgiunchedi) [13:27:23] Your account does not have permission to change the stable version configuration. on test2wiki [13:27:30] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.28 (duration: 01m 03s) [13:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:53] Urbanecm: shouldn't I be able to stabilise things on test2wiki now in .28? [13:29:07] we'll see [13:29:21] I don't think it worked in wmf.27 ftr [13:29:25] Amir1: Urbanecm done! :) [13:29:28] thanks hashar [13:29:46] I confirm I'm still able to change user groups at test2wiki [13:29:51] Thanks! [13:30:16] it gives me the option to grant `editor`, but I can't stabilise pages [13:30:33] Majavah: I'm 99% sure that didn't work in .27 too [13:30:37] => different bug [13:31:21] (03CR) 10jerkins-bot: [V: 04-1] Add Multilingual Wikisource to list of Wikidata's special sites [extensions/WikimediaMessages] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660796 (https://phabricator.wikimedia.org/T138332) (owner: 10Ladsgroup) [13:31:44] any objections to closing T273296 as resolved? [13:31:44] T273296: Some administrators can no longer add/remove users from the autochecked/editor user groups - https://phabricator.wikimedia.org/T273296 [13:32:01] also T273317 is marked as a blocker for this week [13:32:02] T273317: some users with access are unable to configure pending changes - https://phabricator.wikimedia.org/T273317 [13:32:14] Majavah: yes, don't close the task [13:32:24] it should be used to fix it for real [13:32:27] we "just" reverted it [13:32:35] I'll deprioritize stuff [13:32:56] ah, true, forgot that :D [13:36:03] (03PS1) 10Kormat: orchestrator: Unbias lag score [puppet] - 10https://gerrit.wikimedia.org/r/660823 [13:37:29] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27793/console" [puppet] - 10https://gerrit.wikimedia.org/r/660823 (owner: 10Kormat) [13:38:04] (03CR) 10Kormat: "Inspired by a comment @jcrespo made on irc :)" [puppet] - 10https://gerrit.wikimedia.org/r/660823 (owner: 10Kormat) [13:38:19] Urbanecm: actually I think that I am able to stabilise things, it just needs a page that was already reviewed and you need to manually grant editor to review pages [13:38:30] ah, good [13:39:03] actually or just that you need editor to stabilise things [13:39:06] not actually sure [13:39:18] hehe [13:39:28] I granted myself editor and not I can stabilize things [13:39:34] why does sysop not include editor? [13:39:56] Majavah: no idea [13:39:57] (03Merged) 10jenkins-bot: Add Multilingual Wikisource to list of Wikidata's special sites [extensions/WikimediaMessages] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660796 (https://phabricator.wikimedia.org/T138332) (owner: 10Ladsgroup) [13:40:15] Majavah: but yes, confirmed, when I granted myself +editor, https://test2.wikipedia.org/wiki/Special:Stabilization/Main_Page works [13:40:31] without it, it's greyed out [13:40:36] I'll remove that task as a blocker too [13:40:39] okay [13:41:16] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): ms-fe.svc.codfw.wmnet paged during Swift rebalance - https://phabricator.wikimedia.org/T273453 (10fgiunchedi) [13:42:11] (03CR) 10Jbond: [C: 03+2] ldap::client: use ensure_resources to install ldap-utils [puppet] - 10https://gerrit.wikimedia.org/r/659902 (owner: 10Jbond) [13:42:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:43:06] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): ms-fe.svc.codfw.wmnet paged during Swift rebalance - https://phabricator.wikimedia.org/T273453 (10fgiunchedi) [13:44:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:45:29] (03PS4) 10ArielGlenn: make adds-changes dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660819 (https://phabricator.wikimedia.org/T269377) [13:46:20] (03CR) 10ArielGlenn: [C: 03+2] make adds-changes dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660819 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [13:47:05] !log ladsgroup@deploy1001 scap sync-l10n completed (1.36.0-wmf.28) (duration: 00m 58s) [13:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:22] it took a minute, interesting [13:48:07] (03CR) 10Marostegui: "Nice, have you tried the query?" [puppet] - 10https://gerrit.wikimedia.org/r/660823 (owner: 10Kormat) [13:48:39] marostegui: what kind of untrusting question is that? :( [13:48:46] haha [13:48:53] Normally you do mention it! [13:49:06] ok fair :) yes, i've tried it a bunch. [13:49:23] (03CR) 10Marostegui: [C: 03+1] "trusting this blindly 100%" [puppet] - 10https://gerrit.wikimedia.org/r/660823 (owner: 10Kormat) [13:49:24] XD [13:49:27] lol [13:49:44] (03CR) 10Jcrespo: [C: 03+1] switchover: Work-around isolation level issue [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat) [13:49:46] (03CR) 10Kormat: [C: 03+2] orchestrator: Unbias lag score [puppet] - 10https://gerrit.wikimedia.org/r/660823 (owner: 10Kormat) [13:50:38] (03CR) 10Kormat: [C: 03+2] switchover: Work-around isolation level issue [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat) [13:50:54] !log ladsgroup@deploy1001 Started scap: [[gerrit:660796|Add Multilingual Wikisource to list of Wikidata's special sites]] (T138332) [13:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:59] T138332: Interwiki links to/from Multilingual Wikisource - https://phabricator.wikimedia.org/T138332 [13:52:55] (03Merged) 10jenkins-bot: switchover: Work-around isolation level issue [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat) [13:53:14] (03PS1) 10Gerrit maintenance bot: Add mni to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/660826 (https://phabricator.wikimedia.org/T273457) [13:54:14] (03PS1) 10Gerrit maintenance bot: Add mni to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/660830 (https://phabricator.wikimedia.org/T273456) [13:56:26] (03CR) 10Volans: [C: 03+1] "LGTM" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/659251 (owner: 10Kormat) [13:57:06] (03CR) 10Kormat: [C: 03+2] Ensure we don't use a broken version of pip. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/659251 (owner: 10Kormat) [13:59:47] (03Merged) 10jenkins-bot: Ensure we don't use a broken version of pip. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/659251 (owner: 10Kormat) [14:01:49] 10SRE, 10Data-Services, 10Traffic, 10netops, 10cloud-services-team (Kanban): wikireplicas last-minute infra work to discuss / resolve - https://phabricator.wikimedia.org/T273248 (10Volans) @bblack I'm not sure what happened on Netbox [[ https://netbox.wikimedia.org/extras/changelog/?q=208.80.154.242 | he... [14:05:47] (03CR) 10Volans: [C: 03+2] templates: refactor macros and extends [homer/public] - 10https://gerrit.wikimedia.org/r/659235 (owner: 10Volans) [14:05:56] (03CR) 10Volans: [C: 03+2] *sw: move generic section at the top [homer/public] - 10https://gerrit.wikimedia.org/r/660814 (owner: 10Volans) [14:06:17] (03Merged) 10jenkins-bot: templates: refactor macros and extends [homer/public] - 10https://gerrit.wikimedia.org/r/659235 (owner: 10Volans) [14:06:27] (03Merged) 10jenkins-bot: *sw: move generic section at the top [homer/public] - 10https://gerrit.wikimedia.org/r/660814 (owner: 10Volans) [14:10:47] 10SRE, 10Data-Persistence-Backup: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo) I didn't get any answer here or on the other ticket, so this is my plan now: * Add a conditional so the above code only affects jessie host (co... [14:12:34] !log ladsgroup@deploy1001 Finished scap: [[gerrit:660796|Add Multilingual Wikisource to list of Wikidata's special sites]] (T138332) (duration: 21m 52s) [14:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:39] T138332: Interwiki links to/from Multilingual Wikisource - https://phabricator.wikimedia.org/T138332 [14:14:37] (03CR) 10Volans: [C: 03+1] "LGTM but please test also some other random hosts with different setups to ensure we don't have any regression." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [14:15:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:16:07] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [14:16:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add an check for numeric USER instruction in Dockerfile [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [14:17:47] PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:25:53] 10SRE, 10Data-Services, 10Traffic, 10netops, 10cloud-services-team (Kanban): wikireplicas last-minute infra work to discuss / resolve - https://phabricator.wikimedia.org/T273248 (10BBlack) It looks like the /27 is what I manually created, and then the /32 was probably patched in later from puppetdb after... [14:28:24] 10SRE, 10Data-Services, 10Traffic, 10netops, 10cloud-services-team (Kanban): wikireplicas last-minute infra work to discuss / resolve - https://phabricator.wikimedia.org/T273248 (10BBlack) The more interesting Netbox question here, is what the correct way is to define a new tagged virtual interface that... [14:29:06] (03PS1) 10Marostegui: orchestrator.conf: Do not discover labsdb* hosts [puppet] - 10https://gerrit.wikimedia.org/r/660839 (https://phabricator.wikimedia.org/T266483) [14:31:20] (03CR) 10Kormat: [C: 04-1] orchestrator.conf: Do not discover labsdb* hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/660839 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [14:33:29] (03PS2) 10Marostegui: orchestrator.conf: Do not discover labsdb* hosts [puppet] - 10https://gerrit.wikimedia.org/r/660839 (https://phabricator.wikimedia.org/T266483) [14:33:35] RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:45] (03PS5) 10Jbond: customscripts/interface_automation: skipp slaac addresses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) [14:34:01] (03CR) 10Jbond: customscripts/interface_automation: skipp slaac addresses (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [14:34:12] (03PS6) 10Jbond: interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) [14:34:46] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup), 10User-fgiunchedi: ms-fe.svc.codfw.wmnet paged during Swift rebalance - https://phabricator.wikimedia.org/T273453 (10fgiunchedi) [14:34:52] (03CR) 10Kormat: [C: 03+1] "Looks good. I guess we'll find out eventually if it works as we hope :)" [puppet] - 10https://gerrit.wikimedia.org/r/660839 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [14:35:16] (03CR) 10Marostegui: "As soon as I clean up s4 heartbeat table..........." [puppet] - 10https://gerrit.wikimedia.org/r/660839 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [14:35:24] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Do not discover labsdb* hosts [puppet] - 10https://gerrit.wikimedia.org/r/660839 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [14:36:07] (03CR) 10Jbond: "> Patch Set 5: Code-Review+1" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [14:36:19] (03PS7) 10Jbond: interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) [14:39:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1147 T266483', diff saved to https://phabricator.wikimedia.org/P14104 and previous config saved to /var/cache/conftool/dbconfig/20210201-143925-marostegui.json [14:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:33] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [14:40:03] !log Restart mysql on db1147 T266483 [14:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 10%: Repool db1147 after a restart', diff saved to https://phabricator.wikimedia.org/P14105 and previous config saved to /var/cache/conftool/dbconfig/20210201-144604-root.json [14:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:26] (03PS5) 10Giuseppe Lavagetto: Add an check for numeric USER instruction in Dockerfile [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [14:50:28] (03PS1) 10Giuseppe Lavagetto: Add the 'uid' template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660851 (https://phabricator.wikimedia.org/T228967) [14:50:30] (03PS1) 10Giuseppe Lavagetto: Remove the build image functionality [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660852 [14:52:40] (03CR) 10jerkins-bot: [V: 04-1] Remove the build image functionality [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660852 (owner: 10Giuseppe Lavagetto) [14:53:05] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:17] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift-ro,name=codfw [14:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:24] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw [14:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:43] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup), 10User-fgiunchedi: ms-fe.svc.codfw.wmnet paged during Swift rebalance - https://phabricator.wikimedia.org/T273453 (10fgiunchedi) [14:57:10] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10herron) Hey @Cmjohnson, when do you estimate this one will be racked and installed? [14:57:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] "TIL, +1" [puppet] - 10https://gerrit.wikimedia.org/r/660379 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [15:00:16] (03Abandoned) 10Arturo Borrero Gonzalez: cr/firewall.conf: cloud-in4: allow new wiki replicas TCP ports [homer/public] - 10https://gerrit.wikimedia.org/r/659312 (https://phabricator.wikimedia.org/T271476) (owner: 10Arturo Borrero Gonzalez) [15:00:31] 10SRE, 10Data-Persistence-Backup: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo) I'm silly, I was totally convinced that the revert applied to clients. It does not, only to storage hosts, which is easier to revert. That also... [15:00:40] 10SRE, 10Data-Persistence-Backup: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo) [15:00:42] 10SRE, 10serviceops: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10jcrespo) [15:01:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 20%: Repool db1147 after a restart', diff saved to https://phabricator.wikimedia.org/P14106 and previous config saved to /var/cache/conftool/dbconfig/20210201-150107-root.json [15:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:52] 10SRE, 10serviceops: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10jcrespo) I just realized, after closer inspection, that the blocker is indeed real, and we need these in stretch or higher to revert T273182. Is there something I can do to help? [15:02:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] calico: Typha needs to get endpoints to discover it's instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/660399 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [15:05:28] (03PS1) 10Filippo Giunchedi: swift: limit rsync service memory [puppet] - 10https://gerrit.wikimedia.org/r/660854 (https://phabricator.wikimedia.org/T221904) [15:05:30] (03PS1) 10Filippo Giunchedi: swift: limit rsync to 10% memory in codfw [puppet] - 10https://gerrit.wikimedia.org/r/660855 (https://phabricator.wikimedia.org/T221904) [15:05:46] (03PS3) 10Ottomata: Finalize NavigationTiming extension Event Platform migration [puppet] - 10https://gerrit.wikimedia.org/r/659288 (https://phabricator.wikimedia.org/T271208) [15:07:15] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27794/console" [puppet] - 10https://gerrit.wikimedia.org/r/660854 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [15:07:21] (03CR) 10jerkins-bot: [V: 04-1] swift: limit rsync to 10% memory in codfw [puppet] - 10https://gerrit.wikimedia.org/r/660855 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [15:08:36] (03CR) 10Ottomata: [C: 03+2] Finalize NavigationTiming extension Event Platform migration [puppet] - 10https://gerrit.wikimedia.org/r/659288 (https://phabricator.wikimedia.org/T271208) (owner: 10Ottomata) [15:08:58] (03PS1) 10Jcrespo: jessie: Revert openssl conf on director/storage to package defaults [puppet] - 10https://gerrit.wikimedia.org/r/660856 (https://phabricator.wikimedia.org/T273182) [15:09:00] (03PS1) 10Jcrespo: jessie: Remove old openssl override after revert to package version [puppet] - 10https://gerrit.wikimedia.org/r/660857 (https://phabricator.wikimedia.org/T273182) [15:09:25] (03CR) 10JMeybohm: [C: 03+1] Add the 'uid' template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660851 (https://phabricator.wikimedia.org/T228967) (owner: 10Giuseppe Lavagetto) [15:09:58] (03CR) 10Jcrespo: "The openssl.cnf has been taken from the latest Buster package version." [puppet] - 10https://gerrit.wikimedia.org/r/660856 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [15:10:08] (03CR) 10JMeybohm: [C: 03+2] k8s::kubelet: Ensure apparmor is installed [puppet] - 10https://gerrit.wikimedia.org/r/660379 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [15:10:36] (03CR) 10Jcrespo: "To be deployed after 660856." [puppet] - 10https://gerrit.wikimedia.org/r/660857 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [15:10:53] (03PS2) 10Filippo Giunchedi: swift: limit rsync to 10% memory in codfw [puppet] - 10https://gerrit.wikimedia.org/r/660855 (https://phabricator.wikimedia.org/T221904) [15:11:01] (03CR) 10JMeybohm: [C: 03+2] calico: Typha needs to get endpoints to discover it's instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/660399 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [15:12:25] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27796/console" [puppet] - 10https://gerrit.wikimedia.org/r/660855 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [15:14:27] (03PS2) 10Filippo Giunchedi: swift: limit rsync service memory [puppet] - 10https://gerrit.wikimedia.org/r/660854 (https://phabricator.wikimedia.org/T221904) [15:14:29] (03PS3) 10Filippo Giunchedi: swift: limit rsync to 10% memory in codfw [puppet] - 10https://gerrit.wikimedia.org/r/660855 (https://phabricator.wikimedia.org/T221904) [15:16:02] (03PS5) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [15:16:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 40%: Repool db1147 after a restart', diff saved to https://phabricator.wikimedia.org/P14107 and previous config saved to /var/cache/conftool/dbconfig/20210201-151611-root.json [15:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:24] (03PS6) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [15:23:17] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Had a chat with @hnowlan and maps1001 can be moved with some heads up time. I am available to work with John on moving the nodes with Manuel and Hu... [15:31:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 60%: Repool db1147 after a restart', diff saved to https://phabricator.wikimedia.org/P14108 and previous config saved to /var/cache/conftool/dbconfig/20210201-153115-root.json [15:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:15] 10SRE, 10Data-Services, 10Traffic, 10netops, 10cloud-services-team (Kanban): wikireplicas last-minute infra work to discuss / resolve - https://phabricator.wikimedia.org/T273248 (10ayounsi) Other than those duplicate IPs, Netbox/Homer is all good. As server interfaces IPs are not configured by Netbox (o... [15:41:44] (03PS1) 10Elukey: profile::analytics::refinery::job::test::camus: update EL dt field [puppet] - 10https://gerrit.wikimedia.org/r/660858 [15:42:04] 10SRE, 10serviceops: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10jcrespo) See also related T224560 [15:42:08] (03CR) 10Herron: [C: 03+1] profile: only set default partition if unset [puppet] - 10https://gerrit.wikimedia.org/r/659422 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:43:02] (03PS2) 10Elukey: profile::analytics::refinery::job::test::camus: update EL dt field [puppet] - 10https://gerrit.wikimedia.org/r/660858 [15:44:25] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/660858 (owner: 10Elukey) [15:45:09] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 80%: Repool db1147 after a restart', diff saved to https://phabricator.wikimedia.org/P14109 and previous config saved to /var/cache/conftool/dbconfig/20210201-154618-root.json [15:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:50] !log failover RG1 back to node0 on pfw3-eqiad - T263833 [15:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:42] (03CR) 10Andrew Bogott: "It's this:" [puppet] - 10https://gerrit.wikimedia.org/r/660085 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [15:49:38] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: update presto settings [puppet] - 10https://gerrit.wikimedia.org/r/660859 [15:50:29] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::test::camus: update EL dt field [puppet] - 10https://gerrit.wikimedia.org/r/660858 (owner: 10Elukey) [15:50:57] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: update presto settings [puppet] - 10https://gerrit.wikimedia.org/r/660859 (owner: 10Elukey) [15:53:54] (03PS2) 10Ottomata: Refine SuggestedTagsAction schema using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/658419 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns) [15:56:43] (03CR) 10Ottomata: [C: 03+2] Refine SuggestedTagsAction schema using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/658419 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns) [15:59:55] !log install buster kernel update [15:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 100%: Repool db1147 after a restart', diff saved to https://phabricator.wikimedia.org/P14110 and previous config saved to /var/cache/conftool/dbconfig/20210201-160122-root.json [16:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:07] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetboard2002.codfw.wmnet [16:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:47] 10SRE, 10Data-Services, 10Traffic, 10netops, 10cloud-services-team (Kanban): wikireplicas last-minute infra work to discuss / resolve - https://phabricator.wikimedia.org/T273248 (10ayounsi) I deleted the two extra /27 IPs with https://netbox.wikimedia.org/extras/changelog/?request_id=9ff6d398-9358-4668-b... [16:05:00] 10SRE, 10Data-Services, 10Traffic, 10netops, 10cloud-services-team (Kanban): wikireplicas last-minute infra work to discuss / resolve - https://phabricator.wikimedia.org/T273248 (10ayounsi) Also note that the `cloud-support-XXX` vlans are being phased out, so no need to add a LVS leg in all of them if no... [16:05:29] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2002.codfw.wmnet [16:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:42] (03PS1) 10Bstorm: Revert "dumps: fail over dumps web" [dns] - 10https://gerrit.wikimedia.org/r/660798 [16:05:53] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetboard1002.eqiad.wmnet [16:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:00] (03PS1) 10Bstorm: Revert "dumps-dist: fail over labstore1006 to 1007" [puppet] - 10https://gerrit.wikimedia.org/r/660799 [16:07:14] (03PS2) 10Bstorm: Revert "dumps-dist: fail over labstore1006 to 1007" [puppet] - 10https://gerrit.wikimedia.org/r/660799 [16:08:01] (03PS2) 10Bstorm: Revert "dumps: fail over dumps web" [dns] - 10https://gerrit.wikimedia.org/r/660798 [16:09:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1002.eqiad.wmnet [16:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:52] !og disable puppet fleet wide to preform reboots [16:12:23] (that !log is missing an L) [16:12:30] !log disable puppet fleet wide to preform reboots [16:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:33] thank Lucas_WMDE :) [16:12:39] np :) [16:13:20] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host puppetdb1002.eqiad.wmnet [16:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:14] 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10ayounsi) 05Resolved→03Open I'm still seeing errors on that link :( https://librenms.wikimedia.org/graphs/id=13618/type=port_errors Anything more that can be done? Please sync up with... [16:14:21] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1003.eqiad.wmnet [16:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:32] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1002.eqiad.wmnet [16:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:39] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1001.eqiad.wmnet [16:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:48] !log fail-back RG1 back to node1 on pfw3-eqiad - T263833 [16:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:20:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:20:48] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) So there is currently an experiment going on with caching hosts in esams, and flashing firmware would interrupt that. T264398#6772586 When that is done, this ca... [16:21:52] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1003.eqiad.wmnet [16:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:54] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1002.eqiad.wmnet [16:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:30] !log jbond@cumin2001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetdb1002.eqiad.wmnet [16:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:01] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetmaster1001.eqiad.wmnet [16:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:16] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster2001.codfw.wmnet [16:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:38] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster2002.codfw.wmnet [16:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:47] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host puppetmaster2003.codfw.wmnet [16:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:56] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host puppetdb2002.codfw.wmnet [16:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:03] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb2002.codfw.wmnet [16:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:18] (03CR) 10Ayounsi: [C: 03+1] "NOOP and looks much cleaner." [homer/public] - 10https://gerrit.wikimedia.org/r/660815 (owner: 10Volans) [16:33:45] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2002.codfw.wmnet [16:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:57] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetmaster2001.codfw.wmnet [16:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:16] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host pki2001.codfw.wmnet [16:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:42] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2003.codfw.wmnet [16:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:49] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host pki1001.eqiad.wmnet [16:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:21] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host idp-test1001.wikimedia.org [16:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:26] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp-test2001.wikimedia.org [16:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:08] !log enable puppet fleet wide to post reboots [16:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:51] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1001.wikimedia.org [16:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:22] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki2001.codfw.wmnet [16:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:27] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki1001.eqiad.wmnet [16:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:08] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host apt2001.wikimedia.org [16:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:21] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host idp2001.wikimedia.org [16:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:19] (03PS1) 10Jbond: wikimedia.org: move cnames for apt and idp for reboot [dns] - 10https://gerrit.wikimedia.org/r/660862 [16:49:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:50:03] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2001.wikimedia.org [16:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:52:06] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2001.wikimedia.org [16:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:02] (03CR) 10Jbond: [C: 03+2] wikimedia.org: move cnames for apt and idp for reboot [dns] - 10https://gerrit.wikimedia.org/r/660862 (owner: 10Jbond) [16:53:33] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2001.wikimedia.org [16:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:04] (03PS1) 10Jbond: Revert "wikimedia.org: move cnames for apt and idp for reboot" [dns] - 10https://gerrit.wikimedia.org/r/660802 [16:59:42] Urbanecm: oops, I forgot to -1 those logo changes. no harm though, we just might be able to compress them even further, I just didn't finish that investigation [16:59:43] (03PS3) 10Dzahn: decom francium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/659346 (https://phabricator.wikimedia.org/T273142) [17:00:19] legoktm: ah, sorry for merging them :(. It looked good to me in their quality and size, so I just synced it. [17:00:31] fortunately we have the common keys, so re-doing is easy [17:01:00] no need to apologize, I didn't communicate properly :p [17:01:02] and yeah! [17:01:25] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host apt1001.wikimedia.org [17:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:33] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host idp1001.wikimedia.org [17:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:46] legoktm: speaking of your tool, I think we should have a command for identifying unused PNGs, and a command for re-compressing all or some of the PNGs. What do you think? [17:02:42] sure [17:02:55] please file a task so we don't forget :) [17:03:10] sure. Do we have a project for the logo management thing, or should I just use the site requests one? [17:03:37] site requests [17:03:55] at least it's not called "shell request" anymore :) [17:03:57] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1001.wikimedia.org [17:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:26] (03CR) 10Volans: "> Patch Set 6:" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [17:05:08] RIP the "shell" keyword in Bugzilla [17:05:37] I was also thinking of something to validate that the dimensions of the pngs were actually correct after finding all the 1.5x ones to be 2px too big [17:06:22] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1001.wikimedia.org [17:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:29] (03CR) 10Jbond: [C: 03+2] Revert "wikimedia.org: move cnames for apt and idp for reboot" [dns] - 10https://gerrit.wikimedia.org/r/660802 (owner: 10Jbond) [17:07:26] legoktm: heheh, yes, that one:) [17:07:46] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host install1003.wikimedia.org [17:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:23] legoktm: do you happen to be a commons admin? [17:08:32] nope [17:08:32] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host install2003.wikimedia.org [17:08:34] :( [17:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:55] I mean, if we happen to be using common logos without much checking the result, we should probably protect it [17:09:04] so we don't deploy vandalism one day [17:09:10] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host install2003.wikimedia.org [17:09:11] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host install2003.wikimedia.org [17:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:33] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host install3001.wikimedia.org [17:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:07] !log upload dnsdist_1.5.1-3wm1 to apt.wm.o (buster) - T252132 [17:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:11] T252132: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 [17:10:30] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host install4001.wikimedia.org [17:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:11] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host install5001.wikimedia.org [17:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:16] (03PS1) 10Legoktm: arclamp: Add excimer-{stretch,buster} pipelines [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) [17:12:18] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3001.wikimedia.org [17:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:25] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1003.wikimedia.org [17:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [17:12:59] Urbanecm: the logos are semi-protected, but also I manually inspected each diff (Gerrit also makes it easy) that the logos looked the same [17:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:11] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2003.wikimedia.org [17:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:48] legoktm: yeah, I did it too (before I +2'ed the changes), but still, when we recompress 300 logos, such a mistake can be overlooked [17:14:07] !log decom'ing francium.eqiad.wmnet, formerly HTML dumps server, replaced by htmldumper1001 [17:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:11] mhm, I think doing them one by one like I did earlier is the best for that reason [17:14:41] I still think they should be fully protected tbh :D [17:14:46] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4001.wikimedia.org [17:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:02] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [17:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:06] https://commons.wikimedia.org/wiki/Commons:Auto-protected_files [17:15:36] oh, that's interesting thing [17:15:41] !log jbond@cumin2001 START - Cookbook sre.hosts.reboot-single for host deneb.codfw.wmnet [17:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:58] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5001.wikimedia.org [17:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:01] !log jbond@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deneb.codfw.wmnet [17:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:07] (03CR) 10Volans: [C: 03+2] cr: fix indentation of generated config [homer/public] - 10https://gerrit.wikimedia.org/r/660815 (owner: 10Volans) [17:17:26] I have access to the bot, I guess we could add a variant that looks at logos/config.yaml [17:17:37] (03Merged) 10jenkins-bot: cr: fix indentation of generated config [homer/public] - 10https://gerrit.wikimedia.org/r/660815 (owner: 10Volans) [17:17:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:17:57] legoktm: works for me. Filled the ideas I mentioned as T273486, T273490 and T273492 [17:17:58] T273492: Add a script to recompress logos to the logo management tool - https://phabricator.wikimedia.org/T273492 [17:17:58] T273486: Add a script to identify unused PNGs - https://phabricator.wikimedia.org/T273486 [17:17:58] T273490: Make sure logos used by the logo management are (semi)protected at Commons - https://phabricator.wikimedia.org/T273490 [17:18:15] 10SRE, 10Analytics-Clusters: rsyslog segfault on an-test-presto1001 - https://phabricator.wikimedia.org/T273412 (10fdans) [17:18:43] cool, I'll try to get those when I have some free time next [17:18:54] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps instances: add a helper script to format & mount a cinder volume (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [17:19:33] thanks a ton :) [17:20:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:23:10] (03Abandoned) 10Gerrit maintenance bot: Add mni to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/660826 (https://phabricator.wikimedia.org/T273457) (owner: 10Gerrit maintenance bot) [17:23:24] Urbanecm: ^ duplicate or .... ? [17:23:29] mutante: yup, duplicate [17:23:37] ok, doing that in a minute [17:23:39] mutante: https://gerrit.wikimedia.org/r/c/operations/dns/+/660830 is still live [17:23:45] *not abandoned [17:24:16] (03CR) 10Urbanecm: [C: 03+1] Add mni to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/660830 (https://phabricator.wikimedia.org/T273456) (owner: 10Gerrit maintenance bot) [17:24:45] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:39] when running the decom script it asks me to review a homer switch change [17:25:50] mutante: yes [17:25:56] but how would I actually review that I dont know [17:26:25] mutante: now the part that was manual before of running homer to update the port description and close the swtich ports is automated [17:26:30] and the cookbooks runs it for you [17:26:36] it was also the part where it was handed to another team though? [17:26:46] it should mention just the diff for your host(s) [17:27:11] XioNoX , me and other are happy to help review it if you have any doubt [17:27:12] yea, I am being shown a diff and it wants me to say yes or no [17:27:18] just saying it is really a blind yes [17:27:27] (03CR) 10Dave Pifke: [C: 04-1] "-1 to renaming the existing pipeline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [17:27:30] (03PS1) 10Legoktm: profiler: Send data to excimer-{stretch,buster} pipelines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660865 (https://phabricator.wikimedia.org/T273312) [17:27:30] paste it somewhere and I can check it [17:27:54] because how would people know the switch port is correct [17:28:28] from the description? from the changelog that just cleared that up on netbox? [17:28:39] well, if the host name I expect shows up in the description, ok [17:28:55] mutante: hey is it ok to merge the DNS changes for francium? [17:29:03] I just thought it was on purpose that we let dcops do this part when they physically take it out [17:29:10] papaul: no idea [17:29:20] mutante, sorry, but your conversation this may be related to https://phabricator.wikimedia.org/T273275 [17:29:31] mutante: thanks [17:29:33] mutante: no, closing the port on decom is actually safer and what we aimed for from the start [17:29:38] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:45] with some ideas to polish the workflow at the end [17:29:48] mutante: can you past the diff anywhere and I can check it [17:29:54] the fact that papaul needs to ask me this in the exact moment seems to be an indication ? [17:30:03] *if you can post [17:30:17] https://phabricator.wikimedia.org/P14111 [17:30:29] mutante: that's a race, papaul run the sre.dns.netbox cookbook at the same time of the decom cookbook running the same stuff [17:30:34] you merged your change [17:30:43] so his was blocked automatically (see the FAIL above) [17:31:35] mutante: are yu decomming francium? [17:32:00] from https://netbox.wikimedia.org/dcim/devices/1956/ if you look at ge-7/0/33 it shows where it's attached (francium) and you can double check is correct [17:32:12] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) Thanks @Marostegui, I appreciate it. We discussed this during my staff meeting a bit last week, and @Cmjohnson will work with you and the othe... [17:32:38] volans: yes, francium it is and got it, that is because we once had a host come back after decom ,right [17:32:53] jynus: thank you, but it seems to be an unrelated thing [17:32:58] mutante, it is [17:33:31] I just pointed it as a discussion to potentially improving the script [17:33:37] is Phatality still a thing in the new logstash/kibana version? I can’t find it [17:33:51] Lucas_WMDE: there's a task for that [17:34:02] so, not at the moment? [17:34:03] currently not enabled, but eventually will be, when it's...reworked to be compatible [17:34:09] ok thanks [17:34:12] good to know [17:34:14] volans: got it, I see the interface name in netbox now, that I can check, going ahead [17:34:17] thanks [17:34:24] np, thank you for asking [17:34:28] Lucas_WMDE: see T272655 for more info :) [17:34:28] T272655: Phatality doesn't work with Kibana 7 - https://phabricator.wikimedia.org/T272655 [17:34:33] just found it :) [17:34:40] cool :) [17:34:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [17:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:55] papaul: i just said "yes" in the running decom script and now it finished [17:35:02] mutante: thanks [17:36:50] purchase date: 2014 - good riddance [17:37:03] (03CR) 10Legoktm: "> Patch Set 1: Code-Review-1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [17:37:08] (03CR) 10CDanis: "one nit but LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/660854 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [17:37:21] when the procurement ticket was RT you know it was time [17:37:22] (03CR) 10CDanis: [C: 03+1] swift: limit rsync to 10% memory in codfw [puppet] - 10https://gerrit.wikimedia.org/r/660855 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [17:38:30] (03CR) 10Dzahn: [C: 03+2] decom francium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/659346 (https://phabricator.wikimedia.org/T273142) (owner: 10Dzahn) [17:39:38] (03PS1) 10Ssingh: wikidough: update description for role [puppet] - 10https://gerrit.wikimedia.org/r/660868 (https://phabricator.wikimedia.org/T252132) [17:40:53] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) (owner: 10CRusnov) [17:40:58] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27797/console" [puppet] - 10https://gerrit.wikimedia.org/r/660868 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:41:34] (03CR) 10Volans: [C: 04-1] "Order issue" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) (owner: 10CRusnov) [17:41:56] 10SRE, 10ops-codfw, 10Data-Persistence-Backup, 10decommission-hardware: decommission heze and heze-array1 - https://phabricator.wikimedia.org/T273051 (10Papaul) [17:42:58] mutante: fwiw the description of the iface on the paste you pasted had francium:eno1 too, I forgot to point it out earlier [17:43:09] (03CR) 10Ssingh: [V: 03+1 C: 03+2] wikidough: update description for role [puppet] - 10https://gerrit.wikimedia.org/r/660868 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:43:21] so usually the diff is self-explanatory (remove iface description, vlan and add it to the disabled range) [17:44:07] volans: makes sense, alright, yep [17:44:20] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10lmata) noted @Joe! I'll reach out to you to coordinate a time to talk with the team. [17:44:40] andrewbogott: are you waiting to puppet-merge? [17:44:59] same question :) [17:45:02] mutante: my mistake, it still had a prompt [17:45:04] volans: decom script ended with exit 0 btw, all good [17:45:05] done now [17:45:10] andrewbogott: thanks, ack [17:45:19] mutante: please merge my change as well, thanks [17:45:29] sukhe: typing "multiple" right now, yes [17:45:46] role description change is easy :) [17:45:48] done [17:45:51] hah thank you [17:50:47] PROBLEM - Disk space on ping1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=68%): /tmp 0 MB (0% inode=68%): /var/tmp 0 MB (0% inode=68%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping1001&var-datasource=eqiad+prometheus/ops [17:51:01] mmmm [17:51:46] I'll do the usual thing I do to get a few MB back.. apt-get clean [17:51:57] I was about to check [17:52:33] !log ping1001 - apt-get clean gets back 447M - it was out of disk completely, now 84% usage [17:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:49] that is a small / [17:52:56] just 3G total [17:52:56] maybe we can add a new disk for /var ? [17:53:26] (03PS2) 10Legoktm: arclamp: Add excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) [17:53:29] (03PS2) 10Legoktm: profiler: Send data to excimer-buster pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660865 (https://phabricator.wikimedia.org/T273312) [17:53:37] it's possible to add new virtual disks to VMs, yea [17:53:42] if needed though [17:54:01] I guess it is not like a db, where when it reaches 0 nothing works [17:54:06] so low prio [17:54:31] /var/log is not even the culprit for this one .. hmm [17:54:42] or we can add a cron doing what you say [17:54:49] (03CR) 10jerkins-bot: [V: 04-1] profiler: Send data to excimer-buster pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660865 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [17:54:52] (03PS1) 10Jbond: (WIP): add script to copy ldap entries to a local db [puppet] - 10https://gerrit.wikimedia.org/r/660869 [17:55:36] (03PS3) 10Legoktm: profiler: Send data to excimer-buster pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660865 (https://phabricator.wikimedia.org/T273312) [17:55:45] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on wikis with language variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660870 (https://phabricator.wikimedia.org/T272639) [17:55:48] 1.5G ./lib [17:55:50] /usr/lib/modules is a third of it .. [17:56:09] let's make a ticket to decide what is the right move [17:56:25] (03CR) 10jerkins-bot: [V: 04-1] (WIP): add script to copy ldap entries to a local db [puppet] - 10https://gerrit.wikimedia.org/r/660869 (owner: 10Jbond) [17:56:49] mutante, I can also run apt autoremove ? [17:57:12] jynus: sure, it would not hurt [17:57:14] The following packages will be REMOVED: libblas3 libgfortran5 liblinear3 liblua5.3-0 libquadmath0 linux-image-4.19.0-11-amd64 nmap-common python3-debconf [17:57:38] yea, that should be fine [17:57:51] PROBLEM - DPKG on ping2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:57:57] PROBLEM - Disk space on ping2001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=68%): /tmp 0 MB (0% inode=68%): /var/tmp 0 MB (0% inode=68%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping2001&var-datasource=codfw+prometheus/ops [17:57:59] that should fix soon [17:58:10] oh, I was about to ask about codfw :-) [17:58:15] lol [17:58:19] meh [17:58:33] eqiad is down to 77% usage, good [17:58:41] I am about to leave, if you want to do the honers (apt autoclean && apt autoremove) [17:58:45] *honors [17:58:52] sure, ok [17:59:12] I will check the ticket tomorrow, feel free to add me [17:59:27] !log ping 2001 - apt-get clean; apt autoremove - was out of disk as well [17:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:30] ok [17:59:41] "Setting up linux-image-4.19.0-14-amd64 (4.19.171-2) ... [17:59:43] uhmm [18:00:01] (03PS2) 10CDanis: Add a textfile exporter for the machine's Debian version [puppet] - 10https://gerrit.wikimedia.org/r/659991 [18:00:03] well, ok then. that is also back to 77% and we finished the kernel upgrade :p [18:00:04] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210201T1800). [18:00:33] PROBLEM - DPKG on ping3001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:00:48] gimme a break :p [18:00:50] lol [18:01:03] this is ongoing kernel upgrades right now? [18:01:06] cumin to the rescue :-) [18:01:32] ping2001.codfw.wmnet,ping1001.eqiad.wmnet,ping3001.esams.wmnet [18:01:50] only 3 as far as puppet concerns [18:03:22] !log ping3001 - apt-get clean; apt-get autoremove; let it finish kernel upgrade; was out of disk [18:03:25] update-initramfs: Generating /boot/initrd.img-4.19.0-14-amd64 [18:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:32] /dev/vda1 2.9G 2.1G 663M 77% / [18:03:34] done [18:03:43] RECOVERY - Disk space on ping1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping1001&var-datasource=eqiad+prometheus/ops [18:03:43] RECOVERY - Disk space on ping2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping2001&var-datasource=codfw+prometheus/ops [18:03:43] RECOVERY - DPKG on ping2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:03:43] RECOVERY - DPKG on ping3001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:04:33] we should be good now, thanks mutante for taking care of it [18:05:28] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 3 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Legoktm) Here's the list of most of the stuff that PHP links to that differs: curl,... [18:05:43] np, run jynus and enjoy the rest of the night before there is the next thing [18:06:14] "hey people, I just setup a new host called ping4001" /jk [18:06:39] (03CR) 10Jbond: [C: 03+1] "LGTM comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659991 (owner: 10CDanis) [18:09:31] (03CR) 10Jforrester: "These new logos are rather blurrier. Were they previously hand-optimised, or is this due to the config we're feeding our toolchain?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660620 (owner: 10Legoktm) [18:09:40] 10SRE: ping servers running out of disk - https://phabricator.wikimedia.org/T273509 (10Dzahn) [18:17:16] 10SRE, 10ops-codfw: codfw: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268749 (10Papaul) Row C complete [18:18:51] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:19:01] (03CR) 10Legoktm: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660620 (owner: 10Legoktm) [18:19:19] (03PS3) 10CRusnov: Add execution of trace_paths to the post-install setup [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) [18:19:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:19:58] (03CR) 10CRusnov: "> Patch Set 2: Code-Review-1" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) (owner: 10CRusnov) [18:20:18] (03CR) 10Jforrester: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660620 (owner: 10Legoktm) [18:20:23] (03CR) 10CRusnov: Add execution of trace_paths to the post-install setup (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) (owner: 10CRusnov) [18:21:11] (03PS4) 10CRusnov: Add execution of trace_paths to the post-install setup [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) [18:21:18] Urbanecm: did you clear the caches (using purgeList) when deploying the logo changes? [18:21:34] legoktm: not yet, but i will do it later today [18:21:41] (unless you want to) [18:21:43] Urbanecm: don't [18:21:52] okay, i won't do it then [18:21:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:22:07] see the comments on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/660620 [18:22:29] (03CR) 10Volans: "> Patch Set 2:" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) (owner: 10CRusnov) [18:23:10] legoktm: oh :( [18:23:12] 10SRE, 10ops-codfw, 10Data-Persistence-Backup, 10decommission-hardware: decommission heze and heze-array1 - https://phabricator.wikimedia.org/T273051 (10Papaul) [18:23:16] (03CR) 10Dduvall: releases: Provide remaining pipelinelib dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659437 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [18:23:24] (03CR) 10Dduvall: [C: 03+1] releases: Provide remaining pipelinelib dependencies [puppet] - 10https://gerrit.wikimedia.org/r/659437 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [18:24:10] 10SRE, 10ops-codfw, 10Data-Persistence-Backup, 10decommission-hardware: decommission heze and heze-array1 - https://phabricator.wikimedia.org/T273051 (10Papaul) 05Open→03Resolved complete . @jcrespo thanks for getting this done. [18:24:36] legoktm: should i revert all the patches? [18:26:58] (03CR) 10CRusnov: "> Patch Set 4:" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) (owner: 10CRusnov) [18:27:03] I'm trying to diff the pngs right now [18:27:45] okay [18:27:50] ping me if you want my hands [18:29:39] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1009 is CRITICAL: 5.723e+06 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [18:29:49] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1001 is CRITICAL: 5.771e+06 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [18:30:11] (03CR) 10CRusnov: "> Patch Set 4:" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) (owner: 10CRusnov) [18:30:51] (03PS5) 10CRusnov: Add execution of trace_paths to the post-install setup [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) [18:33:10] razzi is working on rebalancing topics on Kafka Jumbo, all expected, we'll downtime the alert later on :) [18:33:48] imagemagick says there's definitely some visible difference https://people.wikimedia.org/~legoktm/diff-dewiki-2x.png [18:35:13] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1005 is CRITICAL: 5.708e+06 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [18:35:23] (03PS1) 10ArielGlenn: make wikidata rdf dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871 (https://phabricator.wikimedia.org/T269377) [18:35:44] legoktm: that's sad :( [18:35:52] (03CR) 10jerkins-bot: [V: 04-1] make wikidata rdf dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [18:36:20] 10SRE, 10ops-codfw, 10Analytics-Radar, 10decommission-hardware, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10wiki_willy) [18:40:29] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on 3 wikis per request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660873 (https://phabricator.wikimedia.org/T258554) [18:40:48] 10SRE, 10serviceops: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10elukey) @jcrespo me and Giuseppe are discussing the problem, so your pings are not unseen, but the problem is complex since it requires a lot of clients to move to eqiad first (Pybals, etcd DNS configs, etc..)... [18:41:10] (03CR) 10Dzahn: [C: 03+2] Add mni to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/660830 (https://phabricator.wikimedia.org/T273456) (owner: 10Gerrit maintenance bot) [18:41:20] (03PS2) 10Dzahn: Add mni to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/660830 (https://phabricator.wikimedia.org/T273456) (owner: 10Gerrit maintenance bot) [18:44:54] (03PS1) 10Kosta Harlan: GrowthExperiments: Bump schema version for homepagemodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660874 (https://phabricator.wikimedia.org/T273084) [18:44:58] !log new Wikimedia project language "mni" added - Meitei is a Sino-Tibetan language and the predominant language and lingua franca of the state of Manipur in northeastern India. [18:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:05] sukhe: ^ another one [18:45:28] Urbanecm, James_F: https://people.wikimedia.org/~legoktm/flip-dewiki-2x.html flips between the old and new logos every second, but I don't actually see any difference [18:45:50] me neither [18:46:58] jouncebot: refresh [18:46:59] I refreshed my knowledge about deployments. [18:47:01] jouncebot: next [18:47:02] In 0 hour(s) and 12 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210201T1900) [18:47:22] (03CR) 10CDanis: [C: 03+2] "thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659991 (owner: 10CDanis) [18:49:29] (03CR) 10Dzahn: [C: 03+2] gerrit: drop log4j custom config [puppet] - 10https://gerrit.wikimedia.org/r/660030 (https://phabricator.wikimedia.org/T141324) (owner: 10Hashar) [18:50:04] (03CR) 10Dzahn: [C: 03+2] gerrit: add link to online config doc [puppet] - 10https://gerrit.wikimedia.org/r/660029 (owner: 10Hashar) [18:50:59] (03PS3) 10Dzahn: gerrit: drop log4j custom config [puppet] - 10https://gerrit.wikimedia.org/r/660030 (https://phabricator.wikimedia.org/T141324) (owner: 10Hashar) [18:52:07] (03CR) 10Kosta Harlan: [C: 04-2] "Putting on hold for the moment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660874 (https://phabricator.wikimedia.org/T273084) (owner: 10Kosta Harlan) [18:54:07] (03CR) 10Legoktm: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660620 (owner: 10Legoktm) [18:54:11] legoktm: Hmm. Ditto. But in the side-by-side on gerrit they're clearly different to me. [18:54:44] hmm [18:54:55] I wonder if Gerrit does changes [18:55:06] The 2x one isn't very different. [18:55:16] is a difference size worse? [18:55:26] But the 1x one looks obvious. [18:55:55] * legoktm updates testing page [18:56:02] jouncebot: now [18:56:02] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [18:56:20] restarting gerrit within 3 minutes :p [18:57:15] !log restarting gerrit for change 660030 (no ticket) [18:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:45] https://people.wikimedia.org/~legoktm/flip-dewiki.html [18:57:48] it's super noticeable [18:58:12] Ack. [18:58:23] the letters are moving too ugh [18:58:38] * Urbanecm is curious why it happens [18:59:18] mutante: ha, unlike the last one I had actually heard about this language. interesting nevertheless [18:59:38] (03CR) 10Volans: [C: 03+1] "Ack, LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) (owner: 10CRusnov) [19:00:05] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210201T1900). [19:00:05] MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:13] \o/ [19:00:18] sukhe: :) [19:00:25] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Add execution of trace_paths to the post-install setup [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/660867 (https://phabricator.wikimedia.org/T265084) (owner: 10CRusnov) [19:00:57] Urbanecm: Almost certainly the original was hand-created ages ago in something and no-one noticed? [19:00:57] I can deploy today (if gerrit is all back up again) [19:01:04] (03CR) 10Bstorm: "https://phabricator.wikimedia.org/T268280" [puppet] - 10https://gerrit.wikimedia.org/r/660799 (owner: 10Bstorm) [19:01:29] (03CR) 10Dzahn: "deployed in prod and restarted gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/660030 (https://phabricator.wikimedia.org/T141324) (owner: 10Hashar) [19:01:40] James_F: I mean, you said it is caused by the commons PNG issue. But...why does it generate bad PNGs? [19:01:45] MatmaRex: Hi, around? [19:01:51] Urbanecm: Because rsvg is a POC. [19:01:58] hi [19:02:01] Urbanecm: See the endless requests for us to improve/replace it. [19:02:06] well I don't think rsvg moved letters around [19:02:19] does that mean Proof of Concept James_F? [19:02:26] piece of crap lol [19:02:33] lol [19:02:48] Or Proof of Crap? [19:03:05] MatmaRex: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/660870 seems to set wgDiscussionToolsEnable to false rather than true (ie. enabling)? Is it the right patch? [19:04:07] Urbanecm: yes, they should fall back to 'wikipedia', which is true [19:04:22] MatmaRex: aha, got it, thanks for the explanation [19:04:32] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools as a beta feature on wikis with language variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660870 (https://phabricator.wikimedia.org/T272639) (owner: 10Bartosz Dziewoński) [19:04:34] let's start then [19:04:41] thanks for checking [19:06:20] (03Merged) 10jenkins-bot: Enable DiscussionTools as a beta feature on wikis with language variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660870 (https://phabricator.wikimedia.org/T272639) (owner: 10Bartosz Dziewoński) [19:06:50] (03Abandoned) 10Kosta Harlan: GrowthExperiments: Bump schema version for homepagemodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660874 (https://phabricator.wikimedia.org/T273084) (owner: 10Kosta Harlan) [19:08:25] MatmaRex: can you check your patch at mwdebug1001, please? [19:08:39] looking [19:09:26] https://people.wikimedia.org/~legoktm/flip-what.html is comparing the new dewiki.png to old dewiki-2x.png. I just see a change in quality, but nothing shifting around, which means the old logo was spaced incorrectly [19:09:36] seems good Urbanecm [19:09:40] thanks, syncing [19:09:47] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools as a beta feature on 3 wikis per request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660873 (https://phabricator.wikimedia.org/T258554) (owner: 10Bartosz Dziewoński) [19:11:14] \o is it too late to add a patch for this window? [19:11:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a98f08f9582215e8f12f9e9c43f79a1f2fc21a2f: Enable DiscussionTools as a beta feature on wikis with language variants (T272639) (duration: 01m 07s) [19:11:15] (03Merged) 10jenkins-bot: Enable DiscussionTools as a beta feature on 3 wikis per request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660873 (https://phabricator.wikimedia.org/T258554) (owner: 10Bartosz Dziewoński) [19:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:24] T272639: Enable Reply Tool as Beta Feature on remaining Wikipedias with language variants - https://phabricator.wikimedia.org/T272639 [19:11:33] kostajh: not really :) [19:11:51] MatmaRex: please test the other patch at mwdebug1001, thanks! [19:13:08] Urbanecm: also looks good [19:13:11] thanks, syncing [19:13:21] kostajh: please add your patch to the calendar :) [19:13:45] Urbanecm: will do [19:14:00] cool :) [19:14:09] (03PS2) 10Krinkle: Reword wmfEtcdApplyDBConfig() comments to better match those in LBFactoryMulti [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658473 (owner: 10Aaron Schulz) [19:14:56] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 6360e7899b7dedc29941783b1cdf76df8db073d7: Enable DiscussionTools as a beta feature on 3 wikis per request (T258554; T265829; T273192) (duration: 01m 04s) [19:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:03] T273192: Enable DiscussionTools beta on ku.wikipedia & ku.wiktionary - https://phabricator.wikimedia.org/T273192 [19:15:03] T265829: Introduce the reply tool to Japanese Wikivoyage - https://phabricator.wikimedia.org/T265829 [19:15:03] T258554: Enable discussion tools on discussion pages on the Konkani Wiktionary - https://phabricator.wikimedia.org/T258554 [19:15:05] MatmaRex: all done! Anything else? [19:15:17] thanks. not from me [19:15:34] np :) [19:15:47] (03PS1) 10Kosta Harlan: Banner module: Switch to using activated/unactivated for state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660926 (https://phabricator.wikimedia.org/T273084) [19:16:28] (03PS2) 10ArielGlenn: make wikidata rdf dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871 (https://phabricator.wikimedia.org/T269377) [19:17:37] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1005 is OK: (C)5e+06 ge (W)1e+06 ge 9.678e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [19:17:44] kostajh: is https://gerrit.wikimedia.org/r/660926 the patch? [19:17:56] Urbanecm: yes [19:18:00] okay [19:18:47] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1009 is OK: (C)5e+06 ge (W)1e+06 ge 7.071e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [19:18:55] (03CR) 10Urbanecm: [C: 03+2] Banner module: Switch to using activated/unactivated for state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660926 (https://phabricator.wikimedia.org/T273084) (owner: 10Kosta Harlan) [19:18:56] kostajh: btw, I know you have deployer rights, so...wanna do it yourself? [19:19:18] Urbanecm: I haven't had training in it yet, so, not this time please [19:19:23] okay :) [19:19:34] I'll do it then [19:20:46] kostajh: before it merges...do you happen to know how disabled suggested edit module looked before at fr.wiktionary? I managed to disable it fully, but...then it looks like this . Is that the intended state? [19:22:30] Urbanecm: yeah, AFAIK that is correct :) [19:22:35] okay [19:22:50] kostajh: if I can ask you to +2 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/660875/, I'll backport it as well [19:23:44] !log gerrit2001 - restarting gerrit (replica) [19:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:26] Urbanecm: nice find, thanks. +2'ed [19:24:32] thanks kostajh :) [19:25:01] (03PS1) 10Urbanecm: SpecialHomepage: Do not load start-startediting if SE aren't enabled [extensions/GrowthExperiments] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660928 (https://phabricator.wikimedia.org/T273243) [19:25:08] (03CR) 10Urbanecm: [C: 03+2] SpecialHomepage: Do not load start-startediting if SE aren't enabled [extensions/GrowthExperiments] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660928 (https://phabricator.wikimedia.org/T273243) (owner: 10Urbanecm) [19:28:50] James_F, Urbanecm: I borrowed my sister's laptop (she has a non-HDPI screen) and the new dewiki.png (1x) is indistinguishable from the *old* dewiki-2x.png. Then we compared the new 1x to the source SVG from Commons and again, identical. So I'm confident in that the old 1x logo was wrong, and the new one is consistent with the old HD logos and the source SVG. [19:29:13] good to know! [19:29:57] legoktm: Aha. OK, that's interesting. [19:32:51] my guess is that it does look more blurry on HDPI screens for whatever reason, but that doesn't matter because it looks correct on normal screens [19:33:00] Right. [19:33:57] (03CR) 10Legoktm: "Conclusion: I borrowed my sister's laptop (she has a non-HDPI screen) and the new dewiki.png (1x) is indistinguishable from the *old* dew" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660620 (owner: 10Legoktm) [19:34:08] (03Merged) 10jenkins-bot: Banner module: Switch to using activated/unactivated for state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660926 (https://phabricator.wikimedia.org/T273084) (owner: 10Kosta Harlan) [19:34:28] finally [19:34:53] kostajh: please test your patch at mwdebug1001 [19:34:59] Urbanecm: doing [19:35:38] ...after i actually pull it there [19:35:43] ...done [19:35:50] git submodule update extensions/GrowthExperiments is so easy to forget [19:37:10] Urbanecm: LGTM! [19:37:19] thanks, syncing [19:38:52] (03Merged) 10jenkins-bot: SpecialHomepage: Do not load start-startediting if SE aren't enabled [extensions/GrowthExperiments] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660928 (https://phabricator.wikimedia.org/T273243) (owner: 10Urbanecm) [19:38:59] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.28/extensions/GrowthExperiments/includes/HomepageModules/Banner.php: d39746aa3ed07dfa9173a98d253c61771d5592a1: Banner module: Switch to using activated/unactivated for state (T273084) (duration: 01m 05s) [19:39:00] just in time :) [19:39:03] kostajh: done [19:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:13] T273084: HomepageModule events with validation errors - https://phabricator.wikimedia.org/T273084 [19:39:43] 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10Dzahn) [19:39:49] patch works, syncing [19:41:49] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.28/extensions/GrowthExperiments/includes/Specials/SpecialHomepage.php: 1acaba4b3650dfb757d29af5395cc7660c839756: SpecialHomepage: Do not load start-startediting if SE arent enabled (T273243) (duration: 01m 05s) [19:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:53] T273243: [wmf.28-regression] frwiktionary - SE module displayed on Homepage with errors - https://phabricator.wikimedia.org/T273243 [19:42:00] 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10Dzahn) a:05Dzahn→03None The serviceops part of this is done. dcops can now continue. [19:42:12] should be all done [19:42:19] !log Morning B&C done [19:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:26] 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10wiki_willy) a:03Cmjohnson [19:44:58] 10SRE, 10ops-codfw, 10Analytics-Radar, 10decommission-hardware, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10wiki_willy) a:03Papaul [19:46:31] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1001 is OK: (C)5e+06 ge (W)1e+06 ge 9.927e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [19:52:40] 10SRE, 10ops-codfw, 10Analytics-Radar, 10decommission-hardware, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10Papaul) @wiki_willy this is a VM it doesn't go to me [19:55:25] 10SRE, 10Analytics-Radar, 10decommission-hardware, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10wiki_willy) a:05Papaul→03None [19:59:18] 10SRE, 10Analytics-Radar, 10decommission-hardware, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10Dzahn) a:03Dzahn [20:00:59] 10SRE, 10Analytics-Radar, 10decommission-hardware, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10Dzahn) a:05Dzahn→03None Oh, this ticket is actually not ready at all (VM or not), per previous comments. this is supposed to be stalled. [20:02:14] 10SRE, 10Analytics-Radar, 10decommission-hardware, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10Dzahn) 05Stalled→03Invalid I'll just close this as invalid because the template would not match a VM and the system is still in production. [20:02:18] 10SRE, 10Analytics, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10Dzahn) [20:02:55] 10SRE, 10Analytics-Radar, 10decommission-hardware, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10wiki_willy) Thanks @Dzahn - I saw it listed under the "pending onsite steps (codfw)" column, so it threw me off for a sec. >! In T245279#6793672, @Dzahn wrote: >... [20:07:57] (03PS1) 10CDanis: WIP use facter instead of /etc/debian_version [puppet] - 10https://gerrit.wikimedia.org/r/660917 [20:09:11] 10SRE, 10Analytics-Radar, 10decommission-hardware, 10serviceops: decommission kraz.wikimedia.org - https://phabricator.wikimedia.org/T245279 (10Dzahn) Yea, my bad, this is not fitting for a VM. But also when this was created we did not have the same decom cookbook and workflow yet. [20:10:43] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 3 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Legoktm) >>! In T273312#6788663, @Legoktm wrote: > I started by profiling `api.php?... [20:20:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:22:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:27:06] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1277.eqiad.wmnet [20:27:20] !log depooling mw1277.eqiad.wmnet for perf testing [20:27:55] !log depooling mw1277.eqiad.wmnet for perf testing [20:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:53] (03PS6) 10Legoktm: openldap: Convert cross-validate-accounts to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455 [20:43:33] (03CR) 10Legoktm: [C: 03+2] openldap: Convert cross-validate-accounts to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455 (owner: 10Legoktm) [20:52:12] !log andrew@deploy1001 Started deploy [striker/deploy@b6441b8]: Striker hacked fix for T272410 [20:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:16] T272410: Fix missing static resource "autocomplete_light" in ToolsAdmin causing broken webpage - https://phabricator.wikimedia.org/T272410 [20:53:08] !log andrew@deploy1001 Finished deploy [striker/deploy@b6441b8]: Striker hacked fix for T272410 (duration: 00m 57s) [20:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:09] jouncebot now [20:57:10] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [20:57:17] jouncebot next [20:57:17] In 0 hour(s) and 2 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210201T2100) [20:58:59] (03PS2) 10Jbond: (WIP): add script to copy ldap entries to a local db [puppet] - 10https://gerrit.wikimedia.org/r/660869 [21:00:05] chrisalbon and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210201T2100). [21:00:14] (03CR) 10Krinkle: [C: 03+1] profiler: Send data to excimer-buster pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660865 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [21:01:08] (03PS3) 10Krinkle: arclamp: Add excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [21:01:48] (03CR) 10Krinkle: arclamp: Add excimer-buster pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [21:10:34] chrisalbon and accraze: Can I borrow your deployment window to roll the train back to .27 ? [21:17:45] Silence implies consent! [21:18:01] I guess I can wait another 5 minutes. :-) [21:24:30] Yeah go ahead [21:24:40] 👍🏾 [21:25:00] (03PS1) 10Ahmon Dancy: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660949 [21:25:02] (03CR) 10Ahmon Dancy: [C: 03+2] all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660949 (owner: 10Ahmon Dancy) [21:27:39] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660949 (owner: 10Ahmon Dancy) [21:29:28] !log dancy@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.27 [21:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:45] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Marostegui) Thanks @wiki_willy - I will be off a few days next week but @LSobanski and @Kormat are on this task in case this needs to happen while I am away. [21:49:03] dancy: why did we rollback back? [21:49:08] *again [21:50:42] Hi Martin. The accumulation of errors occurring in .28 become too much so we decided to roll back to .27 and get those issues fixed in .29 [21:50:53] aha 😞 [21:51:22] There are 4 blockers. They _look_ easy to remedy but I'm certain. Hopefully it's just a matter of coordination [21:51:42] I'm not certain, that is. [21:53:07] (03CR) 10Dzahn: [C: 03+2] nagios_common::commands: require_package -> ensure_packages, simplify [puppet] - 10https://gerrit.wikimedia.org/r/659405 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [21:53:52] dancy: I see. T271341 doesn't seem to have any blockers through, where would I find them? [21:53:53] T271341: 1.36.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T271341 [21:54:00] [21:54:10] https://phabricator.wikimedia.org/T271343 [21:54:16] thanks dancy ! [21:54:19] will look if i can help with any [21:54:23] Sorry about that. I meant to provide a link in the first place! [21:54:30] np :) [21:55:20] T273101 has the widest blast radius in terms of log noise. [21:55:20] T273101: Accessing WikiPage that cannot exist as a page: w:Help:Books/Book creator text. [Called from WikiPage::exists] - https://phabricator.wikimedia.org/T273101 [21:56:21] dancy if T273479 is occurring on .27, does it need to be a blocker? [21:56:21] T273479: ApiEchoUnreadNotificationPages.php PHP Notice: Undefined index: query - https://phabricator.wikimedia.org/T273479 [21:56:57] there are also no recent changes that I see that could cause it https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Echo/+log/refs/heads/master/includes/api/ApiEchoUnreadNotificationPages.php [21:57:24] I made T273479 a blocker for .29 because it was new for .27 and remained unresolved. [21:57:48] hmm. [21:58:33] are you sure it was new? https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Echo/+log shows mostly i18n changes [21:58:56] I'll do a fresh query to verify that. [21:58:59] (03CR) 10Dzahn: "noop on alert1001" [puppet] - 10https://gerrit.wikimedia.org/r/659405 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [21:59:00] there were only i18n changes between .25 and .28 as far as I can tell [22:00:05] Reedy and sbassett: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210201T2200). [22:01:04] ^ there's a couple of sec patches I'd like to deploy now, unless there are any objections. [22:02:44] DannyS712: https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors?_g=h@d0ff167&_a=h@5c101f5 makes it appear now. I can't say what the cause is. [22:02:53] s/now/new/ [22:03:12] I don't think the logs in that system go back that far [22:03:17] "Unable to completely restore the URL, be sure to use the share functionality." [22:03:33] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1277.eqiad.wmnet [22:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:56] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1278.eqiad.wmnet [22:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:31] !log depooling mw1278.eqiad.wmnet for perf testing [22:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:52] DannyS712: see if this is better https://logstash.wikimedia.org/goto/cecaba2ac300bdb55fe0a73110847400 [22:05:04] sbassett: I don't think there are any issues, unless dancy has any objections [22:05:16] dancy: see sbassett's message from above :) [22:05:24] Ok, good, since I'm already deploying :) [22:05:28] sbassett: No objection [22:05:34] @dancy got it, thanks, {{looking}} [22:05:53] !log Deployed security patch for T270713 [22:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:26] (03PS9) 10Cwhite: profile: update netdev to output ECS-formatted logs [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) [22:09:45] (03CR) 10jerkins-bot: [V: 04-1] profile: update netdev to output ECS-formatted logs [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:09:47] (03PS10) 10Cwhite: profile: update netdev to output ECS-formatted logs [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) [22:09:56] !log Deployed security patch for T272386 [22:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:35] okay, @dancy I have a guess - the issue for the echo failures could have been caused in CentralAuth, specifically https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CentralAuth/+/9f79de43eb1ebb7ab6fea3ed83569dedd796ba14. Was there a rise in echo `warning` logs? [22:11:54] no wait, that wasn't actually merged until after .27 was cut [22:13:09] (03PS1) 10Dzahn: profile::docker::storage::loopback: hiera -> lookup [puppet] - 10https://gerrit.wikimedia.org/r/660951 (https://phabricator.wikimedia.org/T209953) [22:18:16] (03CR) 10Cwhite: profile: update netdev to output ECS-formatted logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:21:57] (03CR) 10Legoktm: arclamp: Add excimer-buster pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [22:23:36] (03PS4) 10Legoktm: arclamp: Add excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) [22:25:06] (03PS1) 10Dzahn: labs_bootstrapvz: hiera -> lookup [puppet] - 10https://gerrit.wikimedia.org/r/660953 (https://phabricator.wikimedia.org/T209953) [22:26:12] dancy solved (ish) - its not new to .27, its just that the message changed due to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/654953 . If you look at https://logstash.wikimedia.org/goto/9f1a764cdf47a6d21b9579dbe47fe500 the same issue was present back in wmf.22, so imo it shouldn't be considered a train blocker [22:26:19] (03CR) 10Legoktm: [V: 03+1] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/27798/" [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [22:28:19] Thanks for investigating Danny. Was the problem introduced in .22? I guess the answer to that question doesn't really matter. What matters to me is: How do we make it so we don't get that log message? [22:29:32] We use the error logs to determine the health of a deployment and when non-blocking log messages are allowed to accumulate it makes the logs less useful for that purpose. I'm trying to fight back against that. [22:30:14] add it to the logs filtered out and ignored in `mediawiki-new-errors`? Not sure, but now that I know where to look I'll try to find the cause again (I checked all core, central auth, and echo patches for .27 before realizing it predated that) [22:33:40] I appreciate your investigation! [22:34:08] echo `debug` logs should be present on flourine, if I'm reading the config correctly - was there an uptick in messages when .22 was deployed? Specifically looking for "Exception when fetching CentralAuth token" [22:34:27] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/647267/8/includes/CentralAuthHooks.php appears to have touched related code [22:34:51] either way, can I remove it as a blocker? [22:35:18] Yes in exchange for a promise to fix it this week. [22:35:58] I [22:35:58] sorry, no can do - I don't really understand the code involved [22:36:44] Can you give me more info about flourine? That's a new reference for me. [22:38:32] I only know what I know from the beta cluster, but I believe in prod it is known as mwlog1001 (https://wikitech.wikimedia.org/wiki/Mwlog1001 https://wikitech.wikimedia.org/wiki/Fluorine and https://wikitech.wikimedia.org/wiki/Logs) [22:39:35] okay. I'm logged into mw1001 now so I'll see what I can find to answer your question. [22:44:52] DannyS712: No hits for `Exception when fetching CentralAuth token` in any Echo.log* file [22:45:29] dancy: fyi, fluorine used to be the prod server back in 2017 but since then it's gone and replaced. it's only showing up because beta is using old names [22:45:47] mwlog should be the right place [22:45:48] yup yup [22:45:49] aha! history [22:46:12] aha, sorry, my mistake. dancy thanks for letting me know - I need to leave on the hour, but I'll keep investigating [22:46:17] dancy: I'm pretty confident https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Collection/+/659074/ will remove the notice. Can we test it at mwdebug? [22:46:34] DannyS712: OK. Let's regroup tomorrow! [22:46:50] Urbanecm: yes. Do you want me to deploy? [22:47:06] dancy: I can do it, just wanted to make sure you're fine with that :) [22:47:23] I'm fine with that. Looking forward to the results! [22:47:41] I'm going to step away for a bit as well. Will check back in later looking for happy news [22:48:21] (03PS1) 10Urbanecm: Remove unnecessary calls to WikiPage [extensions/Collection] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660934 (https://phabricator.wikimedia.org/T273101) [22:48:45] (03CR) 10Urbanecm: [C: 03+2] "let's test it at mwdebug" [extensions/Collection] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660934 (https://phabricator.wikimedia.org/T273101) (owner: 10Urbanecm) [22:48:53] let's see how long jenkins will take [22:50:07] went to go test stuff on beta cluster, it appears to be failing again - https://logstash-beta.wmflabs.org/ shows no results found [22:54:30] (03Merged) 10jenkins-bot: Remove unnecessary calls to WikiPage [extensions/Collection] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/660934 (https://phabricator.wikimedia.org/T273101) (owner: 10Urbanecm) [22:54:45] \o [23:05:15] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.28/extensions/Collection/includes/Specials/SpecialCollection.php: 3c7864ca1d5aadc9cd251939c0e23f661faef5e9: Remove unnecessary calls to WikiPage (T273101) (duration: 00m 58s) [23:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:22] T273101: Accessing WikiPage that cannot exist as a page: w:Help:Books/Book creator text. [Called from WikiPage::exists] - https://phabricator.wikimedia.org/T273101 [23:06:10] dancy: worked fine, synced and +2'ed the master patch [23:14:10] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1278.eqiad.wmnet [23:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:41] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1403.eqiad.wmnet [23:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:52] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1405.eqiad.wmnet [23:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:11] !log depooling mw1403 and mw1405 for perf testing [23:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:27] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [23:33:48] dpifke, Krinkle: ok for me to merge & deploy now? I assume the puppet patches goes first, run it everywhere, then mw-config? [23:36:51] Thanks Urbanecm! [23:36:57] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:37:39] (03CR) 10Dave Pifke: [C: 03+1] "LGTM, once the other patch to consume the Redis queue has landed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660865 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [23:37:51] legoktm: not sure the order matters. wmf-config first would be fine too, and just means we lose the first few minutes that are are already not capturing today. might be better as that way we can first redisctl subscribe excimer-buster and see that it works, in case there is any doubt :P [23:38:00] I defer to dpifke :) [23:38:51] legoktm: Yup, go ahead. Once the Puppet patch is merged I can verify that the consumer is running. [23:39:15] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:39:22] (03PS5) 10Legoktm: arclamp: Add excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) [23:40:20] (03CR) 10Legoktm: [C: 03+2] arclamp: Add excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/660863 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [23:42:44] dpifke: finished the puppet run on both hosts [23:43:43] (03PS4) 10Legoktm: profiler: Send data to excimer-buster pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660865 (https://phabricator.wikimedia.org/T273312) [23:44:01] jouncebot: next [23:44:01] In 0 hour(s) and 15 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210202T0000) [23:44:24] (03CR) 10Legoktm: [C: 03+2] profiler: Send data to excimer-buster pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660865 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [23:45:15] (03Merged) 10jenkins-bot: profiler: Send data to excimer-buster pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660865 (https://phabricator.wikimedia.org/T273312) (owner: 10Legoktm) [23:47:22] I see the excimer-buster-log service running [23:48:59] Yup, looks good on this end. [23:50:26] ok, the mw-config change is on mwdebug1003 now, which is a buster host [23:50:40] Once we start getting data from it, we should see /srv/arclamp/logs/{daily,hourly}/*.excimer-buster.* start appearing. Graphs for those will then be created starting 15-ish minutes from the end of the {hour,day}. [23:51:37] let me pick a host that's actually getting user traffic [23:52:48] legoktm@webperf1002:/srv/arclamp/logs/hourly$ ls | grep buster [23:52:48] 2021-02-01_23.excimer-buster.all.log [23:52:48] 2021-02-01_23.excimer-buster.index.log [23:53:18] awesome :) [23:54:04] syncing everywhere [23:54:45] !log legoktm@deploy1001 Synchronized wmf-config/profiler.php: profiler: Send data to excimer-buster pipeline (T273312) (duration: 00m 57s) [23:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:50] T273312: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 [23:55:21] dpifke: thanks for your help :) and Krinkle too [23:55:34] np! [23:56:37] I'll keep an eye on the logs from the SVG generator job, but it should just work.