[00:00:03] (03CR) 10Volans: [C: 03+2] gen-zones: transliterate commit message to ASCII [dns] - 10https://gerrit.wikimedia.org/r/579586 (owner: 10Volans) [00:01:10] (03CR) 10CRusnov: [C: 03+2] Update Netbox to v2.7.10-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/579593 (owner: 10CRusnov) [00:01:20] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Update Netbox to v2.7.10-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/579593 (owner: 10CRusnov) [00:03:05] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:03:37] (03PS4) 10Dzahn: misc_apps: add firewall rule to let envoy talk to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/580479 [00:05:43] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21471/miscweb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/580479 (owner: 10Dzahn) [00:06:31] !log crusnov@deploy1001 Started deploy [netbox/deploy@14256f9]: netbox 2.7.10 upgrade [00:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:20] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Dwisehaupt) Internal PCI scan and credentialed audit scan run as expected. [00:07:49] !log crusnov@deploy1001 Finished deploy [netbox/deploy@14256f9]: netbox 2.7.10 upgrade (duration: 01m 17s) [00:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:08] !log crusnov@deploy1001 Started deploy [netbox/deploy@14256f9]: netbox 2.7.10 upgrade [00:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:37] !log crusnov@deploy1001 Finished deploy [netbox/deploy@14256f9]: netbox 2.7.10 upgrade (duration: 02m 29s) [00:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:40] !log foreachwikiindblist medium deleteEqualMessages.php --delete (T247562) [00:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:46] T247562: Warning: Memcached::setMulti(): failed to set key global:segment:... - https://phabricator.wikimedia.org/T247562 [00:42:11] (03PS2) 10Ladsgroup: Read from the new term store everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580416 (https://phabricator.wikimedia.org/T219123) [00:48:19] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:49:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:52:03] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:54:21] (03PS3) 10Ladsgroup: Read from the new term store everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580416 (https://phabricator.wikimedia.org/T219123) [00:54:58] (03PS4) 10Ladsgroup: Read from the new term store everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580416 (https://phabricator.wikimedia.org/T219123) [01:21:49] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:23:57] PROBLEM - PHP opcache health on mw2170 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:34:01] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:36:07] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:44:45] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:46:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:50:01] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 9 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle) [03:02:16] (03PS1) 10Ladsgroup: labs: Stop writing to the old term store for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580603 (https://phabricator.wikimedia.org/T219123) [03:03:19] (03CR) 10Ladsgroup: [C: 03+2] labs: Stop writing to the old term store for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580603 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [03:04:14] (03Merged) 10jenkins-bot: labs: Stop writing to the old term store for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580603 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [03:04:48] ^ Rebased on deploy1001 [03:15:41] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:16:29] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:30:09] RECOVERY - PHP opcache health on mw2170 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:35:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:37:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:58:47] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 266, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:18:37] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:55] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:22:15] 10Operations, 10DBA, 10Patch-For-Review, 10codfw-rollout: [RFC] improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Marostegui) [06:25:01] 10Operations, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) Thanks @Anomie for the detailed explanation, like Jaime, I also had no idea about that double write factor, that explains things. I have added these lea... [06:31:39] !log Reboot pc1008 to try to get its RAID redone - T247787 [06:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:46] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [06:35:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:37:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:47:18] (03PS1) 10Marostegui: install_server: Reimage pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/580697 (https://phabricator.wikimedia.org/T247787) [06:47:46] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) >>! In T247787#5977227, @wiki_willy wrote: > Sure, that works for me @Marostegui . Feel free to shoot open... [06:49:37] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/580697 (https://phabricator.wikimedia.org/T247787) (owner: 10Marostegui) [06:54:43] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['pc1008.e... [07:01:35] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:05:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:43] 10Operations, 10Puppet, 10User-Joe: Disable hiera autolookups - https://phabricator.wikimedia.org/T181971 (10Joe) I still hate automatic parameters lookups, but maybe we have to accept it's the way to go and adapt our guides accordingly? In particular, I would be ok with using autolookup of parameters if we... [07:16:16] 10Operations, 10serviceops, 10Service-Architecture: Monitor envoy status where it's installed - https://phabricator.wikimedia.org/T247387 (10Joe) 05Open→03Resolved [07:16:21] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [07:20:21] <_joe_> jouncebot: nexgt [07:20:24] <_joe_> grrr [07:20:27] <_joe_> jouncebot: next [07:20:27] In 3 hour(s) and 39 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200318T1100) [07:20:43] (03PS2) 10Giuseppe Lavagetto: Switch eventgate-analytics to go through envoy everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580286 (https://phabricator.wikimedia.org/T247484) [07:25:49] ACKNOWLEDGEMENT - MegaRAID on pc1008 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T247920 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:25:52] 10Operations, 10ops-eqiad: Degraded RAID on pc1008 - https://phabricator.wikimedia.org/T247920 (10ops-monitoring-bot) [07:25:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Switch eventgate-analytics to go through envoy everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580286 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [07:27:07] (03Merged) 10jenkins-bot: Switch eventgate-analytics to go through envoy everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580286 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [07:30:09] (03CR) 10Muehlenhoff: [C: 03+1] base: relax interval for selected checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580327 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [07:33:34] (03Abandoned) 10DCausse: [logstash] add debug_blob field [puppet] - 10https://gerrit.wikimedia.org/r/392590 (https://phabricator.wikimedia.org/T180051) (owner: 10DCausse) [07:33:47] (03Abandoned) 10DCausse: [logstash] log all elastic queries [puppet] - 10https://gerrit.wikimedia.org/r/392603 (https://phabricator.wikimedia.org/T180051) (owner: 10DCausse) [07:34:27] (03CR) 10Muehlenhoff: debdeplot: add libGraphicsMagick-Q16 as a lib for graphicsmagick (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580298 (owner: 10Jbond) [07:40:24] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: eventgate-analytics to use envoy everywhere (duration: 01m 10s) [07:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:17] (03PS3) 10Giuseppe Lavagetto: ProductionServices: switch eventgate-main to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576009 (https://phabricator.wikimedia.org/T244843) [07:47:33] (03CR) 10Elukey: [C: 03+2] jupyterhub: delete users from the database automatically [puppet] - 10https://gerrit.wikimedia.org/r/580345 (owner: 10Elukey) [07:49:34] (03PS1) 10Vgutierrez: ATS: Enable inbound TLSv1.3 for upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/580742 (https://phabricator.wikimedia.org/T170567) [07:54:24] (03CR) 10Elukey: [C: 03+2] "Config option `delete_invalid_users` not recognized by `VenvCreatingAuthenticator`" [puppet] - 10https://gerrit.wikimedia.org/r/580345 (owner: 10Elukey) [07:54:35] (03PS1) 10Elukey: Revert "jupyterhub: delete users from the database automatically" [puppet] - 10https://gerrit.wikimedia.org/r/580745 [07:55:10] !log installing remaining libxslt security updates [07:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:31] (03CR) 10Elukey: [C: 03+2] Revert "jupyterhub: delete users from the database automatically" [puppet] - 10https://gerrit.wikimedia.org/r/580745 (owner: 10Elukey) [07:57:08] (03PS2) 10Vgutierrez: ATS: Enable inbound TLSv1.3 for upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/580742 (https://phabricator.wikimedia.org/T170567) [07:59:44] (03PS3) 10Vgutierrez: ATS: Enable inbound TLSv1.3 for upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/580742 (https://phabricator.wikimedia.org/T170567) [08:01:39] 10Operations, 10ops-eqiad: Degraded RAID on pc1008 - https://phabricator.wikimedia.org/T247920 (10Marostegui) 05Open→03Invalid That's the raid initialization as part of the reimage happening at T247787 [08:02:21] (03CR) 10Vgutierrez: "NOOP in text, expected changes in upload: https://puppet-compiler.wmflabs.org/compiler1002/21474/" [puppet] - 10https://gerrit.wikimedia.org/r/580742 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [08:14:32] !log upgrade ATS to 8.0.6-1wm3 in ulsfo - T170567 [08:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:38] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [08:16:08] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc1008.eqiad.wmnet'] ` and were **ALL** successful. [08:22:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:24:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:44:15] !log Start replication pc1008 from pc1010 to get some of the new keys so it is not fully empty - T247787 [08:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:21] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [08:47:04] 10Operations: Is invite-wmfall@wikimedia.org a Mailman list? - https://phabricator.wikimedia.org/T247848 (10Aklapper) 05Open→03Resolved a:03Dzahn Question answered, I guess :) [08:48:07] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list Deutschschweiz - https://phabricator.wikimedia.org/T247737 (10ArielGlenn) Your list has been created; see https://lists.wikimedia.org/mailman/admin/deutschschweiz/ and https://lists.wikimedia.org/mailman/listinfo/deutschschweiz I have added... [08:53:11] there is a cache issue hgoing on which blocks wmf.23 from moving forwad. Seems Timo had the root cause figured out and made some patches ttps://gerrit.wikimedia.org/r/580599 / https://gerrit.wikimedia.org/r/580598 [08:53:39] but I cant +2/deploy/confirm them. I got kids home schooled this morning. At best I am there at ~ 1pm UTC [08:59:51] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) @wiki_willy I was able to destroy and recreate the RAID myself. So no further needs are expected at this point. While the raid w... [09:00:17] 10Operations, 10Wikimedia-Mailing-lists: Request new mailing list for Myanmar Wikimedia Community User Group - https://phabricator.wikimedia.org/T247647 (10ArielGlenn) Your list has been created; see https://lists.wikimedia.org/mailman/listinfo/wikimedia-mm and https://lists.wikimedia.org/mailman/admin/wikimed... [09:05:21] (03PS4) 10Vgutierrez: ATS: Enable inbound TLSv1.3 for upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/580742 (https://phabricator.wikimedia.org/T170567) [09:08:49] (03CR) 10Ema: [C: 03+1] ATS: Enable inbound TLSv1.3 for upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/580742 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [09:10:45] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) So, given that pc1008 looks ok from those tests my proposal is: - Let pc1008 (pc2008 replicates from pc1008) replicate for maybe... [09:15:18] (03CR) 10Elukey: [C: 03+1] atskafka: rdkafka configuration support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580364 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [09:16:09] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable inbound TLSv1.3 for upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/580742 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [09:17:15] (03PS1) 10ArielGlenn: add joewalsh to analytics-privatedata-users and remove from researchers [puppet] - 10https://gerrit.wikimedia.org/r/580853 (https://phabricator.wikimedia.org/T247636) [09:17:30] (03CR) 10Elukey: [C: 03+1] "> > Gehel / Herron - should we coordinate on deployment of this? It" [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [09:18:15] !log enabling inbound TLSv1.3 in cp4026 - T170567 [09:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:22] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [09:18:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10ArielGlenn) Note that this patch removes joewalsh from the researchers group because according to the comments users should be in one or the oth... [09:19:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10ArielGlenn) [09:19:29] (03CR) 10Gehel: [C: 03+1] "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [09:19:45] (03CR) 10Elukey: [C: 03+1] atskafka: convert float64 configuration values to int [software/atskafka] - 10https://gerrit.wikimedia.org/r/580376 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [09:21:35] (03CR) 10Elukey: [C: 03+1] "> > Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [09:31:20] (03PS1) 10Vgutierrez: prometheus: Add TLSv1.3 ciphersuites on ATS exporter [puppet] - 10https://gerrit.wikimedia.org/r/580868 (https://phabricator.wikimedia.org/T170567) [09:38:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/580853 (https://phabricator.wikimedia.org/T247636) (owner: 10ArielGlenn) [09:39:06] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Add TLSv1.3 ciphersuites on ATS exporter [puppet] - 10https://gerrit.wikimedia.org/r/580868 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [09:39:42] (03CR) 10Elukey: [C: 03+1] add joewalsh to analytics-privatedata-users and remove from researchers [puppet] - 10https://gerrit.wikimedia.org/r/580853 (https://phabricator.wikimedia.org/T247636) (owner: 10ArielGlenn) [09:40:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10elukey) @ArielGlenn yes I confirm thanks! The patch looks good, let's wait for @Nuria's approval and then I'd say that we can merge :) [09:40:57] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10Marostegui) [09:43:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:43:57] !log enabling inbound TLSv1.3 in upload@ulsfo - T170567 [09:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:04] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [09:47:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:02:47] (03PS3) 10Ayounsi: Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 [10:03:09] (03CR) 10jerkins-bot: [V: 04-1] Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 (owner: 10Ayounsi) [10:04:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:04:31] (03PS1) 10Vgutierrez: ATS: Fix session_ticket_number config name [puppet] - 10https://gerrit.wikimedia.org/r/580872 (https://phabricator.wikimedia.org/T245616) [10:06:12] (03CR) 10Vgutierrez: [C: 03+2] ATS: Fix session_ticket_number config name [puppet] - 10https://gerrit.wikimedia.org/r/580872 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [10:06:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:06:44] (03CR) 10Ladsgroup: [C: 03+2] Read from the new term store everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580416 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [10:07:43] (03Merged) 10jenkins-bot: Read from the new term store everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580416 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [10:12:17] (03PS2) 10Jbond: debdeplot: add libGraphicsMagick-Q16 as a lib for graphicsmagick [puppet] - 10https://gerrit.wikimedia.org/r/580298 [10:12:48] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Read from the new term store everywhere (T219123)]] (duration: 01m 08s) [10:12:51] (03PS3) 10Jbond: debdeploy: add libGraphicsMagick-Q16 as a lib for graphicsmagick [puppet] - 10https://gerrit.wikimedia.org/r/580298 [10:12:53] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580298 (owner: 10Jbond) [10:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:55] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [10:13:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/580298 (owner: 10Jbond) [10:14:44] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Read from the new term store everywhere (T219123)]], take II (duration: 01m 07s) [10:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:08] 10Operations, 10Puppet, 10User-Joe: Disable hiera autolookups - https://phabricator.wikimedia.org/T181971 (10jbond) > In particular, I would be ok with using autolookup of parameters if we straight ban the definition of hiera variables for classes not in the profile module outside of the common range. I thi... [10:16:24] (03CR) 10Jbond: [C: 03+2] debdeploy: add libGraphicsMagick-Q16 as a lib for graphicsmagick [puppet] - 10https://gerrit.wikimedia.org/r/580298 (owner: 10Jbond) [10:31:32] marostegui: be ready to drop wb_terms from test soon ^_^ [10:31:54] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Read from the new term store everywhere (T219123)]] (duration: 01m 07s) [10:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:58] and in a week or two from production [10:31:59] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [10:32:18] ^ redeploying because I forgot to rebase :( [10:33:11] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Read from the new term store everywhere (T219123)]], take II (duration: 01m 07s) [10:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:44] (03PS1) 10Ladsgroup: Stop writing to old term store in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580876 (https://phabricator.wikimedia.org/T208425) [10:42:29] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to old term store in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580876 (https://phabricator.wikimedia.org/T208425) (owner: 10Ladsgroup) [10:43:46] (03Merged) 10jenkins-bot: Stop writing to old term store in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580876 (https://phabricator.wikimedia.org/T208425) (owner: 10Ladsgroup) [10:44:42] (03PS4) 10Filippo Giunchedi: prometheus: add icinga average latency checks [puppet] - 10https://gerrit.wikimedia.org/r/580347 (https://phabricator.wikimedia.org/T247538) [10:44:52] 10Operations, 10Analytics, 10Research, 10Traffic, and 2 others: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Miriam) [10:45:44] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Stop writing to old term store in testwikidatawiki (T208425)]] (duration: 01m 07s) [10:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:56] T208425: [EPIC] Kill the wb_terms table - https://phabricator.wikimedia.org/T208425 [10:47:40] I forgot to rebase, again [10:48:31] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Stop writing to old term store in testwikidatawiki (T208425)]], take II (duration: 01m 07s) [10:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:29] Amir1: nice!!!!! [10:50:51] (03PS1) 10Ladsgroup: Stop writing to old term store (wb_terms table) in wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580881 (https://phabricator.wikimedia.org/T208425) [10:51:23] marostegui: ^ I'm stopping writes on production now, you should see a large drop on master connection, writes, replicas, etc. also wb_terms would stop growing [10:51:26] (03CR) 10Vgutierrez: [C: 03+1] atskafka: convert float64 configuration values to int [software/atskafka] - 10https://gerrit.wikimedia.org/r/580376 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [10:51:59] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to old term store (wb_terms table) in wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580881 (https://phabricator.wikimedia.org/T208425) (owner: 10Ladsgroup) [10:52:06] (03CR) 10Ema: [C: 03+2] atskafka: convert float64 configuration values to int [software/atskafka] - 10https://gerrit.wikimedia.org/r/580376 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [10:52:18] <_joe_> !log setting num_retries=0, idle_timeout=5s on mw2223 for eventgate-analytics in envoy (T247484) [10:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:23] T247484: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 [10:53:22] (03Merged) 10jenkins-bot: Stop writing to old term store (wb_terms table) in wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580881 (https://phabricator.wikimedia.org/T208425) (owner: 10Ladsgroup) [10:55:39] Amir1: Epic! monitoring as well [10:55:47] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Stop writing to old term store (wb_terms table) in wikidata (T208425)]] (duration: 01m 08s) [10:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:52] T208425: [EPIC] Kill the wb_terms table - https://phabricator.wikimedia.org/T208425 [10:56:42] marostegui: You can drop it soon after some communication to tool devs + monitoring if issues surface (it would be great if we monitor anything that tries to read or write on this table) [10:58:04] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Stop writing to old term store (wb_terms table) in wikidata (T208425)]], take II (duration: 01m 06s) [10:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:33] <_joe_> !log setting num_retries=0 on mw2224 for eventgate-analytics in envoy (T247484) [10:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:38] T247484: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200318T1100) [11:00:04] kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:28] * kart_ is here. [11:01:05] will start with my patch. [11:01:21] (03PS2) 10KartikMistry: Enable Content Translation in Malay, Azerbaijani and Estonian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579893 (https://phabricator.wikimedia.org/T246622) [11:01:37] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10ayounsi) BGP and firewall filter config removed from codfw's router. [11:02:48] (03CR) 10KartikMistry: [C: 03+2] Enable Content Translation in Malay, Azerbaijani and Estonian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579893 (https://phabricator.wikimedia.org/T246622) (owner: 10KartikMistry) [11:03:17] (03Abandoned) 10Ayounsi: Add cloud-out4 firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/577575 (https://phabricator.wikimedia.org/T246887) (owner: 10Ayounsi) [11:03:39] Amir1: We should probably rename it first [11:03:41] Before dropping [11:03:44] (03Merged) 10jenkins-bot: Enable Content Translation in Malay, Azerbaijani and Estonian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579893 (https://phabricator.wikimedia.org/T246622) (owner: 10KartikMistry) [11:03:48] marostegui: definitely [11:03:53] Amir1: So we can actually look for errors and everything [11:06:09] (03PS1) 10Filippo Giunchedi: prometheus: add mediawiki recording rules [puppet] - 10https://gerrit.wikimedia.org/r/580888 [11:07:44] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:08:23] (03CR) 10Filippo Giunchedi: "Meant to help with https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard panels loading too many metrics, likely som" [puppet] - 10https://gerrit.wikimedia.org/r/580888 (owner: 10Filippo Giunchedi) [11:09:25] in half an hour or so, I will deploy the hotfixes for the train blocker ( T247562 [11:09:25] ) [11:09:25] T247562: Warning: Memcached::setMulti(): failed to set key global:segment:... - https://phabricator.wikimedia.org/T247562 [11:09:41] hasharSchool: do you want me to deploy? What should I monitor? [11:09:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:09:56] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): CloudVPS: introduce filtering for neutron BGP addresses - https://phabricator.wikimedia.org/T246887 (10aborrero) 05Open→03Resolved We decided to drop the BGP setup for now. [11:09:59] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) [11:10:09] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|579893|Enable ContentTranslation as a default tool in Malay, Azerbaijani and Estonian WPs (T246622, T246628, T246629)]] (duration: 01m 07s) [11:10:14] well last night, I found out that the MessageCache is invalidated due to a mismatched HASH [11:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:17] g the cache varies [11:10:19] T246622: Enable Content Translation in Malay Wikipedia as a default tool - https://phabricator.wikimedia.org/T246622 [11:10:20] T246629: Enable Content Translation in Estonian Wikipedia as a default tool - https://phabricator.wikimedia.org/T246629 [11:10:20] T246628: Enable Content Translation in Azerbaijani Wikipedia as a default tool - https://phabricator.wikimedia.org/T246628 [11:10:48] scap, take II is on. [11:11:00] and I could easily see in logstash looking at the MessageCache log bucket [11:11:22] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) 05Open→03Declined We decided to drop the BGP project for now. We collected valuable information about the setup, how it works and what we... [11:11:30] Amir1: I guess I will handle it :] Kids are on a break then they will prepare lunch so I got some "spare" time hehe [11:11:42] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|579893|Enable ContentTranslation as a default tool in Malay, Azerbaijani and Estonian WPs (T246622, T246628, T246629)]], take II (duration: 01m 07s) [11:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:20] sure ^_^ [11:13:13] I have spend a couple hours to find the reproduction case but I had not identified any root cause (beside that the cahce magically varies) [11:13:21] then Krinkle figured it out [11:13:55] and apparently the issue is for anything that relies on a local cache :/ [11:14:27] that's nasty, we really should have tests (at least regression tests) [11:15:27] Since there are no more patches in EU Mid-day SWAT.. [11:16:14] Amir1: do we log 'EU Mid-day SWAT is done' nowadays? [11:16:17] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) p:05Triage→03Medium a:03Marostegui [11:16:24] kart_: AFAIK yes [11:16:28] OK. [11:16:48] !log EU Mid-day SWAT done [11:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:15] !log upload atskafka 0.3 to buster-wikimedia T237993 [11:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:19] T237993: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 [11:18:15] (03PS3) 10Giuseppe Lavagetto: mw1261: switch to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/580343 [11:18:17] (03PS1) 10Giuseppe Lavagetto: services_proxy: allow setting a keepalive timeout [puppet] - 10https://gerrit.wikimedia.org/r/580893 (https://phabricator.wikimedia.org/T247484) [11:18:19] (03PS1) 10Giuseppe Lavagetto: services_proxy: switch from retries to shorter keepalive timeouts [puppet] - 10https://gerrit.wikimedia.org/r/580894 (https://phabricator.wikimedia.org/T247484) [11:19:10] (03PS4) 10Ema: atskafka: rdkafka configuration support [puppet] - 10https://gerrit.wikimedia.org/r/580364 (https://phabricator.wikimedia.org/T247497) [11:19:54] (03CR) 10Ema: atskafka: rdkafka configuration support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580364 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [11:23:16] (03CR) 10Ema: [C: 03+2] atskafka: rdkafka configuration support [puppet] - 10https://gerrit.wikimedia.org/r/580364 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [11:23:56] 10Operations, 10netops: Fix LibreNMS alert "CDR bills over 75% used" - https://phabricator.wikimedia.org/T247949 (10ayounsi) p:05Triage→03Medium [11:25:37] 10Operations, 10fundraising-tech-ops, 10observability, 10Patch-For-Review: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 (10fgiunchedi) The new baseline in eqiad for average check latency is ~70s, which isn't great IMHO but certainly better. Short of deploy... [11:29:30] (03PS1) 10Muehlenhoff: Bump CAS session length to a week [puppet] - 10https://gerrit.wikimedia.org/r/580902 [11:30:28] moritzm: ^ <3 [11:32:16] <_joe_> marostegui: he's specifically reducing the tendril one to 61 seconds tho [11:32:37] _joe_: what's tendril? I don't use that [11:32:41] :p [11:32:58] :-) [11:33:26] (03PS1) 10Giuseppe Lavagetto: Update envoy, add ability to define an idle timeout [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580906 (https://phabricator.wikimedia.org/T247484) [11:34:16] (03PS1) 10Ema: atskafka: specify TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/580908 (https://phabricator.wikimedia.org/T247497) [11:37:52] (03PS2) 10Ema: atskafka: specify TLS configuration [puppet] - 10https://gerrit.wikimedia.org/r/580908 (https://phabricator.wikimedia.org/T247497) [11:39:50] I am going to deploy the hotfixes for the train deployment [11:40:10] which should stop the MessageCache to vary based on the wiki id [11:40:28] 10Operations, 10netops, 10cloud-services-team (Kanban): New network request for CloudVPS CODFW instances transport - https://phabricator.wikimedia.org/T247633 (10aborrero) 05Open→03Declined hey @JHedden, in conversation with @ayounsi on IRC today, he asked me to prioritize this task vs {T245495}. He don'... [11:42:02] Amir1: I haven't noticed any pattern change on s8 master or replicas [11:42:37] 10Operations, 10Puppet, 10User-Joe: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 (10jbond) [11:42:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Decrease db1087, vslow host weight in main, given that the CPU across s8 is now doing a lot better', diff saved to https://phabricator.wikimedia.org/P10715 and previous config saved to /var/cache/conftool/dbconfig/20200318-114259-marostegui.json [11:43:02] marostegui: maybe it's too small that it gets lost in the noise? [11:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:06] apergos: ^ hopefully this will help too [11:43:28] thanks, I hope so! [11:43:41] Amir1: I guess, definitely not something noticiable on the graphs [11:47:55] marostegui: the write graphs on the app server all went ot zero [11:48:06] https://grafana.wikimedia.org/d/000000548/wikibase-sql-term-storage-was-wb_terms?orgId=1&from=now-6h&to=now&refresh=30s&fullscreen&panelId=2 [11:48:29] let me know when I can deploy the object cache patches for the train blocker. I don't want to interrupt your debug session ;] [11:48:38] Amir1: I am so happy to see that [11:48:55] hashar: it's not a big deal, please do. [11:49:34] okk [11:51:56] deploying to mwdebug1001 [11:53:02] (03PS1) 10Jbond: taskgen: add new CI check to ensure hiera keys are valid [puppet] - 10https://gerrit.wikimedia.org/r/580921 (https://phabricator.wikimedia.org/T247956) [11:53:21] (03PS3) 10Ema: atskafka: specify TLS configuration [puppet] - 10https://gerrit.wikimedia.org/r/580908 (https://phabricator.wikimedia.org/T247497) [11:54:02] (03CR) 10jerkins-bot: [V: 04-1] taskgen: add new CI check to ensure hiera keys are valid [puppet] - 10https://gerrit.wikimedia.org/r/580921 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [11:55:06] looks solved [11:56:18] (03CR) 10Jbond: [C: 03+1] Bump CAS session length to a week [puppet] - 10https://gerrit.wikimedia.org/r/580902 (owner: 10Muehlenhoff) [11:57:40] (03CR) 10Ema: [C: 03+2] atskafka: specify TLS configuration [puppet] - 10https://gerrit.wikimedia.org/r/580908 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [11:57:54] !log hashar@deploy1001 Synchronized php-1.35.0-wmf.23/includes/objectcache/ObjectCache.php: objectcache: Restore keyspace for LocalServerCache service - T247562 (duration: 01m 10s) [11:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:59] T247562: Warning: Memcached::setMulti(): failed to set key global:segment:... - https://phabricator.wikimedia.org/T247562 [11:59:09] !log hashar@deploy1001 Synchronized php-1.35.0-wmf.24/includes/objectcache/ObjectCache.php: objectcache: Restore keyspace for LocalServerCache service - T247562 (duration: 01m 07s) [11:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200318T1200) [12:01:02] I don't see any changes [12:01:09] which is good news [12:06:42] (03CR) 10Muehlenhoff: [C: 03+2] Bump CAS session length to a week [puppet] - 10https://gerrit.wikimedia.org/r/580902 (owner: 10Muehlenhoff) [12:08:13] good news train blocker is off [12:08:20] I am going to move the train forward later after lunch [12:08:23] hopefully at 1pm UTC [12:18:16] logstash is quiet [12:18:31] grafana memcache looks fine as well, it has more eviced qps [12:18:40] and more get, though that one is going down [12:18:59] I am assuming that the message caches being set once for good [12:21:27] 10Operations, 10DBA, 10MediaWiki-General, 10Wikidata, and 6 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) [12:22:51] mc1023:9100 going down from 430Mbps to less than 300Mbps ( https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=now-1h&to=now&fullscreen&panelId=56 ) [12:24:02] (03PS1) 10Marostegui: Revert "pc1008: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/580929 [12:24:25] evicted items went from 11k qps to 13k qps; Not sure whether it is a concern, I don't even know what that represents [12:24:58] (03CR) 10Marostegui: [C: 03+2] Revert "pc1008: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/580929 (owner: 10Marostegui) [12:25:47] that is a rabbit hole anyway [12:25:50] i am off for actual lunch [12:26:08] 10Operations, 10observability: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10fgiunchedi) [12:33:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:36:05] 10Operations, 10Epic: Migrate role::graphite::production to Buster - https://phabricator.wikimedia.org/T247963 (10fgiunchedi) [12:37:42] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:38:05] 10Operations, 10Epic: Migrate role::alerting_host to Buster - https://phabricator.wikimedia.org/T247966 (10fgiunchedi) [12:40:05] 10Operations, 10netops, 10observability, 10Epic: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [12:51:22] (03PS4) 10Ayounsi: Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 [12:52:14] (03PS1) 10Arturo Borrero Gonzalez: codfw: openstack: drop unused br-external FQDNs for cloudnet servers [dns] - 10https://gerrit.wikimedia.org/r/580940 (https://phabricator.wikimedia.org/T245606) [12:52:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw: openstack: drop unused br-external FQDNs for cloudnet servers [dns] - 10https://gerrit.wikimedia.org/r/580940 (https://phabricator.wikimedia.org/T245606) (owner: 10Arturo Borrero Gonzalez) [12:58:20] (03CR) 10Marostegui: WIP - Introduce profile::mariadb::misc::analytics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [13:00:04] hashar and twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200318T1300). [13:17:40] (03PS1) 10Elukey: Add initial Prometheus monitoring config for Presto [puppet] - 10https://gerrit.wikimedia.org/r/580941 (https://phabricator.wikimedia.org/T247884) [13:27:46] (03PS2) 10Elukey: Add initial Prometheus monitoring config for Presto [puppet] - 10https://gerrit.wikimedia.org/r/580941 (https://phabricator.wikimedia.org/T247884) [13:33:11] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) m2 eqiad proxies that will require reload: dbproxy1015: active dbproxy1013: passive m2 codfw proxy requires no action. Hosts... [13:36:21] (03PS3) 10Elukey: Add initial Prometheus monitoring config for Presto [puppet] - 10https://gerrit.wikimedia.org/r/580941 (https://phabricator.wikimedia.org/T247884) [13:38:52] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10akosiaris) >>! In T247820#5977711, @colewhite wrote: > Good idea forking the original task. Thanks for that! > >> I '... [13:40:23] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/21483/ looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/580941 (https://phabricator.wikimedia.org/T247884) (owner: 10Elukey) [13:42:06] 10Operations, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Anomie) >>! In T247788#5978199, @Marostegui wrote: > Any objection on closing this task as this was clearly a consequence and not the cause? If you're asking me, I... [13:42:53] 10Operations, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) 05Open→03Resolved a:03Anomie Resolving. Thanks @Anomie for the clarifications and explaining what was going on. [13:43:03] (03CR) 10Elukey: [C: 03+2] Add initial Prometheus monitoring config for Presto [puppet] - 10https://gerrit.wikimedia.org/r/580941 (https://phabricator.wikimedia.org/T247884) (owner: 10Elukey) [13:43:04] 10Operations, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) I will be doing an IR and posting the link here btw [13:45:09] marostegui: for when you have time, can you check a replica and master for queries to read/write wb_terms? [13:45:26] Amir1: Yep! [13:45:28] I highly doubt anything would show up but life is full of surprises [13:45:36] Amir1: let me do a quick write check [13:46:07] Last write: -rw-rw---- 1 mysql mysql 529G Mar 18 12:19 wb_terms.ibd [13:46:54] (03CR) 10Elukey: "Thanks! will amend and resend :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [13:47:40] Amir1: do you have a task to track the dropping? We can rename the table in a host in codfw,, if there is a write, replication will break and will know [13:47:57] Maybe 24h after that, we can go ahead and rename it on an eqiad host and closely monitor for errors [13:48:02] Depending on how confident you feel [13:48:43] marostegui: if it doesn't break replication to labs, it should be okay, let's start for codfw for a bit [13:48:51] (for write) [13:49:41] marostegui: do you have some numbers on how it's going to give us free space and such? [13:49:47] Amir1: wb_terms ins't replicated in labs [13:49:54] it definitely is [13:49:58] is it? [13:50:01] half of tool builders are using it [13:50:12] ah yes [13:50:13] (03PS1) 10Ema: atskafka: rotate statistics file [puppet] - 10https://gerrit.wikimedia.org/r/580943 (https://phabricator.wikimedia.org/T247497) [13:50:16] I was confused with another one [13:50:41] Amir1: About numbers, right now it is around 700GB (compressed) [13:51:09] what's the total and free space? [13:51:10] Amir1: And the whole wikidata dataset is 2TB [13:51:11] XDDD [13:51:30] (03PS1) 10Ssingh: Update `install_requires' in setup.py [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/580944 [13:51:34] nice [13:51:42] Right now the master is 70% used (2.5TB) [13:51:47] So we are going to get a nice chunk back [13:51:56] just "nice"? :P [13:52:00] (03CR) 10jerkins-bot: [V: 04-1] Update `install_requires' in setup.py [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/580944 (owner: 10Ssingh) [13:52:08] We only have 1.2T available, so we will back to 2TB! [13:52:11] Which makes me feel a lot better [13:53:21] That would buy us a couple of years [13:53:29] wb_terms is bigger than the next 3 biggest tables combined [13:53:32] it is crazy [13:53:40] (03PS5) 10Ayounsi: Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 [13:54:03] if you don't count the replacement, I think it's bigger than all other tables combined multiplied by two [13:54:22] (also considering the space comment and actor would free up) [13:55:11] yeah [13:55:13] indeed [13:55:17] the replacement should be around ten times smaller [13:55:31] (but they are not compressed yet I guess) [13:55:49] replacement is wbt_* tables [13:56:13] 289G Mar 18 13:52 wbt_item_terms.ibd [13:56:57] Amir1: do you have a task for the drop? [13:57:51] marostegui: T208425 [13:57:52] T208425: [EPIC] Kill the wb_terms table - https://phabricator.wikimedia.org/T208425 [13:58:24] Ah, yes I am on that one, I meant more specific for the drop itself [13:58:45] Amir1: I assume we also have to drop it from s4, right? [13:59:02] it is empty anyways [13:59:12] on both commons and testcommons [14:00:47] may I process with the train or should I hold while you are finishing up? ;) [14:00:48] yeah, I think we can start testwikidata as well [14:01:14] hashar: oh this is not locking in any way, I move it to the database channel [14:01:18] *blocking [14:01:25] sorry for the noise [14:01:41] hashar: no no, not blocking anything here [14:01:52] Amir1: oh yeah, testwikidata indeed [14:02:19] okkk [14:02:23] going for the train [14:03:32] (03PS1) 10Hashar: all wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580948 [14:03:34] (03CR) 10Hashar: [C: 03+2] all wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580948 (owner: 10Hashar) [14:04:33] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580948 (owner: 10Hashar) [14:06:27] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.23 [14:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:05] PROBLEM - Apache HTTP on mw1283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:10:41] PROBLEM - Nginx local proxy to apache on mw1283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:10:43] PROBLEM - PHP7 rendering on mw1283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1310 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:50] bah [14:11:31] (03PS2) 10Jgreen: nsca_frack.cfg.erb - merge some groups, add fran1001, clean up format [puppet] - 10https://gerrit.wikimedia.org/r/580351 [14:14:03] (03PS3) 10Jgreen: nsca_frack.cfg.erb - merge some groups, add fran1001, clean up format [puppet] - 10https://gerrit.wikimedia.org/r/580351 [14:14:26] I have no idea what is wrong with mw1283 [14:16:23] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:57] RECOVERY - Nginx local proxy to apache on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:17:01] RECOVERY - PHP7 rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 76136 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:17:18] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) 05Open→03Resolved No current differences in targets scraped between the proms in each cluster, and no hits for `too many open files`... [14:17:58] !log Rename wb_terms on codfw hosts: s8 (wikidatawiki - db2081), s3 (testwikidatawiki - db2109), s4 (commonswiki, testcommonswiki - db2106) T208425 [14:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:04] T208425: [EPIC] Kill the wb_terms table - https://phabricator.wikimedia.org/T208425 [14:24:44] (03PS1) 10Vgutierrez: ATS: Disable TLS Session tickets in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/580951 (https://phabricator.wikimedia.org/T170567) [14:28:00] <_joe_> !log restarted php-fpm on mw1283, was throwing SIGILL [14:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:11] thanks _joe_ [14:29:13] !log add debug to icinga2001 - T247538 [14:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:21] T247538: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 [14:29:26] looks like the train for wmf.23 on all wikis is all fine [14:31:05] (03CR) 10Vgutierrez: "pcc looks sane: https://puppet-compiler.wmflabs.org/compiler1001/21485/" [puppet] - 10https://gerrit.wikimedia.org/r/580951 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [14:34:07] (03CR) 10Ema: [C: 03+1] ATS: Disable TLS Session tickets in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/580951 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [14:36:58] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable TLS Session tickets in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/580951 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [14:41:22] (03PS1) 10Jhedden: labstore: remove catgraph cloudvps project [puppet] - 10https://gerrit.wikimedia.org/r/580952 (https://phabricator.wikimedia.org/T247482) [14:41:32] !log disable TLS session tickets in ulsfo - T245616 T170567 [14:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:43] T245616: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 [14:41:43] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [14:43:23] (03CR) 10Jhedden: [C: 03+2] labstore: remove catgraph cloudvps project [puppet] - 10https://gerrit.wikimedia.org/r/580952 (https://phabricator.wikimedia.org/T247482) (owner: 10Jhedden) [14:47:42] I am doing the group0 promotion [14:48:14] (03Abandoned) 10Hashar: Group0 to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580349 (https://phabricator.wikimedia.org/T233872) (owner: 10Hashar) [14:48:56] (03Restored) 10Hashar: Group0 to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580349 (https://phabricator.wikimedia.org/T233872) (owner: 10Hashar) [14:50:28] (03PS2) 10Hashar: Group0 to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580349 (https://phabricator.wikimedia.org/T233872) [14:52:46] (03PS1) 10Volans: Remove host mgmt records for decommissioning hosts [dns] - 10https://gerrit.wikimedia.org/r/580954 (https://phabricator.wikimedia.org/T233183) [14:52:48] (03PS1) 10Volans: Remove all mgmt records for offline hosts [dns] - 10https://gerrit.wikimedia.org/r/580955 (https://phabricator.wikimedia.org/T233183) [14:52:50] (03PS1) 10Volans: Fix typos [dns] - 10https://gerrit.wikimedia.org/r/580956 (https://phabricator.wikimedia.org/T233183) [14:55:06] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: Test atskafka deployment - https://phabricator.wikimedia.org/T247497 (10elukey) ` elukey@kafka-jumbo1001:~$ kafka acls --add --allow-principal User:CN=varnishkafka --producer --topic atskafka_test_webrequest_text kafka-acls --authorizer-properties z... [14:55:59] (03CR) 10Hashar: [C: 03+2] Group0 to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580349 (https://phabricator.wikimedia.org/T233872) (owner: 10Hashar) [14:56:54] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580349 (https://phabricator.wikimedia.org/T233872) (owner: 10Hashar) [14:57:47] grlbblb [14:57:51] I forgot the testwiki step [14:58:34] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10faidon) 05Declined→03Open Reopening this per IRC, and given this is a prod/WMCS task affecting prod in major ways. First of all, it'd be great to h... [14:58:46] !log hashar@deploy1001 Started scap: testwiki to 1.35.0-wmf.24 and rebuild l10n cache - T233872 [14:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:51] T233872: 1.35.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T233872 [15:00:58] (03CR) 10Marostegui: [C: 03+1] "+1 for dbproxies" [dns] - 10https://gerrit.wikimedia.org/r/580954 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [15:03:32] (03PS1) 10Hashar: Revert "Group0 to 1.35.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580967 [15:03:38] (03CR) 10Hashar: [C: 03+2] Revert "Group0 to 1.35.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580967 (owner: 10Hashar) [15:04:36] (03Merged) 10jenkins-bot: Revert "Group0 to 1.35.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580967 (owner: 10Hashar) [15:09:19] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/580347 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [15:23:28] (03PS1) 10Alexandros Kosiaris: Prefix purely included template files with underscore [deployment-charts] - 10https://gerrit.wikimedia.org/r/580982 [15:23:42] (03CR) 10jerkins-bot: [V: 04-1] Prefix purely included template files with underscore [deployment-charts] - 10https://gerrit.wikimedia.org/r/580982 (owner: 10Alexandros Kosiaris) [15:26:01] 10Operations, 10Wikimedia-Mailing-lists: Request new mailing list for Myanmar Wikimedia Community User Group - https://phabricator.wikimedia.org/T247647 (10Ninjastrikers) I have access to make the changes. Thank you so much. [15:28:27] (03CR) 10CRusnov: [C: 03+2] netbox (hiera): Add coherence.Rack to alerted reports [puppet] - 10https://gerrit.wikimedia.org/r/578551 (https://phabricator.wikimedia.org/T239244) (owner: 10CRusnov) [15:28:38] (03PS2) 10CRusnov: netbox (hiera): Add coherence.Rack to alerted reports [puppet] - 10https://gerrit.wikimedia.org/r/578551 (https://phabricator.wikimedia.org/T239244) [15:30:44] 10Operations, 10fundraising-tech-ops, 10observability, 10Patch-For-Review: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 (10fgiunchedi) Top 50 checks as of today, with a little longer time horizon than the previous audit `lines=9 root@icinga2001:/var/log/i... [15:35:48] 10Operations, 10fundraising-tech-ops, 10observability, 10Patch-For-Review: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 (10MoritzMuehlenhoff) Another low-hanging fruit is to reduce the SSH check for the mgmts I think: It currently runs every minute, but th... [15:40:52] (03PS5) 10Filippo Giunchedi: prometheus: add icinga average latency checks [puppet] - 10https://gerrit.wikimedia.org/r/580347 (https://phabricator.wikimedia.org/T247538) [15:40:54] (03PS1) 10Filippo Giunchedi: icinga: relax check interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) [15:44:16] (03PS2) 10Alexandros Kosiaris: Prefix purely included template files with underscore [deployment-charts] - 10https://gerrit.wikimedia.org/r/580982 [15:44:54] (03CR) 10jerkins-bot: [V: 04-1] icinga: relax check interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [15:45:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] Prefix purely included template files with underscore [deployment-charts] - 10https://gerrit.wikimedia.org/r/580982 (owner: 10Alexandros Kosiaris) [15:46:17] (03PS1) 10Ema: Add datetime and sequence number [software/atskafka] - 10https://gerrit.wikimedia.org/r/580986 (https://phabricator.wikimedia.org/T237993) [15:46:21] (03PS1) 10Ema: Do not append to stats file [software/atskafka] - 10https://gerrit.wikimedia.org/r/580987 (https://phabricator.wikimedia.org/T237993) [15:52:19] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10ayounsi) * Neutron BGP is outbound only, so we would still need to keep the VRRP VIP between cr1 and cr2 and a static route from cloud -> core * Neutron... [15:52:42] scap-db-rebuild going on [15:54:22] (03PS1) 10Herron: icinga: change check/retry interval of mgmt host check to 10/15 [puppet] - 10https://gerrit.wikimedia.org/r/580989 (https://phabricator.wikimedia.org/T247538) [15:54:43] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add icinga average latency checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580347 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [15:56:56] (03CR) 10Dzahn: [C: 03+1] Fix typos [dns] - 10https://gerrit.wikimedia.org/r/580956 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [15:57:32] (03CR) 10jerkins-bot: [V: 04-1] icinga: change check/retry interval of mgmt host check to 10/15 [puppet] - 10https://gerrit.wikimedia.org/r/580989 (https://phabricator.wikimedia.org/T247538) (owner: 10Herron) [15:57:34] 10Operations, 10fundraising-tech-ops, 10observability, 10Patch-For-Review: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 (10herron) >>! In T247538#5980045, @MoritzMuehlenhoff wrote: > Another low-hanging fruit is to reduce the SSH check for the mgmts I thin... [15:57:55] (03CR) 10Dzahn: "probably best to have this reviewed by dcops" [dns] - 10https://gerrit.wikimedia.org/r/580955 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [15:58:50] PROBLEM - Disk space on wtp1025 is CRITICAL: DISK CRITICAL - free space: / 1509 MB (3% inode=76%): /tmp 1509 MB (3% inode=76%): /var/tmp 1509 MB (3% inode=76%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wtp1025&var-datasource=eqiad+prometheus/ops [16:00:09] !log hashar@deploy1001 Finished scap: testwiki to 1.35.0-wmf.24 and rebuild l10n cache - T233872 (duration: 61m 23s) [16:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:15] T233872: 1.35.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T233872 [16:00:45] Phab down: Request from 2a00:23c6:9200:9700:d59b:ddc3:e20e:60e9 via cp3052.esams.wmnet, ATS/8.0.6 [16:00:45] Error: 502, internal error - server connection terminated at 2020-03-18 16:00:21 GMT [16:00:56] Back, weird [16:01:12] 1.35.0-wmf.24 is on testwiki [16:03:10] I am out for an hour or so [16:03:25] 10Operations, 10Patch-For-Review, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10RLazarus) Deployed 1.13.1 to all hosts where we're using envoy as a TLS proxy, that is, `C:profile::tlsproxy::envoy`. Exception: `mendelevium.eqiad.... [16:07:04] (03CR) 10Dzahn: [C: 03+1] "You just changed that policy today though, right?" [dns] - 10https://gerrit.wikimedia.org/r/580954 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [16:07:22] (03CR) 10Dzahn: [C: 03+2] Fix typos [dns] - 10https://gerrit.wikimedia.org/r/580956 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [16:07:33] (03PS2) 10Ema: Do not append to stats file [software/atskafka] - 10https://gerrit.wikimedia.org/r/580987 (https://phabricator.wikimedia.org/T237993) [16:07:37] (03PS2) 10Dzahn: Fix typos [dns] - 10https://gerrit.wikimedia.org/r/580956 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [16:08:42] (03CR) 10jerkins-bot: [V: 04-1] Do not append to stats file [software/atskafka] - 10https://gerrit.wikimedia.org/r/580987 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:09:45] (03PS3) 10Ema: Do not append to stats file [software/atskafka] - 10https://gerrit.wikimedia.org/r/580987 (https://phabricator.wikimedia.org/T237993) [16:10:55] (03PS2) 10Filippo Giunchedi: icinga: relax check interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) [16:11:20] (03CR) 10Dzahn: [C: 03+2] Remove host mgmt records for decommissioning hosts [dns] - 10https://gerrit.wikimedia.org/r/580954 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [16:11:24] (03PS2) 10Dzahn: Remove host mgmt records for decommissioning hosts [dns] - 10https://gerrit.wikimedia.org/r/580954 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [16:15:01] (03Abandoned) 10Ema: atskafka: rotate statistics file [puppet] - 10https://gerrit.wikimedia.org/r/580943 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [16:19:01] (03CR) 10RLazarus: [C: 03+1] Update envoy, add ability to define an idle timeout [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580906 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [16:22:09] 10Operations: Is invite-wmfall@wikimedia.org a Mailman list? - https://phabricator.wikimedia.org/T247848 (10bcampbell) Thanks @Aklapper and @Dzahn. You're right, it's just a Google Group with no LDAP entry behind it because invite-wmfall@ doesn't need to send or receive mail. This issue is resolved. [16:22:46] (03PS1) 10Alexandros Kosiaris: admin: Deduplicate coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/580996 [16:22:55] (03PS2) 10Giuseppe Lavagetto: services_proxy: allow setting a keepalive timeout [puppet] - 10https://gerrit.wikimedia.org/r/580893 (https://phabricator.wikimedia.org/T247484) [16:22:57] (03PS2) 10Giuseppe Lavagetto: services_proxy: switch from retries to shorter keepalive timeouts [puppet] - 10https://gerrit.wikimedia.org/r/580894 (https://phabricator.wikimedia.org/T247484) [16:22:59] (03PS4) 10Giuseppe Lavagetto: mw1261: switch to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/580343 [16:23:33] (03PS1) 10Arturo Borrero Gonzalez: hiera: openstack: codfw1dev: revert to neutron complete hack [puppet] - 10https://gerrit.wikimedia.org/r/580997 (https://phabricator.wikimedia.org/T247505) [16:25:26] 10Operations: Setup an Offsite backup infrastructure - https://phabricator.wikimedia.org/T85278 (10akosiaris) a:05akosiaris→03None [16:25:55] 10Operations: Setup an Offsite backup infrastructure - https://phabricator.wikimedia.org/T85278 (10akosiaris) Adding @jcrespo in case he is interested. [16:26:21] 10Operations, 10Goal: Goal: Strengthen Incident monitoring infrastructure - https://phabricator.wikimedia.org/T118746 (10akosiaris) 05Open→03Invalid No longer valid, closing. [16:31:10] (03CR) 10Andrew Bogott: [C: 03+2] hiera: openstack: codfw1dev: revert to neutron complete hack [puppet] - 10https://gerrit.wikimedia.org/r/580997 (https://phabricator.wikimedia.org/T247505) (owner: 10Arturo Borrero Gonzalez) [16:42:28] 10Operations, 10Analytics, 10Traffic: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10Nuria) [16:43:34] !log wtp1025 - Icinga alerted it's running out of disk - 'apt-get clean' lowered disk usage from 97% to 91% [16:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:39] RECOVERY - Disk space on wtp1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wtp1025&var-datasource=eqiad+prometheus/ops [16:50:32] (03PS1) 10Alexandros Kosiaris: admin: Deduplicate rbac more [deployment-charts] - 10https://gerrit.wikimedia.org/r/581006 [16:51:28] (03PS2) 10Dzahn: racktables: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/579675 [16:51:41] (03PS3) 10Dzahn: racktables: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/579675 [16:53:05] (03Abandoned) 10Jdlrobson: Enable PageImages on Commons categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579067 (https://phabricator.wikimedia.org/T198716) (owner: 10Jdlrobson) [16:53:33] (03CR) 10Muehlenhoff: "This should be a generic ferm rule in the Envoy profile, not duplicated across dozens of roles." [puppet] - 10https://gerrit.wikimedia.org/r/580479 (owner: 10Dzahn) [16:54:07] (03CR) 10RLazarus: [C: 03+1] "LGTM, thanks for this! I'm also interested in the results of the cumin discussion but I don't have anything brilliant to add to it -- I th" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [16:57:38] (03Abandoned) 10Jgreen: nsca_frack.cfg.erb - merge some groups, add fran1001, clean up format [puppet] - 10https://gerrit.wikimedia.org/r/580351 (owner: 10Jgreen) [16:58:21] (03CR) 10Dzahn: "..unless envoy is changed eventually to use loopback to talk to apache locally? That's kind of why i brought it up in -serviceops. Also en" [puppet] - 10https://gerrit.wikimedia.org/r/580479 (owner: 10Dzahn) [17:01:30] (03PS3) 10Giuseppe Lavagetto: services_proxy: allow setting a keepalive timeout [puppet] - 10https://gerrit.wikimedia.org/r/580893 (https://phabricator.wikimedia.org/T247484) [17:01:32] (03PS3) 10Giuseppe Lavagetto: services_proxy: switch from retries to shorter keepalive timeouts [puppet] - 10https://gerrit.wikimedia.org/r/580894 (https://phabricator.wikimedia.org/T247484) [17:01:35] (03PS5) 10Giuseppe Lavagetto: mw1261: switch to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/580343 [17:02:52] (03CR) 10Dzahn: [C: 03+2] racktables: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/579675 (owner: 10Dzahn) [17:09:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21487/" [puppet] - 10https://gerrit.wikimedia.org/r/580893 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [17:13:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473 (10RobH) 05Open→03Resolved this is now tracked individually within netbox, this is very outdated task, closing [17:13:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in codfw - https://phabricator.wikimedia.org/T187474 (10RobH) 05Open→03Resolved [17:13:57] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in codfw - https://phabricator.wikimedia.org/T187474 (10RobH) this task is old, outdated, and these systems were done elsewhere. [17:15:44] 10Operations, 10Tracking-Neverending: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063 (10RobH) a:05RobH→03None [17:17:39] (03PS4) 10Giuseppe Lavagetto: services_proxy: switch from retries to shorter keepalive timeouts [puppet] - 10https://gerrit.wikimedia.org/r/580894 (https://phabricator.wikimedia.org/T247484) [17:19:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580373 (https://phabricator.wikimedia.org/T247416) (owner: 10Michael Große) [17:21:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Scheduled for today’s Morning SWAT (starts in 40 minutes)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580373 (https://phabricator.wikimedia.org/T247416) (owner: 10Michael Große) [17:31:07] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 6 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10CCicalese_WMF) [17:32:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/21488/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/580894 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [17:41:25] (03PS2) 10Dzahn: planet: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/579678 [17:44:34] (03CR) 10jerkins-bot: [V: 04-1] planet: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/579678 (owner: 10Dzahn) [17:46:34] (03PS2) 10Volans: Remove all mgmt records for offline hosts [dns] - 10https://gerrit.wikimedia.org/r/580955 (https://phabricator.wikimedia.org/T233183) [17:46:40] 10Operations, 10Mail, 10Epic: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144 (10Dzahn) [17:48:26] (03PS1) 10Dzahn: site: add CI ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/581018 (https://phabricator.wikimedia.org/T228926) [17:48:41] (03CR) 10Herron: [C: 03+1] icinga: relax check interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [17:49:43] (03PS6) 10Herron: kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [17:49:53] (03CR) 10jerkins-bot: [V: 04-1] site: add CI ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/581018 (https://phabricator.wikimedia.org/T228926) (owner: 10Dzahn) [17:50:06] (03CR) 10Herron: [C: 03+1] "> Not sure what special team-permissions that would be :) If you can" [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [17:50:36] jouncebot: refresh [17:50:37] I refreshed my knowledge about deployments. [17:50:39] just in case :) [17:50:43] (03PS2) 10Dzahn: site: add CI ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/581018 (https://phabricator.wikimedia.org/T228926) [17:51:35] (03CR) 10jerkins-bot: [V: 04-1] site: add CI ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/581018 (https://phabricator.wikimedia.org/T228926) (owner: 10Dzahn) [17:51:48] what's up, jerkins [17:53:27] (03CR) 10Herron: [C: 03+2] kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [17:55:15] (03CR) 10Mstyles: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [17:58:23] (03CR) 10CDanis: "I'll leave it to someone from serviceops to speak to this definitively, but I have some concern about increasing the check_interval on ser" [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [17:58:27] (03PS3) 10Dzahn: site: add CI ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/581018 (https://phabricator.wikimedia.org/T228926) [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200318T1800). [18:00:04] Lucas_WMDE: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] o/ [18:00:32] I’ll go ahead with my change if no one minds [18:00:42] I was just gonna ask if you were gonna do it :P [18:00:43] it should™ be beta-only but I’ll still test on mwdebug [18:01:35] 10Operations, 10Cloud-VPS (Project-requests): Request creation of SRE VPS project - https://phabricator.wikimedia.org/T247517 (10bd808) I'm going to start my response with an annoying quoting of the [[https://phabricator.wikimedia.org/project/view/2875/|guidance on project scope]]: > == Project scope == > >... [18:02:03] (03PS3) 10Lucas Werkmeister (WMDE): Add beta configuration for Wikibase reference formatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580373 (https://phabricator.wikimedia.org/T247416) (owner: 10Michael Große) [18:02:30] (03CR) 10Dzahn: [C: 03+2] site: add CI ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/581018 (https://phabricator.wikimedia.org/T228926) (owner: 10Dzahn) [18:02:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580373 (https://phabricator.wikimedia.org/T247416) (owner: 10Michael Große) [18:02:40] (03PS4) 10Dzahn: site: add CI ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/581018 (https://phabricator.wikimedia.org/T228926) [18:04:07] (03Merged) 10jenkins-bot: Add beta configuration for Wikibase reference formatting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580373 (https://phabricator.wikimedia.org/T247416) (owner: 10Michael Große) [18:04:42] pulled to mwdebug1001, testing… [18:05:54] (03PS3) 10Dzahn: planet: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/579678 [18:07:56] everything looks fine, going to sync [18:08:44] hm, how do I sync this, actually? [18:08:54] touches IS-labs.php and wmf-config/Wikibase.php [18:09:06] (03CR) 10jerkins-bot: [V: 04-1] planet: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/579678 (owner: 10Dzahn) [18:09:12] syncing all of wmf-config/ would be bad IIRC [18:09:33] sync Wikibase.php first, I guess? and then IS-labs.php just so it’s up to date, even though it should be unused? [18:10:22] (03PS1) 10Giuseppe Lavagetto: services_proxy: lower idle timeout below the upstream value [puppet] - 10https://gerrit.wikimedia.org/r/581023 [18:11:25] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:580373|Add beta configuration for Wikibase reference formatting (T247416)]] (duration: 01m 07s) [18:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:31] T247416: Configure reference-related properties on Beta - https://phabricator.wikimedia.org/T247416 [18:11:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: lower idle timeout below the upstream value [puppet] - 10https://gerrit.wikimedia.org/r/581023 (owner: 10Giuseppe Lavagetto) [18:11:42] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [18:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:00] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:580373|Add beta configuration for Wikibase reference formatting (T247416)]], take II (duration: 01m 07s) [18:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:35] I think I’ll skip the take II for IS-labs, though ^^ [18:13:42] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [18:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:58] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:580373|Add beta configuration for Wikibase reference formatting (T247416)]] (duration: 01m 08s) [18:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:16] anything else to SWAT? [18:16:02] I think cscott had some stuff from yesterday that didn't get deployed. Dunno if he wants to get it done now [18:17:16] * Lucas_WMDE checks calendar [18:18:02] (03PS1) 10Ottomata: eventstreams - Reduce eventstreams helmfile upgrade timeout to 60 seconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/581025 [18:18:02] ooooh, parsoid things [18:18:32] cscott: do you want to SWAT (some of) those changes now? [18:19:10] PROBLEM - Ensure local MW versions match expected deployment on mwdebug1001 is CRITICAL: CRITICAL: 1 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:19:36] that doesn't sound good [18:20:01] that sounds like somebody tests something on mwdebug1001 ? [18:20:13] hum [18:20:16] scap pull should reset it ? [18:20:22] I’m the only one online according to `w` [18:20:25] I can do a `scap pull` [18:20:32] Lucas_WMDE: I'd say https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/579018/ can definitely go as 09 is gone [18:20:50] !log scap pull on mwdebug1001, attempting to fix mismatched wikiversions alert [18:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:19] Reedy: can you test that one? [18:21:26] oh, it’s -labs only anyways [18:21:33] Yeah [18:21:34] That and [18:21:35] reedy@deployment-deploy01:~$ ping deployment-parsoid09 [18:21:35] ping: deployment-parsoid09: Name or service not known [18:21:35] reedy@deployment-deploy01:~$ ping 172.16.5.63 [18:21:35] PING 172.16.5.63 (172.16.5.63) 56(84) bytes of data. [18:21:38] From 172.16.4.18 icmp_seq=10 Destination Host Unreachable [18:21:40] vs [18:21:43] (03PS1) 10Dzahn: site: add cescout1001 [puppet] - 10https://gerrit.wikimedia.org/r/581028 (https://phabricator.wikimedia.org/T239250) [18:21:45] reedy@deployment-deploy01:~$ ping deployment-parsoid11 [18:21:45] PING deployment-parsoid11.deployment-prep.eqiad.wmflabs (172.16.1.115) 56(84) bytes of data. [18:21:46] (03CR) 10Herron: [C: 03+2] "Np! The deploy to Kibana hosts went smoothly. Thanks mstyles!" [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [18:21:47] 64 bytes from deployment-parsoid11.deployment-prep.eqiad1.wikimedia.cloud (172.16.1.115): icmp_seq=1 ttl=64 time=0.894 ms [18:21:58] ok I’ll add it to the calendar :) [18:22:01] thanks [18:22:37] (03CR) 10Ottomata: [C: 03+2] eventstreams - Reduce eventstreams helmfile upgrade timeout to 60 seconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/581025 (owner: 10Ottomata) [18:22:54] (03PS6) 10Lucas Werkmeister (WMDE): Update linter whitelist w/ parsoid11's IP address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579018 (https://phabricator.wikimedia.org/T246833) (owner: 10C. Scott Ananian) [18:23:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579018 (https://phabricator.wikimedia.org/T246833) (owner: 10C. Scott Ananian) [18:23:51] mutante: no recovery on that icinga alert yet… [18:24:12] (03Merged) 10jenkins-bot: Update linter whitelist w/ parsoid11's IP address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579018 (https://phabricator.wikimedia.org/T246833) (owner: 10C. Scott Ananian) [18:25:03] 10Operations, 10Traffic, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559 (10BBlack) Update: sometime since I last checked, they've changed the header to: `strict-transport-security: max-age=31557600` (~1 year, vs ~90 days before). Still missing the other attributes (`pr... [18:26:03] Lucas_WMDE: did scap pull look like it actually pulled files? [18:26:12] or like it was already up-to-date [18:26:22] 10Operations, 10Traffic, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559 (10BBlack) [18:27:26] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:579018|Update linter whitelist w/ parsoid11's IP address (T246833)]] (beta-only) (duration: 01m 04s) [18:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:32] T246833: Parsoid/RESTbase seems to be unavailable in Beta - https://phabricator.wikimedia.org/T246833 [18:27:35] Lucas_WMDE: telling Icinga to recheck faster [18:27:45] I’m not sure tbh https://phabricator.wikimedia.org/F31690068 [18:28:09] the first scap pull is the one I tried for icinga, the second one was for this config change [18:28:28] 10Operations, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Krinkle) [18:28:37] 10Operations, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Krinkle) [18:29:33] Lucas_WMDE: using my suggestion from https://phabricator.wikimedia.org/T218412#5027492 to check what a mediawiki version is [18:29:40] [mwdebug1001:~] $ sha1sum /srv/mediawiki/php/cache/gitinfo/* | sha1sum [18:29:43] 027271b78ce7e5fc53b2c4083d4c1d59246de0d1 - [18:30:12] [mw1280:~] $ sha1sum /srv/mediawiki/php/cache/gitinfo/* | sha1sum [18:30:12] 027271b78ce7e5fc53b2c4083d4c1d59246de0d1 - [18:30:23] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [18:30:24] ^ should show that they have same MW ? [18:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:42] looks good to me [18:30:43] also told icinga to reschedule.. frm [18:30:51] oh wait.. i did not [18:30:58] because i dont have icinga permissions [18:31:05] because i used the new CAS to login [18:31:33] 10Operations, 10Reading-Admin, 10Traffic: TEST: redirect small portion of unauthenticated desktop users to mobile web - https://phabricator.wikimedia.org/T117826 (10dr0ptp4kt) 05Open→03Declined Not planning to do this. [18:32:18] 10Operations, 10Multimedia, 10RESTBase-API, 10Traffic, and 2 others: Thumb API: Varnish / CDN questions - https://phabricator.wikimedia.org/T150673 (10dr0ptp4kt) [18:32:27] (03Abandoned) 10Andrew Bogott: Revert "neutron: update l3_agent hacks for Queens" [puppet] - 10https://gerrit.wikimedia.org/r/578522 (owner: 10Andrew Bogott) [18:32:29] but I think we can close the SWAT already [18:32:34] (03Abandoned) 10Andrew Bogott: Neutron l3: update with files from Queens [puppet] - 10https://gerrit.wikimedia.org/r/578523 (owner: 10Andrew Bogott) [18:32:36] !log Morning SWAT done [18:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:40] (03Abandoned) 10Andrew Bogott: neutron: apply l3_agent hacks for Queens [puppet] - 10https://gerrit.wikimedia.org/r/578524 (owner: 10Andrew Bogott) [18:32:41] almost wrote EU SWAT out of habit ^^ [18:33:18] i am on the icinga alert.. everything is just slow for me [18:34:16] ok, thanks [18:34:42] 10Operations, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BBlack) [18:35:23] it's just "1 version" mismatch btw [18:35:42] the other day we had a host that was out of commission for quite some time due to hardware repair [18:36:07] and when that came back online that alert said it was hundreds of versions mismatched [18:36:22] and then recovered after scap [18:38:22] the command that is actually run by the check is: /usr/local/lib/nagios/plugins/check_mw_versions --deployhost deploy1001.eqiad.wmnet [18:38:39] Local version for testwiki is incorrect (local: php-1.35.0-wmf.23, official: php-1.35.0-wmf.24) [18:38:52] Reedy: ^ ? [18:39:15] o_0 [18:39:17] i wonder how when the checksums are identical on a random other appserver [18:39:20] Did scap pull not fix it? [18:39:40] i am running it again myself too [18:39:47] no, does NOT fix it [18:40:13] feels like something else is wrong on mwdebug1001 due to manual testing hack ? [18:40:55] well, scratch my question about the checksum when it's not about mw files [18:40:58] but the php version [18:41:17] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: convert policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580137 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [18:41:22] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Elasticsearch errors about BulkShardRequest - https://phabricator.wikimedia.org/T167091 (10dcausse) a:05dcausse→03None [18:41:56] I didn’t do any manual edits on mwdebug1001 [18:42:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom silver/WMF3434 - https://phabricator.wikimedia.org/T191357 (10Aklapper) @Jclark-ctr: All patches merged; is there still more to do in this task or is this maybe resolved? (`IF DECOM: switch port configration removed from switch once system is unrac... [18:44:33] 10Operations, 10Traffic: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236 (10BBlack) 05Open→03Declined We're not using nginx software for this functionality anymore, and everything else related to these parts of the software stack have changed and are still evolving,... [18:44:36] 10Operations, 10Traffic, 10Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827 (10BBlack) [18:45:36] Wikidata beta is down [18:45:44] https://m.wikidata.beta.wmflabs.org/w/index.php?title=Special:Contributions/172.16.3.130&offset=20200317180336&limit=20&target=172.16.3.130 [18:45:47] Wrong channel [18:45:56] Getting NXDOMAIN? [18:45:56] Reedy: where’s best? [18:45:59] Reedy: Lucas_WMDE: everything is on wmf.23 but deploy1001 says .24 [18:46:06] #wikimedia-releng [18:46:18] [mwdebug1001:~] $ grep aawiki /srv/mediawiki/wikiversions.json [18:46:18] "aawiki": "php-1.35.0-wmf.23", [18:46:25] [mw1280:~] $ grep aawiki /srv/mediawiki/wikiversions.json [18:46:26] "aawiki": "php-1.35.0-wmf.23", [18:46:37] it compares that a file on deploy1001 [18:46:54] which is considered the source of what is the "official" version [18:47:31] 25 # Path on the deploy host to query for the production wikiversions [18:47:34] 26 DEPLOYMENTS_PATH = "/mediawiki/mediawiki/wikiversions.json" [18:47:37] 27 LOCAL_VERSIONS_FILE = "/srv/mediawiki/wikiversions.json" [18:47:41] 10Operations, 10Traffic: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236 (10BBlack) a:05BBlack→03None [18:48:13] reedy@deploy1001:/srv/mediawiki-staging$ scap wikiversions-inuse [18:48:13] 1.35.0-wmf.23 1.35.0-wmf.24 [18:48:22] what would that DEPLOYMENTS_PATH start with? [18:48:27] it sure is not /mediawiki [18:48:46] mutante: I see the issue [18:48:50] i assume /srv/deplpyment [18:48:53] Reedy: ah [18:48:56] /srv/mediawiki-deployment says all .23 [18:49:04] /srv/mediawiki has one on .24 [18:49:07] yea, even on deploy1001 itself [18:49:07] hi [18:49:13] hashar: what did you do? :P [18:49:15] "testwiki" is on wmf.24 [18:49:15] Reedy: ACK, matches it [18:49:22] hashar: Not in mediawiki-staging it's not [18:49:40] reedy@deploy1001:/srv/mediawiki-staging$ grep 24 wikiversions.json [18:49:40] reedy@deploy1001:/srv/mediawiki-staging$ [18:49:53] reedy@deploy1001:/srv/mediawiki-staging$ grep 24 ../mediawiki/wikiversions.json [18:49:53] "testwiki": "php-1.35.0-wmf.24", [18:49:58] ah yeah because we drop it [18:50:08] Date: Wed Mar 18 15:03:29 2020 +0000 [18:50:08] Revert "Group0 to 1.35.0-wmf.24" [18:50:09] the doc says to manually update the testwiki entry and sync [18:50:10] then reset hard [18:50:17] Did it not get synced? [18:50:22] Apparently not. [18:50:32] https://test.wikipedia.org/wiki/Special:Version [18:50:35] That says .24 [18:50:52] that causes an alert that mwdebug does not have the correct version [18:50:55] lol at the lack of table [18:51:09] Yeah, that's a pretty impressive breakage of everything skin-related. [18:51:16] good that it does not alert for all hosts at once.. though that would be right, heh [18:51:23] I'm blaming the Desktop Improvements work. [18:51:39] Shall I just sync wikiversions to make it consistent? [18:52:01] hold a bit please. I am in a call with Tyler for other stuff [18:52:29] (03CR) 10Andrew Bogott: [C: 03+2] glance: move policy.json files to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [18:52:52] so [18:53:02] wikiversions.json should have .23 in staging [18:53:12] and ini deployment, .24 for testwiki [18:54:13] 10Operations, 10Traffic, 10HTTPS, 10Tracking-Neverending: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681 (10BBlack) 05Open→03Resolved Resolving this, since it has become an undead tracker for too long. There are still two trailing issues, but having this over-arch... [18:57:13] MatmaRex: when you commented on T247919, did you follow the mobile URL in the task? Trying to work out when mobile wikidata beta went down. [18:57:16] T247919: Contributions sometimes appear out of order - https://phabricator.wikimedia.org/T247919 [18:57:56] RhinosF1: yes, it worked for me just a moment ago. [18:58:09] RhinosF1: in fact it still works? [18:58:10] Hmm [18:58:23] MatmaRex: can you pop into -releng [18:58:52] RhinosF1: uhh actually. there is no mobile URL in that task? i didn't visit the mobile site [18:59:29] MatmaRex: I’m blind and forgetting the automagic that redirects to mobile [19:00:04] hashar and twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200318T1900). [19:00:15] !log fdans@deploy1001 Started deploy [analytics/refinery@549f6a4]: deploying analytics refinery [19:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:00] TRAIN IS ON HOLD [19:01:04] too many blockers for group 0 [19:01:08] (03PS2) 10Dzahn: site: add cescout1001 [puppet] - 10https://gerrit.wikimedia.org/r/581028 (https://phabricator.wikimedia.org/T239250) [19:01:10] jouncebot: hold it [19:11:35] !log 1.35.0-wmf.24 is on hold: too many blockers [19:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:17] !log fdans@deploy1001 Finished deploy [analytics/refinery@549f6a4]: deploying analytics refinery (duration: 15m 02s) [19:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:43] (03PS1) 10Bstorm: toolforge: remove the entire toollabs module and all related roles [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) [19:17:28] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:18:15] hashar: Are you going to fix the inconsistency? [19:19:23] usually no we dont but I can if needed [19:19:36] let me do the patch [19:19:58] o_0 [19:19:59] 10Operations, 10Puppet: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/+/579032 [19:20:03] I'm pretty sure we don't leave things in this state [19:20:09] You don't need a patch [19:20:14] You just need to run sync-wikiversions [19:20:18] 10Operations, 10Puppet: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544 (10Dzahn) 05Resolved→03Open [19:20:20] 10Operations, 10Puppet: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253 (10Dzahn) [19:20:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [19:20:25] There's a commit on mediawiki-staging with the revert [19:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:33] But the problem is that non staging, testwiki is still on .24 [19:20:40] yeah [19:20:46] so we manually bump "testwiki" to .24 [19:20:56] sync that on the cluster then reset on the deployment server [19:21:10] so there is an inconsistency which is the state we are in now [19:21:28] !log shutting down (decom cookbook) elnath.codfw.wmnet (T188544) [19:21:29] my mistake earlier today is that instead of syncing "testwiki" I did a patch to promote group0 directly, hence why I have reverted it [19:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:33] T188544: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544 [19:21:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:50] 10Operations, 10Puppet: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `elnath.codfw.wmnet` - elnath.codfw.wmnet (**PASS**) - Downtimed host on I... [19:22:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:22:38] (03PS1) 10Hashar: testwiki is on 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581058 [19:22:44] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/581058 testwiki is on 1.35.0-wmf.24 [19:22:46] Reedy: ^^^ :) [19:23:11] (03CR) 10Reedy: [C: 03+1] testwiki is on 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581058 (owner: 10Hashar) [19:23:25] (03PS1) 10Dzahn: remove elnath.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/581060 (https://phabricator.wikimedia.org/T188544) [19:23:30] Iguess that is due to our deployment doc [19:23:35] (03CR) 10Hashar: [C: 03+2] testwiki is on 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581058 (owner: 10Hashar) [19:24:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:24:30] (03Merged) 10jenkins-bot: testwiki is on 1.35.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581058 (owner: 10Hashar) [19:24:56] Reedy: done and deploy server updated! [19:25:39] [18:19:10] PROBLEM - Ensure local MW versions match expected deployment on mwdebug1001 is CRITICAL: CRITICAL: 1 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:25:46] Just need to wait for that to recover now... [19:25:46] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22377 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:30] (03PS2) 10Dzahn: remove elnath.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/581060 (https://phabricator.wikimedia.org/T188544) [19:27:05] [mwdebug1001:~] $ /usr/local/lib/nagios/plugins/check_mw_versions --deployhost deploy1001.eqiad.wmnet [19:27:08] Local wikiversions doesn't match production wikiversions [19:27:11] Local version for testwiki is incorrect (local: php-1.35.0-wmf.23, official: php-1.35.0-wmf.24) [19:27:14] CRITICAL: 1 mismatched wikiversions [19:28:58] Again? [19:29:13] Or still? [19:29:19] still [19:29:26] just saying it wont' recover yet [19:32:19] hmm [19:32:49] [mwdebug1001:~] $ grep aawiki /srv/mediawiki/wikiversions.json [19:32:50] "aawiki": "php-1.35.0-wmf.23", [19:33:09] scap pull [19:33:12] running it [19:33:19] we already did 2 or 3 times [19:33:26] RECOVERY - Ensure local MW versions match expected deployment on mwdebug1001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:33:29] heh [19:33:33] BEFORE: [19:33:36] $ grep '"testwiki"' /srv/mediawiki/wikiversions.json [19:33:36] "testwiki": "php-1.35.0-wmf.23", [19:33:37] $ scap pull [19:33:43] $ grep '"testwiki"' /srv/mediawiki/wikiversions.json [19:33:43] "testwiki": "php-1.35.0-wmf.24", [19:33:52] I guess I can ran scap update-wikiversions fleet wide [19:33:57] Is it worth a sync-wikiversions to ensure consistency? [19:33:58] yep, it needed another pull AFTER your merge [19:33:59] just to be sure [19:34:04] but it should have been deployed everywhere [19:34:19] mwdebug1001 might have been left in a weird state due to random other reason [19:35:33] [mw1280:~] $ /usr/local/lib/nagios/plugins/check_mw_versions --deployhost deploy1001.eqiad.wmnet [19:35:35] Production wikiversions changed recently - assuming a recent deploy.Not alerting even if we see discrepancies. [19:35:38] OKAY: wikiversions in sync [19:35:52] I am updating it fleet wide [19:39:13] (03CR) 10Ssingh: [C: 03+1] "Looks good, insofar as my understanding of Puppet goes :)" [puppet] - 10https://gerrit.wikimedia.org/r/581028 (https://phabricator.wikimedia.org/T239250) (owner: 10Dzahn) [19:39:31] that takes ages [19:42:20] (03PS4) 10Dzahn: planet: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/579678 [19:43:25] (03PS1) 10Dzahn: add IPv6 records for cescout1001 [dns] - 10https://gerrit.wikimedia.org/r/581068 (https://phabricator.wikimedia.org/T239250) [19:43:42] (03CR) 10Dzahn: [C: 03+1] site: add cescout1001 [puppet] - 10https://gerrit.wikimedia.org/r/581028 (https://phabricator.wikimedia.org/T239250) (owner: 10Dzahn) [19:43:51] (03CR) 10jerkins-bot: [V: 04-1] add IPv6 records for cescout1001 [dns] - 10https://gerrit.wikimedia.org/r/581068 (https://phabricator.wikimedia.org/T239250) (owner: 10Dzahn) [19:44:18] (03CR) 10Ssingh: [C: 03+2] site: add cescout1001 [puppet] - 10https://gerrit.wikimedia.org/r/581028 (https://phabricator.wikimedia.org/T239250) (owner: 10Dzahn) [19:46:52] (03PS3) 10Ssingh: site: add cescout1001 [puppet] - 10https://gerrit.wikimedia.org/r/581028 (https://phabricator.wikimedia.org/T239250) (owner: 10Dzahn) [19:49:10] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Ensure fleet wide consistency [19:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:20] yeah hmm that has bene slow [19:49:27] Reedy: should all be in sync now [19:51:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:52:48] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:52:49] (03CR) 10Bstorm: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/21492/" [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [19:53:02] (03CR) 10Bstorm: [C: 04-1] toolforge: remove the entire toollabs module and all related roles [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [19:55:57] I am going off [19:56:00] there are a few blockers [19:56:25] brennen: Jdlrobson might need assistance for a hotfix https://phabricator.wikimedia.org/T248010 (the sidebar disappeared! :D ) [19:58:00] !log volans@cumin1001 START - Cookbook sre.dns.netbox [19:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:32] hashar: ack. [20:00:05] halfak and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200318T2000). [20:00:22] and the other blocker might not be a blocker ;] [20:04:15] (03PS2) 10Dzahn: add IPv6 records for cescout1001 [dns] - 10https://gerrit.wikimedia.org/r/581068 (https://phabricator.wikimedia.org/T239250) [20:04:35] !log volans@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:43] aborted on purpose [20:04:49] (03CR) 10jerkins-bot: [V: 04-1] add IPv6 records for cescout1001 [dns] - 10https://gerrit.wikimedia.org/r/581068 (https://phabricator.wikimedia.org/T239250) (owner: 10Dzahn) [20:10:30] (03PS1) 10Jgreen: first stage nsca_frack.cfg.erb cleanup, add misc hostgroup, some reformatting [puppet] - 10https://gerrit.wikimedia.org/r/581076 (https://phabricator.wikimedia.org/T247855) [20:23:14] (03PS2) 10Jgreen: first stage nsca_frack.cfg.erb cleanup, add misc hostgroup, some reformatting [puppet] - 10https://gerrit.wikimedia.org/r/581076 (https://phabricator.wikimedia.org/T247855) [20:28:30] (03PS1) 10Ssingh: Introduce new role for cescout [puppet] - 10https://gerrit.wikimedia.org/r/581082 (https://phabricator.wikimedia.org/T247273) [20:29:56] (03CR) 10Dzahn: [C: 03+1] "looks good to me, and since it's not applied yet nothing can go wrong anyways" [puppet] - 10https://gerrit.wikimedia.org/r/581082 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [20:31:07] (03CR) 10Ssingh: [C: 03+2] Introduce new role for cescout [puppet] - 10https://gerrit.wikimedia.org/r/581082 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [20:31:32] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Introduce new role for cescout [puppet] - 10https://gerrit.wikimedia.org/r/581082 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [20:36:03] (03CR) 10Dwisehaupt: [C: 03+1] "looks ok to me." [puppet] - 10https://gerrit.wikimedia.org/r/581076 (https://phabricator.wikimedia.org/T247855) (owner: 10Jgreen) [20:36:45] (03PS3) 10Dzahn: add IPv6 records for cescout1001 [dns] - 10https://gerrit.wikimedia.org/r/581068 (https://phabricator.wikimedia.org/T239250) [20:37:47] (03PS4) 10Dzahn: add IPv6 records for cescout1001 [dns] - 10https://gerrit.wikimedia.org/r/581068 (https://phabricator.wikimedia.org/T239250) [20:38:30] (03PS5) 10Dzahn: add IPv6 records for cescout1001 [dns] - 10https://gerrit.wikimedia.org/r/581068 (https://phabricator.wikimedia.org/T239250) [20:40:57] Jdlrobson / phuedx: did you want me to pull https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/581054 over to mwdebug1002 for a test or does that need further investigation? [20:43:13] brennen: please pull [20:43:16] i can verify the fix [20:43:31] kk, one moment. [20:44:02] brennen: I'm having a hard time reproducing the bug right now. Jdlrobson: Can you reproduce it? [20:44:22] phuedx: cannot reproduce but i've experienced this exact same issue before [20:44:49] if the patch works on mwdebug1002 we can be assured that caching is the issue and that's where to focus attention [20:45:41] (03CR) 10Dzahn: [C: 03+2] planet: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/579678 (owner: 10Dzahn) [21:00:38] brennen: Did you pull that change onto mwdebug1002? [21:01:12] phuedx: just there now. [21:01:27] (good to test) [21:01:33] Ta [21:01:34] (03PS2) 10Bstorm: toolforge: remove the entire toollabs module and all related roles [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) [21:02:55] (checking) [21:03:29] brennen: im seeing the menu again [21:03:32] phuedx: can you confirm? [21:04:17] Jdlrobson: As I said, I wasn't consistently reproducing the bug beforehand. I'd seen the menu prior to the change being pulled [21:04:18] interestingly also fixed outside mwdebug1002 [21:04:35] so im guessing some level of caching is shared brennen ? [21:05:24] Jdlrobson: that is definitely not a question i am equipped to answer. [21:05:35] NP. I'll follow up on ticket [21:05:44] so I think this patch is good to go to unblock the train [21:05:58] 10Operations, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Krenair) Do we have a document somewhere describing the requirements of hosts pointed to by records under the wikimedia.org zone? If not should one be made and a compliance requiremen... [21:05:59] we'll need to do some more work to understand it better for the train after that [21:07:01] kk, i'll go ahead and sync that file. [21:07:07] (03CR) 10Bstorm: "There, that moves the contents of toollabs::images into the profile itself instead." [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [21:07:28] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/581056 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [21:11:10] !log brennen@deploy1001 Synchronized php-1.35.0-wmf.23/skins/Vector/includes/templates/index.mustache: [[gerrit:581054|Change master template to force cache invalidation of partials]] (duration: 01m 15s) [21:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:12] brennen: It's 9 PM and my brain is grinding to a halt. Should that've been wmf.24? [21:14:34] phuedx: i don't even have the excuse that it's late, but my brain is already fried [21:14:47] it definitely should have been, one sec [21:15:29] good catch. [21:16:15] !log brennen@deploy1001 Synchronized php-1.35.0-wmf.24/skins/Vector/includes/templates/index.mustache: [[gerrit:581054|Change master template to force cache invalidation of partials]] (duration: 01m 06s) [21:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:05] Hrrm. I'm still seeing the issue intermittently. Jdlrobson? [21:18:14] still no sidebar here for me [21:18:15] https://test.wikipedia.org/w/index.php?title=Main_Page&uselang=en-test123 [21:18:18] (outside mwdebug) [21:20:09] also missing on mwdebug1001/1002 [21:21:30] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Dwisehaupt) Need to add to icinga once T247855 refactor is done. [21:22:08] (03PS1) 10CDanis: fix logging error [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/581098 [21:22:10] (03PS1) 10CDanis: export number of 'workunits': configured checks-per-second [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/581099 [21:23:22] (03PS4) 10Dzahn: releases: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/572353 (https://phabricator.wikimedia.org/T210411) [21:27:16] (03CR) 10Dzahn: "I can't actually review if these are the changes you want in fundraising. A nitpick would be that hostgroups should be in a separate file" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/581076 (https://phabricator.wikimedia.org/T247855) (owner: 10Jgreen) [21:28:49] Jdlrobson: I do see "" on every production response now [21:28:49] T248010: Vector sidebar missing on test.wikipedia.org and weird footer - https://phabricator.wikimedia.org/T248010 [21:28:55] Maybe that was meant to be a mustache comment ;-) [21:29:05] including on testwiki [21:29:13] so it has been revalidated, which rules that out [21:30:21] ah bah, no, got caught by a Firefox bug. The purge header isn't sent in Firefox on view-source. [21:30:33] Indeed when the bug is there, the comment isn't either. [21:31:26] 10Operations, 10vm-requests: eqiad: 1 VM request for OTRS - https://phabricator.wikimedia.org/T248028 (10Dzahn) [21:32:39] 10Operations, 10vm-requests: eqiad: 1 VM request for OTRS - https://phabricator.wikimedia.org/T248028 (10Dzahn) To install OTRS on a new buster system and eventually replace the jessie instance mendelevium. requirement for T224590 [21:32:46] Krinkle, Jdlrobson: Worryingly, feeding badly shaped data to a Mustache template results in no warnings and no output [21:32:48] 10Operations, 10vm-requests: eqiad: 1 VM request for OTRS - https://phabricator.wikimedia.org/T248028 (10Dzahn) [21:32:50] 10Operations, 10OTRS: Migrate mendelevium/OTRS host to Stretch/Buster - https://phabricator.wikimedia.org/T224590 (10Dzahn) [21:32:54] It seems like https://gerrit.wikimedia.org/g/mediawiki/core/+/a527bf77f13a9cc605055b9045240e4224f66244/includes/TemplateParser.php#132 is the check that's failing [21:33:40] 10Operations, 10Patch-For-Review, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10Dzahn) ACK, that's T224590 and i just made T248028 to keep that moving. [21:33:51] I thought perhaps it is whitewashing stuff because $compiledTemplate['files'] wasn't set in the old values. But I'm pretty sure that's not possible because 1) that would definitely cause a php warning, 2) the version string was bumped correctly, and 3) we wiped all APCu data earlier today. [21:34:52] Agreed on 1 and 2. I didn't know about 3 [21:36:33] So the files hash hasn't changed, which is indicative of the template not actually changing on disk? [21:37:10] Or the files list isn't being calculated properly [21:38:00] (03PS5) 10Dzahn: releases: close port 80 for caching servers. [puppet] - 10https://gerrit.wikimedia.org/r/572353 (https://phabricator.wikimedia.org/T210411) [21:38:01] @krinkle no it was meant to be a html comment [21:38:21] mustache comments are stripped from the resulting template [21:38:27] (03PS6) 10Dzahn: releases: close port 80 for caching servers. [puppet] - 10https://gerrit.wikimedia.org/r/572353 (https://phabricator.wikimedia.org/T210411) [21:38:40] We just need to bump the file hash right? [21:38:56] stripping happens on cache miss [21:39:07] https://github.com/wikimedia/mediawiki-skins-MinervaNeue/blob/master/includes/skins/minerva.mustache#L69 [21:39:52] I don't know enough about the internals but that was not happening in practice [21:40:18] ok I'm seeing something very strange right now [21:40:39] I can reproduce the issue on mw1385 (random server that I got routed to) [21:40:42] I'm on its shell and I see: [21:40:48] > $cache = ObjectCache::getLocalServerInstance( CACHE_ANYTHING ); [21:41:00] (03PS1) 10Andrew Bogott: Neutron metadata_agent.ini: use nova_metadata_host instead of nova_metadata_ip [puppet] - 10https://gerrit.wikimedia.org/r/581104 (https://phabricator.wikimedia.org/T242766) [21:41:03] > var_dump(get_class($cache)); [21:41:03] string(22) "MemcachedPeclBagOStuff" [21:41:13] How does server local end up with memcached [21:41:34] oh, I guess that's the "anything" fallback via MainCacheType [21:41:38] nvm, it's because I'm on CLI. [21:41:40] that happens for me locally [21:41:51] (template ends up in memcached) [21:42:01] yeah, if apcu isn't enabled on the server that happens [21:43:11] I'll try to get a glimpse of what its caching [21:43:49] (03CR) 10Andrew Bogott: [C: 03+2] Neutron metadata_agent.ini: use nova_metadata_host instead of nova_metadata_ip [puppet] - 10https://gerrit.wikimedia.org/r/581104 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [21:44:04] Krinkle: Thanks. Remind me to renew my access (if it's lapsed) so that I can help out with this kind of debugging [21:44:19] Krinkle: https://phabricator.wikimedia.org/T248010#5982033 may help if you need to compare the memory object with a local one [21:45:13] Jdlrobson: Which getTemplate() call do we suspect is faulty? The one for index.mustache right? [21:45:47] yup [21:45:57] for some reason it's not noticing the partial has changed [21:46:10] (the Sidebar partial) [21:46:12] oh.. [21:46:17] hello again :) [21:46:17] $tp->processTemplate( 'index', $params ); [21:46:22] The parameter is literally just the string "index" [21:46:27] that's not globally unique [21:46:37] the cache key appears not to consider the full file path? [21:46:56] I see it [21:47:08] It should call getTemplateFilename() earlier and use it in the cache key [21:47:15] As you wrote it. I saw it [21:47:18] Yup. That [21:47:30] Not sure if that explains it but certainly a bug in its own right [21:47:31] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10RStallman-legalteam) The NDAs are signed and on file. Feel free to proceed with next steps. Thank you! [21:48:05] were you able to reproduce the sidebar issue locally? [21:48:30] * Krinkle didn't try [21:49:38] hashar: I've yet to reproduce it locally [21:49:46] Krinkle: phuedx i'll reopen https://phabricator.wikimedia.org/T113095 with a note [21:49:49] Krinkle: I may have lucked out https://codesearch.wmflabs.org/search/?q=processTemplate%5C(%20%27index%27&i=nope&files=&repos= [21:49:55] I'm fixing it now [21:49:56] hashar: caching.. need I say more. [21:50:14] yeah :-\ [21:51:37] https://gist.github.com/Krinkle/d96e57019efabdc58b46b6dd25ffa037 [21:51:41] phuedx: cache key output fwiw [21:52:18] OK, I understand the problem now [21:52:24] @phuedx i seem to remember us wanting to scope these changes to Vector [21:52:27] And it is indeed due to the filepath issue [21:52:38] It is being populated by wmf.23 [21:52:44] and then used by wmf.24 [21:52:47] and vice versa [21:53:02] (03PS7) 10Dzahn: releases: close port 80 for caching servers. [puppet] - 10https://gerrit.wikimedia.org/r/572353 (https://phabricator.wikimedia.org/T210411) [21:53:05] I ran this on wmf.24, but 'files' contains wmf.23 file paths [21:53:17] (03CR) 10Cwhite: [C: 03+2] fix logging error [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/581098 (owner: 10CDanis) [21:53:24] (03CR) 10Cwhite: [V: 03+2 C: 03+2] fix logging error [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/581098 (owner: 10CDanis) [21:53:25] see toward the end of the gist [21:53:28] (03PS5) 10Alex Monk: profile::mariadb::cloudinfra: Allow overriding of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/575744 (https://phabricator.wikimedia.org/T242607) [21:54:13] so 'index' not only is ambiguous between Vector and other things in prod, it is also ambiguous towards its alternate self [21:54:55] * Jdlrobson just wondering if Minerva or other template usages might be impacted. [21:55:59] (03PS1) 10Alex Monk: cloud eqiad1: Remove references to old cloud-puppetmaster stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/581105 [21:56:04] !log krinkle@mw1385: scap pull # clean up AdHoc debugging for T248010 [21:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:10] T248010: Vector sidebar missing on test.wikipedia.org and weird footer - https://phabricator.wikimedia.org/T248010 [21:57:55] Krinkle: So we need to inject MW_VERSION into the key name? [21:58:01] (03CR) 10Andrew Bogott: [C: 03+2] cloud eqiad1: Remove references to old cloud-puppetmaster stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/581105 (owner: 10Alex Monk) [21:58:31] James_F: version of template software is in there already (has its own constant) [21:58:39] but it wasn't expanding the file path [21:58:41] Ah. [21:59:01] actually, you bring up a good point. It is a little odd that the wmf.23 file path was able to make it in there. [21:59:07] Yeah. [21:59:11] (03CR) 10Andrew Bogott: cloud eqiad1: Remove references to old cloud-puppetmaster stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/581105 (owner: 10Alex Monk) [21:59:23] given that phuedx did increase the version number used by the cache key [21:59:44] I am off to bed, if you revert or get a fix please be bold. [21:59:53] hashar: On it [22:00:08] then poke twentyafterfour and he should be able to promote to group0 afterward [22:00:10] good luck! [22:00:18] phuedx: Did the TemplateParser bump go out the week prior? [22:00:23] if so then that explains it. [22:00:40] anyway, yeah I'll await phuedx 's fix :) [22:00:52] * twentyafterfour is here [22:01:10] Krinkle: The TemplateParser changes went out ~3 PM today [22:02:01] 3 PM UTC, sorry [22:02:23] so the partial awareness fix that set $cacheVersion = '2.1.0'; and the Vector changes to the sidebar both went out as part of wmf.24? If so then in theory that should have avoided the problem as that should have magically avoided such confusion for this one time. [22:03:36] Krinkle: https://gerrit.wikimedia.org/r/#/c/581107/ [22:04:25] "TemplateParser: Invalidate cache if partial changes" did make it into wmf.23 [22:04:31] so yeah, no mystery [22:04:49] ^ That [22:05:47] It's now late enough that I'm strongly considering another cup of coffee [22:18:40] !log volans@cumin1001 START - Cookbook sre.dns.netbox [22:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:01] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:23] (03CR) 10Cwhite: [V: 03+2 C: 03+2] "LGTM. Will cut a release." [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/581099 (owner: 10CDanis) [22:23:26] shdubsh: thanks! [22:24:10] no worries! thanks for sending the patches [22:26:41] (03PS1) 10Mstyles: kibana: add kibana to relforge [puppet] - 10https://gerrit.wikimedia.org/r/581111 (https://phabricator.wikimedia.org/T246961) [22:38:30] (03CR) 10Mstyles: "some failures here: https://puppet-compiler.wmflabs.org/compiler1001/21494/" [puppet] - 10https://gerrit.wikimedia.org/r/581111 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [22:39:14] 10Operations, 10ops-codfw, 10Wikimedia-FR-Tech-Systems: Fix incongruences between Netbox and DNS repository - https://phabricator.wikimedia.org/T248035 (10Volans) p:05Triage→03Medium [22:39:40] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) @BBlack @crusnov This is the script I use to compare the results P10716 both ways. These is the output checking that all ops/dns repo... [22:49:57] Krinkle, twentyafterfour: The cherry-picks look ready to go [22:50:45] wmf.23: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/581114/ [22:50:55] wmf.24: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/581115/ [22:52:01] (03PS1) 10Cwhite: Release 0.6 [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/581118 [22:52:38] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Release 0.6 [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/581118 (owner: 10Cwhite) [22:53:47] phuedx: So what's the story then? These just need to be swatted and then we can roll wmf.24 forward? [22:54:13] if the bug is fixed, yes :) [22:55:09] twentyafterfour: SWAT 'em, verify that https://phabricator.wikimedia.org/T248010 is fixed, then onward! [22:55:14] so starting with wmf.23? [22:55:20] or 24? [22:58:19] Should be a no-op on wmf.23 but start there, I think. [22:59:06] Indeed. 23 should be fine. 24 is on mwdebug1002 and group0 wikis, which we can reproduce the bug on [23:00:05] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200318T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:37] jouncebot: actually two patches are in the queue as of now [23:02:00] jouncebot: refresh [23:02:01] I refreshed my knowledge about deployments. [23:02:21] James_F: I didn't actually edit the deployment page on wiki [23:02:30] Oh, hah. [23:02:38] but I am taking over the swat window if nobody else is using it [23:02:41] Rank hath its privileges. [23:02:50] lol [23:03:12] If you hold the conch you rule the servers! [23:03:31] Yup. Rule, misrule, you've got it. [23:03:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:04:17] bd808: There's actually a conch isn't there? [23:04:48] we had a bot briefly that did that. :) [23:05:09] more symbolic now, like the British Monarchy [23:05:10] 10Operations, 10ops-eqsin: snag asset tags from ulsfo, ship some to eqsin - https://phabricator.wikimedia.org/T245056 (10RobH) 05Open→03Resolved confirmed jin received them [23:05:21] * bd808 ducks [23:05:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:13:21] ok CI is almost finished, patch merged in wmf.23 [23:13:33] so if it's a noop in wmf.23 why even make the change there? [23:13:54] should I still deploy wmf.23 first really? [23:14:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:15:25] ok patches merged [23:15:45] phuedx: you're the one testing? [23:15:59] twentyafterfour: For completeness, I'd deploy wmf.23. If we want to verify quickly, then wmf.24 and I'll test on mwdebug1002 and testwiki [23:16:12] twentyafterfour: Sure [23:16:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:16:45] so which one first? [23:18:19] wmf.24 because it's approaching midnight for me ;) [23:20:06] (03Abandoned) 10Cwhite: profile: add restbase filter and coerce err key to string [puppet] - 10https://gerrit.wikimedia.org/r/576908 (https://phabricator.wikimedia.org/T239090) (owner: 10Cwhite) [23:20:21] (03Abandoned) 10Cwhite: profile: coerce mediawiki user_id field to string in logstash [puppet] - 10https://gerrit.wikimedia.org/r/576910 (https://phabricator.wikimedia.org/T239458) (owner: 10Cwhite) [23:22:49] phuedx: ok mwdebug1002 is updated [23:24:52] I no longer see the bug (tested on the main page and a bunch o' random pages) [23:26:08] syncing https://gerrit.wikimedia.org/r/c/mediawiki/core/+/581115/ refs T248010 [23:26:09] T248010: Vector sidebar missing on test.wikipedia.org and weird footer - https://phabricator.wikimedia.org/T248010 [23:26:51] !log twentyafterfour@deploy1001 Synchronized php-1.35.0-wmf.24/includes/TemplateParser.php: sync https://gerrit.wikimedia.org/r/c/mediawiki/core/+/581115/ (duration: 01m 08s) [23:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:37] twentyafterfour: Thanks. The menu on testwiki is... y'know... there! [23:29:01] phuedx: thanks for testing. I'll sync wmf.23 now [23:31:11] !log twentyafterfour@deploy1001 Synchronized php-1.35.0-wmf.23/includes/TemplateParser.php: sync https://gerrit.wikimedia.org/r/c/mediawiki/core/+/581114/ refs T248010 (duration: 01m 07s) [23:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:16] T248010: Vector sidebar missing on test.wikipedia.org and weird footer - https://phabricator.wikimedia.org/T248010 [23:32:27] (03PS1) 1020after4: group0 wikis to 1.35.0-wmf.24 refs T233872 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581133 [23:32:29] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.35.0-wmf.24 refs T233872 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581133 (owner: 1020after4) [23:33:33] (03CR) 10jerkins-bot: [V: 04-1] group0 wikis to 1.35.0-wmf.24 refs T233872 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581133 (owner: 1020after4) [23:34:13] (03CR) 10DannyS712: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581133 (owner: 1020after4) [23:35:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:35:37] wth I've never seen that before [23:36:05] Jenkins says `RuntimeError: failed to find interpreter for Builtin discover of python_spec='.tox/bin/python'` [23:36:15] yeah I don't get it [23:36:44] and yet https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/581058/ worked fine [23:37:01] (03PS1) 10DannyS712: Testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581137 [23:37:17] ^seeing if Jenkins will fail an empty commit [23:37:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:38:19] DannyS712: I don't think it's going to even test that, it's WIP [23:38:33] (03CR) 10jerkins-bot: [V: 04-1] Testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581137 (owner: 10DannyS712) [23:38:34] It'll get a -1 in a few seconds [23:38:52] Retrieved from "https://test.wikipedia.org/w/index.php?title=Main_Page&oldid=420348" the weird footer is still around on testwiki [23:39:56] 10Operations, 10User-DannyS712, 10ci-test-error: operations/mediawiki-config master branch failing tests - https://phabricator.wikimedia.org/T248040 (10DannyS712) [23:40:08] twentyafterfour reported the failure [23:41:06] 10Operations, 10User-DannyS712, 10ci-test-error: operations/mediawiki-config master branch failing tests - https://phabricator.wikimedia.org/T248040 (10DannyS712) [23:41:11] 10Operations, 10User-DannyS712, 10ci-test-error: operations/mediawiki-config master branch failing tests - https://phabricator.wikimedia.org/T248040 (10DannyS712) p:05Triage→03Unbreak! [23:42:32] Melos weird footer still there for me too [23:44:19] what's the "weird footer" ? [23:44:24] Right. I'm out [23:44:43] Thanks for the deploy twentyafterfour. Thanks for debugging Krinkle [23:49:42] twentyafterfour: https://test.wikipedia.org/wiki/Yet_another_new_page "Retrieved from.... " at the end of every page [23:50:38] also, weird formatting at https://test.wikipedia.org/wiki/Special:Version - is it related? [23:51:07] Yes, it is [23:51:35] same as https://phabricator.wikimedia.org/T247566 ? i reported that last week [23:52:36] was this revert supposed to be deployed? https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/581116 [23:52:45] Krinkle: Jdlrobson? ^ [23:53:16] Why can't I give it a +1? It only shows "Post" if I hit reply... [23:53:44] DannyS712: ? [23:54:13] At https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/Vector/+/581116/, I can't give it a +1, the option isn't available (no code review options are available) [23:59:14] https://test.wikipedia.org/wiki/Special:Version?useskin=monobook uhmm [23:59:40] Melos what is wrong with it?