[00:00:04] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200604T0000). [00:05:12] Sorry for nagging in the ops-channel, didn't notice I was in wrong channel. [00:06:29] @deployers please see T254417 - I've self-+2ed the revert but it needs to be backported [00:06:30] T254417: SpecialUserrights.php: Call to undefined method CentralAuthGroupMembershipProxy::canReceiveEmail() - https://phabricator.wikimedia.org/T254417 [00:24:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:26:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:36:01] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.35/includes/specials/SpecialUserrights.php: T254417 T251534 (duration: 01m 06s) [00:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:06] T251534: Special:UserRights shows "send email" for user without valid email - https://phabricator.wikimedia.org/T251534 [00:36:06] T254417: SpecialUserrights.php: Call to undefined method CentralAuthGroupMembershipProxy::canReceiveEmail() - https://phabricator.wikimedia.org/T254417 [00:46:01] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20027304 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:29] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 25152 and 55 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:26:29] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:30:05] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:30:35] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:37] RECOVERY - Check the last execution of mediawiki_job_cirrus_build_completion_indices_eqiad on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_eqiad https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:40:37] RECOVERY - Check the last execution of mediawiki_job_cirrus_build_completion_indices_codfw on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_codfw https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:04:43] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.189e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [04:36:39] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:43:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:49:51] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 4768 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [04:50:30] <_joe_> there it is [05:10:22] 10Operations, 10WMF-Design, 10Design: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Prtksxna) >>! In T254118#6188515, @Dzahn wrote: > @Prtksxna The site has been setup and the content has been cloned. But you'll have to adjust the... [05:28:44] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [05:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:30] 10Operations, 10Traffic, 10conftool, 10Patch-For-Review, and 2 others: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Joe) 05Open→03Resolved [05:55:34] 10Operations, 10Traffic, 10discovery-system, 10services-tooling: Create a tool to sync static configuration from a repository to the consistent k/v store - https://phabricator.wikimedia.org/T97978 (10Joe) [05:59:59] <_joe_> !log fixing weights of cp2040 T245594 [06:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:03] T245594: Many objects in conftool have pooled=yes, weight=0 - https://phabricator.wikimedia.org/T245594 [06:01:51] !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: dc=codfw,cluster=elasticsearch,service=elasticsearch [06:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:44] !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: dc=codfw,cluster=elasticsearch,service=elasticsearch.* [06:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:03] !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: cluster=eventschemas,service=eventschemas [06:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:09] !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: name=logstash100.* [06:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:47] !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: name=logstash200.* [06:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:47] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=rpkicounter site={eqiad,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:37:41] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:30] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:29] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:15] PROBLEM - Check the last execution of mediawiki_job_cirrus_build_completion_indices_eqiad on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_eqiad https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:51:28] 10Operations, 10netops, 10Patch-For-Review: netflow2001 kafkatee-webrequest restart loop - https://phabricator.wikimedia.org/T249176 (10ayounsi) 05Open→03Resolved a:03ayounsi Removed! [06:52:10] !log mwmaint1002 started mediawiki_job_cirrus_build_completion_indices_eqiad.service [06:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:45] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:51] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10Physikerwelt) What do you think about implementing a bot that checks tha... [06:55:03] RECOVERY - Check the last execution of mediawiki_job_cirrus_build_completion_indices_eqiad on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_eqiad https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:00:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:01:56] <_joe_> mutante: you should look at what's wrong with that execution [07:02:16] <_joe_> and possibly open a task [07:07:44] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete [07:10:26] _joe_: i don't see anything besides that it exited with "status=123/n/a" and now it's running and retrying a "yellow index" but making progress. watching if it finishes this time [07:11:20] 10Operations, 10Analytics, 10Traffic: missing wmf_netflow data, 18:30-19:00 May 31 - https://phabricator.wikimedia.org/T254161 (10elukey) ` scala> spark.sql("select count(*) from wmf.netflow where year=2020 and month=05 and day=31 and hour=18").show(); 20/06/04 07:09:37 WARN Utils: Truncated the string repre... [07:16:56] 10Operations, 10WMF-Design, 10Design: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Dzahn) Hi @Prtksxna I left some comments directly on the Gerrit change. The links to assets inside _site would have to start with a . as a relativ... [07:19:10] 10Operations, 10LDAP-Access-Requests, 10observability, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10Dzahn) Alright, thanks @AMooney ! [07:19:31] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10MoritzMuehlenhoff) I'll let Luca comment what's best option-wise, but building (and maintaining with custom patches in case of security issues) seems like an acceptable opti... [07:28:28] no gerrit bot? hmm [07:29:56] 10Operations, 10netops: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 (10ayounsi) The changes/cleanup done with that CR: `name=everywhere, lang=diff [edit protocols bgp group Transit4 family inet] + unicast; - any; ` `any` includes unicast + multicast and we don't... [07:32:18] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=cloudceph,service=cloudceph [07:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:03] !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=labweb,service=labweb-ssl [07:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:50] 10Operations, 10Service-Architecture: Many objects in conftool have pooled=yes, weight=0 - https://phabricator.wikimedia.org/T245594 (10Joe) 05Open→03Resolved Resolving this as we have no more services with weight 0, and now "pool" should correctly refuse to pool a service if the weight is zero [07:35:06] 10Operations, 10Service-Architecture: Many objects in conftool have pooled=yes, weight=0 - https://phabricator.wikimedia.org/T245594 (10Joe) [07:39:58] PROBLEM - BGP status on cr1-eqsin is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:41:06] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 267, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:43:04] hmmm XioNoX ^^ that looks like a bug on the icinga check [07:43:42] yup [07:44:26] or SNMP issue [07:45:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:41] !log Depool labsdb1009 - T252219 [07:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:44] T252219: Drop MCR-obsoleted fields from the wiki replicas - https://phabricator.wikimedia.org/T252219 [07:49:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:50:22] PROBLEM - Check the last execution of mediawiki_job_cirrus_build_completion_indices_codfw on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_codfw https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:51:04] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:53:09] 10Operations, 10netops: Peer with SFMIX at ulsfo (May 2020) - https://phabricator.wikimedia.org/T251536 (10faidon) This is now set up on SFMIX's end and up: > On your side please plumb 206.197.187.82/24 and 2001:504:30::ba01:4907:1/64. Usual sane BGP peering rules apply - no broadcast traffic (DHCP, CDP, etc),... [07:59:07] mutante, dcausse - should we open a task for mediawiki_job_cirrus_build_completion_indices_codfw ? [07:59:57] the exit code 123 should be returned by xargs, possibly mwscript fails and we don't know how? [08:00:28] elukey: I thought I fixed the problem... I created T254331 yesterday, fixed it but apparently it's not enough [08:00:29] T254331: Suspicious mismatch between psi and omega elastic cluster - https://phabricator.wikimedia.org/T254331 [08:00:49] maybe a different cause, will take a look and file a task [08:00:56] ah! [08:01:01] thanks :) [08:01:12] thanks for the ping! [08:01:36] np! so the only way to find what's happening is rerunning the specific script ? [08:01:38] elukey: dcausse: so.. earlier it was eqiad and not codfw [08:01:46] and i am still watching the eqiad run continue [08:02:13] mutante: but it ended up with a 123 right? [08:02:38] in global syslog it showed it ended with 123 [08:02:45] elukey: to be precise: "status=123/n/a" [08:02:51] that /n/a/ looks weird of course [08:02:58] yes I think it is "123 if any invocation of the command exited with status 1-125" from xargs' man [08:03:21] `n/a` might mean it wasn't terminated by a signal [08:03:28] that is not great since we don't know the status code of the mwscript [08:03:28] in the log specific to the timer it shows a bunch of "Index is yellow retrying" but it IS successful after all [08:03:36] it is still running and did not fail again ..so far [08:03:59] (eqiad) [08:05:10] yes... finding the errors in these logs is a bit of pain... :/ [08:14:54] !log Run sudo /usr/local/sbin/maintain-views --all-databases --replace-all on labsdb1009 - T252219 [08:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:58] T252219: Drop MCR-obsoleted fields from the wiki replicas - https://phabricator.wikimedia.org/T252219 [08:15:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 to clone db1091 on s1 T253217', diff saved to https://phabricator.wikimedia.org/P11392 and previous config saved to /var/cache/conftool/dbconfig/20200604-081545-marostegui.json [08:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:49] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [08:28:09] 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Dzahn) [08:28:29] 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Dzahn) [08:30:22] filed T254436 for the mediawiki_job_cirrus_build_completion_indices_codfw failure, I'm pretty sure it's cirrus maint script returning an error code on something that is expected. in other words it's not a real failure [08:30:23] T254436: Job mediawiki_job_cirrus_build_completion_indices_codfw fails - https://phabricator.wikimedia.org/T254436 [08:31:24] previously we appended '|| true' to the xargs command to force a success [08:35:58] twist: you can use : find $PATH -exec \; [08:36:43] if command returns non 0, the `-exec \;` predicate will be evaluated as false [08:36:56] when -exec is the last predicate ... find just continue processing [08:37:27] example which show all files: find . -print -exec /bin/false \; [08:37:47] dcausse: thanks! i still did not see another failure for the eqiad run, i will stop watching the log for now then [08:38:08] find . -print -exec /bin/true \; -print # the last -print is evaluated since -exec returns true [08:38:18] it is a trick to execute commands and ignore their status [08:38:41] xargs || true would shallow an error reported by xargs itself [08:39:45] 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Dzahn) [08:42:21] !log restarting archiva to pick up Java security updates [08:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:41] xargs will proceed all the input unless one command returns 255, if one command fails it reports 123 at the end [08:47:25] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti1009.eqiad.wmnet', 'ganeti1010.eqiad.wmnet',... [08:49:02] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Dzahn) Hi @Ferdi2005, I logged into the "Publisher Center". There were no existing domains or projects in it. This would be the first site ever to be... [08:50:01] !log Repool labsdb1009 after running maintain-views T252219 [08:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:04] T252219: Drop MCR-obsoleted fields from the wiki replicas - https://phabricator.wikimedia.org/T252219 [08:51:52] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti1019.eqiad.wmnet', 'ganeti1020.eqiad.wmnet', 'ganeti1021.eqiad.wmnet', 'ganeti1022.eqiad.wmnet'] `... [08:51:55] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) Was another reimage needed? I already did these. Something wrong with RAID still? [08:52:28] 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Dzahn) [08:52:45] 10Operations, 10netops: Peer with SFMIX at ulsfo (May 2020) - https://phabricator.wikimedia.org/T251536 (10ayounsi) 05Open→03Resolved a:03ayounsi Everything is done, and we're peering with the RS. Next is to send peering requests. [08:53:13] 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Dzahn) [08:54:05] 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Dzahn) p:05Triage→03High [08:54:40] 10Operations, 10DNS, 10Domains, 10Traffic: Create diff.wikimedia.org subdomain - https://phabricator.wikimedia.org/T253807 (10Dzahn) p:05Triage→03High [08:55:15] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Dzahn) 05Open→03Stalled stalled by T254437 [08:56:24] 10Operations, 10Analytics, 10Analytics-Kanban: Create a profile to standardize the deployment of JVM packages and configurations - https://phabricator.wikimedia.org/T253553 (10elukey) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/602009/ [08:58:15] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:15] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:59:15] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:59:16] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:03] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) Thanks! Confirmed signature on L3. Will upload a change using the key. [09:00:18] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) a:05YiJuLu→03Dzahn [09:00:18] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:00:19] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:00:19] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:20] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:00:20] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:00:21] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:26] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) 05Stalled→03Open [09:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:47] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:00:49] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:52] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) [09:03:15] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:03:15] !log deploying Java security updates on elastic search nodes [09:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:43] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:45] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:03:45] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:15] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:44] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:18] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:05:19] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:05:19] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:20] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:05:21] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:05:21] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:47] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:14] that is super spammy ;D [09:08:14] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:43] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:13] elukey: this spam eventually provides the Ganeti capacity you need :-) [09:09:45] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:00] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) Hi @diego and @YiJuLu next we will need to know which groups you are requesting specifically. "the analytics cluster" can mean different things. Please take a look a... [09:12:02] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1011.eqiad.wmnet', 'ganeti1017.eqiad.wmnet', 'ganeti1012.eqiad.wmnet', 'ganeti1013.eqiad.wmnet', '... [09:12:31] * elukey dances [09:12:32] :D [09:12:51] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Aklapper) [09:12:53] 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Aklapper) [09:13:17] 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Dzahn) p:05Triage→03Medium [09:13:19] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1021.eqiad.wmnet', 'ganeti1019.eqiad.wmnet', 'ganeti1022.eqiad.wmnet', 'ganeti1020.eqiad.wmnet'] ` and were **ALL** successful. [09:13:46] 10Operations, 10ops-codfw: Degraded RAID on ms-be2018 - https://phabricator.wikimedia.org/T254392 (10Dzahn) p:05Triage→03Medium [09:14:22] 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Dzahn) p:05Triage→03Medium [09:14:42] 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10Dzahn) p:05Triage→03Medium [09:14:44] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee - https://phabricator.wikimedia.org/T254442 (10danshick-wmde) [09:14:55] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10akosiaris) >>! In T228924#6191712, @Dzahn wrote: > Was another reimage needed? I already did these. Something wrong with RAID still? buster vs stretch. the curr... [09:16:46] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee - https://phabricator.wikimedia.org/T254442 (10Dzahn) p:05Triage→03Medium HI @danshick-wmde please work with @KFrancis to get the NDA signed. After that we can continue here on the ticket to add you to the relevant g... [09:17:26] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) >>! In T228924#6191828, @akosiaris wrote: > buster vs stretch. the current clusters are stretch Oh yea, that makes a lot of sense. gotcha, thanks. [09:19:20] 10Operations, 10DC-Ops, 10Traffic: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10Dzahn) p:05Triage→03Medium [09:19:34] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) p:05Triage→03Medium [09:19:41] 10Operations, 10User-MoritzMuehlenhoff: Investigate StorCLI - https://phabricator.wikimedia.org/T254019 (10Dzahn) p:05Triage→03Medium [09:19:48] 10Operations: Why do we have 2 sets of squid proxies? - https://phabricator.wikimedia.org/T254011 (10Dzahn) p:05Triage→03Medium [09:20:45] 10Operations, 10observability, 10User-MoritzMuehlenhoff, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10Dzahn) p:05Triage→03Medium [09:20:55] 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Dzahn) p:05Triage→03Medium [09:21:34] 10Operations, 10Release-Engineering-Team, 10SRE-tools, 10Patch-For-Review: Support running puppet Beaker on CI - https://phabricator.wikimedia.org/T253635 (10Dzahn) p:05Triage→03Medium [09:21:52] 10Operations, 10Pybal, 10Traffic: PyBal ProxyFetch failure when talking to Envoy in SNI-only mode - https://phabricator.wikimedia.org/T253527 (10Dzahn) p:05Triage→03High [09:22:07] 10Operations, 10observability, 10Sustainability (Incident Prevention): add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10Dzahn) p:05Triage→03High [09:22:20] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee - https://phabricator.wikimedia.org/T254442 (10Franziska_Heine) Approved [09:24:16] 10Operations, 10Analytics, 10Traffic, 10Readers-Web-Backlog (Tracking): Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Dzahn) p:05Triage→03Medium [09:24:28] 10Operations: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10Dzahn) p:05Triage→03Medium [09:26:10] !log rolling restart of cassandra on maps* to pick up Java security updates [09:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:21] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10YiJuLu) [09:28:04] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10Aklapper) [09:34:03] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti2009.codfw.wmnet', 'ganeti2010.codfw.wmnet', 'ganeti2011.codfw.wmnet', 'ga... [09:35:29] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti2019.codfw.wmnet', 'ganeti2020.codfw.wmnet', 'ganeti2021.co... [09:41:57] !log jmm@cumin2001 START - Cookbook sre.cassandra.roll-restart [09:41:57] !log jmm@cumin2001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) [09:41:58] 10Operations, 10Pybal, 10Traffic: PyBal ProxyFetch failure when talking to Envoy in SNI-only mode - https://phabricator.wikimedia.org/T253527 (10Dzahn) p:05High→03Medium [09:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:44] !log jmm@cumin2001 START - Cookbook sre.cassandra.roll-restart [09:42:45] !log jmm@cumin2001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) [09:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:37] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' . [09:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:05] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:48:06] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:48:06] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:07] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:48:09] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:48:09] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:02] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:07] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:50:07] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:50:08] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:36] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:50:36] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:50:37] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:38] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:50:39] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:50:39] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:50:39] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:40] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:02] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:08] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:55] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [09:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:04] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:19:33] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10Dzahn) `Cannot find file './doxygen-latex_1.8.18-1~exp1~... [10:23:45] wikibugs is not listening to gerrit patches - T254453 [10:23:45] T254453: wikibugs not listening to Gerrit - https://phabricator.wikimedia.org/T254453 [10:23:56] so you're aware for i.e. swat [10:31:04] hmm [10:32:13] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10Dzahn) ` [apt1001:~] $ sudo -i reprepro ls doxygen doxyg... [10:35:54] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2020.codfw.wmnet'] ` Of which those **FAILED**: ` ['ganeti2020.codfw.wmnet'] ` [10:35:56] oh hi legoktm [10:36:02] hello :) [10:36:05] I thought it was late for you so I did not pinged [10:36:07] I'm stabbing wikibugs right now [10:36:37] great, I just emailed Merlijn [10:40:44] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2016.codfw.wmnet', 'ganeti2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['ganeti2016.codfw.wmnet', 'ganeti2010.codfw.wmnet... [10:41:08] !log deployed new version of puppet-merge revert is https://gerrit.wikimedia.org/r/c/operations/puppet/+/602329 [10:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:19] got it [10:41:25] yaml.scanner.ScannerError: while scanning a double-quoted scalar [10:41:26] in "/data/project/wikibugs/wikibugs2/gerrit-channels.yaml", line 169, column 15 [10:41:26] found unknown escape character '/' [10:41:26] in "/data/project/wikibugs/wikibugs2/gerrit-channels.yaml", line 169, column 21 [10:42:02] !log Deploy schema change on s3 (only testwiki) codfw - T238966 [10:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:05] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [10:45:29] hah [10:45:32] I fixed it [10:45:35] (03CR) 10Ayounsi: "see https://phabricator.wikimedia.org/T250136#6191430 as well." [homer/public] - 10https://gerrit.wikimedia.org/r/602119 (https://phabricator.wikimedia.org/T250136) (owner: 10Ayounsi) [10:46:56] !log Deploy schema change on s3 (only testwiki) eqiad - T238966 [10:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:00] thanks so much legoktm [10:47:25] (03Restored) 10MarcoAurelio: Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602308 (owner: 10MarcoAurelio) [10:47:32] (03Abandoned) 10MarcoAurelio: Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602308 (owner: 10MarcoAurelio) [10:47:36] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10Dzahn) a:05Dzahn→03None ` [apt1001:~] $ sudo -E repr... [10:48:06] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10Dzahn) a:03Dzahn [10:48:36] (03CR) 10Jbond: [V: 03+2 C: 03+2] whitespace change to test puppet-merge [labs/private] - 10https://gerrit.wikimedia.org/r/602330 (owner: 10Jbond) [10:48:53] (03CR) 10Jbond: [C: 03+2] whitespace change to test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/602332 (owner: 10Jbond) [10:49:29] (03PS1) 10Filippo Giunchedi: conftool: fix confctl detection logic [puppet] - 10https://gerrit.wikimedia.org/r/602334 (https://phabricator.wikimedia.org/T253840) [10:49:36] yw :) [10:49:40] (03CR) 10Elukey: [C: 03+2] Prepare druid1004 for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/602310 (https://phabricator.wikimedia.org/T253980) (owner: 10Elukey) [10:49:45] what a weird bug [10:49:49] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10Dzahn) 05Open→03Resolved [10:50:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/602334 (https://phabricator.wikimedia.org/T253840) (owner: 10Filippo Giunchedi) [10:50:31] legoktm: indeed, last patch was like 7 days ago but started to fail today? [10:51:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] conftool: fix confctl detection logic [puppet] - 10https://gerrit.wikimedia.org/r/602334 (https://phabricator.wikimedia.org/T253840) (owner: 10Filippo Giunchedi) [10:51:25] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool: fix confctl detection logic [puppet] - 10https://gerrit.wikimedia.org/r/602334 (https://phabricator.wikimedia.org/T253840) (owner: 10Filippo Giunchedi) [10:51:32] hauskatze: it was merged today or yesterday [10:51:52] (03CR) 10Jcrespo: "This should do it: https://gerrit.wikimedia.org/r/c/integration/config/+/602333" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602323 (owner: 10Jcrespo) [10:52:49] PROBLEM - Check the last execution of mediawiki_job_cirrus_build_completion_indices_eqiad on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_eqiad https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:53:39] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=druid1004.eqiad.wmnet [10:53:39] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:55] legoktm: forgot to deploy https://gerrit.wikimedia.org/r/#/c/labs/tools/ldap/+/597911/ to the tool [10:55:09] I shut down the laptop I use for that [10:55:50] no worries, I'll take care of it [10:56:42] k tnx [10:57:07] (03PS1) 10Jbond: puppetmaster: remove monitoring::icinga::git_merge check on backend [puppet] - 10https://gerrit.wikimedia.org/r/602335 (https://phabricator.wikimedia.org/T251104) [10:57:07] {{done}} [10:58:22] Tooforge heh [10:58:31] (03CR) 10Jbond: [C: 03+2] puppetmaster: remove monitoring::icinga::git_merge check on backend [puppet] - 10https://gerrit.wikimedia.org/r/602335 (https://phabricator.wikimedia.org/T251104) (owner: 10Jbond) [10:59:40] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [10:59:40] dunno how no one else ever noticed it :p [10:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200604T1100). [11:00:04] hauskatze: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:01:02] Urbanecm: are you doing th* [11:01:22] * RhinosF1 doesnt have access to debug as mobile [11:01:34] 10Operations, 10observability: Switch ELK7 to use the distro Java - https://phabricator.wikimedia.org/T252913 (10MoritzMuehlenhoff) p:05Medium→03High [11:04:03] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [11:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:48] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10hashar) 05Resolved→03Open [11:05:04] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:09:30] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:11:37] o/ [11:11:40] jouncebot: now [11:11:40] For the next 0 hour(s) and 48 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200604T1100) [11:11:49] I haven't done swat in a while [11:12:04] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [11:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:52] oh man [11:14:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:50] hauskatze: will do it [11:15:03] thanks hashar [11:15:05] (03CR) 10Hashar: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602285 (https://phabricator.wikimedia.org/T254372) (owner: 10MarcoAurelio) [11:16:16] (03Merged) 10jenkins-bot: [metawiki] Add `centralauth-rename` to WMF OIT staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602285 (https://phabricator.wikimedia.org/T254372) (owner: 10MarcoAurelio) [11:19:29] hauskatze: it is on mwdebug1001 [11:19:37] checking [11:20:18] lgtm hashar [11:20:38] thank you :) [11:21:43] !log hashar@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [metawiki] Add `centralauth-rename` to WMF OIT staff - T254372 (duration: 01m 08s) [11:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:46] T254372: Add `centralauth-rename` to the `wmf-officeit` group - https://phabricator.wikimedia.org/T254372 [11:23:08] EU SWAT Completed :-) [11:24:17] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Patch-For-Review, and 2 others: Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10hashar) 05Open→03Resolved a:05Dzahn→03hashar Container rebuild and I have switched the Jenkins jobs to Doxygen 1.8.18 Thank yo... [11:26:55] (03CR) 10Dzahn: "these don't have IPs yet" [puppet] - 10https://gerrit.wikimedia.org/r/599749 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [11:27:31] (03PS1) 10Elukey: Set Debian Buster for druid100[4,5,6] [puppet] - 10https://gerrit.wikimedia.org/r/602340 (https://phabricator.wikimedia.org/T253980) [11:28:01] (03CR) 10Elukey: [C: 03+2] Set Debian Buster for druid100[4,5,6] [puppet] - 10https://gerrit.wikimedia.org/r/602340 (https://phabricator.wikimedia.org/T253980) (owner: 10Elukey) [11:29:07] !log Compress InnoDB on db1091 before pooling it as new slave on s1 - T254462 [11:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:11] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [11:29:35] (03CR) 10Dzahn: "let's use https where we can, please" [puppet] - 10https://gerrit.wikimedia.org/r/602311 (owner: 10Amire80) [11:30:20] (03PS1) 10Jbond: gitclone: add labs/private checkout back to labs environment [puppet] - 10https://gerrit.wikimedia.org/r/602341 [11:31:28] (03PS2) 10Dzahn: Add wikipediapodden.se to North Germanic Planet [puppet] - 10https://gerrit.wikimedia.org/r/602311 (owner: 10Amire80) [11:31:40] (03CR) 10Jbond: [C: 03+2] gitclone: add labs/private checkout back to labs environment [puppet] - 10https://gerrit.wikimedia.org/r/602341 (owner: 10Jbond) [11:32:17] (03CR) 10Dzahn: [C: 03+2] Add wikipediapodden.se to North Germanic Planet [puppet] - 10https://gerrit.wikimedia.org/r/602311 (owner: 10Amire80) [11:32:51] (03PS2) 10Dzahn: Add bunyk to the Ukrainian Planet [puppet] - 10https://gerrit.wikimedia.org/r/602319 (owner: 10Amire80) [11:34:27] (03CR) 10Dzahn: [C: 03+2] Add bunyk to the Ukrainian Planet [puppet] - 10https://gerrit.wikimedia.org/r/602319 (owner: 10Amire80) [11:36:47] 10Operations, 10observability: Switch ELK7 to use the distro Java - https://phabricator.wikimedia.org/T252913 (10MoritzMuehlenhoff) For amending the elasticsearch.service the following should do it: ` systemd::unit{'elasticsearch': override => true, restart => true, content => "[Service]\nEnvironment=... [11:39:31] (03CR) 10Urbanecm: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599737 (https://phabricator.wikimedia.org/T253578) (owner: 10RhinosF1) [11:39:38] (03PS3) 10Urbanecm: Change $wgNamespaceRobotPolicies on Thai wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599737 (https://phabricator.wikimedia.org/T253578) (owner: 10RhinosF1) [11:39:46] (03CR) 10Urbanecm: [C: 03+2] Change $wgNamespaceRobotPolicies on Thai wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599737 (https://phabricator.wikimedia.org/T253578) (owner: 10RhinosF1) [11:39:53] (03PS3) 10Urbanecm: wgNamespaceRobotPolicies: Set several namespaces to noindex,nofollow for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598999 (https://phabricator.wikimedia.org/T253574) [11:39:59] (03CR) 10Urbanecm: [C: 03+2] wgNamespaceRobotPolicies: Set several namespaces to noindex,nofollow for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598999 (https://phabricator.wikimedia.org/T253574) (owner: 10Urbanecm) [11:40:12] (03CR) 10Urbanecm: [C: 03+2] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599737 (https://phabricator.wikimedia.org/T253578) (owner: 10RhinosF1) [11:40:14] * RhinosF1 waves [11:41:14] hey RhinosF1 [11:41:16] (03Merged) 10jenkins-bot: Change $wgNamespaceRobotPolicies on Thai wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599737 (https://phabricator.wikimedia.org/T253578) (owner: 10RhinosF1) [11:41:35] (03PS4) 10Urbanecm: wgNamespaceRobotPolicies: Set several namespaces to noindex,nofollow for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598999 (https://phabricator.wikimedia.org/T253574) [11:41:42] (03CR) 10Urbanecm: wgNamespaceRobotPolicies: Set several namespaces to noindex,nofollow for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598999 (https://phabricator.wikimedia.org/T253574) (owner: 10Urbanecm) [11:41:45] (03CR) 10Urbanecm: [C: 03+2] wgNamespaceRobotPolicies: Set several namespaces to noindex,nofollow for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598999 (https://phabricator.wikimedia.org/T253574) (owner: 10Urbanecm) [11:41:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11395 and previous config saved to /var/cache/conftool/dbconfig/20200604-114149-marostegui.json [11:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:50] (03Merged) 10jenkins-bot: wgNamespaceRobotPolicies: Set several namespaces to noindex,nofollow for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598999 (https://phabricator.wikimedia.org/T253574) (owner: 10Urbanecm) [11:46:43] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [11:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:57] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 338cb90: 1ade16f: Change $wgNamespaceRobotPolicies on Thai wikis (T253578; T253577; T253576; T253575; T253574) (duration: 01m 07s) [11:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:05] T253576: Change $wgNamespaceRobotPolicies on thwikisource - https://phabricator.wikimedia.org/T253576 [11:49:06] T253575: Change $wgNamespaceRobotPolicies on thwikibooks - https://phabricator.wikimedia.org/T253575 [11:49:06] T253578: Change $wgNamespaceRobotPolicies on thwiktionary - https://phabricator.wikimedia.org/T253578 [11:49:06] T253577: Change $wgNamespaceRobotPolicies on thwikiquote - https://phabricator.wikimedia.org/T253577 [11:49:06] T253574: Change $wgNamespaceRobotPolicies on thwiki - https://phabricator.wikimedia.org/T253574 [11:49:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:30] (03PS1) 10Urbanecm: wgNamespaceRobotPolicies: thwiki: Add 100 NS to noindex [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602344 (https://phabricator.wikimedia.org/T253574) [11:51:06] (03CR) 10Urbanecm: [C: 03+2] wgNamespaceRobotPolicies: thwiki: Add 100 NS to noindex [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602344 (https://phabricator.wikimedia.org/T253574) (owner: 10Urbanecm) [11:51:54] (03Merged) 10jenkins-bot: wgNamespaceRobotPolicies: thwiki: Add 100 NS to noindex [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602344 (https://phabricator.wikimedia.org/T253574) (owner: 10Urbanecm) [11:53:50] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: ec07467: wgNamespaceRobotPolicies: thwiki: Add 100 NS to noindex (T253574) (duration: 01m 15s) [11:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:47] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Patch-For-Review, and 2 others: Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10Pablo-WMDE) FYI a 1.8.18 build just failed for us without a conclusive error message. https://integration.wikimedia.org/ci/job/mwext-... [11:55:49] RhinosF1: all done :) [11:55:54] ty [11:59:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11396 and previous config saved to /var/cache/conftool/dbconfig/20200604-115933-marostegui.json [11:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:35] !log upgrading mw1276 to PHP 7.2.31 [12:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:00] (03CR) 10Dzahn: [C: 03+2] jenkins: reindent spec to two spaces [puppet] - 10https://gerrit.wikimedia.org/r/602100 (owner: 10Hashar) [12:06:56] (03CR) 10Dzahn: [C: 03+2] jenkins: fix spec to use proper facts [puppet] - 10https://gerrit.wikimedia.org/r/602101 (owner: 10Hashar) [12:07:05] (03PS3) 10Dzahn: jenkins: fix spec to use proper facts [puppet] - 10https://gerrit.wikimedia.org/r/602101 (owner: 10Hashar) [12:08:48] (03PS7) 10Dzahn: ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [12:09:04] (03CR) 10jerkins-bot: [V: 04-1] ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [12:09:09] SIGH [12:10:25] (03PS1) 10Arturo Borrero Gonzalez: wmcs: kubeadm: refresh default version to 1.16.10 [puppet] - 10https://gerrit.wikimedia.org/r/602346 [12:10:55] (03CR) 10Jbond: "lgtm, questions and a comment" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/602119 (https://phabricator.wikimedia.org/T250136) (owner: 10Ayounsi) [12:11:52] (03CR) 10Dzahn: "merged the ancestors..rebased and "08:08:59 docker: Error response from daemon: ttrpc: closed: unknown."" [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [12:12:09] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [12:12:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: kubeadm: refresh default version to 1.16.10 [puppet] - 10https://gerrit.wikimedia.org/r/602346 (owner: 10Arturo Borrero Gonzalez) [12:14:52] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'citoid' for release 'production' . [12:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:17] (03CR) 10Dzahn: [C: 03+2] ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [12:18:24] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' . [12:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:56] (03PS1) 10Alexandros Kosiaris: ganeti: Add a ganeti_init.sh script [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) [12:24:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11397 and previous config saved to /var/cache/conftool/dbconfig/20200604-122406-marostegui.json [12:32:01] (03CR) 10Muehlenhoff: "Sure, if anyone wants to test this setting with a few specific Cloud VPS projects, I'll hold this back for now. It's just a matter of sett" [puppet] - 10https://gerrit.wikimedia.org/r/602286 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [12:34:40] (03CR) 10Muehlenhoff: [C: 03+2] Rename reprepro definition for grafana [puppet] - 10https://gerrit.wikimedia.org/r/602108 (owner: 10Muehlenhoff) [12:36:09] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 56 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:38:01] Who manages https://sal.toolforge.org/? It seems to be empty :/ [12:38:10] (03PS2) 10Alexandros Kosiaris: ganeti: Add a ganeti_init.sh script [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) [12:38:17] Yup, RIP toolforge at this moment Urbanecm [12:38:40] DutchTina: oh, it's toolforge-side outage? [12:38:46] Yes [12:39:31] DutchTina: do you have a task number please? [12:39:51] I don't know if there is a task yet. [12:40:11] All bots left Urbanecm and other tools are half down too. [12:41:08] (03CR) 10Alexandros Kosiaris: "@dzahn here's a first draft of that ugly set of commands I was referring too. And this is already better than previously :-)" [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) (owner: 10Alexandros Kosiaris) [12:41:43] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:42:30] (03PS10) 10DCausse: [WIP][wdqs] add a new streaming updater test role [puppet] - 10https://gerrit.wikimedia.org/r/597790 [12:42:32] (03PS1) 10DCausse: [wdqs] drop updater mode config [puppet] - 10https://gerrit.wikimedia.org/r/602353 [12:43:27] DutchTina: somehow (suprisingly), none of my tools have failed [12:43:44] RhinosF1: Lucky you haha [12:44:04] !log Drop trigger revision_insert and revision_update from sanitarium (on testwiki) T238966 [12:44:43] (03CR) 10jerkins-bot: [V: 04-1] [WIP][wdqs] add a new streaming updater test role [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [12:44:51] DutchTina: it's puppetmaster, that's why. My tools are normally first to break! [12:45:08] RhinosF1: Aaah... [12:45:27] (03PS2) 10Jcrespo: test commit [software/transferpy] - 10https://gerrit.wikimedia.org/r/602323 [12:46:06] (03CR) 10Jcrespo: "It can take 20 minutes to propagate to production..." [software/transferpy] - 10https://gerrit.wikimedia.org/r/602323 (owner: 10Jcrespo) [12:46:34] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [12:47:02] (03CR) 10Hashar: "recheck" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602323 (owner: 10Jcrespo) [12:47:10] (03PS1) 10Joal: Add parameters to the eventlogging_to_druid job [puppet] - 10https://gerrit.wikimedia.org/r/602354 [12:47:18] (03PS3) 10Jcrespo: test commit [software/transferpy] - 10https://gerrit.wikimedia.org/r/602323 [12:50:18] !log upgrading mw1277-mw1283 to PHP 7.2.31 [12:50:54] (03PS1) 10RhinosF1: Revert "wgNamespaceRobotPolicies: thwiki: Add 100 NS to noindex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602355 [12:51:13] Urbanecm: ^ [12:51:19] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/22982/an-launcher1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/602354 (owner: 10Joal) [12:51:56] (03CR) 10Urbanecm: [C: 03+2] Revert "wgNamespaceRobotPolicies: thwiki: Add 100 NS to noindex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602355 (owner: 10RhinosF1) [12:52:46] (03Merged) 10jenkins-bot: Revert "wgNamespaceRobotPolicies: thwiki: Add 100 NS to noindex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602355 (owner: 10RhinosF1) [12:54:28] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c06e720: Revert "wgNamespaceRobotPolicies: thwiki: Add 100 NS to noindex" (T253574) (duration: 01m 06s) [12:54:53] Urbanecm: no stashbot so don't know how you can log [12:55:02] (03CR) 10Jcrespo: "It worked because we hadn't touched the other files yet, we need a commit to delete the files unrelated to transfer.py and setup the right" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602323 (owner: 10Jcrespo) [12:55:11] (03Abandoned) 10Jcrespo: test commit [software/transferpy] - 10https://gerrit.wikimedia.org/r/602323 (owner: 10Jcrespo) [12:55:24] RhinosF1: realized too late, adding to Wikitech manually [12:55:24] ty [12:55:36] :) [12:56:52] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c06e720: Revert "wgNamespaceRobotPolicies: thwiki: Add 100 NS to noindex" (T253574) (duration: 01m 06s) [12:56:55] ehh [12:57:47] Urbanecm: it quit straight away, must be flapping [12:58:19] yeah, probably didn't even boot - but thought it's worth trying :) [12:58:29] true [12:58:33] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@e1d948f]: Use gevent in gunicorn [12:59:06] (03PS1) 10Joal: Update netflow-to-druid load job configuration [puppet] - 10https://gerrit.wikimedia.org/r/602356 [12:59:11] elukey: stashbot's down, your entry didn't get logged [12:59:26] ack thanks [12:59:31] Urbanecm: should that be put in the topic for a bit? [12:59:40] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@e1d948f]: Use gevent in gunicorn (duration: 01m 08s) [13:00:02] no oppose, through I'm not an op there - I guess SRE should decidde that :) [13:00:06] *here [13:03:37] (03PS1) 10Elukey: profile::superset: move to gevent [puppet] - 10https://gerrit.wikimedia.org/r/602357 (https://phabricator.wikimedia.org/T253545) [13:03:52] (03PS1) 10Hnowlan: changeprop-jobqueue: fix rendering of ignore topics list. [deployment-charts] - 10https://gerrit.wikimedia.org/r/602358 (https://phabricator.wikimedia.org/T220399) [13:04:05] (03CR) 10Elukey: [C: 03+2] profile::superset: move to gevent [puppet] - 10https://gerrit.wikimedia.org/r/602357 (https://phabricator.wikimedia.org/T253545) (owner: 10Elukey) [13:04:54] (03PS1) 10Jcrespo: Transferer.py: Backport production fixes into HEAD (xtrabackup in path) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602359 (https://phabricator.wikimedia.org/T250666) [13:08:28] (03PS1) 10Marostegui: db1104: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/602362 [13:08:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1107', diff saved to https://phabricator.wikimedia.org/P11398 and previous config saved to /var/cache/conftool/dbconfig/20200604-130839-marostegui.json [13:08:44] (03PS2) 10Hnowlan: changeprop-jobqueue: fix rendering of ignore topics list. [deployment-charts] - 10https://gerrit.wikimedia.org/r/602358 (https://phabricator.wikimedia.org/T220399) [13:09:08] marostegui: no stashbot so that won't be logged [13:09:38] RhinosF1: Sure [13:10:06] (03CR) 10Marostegui: [C: 03+2] db1104: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/602362 (owner: 10Marostegui) [13:10:18] (03CR) 10Privacybatm: "> Patch Set 3:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602323 (owner: 10Jcrespo) [13:10:27] (03PS2) 10Joal: Update netflow-to-druid load job configuration [puppet] - 10https://gerrit.wikimedia.org/r/602356 [13:11:59] (03PS3) 10Joal: Update netflow-to-druid load job configuration [puppet] - 10https://gerrit.wikimedia.org/r/602356 [13:12:05] (03PS11) 10DCausse: [WIP][wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [13:12:28] (03CR) 10Vgutierrez: [C: 03+2] Add diff.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/602306 (https://phabricator.wikimedia.org/T253807) (owner: 10Vgutierrez) [13:13:53] (03PS3) 10Mholloway: Mobileapps: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) [13:13:56] (03CR) 10jerkins-bot: [V: 04-1] Update netflow-to-druid load job configuration [puppet] - 10https://gerrit.wikimedia.org/r/602356 (owner: 10Joal) [13:14:14] (03CR) 10Mholloway: Mobileapps: Add initial helmfile stanzas (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [13:14:16] (03CR) 10jerkins-bot: [V: 04-1] [WIP][wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [13:14:32] (03PS3) 10Muehlenhoff: Extend Cassandra cookbook to also cover maps [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 [13:15:07] 10Operations, 10DNS, 10Domains, 10Traffic, 10Patch-For-Review: Create diff.wikimedia.org subdomain - https://phabricator.wikimedia.org/T253807 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `willikins:dns vgutierrez$ dig diff.wikimedia.org. ; <<>> DiG 9.10.6 <<>> diff.wikimedia.org. ;; global opt... [13:16:36] (03CR) 10jerkins-bot: [V: 04-1] Extend Cassandra cookbook to also cover maps [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff) [13:16:58] (03PS3) 10Alexandros Kosiaris: ganeti: Add a ganeti_init.sh script [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) [13:17:00] (03PS1) 10Alexandros Kosiaris: Assign role::ganeti to new ganeti expansion hosts [puppet] - 10https://gerrit.wikimedia.org/r/602364 (https://phabricator.wikimedia.org/T228924) [13:17:48] (03PS2) 10Alexandros Kosiaris: Assign role::ganeti to new ganeti expansion hosts [puppet] - 10https://gerrit.wikimedia.org/r/602364 (https://phabricator.wikimedia.org/T228924) [13:17:50] (03PS4) 10Alexandros Kosiaris: ganeti: Add a ganeti_init.sh script [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) [13:18:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] Assign role::ganeti to new ganeti expansion hosts [puppet] - 10https://gerrit.wikimedia.org/r/602364 (https://phabricator.wikimedia.org/T228924) (owner: 10Alexandros Kosiaris) [13:22:06] (03CR) 10Mforns: "LGTM! except for typo." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602356 (owner: 10Joal) [13:25:16] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01723 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:26:22] (03PS2) 10Mholloway: Chromium-render: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) [13:26:30] (03CR) 10Mholloway: Chromium-render: Add initial helmfile stanzas (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway) [13:26:39] (03CR) 10Jcrespo: "For Batm:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602359 (https://phabricator.wikimedia.org/T250666) (owner: 10Jcrespo) [13:27:23] (03PS4) 10Elukey: Update netflow-to-druid load job configuration [puppet] - 10https://gerrit.wikimedia.org/r/602356 (owner: 10Joal) [13:27:52] PROBLEM - Check systemd state on ms-be1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:31] akosiaris: are you seeing the puppet failures? not positive it's your change but it looks plausible [13:29:46] rzl: /me looking [13:31:49] rzl: yeah, transient, they 'll get autocorrected on the next puppet run. It seems like we need a require for that file resource [13:31:58] 👍 [13:32:10] (03CR) 10Elukey: [C: 03+2] Update netflow-to-druid load job configuration [puppet] - 10https://gerrit.wikimedia.org/r/602356 (owner: 10Joal) [13:35:27] !log installing exim security updates on jessie (stretch/buster already done) [13:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:58] (03PS5) 10Alexandros Kosiaris: ganeti: Add a ganeti_init.sh script [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) [13:37:00] (03PS1) 10Alexandros Kosiaris: ganeti: Deduplicate /var/lib/ganeti/rapi/users [puppet] - 10https://gerrit.wikimedia.org/r/602368 [13:39:34] (03PS4) 10JMeybohm: Readd wmf.chartid (.metadata.labels.chart) to all resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/598076 [13:39:46] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1023 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:40:30] (03CR) 10Kormat: [C: 04-1] "Minor english fixes." (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602359 (https://phabricator.wikimedia.org/T250666) (owner: 10Jcrespo) [13:40:36] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 53 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:41:42] RECOVERY - Check systemd state on ms-be1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:44] !log bounced ferm on ms-be1023 [13:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:18] 10Operations, 10observability: Switch ELK7 to use the distro Java - https://phabricator.wikimedia.org/T252913 (10herron) I'm afraid we won't be able to remove the openjdk-8 dependency yet, as we will be moving the kafka-logging brokers that are co-located with the logging ES data nodes to ELK7. After some IRC... [13:44:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: Deduplicate /var/lib/ganeti/rapi/users [puppet] - 10https://gerrit.wikimedia.org/r/602368 (owner: 10Alexandros Kosiaris) [13:46:10] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.003829 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:50:41] (03PS1) 10Elukey: Add Turnilo to the staging environment on an-tool1007 [puppet] - 10https://gerrit.wikimedia.org/r/602371 (https://phabricator.wikimedia.org/T253294) [13:51:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Jclark-ctr) @Cmjohnson Host have been removed from racks and netbox has been updated for removing from rack. [13:51:18] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 47 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:51:33] 10Operations, 10observability: Switch ELK7 to use the distro Java - https://phabricator.wikimedia.org/T252913 (10MoritzMuehlenhoff) Indeed, that's a built-in feature of profile::java, all we need is the following in Hiera for the ELK7 hosts: ` profile::java::java_packages: - version: 8 - variant: jdk -... [13:52:06] (03CR) 10jerkins-bot: [V: 04-1] Add Turnilo to the staging environment on an-tool1007 [puppet] - 10https://gerrit.wikimedia.org/r/602371 (https://phabricator.wikimedia.org/T253294) (owner: 10Elukey) [13:52:54] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: fix rendering of ignore topics list. [deployment-charts] - 10https://gerrit.wikimedia.org/r/602358 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [13:55:34] 10Operations, 10User-MoritzMuehlenhoff: Ferm sometimes (rarely) fails to reload - https://phabricator.wikimedia.org/T254477 (10MoritzMuehlenhoff) [13:58:39] (03PS5) 10JMeybohm: Readd wmf.chartid (.metadata.labels.chart) to all resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/598076 [14:00:26] (03CR) 10JMeybohm: Readd wmf.chartid (.metadata.labels.chart) to all resources (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598076 (owner: 10JMeybohm) [14:00:29] !log Stopping puppet on gerrit1002 (gerrit-test) to run tests for Gerrit upgrade [14:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:28] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10observability, and 3 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10Pchelolo) The patch above doesn't change anything in production. In general, having 'config.prod.yaml' in citoid source repo is misleading... [14:08:53] !log installing clamav security updates on mendelevium (ticket.wikimedia.org) [14:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:58] (03PS12) 10DCausse: [WIP][wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [14:10:18] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1023 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:12:08] (03CR) 10jerkins-bot: [V: 04-1] [WIP][wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [14:12:24] 10Operations, 10observability, 10User-MoritzMuehlenhoff: Switch ELK7 to use the distro Java - https://phabricator.wikimedia.org/T252913 (10MoritzMuehlenhoff) [14:14:21] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Kormat) [14:14:27] (03PS1) 10Arturo Borrero Gonzalez: hieradata: tools: drop legacy kubernetes tokens [labs/private] - 10https://gerrit.wikimedia.org/r/602374 [14:14:58] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Kormat) 05Open→03Resolved Got u2f key, enrolled in both idp and google apps, and tested both. We're done, folks :) [14:15:50] (03PS6) 10JMeybohm: Readd wmf.chartid (.metadata.labels.chart) to all resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/598076 (https://phabricator.wikimedia.org/T254479) [14:17:09] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: tools: drop legacy kubernetes tokens [labs/private] - 10https://gerrit.wikimedia.org/r/602374 (owner: 10Arturo Borrero Gonzalez) [14:18:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:20:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:20:30] 10Operations, 10Puppet: automated linting/analysis/other CI of Python/shell scripts generated by ERB - https://phabricator.wikimedia.org/T254480 (10CDanis) [14:21:25] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) Nice!!!! [14:30:06] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti2010.codfw.wmnet', 'ganeti2020.codfw.wmnet'] ` The log can... [14:30:17] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2010.codfw.wmnet', 'ganeti2020.codfw.wmnet'] ` Of which those **FAILED**: ` ['ganeti2010.codfw.wmnet', 'ganeti2... [14:34:24] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: fix rendering of ignore topics list. [deployment-charts] - 10https://gerrit.wikimedia.org/r/602358 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:34:54] (03Merged) 10jenkins-bot: changeprop-jobqueue: fix rendering of ignore topics list. [deployment-charts] - 10https://gerrit.wikimedia.org/r/602358 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:35:22] (03PS6) 10Alexandros Kosiaris: ganeti: Add a ganeti_init.sh script [puppet] - 10https://gerrit.wikimedia.org/r/602350 (https://phabricator.wikimedia.org/T228924) [14:35:24] (03PS1) 10Alexandros Kosiaris: ganeti: ganeti[12]0{09..24}.eqiad|codfw.wmnet to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/602379 (https://phabricator.wikimedia.org/T228924) [14:36:44] !log installing libexif security updates on jessie [14:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:16] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [14:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: ganeti[12]0{09..24}.eqiad|codfw.wmnet to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/602379 (https://phabricator.wikimedia.org/T228924) (owner: 10Alexandros Kosiaris) [14:42:31] 10Operations, 10ops-codfw: Degraded RAID on ms-be2018 - https://phabricator.wikimedia.org/T254392 (10fgiunchedi) @Papaul looks like this is also a failed BBU, similar to {T252851}. Please replace once the new BBUs come in, thank you! Host is good to be taken down at any time after a clean `poweroff`, ping me o... [14:46:02] (03PS13) 10DCausse: [WIP][wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [14:46:36] (03PS1) 10Arturo Borrero Gonzalez: hieradata: labs: tools: add dummy password placeholder for k8s encryption key [labs/private] - 10https://gerrit.wikimedia.org/r/602384 [14:47:29] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: labs: tools: add dummy password placeholder for k8s encryption key [labs/private] - 10https://gerrit.wikimedia.org/r/602384 (owner: 10Arturo Borrero Gonzalez) [14:48:06] (03CR) 10jerkins-bot: [V: 04-1] [WIP][wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [14:49:16] 10Operations, 10Analytics, 10Traffic: missing wmf_netflow data, 18:30-19:00 May 31 - https://phabricator.wikimedia.org/T254161 (10elukey) The hole is now gone, but we discovered a major problem in T254383 :( [14:49:34] 10Operations, 10Analytics, 10Traffic: missing wmf_netflow data, 18:30-19:00 May 31 - https://phabricator.wikimedia.org/T254161 (10elukey) 05Open→03Resolved [14:49:44] (03PS14) 10DCausse: [WIP][wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [14:50:19] (03CR) 10jerkins-bot: [V: 04-1] [WIP][wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [14:51:54] (03PS3) 10Krinkle: Lossy optimisation of Wikipedia logos static PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [14:52:11] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti2010.codfw.wmnet', 'ganeti2020.codfw.wmnet'] ` The log can... [14:52:14] (03PS1) 10Ottomata: Default PYSPARK_PYTHON to exact versioned python executable used on driver. [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/602386 (https://phabricator.wikimedia.org/T229347) [14:52:51] (03CR) 10DCausse: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [14:53:19] (03CR) 10Krinkle: "The 1x and 1.5x look identical to me. The 2x variants have become blurry for me, e.g. enwiki 2x old/new. The old one was crisp, but the ne" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [14:54:20] (03PS2) 10Ottomata: Default PYSPARK_PYTHON to exact versioned python executable used on driver. [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/602386 (https://phabricator.wikimedia.org/T229347) [14:55:50] (03CR) 10Krinkle: [C: 03+1] "The 2x ones look fine when viewed in a Safar, Firefox or Chrome tab. The blurry 2x rendering happens only when opening the new image in ma" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [14:58:58] (03CR) 10Muehlenhoff: "This looks really great" [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [15:00:26] (03CR) 10CDanis: [C: 03+1] "I don't feel competent to review the bigger-picture things here, but the syntax and small decisions here all LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [15:01:10] (03PS1) 10MSantos: WIP: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) [15:01:12] (03CR) 10EBernhardson: Default PYSPARK_PYTHON to exact versioned python executable used on driver. (031 comment) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/602386 (https://phabricator.wikimedia.org/T229347) (owner: 10Ottomata) [15:02:11] (03PS1) 10Joal: Bump AQS druid snapshot to 2020-05 [puppet] - 10https://gerrit.wikimedia.org/r/602391 [15:04:51] (03PS15) 10DCausse: [WIP][wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [15:06:02] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:08] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [15:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:32] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:21] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:08] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:18] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) [15:12:22] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=druid1004.eqiad.wmnet [15:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:47] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) p:05Triage→03Unbreak! [15:16:10] (03PS2) 10Elukey: Add Turnilo to the staging environment on an-tool1007 [puppet] - 10https://gerrit.wikimedia.org/r/602371 (https://phabricator.wikimedia.org/T253294) [15:19:20] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) Announced to community on cloud-announce: https://lists.wikimedia.org/pipermail/cloud-announce/2020-June/000291.html [15:20:16] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) a:03bd808 [15:21:38] (03PS10) 10Kormat: install_server: Allow reuse of partitions during reimage. [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [15:21:52] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) [15:23:04] (03PS11) 10Kormat: install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [15:24:48] (03CR) 10BearND: Mobileapps: Add initial helmfile stanzas (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [15:30:07] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2020.codfw.wmnet', 'ganeti2010.codfw.wmnet'] ` and were **ALL** successful. [15:30:21] (03CR) 10Jcrespo: "Thanks for the catch, will amend. I am happy if those are the only issues!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602359 (https://phabricator.wikimedia.org/T250666) (owner: 10Jcrespo) [15:32:05] (03PS7) 10Filippo Giunchedi: thanos: add alerts for Thanos components [puppet] - 10https://gerrit.wikimedia.org/r/602082 (https://phabricator.wikimedia.org/T252186) [15:32:07] (03PS1) 10Filippo Giunchedi: prometheus: move services instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/602398 (https://phabricator.wikimedia.org/T252186) [15:35:14] (03PS2) 10Filippo Giunchedi: prometheus: move services instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/602398 (https://phabricator.wikimedia.org/T252186) [15:39:13] (03PS1) 10Ssingh: add fake console key for dnsdist::wikidough [labs/private] - 10https://gerrit.wikimedia.org/r/602400 [15:40:08] (03PS1) 10Filippo Giunchedi: prometheus: move global instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/602401 (https://phabricator.wikimedia.org/T252186) [15:40:56] (03CR) 10Dzahn: [C: 03+1] add fake console key for dnsdist::wikidough [labs/private] - 10https://gerrit.wikimedia.org/r/602400 (owner: 10Ssingh) [15:41:44] (03CR) 10Ssingh: [V: 03+2 C: 03+2] add fake console key for dnsdist::wikidough [labs/private] - 10https://gerrit.wikimedia.org/r/602400 (owner: 10Ssingh) [15:45:10] (03PS1) 10Ssingh: dnsdist: allow access to control socket [puppet] - 10https://gerrit.wikimedia.org/r/602406 (https://phabricator.wikimedia.org/T252132) [15:47:07] (03CR) 10jerkins-bot: [V: 04-1] dnsdist: allow access to control socket [puppet] - 10https://gerrit.wikimedia.org/r/602406 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:47:13] 10Operations, 10User-MoritzMuehlenhoff: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10MoritzMuehlenhoff) I've backported the patch to the version in stretch and deployed a test package on cp2009, which seems to work well. I'll do a... [15:48:18] (03PS1) 10Bartosz Dziewoński: Set wmgVisualEditorDisableForAnons to false on enwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602408 (https://phabricator.wikimedia.org/T253941) [15:51:38] (03PS2) 10Ssingh: dnsdist: allow access to control socket [puppet] - 10https://gerrit.wikimedia.org/r/602406 (https://phabricator.wikimedia.org/T252132) [15:52:08] (03PS1) 10Filippo Giunchedi: prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) [15:54:08] (03CR) 10jerkins-bot: [V: 04-1] prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:54:49] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/22990/" [puppet] - 10https://gerrit.wikimedia.org/r/602406 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:00:04] godog and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200604T1600). [16:00:04] brennen: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:12] * brennen here. [16:00:29] (03CR) 10Dzahn: [C: 03+1] dnsdist: allow access to control socket [puppet] - 10https://gerrit.wikimedia.org/r/602406 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:02:22] (03CR) 10Ssingh: [C: 03+2] dnsdist: allow access to control socket [puppet] - 10https://gerrit.wikimedia.org/r/602406 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:02:55] brennen: I'll take a look [16:03:04] godog: thanks [16:04:28] (03CR) 10Filippo Giunchedi: [C: 03+2] logspam-watch: add time & sortable columns, improve formatting [puppet] - 10https://gerrit.wikimedia.org/r/593936 (https://phabricator.wikimedia.org/T242882) (owner: 10Brennen Bearnes) [16:07:05] brennen: {{done}} and puppet has ran [16:07:39] godog: thanks much. confirmed that the change works as expected on mwlog1001. [16:08:00] neat! you're welcome brennen [16:13:07] (03PS2) 10Filippo Giunchedi: prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) [16:13:09] 10Operations, 10Analytics, 10Analytics-Kanban: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10Milimetric) Can we re-enable reportupdater on the machine now? [16:13:13] (03Abandoned) 10Lucas Werkmeister (WMDE): logspam-watch: exec watch [puppet] - 10https://gerrit.wikimedia.org/r/499761 (owner: 10Lucas Werkmeister (WMDE)) [16:15:42] 10Puppet, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), 10User-brennen: logspam-watch: Add interactive sorting / filtering - https://phabricator.wikimedia.org/T242882 (10brennen) 05Open→03Resolved Tested and generally available on mwlog1001. [16:18:07] 10Operations, 10observability, 10User-MoritzMuehlenhoff, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10CDanis) I hesitate to ask, but, did anyone check if mcelog recorded anything for these events? [16:18:28] 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Milimetric) p:05Medium→03High [16:18:35] (03CR) 10Filippo Giunchedi: "PCC is effectively a noop https://puppet-compiler.wmflabs.org/compiler1003/22993/prometheus1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/602398 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:20:58] (03CR) 10Filippo Giunchedi: "PCC effectively a noop https://puppet-compiler.wmflabs.org/compiler1003/22995/prometheus1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/602401 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:21:24] (03PS1) 10Hamish: Change the Traditional Chinese logo for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602421 (https://phabricator.wikimedia.org/T254467) [16:21:34] (03CR) 10CDanis: "friendly ping :)" [puppet] - 10https://gerrit.wikimedia.org/r/601460 (owner: 10CDanis) [16:23:39] (03PS3) 10Filippo Giunchedi: prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) [16:28:01] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.09e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:28:39] 👀 [16:29:01] mc1030 is saturating [16:29:20] also, big increase in traffic ~45 minutes ago at 15:45 [16:29:47] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 193 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:30:26] (03CR) 10Filippo Giunchedi: "PCC's happy (and a noop in practice) https://puppet-compiler.wmflabs.org/compiler1001/22997/" [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:30:41] rzl: elukey: how to tell when a gutter pool server is being used? [16:31:01] (03CR) 10Gilles: "Hmm yeah, in my experience Preview.app has a tendency to blow small images up randomly when you open them, especially when using the space" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [16:31:14] I guess https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached_gutter&var-instance=All kind of tells a story [16:31:42] cdanis: not sure offhand but they're mc-gp####, check network usage? [16:32:11] yeah! that's the stuff [16:32:19] gutterpools are live already? [16:32:23] Krinkle: for a while now [16:32:38] Krinkle: the pending work is the machine-local memcached with very brief ttl [16:32:43] I lost track of the conversation regarding not breaking tombstones and purge requirements [16:32:45] s/machine/appserver/ [16:33:04] did we end up agreeing on a solution for memc/wancache that upholds its needs? [16:33:35] cdanis: so there are multiple ways [16:33:43] e.g where to replay traffic to/from, ttl, etc. lack of coordinated start/end [16:33:58] 1) memcached grafana dashboard -> gutter and check traffic graphs etc.. [16:34:17] 2) mcrouter metrics, if you see TKOs then it is probably a sign of failover happened [16:34:43] For example, even a 10s ttl (which is shorter than the tombstone's 11s ttl) can be problematic for local-memc depending on how it is populated. If it is populated lazily, do we do prevent a tomstone that is nearly expired from being renewed for another 10s etc. [16:34:44] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached_gutter&var-instance=All [16:35:23] Krinkle: what we do is set a max TTL for keys ending up in the gutter, IIRC 10 mins [16:35:26] if gutter is only populated on set (e.g. all sets go to primary and gutter), then a 10s ttl there might be fine, not sure. [16:37:13] also cdanis keep in mind that TKOs keep being logged even after the failover happened to the gutter [16:37:43] elukey: is set/delete traffic etc always sent to the gutters continuously? [16:37:51] so if we see a spike, it is probably a quick failover, a sustained TKO stream is instead a longer failover [16:37:57] elukey: my guess is that mcrouter has to see continued TKOs to decide to keep using the gutter? [16:38:05] (and then capped to 10min there?) [16:38:34] Krinkle: correct, for all the keys related to the shard marked as TKO of course [16:38:47] there is also an async log for deletes, to replay when failback occurs IIRC [16:39:11] elukey: I mean even outside TKO, when all servers are fine, the mcrouter needs to send sets/deletes to gutters as well. [16:39:22] cdanis: yes I see a datapoint for a TKO as "this request was for a shard that is marked as TKO" [16:39:26] elukey: ack [16:39:32] otherwise its state is meaningless I think, given that TKO is not coordinated. [16:40:07] Krinkle: it is a fresh cache basically, we don't propagate anything to the gutter pool if we are green [16:40:26] e.g. mw1 will see mcA as TKO, and use gp1 instead. mw2 still sees it as fine an issues a purge for a key to mcA - this purge is not seen thus losing 10min of consistency which is pretty big [16:40:45] fresh as in - we wipe it completely whenever an app server's mcrouter starts using it? [16:40:51] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Jdforrester-WMF) [16:42:18] conversation is at https://phabricator.wikimedia.org/T240684 - looks like there is something to be synced up there, sounds potentially problematic as there'd be no way to know that something is more stale than the expected tolerance of ~ 10s. [16:43:24] Krinkle: how can a purge from mw2x happen if we are not active/active? [16:44:15] 10Operations, 10Puppet, 10User-jbond: automated linting/analysis/other CI of Python/shell scripts generated by ERB - https://phabricator.wikimedia.org/T254480 (10jbond) p:05Triage→03Medium [16:44:29] elukey mw1 and mw2 were intended as two servers in the same DC [16:44:36] mcrouter is not coordinated, TKO is not coordinated. [16:44:36] ah okok [16:45:25] yes correct but we do failover only when we see 10 timeous of 1s in a row [16:45:55] so something problematic is happening on the shard, either network congestion or downtime etc.. [16:46:55] there is also one point to make - we are possibly keeping too much state on memcached, something that shouldn't happen [16:48:07] it does seem like something more than a cache [16:50:37] (03CR) 10Elukey: [C: 03+2] Bump AQS druid snapshot to 2020-05 [puppet] - 10https://gerrit.wikimedia.org/r/602391 (owner: 10Joal) [16:58:32] 10Operations, 10serviceops, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) Update - The gutterpools are live. The conversation here does not look finished though, so it'... [16:59:41] (03PS6) 10EBernhardson: query_service: Move shared config into common file [puppet] - 10https://gerrit.wikimedia.org/r/599145 [16:59:43] (03PS9) 10EBernhardson: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [16:59:45] (03PS2) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 [17:00:04] halfak and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200604T1700). [17:00:34] cdanis: the compromise we made was to consider it as much as possible a cache, meaning no expectation of writes being immediate, and not even needed to be eventually consistent (that is, each local-dc memc cluster is expected to be indendent and lazily populated). Before the multi-dc effort it was not uncomon for memc keys to be populated from master DB data during POST requests. Now we always populate from replica DBs, lazily, via [17:00:34] the getWithSet idom. [17:00:49] This compromise however requires that we do coordinate one thing: purges (aka tombstones). [17:01:25] and from what I can tell, the gutter pool, given no coordination or central proxy, is essentially just mini DC. so we would alway sneed to broadcast purges to it, I'm not sure how else it could work reliably. [17:01:30] (03CR) 10jerkins-bot: [V: 04-1] query_service: Move shared config into common file [puppet] - 10https://gerrit.wikimedia.org/r/599145 (owner: 10EBernhardson) [17:01:37] (03CR) 10jerkins-bot: [V: 04-1] Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 (owner: 10EBernhardson) [17:02:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 (owner: 10EBernhardson) [17:02:40] I could be wrong, but I have the vague sense that TKOs are often false negatives that are only perceived by a small subset of app servers, e.g. possibly due to specific keys or commands resulting in "bad" responses, or due to very intermittent congestion. If that is the case, then that also makes it more important to consider the gutter pools always "live". [17:02:54] We also use memc for dc-local mutex locks. [17:04:07] That's probably where TKOs have been the most damaging, because it means everyone is denied the lock for a few seconds. So gutter pools will help there to gain back the behaviour we had in the past I think with nutcracker, which jsut randomly rehashed to another server I think? that had its own issues, but at least it meant bette uptime for ADD/lock operations. [17:04:37] Krinkle: so, I don't understand the contract expected here, or if such a thing is even documented anywhere, but, I'm worried about these sorts of expectations in general, especially when I think only a very small number of people understand them deeply [17:05:08] (03PS1) 10Hnowlan: changeprop-jobqueue: enable partitioned jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602430 (https://phabricator.wikimedia.org/T220399) [17:05:09] +1 [17:05:55] Aaron;'s message has been pretty consistent since 2015. But conversations do need to take place for that knowledge to stick and transfer. [17:06:49] The risks of this were pointed out at https://phabricator.wikimedia.org/T240684, where I assume you asked our input to make sure we get it right. [17:07:03] Maybe that ticket was forgotten? [17:07:35] (03CR) 10Herron: [C: 03+2] icinga: increase "rsyslog failing to deliver messages" check threshold [puppet] - 10https://gerrit.wikimedia.org/r/602153 (owner: 10Herron) [17:09:52] 10Operations, 10serviceops, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10elukey) Trying to answer :) >>! In T240684#6193525, @Krinkle wrote: > Update - The gutterpools are liv... [17:12:34] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10observability, and 3 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10Mvolz) >>! In T219919#6192799, @Pchelolo wrote: > The patch above doesn't change anything in production. In general, having 'config.prod.ya... [17:12:36] Krinkle: from our point of view, wanobjectcache is very complex and difficult to understand, the documentation of the code is not enough (speaking for me, I found it really difficult to follow). We don't have a clear idea about all the use cases in detail, like locking/tombstones/replication/etc.., but at the same time we have the need of reliability and maintenance [17:13:07] so the task was not forgotten, but some trade-off needed to be made to avoid a lot of engineering pain [17:13:35] I think that results are good, there is the need for some tuning (as expected) but we can do it together [17:14:14] One thing that would be really useful, in my opinion, would be a quick talk about the object cache use cases and constraints [17:14:24] for SRE I mean, so we can start from the same page [17:14:27] does it make sense? [17:15:53] also let's keep in mind that a TKO state should be an exceptional event, not a regular one [17:19:02] (03PS3) 10Elukey: Add Turnilo to the staging environment on an-tool1007 [puppet] - 10https://gerrit.wikimedia.org/r/602371 (https://phabricator.wikimedia.org/T253294) [17:19:22] elukey: for the entrypoint I look after most (load.php) it seems to be regular and basically the only source of errors. E.g. https://logstash.wikimedia.org/goto/8901f1dea3dfe11483b1c581b684a586 [17:19:30] I don't know of those are all TKO-induced though [17:19:42] I guess with gutters being live, and them still there, maybe not? [17:23:25] you can see TKOs on the mcrouter console in grafana Krinkle [17:24:33] https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1 [17:24:41] there is a panel for TKOs etc.. [17:25:01] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: enable partitioned jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602430 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [17:26:20] (03PS1) 10BryanDavis: toolsdb: add 3rd temporary filter for replication [puppet] - 10https://gerrit.wikimedia.org/r/602433 (https://phabricator.wikimedia.org/T253738) [17:27:09] (03PS4) 10Elukey: Add Turnilo to the staging environment on an-tool1007 [puppet] - 10https://gerrit.wikimedia.org/r/602371 (https://phabricator.wikimedia.org/T253294) [17:27:57] elukey: Yeah, I think the Doxyen docs are quite good (I don't know if those are the ones you used), but they are written for developers wanting to understand the MW side of things. They don't do well to tell you the contract it needs toward its backend. [17:28:19] (03CR) 10Elukey: Default PYSPARK_PYTHON to exact versioned python executable used on driver. (031 comment) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/602386 (https://phabricator.wikimedia.org/T229347) (owner: 10Ottomata) [17:29:05] It'd be good I think to separate these better so that you don't have the need/want to even know what MW does with tombstones, e.g. more around "we need X to do Y" and less "we need to be able to do XYZ, somehow" which only opens a rabithole of more questions, and that's great, but I think it woudl generally be good if SRE can do what they need without knowing MW too well :) [17:29:20] and likewise for us to improve WANCache without knowing mcrouter and gutterpool details. [17:30:47] Krinkle: some understanding of MW is acceptable on our side, but as you mentioned doxygen is more for devs (rightfully) and less for use cases that we have to keep in mind.. so if we could have a list of things to agree upon it would be a good start :) [17:31:38] source code comments written by one person do not an SLO make :) [17:31:51] Krinkle: for example, another thing that we haven't solved yet is how to handle TKOs for the mw2xxx mcrouter proxies [17:31:57] (03CR) 10BryanDavis: "Puppet is disabled on this host right now due to T254491. I will hand apply the config and do the service restart while that disable is in" [puppet] - 10https://gerrit.wikimedia.org/r/602433 (https://phabricator.wikimedia.org/T253738) (owner: 10BryanDavis) [17:32:09] because from the mcrouters in eqiad, they are like mc shards, and can go TKOL [17:32:12] *TKO [17:32:38] Yes. the same issue applies in eqiad as well. [17:33:09] even if we broadcast the mw-wan subset to all servers all the time, this does us no good for servers that are in TKO. [17:33:36] so after the TKO is over and we switch back, the original memc server will have missed 10 minutes of purges that it will never recover from? [17:34:33] Krinkle: if purges are DELETE commands, I think that the mcrouter async log should replay them to the shard that was in TKO, but this needs to be verified [17:34:56] As I understand it the original 2015 proposal was for purge broadcasts to not (ab)use memcached at all but use something separate, e.g. redis pubsub (originally) or kafka (later). It sounds like a Purged-like deamon here might not be so overkill as we originaly thought. [17:35:18] elukey: they are not, we don't use DELETE commands, only SETs that place tombstones. [17:35:39] ah then no, they are not replayed [17:35:51] when a value is known to have changed we have to ignore cache sets for 10 seconds (max lag), as we would othewise end up lazily repopulating the old stale value from a replica DB [17:36:59] if that "downtime" log can be used for all mw-wan/* events (which are only the DELETE-like SETs, nothing else), that'd be cool [17:38:09] we absolutely need to come up with a list of use cases + shared terminology between Performance and SRE [17:38:26] and then work together on tuning the gutter [17:38:30] +1 [17:38:36] and later on the local memc, that rzl is working on [17:39:07] (03PS2) 10BryanDavis: toolsdb: add 3rd temporary filter for replication [puppet] - 10https://gerrit.wikimedia.org/r/602433 (https://phabricator.wikimedia.org/T253738) [17:39:45] (03PS7) 10EBernhardson: query_service: Move shared config into common file [puppet] - 10https://gerrit.wikimedia.org/r/599145 [17:39:47] (03PS10) 10EBernhardson: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [17:39:49] (03PS3) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 [17:39:56] elukey: where should we put the blockign concerns for local memc? [17:40:26] https://phabricator.wikimedia.org/T244340 [17:40:34] you're cc'd ;) [17:40:36] 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) [17:41:50] (03CR) 10jerkins-bot: [V: 04-1] Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 (owner: 10EBernhardson) [17:42:49] (03PS2) 10Aaron Schulz: Enable "coalesceKeys"="non-global" for WANCache on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598854 [17:48:12] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) `name=From IRC I wonder how the backfill logic would work...would it be get... [17:48:26] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) [17:56:04] 10Operations, 10Puppet, 10User-jbond: automated linting/analysis/other CI of Python/shell scripts generated by ERB - https://phabricator.wikimedia.org/T254480 (10jbond) This is a great idea, i think we may be able to do it in a rake task by adding something to task gen, the biggest issues with adding stuff l... [17:57:55] (03CR) 10Ottomata: Default PYSPARK_PYTHON to exact versioned python executable used on driver. (031 comment) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/602386 (https://phabricator.wikimedia.org/T229347) (owner: 10Ottomata) [17:57:59] (03PS11) 10EBernhardson: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [17:58:01] (03PS4) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 [17:58:34] 10Operations, 10Puppet, 10User-jbond: automated linting/analysis/other CI of Python/shell scripts generated by ERB - https://phabricator.wikimedia.org/T254480 (10jbond) quick glance suggest most are SC2086 & SC2006 [18:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200604T1800). [18:00:04] MatmaRex: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:25] hi [18:01:38] Hi MatmaRex - are you gonna deploy your patch yourself? [18:01:49] no, i can't [18:02:23] Okay. I haven't deployed in a bit so let me see if anything's changed. [18:02:59] :o thanks [18:03:06] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602408 (https://phabricator.wikimedia.org/T253941) (owner: 10Bartosz Dziewoński) [18:03:36] Krinkle: is it ok to ask to you or Aaron to come up with a list of use cases during the next days? [18:03:44] (just to set some actionables) [18:04:05] (03PS5) 10Elukey: Add Turnilo to the staging environment on an-tool1007 [puppet] - 10https://gerrit.wikimedia.org/r/602371 (https://phabricator.wikimedia.org/T253294) [18:04:07] (03PS1) 10Elukey: turnilo: move functionalities to the proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/602440 (https://phabricator.wikimedia.org/T253294) [18:04:09] (03Merged) 10jenkins-bot: Set wmgVisualEditorDisableForAnons to false on enwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602408 (https://phabricator.wikimedia.org/T253941) (owner: 10Bartosz Dziewoński) [18:04:40] (bbiab, will read later) [18:05:00] MatmaRex: Niharika: I'm around, but Niharika feel free to do the SWAT :) [18:06:04] (03CR) 10jerkins-bot: [V: 04-1] turnilo: move functionalities to the proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/602440 (https://phabricator.wikimedia.org/T253294) (owner: 10Elukey) [18:06:53] MatmaRex: Your change is on mwdebug1001. [18:06:57] Thanks Urbanecm! [18:07:03] I think I got this. [18:07:18] nice :) [18:08:03] looking [18:12:55] MatmaRex: Any luck? [18:13:08] Niharika: sorry, i think we should revert that… it has a different effect on eswiki than i expected and i'm not sure if it's correct [18:13:23] Niharika: is it okay by you if i submit another patch for enwiki only? [18:13:31] Okay. I'll create a revert. Yes, sure. [18:13:59] (03PS1) 10Niharika29: Revert "Set wmgVisualEditorDisableForAnons to false on enwiki and eswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602442 [18:14:49] 10Operations, 10netbox, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10ayounsi) [18:15:34] (03CR) 10Niharika29: [C: 03+2] Revert "Set wmgVisualEditorDisableForAnons to false on enwiki and eswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602442 (owner: 10Niharika29) [18:16:38] (03Merged) 10jenkins-bot: Revert "Set wmgVisualEditorDisableForAnons to false on enwiki and eswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602442 (owner: 10Niharika29) [18:17:04] (03PS1) 10Bartosz Dziewoński: Set wmgVisualEditorDisableForAnons to false on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602445 (https://phabricator.wikimedia.org/T253941) [18:17:28] Niharika: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/602445 is the corrected version [18:17:46] Gotcha. I'll wait for the CI run. [18:18:07] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602445 (https://phabricator.wikimedia.org/T253941) (owner: 10Bartosz Dziewoński) [18:18:59] (03Merged) 10jenkins-bot: Set wmgVisualEditorDisableForAnons to false on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602445 (https://phabricator.wikimedia.org/T253941) (owner: 10Bartosz Dziewoński) [18:20:39] MatmaRex: It's back on mwdebug1001. [18:21:09] looking [18:23:36] Niharika: actually, that also doesn't work right. looks like i'm a moron [18:23:55] Revert this too? [18:23:57] Niharika: so please revert that one as well, and i'm sorry for wasting your time [18:24:07] i'll document the problems on the task in a minute [18:24:27] (03PS1) 10Niharika29: Revert "Set wmgVisualEditorDisableForAnons to false on enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602446 [18:26:14] (03CR) 10Niharika29: [C: 03+2] Revert "Set wmgVisualEditorDisableForAnons to false on enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602446 (owner: 10Niharika29) [18:27:05] (03Merged) 10jenkins-bot: Revert "Set wmgVisualEditorDisableForAnons to false on enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602446 (owner: 10Niharika29) [18:30:29] Niharika: thank you. fyi, https://phabricator.wikimedia.org/T253941#6193744 [18:33:25] You're welcome MatmaRex. I've cleaned up master. [18:34:51] elukey: cdanis: https://wikitech.wikimedia.org/wiki/Memcached#WANObjectCache (work in progress) [18:36:40] (03CR) 10VulpesVulpes825: [C: 03+1] "LGTM. Let's ride the Evening SWAT of 2020-06-04" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602421 (https://phabricator.wikimedia.org/T254467) (owner: 10Hamish) [18:47:21] 10Operations, 10Puppet, 10User-jbond: automated linting/analysis/other CI of Python/shell scripts generated by ERB - https://phabricator.wikimedia.org/T254480 (10jbond) this is a bit better ` $ find modules -path modules/admin/files/home -prune -o -name \*.sh -exec shellcheck -f gcc {} \; | grep -v note | a... [18:47:28] (03CR) 10Herron: [C: 03+1] prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [18:47:42] (03CR) 10Herron: [C: 03+1] prometheus: move global instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/602401 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [18:48:13] (03CR) 10Herron: [C: 03+1] prometheus: move services instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/602398 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [18:55:04] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [18:57:43] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 4 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10Mholloway) a:03Mholloway [18:59:55] Okie-dokie, train time. [19:00:05] James_F and longma: How many deployers does it take to do Mediawiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200604T1900). [19:00:15] yay [19:00:24] (03PS1) 10Jforrester: all wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602465 [19:00:26] (03CR) 10Jforrester: [C: 03+2] all wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602465 (owner: 10Jforrester) [19:02:10] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602465 (owner: 10Jforrester) [19:02:25] (03PS2) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [19:04:19] Debug looks OK. [19:04:24] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.35 [19:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:30] looks okay on my side [19:04:43] * James_F crosses fingers. [19:08:21] OK, we're five minutes in and the world hasn't ended. [19:08:30] I'm going to declare it launched. [19:08:42] 👍 [19:11:37] 10Operations, 10observability, 10serviceops, 10Sustainability (Incident Prevention): add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10CDanis) [19:11:42] @James_F - sorry to rain on the parade [19:11:47] unable to edit https://en.wikipedia.org/wiki/Template:Admin_tasks?action=edit [19:11:52] (03PS3) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [19:12:13] no submit option, text appears read only [19:12:37] In the 2010 editor? [19:12:45] Confirmed to be reproducible at https://en.wikipedia.org/wiki/Template:Admin_tasks?action=edit&safemode=1 [19:12:48] It works fine for me in both 2010 and 2017 editors. [19:12:51] The new editor [19:12:56] Old editor at https://en.wikipedia.org/wiki/Template:Admin_tasks?action=submit&safemode=1 works [19:13:28] https://www.irccloud.com/pastebin/OdGJkp9u/ [19:13:54] DannyS712: Do you lack the rights to edit it? [19:14:15] It works in my +sysop and +sysadmin accounts. [19:14:25] nope - I created that template and can edit it (based on protection) [19:15:02] hmm, now it appears to work - caching for the new ve module? [19:15:18] I can even load and edit as an IP. [19:15:27] (03PS2) 10Elukey: turnilo: move functionalities to the proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/602440 (https://phabricator.wikimedia.org/T253294) [19:15:29] (03PS6) 10Elukey: Add Turnilo to the staging environment on an-tool1007 [puppet] - 10https://gerrit.wikimedia.org/r/602371 (https://phabricator.wikimedia.org/T253294) [19:15:30] Possibly an RL cache gremlin, yeah. [19:15:42] (03PS1) 10Ryan Kemper: cloudelastic: bring new nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/602469 (https://phabricator.wikimedia.org/T249062) [19:15:53] Oh, right, yes, shouldShowEducationPopups is Roan's new config thing. [19:15:55] Should be working fine. [19:16:13] (03PS1) 10Herron: centrallog: update mtail syslog file locations [puppet] - 10https://gerrit.wikimedia.org/r/602470 [19:16:36] okay, confirmed to be able to edit, appears to have fixed itself [19:17:16] OK, so just a transient bug while deploying? [19:17:46] (03PS2) 10Ryan Kemper: cloudelastic: bring new nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/602469 (https://phabricator.wikimedia.org/T249062) [19:18:01] Yeah, but RL cache shouldn't ever be inconsistent. [19:18:12] Yeah that looks like a new version of the main VE module with an old version of the init module [19:18:23] If the VE code module knows to ask for the shouldShowEducationPopups config, the VE init module should be live too. [19:18:39] Sorry, didn't really mean to ping you, Roan. [19:18:46] Yeah. But mid-deploy weirdness happens [19:18:49] No worries [19:18:55] (03PS3) 10BryanDavis: toolsdb: add more temporary filters for replication [puppet] - 10https://gerrit.wikimedia.org/r/602433 (https://phabricator.wikimedia.org/T253738) [19:20:21] (03PS3) 10Ryan Kemper: cloudelastic: bring new nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/602469 (https://phabricator.wikimedia.org/T249062) [19:23:53] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support, 10Patch-For-Review: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10cscott) [19:29:17] (03PS4) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [19:34:37] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) `lang=irc [19:21] One investigative tool is: 1) edit crontab to comment out the periodic puppet run 2) enable puppet... [19:41:07] (03PS4) 10Ryan Kemper: cloudelastic: bring new nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/602469 (https://phabricator.wikimedia.org/T249062) [19:43:10] (03PS5) 10Ryan Kemper: cloudelastic: bring new nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/602469 (https://phabricator.wikimedia.org/T249062) [19:46:25] (03CR) 10Mholloway: "Hmm. The pattern I see for other services doesn't seem to be only including values that override the default. Maybe best to wait on furthe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [19:47:19] (03CR) 10Gehel: [C: 04-1] "minor comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602469 (https://phabricator.wikimedia.org/T249062) (owner: 10Ryan Kemper) [19:49:59] (03PS6) 10Ryan Kemper: cloudelastic: bring new nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/602469 (https://phabricator.wikimedia.org/T254353) [19:51:43] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602469 (https://phabricator.wikimedia.org/T254353) (owner: 10Ryan Kemper) [19:52:35] (03PS5) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [20:15:17] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Deploy multi-site plugin to gerrit1001 and gerrit2001 - https://phabricator.wikimedia.org/T217174 (10Jdforrester-WMF) [20:16:53] 10Operations, 10SRE-tools, 10Patch-For-Review: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504 (10hashar) I have just found out that Debmonitor now also crawls docker-registry.wikimedia.org . An example is the `doxygen` package https://de... [20:17:37] (03CR) 10Gehel: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/602469 (https://phabricator.wikimedia.org/T254353) (owner: 10Ryan Kemper) [20:20:31] 10Operations, 10Gerrit, 10serviceops, 10Patch-For-Review: Convert Gerrit to use H2 as the database - https://phabricator.wikimedia.org/T211139 (10Paladox) 05Stalled→03Declined Declining as we're going straight to 3.1 so we won't be needing a db from that release. [20:30:30] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Deploy multi-site plugin to gerrit1001 and gerrit2001 - https://phabricator.wikimedia.org/T217174 (10Paladox) [20:30:41] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Deploy multi-site plugin to gerrit1001 and gerrit2001 - https://phabricator.wikimedia.org/T217174 (10Paladox) p:05Medium→03Low [20:34:48] 10Operations, 10Gerrit, 10serviceops, 10Patch-For-Review: Convert Gerrit to use H2 as the database - https://phabricator.wikimedia.org/T211139 (10Paladox) [20:39:16] !log disabled puppet on `cloudelastic100[5,6]` which are two racked nodes that we are now bringing into service. Will re-enable after successful puppet-merge / elasticsearch cluster join [20:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:34] (03CR) 10Ryan Kemper: [C: 03+2] cloudelastic: bring new nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/602469 (https://phabricator.wikimedia.org/T254353) (owner: 10Ryan Kemper) [20:43:51] (03PS1) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) [20:45:43] (03CR) 10jerkins-bot: [V: 04-1] profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:47:36] (03PS2) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) [20:49:23] (03CR) 10jerkins-bot: [V: 04-1] profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:54:28] (03PS3) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) [20:56:20] (03CR) 10jerkins-bot: [V: 04-1] profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:56:22] !log enabled puppet on `cloudelastic1005` in order to kick off a puppet run and verify that this new node joins the ES cluster properly [20:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:59] (03PS1) 10Cmjohnson: Removing asset tag wmf4725 associated w/decom host cobalt [dns] - 10https://gerrit.wikimedia.org/r/602496 (https://phabricator.wikimedia.org/T236187) [21:12:52] (03CR) 10Cmjohnson: [C: 03+2] Removing asset tag wmf4725 associated w/decom host cobalt [dns] - 10https://gerrit.wikimedia.org/r/602496 (https://phabricator.wikimedia.org/T236187) (owner: 10Cmjohnson) [21:12:54] (03PS2) 10Cmjohnson: Removing asset tag wmf4725 associated w/decom host cobalt [dns] - 10https://gerrit.wikimedia.org/r/602496 (https://phabricator.wikimedia.org/T236187) [21:15:04] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Removing asset tag wmf4725 associated w/decom host cobalt [dns] - 10https://gerrit.wikimedia.org/r/602496 (https://phabricator.wikimedia.org/T236187) (owner: 10Cmjohnson) [21:17:59] PROBLEM - Check systemd state on cloudelastic1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:24] ^ This is a known problem, `nginx` failed to start following the puppet run due to some SSL issues. I'm setting a maintenance window on `cloudelastic1005` [21:21:37] PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [21:22:36] I've disabled checks on the host and its corresponding services. Sorry for the noise [21:27:07] (03PS2) 10Ladsgroup: beta: Add Persian to suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600335 (https://phabricator.wikimedia.org/T253291) [21:27:20] (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600335 (https://phabricator.wikimedia.org/T253291) (owner: 10Ladsgroup) [21:28:12] (03Merged) 10jenkins-bot: beta: Add Persian to suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600335 (https://phabricator.wikimedia.org/T253291) (owner: 10Ladsgroup) [21:28:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Cmjohnson) [21:29:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Cmjohnson) 05Open→03Resolved removed from rack, netbox updated, switch updated, removed dns (asset tag) [21:29:33] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage gerrit1001 and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10Cmjohnson) [21:29:36] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Cmjohnson) [21:40:46] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) [21:47:09] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) p:05Unbreak!→03High Lowering priority from UBN to High. @andrew, @aborrero, @jbond, @hashar, @Krenair, and @bd808 have worked th... [21:50:58] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10aaron) I suspect that the keys that cause trouble are big text/JSON blobs and ParserOutput objects, all of... [22:02:51] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:04:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:15:18] (03PS1) 10Jayprakash12345: Set guwiktionary timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602513 [22:18:43] (03PS2) 10Jayprakash12345: Set guwiktionary timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602513 (https://phabricator.wikimedia.org/T253827) [22:21:47] (03CR) 10MarcoAurelio: [C: 03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602513 (https://phabricator.wikimedia.org/T253827) (owner: 10Jayprakash12345) [22:30:23] (03PS1) 10Ryan Kemper: elasticsearch: need dhparam.pem for nginx ssl [puppet] - 10https://gerrit.wikimedia.org/r/602520 [22:32:19] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: need dhparam.pem for nginx ssl [puppet] - 10https://gerrit.wikimedia.org/r/602520 (owner: 10Ryan Kemper) [22:34:06] (03PS2) 10Ryan Kemper: elasticsearch: need dhparam.pem for nginx ssl [puppet] - 10https://gerrit.wikimedia.org/r/602520 [22:41:31] (03CR) 10EBernhardson: [C: 03+1] "Looks reasonable in PCC: https://puppet-compiler.wmflabs.org/compiler1001/23012/" [puppet] - 10https://gerrit.wikimedia.org/r/602520 (owner: 10Ryan Kemper) [22:42:01] (03PS1) 10Jdlrobson: Enable site notices on wikivoyage projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602524 (https://phabricator.wikimedia.org/T254391) [22:48:41] (03PS2) 10Jdlrobson: Enable site notices on wikivoyage projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602524 (https://phabricator.wikimedia.org/T254391) [22:49:32] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: need dhparam.pem for nginx ssl [puppet] - 10https://gerrit.wikimedia.org/r/602520 (owner: 10Ryan Kemper) [22:51:59] RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1005 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2020-09-02 19:55:16 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search [22:56:18] !log re-enabled puppet on `cloudelastic1006`. All `cloudelastic` instances now have puppet enabled and are in sync [22:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:35] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) a:05bd808→03None [23:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200604T2300). [23:00:05] VulpesVulpes825, Jayprakash12345, and Jdlrobson: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:21] o/ here [23:00:40] Here [23:00:47] Here [23:02:00] (03PS4) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) [23:03:52] (03CR) 10jerkins-bot: [V: 04-1] profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [23:04:07] (03CR) 10Cwhite: [C: 03+1] centrallog: update mtail syslog file locations [puppet] - 10https://gerrit.wikimedia.org/r/602470 (owner: 10Herron) [23:04:24] (03CR) 10Cwhite: [C: 03+1] prometheus: move services instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/602398 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [23:04:44] (03CR) 10Cwhite: [C: 03+1] prometheus: move global instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/602401 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [23:05:44] (03CR) 10Cwhite: [C: 03+1] prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [23:07:13] RECOVERY - Check systemd state on cloudelastic1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:42] I can do the SWAT, sorry for being late [23:08:27] (03CR) 10Cwhite: "Logstash-filter-verifier will fail because /etc/logstash/filter_scripts is not available. It may need to be mounted on the docker contain" [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [23:09:21] (03CR) 10Catrope: [C: 03+2] Change the Traditional Chinese logo for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602421 (https://phabricator.wikimedia.org/T254467) (owner: 10Hamish) [23:09:38] (03CR) 10Catrope: [C: 03+2] Set guwiktionary timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602513 (https://phabricator.wikimedia.org/T253827) (owner: 10Jayprakash12345) [23:10:08] (03Merged) 10jenkins-bot: Change the Traditional Chinese logo for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602421 (https://phabricator.wikimedia.org/T254467) (owner: 10Hamish) [23:10:33] (03Merged) 10jenkins-bot: Set guwiktionary timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602513 (https://phabricator.wikimedia.org/T253827) (owner: 10Jayprakash12345) [23:10:49] Thank you. [23:12:50] (03PS3) 10Jdlrobson: Enable site notices on wikivoyage projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602524 (https://phabricator.wikimedia.org/T254391) [23:13:26] Asia/Kolkata is an interesting timezone definition, I thought the tzinfo library was written a bit more recently than that... [23:14:17] Argh VulpesVulpes825 left early [23:14:24] They were supposed to test first [23:14:32] Jayprakash12345: Your change is on mwdebug1002, please test [23:14:38] Ok [23:14:49] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:15:33] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:16:48] RoanKattouw: was that the timezone? [23:17:03] !log catrope@deploy1001 Synchronized static/images/: Change logo for zhwiki (T254467) (duration: 01m 00s) [23:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:06] Well Kolkata hasn't been the capital of India for quite a while, is what I mean [23:17:07] T254467: Change the Traditional Chinese logo of Chinese Wikipedia - https://phabricator.wikimedia.org/T254467 [23:17:18] You'd think the timezone might be called Asia/Delhi [23:18:00] RoanKattouw It look good to me. It is showing Indian time. [23:18:07] RoanKattouw: yep me too [23:18:07] https://gu.wiktionary.org/w/index.php?title=%E0%AA%B5%E0%AA%BF%E0%AA%B6%E0%AB%87%E0%AA%B7:%E0%AA%AA%E0%AA%B8%E0%AA%82%E0%AA%A6&uselang=en#mw-prefsection-rendering [23:18:08] Cool, deploying [23:18:29] RoanKattouw: mind doing https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/598854/ if you are doing all the swat patches? [23:18:55] RoanKattouw: i can check the zh logo if you want [23:19:12] I can bug my Singaporean parents who are on the phone right now to check it looks okay [23:19:16] (03PS4) 10Catrope: Enable site notices on wikivoyage projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602524 (https://phabricator.wikimedia.org/T254391) (owner: 10Jdlrobson) [23:19:22] (03CR) 10Catrope: [C: 03+2] Enable site notices on wikivoyage projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602524 (https://phabricator.wikimedia.org/T254391) (owner: 10Jdlrobson) [23:19:35] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set guwiki timezone to Asia/Kolkata (T253827) (duration: 00m 57s) [23:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:39] T253827: Set guwiktionary timezone to Asia/Kolkata - https://phabricator.wikimedia.org/T253827 [23:19:46] Jdlrobson: That would be great. It's already deployed but I still need to do a cache purge [23:19:50] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:20:06] RoanKattouw: deployed where? [23:20:10] (03Merged) 10jenkins-bot: Enable site notices on wikivoyage projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602524 (https://phabricator.wikimedia.org/T254391) (owner: 10Jdlrobson) [23:20:10] RoanKattouw Thank you :) [23:20:12] debug1002 or synced etc? [23:20:56] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:21:51] Jdlrobson: zhwiki production. Purge is done, they should show up now [23:22:00] (Well it's one logo, but with 1x, 1.5x and 2x versions) [23:22:39] Jdlrobson: Meanwhile, please also test the site notice wikivoyage patch on mwdebug1002 [23:23:15] (03PS3) 10Catrope: Enable "coalesceKeys"="non-global" for WANCache on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598854 (owner: 10Aaron Schulz) [23:23:21] (03CR) 10Catrope: [C: 03+2] Enable "coalesceKeys"="non-global" for WANCache on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598854 (owner: 10Aaron Schulz) [23:23:28] RoanKattouw: okay apparently it looks good [23:23:35] OK cool, thanks [23:23:38] (zh logo) [23:24:15] (03Merged) 10jenkins-bot: Enable "coalesceKeys"="non-global" for WANCache on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598854 (owner: 10Aaron Schulz) [23:24:20] RoanKattouw: tx [23:25:04] OK, once Jdlrobson confirms that the site notice wikivoyage change looks good, I'll sync that one, then pull AaronSchulz's change onto mwdebug1002 [23:26:06] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 49 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:26:29] RoanKattouw: testing [23:27:03] RoanKattouw: oh shoot i cant use debug2 on https://en.m.wikivoyage.beta.wmflabs.org/wiki/Bannerda [23:27:13] so i think you have to sync this one and i'll test it there shortly after [23:27:17] (not in production yet) [23:27:19] beta syncing happens automatically [23:27:30] ahh [23:27:31] okay [23:27:33] just needed a purge [23:27:35] But I can check whether it's synced there yet, hold on [23:27:36] good to go then! :) [23:27:38] Or is it working? [23:27:44] yep working fine thanks! [23:27:46] OK great [23:29:33] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Minerva site notices on Wikivoyage wiis (T254391) (duration: 00m 58s) [23:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:36] T254391: Restore banners to skins (Vector) - https://phabricator.wikimedia.org/T254391 [23:34:18] AaronSchulz: Your patch is next. Does it make sense to test on mwdebug1002 or should I just sync it to all appservers right away? [23:34:40] thanks RoanKattouw :) [23:34:41] a little debug time won't hurt [23:35:31] I'd just poke around a bit with x-wmf-debug (similar to the other deploys) [23:36:07] OK, it's on mwdebug1002, enjoy. Please ping me by name when you're ready for it to go live [23:37:44] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) > Some keys are super hot - take for instance `WANCache:v:global:CacheAwarePropertyInfoStore:wiki... [23:41:10] ok [23:42:40] RoanKattouw: lgtm [23:45:09] !log catrope@deploy1001 Synchronized wmf-config/mc.php: Set coalesceKeys=non-global for WANCache on enwiki (duration: 00m 59s) [23:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:22] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:48:10] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 8.267 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [23:50:16] (03PS1) 10Bmansurov: Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230)