[02:05:44] PROBLEM - Check systemd state on logstash1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:06] RECOVERY - Check systemd state on logstash1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:32] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [03:03:02] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [04:01:18] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [04:01:50] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [04:06:25] (03PS1) 10BryanDavis: Introduce jinja2 templating [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/578165 [04:06:27] (03PS1) 10BryanDavis: rebuild_all: Allow overriding python used and additional args [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/578166 [04:06:29] (03PS1) 10BryanDavis: Introduce jinja2 macros [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/578167 [04:06:31] (03PS1) 10BryanDavis: Introduce macros for installing composer and npm [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/578168 [04:24:24] (03PS2) 10BryanDavis: Introduce jinja2 macros [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/578167 [04:24:26] (03PS2) 10BryanDavis: Introduce macros for installing composer and npm [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/578168 [04:27:36] (03PS3) 10BryanDavis: Introduce macros for installing composer and npm [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/578168 [04:41:39] (03CR) 10BryanDavis: kubernetes: Set php7.3 as the default type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496564 (owner: 10BryanDavis) [04:52:52] (03CR) 10BryanDavis: Make Kubernetes the default backend and warn when guessing [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [05:34:16] !log restart ats-tls, ats-be and varnish-fe on cp3053 to clean up daemon restart alerts - T247195 [05:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:22] T247195: OOM killer killed varnihsd cache-main on cp3053 - https://phabricator.wikimedia.org/T247195 [05:37:18] RECOVERY - traffic_server tls process restarted on cp3053 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3053&var-layer=tls [05:37:38] RECOVERY - Varnish frontend child restarted on cp3053 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3053&var-datasource=esams+prometheus/ops [05:38:42] RECOVERY - traffic_server backend process restarted on cp3053 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3053&var-layer=backend [06:46:28] 10Operations, 10Traffic: OOM killer killed varnihsd cache-main on cp3053 - https://phabricator.wikimedia.org/T247195 (10Vgutierrez) p:05Triage→03Medium [06:51:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add a function to my shell [puppet] - 10https://gerrit.wikimedia.org/r/577581 (owner: 10Giuseppe Lavagetto) [06:52:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: purge undeclared listeners and clusters definitions [puppet] - 10https://gerrit.wikimedia.org/r/577580 (owner: 10Giuseppe Lavagetto) [06:58:07] (03PS1) 10Marostegui: install_server: Allow reimage db2114 [puppet] - 10https://gerrit.wikimedia.org/r/578172 (https://phabricator.wikimedia.org/T246604) [07:03:54] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db2114 [puppet] - 10https://gerrit.wikimedia.org/r/578172 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [07:06:30] (03CR) 10Elukey: [C: 03+2] profile::swap: use nodejs 10 by default [puppet] - 10https://gerrit.wikimedia.org/r/577742 (https://phabricator.wikimedia.org/T247055) (owner: 10Elukey) [07:09:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2114 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10654 and previous config saved to /var/cache/conftool/dbconfig/20200309-070937-marostegui.json [07:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:44] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [07:10:45] (03PS1) 10Marostegui: install_server: Reimage db2114 to buster [puppet] - 10https://gerrit.wikimedia.org/r/578178 (https://phabricator.wikimedia.org/T246604) [07:12:14] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2114 to buster [puppet] - 10https://gerrit.wikimedia.org/r/578178 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [07:13:34] (03PS13) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [07:13:49] !log Stop MySQL on db2114 to upgrade to buster [07:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:15] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [07:17:42] (03PS1) 10Vgutierrez: ATS: Clean libhwloc5 pin [puppet] - 10https://gerrit.wikimedia.org/r/578179 [07:17:44] (03PS1) 10Vgutierrez: ATS: Remove libhwloc5 pin [puppet] - 10https://gerrit.wikimedia.org/r/578180 [07:24:13] (03PS1) 10Marostegui: Revert "install_server: Allow reimage db2114" [puppet] - 10https://gerrit.wikimedia.org/r/578181 [07:25:21] (03PS1) 10Vgutierrez: ATS: Disable parent proxies globally [puppet] - 10https://gerrit.wikimedia.org/r/578182 (https://phabricator.wikimedia.org/T244464) [07:28:14] (03CR) 10Vgutierrez: "pcc seems healthy: https://puppet-compiler.wmflabs.org/compiler1002/21339/" [puppet] - 10https://gerrit.wikimedia.org/r/578182 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [07:29:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:26] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage db2114" [puppet] - 10https://gerrit.wikimedia.org/r/578181 (owner: 10Marostegui) [07:46:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2114 after reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10655 and previous config saved to /var/cache/conftool/dbconfig/20200309-074629-marostegui.json [07:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:35] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [07:51:38] (03PS1) 10Brian Wolff: Add wikidata.beta.wmflabs.org to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 [07:58:41] (03PS1) 10Elukey: role::statistics::explore: add profile::swap [puppet] - 10https://gerrit.wikimedia.org/r/578271 (https://phabricator.wikimedia.org/T245179) [08:00:06] (03PS2) 10Elukey: role::statistics::explore: add profile::swap [puppet] - 10https://gerrit.wikimedia.org/r/578271 (https://phabricator.wikimedia.org/T245179) [08:03:36] (03PS1) 10Giuseppe Lavagetto: prometheus: fix file sd path for envoy [puppet] - 10https://gerrit.wikimedia.org/r/578278 [08:05:20] (03PS3) 10Elukey: role::statistics::explore: add profile::swap [puppet] - 10https://gerrit.wikimedia.org/r/578271 (https://phabricator.wikimedia.org/T245179) [08:07:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] prometheus: fix file sd path for envoy [puppet] - 10https://gerrit.wikimedia.org/r/578278 (owner: 10Giuseppe Lavagetto) [08:11:30] (03PS1) 10Giuseppe Lavagetto: Revert "Revert "ProductionServices: use the local proxy for sessionstore"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578279 [08:19:15] <_joe_> jouncebot: next [08:19:16] In 0 hour(s) and 40 minute(s): SRE deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T0900) [08:19:17] (03PS1) 10Marostegui: install_server: Allow reimage of db2126 [puppet] - 10https://gerrit.wikimedia.org/r/578280 (https://phabricator.wikimedia.org/T246604) [08:20:42] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage of db2126 [puppet] - 10https://gerrit.wikimedia.org/r/578280 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:21:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2126 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10656 and previous config saved to /var/cache/conftool/dbconfig/20200309-082118-marostegui.json [08:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:24] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [08:23:11] (03PS1) 10Marostegui: install_server: Reimage db2126 to buster [puppet] - 10https://gerrit.wikimedia.org/r/578281 (https://phabricator.wikimedia.org/T246604) [08:25:07] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2126 to buster [puppet] - 10https://gerrit.wikimedia.org/r/578281 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:28:01] PROBLEM - MariaDB Slave IO: s2 on db2095 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2126.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2126.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:29:22] me ^ [08:29:58] (03PS1) 10Marostegui: Revert "install_server: Reimage db2126 to buster" [puppet] - 10https://gerrit.wikimedia.org/r/578283 [08:30:15] (03PS1) 10Marostegui: Revert "install_server: Allow reimage of db2126" [puppet] - 10https://gerrit.wikimedia.org/r/578284 [08:31:19] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Reimage db2126 to buster" [puppet] - 10https://gerrit.wikimedia.org/r/578283 (owner: 10Marostegui) [08:31:36] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage of db2126" [puppet] - 10https://gerrit.wikimedia.org/r/578284 (owner: 10Marostegui) [08:32:39] RECOVERY - MariaDB Slave IO: s2 on db2095 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:35:44] (03PS4) 10Elukey: role::statistics::explore: add profile::swap [puppet] - 10https://gerrit.wikimedia.org/r/578271 (https://phabricator.wikimedia.org/T245179) [08:36:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2126', diff saved to https://phabricator.wikimedia.org/P10657 and previous config saved to /var/cache/conftool/dbconfig/20200309-083653-marostegui.json [08:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:58] (03PS1) 10Marostegui: install_server: Allow reimage of db2125 [puppet] - 10https://gerrit.wikimedia.org/r/578285 (https://phabricator.wikimedia.org/T246604) [08:37:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2125 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10658 and previous config saved to /var/cache/conftool/dbconfig/20200309-083711-marostegui.json [08:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:16] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [08:38:20] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage of db2125 [puppet] - 10https://gerrit.wikimedia.org/r/578285 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:40:15] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/21344/" [puppet] - 10https://gerrit.wikimedia.org/r/578271 (https://phabricator.wikimedia.org/T245179) (owner: 10Elukey) [08:40:37] (03PS1) 10Marostegui: install_server: Reimage db2125 to buster [puppet] - 10https://gerrit.wikimedia.org/r/578287 (https://phabricator.wikimedia.org/T246604) [08:42:35] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2125 to buster [puppet] - 10https://gerrit.wikimedia.org/r/578287 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:48:50] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [08:59:17] (03PS2) 10Gehel: cirrus: initial configuration of elastic20[55-60] [puppet] - 10https://gerrit.wikimedia.org/r/577250 (https://phabricator.wikimedia.org/T246975) [09:00:04] _joe_: #bothumor My software never has bugs. It just develops random features. Rise for SRE deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T0900). [09:00:42] (03CR) 10Gehel: [C: 03+2] cirrus: initial configuration of elastic20[55-60] [puppet] - 10https://gerrit.wikimedia.org/r/577250 (https://phabricator.wikimedia.org/T246975) (owner: 10Gehel) [09:00:42] <_joe_> helloo [09:01:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Revert "ProductionServices: use the local proxy for sessionstore"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578279 (owner: 10Giuseppe Lavagetto) [09:02:32] (03Merged) 10jenkins-bot: Revert "Revert "ProductionServices: use the local proxy for sessionstore"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578279 (owner: 10Giuseppe Lavagetto) [09:03:55] (03CR) 10Elukey: [C: 03+2] role::statistics::explore: add profile::swap [puppet] - 10https://gerrit.wikimedia.org/r/578271 (https://phabricator.wikimedia.org/T245179) (owner: 10Elukey) [09:04:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:15] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: switch sessionstore to use envoy (duration: 01m 00s) [09:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:31] (03CR) 10Muehlenhoff: "Squid on the new systems doesn't seem to work yet? "http_proxy=install1002.wikimedia.org wget http://apt.wikimedia.org/wikimedia/pool/thir" [dns] - 10https://gerrit.wikimedia.org/r/569680 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [09:12:39] PROBLEM - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fe8e0211198: Failed to establish a new connection: [Errno 111] Connection refused,): /openapi https://www.mediawiki.org/wiki/Kask [09:13:30] <_joe_> sigh [09:13:32] <_joe_> ok [09:13:35] <_joe_> rollback [09:14:33] RECOVERY - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fb0c39c7198, Connection to sessionstore.svc.eqiad.wmnet timed out. (connect timeout=15)): /openapi https://www.mediawiki.org/wiki/Kask [09:14:52] (03PS1) 10Gehel: elasticsearch: add racking info for new servers elastic20[55-60] [puppet] - 10https://gerrit.wikimedia.org/r/578292 (https://phabricator.wikimedia.org/T246975) [09:14:55] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: revert switch sessionstore to use envoy (duration: 00m 58s) [09:14:57] <_joe_> I guess my timeout was too aggressive there [09:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:10] <_joe_> let's see if I'm right [09:15:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/577701 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [09:17:52] (03PS1) 10Giuseppe Lavagetto: services_proxy: make timeouts larger for now [puppet] - 10https://gerrit.wikimedia.org/r/578294 [09:18:01] <_joe_> I'm waiting before doing a full rollback for now [09:18:25] <_joe_> but I will do a gerrit revert and abort the deployment unless I confirm my fix does fix the situation [09:19:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: make timeouts larger for now [puppet] - 10https://gerrit.wikimedia.org/r/578294 (owner: 10Giuseppe Lavagetto) [09:22:35] (03CR) 10DCausse: [C: 03+1] elasticsearch: add racking info for new servers elastic20[55-60] [puppet] - 10https://gerrit.wikimedia.org/r/578292 (https://phabricator.wikimedia.org/T246975) (owner: 10Gehel) [09:23:30] (03CR) 10Elukey: [C: 03+1] elasticsearch: add racking info for new servers elastic20[55-60] [puppet] - 10https://gerrit.wikimedia.org/r/578292 (https://phabricator.wikimedia.org/T246975) (owner: 10Gehel) [09:23:56] (03CR) 10Gehel: [C: 03+2] elasticsearch: add racking info for new servers elastic20[55-60] [puppet] - 10https://gerrit.wikimedia.org/r/578292 (https://phabricator.wikimedia.org/T246975) (owner: 10Gehel) [09:24:45] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Add es5 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577185 (https://phabricator.wikimedia.org/T246072) [09:28:23] <_joe_> jouncebot: next [09:28:24] In 1 hour(s) and 1 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1030) [09:28:29] <_joe_> ok, I have time [09:31:59] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic, and 2 others: load.php?modules=startup miss rate tripled on 2020-02-05 - https://phabricator.wikimedia.org/T247020 (10ema) >>! In T247020#5948063, @ema wrote: > I suspect this is due to the fact that we are unsetting `Accept-Encoding... [09:37:52] <_joe_> ok I'm trying to redeploy [09:39:45] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: re-try: switch sessionstore to use envoy (duration: 00m 58s) [09:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:06] (03CR) 10Vgutierrez: [C: 03+1] ATS: unset client req Accept-Encoding on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/577551 (https://phabricator.wikimedia.org/T247020) (owner: 10Ema) [09:47:08] <_joe_> sigh [09:47:13] <_joe_> it's happening again apparently [09:47:31] <_joe_> yeah lemme revert one last time [09:48:38] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: re-revert: switch sessionstore to use envoy (duration: 00m 35s) [09:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:38] (03PS1) 10Giuseppe Lavagetto: Revert "Revert "Revert "ProductionServices: use the local proxy for sessionstore""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578296 [09:50:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Revert "Revert "ProductionServices: use the local proxy for sessionstore""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578296 (owner: 10Giuseppe Lavagetto) [09:51:04] <_joe_> this is just reproducing the current situation in production [09:51:23] !log pooling new elastic20[55-60] servers - T246975 [09:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:29] T246975: service implementation for elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T246975 [09:51:47] (03Merged) 10jenkins-bot: Revert "Revert "Revert "ProductionServices: use the local proxy for sessionstore""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578296 (owner: 10Giuseppe Lavagetto) [09:56:03] (03CR) 10Ema: [C: 03+1] ATS: Disable parent proxies globally [puppet] - 10https://gerrit.wikimedia.org/r/578182 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [09:58:46] (03CR) 10Gehel: [C: 03+1] "Nice! Looks like we forgot that one when moving to multi-cluster." [puppet] - 10https://gerrit.wikimedia.org/r/577746 (owner: 10Elukey) [09:58:51] (03CR) 10Ema: [C: 03+2] ATS: unset client req Accept-Encoding on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/577551 (https://phabricator.wikimedia.org/T247020) (owner: 10Ema) [10:00:01] !log installing php5 security updates [10:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:55] (03CR) 10Elukey: [C: 03+2] icinga::monitor::elasticsearch::old_jvm_gc_checks: fix grafana URL [puppet] - 10https://gerrit.wikimedia.org/r/577746 (owner: 10Elukey) [10:02:55] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable parent proxies globally [puppet] - 10https://gerrit.wikimedia.org/r/578182 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:04:55] !log disable parent proxies globally on ats-tls - T244464 [10:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:00] T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 [10:08:12] (03CR) 10Jbond: [C: 03+2] systemd::syslog: ensure log dir is removed if resource is absent [puppet] - 10https://gerrit.wikimedia.org/r/576364 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [10:12:11] !log installing openjdk-7 security updates [10:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:16] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10akosiaris) As far as OTRS goes I can be around and help with restarts/verifying behavior and all that jazz. Pick dates that suit you and le... [10:17:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge-clush: correct the classifications and remove legacy k8s [puppet] - 10https://gerrit.wikimedia.org/r/577279 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [10:18:59] (03PS1) 10Alexandros Kosiaris: sessionstore: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578301 (https://phabricator.wikimedia.org/T244843) [10:19:05] (03Abandoned) 10Giuseppe Lavagetto: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/576621 (owner: 10Giuseppe Lavagetto) [10:19:14] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) >>! In T246098#5953169, @akosiaris wrote: > As far as OTRS goes I can be around and help with restarts/verifying behavior and a... [10:20:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] sessionstore: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578301 (https://phabricator.wikimedia.org/T244843) (owner: 10Alexandros Kosiaris) [10:20:51] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: load.php?modules=startup miss rate tripled on 2020-02-05 - https://phabricator.wikimedia.org/T247020 (10ema) 05Open→03Resolved a:03ema The hitrate is now recovering after applying the patch: {F31672603}... [10:21:05] (03Merged) 10jenkins-bot: sessionstore: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578301 (https://phabricator.wikimedia.org/T244843) (owner: 10Alexandros Kosiaris) [10:21:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] sessionstore: Bump memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578301 (https://phabricator.wikimedia.org/T244843) (owner: 10Alexandros Kosiaris) [10:24:19] <_joe_> akosiaris: should I deploy? [10:24:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but typo" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/577575 (https://phabricator.wikimedia.org/T246887) (owner: 10Ayounsi) [10:24:38] _joe_: I am right now [10:24:43] <_joe_> ack [10:24:50] actually [10:24:50] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Add cloud-out4 firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/577575 (https://phabricator.wikimedia.org/T246887) (owner: 10Ayounsi) [10:24:52] wanna do it? [10:25:06] it feels weird that I do all of that, and it's probably better that others start doing them [10:25:17] it's merged and ready for helmfile apply [10:25:20] <_joe_> oh sure I'm used to though [10:26:27] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [10:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:32] 10Operations, 10Traffic, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez KA between ats-tls and varnish-fe is working successfully and enabled globally in the caching c... [10:26:34] 10Operations, 10Traffic: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) [10:28:26] <_joe_> akosiaris: so it's deployed [10:28:38] (03PS1) 10Elukey: jupyterhub: refactor user authentication for posix groups [puppet] - 10https://gerrit.wikimedia.org/r/578303 (https://phabricator.wikimedia.org/T245179) [10:29:16] _joe_: cool. So.. wanna try again ? [10:29:30] <_joe_> but we can't do more now as poeple now need to do deployments AIUI [10:29:44] <_joe_> jouncebot: next [10:29:44] In 0 hour(s) and 0 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1030) [10:29:46] <_joe_> heh [10:30:00] <_joe_> so let's wait for that to be over [10:30:04] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1030). [10:32:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] keystone: Backport the Rocky version of ldap integration [puppet] - 10https://gerrit.wikimedia.org/r/577602 (https://phabricator.wikimedia.org/T247050) (owner: 10Andrew Bogott) [10:32:08] !log install spamassassin security updates on mendelevium/ticket.wikimedia.org [10:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Queens keystone: add a hack for utf8 decoding in already-hacked ldap handler [puppet] - 10https://gerrit.wikimedia.org/r/577669 (https://phabricator.wikimedia.org/T247050) (owner: 10Andrew Bogott) [10:32:15] (03CR) 10Elukey: [C: 03+2] jupyterhub: refactor user authentication for posix groups [puppet] - 10https://gerrit.wikimedia.org/r/578303 (https://phabricator.wikimedia.org/T245179) (owner: 10Elukey) [10:33:38] <_joe_> jan_drewniak: are you deploying anything? [10:34:17] !log install spamassassin security updates on fermium/lists.wikimedia.org [10:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:44] _joe_: ah! daylight saving caught me by surprise [10:35:34] PROBLEM - traffic_server backend process restarted on cp4032 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=ulsfo+prometheus/ops&var-instance=cp4032&var-layer=backend [10:35:44] I'm preparing the patch, so yeah, I guess I will in 5-10 minutes [10:35:56] uh.... checking that [10:36:36] <_joe_> jan_drewniak: ok thanks [10:36:52] <_joe_> akosiaris: so debugging wil happen in the afternoon I guess [10:37:04] <_joe_> because after this comes SWAT [10:40:48] :( [10:41:57] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578305 (https://phabricator.wikimedia.org/T128546) [10:42:50] _joe_: looking at https://wikitech.wikimedia.org/wiki/Deployments [10:43:00] there is something weird going on timezone wise [10:43:15] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578305 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:43:16] nope, I take that back [10:44:10] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578305 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:44:29] 10Operations, 10Traffic: lua related crash on ats-be @ cp4032 - https://phabricator.wikimedia.org/T247232 (10Vgutierrez) [10:44:57] 10Operations, 10Traffic: lua related crash on ats-be @ cp4032 - https://phabricator.wikimedia.org/T247232 (10Vgutierrez) p:05Triage→03High [10:45:54] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:578305| Bumping portals to master (563985)]] (duration: 00m 58s) [10:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:53] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:578305| Bumping portals to master (563985)]] (duration: 00m 58s) [10:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:59] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:27] 10Operations, 10Epic, 10Maps (Kartotherian), 10Patch-For-Review: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10akosiaris) [10:54:22] (03PS2) 10KartikMistry: apertium: Update dependency and fix conflict [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/577861 [10:58:40] !log upload pystemd 0.7.0-1wm1 to apt.wm.o (buster) - T245616 [10:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:45] T245616: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1100) [11:00:04] Jhs and samwilson: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:08] * Jhs is here [11:01:17] I'm here too [11:01:35] o/ I guess I can do it [11:01:50] o/ [11:02:01] samwilson1: does beta cluster have the tables? [11:02:39] yep, that happened a couple of weeks ago [11:03:15] (03CR) 10Ladsgroup: [C: 03+2] Enable watchlist expiry on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577033 (https://phabricator.wikimedia.org/T246849) (owner: 10Samwilson) [11:03:41] samwilson1: okay, it's merged, it'll go live automatically in (at most) half an hour. It's out of my control [11:03:55] Amir1: thanks! [11:04:06] (03PS2) 10Ladsgroup: Add `fkv` Kven to $wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577671 (https://phabricator.wikimedia.org/T167259) (owner: 10Jon Harald Søby) [11:04:10] (03CR) 10Ladsgroup: [C: 03+2] Add `fkv` Kven to $wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577671 (https://phabricator.wikimedia.org/T167259) (owner: 10Jon Harald Søby) [11:04:18] (03Merged) 10jenkins-bot: Enable watchlist expiry on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577033 (https://phabricator.wikimedia.org/T246849) (owner: 10Samwilson) [11:05:14] (03Merged) 10jenkins-bot: Add `fkv` Kven to $wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577671 (https://phabricator.wikimedia.org/T167259) (owner: 10Jon Harald Søby) [11:05:54] Jhs: it's live in mwdebug1001 [11:06:43] Amir1, works like a charm 👍 https://www.wikidata.org/w/index.php?title=Q4115189&type=revision&diff=1132631489&oldid=1132006182 [11:07:10] coool [11:07:12] deploying [11:08:44] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:577671|Add `fkv` Kven to $wmgExtraLanguageNames (T167259)]] (duration: 00m 59s) [11:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:51] T167259: Add [fkv] Kven to $wmgExtraLanguageNames for Wikidata - https://phabricator.wikimedia.org/T167259 [11:09:51] (03PS14) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [11:09:56] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:577671|Add `fkv` Kven to $wmgExtraLanguageNames (T167259)]], take II (duration: 00m 57s) [11:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:34] !log EU SWAT is done [11:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:48] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [11:17:21] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:12] <_joe_> ok akosiaris let's run a test? I can just do a further re-revert or just push the change for a few minutes [11:19:17] <_joe_> I'd do the latter tbh [11:21:00] (03PS15) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [11:22:09] _joe_: sure [11:22:45] <_joe_> ok let's go then [11:22:50] <_joe_> Amir1: you're done right? [11:23:12] _joe_: yup [11:23:57] <_joe_> akosiaris: ok so let's run this test [11:24:10] <_joe_> let's say we run it for 10 minutes tops? [11:24:35] yeah, why not [11:25:09] push the change for 10mins, revert and let's evaluate [11:25:15] <_joe_> yep [11:25:21] <_joe_> the change is being deployed [11:25:36] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: test: switch sessionstore to use envoy again (duration: 00m 57s) [11:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:21] <_joe_> memory usage is spiking again [11:27:15] <_joe_> yes, keeps going up [11:28:20] <_joe_> interestingly cpu usage is going down [11:28:47] (03PS1) 10Elukey: jupyterhub: fix link to frozen-requirements.txt for user creation [puppet] - 10https://gerrit.wikimedia.org/r/578311 (https://phabricator.wikimedia.org/T245179) [11:29:02] (03PS2) 10Muehlenhoff: Enable the sso endpoint [puppet] - 10https://gerrit.wikimedia.org/r/577571 [11:29:03] that... [11:29:07] going down? [11:29:11] how? [11:29:12] <_joe_> akosiaris: that seems to confirm my theory [11:29:18] <_joe_> we do less tls negotiations [11:29:21] aaahhh [11:29:25] yes that makes sense [11:29:30] (03CR) 10jerkins-bot: [V: 04-1] Enable the sso endpoint [puppet] - 10https://gerrit.wikimedia.org/r/577571 (owner: 10Muehlenhoff) [11:29:33] <_joe_> network's going down a lot too [11:29:44] memory is closer to the old limits [11:29:50] <_joe_> and look at the latency buckets [11:30:00] so there is a case that kask was being starved of memory before? [11:30:06] just barely perhaps ... [11:30:14] <_joe_> well no memory usage is at 350M [11:30:26] that's total for all pods [11:30:29] pods are 4 btw [11:30:31] <_joe_> oh right [11:30:38] and limit was 100Mi [11:30:41] <_joe_> so yes [11:30:48] <_joe_> probably it can get over 100Mi [11:30:59] just barely I guess, but enough to cause the issues? [11:31:04] <_joe_> now the point is understanding if this will just keep growing [11:31:07] another 4 minutes in [11:31:09] <_joe_> akosiaris: very possible [11:31:18] maybe we should bump it to 15m of test? [11:31:24] <_joe_> yeah [11:31:41] ok [11:32:24] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21345/" [puppet] - 10https://gerrit.wikimedia.org/r/578311 (https://phabricator.wikimedia.org/T245179) (owner: 10Elukey) [11:32:50] (03CR) 10Hnowlan: "> See also these two puppet patches, which seem to be doing a similar thing. Merge/close as needed?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576301 (https://phabricator.wikimedia.org/T243096) (owner: 10Hnowlan) [11:34:21] _joe_: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now&fullscreen&panelId=44 [11:34:29] look at the percentages of fast requests [11:34:32] (03PS3) 10Muehlenhoff: Enable the sso endpoint [puppet] - 10https://gerrit.wikimedia.org/r/577571 [11:34:34] <_joe_> sigh [11:34:47] it just increased from 21% to 26% [11:34:53] <_joe_> we gained a 5% of fast requests just like that [11:34:58] yes! [11:34:59] <_joe_> let [11:35:07] <_joe_> s see what happens with sessionstore [11:35:46] we are at the 10m mark [11:35:55] <_joe_> let's go another 5 [11:36:00] memory usage is reaching 400MB but it looks ok still [11:36:06] in fact everything looks better than before [11:36:27] network is done, CPU is down, latencies are down [11:36:31] I mean... nice! [11:36:43] <_joe_> yeah, apart from memory that seems to keep growing [11:36:51] <_joe_> slowly but surely [11:37:02] it should plateau at some point [11:37:05] <_joe_> so I want to keep the test going to see if there is some memleak [11:37:11] <_joe_> heh, not sure about that [11:37:33] if it doesn't it's a memleak for sure [11:37:41] (03PS1) 10Jbond: puppetdb: monitor agent runs [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) [11:37:49] <_joe_> we have some detailed stats about memory IIRC [11:37:58] (03PS1) 10Elukey: jupyterhub: fix reference to distro variable in jupyterhub_config.py [puppet] - 10https://gerrit.wikimedia.org/r/578313 (https://phabricator.wikimedia.org/T245179) [11:38:04] but it's logarithmic curve up to now ? [11:38:21] _joe_: yes, it's a go app, we probably have very fine grained ones [11:38:32] I even have generic go dashboards IIRC [11:38:47] never though I would use them for sessionstore, but I can try and see [11:40:03] 15m mark and I still don't feel bad about the change. Let's say we extend it to 30m ? [11:40:30] <_joe_> yes [11:40:34] (03CR) 10Elukey: [C: 03+2] jupyterhub: fix reference to distro variable in jupyterhub_config.py [puppet] - 10https://gerrit.wikimedia.org/r/578313 (https://phabricator.wikimedia.org/T245179) (owner: 10Elukey) [11:40:35] if memory keeps increase (it's starting to become linear btw) I 'll copy paste dashboards for go mem internals [11:40:41] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:42] lemme make sure kask exports them [11:40:54] <_joe_> akosiaris: it does export a few [11:40:59] stat1004 is me [11:41:13] <_joe_> curl -k https://10.64.64.171:8081/metrics will tell you which [11:41:46] yup it's the golang standard ones, great [11:41:52] * akosiaris adding graphs then [11:42:36] aye that's the prometheus golang client exporting those iirc [11:42:43] <_joe_> I'm looking at the number of objects in the heap [11:42:52] (03PS2) 10Jbond: puppetdb: monitor agent runs [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) [11:42:53] <_joe_> and it's not running away on any server [11:44:39] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/21349/" [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [11:47:02] (03CR) 10Jbond: [C: 03+1] Enable the sso endpoint [puppet] - 10https://gerrit.wikimedia.org/r/577571 (owner: 10Muehlenhoff) [11:47:23] (03PS1) 10Filippo Giunchedi: logstash: adjust client error topics for eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/578315 (https://phabricator.wikimedia.org/T226986) [11:47:39] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:08] <_joe_> akosiaris: I'm tempted to persist the test at this point [11:50:13] _joe_: me too [11:50:29] give me 5 mins to finish the golang dashboard row and let's see if we get something [11:50:42] <_joe_> roger [11:50:54] <_joe_> I will anyways prepare the re-re-re-revert [11:53:56] 10Operations, 10Page Content Service, 10Wikimedia-Logstash, 10observability, and 4 others: Move mobileapps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10fgiunchedi) 05Open→03Stalled >>! In T219924#5949120, @Mholloway wrote: > @fgiunchedi This should be finished by April... [11:54:02] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10fgiunchedi) [11:54:43] 10Operations, 10Proton, 10Wikimedia-Logstash, 10observability, and 4 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10fgiunchedi) 05Open→03Stalled Stalling since we'll be piggybacking on Proton (and mobileapps) moving to k8s, and thus the logging pipeli... [11:54:48] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10fgiunchedi) [11:56:25] _joe_: https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=now-1h&to=now [11:56:28] (03CR) 10Muehlenhoff: [C: 03+2] Enable the sso endpoint [puppet] - 10https://gerrit.wikimedia.org/r/577571 (owner: 10Muehlenhoff) [11:56:30] (03PS1) 10Giuseppe Lavagetto: ProductionServices: use the local proxy for sessionstore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578316 [11:56:31] last row is "golang internals" [11:56:57] memory is indeed plateauing now btw [11:57:00] at the 500MB mark [11:57:01] <_joe_> akosiaris: yes [11:57:17] <_joe_> so apparently look at the number of frees and mallocs [11:57:18] look at the "memory use" graph [11:57:22] <_joe_> it more than halved [11:57:48] heap jumped to 500MB total (across all pods) and it's now stable [11:57:57] dammit I love golang and prometheus [11:58:02] <_joe_> akosiaris: https://gerrit.wikimedia.org/r/578316 care to +1? [11:58:12] the amount of insights it gives you is unparalled [11:58:25] <_joe_> ahem the jvm gives you much more [11:58:36] <_joe_> actually so much more it's divination :P [11:58:36] half of it means nothing to me [11:59:00] you have to go around and figure out the generations, read up on garbage collectors [11:59:03] and so on [11:59:12] it's a pain. This is more intuitive [11:59:35] and extremely easy to get it working and on a dashboard [12:00:30] <_joe_> akosiaris: so, let's merge the switch? [12:00:40] <_joe_> and persist it [12:00:51] ah you pushed it locally from deploy1001 ? [12:01:05] yes, go for it [12:01:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] ProductionServices: use the local proxy for sessionstore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578316 (owner: 10Giuseppe Lavagetto) [12:01:30] <_joe_> akosiaris: given it was a test, yes [12:01:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ProductionServices: use the local proxy for sessionstore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578316 (owner: 10Giuseppe Lavagetto) [12:02:35] I like the divergent patterns of idle heap vs allocated and in use heap [12:02:40] https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=now-1h&to=now&fullscreen&panelId=58 [12:03:07] (03Merged) 10jenkins-bot: ProductionServices: use the local proxy for sessionstore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578316 (owner: 10Giuseppe Lavagetto) [12:06:39] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: switch sessionstore to use envoy permanently (duration: 00m 59s) [12:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:12] (03PS1) 10Giuseppe Lavagetto: sessionstore: bump memory limits in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/578318 (https://phabricator.wikimedia.org/T244843) [12:17:51] <_joe_> akosiaris: I raised the request a bit too [12:18:03] <_joe_> to the level we would expect to be reached more or less [12:20:27] (03PS1) 10Muehlenhoff: Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 [12:21:48] (03PS3) 10Jcrespo: mariadb-backups: Increase snapshot frequency and retain those on bacula [puppet] - 10https://gerrit.wikimedia.org/r/577462 (https://phabricator.wikimedia.org/T138562) [12:22:14] (03PS16) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [12:22:47] (03CR) 10jerkins-bot: [V: 04-1] Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [12:24:57] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [12:25:57] (03PS4) 10Jcrespo: mariadb-backups: Increase snapshot frequency and retain those on bacula [puppet] - 10https://gerrit.wikimedia.org/r/577462 (https://phabricator.wikimedia.org/T138562) [12:30:17] !log upload apertium 3.6.1, cg3 1.3.1, lttoolbox 3.5.1, apertium-lex-tools 0.2.3 to apt.wikimedia.org/jessie-wikimedia main. T234182 [12:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:22] T234182: Update to new cg3, lttoolbox, apertium, apertium-lex-tools and apertium-separable packages - https://phabricator.wikimedia.org/T234182 [12:30:41] (03PS17) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [12:30:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-lex-tools: Update to new upstream release 0.2.3 [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/577045 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [12:31:24] (03PS2) 10Muehlenhoff: Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 [12:33:35] (03CR) 10jerkins-bot: [V: 04-1] Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [12:36:00] (03PS18) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [12:47:27] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/21353/" [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [12:52:02] 10Operations, 10Performance-Team, 10SRE-swift-storage, 10Traffic, and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) @dpifke this is something I might be interested in handing off to you and that you might want to consider for Q4. Let me k... [12:52:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium: Update dependency and fix conflict [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/577861 (owner: 10KartikMistry) [12:53:09] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [12:54:13] 10Operations, 10User-jbond: Upgrade CAS to 6.1.0 - https://phabricator.wikimedia.org/T236815 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff We're running 6.1.5 by now [12:54:16] 10Operations, 10Security-Team, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10MoritzMuehlenhoff) [12:55:04] (03PS19) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [12:58:03] 10Operations, 10User-jbond: Integrate CAS into backup infrastructure - https://phabricator.wikimedia.org/T233936 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I'm closing this, the U2F meta data is backed up and the rest comes from Puppet or is volatile. [12:58:05] 10Operations, 10Security-Team, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10MoritzMuehlenhoff) [13:07:11] (03PS20) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [13:18:41] (03PS21) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [13:46:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] sessionstore: bump memory limits in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/578318 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [13:47:07] (03CR) 10Ottomata: [C: 03+1] "Do both collectors run in both DCs? If so, it might be a little better to use MirrorMaker to replicate between the Kafka clusters and the" [puppet] - 10https://gerrit.wikimedia.org/r/578315 (https://phabricator.wikimedia.org/T226986) (owner: 10Filippo Giunchedi) [13:47:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sessionstore: bump memory limits in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/578318 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [13:49:44] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [13:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:20] (03CR) 10Ottomata: "Great, I think we should keep this as a non-temporary change. :)" [puppet] - 10https://gerrit.wikimedia.org/r/578311 (https://phabricator.wikimedia.org/T245179) (owner: 10Elukey) [13:51:59] 10Operations, 10Traffic: lua related crash on ats-be @ cp4032 - https://phabricator.wikimedia.org/T247232 (10ema) [13:52:02] <_joe_> awight: going with eqiad now [13:52:05] <_joe_> err akosiaris [13:52:09] <_joe_> sorry awight :P [13:52:28] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [13:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:48] * akosiaris looking into mw1266's logs [13:53:32] 10Operations, 10Traffic: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10ema) [13:53:33] got 2 503s [13:53:34] 10Operations, 10Traffic: lua related crash on ats-be @ cp4032 - https://phabricator.wikimedia.org/T247232 (10ema) [13:55:10] 10Operations, 10Traffic: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10ema) T247232 is another example of this bug, with a different Lua error (string length overflow). [13:56:27] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1001/21354/" [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [13:57:26] (03Abandoned) 10Ema: varnish: remove grafana-labs-admin [puppet] - 10https://gerrit.wikimedia.org/r/576427 (owner: 10Dzahn) [13:59:50] (03PS3) 10Marostegui: prometheus-mysqld-exporter: Add es3 to the list of standalone sections [puppet] - 10https://gerrit.wikimedia.org/r/576655 (https://phabricator.wikimedia.org/T246072) (owner: 10Jcrespo) [14:00:38] (03PS22) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [14:01:21] (03CR) 10Ema: [C: 03+1] "+1, let's test this in labs first though!" [puppet] - 10https://gerrit.wikimedia.org/r/532348 (owner: 10Vgutierrez) [14:02:27] (03CR) 10Vgutierrez: "pcc is still happy: https://puppet-compiler.wmflabs.org/compiler1003/21355/" [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [14:07:27] (03PS23) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [14:11:18] (03PS3) 10Muehlenhoff: Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 [14:13:16] (03CR) 10jerkins-bot: [V: 04-1] Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [14:16:38] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10revi) [14:20:11] (03CR) 10Volans: "I think that the approach in the check could be drastically simplified, see inline and feel free to ping me if you want to discuss details" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [14:29:51] (03CR) 10Ema: [C: 03+1] "One nit, looks good!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [14:31:14] (03PS2) 10Giuseppe Lavagetto: ProductionServices: use envoy to connect to mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576007 (https://phabricator.wikimedia.org/T244843) [14:31:16] (03PS2) 10Giuseppe Lavagetto: ProductionServices:switch eventgate-analytics to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576008 (https://phabricator.wikimedia.org/T244843) [14:31:18] (03PS2) 10Giuseppe Lavagetto: ProductionServices: switch eventgate-main to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576009 (https://phabricator.wikimedia.org/T244843) [14:32:48] 10Operations, 10Analytics, 10ContentTranslation, 10SRE-Access-Requests, 10Language-Team (Language-2020-January-March): Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T247246 (10Nuria) [14:35:05] (03PS1) 10Elukey: jupyterhub: use http_proxy pid file for Buster [puppet] - 10https://gerrit.wikimedia.org/r/578325 (https://phabricator.wikimedia.org/T245179) [14:36:04] (03CR) 10Andrew Bogott: [C: 03+2] keystone: Backport the Rocky version of ldap integration [puppet] - 10https://gerrit.wikimedia.org/r/577602 (https://phabricator.wikimedia.org/T247050) (owner: 10Andrew Bogott) [14:36:18] (03CR) 10Andrew Bogott: [C: 03+2] Queens keystone: add a hack for utf8 decoding in already-hacked ldap handler [puppet] - 10https://gerrit.wikimedia.org/r/577669 (https://phabricator.wikimedia.org/T247050) (owner: 10Andrew Bogott) [14:36:27] (03PS2) 10Andrew Bogott: Queens keystone: add a hack for utf8 decoding in already-hacked ldap handler [puppet] - 10https://gerrit.wikimedia.org/r/577669 (https://phabricator.wikimedia.org/T247050) [14:37:44] (03CR) 10Vgutierrez: [C: 03+2] ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [14:38:03] (03PS2) 10Elukey: jupyterhub: use http_proxy pid file for Buster [puppet] - 10https://gerrit.wikimedia.org/r/578325 (https://phabricator.wikimedia.org/T245179) [14:38:26] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/578315 (https://phabricator.wikimedia.org/T226986) (owner: 10Filippo Giunchedi) [14:38:44] (03CR) 10Vgutierrez: "builds in boron as expected" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/577569 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [14:38:58] jouncebot: next [14:38:58] In 2 hour(s) and 21 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1700) [14:40:48] (03CR) 10Elukey: [C: 03+2] jupyterhub: use http_proxy pid file for Buster [puppet] - 10https://gerrit.wikimedia.org/r/578325 (https://phabricator.wikimedia.org/T245179) (owner: 10Elukey) [14:41:26] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: adjust client error topics for eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/578315 (https://phabricator.wikimedia.org/T226986) (owner: 10Filippo Giunchedi) [14:41:44] !log roll restart logstash in codfw / eqiad - T226986 [14:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:50] T226986: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 [14:47:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 T239791', diff saved to https://phabricator.wikimedia.org/P10662 and previous config saved to /var/cache/conftool/dbconfig/20200309-144752-marostegui.json [14:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:58] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [14:48:24] !log Restart and upgrade mysql on db1121 T239791 [14:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:18] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [14:52:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1121 T239791', diff saved to https://phabricator.wikimedia.org/P10663 and previous config saved to /var/cache/conftool/dbconfig/20200309-145232-marostegui.json [14:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:26] (03PS1) 10Vgutierrez: ATS: Turn on TLS Session tickets on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/578327 (https://phabricator.wikimedia.org/T245616) [14:55:56] (03PS2) 10Vgutierrez: ATS: Turn on TLS Session tickets on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/578327 (https://phabricator.wikimedia.org/T245616) [14:56:41] !log Updated the Wikidata property suggester with data from the 2020-03-02 JSON dump and applied the T132839 workarounds [14:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:46] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [14:57:02] (03PS1) 10Ema: Use confluent-kafka-go instead of segmentio/kafka-go [software/atskafka] - 10https://gerrit.wikimedia.org/r/578328 (https://phabricator.wikimedia.org/T237993) [14:57:09] !log Restart mysql for upgrade [14:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:16] 10Operations, 10Analytics, 10ContentTranslation, 10SRE-Access-Requests, 10Language-Team (Language-2020-January-March): Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T247246 (10Nuria) @santhosh What is your LDAP user? [15:01:46] (03PS4) 10Alexandros Kosiaris: admin: Add redis databases for changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/577239 (https://phabricator.wikimedia.org/T213193) [15:04:53] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [15:06:55] !log Restart mysql on db1116 (the previous one was db1102) for upgrade [15:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] ProductionServices: use envoy to connect to mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576007 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:09:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Add redis databases for changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/577239 (https://phabricator.wikimedia.org/T213193) (owner: 10Alexandros Kosiaris) [15:10:09] (03Merged) 10jenkins-bot: admin: Add redis databases for changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/577239 (https://phabricator.wikimedia.org/T213193) (owner: 10Alexandros Kosiaris) [15:12:17] <_joe_> jouncebot: next [15:12:17] In 1 hour(s) and 47 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1700) [15:12:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ProductionServices: use envoy to connect to mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576007 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:12:46] <_joe_> ouch I merged myself out of habit [15:12:49] <_joe_> well w/e [15:12:55] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:00] <_joe_> it's a one-line change [15:13:04] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1016 to es1 master, this is a NOOP T239791', diff saved to https://phabricator.wikimedia.org/P10664 and previous config saved to /var/cache/conftool/dbconfig/20200309-151310-marostegui.json [15:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:15] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [15:13:19] cdanis: ^ :) [15:14:20] marostegui: assuming the next thing you're doing is depooling es1012, would be nice to have that scripted indeed [15:14:23] (03PS4) 10Jbond: Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [15:14:31] yeah, I will be depooling that one [15:14:51] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:24] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [15:17:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1012 T239791', diff saved to https://phabricator.wikimedia.org/P10665 and previous config saved to /var/cache/conftool/dbconfig/20200309-151751-marostegui.json [15:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:51] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: switch mathoid to use envoy (duration: 00m 59s) [15:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:04] (03PS5) 10Jbond: Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [15:23:30] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/21359/" [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [15:24:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2125 - T246604', diff saved to https://phabricator.wikimedia.org/P10666 and previous config saved to /var/cache/conftool/dbconfig/20200309-152427-marostegui.json [15:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:33] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [15:25:46] ottomata: it is working! :) https://logstash.wikimedia.org/goto/edf04ab8ff11b50a69ecf9988337b7e1 [15:26:24] OOoOOooOO [15:27:02] godog: if you do meta.stream:"mediawiki.client.error" it should match both topics [15:29:22] !log Upgrade mysql on es1012 T239791 [15:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:28] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [15:31:45] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [15:32:33] ottomata: indeed! via that tag above should work too afaict, on the logstash side that is [15:33:08] OH irght [15:33:13] it wildcards it cool! [15:35:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1012', diff saved to https://phabricator.wikimedia.org/P10667 and previous config saved to /var/cache/conftool/dbconfig/20200309-153515-marostegui.json [15:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:58] ottomata: correct! same tag across all consumers [15:38:14] (03CR) 10Ottomata: [C: 03+1] ProductionServices:switch eventgate-analytics to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576008 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:40:19] is beta cluster down? [15:40:32] this does not open https://en.wikipedia.beta.wmflabs.org/ [15:41:12] Looks like it might be [15:41:32] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10fgiunchedi) Not sure if there's a more specific python3 + Thumbor but the alpha version of Thumbor ships with Python 3 support: https://github.com/thumbor/thumbor/releases/tag/7.0.0a2 [15:42:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/574661 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:44:22] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: hwraid-1dev partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/574661 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:44:55] zeljkof: Seems ok now.. [15:45:36] Reedy: it's slow but it does work now, thanks [15:46:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1012', diff saved to https://phabricator.wikimedia.org/P10668 and previous config saved to /var/cache/conftool/dbconfig/20200309-154627-marostegui.json [15:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:37] uh oh, but now the title is wrong, but maybe it always was [15:46:38] https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [15:47:01] Wikinews, the free news source [15:48:11] (03CR) 10Filippo Giunchedi: "+Ariel, FYI for snapshot hosts that use hardware raid on /dev/sda" [puppet] - 10https://gerrit.wikimedia.org/r/574661 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:49:21] (03PS1) 10Giuseppe Lavagetto: services_proxy: raise timeout for eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/578341 [15:49:27] Says commons for me :D [15:49:42] <_joe_> zeljkof: please don't talk about beta in this channel [15:49:55] <_joe_> I almost fainted thinking something bad was ongoing in prod :P [15:50:00] Haha [15:50:04] _joe_: oops, sorry :) [15:50:06] I was going to ask if it was PTSD or... [15:50:27] <_joe_> no just I had to scroll back to confirm it was beta and not the actuall prod wikis [15:51:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: raise timeout for eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/578341 (owner: 10Giuseppe Lavagetto) [15:52:09] 10Operations, 10Analytics, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Nuria) I think it will be very helpful to have a design document for this service so we are all in the same page of w... [15:58:36] (03PS3) 10Jbond: puppetdb: monitor agent runs [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) [15:59:15] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [15:59:57] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:42] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Increase snapshot frequency and retain those on bacula [puppet] - 10https://gerrit.wikimedia.org/r/577462 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [16:02:58] <_joe_> jouncebot: next [16:02:59] In 0 hour(s) and 57 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1700) [16:03:07] <_joe_> gehel: are you deploying something? [16:03:27] _joe_: nope, nothing today [16:03:34] <_joe_> ack thanks [16:04:37] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:43] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:55] my bad for notebooks [16:04:58] I am doing a cleanup [16:05:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ProductionServices:switch eventgate-analytics to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576008 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [16:06:08] (03PS4) 10Jbond: puppetdb: monitor agent runs [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) [16:07:03] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:08:21] (03Merged) 10jenkins-bot: ProductionServices:switch eventgate-analytics to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576008 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [16:09:29] (03CR) 10Jbond: "LGTM, with optional nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [16:09:42] (03PS1) 10Elukey: admin: add kerberos flag for user mholloway [puppet] - 10https://gerrit.wikimedia.org/r/578352 (https://phabricator.wikimedia.org/T246834) [16:10:06] (03CR) 10Muehlenhoff: Enable ssoSessions endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [16:11:12] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag for user mholloway [puppet] - 10https://gerrit.wikimedia.org/r/578352 (https://phabricator.wikimedia.org/T246834) (owner: 10Elukey) [16:11:14] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [16:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:12] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: switch eventgate-analytics to use envoy (duration: 01m 05s) [16:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:30] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10revi) @chasemp Feel free to consider current list final for the time being. One stewards' confirmation is not closed but I th... [16:16:13] <_joe_> akosiaris: seeing some failures [16:16:55] (03CR) 10Volans: "Thanks for the quick refactor! Looks good to me, couple of minor things inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [16:17:14] <_joe_> yes [16:17:20] <_joe_> I see throttled spiking [16:17:26] <_joe_> I guess because of TLS? [16:18:48] <_joe_> yeah gonna stop the deployment now [16:18:57] what are you seeing exactly? [16:19:08] most of red dashboard is ok [16:19:13] I see some error in mcrouter ofc [16:19:48] <_joe_> no I mean here https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&from=now-30m&to=now [16:20:16] <_joe_> and I see the errors in envoy [16:20:21] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: revert: switch eventgate-analytics to use envoy (duration: 00m 59s) [16:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:25] <_joe_> I guess we need more cpu there? [16:20:27] yeah, not a good sign? [16:20:35] yes, exactly that [16:20:37] revert [16:20:43] <_joe_> already done [16:20:45] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&fullscreen&panelId=28 [16:20:46] <_joe_> see above [16:21:04] <_joe_> that's because now it needs to do tls negotiation I guess [16:21:12] how on earth did we get that spike? [16:21:16] envoy the sidecar? [16:21:23] <_joe_> I guess so [16:21:30] <_joe_> that's the only thing we're really changing [16:21:34] 9min... that's a lie [16:21:41] <_joe_> ofc [16:21:52] lemme review those numbers [16:21:56] <_joe_> sure [16:22:07] <_joe_> I'm taking a break before of persisting the rollback [16:28:03] _joe_: https://w.wiki/K4G, yup it's the TLS proxy [16:28:52] <_joe_> heh [16:28:57] <_joe_> so let's add some cpu? [16:29:16] <_joe_> now I suspect that's what happened even when andrew tried to switch [16:29:17] yeah, trying to calculate it [16:29:25] yes, same idea over here [16:29:27] <_joe_> but thanks to envoy no outage [16:29:54] <_joe_> envoy on the mw side [16:30:56] min/max/avg of that metric per tls-proxy container are at 25secs [16:30:59] that's still too much [16:31:14] I remember reading that metric can be highly misleading [16:35:52] (03PS3) 10RhinosF1: Create Define/Define Talk: namespace on scowiki with CapitalLinkOverrides true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577802 (https://phabricator.wikimedia.org/T247172) [16:38:23] (03PS4) 10RhinosF1: Create Define/Define Talk: namespace on scowiki with CapitalLinkOverrides true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577802 (https://phabricator.wikimedia.org/T247172) [16:38:27] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-err https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&va [16:38:27] -eqiad&var-topic=All&var-consumer_group=All [16:38:51] <_joe_> uhm [16:39:35] (03PS9) 10RhinosF1: Remove expired throttle config from throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577346 [16:41:47] 10Operations, 10Performance Issue: Investigate CAS performance - https://phabricator.wikimedia.org/T246010 (10MoritzMuehlenhoff) [16:41:49] 10Operations, 10Security-Team, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10MoritzMuehlenhoff) [16:42:05] * RhinosF1 is here from now for SWAT (yes, I'm early) [16:42:07] <_joe_> akosiaris: so wanna try to bump the CPU? [16:42:12] <_joe_> jouncebot: next [16:42:12] In 0 hour(s) and 17 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1700) [16:42:54] <_joe_> I guess I'll just revert and call it a day [16:42:56] (03CR) 10Elukey: [C: 03+1] "<3" [software/atskafka] - 10https://gerrit.wikimedia.org/r/578328 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:42:59] _joe_: yeah, calculating numbers still [16:43:11] I think I am gonna go for a 4 [16:43:17] it's the most easy right now [16:44:05] (03PS1) 10Filippo Giunchedi: install_server: switch snapshot and sodium to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/578356 (https://phabricator.wikimedia.org/T156955) [16:44:20] _joe_: actually, scratch that [16:44:29] it will take some more time. We need to update the chart and all [16:44:36] let's do it tomorrow without pressure [16:45:27] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:46:07] <_joe_> akosiaris: ack [16:46:47] (03PS1) 10Giuseppe Lavagetto: Revert "ProductionServices:switch eventgate-analytics to use envoy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578357 [16:46:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "ProductionServices:switch eventgate-analytics to use envoy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578357 (owner: 10Giuseppe Lavagetto) [16:48:14] (03Merged) 10jenkins-bot: Revert "ProductionServices:switch eventgate-analytics to use envoy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578357 (owner: 10Giuseppe Lavagetto) [16:48:45] 10Operations, 10Performance Issue: Investigate CAS performance - https://phabricator.wikimedia.org/T246010 (10MoritzMuehlenhoff) I dug into this a little further, but no luck yet. Some further findings: Opening the login page took ~ 20s from Firefox after a restart of CAS. There's two significant, long gaps (... [16:50:39] (03PS1) 10Andrew Bogott: wmfnovamiddleware: adjust string encoding for python3 [puppet] - 10https://gerrit.wikimedia.org/r/578359 (https://phabricator.wikimedia.org/T242766) [16:51:30] (03CR) 10jerkins-bot: [V: 04-1] wmfnovamiddleware: adjust string encoding for python3 [puppet] - 10https://gerrit.wikimedia.org/r/578359 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [16:53:37] (03PS2) 10Andrew Bogott: wmfnovamiddleware: adjust string encoding for python3 [puppet] - 10https://gerrit.wikimedia.org/r/578359 (https://phabricator.wikimedia.org/T242766) [16:55:51] (03CR) 10Andrew Bogott: [C: 03+2] wmfnovamiddleware: adjust string encoding for python3 [puppet] - 10https://gerrit.wikimedia.org/r/578359 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [17:00:04] gehel and onimisionipe: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1700). [17:02:20] (03CR) 10RhinosF1: [C: 03+1] "All now expired" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577346 (owner: 10RhinosF1) [17:02:24] jouncebot: no deploy today [17:02:47] jouncebot: next [17:02:47] In 0 hour(s) and 57 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1800) [17:03:54] (03PS5) 10Jbond: puppetdb: monitor agent runs [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) [17:08:02] (03PS6) 10Jbond: puppetdb: monitor agent runs [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) [17:10:32] (03CR) 10Jbond: puppetdb: monitor agent runs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [17:15:10] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for adding this check!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [17:15:34] ottomata: hi, the "to delete" eventgate clusters are scheduled to be gone soon? I'm asking because they are currently triggering the swagger checks on icinga "Prometheus jobs reduced availability" [17:17:44] OH! i thought they were gone i must have missed the check [17:18:25] godog: i think they are gone... [17:18:37] at least from puuppet [17:21:42] mmhh ok thanks ottomata, I'll put on an hardhat and do some digging [17:28:23] (03PS1) 10Filippo Giunchedi: lvs: delete legacy eventgate monitors [puppet] - 10https://gerrit.wikimedia.org/r/578365 (https://phabricator.wikimedia.org/T245203) [17:29:15] (03CR) 10Filippo Giunchedi: "Please double check this is correct, I did a blind cleanup to resolve icinga alerts" [puppet] - 10https://gerrit.wikimedia.org/r/578365 (https://phabricator.wikimedia.org/T245203) (owner: 10Filippo Giunchedi) [17:30:22] AHHHHH godog sorryt i was grepping puppet for to-delete [17:30:26] with hyphen so didn't see it just now [17:30:35] (03CR) 10Ottomata: [C: 03+1] lvs: delete legacy eventgate monitors [puppet] - 10https://gerrit.wikimedia.org/r/578365 (https://phabricator.wikimedia.org/T245203) (owner: 10Filippo Giunchedi) [17:31:20] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10leila) @Marostegui please go ahead. We can handle a few sec potential down for recommendationapi. (@bmansurov FYI) [17:31:21] sweet! thanks ottomata, merging [17:31:25] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [17:31:48] (03CR) 10Filippo Giunchedi: [C: 03+2] lvs: delete legacy eventgate monitors [puppet] - 10https://gerrit.wikimedia.org/r/578365 (https://phabricator.wikimedia.org/T245203) (owner: 10Filippo Giunchedi) [17:32:03] (03CR) 10Jbond: [C: 03+2] puppetdb: monitor agent runs [puppet] - 10https://gerrit.wikimedia.org/r/578312 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [17:38:51] (03PS1) 10Jbond: monitoring_agentrun: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/578366 [17:42:13] (03CR) 10Jbond: [C: 03+2] monitoring_agentrun: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/578366 (owner: 10Jbond) [17:44:34] ok I can't quite figure out now why the check is still alerting, will take a deeper look tomorrow [17:47:56] 10Operations, 10Traffic, 10observability: prometheus2004 not scraping lvs2007 & lvs2008 - https://phabricator.wikimedia.org/T246860 (10elukey) I am investigating a similar problem for some new elasticsearch nodes (like elastic2055). If I create a ssh tunnel for prometheus2003 and 2004, I can see the metrics... [17:56:19] Can I sneak in first for SWAT? [17:56:31] Urbanecm, MatmaRex: [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1800). [18:00:05] RhinosF1 and MatmaRex: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:45] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:00:54] * RhinosF1 ready [18:01:14] hello [18:01:18] sure [18:05:25] Mine is just merge and sync file, no test needed. I will be eating so not watching soon. [18:05:36] Just go ahead though [18:06:10] hmm, so, is anyone deploying? [18:06:18] Not sure [18:06:27] Urbanecm normally does it [18:06:41] RoanKattouw, Niharika: ? [18:07:12] Reedy: you around? want to save the day? [18:08:54] MatmaRex: I’ve got a patch for the 11pm window. Shall I just move to then? [18:09:52] probably, yeah [18:10:17] If someone turns up, feel free to just merge. If not, I’ll go then [18:12:38] jouncebot: now [18:12:38] For the next 0 hour(s) and 47 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T1800) [18:12:43] jouncebot: next [18:12:44] In 1 hour(s) and 47 minute(s): Services – Graphoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T2000) [18:17:03] 10Operations, 10Analytics, 10Traffic, 10Performance-Team (Radar): Send X-Analytics information from Varnish to Hadoop with VCL_Log - https://phabricator.wikimedia.org/T196558 (10Krinkle) 05Open→03Resolved [18:17:06] 10Operations, 10SRE-swift-storage, 10Traffic, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814 (10Krinkle) [18:19:12] 10Operations, 10SRE-swift-storage, 10Traffic, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814 (10Krinkle) [18:40:23] I’m back if anyone wants to go [18:41:07] Urbanecm, RoanKattouw, Niharika: if not, can you confirm the 11pm slot so I’m not up late for nothing? [19:00:09] hmm, I see a SWAT was missed? [19:00:13] Is there anything urgent? [19:00:32] Urbanecm: not really, I can go 11pm if you’re around [19:00:37] MatmaRex: ^ [19:00:57] i rescheduled mine for tomorrow [19:01:00] Urbanecm: we’ll have both the throttle cleanup and scowiki to do though [19:01:06] Sound good? [19:01:10] okay [19:01:17] RhinosF1: yes, the window limit is 6 :-) no problem [19:02:55] Cool [19:41:02] (03PS1) 10Andrew Bogott: Openstack queens packages: absent some python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/578378 (https://phabricator.wikimedia.org/T242766) [20:00:04] halfak and accraze: That opportune time is upon us again. Time for a Services – Graphoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T2000). [20:09:17] (03CR) 10Andrew Bogott: [C: 03+2] Openstack queens packages: absent some python2 packages [puppet] - 10https://gerrit.wikimedia.org/r/578378 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [20:38:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:40:31] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:51:26] jouncebot: next [20:51:26] In 0 hour(s) and 8 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T2100) [21:00:04] Reedy and sbassett: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T2100). [21:04:49] I love that #bothumor jouncebot [21:12:41] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 122 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [21:15:25] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 108.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [21:16:03] 10Operations, 10Traffic, 10observability: prometheus2004 not scraping lvs2007 & lvs2008 - https://phabricator.wikimedia.org/T246860 (10CDanis) 05Resolved→03Open a:05CDanis→03None I'll take a look at this tomorrow in my day, unless @fgiunchedi beats me to it. [21:25:21] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 103.4 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [21:34:22] * RhinosF1 here until after SWAT [21:34:31] jouncebot: next [21:34:31] In 1 hour(s) and 25 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T2300) [21:34:36] ages left [21:35:03] Urbanecm: I suggest we do sco.wiki first and then the throttle cleanup patch. [22:00:17] RECOVERY - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [22:03:57] RECOVERY - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 72.2 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [22:29:55] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 108.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [22:44:25] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [22:57:19] * RhinosF1 peeks in [22:57:24] ready [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200309T2300). [23:00:04] RhinosF1: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:28] I can do the SWAT, give me a minute [23:01:41] good luck RoanKattouw :) [23:02:01] RoanKattouw: great, this is my first. Can we do the sco.wiki patch first pls? [23:03:08] 10Operations: Deploy the cescout package (censorship monitoring) - https://phabricator.wikimedia.org/T247273 (10ssingh) [23:03:25] 10Operations: Deploy the cescout package (censorship monitoring) - https://phabricator.wikimedia.org/T247273 (10ssingh) [23:06:40] (03CR) 10Catrope: [C: 03+2] Create Define/Define Talk: namespace on scowiki with CapitalLinkOverrides true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577802 (https://phabricator.wikimedia.org/T247172) (owner: 10RhinosF1) [23:07:40] (03Merged) 10jenkins-bot: Create Define/Define Talk: namespace on scowiki with CapitalLinkOverrides true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577802 (https://phabricator.wikimedia.org/T247172) (owner: 10RhinosF1) [23:09:04] RhinosF1: OK, your scowiki patch is on mwdebug1001 for testing, please test it there using the WikimediaDebug browser extension [23:09:31] {{doing}} [23:13:39] RoanKattouw: https://sco.wikipedia.org/w/index.php?title=Define:Test&action=info - LGTM [23:13:55] OK, deploying everywhere [23:15:20] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Create Define/Define talk: namespace on scowiki (duration: 01m 00s) [23:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:11] RoanKattouw: proceed with the throttle patch then when ready [23:18:54] (03CR) 10Catrope: [C: 03+2] Remove expired throttle config from throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577346 (owner: 10RhinosF1) [23:19:45] (03Merged) 10jenkins-bot: Remove expired throttle config from throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577346 (owner: 10RhinosF1) [23:21:24] !log catrope@deploy1001 Synchronized wmf-config/throttle.php: Remove expired throttle exemptions (duration: 01m 00s) [23:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:46] RoanKattouw: That's SWAT done from me! [23:23:37] * RhinosF1 is off to sleep now