[00:00:04] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200730T0000). Please do the needful. [00:00:28] RECOVERY - MariaDB Replica Lag: s4 on db1145 is OK: OK slave_sql_lag Replication lag: 0.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:08:10] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:47] (03PS1) 10Mholloway: Implement GetContentModels hook [extensions/JsonConfig] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617306 (https://phabricator.wikimedia.org/T259126) [00:13:14] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:25] 10Operations, 10serviceops: reinstall xhgui* with buster - https://phabricator.wikimedia.org/T259206 (10Dzahn) [00:23:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:23:47] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:45] (03CR) 10Mholloway: [C: 03+2] Implement GetContentModels hook [extensions/JsonConfig] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617306 (https://phabricator.wikimedia.org/T259126) (owner: 10Mholloway) [00:40:47] RECOVERY - mediawiki-installation DSH group on wtp2008 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:48:11] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:59] PROBLEM - parsoid on wtp2010 is CRITICAL: connect to address 10.192.16.52 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [00:51:59] PROBLEM - Check size of conntrack table on wtp2010 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.52: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:52:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:52:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:23] ACKNOWLEDGEMENT - parsoid on wtp2010 is CRITICAL: connect to address 10.192.16.52 and port 8000: Connection refused daniel_zahn reinstall https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [00:53:23] ACKNOWLEDGEMENT - php7.2-fpm service on wtp2010 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.52: Connection reset by peer daniel_zahn reinstall https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:57:59] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:04] (03Merged) 10jenkins-bot: Implement GetContentModels hook [extensions/JsonConfig] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617306 (https://phabricator.wikimedia.org/T259126) (owner: 10Mholloway) [01:00:27] RECOVERY - Check size of conntrack table on wtp2009 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:03:20] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2009.codfw.wmnet'] ` and were **ALL** successful. [01:06:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:07:19] !log mholloway-shell@deploy1001 Synchronized php-1.36.0-wmf.2/extensions/JsonConfig: Backport: Implement GetContentModels hook (T259126) (duration: 01m 07s) [01:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:25] T259126: Warning: Locally stored wiki page has unsupported content model (from JsonConfig) - https://phabricator.wikimedia.org/T259126 [01:09:07] RECOVERY - Check the last execution of helm-chartctl-package-all on chartmuseum2001 is OK: OK: Status of the systemd unit helm-chartctl-package-all https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:16:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:16:31] 10Operations, 10serviceops: reinstall xhgui* with buster - https://phabricator.wikimedia.org/T259206 (10Dzahn) p:05Triage→03High [01:16:41] 10Operations, 10serviceops: reinstall xhgui* with buster - https://phabricator.wikimedia.org/T259206 (10Dzahn) a:03Dzahn [01:17:29] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:13] 10Operations, 10serviceops: reinstall xhgui* with buster - https://phabricator.wikimedia.org/T259206 (10Dzahn) [01:20:09] PROBLEM - Check the last execution of helm-chartctl-package-all on chartmuseum2001 is CRITICAL: CRITICAL: Status of the systemd unit helm-chartctl-package-all https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:20:33] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:22:56] !log imported in apt.wikimedia.org for buster: php-slim, php-slim-views, php-perftools-xhgui-collector, php-pimple, php-psr-http-server-middleware, php-psr-http-server-handler, xhgui [01:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:20] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2009.codfw.wmnet [01:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [01:33:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:26] RECOVERY - Check size of conntrack table on wtp2010 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:42:47] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:09] RECOVERY - Check the last execution of helm-chartctl-package-all on chartmuseum2001 is OK: OK: Status of the systemd unit helm-chartctl-package-all https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:53:14] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@ad87f69]: Deploying https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/615302 [01:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:19] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@ad87f69]: Deploying https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/615302 (duration: 00m 05s) [01:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:07] (03PS4) 10Dave Pifke: arclamp: Run & scrape Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) [02:11:47] RECOVERY - parsoid on wtp2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [02:12:45] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2010.codfw.wmnet'] ` and were **ALL** successful. [02:18:55] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2010.codfw.wmnet [02:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:31] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) wtp2002 through wtp2010 done and repooled. [02:28:52] 10Operations, 10Graphoid, 10serviceops, 10Chinese-Sites, 10Platform Engineering (Icebox): Undeploy graphoid for phase 2 wiki's - https://phabricator.wikimedia.org/T258463 (10Jseddon) 05Open→03Resolved [02:28:55] 10Operations, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [02:29:46] 10Operations, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Engineering (Icebox): Undeploy graphoid for phase 3 wiki's - https://phabricator.wikimedia.org/T259207 (10Jseddon) [02:29:57] 10Operations, 10Graphoid, 10serviceops, 10Platform Engineering (Icebox): Undeploy graphoid for phase 3 wiki's - https://phabricator.wikimedia.org/T259207 (10Jseddon) [03:44:37] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:46:31] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:54:09] RECOVERY - exim queue on mx1001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [04:41:23] (03PS2) 10Catrope: Enable and configure GrowthExperiments on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616959 (https://phabricator.wikimedia.org/T255020) [05:26:01] !log Deploy MCR schema change on labswiki (wikitech) T238966 [05:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:07] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [06:10:39] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10RhinosF1) Will there be any disclosure of the issue so 3rd party sites that followed suit in disablin... [06:47:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:51] (03CR) 10JMeybohm: [C: 04-1] "I think this is a great addition!" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [06:55:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:57:32] !log upload druid_0.19.0-1 packages to buster-wikimedia [06:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:46] (upgrade of one cluster planned for today) [07:01:21] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:05:07] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:07:33] (03PS1) 10Elukey: setup.py: skip pytest>= 6.0.0 to avoid prospector failures [software/spicerack] - 10https://gerrit.wikimedia.org/r/617378 [07:08:47] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) >>! In T257066#6347460, @RhinosF1 wrote: > Will there be any disclosure of the issue so 3r... [07:10:25] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 74 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:12:42] (03PS1) 10Filippo Giunchedi: Decom weblog1001 [puppet] - 10https://gerrit.wikimedia.org/r/617379 (https://phabricator.wikimedia.org/T259217) [07:16:01] (03PS1) 10Alexandros Kosiaris: Update some comments [software/otrs] - 10https://gerrit.wikimedia.org/r/617380 [07:16:03] (03PS1) 10Alexandros Kosiaris: Release 1.0.16. First version that only support 6.0.x [software/otrs] - 10https://gerrit.wikimedia.org/r/617381 (https://phabricator.wikimedia.org/T187984) [07:16:17] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:16:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1121', diff saved to https://phabricator.wikimedia.org/P12129 and previous config saved to /var/cache/conftool/dbconfig/20200730-071633-marostegui.json [07:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:03] (03PS1) 10Elukey: role::druid::public::worker: update settings for 0.19 [puppet] - 10https://gerrit.wikimedia.org/r/617382 (https://phabricator.wikimedia.org/T244482) [07:20:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:25] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:21:08] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom weblog1001 [puppet] - 10https://gerrit.wikimedia.org/r/617379 (https://phabricator.wikimedia.org/T259217) (owner: 10Filippo Giunchedi) [07:22:02] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [07:22:03] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:10] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10RhinosF1) > Yes. I planned on putting out an announcement a couple of days ago, but it turns out to b... [07:22:17] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:22:17] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/24228/druid1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/617382 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [07:31:30] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission [07:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:17] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:42] (03PS1) 10Marostegui: db2083: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/617383 (https://phabricator.wikimedia.org/T250666) [07:33:21] (03CR) 10Marostegui: [C: 03+2] db2083: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/617383 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [07:35:00] (03PS1) 10Filippo Giunchedi: Decom weblog1001 [dns] - 10https://gerrit.wikimedia.org/r/617384 (https://phabricator.wikimedia.org/T259217) [07:36:07] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Update some comments [software/otrs] - 10https://gerrit.wikimedia.org/r/617380 (owner: 10Alexandros Kosiaris) [07:36:15] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom weblog1001 [dns] - 10https://gerrit.wikimedia.org/r/617384 (https://phabricator.wikimedia.org/T259217) (owner: 10Filippo Giunchedi) [07:37:23] (03PS6) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 [07:38:06] (03PS2) 10Alexandros Kosiaris: Release 1.0.16. First version that only support 6.0.x [software/otrs] - 10https://gerrit.wikimedia.org/r/617381 (https://phabricator.wikimedia.org/T187984) [07:38:16] 10Operations, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission weblog1001 (unrack or return to spares) - https://phabricator.wikimedia.org/T259217 (10fgiunchedi) [07:38:55] 10Operations, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission weblog1001 (unrack or return to spares) - https://phabricator.wikimedia.org/T259217 (10fgiunchedi) @Cmjohnson or @Jclark-ctr this host can be either unracked (purchased in 2016) or returned to spare, thanks! [07:39:03] (03PS1) 10Muehlenhoff: Enable CAS for Hue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/617385 [07:39:18] (03CR) 10jerkins-bot: [V: 04-1] Enable CAS for Hue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/617385 (owner: 10Muehlenhoff) [07:39:45] (03PS2) 10Muehlenhoff: Enable CAS for Hue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/617385 [07:45:08] (03CR) 10Volans: [C: 03+1] "LGTM although I'm seeing upstream very active and the fix seems already merged into master. I'm keen to maybe wait a day or two to see if " [software/spicerack] - 10https://gerrit.wikimedia.org/r/617378 (owner: 10Elukey) [07:50:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:50:22] (03PS1) 10Alexandros Kosiaris: push-notifications: Remove from CI 2 wrongly added vars [puppet] - 10https://gerrit.wikimedia.org/r/617387 (https://phabricator.wikimedia.org/T259152) [07:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:57] (03PS1) 10Filippo Giunchedi: profile: ensure only one webrequest host sends 5xx to logstash [puppet] - 10https://gerrit.wikimedia.org/r/617388 (https://phabricator.wikimedia.org/T247968) [07:51:52] (03CR) 10Ammarpad: ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [07:52:12] (03CR) 10jerkins-bot: [V: 04-1] profile: ensure only one webrequest host sends 5xx to logstash [puppet] - 10https://gerrit.wikimedia.org/r/617388 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [07:52:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:33] (03PS2) 10Filippo Giunchedi: profile: ensure only one webrequest host sends 5xx to logstash [puppet] - 10https://gerrit.wikimedia.org/r/617388 (https://phabricator.wikimedia.org/T247968) [07:55:23] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/24230/" [puppet] - 10https://gerrit.wikimedia.org/r/617388 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [07:55:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] push-notifications: Remove from CI 2 wrongly added vars [puppet] - 10https://gerrit.wikimedia.org/r/617387 (https://phabricator.wikimedia.org/T259152) (owner: 10Alexandros Kosiaris) [07:57:31] (03PS10) 10Alexandros Kosiaris: traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) [08:01:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Looks pretty ok to me, comments inline. It also misses a commit for admin/common/calico/default-kubernetes-policy.yaml" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [08:02:29] ema: I am gonna merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/614759 and corresponding DNS record for adding the OTRS migration endpoint. [08:02:39] migration testing more correctly [08:03:45] akosiaris: let me take a quick look [08:04:25] akosiaris: any chance to get that behind TLS? :) [08:04:50] ema: niah, not worth it. It's for like 1 month or so [08:04:56] fair enough [08:06:16] (03CR) 10Ema: [C: 03+1] traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [08:11:25] thanks ema! [08:13:40] (03PS1) 10Muehlenhoff: Fix up ERB templates for Hue/CAS [puppet] - 10https://gerrit.wikimedia.org/r/617391 [08:14:40] (03PS12) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [08:15:05] (03CR) 10jerkins-bot: [V: 04-1] Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [08:16:53] (03PS1) 10Marostegui: db2083: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/617392 [08:17:48] (03CR) 10Marostegui: [C: 03+2] db2083: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/617392 (owner: 10Marostegui) [08:18:02] (03CR) 10Muehlenhoff: [C: 03+2] Fix up ERB templates for Hue/CAS [puppet] - 10https://gerrit.wikimedia.org/r/617391 (owner: 10Muehlenhoff) [08:19:28] (03PS1) 10Filippo Giunchedi: Revert "wikimedia: failover to netmon2001" [dns] - 10https://gerrit.wikimedia.org/r/617393 (https://phabricator.wikimedia.org/T247967) [08:20:59] (03PS1) 10Filippo Giunchedi: Revert "Failover to netmon2001" [puppet] - 10https://gerrit.wikimedia.org/r/617395 (https://phabricator.wikimedia.org/T247967) [08:21:26] 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Aklapper) p:05High→03Medium @Nonovian: Resetting priority to reflect reality. See https://www.mediawiki.org/w... [08:22:18] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [08:28:16] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:28:34] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:29:00] (03PS1) 10Muehlenhoff: Hue/CAS: Fix one more occurance of the port being passed to the template [puppet] - 10https://gerrit.wikimedia.org/r/617397 [08:34:21] (03CR) 10Muehlenhoff: [C: 03+2] Hue/CAS: Fix one more occurance of the port being passed to the template [puppet] - 10https://gerrit.wikimedia.org/r/617397 (owner: 10Muehlenhoff) [08:34:44] (03Abandoned) 10Jcrespo: mariadb backups: Include extra valid sections on checking script [puppet] - 10https://gerrit.wikimedia.org/r/538885 (https://phabricator.wikimedia.org/T231208) (owner: 10Jcrespo) [08:40:50] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [08:41:07] (03PS1) 10Muehlenhoff: Add IDP service definition for Hue [puppet] - 10https://gerrit.wikimedia.org/r/617398 [08:41:10] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "Failover to netmon2001" [puppet] - 10https://gerrit.wikimedia.org/r/617395 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:43:41] !log flip smokeping/librenms from netmon2001 to netmon1002 - T247967 [08:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:49] T247967: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 [08:44:22] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "wikimedia: failover to netmon2001" [dns] - 10https://gerrit.wikimedia.org/r/617393 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:49:27] 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10Aklapper) @JGulingan: Could you please answer the last comment(s)? Thanks in advance! [08:53:30] (03CR) 10Jcrespo: [C: 03+2] "This may have been your best patch to day- I couldn't find any bug on my testing, and I tried hard." [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) (owner: 10Privacybatm) [08:54:00] (03Merged) 10jenkins-bot: Make transferpy configurable using a configuration file [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) (owner: 10Privacybatm) [08:57:11] (03PS3) 10Muehlenhoff: Enable CAS for Hue [puppet] - 10https://gerrit.wikimedia.org/r/617385 [08:59:08] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [08:59:27] (03PS13) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [08:59:47] (03CR) 10Vgutierrez: [C: 03+2] Add .gitignore [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617175 (owner: 10BryanDavis) [08:59:49] (03CR) 10jerkins-bot: [V: 04-1] Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [09:02:39] (03Merged) 10jenkins-bot: Add .gitignore [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617175 (owner: 10BryanDavis) [09:02:44] (03Merged) 10jenkins-bot: acme_chief: Profide .chained.crt.key file support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617176 (https://phabricator.wikimedia.org/T255249) (owner: 10BryanDavis) [09:02:59] (03Merged) 10jenkins-bot: api: Allow acme-chief clients to fetch .chained.crt.key files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617177 (https://phabricator.wikimedia.org/T255249) (owner: 10BryanDavis) [09:03:06] Profide... my OCD is crying a little bit [09:03:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:06:40] (03PS14) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [09:06:54] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete! All netmon hosts are running Buster [09:06:57] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10fgiunchedi) [09:06:59] (03CR) 10jerkins-bot: [V: 04-1] Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [09:08:01] (03PS15) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [09:08:15] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:09:12] (03CR) 10Ayounsi: "Ready for reviews!" [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [09:10:45] (03CR) 10Ayounsi: Initial templating for CR routing-options (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [09:12:25] (03PS1) 10Vgutierrez: Release 0.27 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617400 (https://phabricator.wikimedia.org/T255249) [09:13:18] (03PS2) 10Vgutierrez: Fix repository name in .gitreview [software/acme-chief] - 10https://gerrit.wikimedia.org/r/611309 (owner: 10Hashar) [09:17:28] (03PS7) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 [09:20:24] (03CR) 10Vgutierrez: [C: 03+2] Fix repository name in .gitreview [software/acme-chief] - 10https://gerrit.wikimedia.org/r/611309 (owner: 10Hashar) [09:22:05] (03PS4) 10Privacybatm: transferpy: Create required config directory at the time of deb installation [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) [09:22:07] (03PS8) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [09:23:46] (03Merged) 10jenkins-bot: Fix repository name in .gitreview [software/acme-chief] - 10https://gerrit.wikimedia.org/r/611309 (owner: 10Hashar) [09:24:51] (03CR) 10Jcrespo: "Maybe it is my environment, but after applying locally this change, I get 2 linting errors:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) (owner: 10Privacybatm) [09:25:11] (03PS2) 10Vgutierrez: Release 0.27 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617400 (https://phabricator.wikimedia.org/T255249) [09:25:22] vgutierrez: thank you ;) [09:25:32] np, sorry about the 20 days delay /o\ [09:25:43] I've just spotted it a few minutes ago [09:25:53] it is not like it was a life threatening issue ;] [09:28:13] (03PS8) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) [09:29:26] (03PS9) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) [09:29:38] (03CR) 10Jcrespo: "> Patch Set 4:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) (owner: 10Privacybatm) [09:29:52] (03CR) 10Vgutierrez: [C: 03+2] Release 0.27 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617400 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [09:30:15] (03PS1) 10Marostegui: dbproxy1013,1015: Test db1107 into haproxy [puppet] - 10https://gerrit.wikimedia.org/r/617402 (https://phabricator.wikimedia.org/T257540) [09:30:59] (03PS10) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) [09:31:50] (03CR) 10Kormat: "Ready for review now." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [09:31:52] (03CR) 10Jcrespo: [C: 03+2] "Let's merge for now as is, let's consider build cleanup later." [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) (owner: 10Privacybatm) [09:32:21] (03Merged) 10jenkins-bot: transferpy: Create required config directory at the time of deb installation [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) (owner: 10Privacybatm) [09:32:33] (03Merged) 10jenkins-bot: Release 0.27 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617400 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [09:34:00] (03CR) 10Kormat: [C: 04-1] dbproxy1013,1015: Test db1107 into haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617402 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [09:36:41] (03CR) 10Jcrespo: "Mismatched dependency (see bellow)." (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [09:38:27] (03PS11) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) [09:39:38] (03CR) 10Kormat: "> Patch Set 10:" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [09:39:50] (03CR) 10Marostegui: dbproxy1013,1015: Test db1107 into haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617402 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [09:41:22] (03PS1) 10Elukey: Initial release of wmflib [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) [09:42:44] (03PS2) 10Elukey: Initial release of wmflib [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) [09:43:58] (03CR) 10Jcrespo: "> Patch Set 11:" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [09:45:23] (03CR) 10Kormat: "> Oh, sorry, I don't mean to suggest those code changes (or at least not on this patch). So from your answer I understand that both Remote" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [09:45:40] (03CR) 10Jcrespo: "Also if the linter doesn't let you sleep because of the extension (mysql.py) we can rename it, as long we don't rename it to "mysql" 😊" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [09:46:30] (03CR) 10Jcrespo: [C: 03+1] "General, I think this is good, let's merge ASAP, test carefully before deploy +1." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [09:53:31] (03PS2) 10Marostegui: dbproxy1013,1015: Test db1107 in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/617402 (https://phabricator.wikimedia.org/T257540) [09:54:03] (03CR) 10Kormat: [C: 03+1] dbproxy1013,1015: Test db1107 in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/617402 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [09:56:01] (03CR) 10Privacybatm: "> Patch Set 4:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) (owner: 10Privacybatm) [09:57:29] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [10:00:04] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200730T1000) [10:01:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, just noting that Prometheus will scrape metrics for each arclamp host in each site. Pointing this out in case it is relevant for e.g" [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) (owner: 10Dave Pifke) [10:03:01] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:04:38] (03CR) 10Filippo Giunchedi: "Nice job! LGTM overall, see inline. Haven't tried to build the package myself though" (032 comments) [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/617250 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [10:06:12] (03PS5) 10Hnowlan: api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) [10:06:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:08:40] (03CR) 10Filippo Giunchedi: "LGTM overall, a few questions:" [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [10:12:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:13:43] <_joe_> this is again parsoid ^^ [10:13:48] <_joe_> I need to open a task [10:14:07] (03PS1) 10JMeybohm: Support Swift V1 reauth [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/617409 (https://phabricator.wikimedia.org/T259221) [10:14:25] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:14:40] <_joe_> oh there is already https://phabricator.wikimedia.org/T258499 [10:16:34] (03PS12) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) [10:18:01] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:18:24] 10Operations, 10Parsoid, 10Platform Engineering, 10Wikimedia-production-error: Investigate Parsoid Exception: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T258499 (10Joe) [10:27:04] 10Operations, 10netops: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) New proposal! Change to T200277#6077728 is to use the following fields: * `metric` - keeps things more generic than `latency` * `state` - choice between `default`, `preferred`, `drained` [10:35:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Support Swift V1 reauth [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/617409 (https://phabricator.wikimedia.org/T259221) (owner: 10JMeybohm) [10:37:15] (03CR) 10Volans: "Thanks for the bootstrap, really nice! Some small nits inline, mostly metadata." (0323 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [10:37:19] (03CR) 10Marostegui: [C: 03+2] dbproxy1013,1015: Test db1107 in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/617402 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [10:38:02] !log Reload haproxy on dbproxy1013 and dbproxy10165 [10:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:58] (03CR) 10Jcrespo: [C: 04-1] "This is what I get from executing this:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:42:36] Amir1, awight, Urbanecm, and Lucas_WMDE, I'd like to promote train to group1 earlier rather than later, so that we can promote to group2 later today, if all goes well; do you mind if I do it right before the backport window? should only take a few minutes [10:43:05] fine by me, looks like there’s nothing scheduled to backport yet anyways [10:43:07] liw: afaics, there is no patch scheduled for the backport window [10:43:20] (03CR) 10Jcrespo: "Q:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/617071 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [10:43:20] So yes, go ahead, and feel free to go into the window [10:43:30] ack, will do [10:44:40] (03PS1) 10Lars Wirzenius: group1 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617410 [10:44:42] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617410 (owner: 10Lars Wirzenius) [10:45:23] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617410 (owner: 10Lars Wirzenius) [10:46:03] (03CR) 10Jcrespo: "Looks good otherwise." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/617068 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [10:46:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617398 (owner: 10Muehlenhoff) [10:46:58] 10Operations, 10 FR-Q2-FY2020-21-cleanup-list, 10Product-Infrastructure-Team-Backlog, 10Security, 10Services (next): Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813 (10Jstone6336) p:05Medium→03Unbreak! [10:47:31] (03CR) 10Jcrespo: "Could this be an unmarked dependency or the other patch and that is why it is failing to me?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [10:47:36] (03CR) 10Hnowlan: api-gateway: add helmfile.d configuration (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [10:48:25] (03CR) 10Privacybatm: "> Patch Set 6: Code-Review-1" [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:48:43] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.2 [10:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:15] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Security, 10Services (next): Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813 (10Urbanecm) p:05Unbreak!→03Medium [10:49:51] !log liw@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.2 (duration: 01m 07s) [10:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:00] and done [10:50:01] liw: hi, ad https://phabricator.wikimedia.org/T257970#6346255, that task seems to exist for quite some time? [10:53:27] Urbanecm, hm? I don't understand what you mean [10:54:41] liw: in that comment, you mention you reported T157997, but that task seems to exist for years [10:54:42] T157997: BannerExistenceException due to non-existing CentralNotice banner (after Special:LanguageStats view) - https://phabricator.wikimedia.org/T157997 [10:56:33] liw: This deployment's sound track was Queen's "We will rock you". :) nice soundtrack, comment made me smile [10:56:59] Urbanecm, er, yeah, bad wording on my part. I should've said "we added to T... that this is still/again happening" or something like that [10:57:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [10:57:16] RhinosF1, :) [10:57:47] (03PS2) 10Alexandros Kosiaris: Temporarily add ticket-test.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/616532 (https://phabricator.wikimedia.org/T187984) [10:58:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] Temporarily add ticket-test.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/616532 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [10:58:07] liw: Aha, gotcha [10:58:46] I'm not always entirely calm and serene enough to be clear and unambiguous during train week :( [10:59:38] (do point out if I'm unclear or wrong, thanks) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200730T1100). [11:00:49] liw: is anyone always calm and serene? You do well [11:01:02] 10Operations, 10Continuous-Integration-Infrastructure, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10akosiaris) 05Open→03Resolved a:03akosia... [11:01:02] * RhinosF1 likes the idea of deployments with a soundtrack [11:01:11] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [11:03:01] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:05:34] liw: ^ [11:05:44] (03CR) 10Privacybatm: "> Patch Set 8:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [11:06:05] (03CR) 10Jbond: [C: 03+1] "checked the proposed config matches current and logic all looks good to me" [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [11:06:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:17:39] (03PS1) 10Urbanecm: sysop_itwiki: Add WP as an alias for NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617411 (https://phabricator.wikimedia.org/T259243) [11:17:41] (03PS1) 10Urbanecm: sysop_itwiki: Add several pages to wgWhitelistRead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617412 (https://phabricator.wikimedia.org/T259243) [11:17:43] (03PS1) 10Urbanecm: Add import sources to sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617413 (https://phabricator.wikimedia.org/T259243) [11:17:45] (03PS1) 10Urbanecm: sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617414 (https://phabricator.wikimedia.org/T259243) [11:18:44] (03CR) 10jerkins-bot: [V: 04-1] sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617414 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [11:20:41] Amir1, sorry, was on other desktop watching logstash... seems to have sorted itself out now [11:22:51] (03PS7) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) [11:23:07] (03CR) 10Jbond: [C: 03+2] sudo: create new command to create safe wildcard sudo commands [puppet] - 10https://gerrit.wikimedia.org/r/616717 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [11:23:13] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:23:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:23:56] (03CR) 10Jbond: [C: 03+1] Enable CAS for Hue [puppet] - 10https://gerrit.wikimedia.org/r/617385 (owner: 10Muehlenhoff) [11:25:21] (03PS2) 10Urbanecm: sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617414 (https://phabricator.wikimedia.org/T259243) [11:26:05] this MW alert is annoying :/ [11:26:38] (03CR) 10Jbond: [C: 03+2] thanos: add thanos.wikimedia.org to the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [11:26:43] (03PS6) 10Jbond: thanos: add thanos.wikimedia.org to the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) [11:27:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:33:36] (03PS8) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) [11:33:58] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:34:51] (03PS2) 10Jbond: thanos: add thanos cname pointing to cache [dns] - 10https://gerrit.wikimedia.org/r/617107 (https://phabricator.wikimedia.org/T151009) [11:35:10] (03PS1) 10Urbanecm: ClosedWikiProvider: Do not run when $wmgUseCentralAuth is false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617415 (https://phabricator.wikimedia.org/T259246) [11:35:25] (03CR) 10Jbond: [C: 03+2] thanos: add thanos cname pointing to cache [dns] - 10https://gerrit.wikimedia.org/r/617107 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [11:37:00] (03CR) 10Urbanecm: [C: 03+2] sysop_itwiki: Add WP as an alias for NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617411 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [11:37:23] (03CR) 10Ammarpad: [C: 03+1] "Yes, that's it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617415 (https://phabricator.wikimedia.org/T259246) (owner: 10Urbanecm) [11:37:47] (03Merged) 10jenkins-bot: sysop_itwiki: Add WP as an alias for NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617411 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [11:37:54] (03CR) 10Urbanecm: [C: 03+2] sysop_itwiki: Add several pages to wgWhitelistRead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617412 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [11:38:41] (03Merged) 10jenkins-bot: sysop_itwiki: Add several pages to wgWhitelistRead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617412 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [11:39:09] (03CR) 10Urbanecm: [C: 03+2] "thanks Ammarpad, let's ship this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617415 (https://phabricator.wikimedia.org/T259246) (owner: 10Urbanecm) [11:39:32] (03PS1) 10Jbond: thanos: correct typo [dns] - 10https://gerrit.wikimedia.org/r/617417 [11:39:34] (03CR) 10Privacybatm: "Why flake8 saying it as redefinition?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:39:40] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 5ea4bc87ee3b5ed1ef4f7cf2b1068e678f6eb42c: sysop_itwiki: Add WP as an alias for NS_PROJECT (T259243) (duration: 01m 08s) [11:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:47] T259243: Config tweaks for sysop-it.wikipedia - https://phabricator.wikimedia.org/T259243 [11:39:53] (03Merged) 10jenkins-bot: ClosedWikiProvider: Do not run when $wmgUseCentralAuth is false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617415 (https://phabricator.wikimedia.org/T259246) (owner: 10Urbanecm) [11:40:04] (03CR) 10Jbond: [C: 03+2] thanos: correct typo [dns] - 10https://gerrit.wikimedia.org/r/617417 (owner: 10Jbond) [11:41:19] (03CR) 10Urbanecm: [C: 03+2] Add import sources to sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617413 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [11:42:03] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 7aa0c2361a1e0e363700d54a109b81497b5a045b: sysop_itwiki: Add several pages to wgWhitelistRead (T259243) (duration: 01m 06s) [11:42:04] (03Merged) 10jenkins-bot: Add import sources to sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617413 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [11:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:10] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: fc5de151ee2c9cf5c64a7d13b2e65e39bb349296: ClosedWikiProvider: Do not run when $wmgUseCentralAuth is false (T259246) (duration: 01m 07s) [11:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:16] T259246: Class 'CentralAuthUser' not found - https://phabricator.wikimedia.org/T259246 [11:45:10] (03PS3) 10Urbanecm: sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617414 (https://phabricator.wikimedia.org/T259243) [11:45:18] (03CR) 10Urbanecm: [C: 03+2] sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617414 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [11:45:57] (03Merged) 10jenkins-bot: sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617414 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [11:46:32] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: fc48441f26afc6f4e97c5f7e96185d04cacb0f4b: Add import sources to sysop_itwiki (T259243) (duration: 01m 08s) [11:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:37] T259243: Config tweaks for sysop-it.wikipedia - https://phabricator.wikimedia.org/T259243 [11:47:56] (03CR) 10Elukey: [C: 03+2] role::druid::public::worker: update settings for 0.19 [puppet] - 10https://gerrit.wikimedia.org/r/617382 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [11:48:29] !log urbanecm@deploy1001 Synchronized static/favicon/wmf-blue.ico: 399e9c5d949ade5f574ef965e05288b7253c3c3e: sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg (T259243; 1/2) (duration: 01m 06s) [11:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:32] (03PS1) 10Urbanecm: Revert "sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617427 (https://phabricator.wikimedia.org/T259243) [11:49:38] (03CR) 10Urbanecm: [C: 03+2] Revert "sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617427 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [11:50:23] (03Merged) 10jenkins-bot: Revert "sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617427 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [11:50:25] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) An update: The upgrade on the new node using a test database has progressed ok. A couple of issues met: The script ./DBUpdate... [11:53:14] !log urbanecm@deploy1001 Synchronized static/favicon/: c08f774b9b05cb9c5faf692c59dd45bf5d65b557: Revert "sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg" (T259243) (duration: 01m 06s) [11:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:19] T259243: Config tweaks for sysop-it.wikipedia - https://phabricator.wikimedia.org/T259243 [11:54:11] (03PS9) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) [11:54:55] (03PS1) 10Ayounsi: Netbox: add circuits support [software/homer] - 10https://gerrit.wikimedia.org/r/617418 [11:55:10] (03CR) 10Urbanecm: ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [11:56:06] (03CR) 10jerkins-bot: [V: 04-1] Netbox: add circuits support [software/homer] - 10https://gerrit.wikimedia.org/r/617418 (owner: 10Ayounsi) [11:56:45] (03PS10) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) [11:58:20] (03CR) 10Privacybatm: "I don't know from where the tests are importing os!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:59:16] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) DNS and edge cache changes have been merged, this is ready to be tested by agents. I 'll ping on the OTRS wiki noticeboard aski... [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200730T1200) [12:02:49] (03PS13) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) [12:05:45] (03PS1) 10Muehlenhoff: Make U2F token expiry configurable [puppet] - 10https://gerrit.wikimedia.org/r/617419 [12:07:00] (03CR) 10jerkins-bot: [V: 04-1] Make U2F token expiry configurable [puppet] - 10https://gerrit.wikimedia.org/r/617419 (owner: 10Muehlenhoff) [12:07:10] !log upgrade of the druid public cluster (serving AQS) from 0.12.3 to 0.19 [12:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:53] (03PS1) 10Ema: ATS: add caching rule for thanos-query.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/617420 (https://phabricator.wikimedia.org/T151009) [12:09:42] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/617420 (https://phabricator.wikimedia.org/T151009) (owner: 10Ema) [12:10:01] PROBLEM - OTRS SMTP on otrs1001 is CRITICAL: connect to address 10.64.16.39 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/OTRS%23Troubleshooting [12:11:02] (03PS2) 10Muehlenhoff: Make U2F token expiry configurable [puppet] - 10https://gerrit.wikimedia.org/r/617419 [12:12:19] (03CR) 10jerkins-bot: [V: 04-1] Make U2F token expiry configurable [puppet] - 10https://gerrit.wikimedia.org/r/617419 (owner: 10Muehlenhoff) [12:16:28] (03Abandoned) 10Hnowlan: api-gateway: Restrict unauthenticated write HTTP methods, permit read HTTP methods [deployment-charts] - 10https://gerrit.wikimedia.org/r/613650 (https://phabricator.wikimedia.org/T256769) (owner: 10Hnowlan) [12:16:40] (03CR) 10Volans: "one suggestion inline as requested on IRC 😊" (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/617418 (owner: 10Ayounsi) [12:16:42] (03PS1) 10Elukey: profile::druid::historical: set class prefixes like the other daemons [puppet] - 10https://gerrit.wikimedia.org/r/617421 (https://phabricator.wikimedia.org/T244482) [12:17:56] (03CR) 10jerkins-bot: [V: 04-1] profile::druid::historical: set class prefixes like the other daemons [puppet] - 10https://gerrit.wikimedia.org/r/617421 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [12:21:45] (03CR) 10JMeybohm: [C: 03+2] Support Swift V1 reauth [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/617409 (https://phabricator.wikimedia.org/T259221) (owner: 10JMeybohm) [12:24:30] (03Merged) 10jenkins-bot: Support Swift V1 reauth [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/617409 (https://phabricator.wikimedia.org/T259221) (owner: 10JMeybohm) [12:25:03] (03PS2) 10Elukey: profile::druid::historical: set class prefixes like the other daemons [puppet] - 10https://gerrit.wikimedia.org/r/617421 (https://phabricator.wikimedia.org/T244482) [12:26:19] (03CR) 10jerkins-bot: [V: 04-1] profile::druid::historical: set class prefixes like the other daemons [puppet] - 10https://gerrit.wikimedia.org/r/617421 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [12:26:21] (03PS1) 10Urbanecm: sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617422 (https://phabricator.wikimedia.org/T259243) [12:26:36] (03PS3) 10Muehlenhoff: Make U2F token expiry configurable [puppet] - 10https://gerrit.wikimedia.org/r/617419 [12:27:48] (03PS3) 10Elukey: profile::druid::historical: set class prefixes like the other daemons [puppet] - 10https://gerrit.wikimedia.org/r/617421 (https://phabricator.wikimedia.org/T244482) [12:27:56] (03CR) 10jerkins-bot: [V: 04-1] Make U2F token expiry configurable [puppet] - 10https://gerrit.wikimedia.org/r/617419 (owner: 10Muehlenhoff) [12:29:21] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24237/" [puppet] - 10https://gerrit.wikimedia.org/r/617421 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [12:32:26] (03PS1) 10Vgutierrez: Add .gitignore [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617423 [12:32:28] (03PS1) 10Vgutierrez: acme_chief: Profide .chained.crt.key file support [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617424 (https://phabricator.wikimedia.org/T255249) [12:32:30] (03PS1) 10Vgutierrez: api: Allow acme-chief clients to fetch .chained.crt.key files [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617425 (https://phabricator.wikimedia.org/T255249) [12:32:32] (03PS1) 10Vgutierrez: Release 0.27 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617446 (https://phabricator.wikimedia.org/T255249) [12:32:34] (03PS1) 10Vgutierrez: debian: Add release 0.27 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617447 (https://phabricator.wikimedia.org/T255249) [12:33:04] (03PS4) 10Muehlenhoff: Make U2F token expiry configurable [puppet] - 10https://gerrit.wikimedia.org/r/617419 [12:34:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:35:59] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:36:09] (03PS11) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) [12:36:14] (03PS5) 10Muehlenhoff: Make U2F token expiry configurable [puppet] - 10https://gerrit.wikimedia.org/r/617419 [12:36:15] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:36:41] (03CR) 10Vgutierrez: [C: 03+2] Add .gitignore [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617423 (owner: 10Vgutierrez) [12:36:45] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:36:47] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Profide .chained.crt.key file support [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617424 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:36:52] (03CR) 10Vgutierrez: [C: 03+2] api: Allow acme-chief clients to fetch .chained.crt.key files [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617425 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:36:58] (03CR) 10Vgutierrez: [C: 03+2] Release 0.27 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617446 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:37:03] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:37:29] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:37:44] (03CR) 10Privacybatm: "Okay! Now things are alright. Please have a look." [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [12:37:45] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:38:01] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:38:20] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24239/" [puppet] - 10https://gerrit.wikimedia.org/r/617419 (owner: 10Muehlenhoff) [12:38:31] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:38:39] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:38:49] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:39:15] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:39:21] this is me --^ hiccup about the upgrade [12:39:55] (03Merged) 10jenkins-bot: Add .gitignore [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617423 (owner: 10Vgutierrez) [12:40:08] (03Merged) 10jenkins-bot: acme_chief: Profide .chained.crt.key file support [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617424 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:40:11] (03Merged) 10jenkins-bot: api: Allow acme-chief clients to fetch .chained.crt.key files [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617425 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:40:23] (03Merged) 10jenkins-bot: Release 0.27 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617446 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:43:36] (03PS9) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) [12:46:02] (03CR) 10Hnowlan: [C: 03+2] api-gateway: set CORS headers to allow all domains for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/613160 (https://phabricator.wikimedia.org/T256771) (owner: 10Hnowlan) [12:46:12] (03PS2) 10Hnowlan: api-gateway: set CORS headers to allow all domains for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/613160 (https://phabricator.wikimedia.org/T256771) [12:46:14] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: set CORS headers to allow all domains for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/613160 (https://phabricator.wikimedia.org/T256771) (owner: 10Hnowlan) [12:47:04] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] api-gateway: set CORS headers to allow all domains for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/613160 (https://phabricator.wikimedia.org/T256771) (owner: 10Hnowlan) [12:47:06] (03CR) 10Privacybatm: "> Patch Set 2:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/617071 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [12:48:08] (03Merged) 10jenkins-bot: api-gateway: set CORS headers to allow all domains for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/613160 (https://phabricator.wikimedia.org/T256771) (owner: 10Hnowlan) [12:48:20] (03PS3) 10Privacybatm: transferpy: Release transferpy 1.0 [software/transferpy] - 10https://gerrit.wikimedia.org/r/617071 (https://phabricator.wikimedia.org/T257601) [12:49:01] (03CR) 10Privacybatm: "> Patch Set 2:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/617068 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [12:50:27] (03PS3) 10Privacybatm: transferpy: Improve documentation [software/transferpy] - 10https://gerrit.wikimedia.org/r/617068 (https://phabricator.wikimedia.org/T257601) [12:51:56] (03Abandoned) 10Urbanecm: Add "work" namespace to search results for Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616609 (owner: 10Bodhisattwa) [12:52:19] (03CR) 10Privacybatm: "> Patch Set 4:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [12:54:17] (03PS1) 10JMeybohm: chartmuseum: Fix inverted hostname and IP [puppet] - 10https://gerrit.wikimedia.org/r/617455 (https://phabricator.wikimedia.org/T253843) [12:54:23] (03PS4) 10Jbond: storeconfigs: add debug option to test $settings variable [puppet] - 10https://gerrit.wikimedia.org/r/617156 [12:55:41] (03CR) 10jerkins-bot: [V: 04-1] storeconfigs: add debug option to test $settings variable [puppet] - 10https://gerrit.wikimedia.org/r/617156 (owner: 10Jbond) [12:57:34] (03PS2) 10Vgutierrez: debian: Add release 0.27 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617447 (https://phabricator.wikimedia.org/T255249) [12:58:22] (03CR) 10Ema: [C: 03+1] chartmuseum: Fix inverted hostname and IP [puppet] - 10https://gerrit.wikimedia.org/r/617455 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:59:27] (03CR) 10JMeybohm: [C: 03+2] chartmuseum: Fix inverted hostname and IP [puppet] - 10https://gerrit.wikimedia.org/r/617455 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:00:04] liw and brennen: Your horoscope predicts another unfortunate Mediawiki train - European+American Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200730T1300). [13:02:12] !log imported chartmuseum_0.12.0-3 to buster-wikimedia [13:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:01] (03PS6) 10Muehlenhoff: Make U2F token expiry configurable [puppet] - 10https://gerrit.wikimedia.org/r/617419 (https://phabricator.wikimedia.org/T258029) [13:19:12] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617419 (https://phabricator.wikimedia.org/T258029) (owner: 10Muehlenhoff) [13:20:33] (03PS1) 10Jbond: thanos-sso: create a discover name to be used by the authenticate FE [dns] - 10https://gerrit.wikimedia.org/r/617456 (https://phabricator.wikimedia.org/T258029) [13:21:02] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:22:19] (03CR) 10Jcrespo: "This looks much cleaner and simple at first look." [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [13:23:10] (03PS1) 10Jbond: ATS - thanos: update thanos.wikimedia.org to use CNAME based name [puppet] - 10https://gerrit.wikimedia.org/r/617457 (https://phabricator.wikimedia.org/T258029) [13:23:39] (03PS2) 10Jbond: thanos-sso: create a discover name to be used by the authenticate FE [dns] - 10https://gerrit.wikimedia.org/r/617456 (https://phabricator.wikimedia.org/T258029) [13:25:33] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) It takes very little to load another snapshot if you think you need it. [13:26:14] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:26:41] (03CR) 10Muehlenhoff: [C: 03+2] Add IDP service definition for Hue [puppet] - 10https://gerrit.wikimedia.org/r/617398 (owner: 10Muehlenhoff) [13:27:32] (03PS14) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 (https://phabricator.wikimedia.org/T259021) [13:39:28] (03CR) 10Vgutierrez: [C: 03+2] "merging after checking that it builds as expected. I need to check with hashar what's going on with debian-glue" [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617447 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [13:39:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:40:19] (03CR) 10Muehlenhoff: [C: 03+2] Make U2F token expiry configurable [puppet] - 10https://gerrit.wikimedia.org/r/617419 (https://phabricator.wikimedia.org/T258029) (owner: 10Muehlenhoff) [13:42:17] (03Merged) 10jenkins-bot: debian: Add release 0.27 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/617447 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [13:44:02] (03PS1) 10Marostegui: Revert "dbproxy1013,1015: Test db1107 in haproxy" [puppet] - 10https://gerrit.wikimedia.org/r/617428 [13:44:43] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1013,1015: Test db1107 in haproxy" [puppet] - 10https://gerrit.wikimedia.org/r/617428 (owner: 10Marostegui) [13:45:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:46:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though please add a comment with a brief explanation of why this is in place (plus the task #)" [dns] - 10https://gerrit.wikimedia.org/r/617456 (https://phabricator.wikimedia.org/T258029) (owner: 10Jbond) [13:46:31] !log installing qemu security updates on Buster [13:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:52] (03CR) 10Filippo Giunchedi: [C: 03+1] ATS - thanos: update thanos.wikimedia.org to use CNAME based name [puppet] - 10https://gerrit.wikimedia.org/r/617457 (https://phabricator.wikimedia.org/T258029) (owner: 10Jbond) [13:47:01] !log upload acme-chief 0.27 to apt.wm.o (buster) - T255249 [13:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:07] T255249: acme-chief: support for generating a concatenated cert/key file - https://phabricator.wikimedia.org/T255249 [13:47:25] !log upgrade acme-chief to version 0.27 - T255249 [13:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:27] !log volans@cumin1001 START - Cookbook sre.dns.netbox [13:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:58] (03PS1) 10Kormat: passwords: Add dummy mysql replication password [labs/private] - 10https://gerrit.wikimedia.org/r/617473 [13:53:18] (03CR) 10Marostegui: [C: 03+1] passwords: Add dummy mysql replication password [labs/private] - 10https://gerrit.wikimedia.org/r/617473 (owner: 10Kormat) [13:53:45] kormat: don't put marostegui's passwords in a public repo please [13:54:03] it is not a good thing to do to a colleague [13:54:30] haha [13:54:35] :D [13:54:51] elukey: i don't really think of marostegui as a colleague. more of a work-place hazard, really. [13:55:15] kormat: I know that you are a wise man, you are right [13:55:30] (03PS3) 10Hnowlan: api-gateway: proxy clusters interface through Envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) [13:55:37] (03CR) 10Kormat: [V: 03+2 C: 03+2] passwords: Add dummy mysql replication password [labs/private] - 10https://gerrit.wikimedia.org/r/617473 (owner: 10Kormat) [13:56:13] I think next time I will leave new replication configurations to be figured out by elukey and kormat - that'll be fun to watch [13:56:21] hahahahahahahahaah [13:56:21] <_joe_> hnowlan: you might be interested in the patches I'm writing that allow to validate an envoy config within that repo [13:56:31] * kormat coughs [13:56:38] marostegui: can I fix this with some wikilove? I don't really like kormat [13:56:44] <3 [13:56:46] fixed! [13:56:50] \o/ [13:57:21] <_joe_> also that's fake new, we all know that marostegui's passwords are all 12345 [13:57:48] unless forced to be 8 chars, in that case 12345678 [13:58:02] <_joe_> https://www.youtube.com/watch?v=a6iW-8xPw3k [13:58:38] hahahaha [14:01:59] * RhinosF1 likes watching SRE banter [14:04:37] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:29] (03PS3) 10Jbond: thanos-sso: create a discover name to be used by the authenticate FE [dns] - 10https://gerrit.wikimedia.org/r/617456 (https://phabricator.wikimedia.org/T151009) [14:05:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:06:04] (03PS2) 10Jbond: ATS - thanos: update thanos.wikimedia.org to use CNAME based name [puppet] - 10https://gerrit.wikimedia.org/r/617457 (https://phabricator.wikimedia.org/T151009) [14:06:52] (03CR) 10Jbond: [C: 03+2] thanos-sso: create a discover name to be used by the authenticate FE [dns] - 10https://gerrit.wikimedia.org/r/617456 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:07:37] (03CR) 10Jbond: [C: 03+2] ATS - thanos: update thanos.wikimedia.org to use CNAME based name [puppet] - 10https://gerrit.wikimedia.org/r/617457 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:10:04] (03CR) 10Jcrespo: [C: 03+2] Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [14:10:31] (03Merged) 10jenkins-bot: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [14:11:29] (03PS10) 10Jcrespo: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [14:14:45] (03CR) 10Jcrespo: [C: 03+2] Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [14:15:24] (03CR) 10Jcrespo: [C: 03+2] transferpy: Improve documentation [software/transferpy] - 10https://gerrit.wikimedia.org/r/617068 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [14:15:32] (03PS1) 10Andrew Bogott: mwopenstackclient: Fix designate client auth confusion with no-auth project [puppet] - 10https://gerrit.wikimedia.org/r/617475 (https://phabricator.wikimedia.org/T258973) [14:16:33] (03PS4) 10Jcrespo: transferpy: Improve documentation [software/transferpy] - 10https://gerrit.wikimedia.org/r/617068 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [14:16:35] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclient: Fix designate client auth confusion with no-auth project [puppet] - 10https://gerrit.wikimedia.org/r/617475 (https://phabricator.wikimedia.org/T258973) (owner: 10Andrew Bogott) [14:18:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:18:50] (03CR) 10Jcrespo: [C: 03+2] transferpy: Improve documentation (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/617068 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [14:19:17] (03Merged) 10jenkins-bot: transferpy: Improve documentation [software/transferpy] - 10https://gerrit.wikimedia.org/r/617068 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [14:19:31] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Provide authenticated access to Thanos native web interface - https://phabricator.wikimedia.org/T151009 (10jbond) 05Open→03Resolved This is now configured with CAS-SSO authentication at https://thanos.wikimedia.org [14:20:17] (03CR) 10Jcrespo: [C: 03+2] transferpy: Release transferpy 1.0 [software/transferpy] - 10https://gerrit.wikimedia.org/r/617071 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [14:22:05] (03PS4) 10Jcrespo: transferpy: Release transferpy 1.0 [software/transferpy] - 10https://gerrit.wikimedia.org/r/617071 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [14:30:32] !log installing squid security updates [14:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:45] (03PS1) 10Jbond: thanos-sso: add descriptive comment and task reference [dns] - 10https://gerrit.wikimedia.org/r/617478 (https://phabricator.wikimedia.org/T151009) [14:36:38] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos-sso: add descriptive comment and task reference [dns] - 10https://gerrit.wikimedia.org/r/617478 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:37:53] (03CR) 10Jbond: [C: 03+2] thanos-sso: add descriptive comment and task reference [dns] - 10https://gerrit.wikimedia.org/r/617478 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:38:53] 10Operations, 10Mail, 10observability, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10MoritzMuehlenhoff) >>! In T257016#6344879, @MoritzMuehlenhoff wrote: > @herron Seems to work fine, didn't see a paniclog mail today \o/ Actually, the p... [14:39:18] (03PS5) 10JMeybohm: New upstream version 2.16.9 [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) [14:42:23] (03PS1) 10Elukey: profile::mariadb::misc::analytics::multiinstance: change ports [puppet] - 10https://gerrit.wikimedia.org/r/617479 (https://phabricator.wikimedia.org/T234826) [14:45:31] (03PS2) 10Elukey: profile::mariadb::misc::analytics::multiinstance: change ports [puppet] - 10https://gerrit.wikimedia.org/r/617479 (https://phabricator.wikimedia.org/T234826) [14:48:34] (03PS1) 10Jdlrobson: Disable panning and zooming until ready [extensions/Kartographer] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617429 (https://phabricator.wikimedia.org/T257872) [14:55:42] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [15:02:01] (03CR) 10JMeybohm: [C: 04-1] Add local service proxy to the tls terminator v0.2 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:05:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:07:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:11:55] (03CR) 10JMeybohm: [C: 04-1] envoyproxy::tls_terminator: update tls definitions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617086 (https://phabricator.wikimedia.org/T258140) (owner: 10Giuseppe Lavagetto) [15:19:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617083 (owner: 10Filippo Giunchedi) [15:46:32] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [15:55:21] (03CR) 10Ayounsi: Netbox: add circuits support (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/617418 (owner: 10Ayounsi) [15:56:22] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10RobH) >>! In T258221#6334410, @Jclark-ctr wrote: > @CDanis connected console to port 41 on scs-c1-eqiad I just jumped on and tested this connection, it works! I'm closing t... [15:57:40] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [15:58:06] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10RobH) [15:58:36] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [15:58:55] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2011.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [15:59:33] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2012.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [16:00:04] godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200730T1600). [16:00:36] ^ no change in this window [16:07:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:11:42] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:15:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, aside from the probably useful inclusion of api-rw." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:15:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:23:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:32:31] sssssss [16:38:14] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:40:21] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:41:16] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 53.9 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [16:41:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:41:55] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:44:10] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:44:15] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:00] (03CR) 10Dzahn: [C: 03+2] add backends for misc services with multiple backends but without geoip [dns] - 10https://gerrit.wikimedia.org/r/616911 (owner: 10Dzahn) [16:47:07] (03PS2) 10Dzahn: add backends for misc services with multiple backends but without geoip [dns] - 10https://gerrit.wikimedia.org/r/616911 [16:47:17] (03CR) 10Dzahn: [C: 03+2] "comments-only" [dns] - 10https://gerrit.wikimedia.org/r/616911 (owner: 10Dzahn) [16:49:49] ^ this makes it more obvious which misc things have multiple backends, just without geoip, and which are truly eqiad-only [16:50:09] 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team: Puppet failing due to dependency cycle on deployment-xhgui03 - https://phabricator.wikimedia.org/T259278 (10bd808) [16:50:11] and should be easier if we ever need to fail them over [16:51:22] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp2012 is CRITICAL: connect to address 10.192.32.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [16:51:22] PROBLEM - dhclient process on wtp2011 is CRITICAL: connect to address 10.192.32.26 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [16:51:43] 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team: Puppet failing due to dependency cycle on deployment-xhgui03 - https://phabricator.wikimedia.org/T259278 (10Dzahn) This comes from T254310#6347281. [16:51:48] (03PS6) 10Hnowlan: api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) [16:51:50] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:52:32] PROBLEM - Check the last execution of php7.2-fpm_check_restart on wtp2012 is CRITICAL: connect to address 10.192.32.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:53:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:53:27] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:40] wtp2012 - ACK [16:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:41] 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team: Puppet failing due to dependency cycle on deployment-xhgui03 - https://phabricator.wikimedia.org/T259278 (10greg) Hey @dpifke, pinging you since I saw xhgui above :) [16:56:54] (03CR) 10Hnowlan: api-gateway: add helmfile.d configuration (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [17:00:05] halfak and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200730T1700). [17:00:33] (03PS1) 10RLazarus: Revert "puppetmaster: clearer edit message when you can't rewrite history" [puppet] - 10https://gerrit.wikimedia.org/r/617430 [17:01:13] (03CR) 10jerkins-bot: [V: 04-1] Revert "puppetmaster: clearer edit message when you can't rewrite history" [puppet] - 10https://gerrit.wikimedia.org/r/617430 (owner: 10RLazarus) [17:01:40] jerkins-- [17:02:04] (03CR) 10Dzahn: [C: 03+1] "I got "trap: ERR: bad trap" and "warning too few spaces before comment" about /srv/private/hieradata/common.yaml without touching that fi" [puppet] - 10https://gerrit.wikimedia.org/r/617430 (owner: 10RLazarus) [17:02:37] rzl: just the commit message :) [17:03:00] yeah, I'm just grumbling because I put the commit message into the little revert popup on gerrit [17:03:08] our tools are fighting with our other tools and I have to clean up after them [17:03:15] indeed. yea... [17:03:35] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@d3ab874]: airflow: refinery_drop_hive_partitions: Fix kerberos token passing [17:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:30] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@d3ab874]: airflow: refinery_drop_hive_partitions: Fix kerberos token passing (duration: 00m 55s) [17:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:44] (03PS2) 10RLazarus: Revert "puppetmaster: clearer edit message when you can't rewrite history" [puppet] - 10https://gerrit.wikimedia.org/r/617430 [17:06:51] (03CR) 10RLazarus: [C: 03+2] Revert "puppetmaster: clearer edit message when you can't rewrite history" [puppet] - 10https://gerrit.wikimedia.org/r/617430 (owner: 10RLazarus) [17:08:19] (03PS1) 10Bstorm: paws haproxy: switch to the chained cert for TLS [puppet] - 10https://gerrit.wikimedia.org/r/617497 (https://phabricator.wikimedia.org/T255249) [17:10:35] (03CR) 10Bstorm: [C: 03+2] paws haproxy: switch to the chained cert for TLS [puppet] - 10https://gerrit.wikimedia.org/r/617497 (https://phabricator.wikimedia.org/T255249) (owner: 10Bstorm) [17:10:45] (03PS1) 10Dzahn: xhgui: try to break dependency cycle between php-twig package and APT pin (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/617498 (https://phabricator.wikimedia.org/T259278) [17:12:30] 10Operations, 10CommRel-Specialists-Support, 10Editing-team, 10Fundraising-Backlog, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Elitre) Removing my team, I don't think there's anything for us here? [17:12:53] 10Operations, 10Editing-team, 10Fundraising-Backlog, 10Parsing-Team, and 8 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Elitre) [17:21:39] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) @jrbs @Emufarmers I just made the change to the wikimediafoundation.org template (and wmflabs.org). This means you should... [17:25:58] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10RobH) 05Open→03Resolved resolving as this now only has the single sub task to terminate the racks, also in the same project, so no need to keep a master tracking... [17:27:09] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) a:05RobH→03BBlack Ok, this has now sat neglected for awhile. @bblack: Should I resume updating bios on these hosts in a rotating, one per cluster fash... [17:27:16] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) 05Stalled→03Open [17:27:18] 10Operations, 10Traffic: Servers freezing across the caching cluster - https://phabricator.wikimedia.org/T238305 (10RobH) [17:35:27] (03PS1) 10Cmjohnson: Adding mgmt dns for cloudvirt servers, netbox script has already been run [dns] - 10https://gerrit.wikimedia.org/r/617502 (https://phabricator.wikimedia.org/T251627) [17:37:05] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for cloudvirt servers, netbox script has already been run [dns] - 10https://gerrit.wikimedia.org/r/617502 (https://phabricator.wikimedia.org/T251627) (owner: 10Cmjohnson) [17:45:03] (03Abandoned) 10Dzahn: admins: turn all wdqs-admins into wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/616564 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [17:49:18] 10Operations, 10ops-eqiad: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10Volans) p:05Triage→03Medium [17:54:19] 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Patch-For-Review: Puppet failing due to dependency cycle on deployment-xhgui03 - https://phabricator.wikimedia.org/T259278 (10Dzahn) It was introduced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/610446 Working on providing the php-... [17:54:32] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T259278" [puppet] - 10https://gerrit.wikimedia.org/r/610446 (https://phabricator.wikimedia.org/T254310) (owner: 10Dave Pifke) [17:55:25] 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Patch-For-Review: Puppet failing due to dependency cycle on deployment-xhgui03 - https://phabricator.wikimedia.org/T259278 (10Dzahn) [17:59:46] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10MSantos) [18:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200730T1800). [18:00:05] Jdlrobson: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:15] (03PS1) 10Volans: mgmt: netbox-generated data for mgmt eqiad [dns] - 10https://gerrit.wikimedia.org/r/617509 (https://phabricator.wikimedia.org/T233183) [18:00:22] Lovely :) [18:00:22] (03CR) 10jerkins-bot: [V: 04-1] mgmt: netbox-generated data for mgmt eqiad [dns] - 10https://gerrit.wikimedia.org/r/617509 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [18:00:25] I can deploy today! [18:00:28] o/ here [18:00:50] thanks Urbanecm [18:00:57] (03CR) 10Urbanecm: [C: 03+2] Disable panning and zooming until ready [extensions/Kartographer] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617429 (https://phabricator.wikimedia.org/T257872) (owner: 10Jdlrobson) [18:01:14] Jdlrobson: just to confirm, you do not intend to backport this to .1? [18:01:16] Urbanecm: Hi, would it be fine to add https://gerrit.wikimedia.org/r/617422 to the current window? (Sorry for being very late) [18:01:33] Daimona: sure, I was actually planning to do that too! [18:01:35] (03CR) 10Urbanecm: [C: 03+2] sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617422 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [18:01:41] Great, thank you :) [18:01:48] thank you for the ico :) [18:02:27] (03Merged) 10jenkins-bot: sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617422 (https://phabricator.wikimedia.org/T259243) (owner: 10Urbanecm) [18:04:29] Credit goes to Civvì, glad it helped :) [18:04:38] (03PS2) 10Volans: mgmt: netbox-generated data for mgmt eqiad [dns] - 10https://gerrit.wikimedia.org/r/617509 (https://phabricator.wikimedia.org/T233183) [18:04:49] !log urbanecm@deploy1001 Synchronized static/favicon/wmf-blue.ico: 14ef2ec2956fd8f66be7efb3c5978ac0eda7ce97: sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg (T259243; 1/2) (duration: 01m 06s) [18:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:57] T259243: Config tweaks for sysop-it.wikipedia - https://phabricator.wikimedia.org/T259243 [18:05:39] 20:01 Jdlrobson: just to confirm, you do not intend to backport this to wmf.1? [18:05:47] (03CR) 10Volans: "See also https://phabricator.wikimedia.org/T259283" [dns] - 10https://gerrit.wikimedia.org/r/617509 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [18:06:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 14ef2ec2956fd8f66be7efb3c5978ac0eda7ce97: sysop_itwiki: Set favicon to Wikimedia_logo_blue.svg (T259243; 2/2) (duration: 01m 06s) [18:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:33] Daimona: should be done! Looks good to me. [18:07:17] I'm still getting the old one, likely cache-related [18:07:36] (03PS1) 10Dzahn: admins: set a http_proxy for myself, dzahn [puppet] - 10https://gerrit.wikimedia.org/r/617512 [18:07:36] yeah, browsers tend to cache favicons for longer time [18:08:10] RECOVERY - Check the last execution of php7.2-fpm_check_restart on wtp2012 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:08:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:08:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:58] wtp2012 - ACK [18:09:04] Urbanecm: i want to apply this change to the ca.wikipedia.org and make sure and in the next deploy [18:09:51] gotcha, thanks! [18:14:04] (03PS1) 10Urbanecm: Add import sources for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617516 (https://phabricator.wikimedia.org/T258913) [18:14:16] (03PS2) 10Ryan Kemper: [wdqs] disable autoploys on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/616446 (owner: 10DCausse) [18:14:18] (03CR) 10Urbanecm: [C: 03+2] Add import sources for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617516 (https://phabricator.wikimedia.org/T258913) (owner: 10Urbanecm) [18:14:26] Duh, stupid browser is still caching the old one, but I assume it's going to be fine :D Thank you! [18:14:38] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] [wdqs] disable autoploys on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/616446 (owner: 10DCausse) [18:14:46] works for me! https://usercontent.irccloud-cdn.com/file/Q3FRLyw1/image.png [18:15:18] (03Merged) 10jenkins-bot: Add import sources for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617516 (https://phabricator.wikimedia.org/T258913) (owner: 10Urbanecm) [18:15:43] Yeah, I think I'm just going to wait for the browser to wake up [18:19:08] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2011.codfw.wmnet'] ` and were **ALL** successful. [18:21:19] (03PS1) 10Urbanecm: Fix definition of yuewiktionary import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617518 (https://phabricator.wikimedia.org/T258913) [18:21:28] (03CR) 10Urbanecm: [C: 03+2] Fix definition of yuewiktionary import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617518 (https://phabricator.wikimedia.org/T258913) (owner: 10Urbanecm) [18:21:52] (03Merged) 10jenkins-bot: Disable panning and zooming until ready [extensions/Kartographer] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617429 (https://phabricator.wikimedia.org/T257872) (owner: 10Jdlrobson) [18:21:54] (03CR) 10Ryan Kemper: "`sudo puppet-merge` done" [puppet] - 10https://gerrit.wikimedia.org/r/616446 (owner: 10DCausse) [18:22:22] (03Merged) 10jenkins-bot: Fix definition of yuewiktionary import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617518 (https://phabricator.wikimedia.org/T258913) (owner: 10Urbanecm) [18:23:10] RECOVERY - dhclient process on wtp2011 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:23:12] RECOVERY - Check the NTP synchronisation status of timesyncd on wtp2012 is OK: OK: synced at Thu 2020-07-30 18:23:10 UTC. https://wikitech.wikimedia.org/wiki/NTP [18:24:25] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 617516: Add import sources for yuewiktionary | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/617516; 617518: Fix definition of yuewiktionary import sources | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/617518 # T258913 (duration: 01m 06s) [18:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:31] T258913: Add import sources and namespaces for yuewiktionary - https://phabricator.wikimedia.org/T258913 [18:25:23] Urbanecm: ready to test when you are :) [18:25:39] Jdlrobson: pulled the change onto mwdebug1001, could you have a look, please? [18:26:53] loooking [18:29:31] Urbanecm: taking a bit longer than expected due to some connection challenges [18:29:41] no prolem. Is that anything I can help with? [18:30:27] Urbanecm: looks good! [18:30:29] 10Operations, 10Fundraising-Backlog: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10DStrine) Thanks all for discussing. Just for context, if we move back to our old Thank you pages, we will not be able to hid banners for people who have... [18:30:29] you can sync! [18:30:33] thanks, syncing! [18:32:35] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.2/extensions/Kartographer/modules/box/Map.js: aa3dbd54f8e422a511e55b5efba6b5f48253dbe7: Disable panning and zooming until ready (T257872) (duration: 01m 06s) [18:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:41] T257872: Uncaught Error: Set map center and zoom first on mobile domain Android - https://phabricator.wikimedia.org/T257872 [18:32:57] Jdlrobson: should be done! Anything else? :) [18:34:01] Urbanecm: nope THANK you! I'm monitoring the client side errors on logstash now. [18:34:12] !log Morning B&C done [18:34:15] okay then :) [18:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:50:47] !log mforns@deploy1001 Started deploy [analytics/refinery@adb0d09]: Regular analytics weekly train [analytics/refinery@adb0d09b6584a7a26143623cf6173ae8983423e3] [18:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:59] !log imported twig (php-twig) into APT repo [18:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] liw and brennen: Time to snap out of that daydream and deploy Mediawiki train - European+American Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200730T1900). [19:00:29] promoting to group2 momentarily. [19:01:28] !log mforns@deploy1001 Finished deploy [analytics/refinery@adb0d09]: Regular analytics weekly train [analytics/refinery@adb0d09b6584a7a26143623cf6173ae8983423e3] (duration: 10m 41s) [19:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:09] (03PS1) 10Brennen Bearnes: all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617524 [19:06:12] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617524 (owner: 10Brennen Bearnes) [19:06:47] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617524 (owner: 10Brennen Bearnes) [19:09:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:10:12] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.2 [19:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:09] rolling back [19:12:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:13:07] (03CR) 10Jcrespo: "Looks good- will you have time tomorrow to deploy this (will need mysql restart) and finish most of the setup?" [puppet] - 10https://gerrit.wikimedia.org/r/617479 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [19:13:48] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.36.0-wmf.1 [19:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:38] (03PS1) 10Brennen Bearnes: Revert "all wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617525 [19:15:40] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "all wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617525 (owner: 10Brennen Bearnes) [19:16:22] (03Merged) 10jenkins-bot: Revert "all wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617525 (owner: 10Brennen Bearnes) [19:18:03] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:21:05] (03PS1) 10Herron: logstash7: increase SSD tier JVM heap to 32G [puppet] - 10https://gerrit.wikimedia.org/r/617526 (https://phabricator.wikimedia.org/T259219) [19:23:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:24:36] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2011.codfw.wmnet [19:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:01] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2013.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [19:26:08] 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Patch-For-Review: Puppet failing due to dependency cycle on deployment-xhgui03 - https://phabricator.wikimedia.org/T259278 (10Dzahn) a:03Dzahn [19:26:44] (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24240/" [puppet] - 10https://gerrit.wikimedia.org/r/617526 (https://phabricator.wikimedia.org/T259219) (owner: 10Herron) [19:27:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:44:21] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:46:06] (03PS1) 10Herron: exim4: move daily paniclog rotate from exim4-base to exim4-paniclog [puppet] - 10https://gerrit.wikimedia.org/r/617529 (https://phabricator.wikimedia.org/T257016) [19:48:15] 10Operations, 10Mail, 10WMF-Communications: Updating DNS records (pr.wikimedia.org) - https://phabricator.wikimedia.org/T231387 (10Dzahn) [19:48:53] 10Operations, 10Mail, 10observability, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10herron) >>! In T257016#6348642, @MoritzMuehlenhoff wrote: >>>! In T257016#6344879, @MoritzMuehlenhoff wrote: >> @herron Seems to w... [19:50:41] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:51:39] hi operations! [19:52:25] fr-tech is still looking for a stable solution to setting banner hide cookies on *.wikipedia.org domains for donors [19:52:57] at present we're exploiting the fact that donatewiki is available at both donate.wikimedia.org and donate.wikipedia.org [19:53:24] and redirecting donors to the thank you page at donate.wikipedia.org so that we can set the hide cookies [19:54:19] since browsers now reject cookies from our old method of cross-domain tags [19:54:53] ejegg: is there a phab task about this problem? [19:54:55] Our current solution sets the cookies fine, but Krinkle has warned it is unsupported [19:55:24] bd808 There is, one sec [19:56:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:56:39] bd808: here's the initial ticket: https://phabricator.wikimedia.org/T251780 [19:57:12] we've asked for a clone of donatewiki just for the TY pages: https://phabricator.wikimedia.org/T259002 [19:57:42] but we have just now realized that mobile users with the app installed are seeing errors instead of the TY page [19:58:19] In Hive, there are 404s from the app trying to load the TY page via the rest API [19:58:35] so I think the app is hijacking *.wikipedia.org links [19:58:59] and will try to open for whatever subdomain we choose for that cloned wiki [19:59:43] that sounds like something to talk through with the android and ios app teams maybe? [19:59:54] sure [20:00:05] or restbase developers, depending on what's causing the url capture (iff that's the reason) [20:00:09] the other question is whether it would be easy to turn on the rest API [20:00:09] what do the 404 urls look like? [20:00:33] (03PS2) 10Dzahn: xhgui: fix dependency cycle between php-twig package and APT pin [puppet] - 10https://gerrit.wikimedia.org/r/617498 (https://phabricator.wikimedia.org/T259278) [20:00:38] but that would serve it in the app in a way that doesn't do what you want the thank you page to do afaik [20:00:50] requesting /api/rest_v1/page/mobile-html/Thank_You%2Fja [20:00:54] for one, all its custom HTML would likely not render well or at all in the app [20:01:05] Krinkle: yeah, it might look bad [20:01:16] (03PS3) 10Dzahn: xhgui: fix dependency cycle between php-twig package and APT pin [puppet] - 10https://gerrit.wikimedia.org/r/617498 (https://phabricator.wikimedia.org/T259278) [20:01:22] but at least it wouldn't be an error page [20:01:34] which has prompted a few hundred donors to keep retrying [20:01:34] there's presumably a sitematrix or other such list controlling this [20:01:40] I didn't even know we did fund raising banners in the native apps. that's cool :) [20:01:51] bd808 so these are mostly email donors [20:01:56] i.e. coming in from an email send [20:02:03] clicking the link and getting the app opening it unexpectedly [20:02:17] ah. that makes a bit more sense I guess [20:02:25] (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1003/24241/" [puppet] - 10https://gerrit.wikimedia.org/r/617529 (https://phabricator.wikimedia.org/T257016) (owner: 10Herron) [20:02:41] From what I know of app URL hijacking, we can't change that until the next app version upgrade [20:03:03] so problem #1 really seems to be how to set a *.wikipedia.org cookie to stop further banners? [20:03:17] bd808 yeah, that's what started us down this whole road [20:03:29] we used to do it via tags on donate.wikimedia.org [20:03:47] but these days browsers reject third-party cookies [20:04:09] (03CR) 10Dzahn: [C: 03+2] xhgui: fix dependency cycle between php-twig package and APT pin [puppet] - 10https://gerrit.wikimedia.org/r/617498 (https://phabricator.wikimedia.org/T259278) (owner: 10Dzahn) [20:04:24] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2012.codfw.wmnet'] ` Of which those **FAILED**: ` ['wtp2012.codfw.wmnet'] ` [20:04:58] (03PS1) 10Bstorm: toolsdb: remove one of the temporary replication filters [puppet] - 10https://gerrit.wikimedia.org/r/617533 (https://phabricator.wikimedia.org/T253738) [20:05:15] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:05:25] does that rest API require a lot of infrastructure for each site? Or is it basically just activating an extension? [20:06:00] (03PS2) 10Bstorm: toolsdb: remove one of the temporary replication filters [puppet] - 10https://gerrit.wikimedia.org/r/617533 (https://phabricator.wikimedia.org/T253738) [20:06:20] ejegg: that's the restbase stuff. I think it basically "comes for free" in a typical main cluster wiki? [20:06:51] It might be acceptable to us to serve a funny-looking thank you page in-app to the small number of donors with the app installed if we can keep setting the hide cookie for the vast majority of donors giving in a browser from a banner [20:07:21] bd808 ok, that would be ideal. I'll add that detail to the donatewiki clone request [20:07:55] that whole new wiki approach really seems like overkill [20:08:04] (03CR) 10Bstorm: [C: 03+2] toolsdb: remove one of the temporary replication filters [puppet] - 10https://gerrit.wikimedia.org/r/617533 (https://phabricator.wikimedia.org/T253738) (owner: 10Bstorm) [20:08:32] !log [wtp2012:~] $ sudo rm -rf /srv/deployment/parsoid/deploy-cache [20:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:56] Krinkle: if they redirected though a *.wikipedia.org URL and the cookie was sent with the redirect, does that work with modern 3rd-party cookie blocking? [20:09:12] No, that's also blocked [20:09:19] Common tactic [20:09:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:09:21] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:49] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:10:18] bd808: eg big store paying little store to redirect through them to track interest in products without user account or consent [20:11:52] If you do find a way that works and is not the user looking at Wikipedia.org it would be high on the list of browsers to block next and also risks public image damage [20:12:06] hehe, we had a patch all ready to deploy to do exactly that redirect [20:14:16] 10Operations, 10Fundraising-Backlog: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Ejegg) It looks like our apps are hijacking all requests to *.wikipedia.org and trying to load the pages via the rest API. Can we also have restbase ins... [20:14:23] ejegg: has this been confirmed to be an in app 404 consistently reproduced from the Pedia url and works fine from the Media url? [20:15:49] If so then the issue is just that the wiki doesn't exist to restbase because the url is non-canonical. That could be fixed by a redirect somewhere in the varnish or ATS rules to apply before the /rest-v1/ split [20:16:42] In the current setup I mean [20:17:03] Yes if it were a new wiki or would need restbase but that's standard. Nothing you need to do or ask for [20:17:43] is there something really new and emergent here? Firefox started blocking 3rd aprty a year ago if I'm reading blogs posts correctly and it sounds like chrome won't follow for a couple more years? [20:18:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:18:23] (03CR) 10Cwhite: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:21:10] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:22:02] PROBLEM - Check size of conntrack table on wtp2013 is CRITICAL: connect to address 10.192.32.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:22:02] PROBLEM - parsoid on wtp2013 is CRITICAL: connect to address 10.192.32.28 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [20:23:14] ACK - wtp2013 [20:23:25] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:23:26] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:09] Krinkle: we haven't been able to repro it yet [20:24:17] I only just now found the 404 in hive [20:24:32] but the useragent is the wikipedia iOS app [20:24:58] I'd guess the app doesn't actually open for the Media url [20:25:07] (03PS6) 10Jdlrobson: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) [20:26:16] ejegg: well after repro, perhaps check with iOS devs on what the app would do if the rest base request were to redirect from Pedia to Media [20:26:36] brennen: I have a patch for T258609 (under review), do you want to try re-deploying today? [20:26:36] T258609: Circular dependency when creating service! GrowthExperimentsConfigurationLoader -> GrowthExperimentsConfigurationLoader - https://phabricator.wikimedia.org/T258609 [20:26:39] That could be a one line fix of it's supported at that level [20:26:48] (03CR) 10Dave Pifke: [C: 03+1] "TIL, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/617498 (https://phabricator.wikimedia.org/T259278) (owner: 10Dzahn) [20:26:50] (03PS2) 10Cwhite: debianization [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/617250 (https://phabricator.wikimedia.org/T222826) [20:27:24] tgr: if reviewers are comfortable with it, i'm comfortable redeploying any time in about the next hour. [20:27:35] cool, thanks [20:27:56] thanks for the speedy response on that one. [20:28:12] (03CR) 10Cwhite: debianization (032 comments) [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/617250 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:30:32] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617526 (https://phabricator.wikimedia.org/T259219) (owner: 10Herron) [20:31:26] (03CR) 1020after4: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/616922 (owner: 10Dzahn) [20:35:24] (03CR) 10Cwhite: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617388 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [20:35:43] (03CR) 1020after4: "I can't tell what actually failed from the console output." [puppet] - 10https://gerrit.wikimedia.org/r/616922 (owner: 10Dzahn) [20:37:44] ok, I can reproduce the issue [20:38:08] let's see where the mobile app teams hang out.... [20:39:33] (03CR) 10Dzahn: "needs manual rebase it looks" [puppet] - 10https://gerrit.wikimedia.org/r/616922 (owner: 10Dzahn) [20:41:50] (03CR) 1020after4: envoyproxy::tls_terminator: update tls definitions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617086 (https://phabricator.wikimedia.org/T258140) (owner: 10Giuseppe Lavagetto) [20:42:52] (03PS2) 10Dzahn: aphlict: make client port and IP also configurable, rename parameters [puppet] - 10https://gerrit.wikimedia.org/r/616922 [20:43:09] (03PS1) 10BryanDavis: cloud: Remove legacy conditionals from profile::base::labs [puppet] - 10https://gerrit.wikimedia.org/r/617539 [20:45:29] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@e797cf0]: 0.3.42 [20:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:44] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2012.codfw.wmnet [20:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:08] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2014.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [20:46:31] ejegg: yeah I remember on my phone, when I had the app installed, if I clicked on a wikipedia.org link, it asked me if I wanted to open it in the app. Probably best would be to update the app to somehow not capture TY page URLs, but I'm not even sure how android apps do that [20:46:41] ejegg: I assume it isn't only iOS? [20:47:13] AndyRussG: good question - I have only seen the iOS thing in the logs [20:47:25] yeah I bet it could be both [20:47:30] I think I can work out which URLs the android app hijacks from the source code [20:48:35] Krinkle: thanks again for the help! BTW we also had a solution that involved setting a cookie on meta.wikimedia.org, reading it server-side when the banner itself is fetched from that domain, and varying the response based on whether the cookie says the banner should be hidden. But making that feasible would have involved a lot of extra work from Traffic [20:48:37] 10Puppet, 10Beta-Cluster-Infrastructure, 10Performance-Team: Puppet failing due to dependency cycle on deployment-xhgui03 - https://phabricator.wikimedia.org/T259278 (10Dzahn) 05Open→03Resolved The issue is gone now with the change above which stopped using require_package to install xhgui. (and after D... [20:48:58] (03CR) 10Dzahn: "fixed puppet run on deployment-xhgui03" [puppet] - 10https://gerrit.wikimedia.org/r/617498 (https://phabricator.wikimedia.org/T259278) (owner: 10Dzahn) [20:49:02] (03PS1) 10Gergő Tisza: Fast and ugly fix for T258609 [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617437 (https://phabricator.wikimedia.org/T258609) [20:50:04] brennen: ^ [20:52:54] tgr: cool - testable if i get it out to mwdebug2001? [20:53:29] OK, looks like the android app hijacks *.wikipedia.org/wiki/* : https://github.com/wikimedia/apps-android-wikipedia/blob/master/app/src/main/AndroidManifest.xml#L92 [20:54:57] (03CR) 10Brennen Bearnes: [C: 03+2] Fast and ugly fix for T258609 [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617437 (https://phabricator.wikimedia.org/T258609) (owner: 10Gergő Tisza) [20:54:59] brennen: I can test that it didn't break Growth features. I'm not sure how to reproduce the srwiki error though. [20:55:36] locally, I could generate a similar error by running update.php but obviously I can't test that in production. [20:55:41] right [20:55:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:56:23] the error looked like it should affect basically all pageviews, but I could not repro that locally or on srwiki, and I don't know much about LanguageConverter. [20:56:33] (03PS1) 10Dzahn: install_server: switch xhgui* servers from stretch to buster [puppet] - 10https://gerrit.wikimedia.org/r/617540 (https://phabricator.wikimedia.org/T259206) [20:56:48] (...or on srwiki beta...) [20:57:09] tgr: i guess we'll know quite quickly if it's deployed. [20:57:45] (03CR) 10jerkins-bot: [V: 04-1] Fast and ugly fix for T258609 [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617437 (https://phabricator.wikimedia.org/T258609) (owner: 10Gergő Tisza) [20:57:54] let's go ahead and verify on mwdebug1002 (actual hostname) to the extent possible and give it a shot. [20:58:06] 10Puppet, 10Analytics, 10VPS-Projects: Puppet failing on wikistats.analytics.eqiad.wmflabs due to statistics::user - https://phabricator.wikimedia.org/T259307 (10bd808) [20:58:21] * brennen waits on CI [21:00:39] (03CR) 10Brennen Bearnes: [C: 03+2] "recheck" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617437 (https://phabricator.wikimedia.org/T258609) (owner: 10Gergő Tisza) [21:03:12] Ugh, that's T259199. We fixed that but the fix is not in wmf.2 I suppose. [21:03:13] T259199: GrowthExperiments Selenium tests break with "Homepage Mentorship module: no mentor available for user ..." - https://phabricator.wikimedia.org/T259199 [21:03:37] (03PS1) 10Gergő Tisza: Fix selenium test breakage due to error logging with no mentor page [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617438 (https://phabricator.wikimedia.org/T259199) [21:04:15] (03PS2) 10Gergő Tisza: Fast and ugly fix for T258609 [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617437 (https://phabricator.wikimedia.org/T258609) [21:07:16] (03PS1) 10Catrope: Fix layout of NotificationsInboxWidget on narrow screen [skins/MinervaNeue] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617439 (https://phabricator.wikimedia.org/T258939) [21:08:18] brennen: While we're on the topic of cleaning up messes the Growth team caused in wmf.2, is it OK if I also deploy this MinervaNeue change? ---^^ [21:08:24] brennen: should work now if merged with the parent. [21:08:36] RoanKattouw: sure [21:08:37] It's a bit more cosmetic rather than an actual everything-dies-with-errors train blocker though, so I can wait a little if you want [21:09:00] (03CR) 10Catrope: [C: 03+2] Fix layout of NotificationsInboxWidget on narrow screen [skins/MinervaNeue] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617439 (https://phabricator.wikimedia.org/T258939) (owner: 10Catrope) [21:09:05] i'll make you a trade - sync the others files while you're at it. :) [21:09:09] ^ RoanKattouw [21:09:12] Will do [21:09:15] (03CR) 10Urbanecm: [C: 04-1] Enable GrowthExperiments on Persian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609893 (https://phabricator.wikimedia.org/T253291) (owner: 10Catrope) [21:09:17] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Jclark-ctr) [21:10:10] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@e797cf0]: 0.3.42 (duration: 24m 41s) [21:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:50] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:10:53] (03CR) 10Dzahn: [C: 03+2] install_server: switch xhgui* servers from stretch to buster [puppet] - 10https://gerrit.wikimedia.org/r/617540 (https://phabricator.wikimedia.org/T259206) (owner: 10Dzahn) [21:11:38] (03PS2) 10Catrope: Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609893 (https://phabricator.wikimedia.org/T253291) [21:11:52] (03CR) 10Catrope: Enable GrowthExperiments on Persian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609893 (https://phabricator.wikimedia.org/T253291) (owner: 10Catrope) [21:12:06] (03CR) 10Dave Pifke: [C: 03+1] install_server: switch xhgui* servers from stretch to buster [puppet] - 10https://gerrit.wikimedia.org/r/617540 (https://phabricator.wikimedia.org/T259206) (owner: 10Dzahn) [21:12:28] (03CR) 10jerkins-bot: [V: 04-1] Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609893 (https://phabricator.wikimedia.org/T253291) (owner: 10Catrope) [21:12:31] (03PS3) 10Catrope: Enable and configure GrowthExperiments on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616959 (https://phabricator.wikimedia.org/T255020) [21:12:32] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson name rack position asset_tag switchport an-worker1102 A4 39 WMF5406 38 an-worker1103 A7 39 WMF5407 25 an-worker110... [21:12:55] (03PS3) 10Catrope: Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609893 (https://phabricator.wikimedia.org/T253291) [21:13:23] (03CR) 10jerkins-bot: [V: 04-1] Enable and configure GrowthExperiments on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616959 (https://phabricator.wikimedia.org/T255020) (owner: 10Catrope) [21:13:40] (03CR) 10jerkins-bot: [V: 04-1] Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609893 (https://phabricator.wikimedia.org/T253291) (owner: 10Catrope) [21:14:32] Krinkle / bd808 so it looks like that request wouldn't work on donate's wikiMedia url either: https://donate.wikimedia.org/api/rest_v1/page/mobile-html/Thank_You%2Fja is a json-formatted 404 [21:15:27] (03CR) 10Brennen Bearnes: [C: 03+2] Fix selenium test breakage due to error logging with no mentor page [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617438 (https://phabricator.wikimedia.org/T259199) (owner: 10Gergő Tisza) [21:15:53] so I guess Krinkle's suggestion for a one-line redirect fix is out of the question? [21:16:03] ejegg: that makes me wonder if the backing "microservice" understands sub pages [21:16:14] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] Fast and ugly fix for T258609 [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617437 (https://phabricator.wikimedia.org/T258609) (owner: 10Gergő Tisza) [21:16:42] no dice if I strip off the %2Fja either bd808 [21:16:59] (03CR) 10Brennen Bearnes: [C: 03+2] "Clicked verified accidentally here." [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617437 (https://phabricator.wikimedia.org/T258609) (owner: 10Gergő Tisza) [21:17:06] RoanKattouw: you would need to run `composer buildDBLists` (or manually touch the .dblist files) too :-) [21:17:12] Yeah I saw [21:17:13] ejegg: https://donate.wikimedia.org/api/rest_v1/ [21:17:15] Doing that now [21:17:19] ah, ok :) [21:17:21] 10Operations, 10Graphoid, 10Code-Stewardship-Reviews, 10Release-Engineering-Team (Code Health), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10kaldari) I'd like to propose that we close this ticket, since we've decided we are no longer going to be using Grap... [21:17:58] hmm, "Site info fetch failed." ? [21:18:17] (03PS4) 10Catrope: Enable and configure GrowthExperiments on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616959 (https://phabricator.wikimedia.org/T255020) [21:19:03] ejegg: yep, I suppose worth chatting with CPT to see why that is [21:19:10] CPT? [21:19:17] core platform team [21:19:24] aka mwcore + services [21:19:29] ah, thanks [21:20:10] (03PS4) 10Catrope: Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609893 (https://phabricator.wikimedia.org/T253291) [21:20:41] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616959 (https://phabricator.wikimedia.org/T255020) (owner: 10Catrope) [21:21:06] ejegg: whereas for donate.wikipedia.org it just says, "Not found" [21:21:27] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609893 (https://phabricator.wikimedia.org/T253291) (owner: 10Catrope) [21:22:15] AndyRussG: right, I think that would need more site config [21:22:40] and I think Krinkle was hoping we could just redirect that to the equivalent urls on donate.wikiMedia.org [21:22:50] but it looks like those don't work either :( [21:24:47] 10Operations, 10Graphoid, 10Code-Stewardship-Reviews, 10Release-Engineering-Team (Code Health), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Krinkle) So who has taken stewardship over the service for its remaining operation and to do the work done/being do... [21:25:28] ejegg: restbase not working there is a bug that I expect to be an easy fix once pointed out to the service owners [21:27:31] ah cool [21:27:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:27:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:27:36] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:12] anyone know if the core team reads much #wikimedia-dev in between all the bot posts there? Or is there a better channel to look for them? [21:30:40] !log reinstalling xhgui2001 [21:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:42] (03Merged) 10jenkins-bot: Fix layout of NotificationsInboxWidget on narrow screen [skins/MinervaNeue] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617439 (https://phabricator.wikimedia.org/T258939) (owner: 10Catrope) [21:31:44] (03Merged) 10jenkins-bot: Fix selenium test breakage due to error logging with no mentor page [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617438 (https://phabricator.wikimedia.org/T259199) (owner: 10Gergő Tisza) [21:31:49] (03Merged) 10jenkins-bot: Fast and ugly fix for T258609 [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617437 (https://phabricator.wikimedia.org/T258609) (owner: 10Gergő Tisza) [21:33:28] ejegg: they have #wikimedia-cpt [21:34:10] thanks mutante ! [21:34:24] RoanKattouw: awright, backports for T258609 are merged [21:34:24] T258609: Circular dependency when creating service! GrowthExperimentsConfigurationLoader -> GrowthExperimentsConfigurationLoader - https://phabricator.wikimedia.org/T258609 [21:34:43] Thanks for the ping! I had gotten distracted with something else because it took so long [21:34:45] Will sync them now [21:35:29] ta. we're still a bit before official cutoff, so i'll take one more crack at rolling forward once that's done. [21:38:45] RECOVERY - Check size of conntrack table on wtp2013 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:39:10] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.2/skins/MinervaNeue/: T258939 (duration: 01m 08s) [21:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:15] T258939: Special:Notifications header too wide for mobile - https://phabricator.wikimedia.org/T258939 [21:40:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:40:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:53] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.2/extensions/GrowthExperiments/: T258609 (duration: 01m 06s) [21:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:58] T258609: Circular dependency when creating service! GrowthExperimentsConfigurationLoader -> GrowthExperimentsConfigurationLoader - https://phabricator.wikimedia.org/T258609 [21:41:31] !log revoking and resigning puppet cert for xhgui2001.codfw.wmnet T259206 [21:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:36] T259206: reinstall xhgui* with buster - https://phabricator.wikimedia.org/T259206 [21:42:36] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2013.codfw.wmnet'] ` and were **ALL** successful. [21:45:05] RoanKattouw: all clear? [21:45:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2013.codfw.wmnet [21:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:25] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2015.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [21:45:33] brennen: I'm all done, I don't know how to check if the circular dependency error has gone away though [21:45:58] yeah, based on what tgr was saying i'm not sure there's a known reproduction case. [21:46:11] I guess you'll just have to roll forward and see? [21:46:16] yeah. here goes. [21:46:28] 10Operations, 10Graphoid, 10Code-Stewardship-Reviews, 10Release-Engineering-Team (Code Health), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10kaldari) The undeployment/client-side enabling is being handled by @Jseddon. For supporting Graphoid's final days... [21:46:36] (03PS1) 10Brennen Bearnes: all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617544 [21:46:38] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617544 (owner: 10Brennen Bearnes) [21:47:21] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617544 (owner: 10Brennen Bearnes) [21:48:25] we can check some of the URLs from the error logs [21:49:55] canaries didn't yell at all this time, that's a good sign [21:51:31] (03PS1) 10Dzahn: switch xhgui2001 from xhgui::app to webperf::xhgui [puppet] - 10https://gerrit.wikimedia.org/r/617546 (https://phabricator.wikimedia.org/T259206) [21:51:33] hrm - ssh: connect to host wtp2015.codfw.wmnet port 22: Connection timed out [21:51:40] PROBLEM - Apache HTTP on wtp2014 is CRITICAL: connect to address 10.192.32.29 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [21:51:42] PROBLEM - nutcracker process on wtp2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.29: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [21:51:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:51:49] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:51:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:04] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.2 [21:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:23] ACKNOWLEDGEMENT - Apache HTTP on wtp2014 is CRITICAL: connect to address 10.192.32.29 and port 80: Connection refused daniel_zahn reinstall https://wikitech.wikimedia.org/wiki/Application_servers [21:52:49] tried a few, seems fixed [21:53:26] tgr: yeah, log looks about as expected [21:53:45] we would have known pretty quickly i think. it was ~4k errors in the first few minutes last go. [22:03:36] 10Operations, 10Fundraising-Backlog: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Ejegg) Requested restbase for donate.wiki[mp]edia.org here: T259309 [22:03:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:04:13] welp. [22:08:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:08:52] i think i'm rolling this back for these `Argument 1 passed to Wikimedia\Parsoid\Utils\DOMDataUtils::getDataMw() must be an instance of DOMElement` errors. [22:09:23] (unless they're not new - checking) [22:10:19] yep, rolling back. [22:12:36] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.36.0-wmf.1 [22:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:43] (03CR) 10Dave Pifke: [C: 03+1] switch xhgui2001 from xhgui::app to webperf::xhgui [puppet] - 10https://gerrit.wikimedia.org/r/617546 (https://phabricator.wikimedia.org/T259206) (owner: 10Dzahn) [22:16:25] (03PS1) 10Brennen Bearnes: Revert "all wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617551 [22:16:27] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "all wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617551 (owner: 10Brennen Bearnes) [22:17:31] (03Merged) 10jenkins-bot: Revert "all wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617551 (owner: 10Brennen Bearnes) [22:18:50] (03PS2) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) [22:20:06] (03CR) 10jerkins-bot: [V: 04-1] Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [22:23:13] (03PS3) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) [22:24:27] (03CR) 10jerkins-bot: [V: 04-1] Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [22:27:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:27:39] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:39] (03CR) 10Dzahn: [C: 03+2] switch xhgui2001 from xhgui::app to webperf::xhgui [puppet] - 10https://gerrit.wikimedia.org/r/617546 (https://phabricator.wikimedia.org/T259206) (owner: 10Dzahn) [22:34:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:34:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:13] i'm calling it for the day. thanks all for earlier train assistance. [22:51:29] PROBLEM - nutcracker socket on wtp2015 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.30: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [22:51:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Cmjohnson) [22:53:05] RECOVERY - Apache HTTP on wtp2014 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:53:51] PROBLEM - parsoid on wtp2015 is CRITICAL: connect to address 10.192.32.30 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [22:53:51] PROBLEM - Check size of conntrack table on wtp2015 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.30: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:54:29] PROBLEM - Check no envoy runtime configuration is left persistent on wtp2015 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.30: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:55:32] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2014.codfw.wmnet'] ` and were **ALL** successful. [22:56:11] PROBLEM - Check systemd state on wtp2015 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.30: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:11] PROBLEM - MD RAID on wtp2015 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.30: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:56:11] PROBLEM - php7.2-fpm service on wtp2015 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.30: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:56:25] (03PS1) 10Dzahn: xhgui: ensure the config file is written after the package is installed [puppet] - 10https://gerrit.wikimedia.org/r/617554 (https://phabricator.wikimedia.org/T259278) [22:57:11] PROBLEM - Check that envoy is running on wtp2015 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.30: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:57:21] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/617554 (https://phabricator.wikimedia.org/T259278) (owner: 10Dzahn) [22:58:27] PROBLEM - puppet last run on wtp2015 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.30: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:59:02] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2014.codfw.wmnet [22:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200730T2300). [23:00:05] RoanKattouw: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:11] (03CR) 10Dzahn: [C: 03+2] xhgui: ensure the config file is written after the package is installed [puppet] - 10https://gerrit.wikimedia.org/r/617554 (https://phabricator.wikimedia.org/T259278) (owner: 10Dzahn) [23:00:17] I'll deploy it myself [23:00:51] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp2015 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.30: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [23:01:06] (03CR) 10Catrope: [C: 03+2] Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609893 (https://phabricator.wikimedia.org/T253291) (owner: 10Catrope) [23:01:49] PROBLEM - Check the last execution of php7.2-fpm_check_restart on wtp2015 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.30: Connection reset by peer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:01:54] (03Merged) 10jenkins-bot: Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609893 (https://phabricator.wikimedia.org/T253291) (owner: 10Catrope) [23:02:55] Oh ugh I didn't realize the train was rolled back again [23:03:13] PROBLEM - PHP opcache health on wtp2015 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.30: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:03:31] (03PS1) 10Catrope: Revert "Enable GrowthExperiments on Persian Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617441 [23:03:48] (03PS2) 10Catrope: Revert "Enable GrowthExperiments on Persian Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617441 (https://phabricator.wikimedia.org/T253291) [23:03:54] (03CR) 10Catrope: [C: 03+2] Revert "Enable GrowthExperiments on Persian Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617441 (https://phabricator.wikimedia.org/T253291) (owner: 10Catrope) [23:04:40] (03Merged) 10jenkins-bot: Revert "Enable GrowthExperiments on Persian Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617441 (https://phabricator.wikimedia.org/T253291) (owner: 10Catrope) [23:05:17] PROBLEM - PHP7 rendering on wtp2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:05:41] (03PS1) 10Catrope: Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617442 (https://phabricator.wikimedia.org/T253291) [23:11:26] RECOVERY - nutcracker process on wtp2014 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [23:19:31] PROBLEM - mediawiki-installation DSH group on wtp2015 is CRITICAL: Host wtp2015 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [23:20:54] (03PS1) 10Cmjohnson: Update dhcpd file with mac addresses for cloudvirt hosts [puppet] - 10https://gerrit.wikimedia.org/r/617557 (https://phabricator.wikimedia.org/T251627) [23:21:03] PROBLEM - Apache HTTP on wtp2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:24:08] (03CR) 10Dzahn: "still happening on a regular basis, for example today on xhgui2001" [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [23:27:37] (03PS1) 10Catrope: Mobile Special:Notifications: Properly close overlay on selection [extensions/Echo] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617443 (https://phabricator.wikimedia.org/T258954) [23:27:45] (03CR) 10Catrope: [C: 03+2] Mobile Special:Notifications: Properly close overlay on selection [extensions/Echo] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617443 (https://phabricator.wikimedia.org/T258954) (owner: 10Catrope) [23:28:00] (03PS1) 10Catrope: Mobile Special:Notifications: Properly close overlay on selection [extensions/Echo] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/617444 (https://phabricator.wikimedia.org/T258954) [23:28:05] (03CR) 10Catrope: [C: 03+2] Mobile Special:Notifications: Properly close overlay on selection [extensions/Echo] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/617444 (https://phabricator.wikimedia.org/T258954) (owner: 10Catrope) [23:28:35] I'm going to deploy these ---^^ instead of the fawiki config patch, since the fix for that was just merged and we're still in the evening backport window [23:28:50] (03PS1) 10Andrew Bogott: Add records for cloudvirt103[1-9] [dns] - 10https://gerrit.wikimedia.org/r/617558 [23:29:13] (03PS2) 10Andrew Bogott: Add records for cloudvirt103[1-9] [dns] - 10https://gerrit.wikimedia.org/r/617558 (https://phabricator.wikimedia.org/T251627) [23:29:26] (03PS1) 10Cmjohnson: Add production dns for cloudvirt1031-1039 [dns] - 10https://gerrit.wikimedia.org/r/617559 (https://phabricator.wikimedia.org/T251627) [23:32:25] (03PS2) 10Cmjohnson: Update and remove tabs in hcpd file with mac addresses for cloudvirt hosts [puppet] - 10https://gerrit.wikimedia.org/r/617557 (https://phabricator.wikimedia.org/T251627) [23:33:34] 10Operations, 10serviceops: reinstall xhgui* with buster - https://phabricator.wikimedia.org/T259206 (10Dzahn) xhgui2001 has been reinstalled with buster. xhgui is now installed by puppet. xhgui1001 is as before for right now. [23:33:44] (03CR) 10Cmjohnson: [C: 03+2] Add production dns for cloudvirt1031-1039 [dns] - 10https://gerrit.wikimedia.org/r/617559 (https://phabricator.wikimedia.org/T251627) (owner: 10Cmjohnson) [23:34:58] (03CR) 10Dzahn: Add mtail program for monitoring the Zuul error log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [23:36:31] (03PS3) 10Dzahn: aphlict: make client port and IP also configurable, rename parameters [puppet] - 10https://gerrit.wikimedia.org/r/616922 [23:37:28] (03CR) 10Cmjohnson: [C: 03+2] Update and remove tabs in hcpd file with mac addresses for cloudvirt hosts [puppet] - 10https://gerrit.wikimedia.org/r/617557 (https://phabricator.wikimedia.org/T251627) (owner: 10Cmjohnson) [23:42:25] (03CR) 10Dzahn: [C: 04-1] "port gets removed https://puppet-compiler.wmflabs.org/compiler1001/24242/aphlict1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/616922 (owner: 10Dzahn) [23:45:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:46:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Cmjohnson) [23:46:32] (03Merged) 10jenkins-bot: Mobile Special:Notifications: Properly close overlay on selection [extensions/Echo] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617443 (https://phabricator.wikimedia.org/T258954) (owner: 10Catrope) [23:49:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:50:54] (03Merged) 10jenkins-bot: Mobile Special:Notifications: Properly close overlay on selection [extensions/Echo] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/617444 (https://phabricator.wikimedia.org/T258954) (owner: 10Catrope) [23:56:15] RECOVERY - Check no envoy runtime configuration is left persistent on wtp2015 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [23:56:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:56:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:59] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:01] RECOVERY - Apache HTTP on wtp2015 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:57:05] RECOVERY - Check that envoy is running on wtp2015 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [23:57:15] RECOVERY - nutcracker socket on wtp2015 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_codfw.sock https://wikitech.wikimedia.org/wiki/Nutcracker [23:57:19] RECOVERY - Check systemd state on wtp2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:19] RECOVERY - php7.2-fpm service on wtp2015 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:57:25] RECOVERY - parsoid on wtp2015 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [23:57:25] RECOVERY - Check size of conntrack table on wtp2015 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:58:03] RECOVERY - PHP opcache health on wtp2015 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:59:25] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2015.codfw.wmnet'] ` and were **ALL** successful. [23:59:27] RECOVERY - PHP7 rendering on wtp2015 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering