[00:02:41] 10Puppet, 10Beta-Cluster-Infrastructure: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10Mholloway) And again today. [00:14:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:27] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:41] (03PS1) 10Legoktm: mediawiki: Install firejail from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/616955 (https://phabricator.wikimedia.org/T179022) [00:18:57] TimStarling: ^^ [00:19:38] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Install firejail from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/616955 (https://phabricator.wikimedia.org/T179022) (owner: 10Legoktm) [00:20:57] (03PS2) 10Legoktm: mediawiki: Install firejail from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/616955 (https://phabricator.wikimedia.org/T179022) [00:21:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:22:05] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "I'm all for anything that cuts down on noise. Shipping it!" [puppet] - 10https://gerrit.wikimedia.org/r/616631 (owner: 10Ebernhardson) [00:23:24] (03CR) 10Ryan Kemper: "`sudo puppet-merge` done" [puppet] - 10https://gerrit.wikimedia.org/r/616631 (owner: 10Ebernhardson) [00:24:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:29:02] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [00:31:34] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [00:33:17] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1046.eqiad.wmnet'] ` and were **ALL** successful. [00:37:54] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [00:37:54] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [00:43:16] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [00:43:16] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [00:44:04] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) a:03Dzahn [00:44:54] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24203/" [puppet] - 10https://gerrit.wikimedia.org/r/616027 (owner: 10DCausse) [00:48:56] !log sudo -E cumin -b 10 'A:wdqs-all' 'sudo run-puppet-agent' [00:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:48] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1046.eqiad.wmnet [00:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:05] (03PS1) 10Catrope: Enable and configure GrowthExperiments on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616959 (https://phabricator.wikimedia.org/T255020) [01:13:15] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1047.eqiad.wmnet'] ` and were **ALL** successful. [01:15:31] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1047.eqiad.wmnet [01:15:34] (03PS1) 10Tim Starling: Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616937 (https://phabricator.wikimedia.org/T257091) [01:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:37] (03CR) 10Tim Starling: [C: 03+2] Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616937 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling) [01:15:57] (03PS1) 10Tim Starling: Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616938 (https://phabricator.wikimedia.org/T257091) [01:16:14] (03CR) 10Tim Starling: [C: 03+2] Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616938 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling) [01:17:57] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1048.eqiad.wmnet'] ` and were **ALL** successful. [01:19:58] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1048.eqiad.wmnet [01:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:59] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) >>! In T258775#6341636, @JMeybohm wrote: > All hosts but `wtp104[6-8].eqiad.wmnet` completed. wtp1046, wtp1047, wtp1048 completed and repooled on wtp1... [01:26:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_eventgate_main_http_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:29:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:34:38] (03Merged) 10jenkins-bot: Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616937 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling) [01:35:20] (03Merged) 10jenkins-bot: Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616938 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling) [01:36:29] (03CR) 10Ryan Kemper: "Puppet-merged, manually ran puppet agent on all wdqs nodes, everything looks good." [puppet] - 10https://gerrit.wikimedia.org/r/616027 (owner: 10DCausse) [01:38:14] 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Followup): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10aaron) 05Open→03Resolved Yeah, technically, all sorts of anomalies are possible, so callers should always (a)... [01:38:17] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10aaron) [01:45:09] !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Score/includes/Score.php: work around firejail bug (duration: 01m 08s) [01:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:23] !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.2/extensions/Score/includes/Score.php: work around firejail bug (duration: 01m 07s) [01:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:48] (03PS1) 10Tim Starling: Re-enable LilyPond/Score in safe mode (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616941 [01:49:59] PROBLEM - Long running screen/tmux on netbox1001 is CRITICAL: CRIT: Long running tmux process. (user: crusnov PID: 17784, 1910768s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [01:52:38] (03PS2) 10Tim Starling: Re-enable LilyPond/Score in safe mode (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616941 [01:54:25] (03CR) 10Tim Starling: [C: 03+2] Re-enable LilyPond/Score in safe mode (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616941 (owner: 10Tim Starling) [01:55:13] (03Merged) 10jenkins-bot: Re-enable LilyPond/Score in safe mode (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616941 (owner: 10Tim Starling) [02:10:28] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) p:05High→03Medium [02:18:42] 10Operations, 10Arc-Lamp, 10Performance-Team, 10Patch-For-Review: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Krinkle) [02:18:56] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Krinkle) 05Open→03Resolved [02:19:02] !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: re-enable lilypond in safe mode (duration: 01m 09s) [02:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:13] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Krinkle) >>! In T235456#6343338, @Krinkle wrote: > {F31952510 height=250} Nice @dpifke :) [02:35:12] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) [03:07:33] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:13:11] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:30:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:32:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:36:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:37:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:52:59] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:58:39] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:04:21] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:08:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:13:49] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:17:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:27:03] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:30:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:35:27] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.354e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:37:21] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 333 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:51:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:54:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:58:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1142', diff saved to https://phabricator.wikimedia.org/P12100 and previous config saved to /var/cache/conftool/dbconfig/20200729-045859-marostegui.json [04:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1142', diff saved to https://phabricator.wikimedia.org/P12101 and previous config saved to /var/cache/conftool/dbconfig/20200729-050204-marostegui.json [05:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1141', diff saved to https://phabricator.wikimedia.org/P12102 and previous config saved to /var/cache/conftool/dbconfig/20200729-050247-marostegui.json [05:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078', diff saved to https://phabricator.wikimedia.org/P12103 and previous config saved to /var/cache/conftool/dbconfig/20200729-050346-marostegui.json [05:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:13] (03PS1) 10Marostegui: db2106: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/616989 (https://phabricator.wikimedia.org/T250666) [05:21:59] (03CR) 10Marostegui: [C: 03+2] db2106: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/616989 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [05:23:41] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Emufarmers) We are fine with this. [05:35:13] (03PS5) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 [05:43:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [05:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:48] !log standardize mr1-eqsin interfaces [05:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:04] !log ssh doc1001.eqiad.wmnet sudo -u doc-uploader git -C /srv/docroot pull [06:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:47] !log standardize mr1-ulsfo interfaces [06:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:53] (03PS6) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 [06:14:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1078', diff saved to https://phabricator.wikimedia.org/P12104 and previous config saved to /var/cache/conftool/dbconfig/20200729-061450-marostegui.json [06:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616920 (https://phabricator.wikimedia.org/T258775) (owner: 10Dzahn) [06:16:21] !log standardize mr1-codfw interfaces [06:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:21] (03PS1) 10Tim Starling: Turn off .ly source downloads [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616978 [06:17:53] (03CR) 10Tim Starling: [C: 03+2] "It's in wmf.2 but not wmf.1" [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616978 (owner: 10Tim Starling) [06:18:14] (03CR) 10Muehlenhoff: [C: 04-1] "We're actively moving away from stretch-backports, in fact it will be disabled very soon entirely with https://gerrit.wikimedia.org/r/c/op" [puppet] - 10https://gerrit.wikimedia.org/r/616955 (https://phabricator.wikimedia.org/T179022) (owner: 10Legoktm) [06:20:06] (03PS1) 10Marostegui: db2106: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/616994 [06:20:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1078', diff saved to https://phabricator.wikimedia.org/P12105 and previous config saved to /var/cache/conftool/dbconfig/20200729-062009-marostegui.json [06:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:42] (03PS7) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 [06:21:53] (03CR) 10Marostegui: [C: 03+2] db2106: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/616994 (owner: 10Marostegui) [06:22:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112', diff saved to https://phabricator.wikimedia.org/P12106 and previous config saved to /var/cache/conftool/dbconfig/20200729-062224-marostegui.json [06:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:31] (03PS1) 10Marostegui: db2117: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/616995 (https://phabricator.wikimedia.org/T250666) [06:26:39] !log standardize mr1-eqiad interfaces [06:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:14] (03CR) 10Marostegui: [C: 03+2] db2117: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/616995 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [06:30:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616885 (https://phabricator.wikimedia.org/T259000) (owner: 10Herron) [06:32:42] (03PS1) 10Ayounsi: Add interfaces support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/616997 [06:33:19] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) I think that we should coordinate first about how to proceed, given what discussed in T243521#6005828. There are two things to keep in mind: * rack... [06:34:52] (03CR) 10Ayounsi: [C: 03+2] "Self-merging as I manually tested it and pushed the differences to the routers. So it's currently a NOOP." [homer/public] - 10https://gerrit.wikimedia.org/r/616997 (owner: 10Ayounsi) [06:35:01] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10elukey) Cross-posting: we had a chat a while ago with dc-ops about 10g-enabled racks and availability, ending up in T243521#6005828. The config may be outdated, but... [06:35:18] (03Merged) 10jenkins-bot: Add interfaces support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/616997 (owner: 10Ayounsi) [06:35:33] (03Merged) 10jenkins-bot: Turn off .ly source downloads [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616978 (owner: 10Tim Starling) [06:36:58] (03PS8) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 [06:40:13] (03PS3) 10Muehlenhoff: Add CAS support to Hue [puppet] - 10https://gerrit.wikimedia.org/r/616541 [06:48:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [06:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:29] (03PS9) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 [07:08:21] (03CR) 10Ayounsi: [C: 03+1] Update netbox to v2.8.8-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/616892 (https://phabricator.wikimedia.org/T258942) (owner: 10CRusnov) [07:19:19] (03PS1) 10Elukey: Remove AAAA/PTR records for db1108 [dns] - 10https://gerrit.wikimedia.org/r/617064 (https://phabricator.wikimedia.org/T234826) [07:25:55] (03CR) 10Jcrespo: [C: 04-2] Remove AAAA/PTR records for db1108 [dns] - 10https://gerrit.wikimedia.org/r/617064 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:26:12] jynus: ? [07:26:23] I'm commenting on patch [07:26:42] yes I wasn't about to merge, it was just a code review [07:27:04] (03CR) 10Jcrespo: [C: 04-2] "While this is an outlier, I think I was able to make it work. Let's keep the ipv6 for now." [dns] - 10https://gerrit.wikimedia.org/r/617064 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:29:10] okok :) [07:29:19] (03CR) 10Jcrespo: [C: 04-2] "> so replication from db1108 would fail/timeout multiple times" [dns] - 10https://gerrit.wikimedia.org/r/617064 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:31:21] elukey: I am getting some strange errors on backup [07:31:33] I may need you for a while [07:33:37] jynus: they just said they were afk for a few hours in analytics [07:33:46] ok [07:33:51] thanks [07:33:56] jynus: I am about to go afk for a bit, sorry :( is it ok if we do it this afternoon? [07:34:05] yes, no rush [07:34:09] super [07:36:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/24209/deploy1001.eqiad.wmnet/fulldiff.html at long last I managed to produce the correct " [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto) [07:44:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P12107 and previous config saved to /var/cache/conftool/dbconfig/20200729-074414-marostegui.json [07:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:32] (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: CVE-2020-7016 / CVE-2020-7017 mitigations [puppet] - 10https://gerrit.wikimedia.org/r/616885 (https://phabricator.wikimedia.org/T259000) (owner: 10Herron) [07:45:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: upgrade snmp-exporter config [puppet] - 10https://gerrit.wikimedia.org/r/616857 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [07:48:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P12108 and previous config saved to /var/cache/conftool/dbconfig/20200729-074828-marostegui.json [07:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:05] 10Operations, 10WMDE-Analytics-Engineering, 10Graphite, 10User-Addshore: Regularly & Automatically backup WMDE metrics stored in graphite - https://phabricator.wikimedia.org/T125408 (10Addshore) 05Stalled→03Declined [07:52:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:53:13] (03PS1) 10Filippo Giunchedi: prometheus: hide diffs in snmp_exporter::module [puppet] - 10https://gerrit.wikimedia.org/r/617067 (https://phabricator.wikimedia.org/T247967) [07:53:55] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [07:54:50] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:55:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P12109 and previous config saved to /var/cache/conftool/dbconfig/20200729-075558-marostegui.json [07:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:44] (03PS1) 10Privacybatm: transferpy: Improve documentation [software/transferpy] - 10https://gerrit.wikimedia.org/r/617068 (https://phabricator.wikimedia.org/T257601) [07:58:56] (03CR) 10Filippo Giunchedi: "LGTM overall, see nits inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [08:02:40] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10JMeybohm) >>! In T258775#6343259, @Dzahn wrote: >>>! In T258775#6341636, @JMeybohm wrote: >> All hosts but `wtp104[6-8].eqiad.wmnet` completed. > > wtp1046, w... [08:02:55] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: hide diffs in snmp_exporter::module [puppet] - 10https://gerrit.wikimedia.org/r/617067 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:03:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1141', diff saved to https://phabricator.wikimedia.org/P12110 and previous config saved to /var/cache/conftool/dbconfig/20200729-080318-marostegui.json [08:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:08] (03CR) 10Volans: "Change looks sane, I've a question inline and a more general one:" (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [08:04:12] (03PS1) 10Filippo Giunchedi: install_server: reinstall netmon1002 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/617069 (https://phabricator.wikimedia.org/T247967) [08:04:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:04:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121', diff saved to https://phabricator.wikimedia.org/P12111 and previous config saved to /var/cache/conftool/dbconfig/20200729-080442-marostegui.json [08:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:25] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: reinstall netmon1002 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/617069 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:05:34] !log Deploy MCR schema change on db1121 (lag will show up on s4), also remove triggers on db1124:3314 [08:05:35] (03PS2) 10Filippo Giunchedi: install_server: reinstall netmon1002 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/617069 (https://phabricator.wikimedia.org/T247967) [08:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:04] (03PS1) 10Privacybatm: transferpy: Release transferpy 1.0 [software/transferpy] - 10https://gerrit.wikimedia.org/r/617071 (https://phabricator.wikimedia.org/T257601) [08:08:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:10:54] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:11:32] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:57] (03PS2) 10Privacybatm: transferpy: Improve documentation [software/transferpy] - 10https://gerrit.wikimedia.org/r/617068 (https://phabricator.wikimedia.org/T257601) [08:13:41] (03PS2) 10Privacybatm: transferpy: Release transferpy 1.0 [software/transferpy] - 10https://gerrit.wikimedia.org/r/617071 (https://phabricator.wikimedia.org/T257601) [08:18:39] (03CR) 10DCausse: [C: 03+2] increment extra plugin to 6.5.4-wmf-11 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/616602 (owner: 10Ebernhardson) [08:18:55] (03CR) 10DCausse: [C: 03+1] "I meant +1 :)" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/616602 (owner: 10Ebernhardson) [08:21:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, the comments are nitpicks you can safely disregard." (032 comments) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813 (owner: 10JMeybohm) [08:23:57] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [08:24:28] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [08:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:34] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:12] (03CR) 10JMeybohm: [C: 04-1] helmfile: add data for enabling service proxy in k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto) [08:27:27] (03PS1) 10Filippo Giunchedi: prometheus: bump scrape timeout for PDUs [puppet] - 10https://gerrit.wikimedia.org/r/617073 (https://phabricator.wikimedia.org/T247967) [08:29:11] (03Abandoned) 10Filippo Giunchedi: prometheus: bump scrape timeout for PDUs [puppet] - 10https://gerrit.wikimedia.org/r/617073 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:31:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:32:10] (03PS3) 10Jbond: validate_$type: add checks to prevent legacy stdlib functions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013) [08:33:10] (03CR) 10Jbond: "Thanks updated" (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [08:33:42] (03PS1) 10DCausse: [wdqs] install openjdk-8-dbg [puppet] - 10https://gerrit.wikimedia.org/r/617074 [08:34:45] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10dcausse) @herron thanks for the deploy. It works well for me. For jstack I need an extra package for... [08:35:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617074 (owner: 10DCausse) [08:40:28] (03CR) 10Jbond: "Thanks updated, for clarification i have created this CR with the aim of supporting both python 2.7 and 3.7 more as a transition then a pe" (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/615793 (owner: 10Jbond) [08:40:56] (03CR) 10Giuseppe Lavagetto: "While the change seems technically correct, I'd like to see some rationale for making debian/rules more complicated in the commit message." [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) (owner: 10JMeybohm) [08:41:05] (03CR) 10Muehlenhoff: [C: 03+2] [wdqs] install openjdk-8-dbg [puppet] - 10https://gerrit.wikimedia.org/r/617074 (owner: 10DCausse) [08:41:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/616911 (owner: 10Dzahn) [08:41:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:42:06] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp2001.codfw.wmnet [08:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:27] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp2001.codfw.wmnet ` The log can be found in `/var/log... [08:42:38] (03PS1) 10Filippo Giunchedi: role: use rsync wrap_with_stunnel for netmon [puppet] - 10https://gerrit.wikimedia.org/r/617076 (https://phabricator.wikimedia.org/T247967) [08:43:59] (03PS1) 10Jcrespo: mariadb: Create ugly exception for port assignment for db1108 [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) [08:45:59] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.8-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/616724 (owner: 10Vgutierrez) [08:46:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:47:27] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/24210/" [puppet] - 10https://gerrit.wikimedia.org/r/617076 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:47:33] (03PS2) 10Filippo Giunchedi: role: use rsync wrap_with_stunnel for netmon [puppet] - 10https://gerrit.wikimedia.org/r/617076 (https://phabricator.wikimedia.org/T247967) [08:49:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:52:16] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [08:52:29] (03CR) 10Volans: [C: 03+1] "ACK, LGTM" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [08:53:13] (03PS1) 10Muehlenhoff: profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 [08:54:27] (03CR) 10jerkins-bot: [V: 04-1] profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff) [08:55:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1121', diff saved to https://phabricator.wikimedia.org/P12112 and previous config saved to /var/cache/conftool/dbconfig/20200729-085504-marostegui.json [08:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:36] !log The above was db1112 [08:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:33] (03PS2) 10Muehlenhoff: profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 [08:56:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] helmfile: add data for enabling service proxy in k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto) [09:00:55] (03CR) 10JMeybohm: [C: 03+1] "Thanks for clarification! LGTM then" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto) [09:10:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1112', diff saved to https://phabricator.wikimedia.org/P12113 and previous config saved to /var/cache/conftool/dbconfig/20200729-091006-marostegui.json [09:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:38] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24211/" [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff) [09:10:47] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Volans) @Dzahn what's the status of this? It appears that the VM is up but not in puppet at all. [09:11:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto) [09:13:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1112', diff saved to https://phabricator.wikimedia.org/P12114 and previous config saved to /var/cache/conftool/dbconfig/20200729-091319-marostegui.json [09:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:06] (03PS1) 10Marostegui: db2117: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/617080 [09:15:08] (03CR) 10Marostegui: [C: 03+2] db2117: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/617080 (owner: 10Marostegui) [09:15:17] !log upload trafficserver 8.0.8-1wm2 to apt.wm.o (buster) [09:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1112', diff saved to https://phabricator.wikimedia.org/P12115 and previous config saved to /var/cache/conftool/dbconfig/20200729-091528-marostegui.json [09:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:02] !log upgrade ATS to version 8.0.8-1wm2 on cp4026 and cp4032 [09:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:33] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [09:20:33] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] Rakefile: Correctly match start of YAML docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/616024 (owner: 10Alexandros Kosiaris) [09:29:30] (03Merged) 10jenkins-bot: Rakefile: Correctly match start of YAML docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/616024 (owner: 10Alexandros Kosiaris) [09:29:32] (03Merged) 10jenkins-bot: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [09:37:08] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [09:39:49] (03PS5) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [09:44:16] !log upgrade ATS to version 8.0.8-1wm2 on cp5006 and cp5012 [09:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [09:47:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [09:48:11] (03CR) 10Jbond: [C: 03+2] graphite::web: drop validate_functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [09:49:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616753 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [09:49:57] (03CR) 10Jbond: [C: 03+2] prometheus: drop legacy validate_functions [puppet] - 10https://gerrit.wikimedia.org/r/616753 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [09:51:41] PROBLEM - mediawiki-installation DSH group on wtp2001 is CRITICAL: Host wtp2001 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:52:51] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [09:53:55] PROBLEM - Apache HTTP on wtp2001 is CRITICAL: connect to address 10.192.16.43 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [09:53:55] PROBLEM - nutcracker process on wtp2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.43: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [09:56:15] PROBLEM - nutcracker socket on wtp2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.43: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [09:56:43] (03PS1) 10Filippo Giunchedi: rsync: listen for stunnel connections on all AFs [puppet] - 10https://gerrit.wikimedia.org/r/617083 [09:57:11] (03PS2) 10Filippo Giunchedi: rsync: listen for stunnel connections on v4/v6 [puppet] - 10https://gerrit.wikimedia.org/r/617083 [09:58:16] wtp2001 is me (reimaging) [09:58:45] PROBLEM - parsoid on wtp2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [10:07:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff) [10:11:45] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 69.15 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [10:12:06] !log upgrade ATS to version 8.0.8-1wm2 on cp3064 and cp3065 [10:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:37] !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Score/extension.json: do not offer .ly downloads (duration: 01m 20s) [10:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:15] !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Score/includes/Score.php: do not offer .ly downloads (duration: 01m 07s) [10:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:26] (03CR) 10Kormat: "Is there any chance we can get the ports changed instead? E.g. 3351/3352." [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [10:28:11] PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [10:30:03] (03CR) 10Vgutierrez: [C: 04-1] "the TLS cert used on envoy at aphlict.discovery.wmnet must include the public faced hostname phabricator.wikimedia.org in the SAN list, cu" [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [10:30:31] ^^ netmon1002 has been restarted lately? [10:31:54] yes indeed, I've reimaged it earlier today, I'll rearm [10:32:19] thx :D [10:32:43] np! {{done}} [10:32:52] with that, I'll go to lunch [10:33:49] RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [10:36:57] 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10Kormat) There are a number of scripts (i.e. executable python scripts) in this repo, but i'm not sure which ones are actively used or not: I know these are used: ` switchover.py replication_tree... [10:37:32] (03PS26) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [10:37:51] (03CR) 10Elukey: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [10:39:41] (03CR) 10Elukey: [C: 03+1] profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff) [10:40:54] !log rolling upgrade of ATS to version 8.0.8-1wm2 [10:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:22] (03CR) 10Muehlenhoff: "Looks good, but before deploying we should check existing ferm rules, not sure if that happened yet? Most should be created by auto_ferm, " [puppet] - 10https://gerrit.wikimedia.org/r/617083 (owner: 10Filippo Giunchedi) [10:41:24] (03PS1) 10Jbond: role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085 (https://phabricator.wikimedia.org/T259013) [10:41:32] 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10Marostegui) So: these are definitely used: ` compare.py mysql.py backup_mariadb.py osc_host.py ` [10:43:21] RECOVERY - Apache HTTP on wtp2001 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:43:21] RECOVERY - nutcracker process on wtp2001 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [10:43:27] RECOVERY - nutcracker socket on wtp2001 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_codfw.sock https://wikitech.wikimedia.org/wiki/Nutcracker [10:44:37] (03PS1) 10Giuseppe Lavagetto: envoyproxy::tls_terminator: update tls definitions [puppet] - 10https://gerrit.wikimedia.org/r/617086 (https://phabricator.wikimedia.org/T258140) [10:44:54] (03PS2) 10Jbond: role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085 (https://phabricator.wikimedia.org/T259013) [10:49:12] (03PS3) 10Jbond: role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085 (https://phabricator.wikimedia.org/T259013) [10:52:06] (03PS4) 10Jbond: role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085 [10:53:16] (03CR) 10jerkins-bot: [V: 04-1] role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085 (owner: 10Jbond) [10:54:25] (03CR) 10Hnowlan: [C: 03+1] envoyproxy::tls_terminator: update tls definitions [puppet] - 10https://gerrit.wikimedia.org/r/617086 (https://phabricator.wikimedia.org/T258140) (owner: 10Giuseppe Lavagetto) [10:54:28] (03CR) 10Elukey: Move mjolnir's daemons to search-loader hosts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [10:54:37] (03PS5) 10Jbond: role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085 [10:56:21] (03PS8) 10Elukey: Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) [10:57:54] (03CR) 10JMeybohm: [C: 04-1] Add local service proxy to the tls terminator v0.2 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [10:58:52] 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10jcrespo) My intention is to put backup_mariadb.py and its dependencies (remote execution, etc.) on a separate package (that is why I showed you the https://gerrit.wikimedia.org/r/c/operations/sof... [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European mid-day backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1100). [11:00:05] VulpesVulpes825: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:35] Present and currently waiting for patch deployment. [11:01:14] I can deploy today :) [11:01:17] looking at the patch now [11:01:47] the 1x version only seems to have minimal changes, is that correct? [11:01:52] * Lucas_WMDE looks at the task [11:02:10] Lucas_WMDE That is correct. [11:02:24] ok [11:02:32] The reason why 1x logo is changed, but not 1.5x and 2x is the previous update in unknown. [11:02:40] (03PS4) 10Lucas Werkmeister (WMDE): Change the logo for Wu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616760 (https://phabricator.wikimedia.org/T259005) (owner: 10VulpesVulpes825) [11:02:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Change the logo for Wu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616760 (https://phabricator.wikimedia.org/T259005) (owner: 10VulpesVulpes825) [11:03:34] (03Merged) 10jenkins-bot: Change the logo for Wu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616760 (https://phabricator.wikimedia.org/T259005) (owner: 10VulpesVulpes825) [11:04:04] * Lucas_WMDE looks up how to deploy logo changes [11:04:11] probably needs some cache purging commands IIRC [11:04:22] ah yes, https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Purging [11:05:01] and I assume it doesn’t make sense to test this on mwdebug1002, /static probably isn’t affected by X-Wikimedia-Debug [11:06:03] Lucas_WMDE: static is affected by X-Wikimedia-Debug [11:06:11] oh, ok [11:06:14] then I can try it out, I guess [11:06:19] yup [11:06:24] thanks [11:06:29] ok change is on mwdebug1001 [11:06:36] and yes, you need to purge changed static URLs (mwscript purgeList.php) [11:06:57] (03PS6) 10Jbond: role::logstash: refactor role to conform to coding guid [puppet] - 10https://gerrit.wikimedia.org/r/617085 [11:07:43] yup, seems to work (at 200% zoom, logo gets the larger characters with XWD) [11:07:43] Lucas_WMDE: VulpesVulpes825: wfm [11:07:45] syncing [11:07:58] :( [11:08:54] !log lucaswerkmeister-wmde@deploy1001 Synchronized static/images/project-logos/: Config: [[gerrit:616760|Change the logo for Wu Wikipedia (T259005)]] (duration: 01m 08s) [11:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:00] T259005: Change the logo of Wu Wikipedia - https://phabricator.wikimedia.org/T259005 [11:09:35] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/project-logos/%s\n' 'wuuwiki.png' 'wuuwiki-1.5x.png' 'wuuwiki-2x.png' | mwscript purgeList.php # T259005 [11:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:57] and now it also works without the debug header [11:10:18] (y) [11:11:02] It is now live on wuu wikipedia. Lucas_WMDE, thank you for your help. [11:11:12] you’re welcome, thanks for the patch :) [11:11:50] !log EU B&C window done [11:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:30] (03PS1) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) [11:15:55] (03PS2) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) [11:17:18] 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10jcrespo) Let me go a bit overboard an propose you the following to be added to a potential wmfmariadbpy package- to be installed on cumin hosts: * Libraries: WMFMariaDB, WMFReplication (I believ... [11:19:01] 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10jcrespo) BTW, this is duplicate of 3yo T165358 :-). [11:19:35] 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10jcrespo) [11:22:29] 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10Marostegui) >>! In T259021#6344214, @jcrespo wrote: > My intention is to put backup_mariadb.py and its dependencies (remote execution, etc.) on a separate package (that is why I showed you the ht... [11:22:38] (03CR) 10Jcrespo: "> we had rigid restrictions" [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [11:25:28] (03PS3) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) [11:25:52] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2001.codfw.wmnet'] ` and were **ALL** successful. [11:26:20] RECOVERY - parsoid on wtp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [11:27:05] (03PS1) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) [11:27:20] PROBLEM - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1133, Errmsg: Error Cant find any matching row in the user table on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:27:45] elukey: ^ [11:27:53] I will check with you [11:27:53] eh, I didn't touch anything? [11:28:04] (03PS1) 10Elukey: Add hue overrides to an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/617091 (https://phabricator.wikimedia.org/T258768) [11:28:08] nono it is me, I created a test database, of course it didn't work [11:28:21] elukey: it was the grants, not the DB [11:28:25] I will work with you on that [11:28:30] ahhhh [11:28:39] It will be an easy fix [11:28:43] elukey: also please don't create "test" databases :-D, call them something else :-D [11:29:06] we just had to spend 2 days of kormat's work fixing an issue with that :-D [11:29:14] (not kidding) [11:30:01] well I have to test a new version of the Hue daemon, and to avoid using the prod DB I need to create another one, it is technically testing but if good it will allow me to upgrade [11:30:08] and I can't do it elsewhere sadly [11:30:17] sorry I didn't express myself weel [11:30:21] you should create test databases [11:30:29] just not call them "test" :-D [11:30:39] or testsomething [11:31:01] ah okok yes I agree :) [11:31:02] jynus: it is not called test or anything similar [11:31:05] ah, ok [11:31:08] I missunderstood [11:31:17] I called it hue_next, really promising [11:31:20] (03PS1) 10Urbanecm: Fix overindentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617092 [11:31:25] (not really since I broke replication after a minute) [11:31:28] elukey: that has no issues [11:31:42] well, the naming at least [11:32:52] (03PS1) 10Urbanecm: Add Wikipedia wordmark for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617093 (https://phabricator.wikimedia.org/T255489) [11:33:12] (03CR) 10Urbanecm: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617092 (owner: 10Urbanecm) [11:33:18] (03CR) 10Urbanecm: [C: 03+2] Add Wikipedia wordmark for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617093 (https://phabricator.wikimedia.org/T255489) (owner: 10Urbanecm) [11:33:59] (03Merged) 10jenkins-bot: Fix overindentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617092 (owner: 10Urbanecm) [11:34:03] (03Merged) 10jenkins-bot: Add Wikipedia wordmark for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617093 (https://phabricator.wikimedia.org/T255489) (owner: 10Urbanecm) [11:34:04] RECOVERY - MariaDB Replica SQL: analytics_meta on db1108 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:35:58] (03PS7) 10Jbond: role::logstash: refactor role to conform to coding guid [puppet] - 10https://gerrit.wikimedia.org/r/617085 [11:36:03] (03CR) 10Jcrespo: "> then it should be in theory easy to stop the mariadb slaves and restart them no?" [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [11:36:19] (03PS4) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) [11:36:22] (03Abandoned) 10Jcrespo: mariadb: Create ugly exception for port assignment for db1108 [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [11:36:22] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 9f7e03292941d0d782437862f406efa7e1c6463e: Fix overindentation (duration: 01m 08s) [11:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:26] (03PS2) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) [11:36:52] (03CR) 10Elukey: [C: 03+2] Add hue overrides to an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/617091 (https://phabricator.wikimedia.org/T258768) (owner: 10Elukey) [11:37:50] (03CR) 10Jforrester: [C: 04-1] "I don't like this approach, as I said, but if you insist, we can…" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) (owner: 10C. Scott Ananian) [11:38:15] 10Operations, 10Fundraising-Backlog: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Pcoombe) @Krinkle I do quite like the idea of exploring a microsite in the longer term, but it would involve more work that we hadn't planned for. We're... [11:39:09] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/wikipedia-wordmark-tr.svg: 252bb6c1bf83d96a14a0ef63e06eb544eef8a00b: Add Wikipedia wordmark for trwiki (T255489; sync 1/2) (duration: 01m 06s) [11:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:14] T255489: Mobile version logo on Turkish Wikipedia - https://phabricator.wikimedia.org/T255489 [11:41:19] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 252bb6c1bf83d96a14a0ef63e06eb544eef8a00b: Add Wikipedia wordmark for trwiki (T255489; sync 2/2) (duration: 01m 05s) [11:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:52] (03PS1) 10Jbond: role::logstash7: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617095 (https://phabricator.wikimedia.org/T259013) [11:42:21] (03PS8) 10Jbond: role::logstash: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617085 [11:42:56] (03PS2) 10Hnowlan: api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) [11:43:01] elukey: I think I may have found a solution to make things easier- create a third instance for tests/one time things [11:43:19] that way it can be in read-write and will not affect replication? [11:43:35] (03PS1) 10Cmjohnson: Adding mgmt dns for alert1001 to dns file, netbox aleady updated [dns] - 10https://gerrit.wikimedia.org/r/617096 (https://phabricator.wikimedia.org/T255072) [11:43:48] yes definitely this is a possible solution, but I also like to get exposed to these issues since I am learning a lot [11:44:02] I know it is a toll on your team but I hope to be more independent eventually :) [11:44:04] when you are available later or tomorrow [11:44:23] I will want to ask you, and at the same time show you how to do an emergency recovery [11:44:24] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for alert1001 to dns file, netbox aleady updated [dns] - 10https://gerrit.wikimedia.org/r/617096 (https://phabricator.wikimedia.org/T255072) (owner: 10Cmjohnson) [11:44:42] in case you have to do it and we are not around [11:45:07] (03PS2) 10Jbond: role::logstash7: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617095 (https://phabricator.wikimedia.org/T259013) [11:45:23] (03PS5) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) [11:45:32] (03PS3) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) [11:46:23] (03PS3) 10Jbond: role::logstash7: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617095 (https://phabricator.wikimedia.org/T259013) [11:47:42] (03CR) 10Muehlenhoff: "This is currently in use on logstash1007-1009 and logstash2004-2006? It will be obsolete with the full move to Kibana 7, though." [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:52:25] (03PS1) 10Cmjohnson: Adding production dns alert1001, public ip with ipv6 [dns] - 10https://gerrit.wikimedia.org/r/617101 (https://phabricator.wikimedia.org/T255072) [11:52:29] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:52:50] (03CR) 10jerkins-bot: [V: 04-1] Adding production dns alert1001, public ip with ipv6 [dns] - 10https://gerrit.wikimedia.org/r/617101 (https://phabricator.wikimedia.org/T255072) (owner: 10Cmjohnson) [11:54:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:55:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/616541 (owner: 10Muehlenhoff) [11:55:46] (03PS2) 10Cmjohnson: Adding production dns alert1001, public ip with ipv6 [dns] - 10https://gerrit.wikimedia.org/r/617101 (https://phabricator.wikimedia.org/T255072) [11:56:01] 10Operations, 10netbox: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10jbond) >>! In T244849#6330155, @ayounsi wrote: > No idea if it's useful here but came across https://github.com/jeremyschulman/netbox-plugin-auth-saml2 forgot to respond to this, yes this is useful thanks @ayounsi [11:57:25] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns alert1001, public ip with ipv6 [dns] - 10https://gerrit.wikimedia.org/r/617101 (https://phabricator.wikimedia.org/T255072) (owner: 10Cmjohnson) [11:57:58] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:00:04] Urbanecm and Amir1: #bothumor I � Unicode. All rise for Create avkwiki deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1200). [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1200) [12:00:28] 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Nonovian) p:05Medium→03High I change the priority for more visibility on Phabricator. [12:00:45] Amir1: \o/ [12:01:23] * marostegui around in case [12:02:44] o/ [12:02:57] You start, I'm around :D [12:03:15] cool [12:03:49] * Urbanecm rebasing config file [12:04:38] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/617083 (owner: 10Filippo Giunchedi) [12:05:03] (03PS8) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) [12:05:25] (03PS9) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) [12:05:31] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [12:06:36] (03Merged) 10jenkins-bot: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [12:06:44] fetching config to the deployment host [12:07:10] pulling to mwmaint [12:07:15] !log rebooting idp2001 for kernel update [12:07:17] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:44] Amir1: so, `mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=cebwiki avk wikipedia avkwiki avk.wikipedia.org` would be the magical command now, right? [12:07:55] Yup [12:08:01] okay, running that [12:09:14] sql avkwiki --write says the master host is 10.64.32.197, which is db1100, which is in s5 [12:09:17] seems it worked [12:09:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:55] Urbanecm: I can successfully see the new database on s5 slaves [12:09:59] cooolio [12:10:00] cool! [12:10:10] Amir1: at which point am I supposed to sync db-* files? [12:10:25] something tells me before everything else [12:10:36] yup, that's first [12:10:40] okay, doing [12:11:35] https://wikitech.wikimedia.org/wiki/Add_a_wiki#Database_creation [12:12:02] !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: Creating avkwiki (T257943) (duration: 01m 05s) [12:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:07] T257943: Create Wikipedia Kotava - https://phabricator.wikimedia.org/T257943 [12:12:55] that doesn't say anything about db-eqiad.php and db-codfw.eqiad Amir1 - we need to update the wiki page for different shard process (ideally once we move some closed wikis) [12:13:06] *db-codfw.php, of course [12:13:38] yeah but the concept is the same, first db files [12:13:49] i see [12:13:51] db lists, db-eqiad.php [12:14:27] !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: Creating avkwiki (T257943) (duration: 01m 06s) [12:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:33] syncing dblists [12:14:53] I move to another desk now [12:15:06] (03PS1) 10Jbond: thanos: add thanos.wikimedia.org top the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) [12:15:29] (03PS1) 10Cmjohnson: Adding alert1001 to site.pp and dhpd file [puppet] - 10https://gerrit.wikimedia.org/r/617106 (https://phabricator.wikimedia.org/T255072) [12:15:41] !log urbanecm@deploy1001 Synchronized dblists: Creating avkwiki (T257943) (duration: 01m 06s) [12:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:48] wikiversions going now [12:15:51] (03PS1) 10Jbond: thanos: add thanos cname pointing to cache [dns] - 10https://gerrit.wikimedia.org/r/617107 (https://phabricator.wikimedia.org/T151009) [12:16:10] (03PS2) 10Jbond: thanos: add thanos.wikimedia.org top the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) [12:19:53] (03CR) 10Cmjohnson: [C: 03+2] Adding alert1001 to site.pp and dhpd file [puppet] - 10https://gerrit.wikimedia.org/r/617106 (https://phabricator.wikimedia.org/T255072) (owner: 10Cmjohnson) [12:22:14] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24218/cp1079.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [12:24:08] scap is taking longer than usually to finish [12:24:19] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating avkwiki (T257943) [12:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:25] T257943: Create Wikipedia Kotava - https://phabricator.wikimedia.org/T257943 [12:26:03] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Creating avkwiki (T257943) (duration: 01m 06s) [12:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:16] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating avkwiki (T257943) (duration: 01m 03s) [12:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:34] !log urbanecm@deploy1001 Synchronized langlist: Creating avkwiki (T257943) (duration: 01m 05s) [12:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:45] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617110 [12:28:47] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617110 (owner: 10Urbanecm) [12:29:29] (03CR) 10Nikerabbit: "The file looks not optimized for size. Is that intentional?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617093 (https://phabricator.wikimedia.org/T255489) (owner: 10Urbanecm) [12:29:36] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617110 (owner: 10Urbanecm) [12:30:18] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: Update Jenkins gpg release key in reprepro - https://phabricator.wikimedia.org/T259116 (10hashar) [12:30:59] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 21s) [12:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:12] Amir1: marostegui: So, we're done now :). [12:31:18] <3 [12:31:28] Cool, let's check cebwiki [12:31:30] Urbanecm: let me know when ready to sanitize on the ticket [12:31:34] Does it look like avk wiki now? [12:31:43] I hope not :-/ [12:31:56] marostegui: it should be ready now, the database got created. I'll put a note on the DBA ticket [12:32:01] thanks [12:33:06] Amir1: cebwiki looks fine to me [12:33:12] \o/ [12:33:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:34:45] (03PS6) 10JMeybohm: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:35:27] Amir1: ^^ - doesn't look related to me through [12:36:02] hmm, what's causing it? [12:37:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:37:37] not sure, but `[{exception_id}] {exception_url} Wikimedia\Assert\InvariantException from line 224 of /srv/mediawiki/php-1.36.0-wmf.1/vendor/wikimedia/assert/src/Assert.php: Invariant failed: Bad UTF-8 at end of string (3 byte sequence)` sounds like T242298, opened in 1.35.0-wmf.14 times [12:37:37] T242298: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T242298 [12:38:05] (03CR) 10JMeybohm: "While trying to verify the generated envoy.yaml it seemed easier to push a new patch that to mention the linefeed chomping problems one-by" [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:40:31] Urbanecm Amir1 I see the table being fine on external store for avkwiki [12:40:36] so that looks good too [12:40:44] cool! [12:41:47] 10Operations, 10User-jbond: OKR: Install and configure new CFSSL PKI server - https://phabricator.wikimedia.org/T259117 (10jbond) p:05Triage→03Medium [12:43:07] \o/ [12:44:25] !log imported curl 7.38.0-4+deb8u16+wmf1 to apt.wikimedia.org (jessie-wikimedia) T259102 [12:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:15] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10jcrespo) I am getting a lot of 500 internal server errors on logstash-next instance. I am guessing that is expected/WIP? [12:48:28] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [12:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:06] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [12:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:25] 10Operations, 10User-jbond: OKR: Install and configure new CFSSL PKI server - https://phabricator.wikimedia.org/T259117 (10jbond) [12:49:40] 10Operations, 10User-jbond: OKR: Install and configure new CFSSL PKI server - https://phabricator.wikimedia.org/T259117 (10jbond) [12:50:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` alert1001.wikimedia.org ` The log can be foun... [12:51:47] (03PS3) 10Jbond: thanos: add thanos.wikimedia.org to the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) [12:51:56] (03CR) 10CDanis: [C: 03+1] rsync: listen for stunnel connections on v4/v6 [puppet] - 10https://gerrit.wikimedia.org/r/617083 (owner: 10Filippo Giunchedi) [12:54:02] (03CR) 10Ema: [C: 03+1] thanos: add thanos.wikimedia.org to the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [12:54:49] (03PS2) 10Volans: mgmt: netbox-generated data for mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/615668 (https://phabricator.wikimedia.org/T233183) [12:56:55] !log volans@cumin1001 START - Cookbook sre.dns.netbox [12:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:05] (03PS4) 10Jbond: thanos: add thanos.wikimedia.org to the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) [12:57:26] 10Operations, 10RESTBase, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): RESTBase CORS redirect resolve should not hit frontend caches - https://phabricator.wikimedia.org/T259054 (10ema) [12:58:00] !log volans@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:10] [HEADS UP] I'm about to merge the DNS patch that moves all codfw mgmt record to the auto-generated ones via Netbox in few minutes ( https://gerrit.wikimedia.org/r/c/operations/dns/+/615668 ) [12:58:11] (03CR) 10Filippo Giunchedi: "LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [13:00:04] liw and brennen: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1300). [13:01:15] (03PS1) 10Lars Wirzenius: group1 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617114 [13:01:17] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617114 (owner: 10Lars Wirzenius) [13:02:00] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617114 (owner: 10Lars Wirzenius) [13:03:27] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [13:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:04:02] (03PS4) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 [13:04:24] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.2 [13:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:32] !log liw@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.2 (duration: 01m 07s) [13:05:34] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10Cmjohnson) [13:07:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10Cmjohnson) 05Open→03Resolved [13:08:51] (03PS1) 10Jbond: profile::thanos::httpd: pass maxconn and query_port to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/617115 [13:10:05] (03CR) 10jerkins-bot: [V: 04-1] profile::thanos::httpd: pass maxconn and query_port to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/617115 (owner: 10Jbond) [13:10:11] (03CR) 10Andrew Bogott: [C: 03+2] cloud - hiera5: migrate labs main environment to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/615159 (owner: 10Jbond) [13:13:16] 10Operations, 10RESTBase, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): RESTBase CORS redirect resolve should not hit frontend caches - https://phabricator.wikimedia.org/T259054 (10ema) For the record I don't think this currently causes any specific functional issue. I've spotted a few RESTBase... [13:17:26] (03PS1) 10Volans: dns: skip Netbox addresses without DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183) [13:17:50] (03CR) 10jerkins-bot: [V: 04-1] dns: skip Netbox addresses without DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [13:18:30] (03PS2) 10Volans: dns: skip Netbox addresses without DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183) [13:25:26] RECOVERY - puppet last run on otrs1001 is OK: OK: Puppet is currently disabled (disable for OTRS 6.x upgrade), not alerting. Last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:26:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:29:15] (03PS4) 10Muehlenhoff: Add CAS support to Hue [puppet] - 10https://gerrit.wikimedia.org/r/616541 [13:29:23] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2001.codfw.wmnet [13:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:43] (03PS1) 10Elukey: role::druid::test_analytics::worker: fix wrong monitor name [puppet] - 10https://gerrit.wikimedia.org/r/617120 [13:31:20] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: fix wrong monitor name [puppet] - 10https://gerrit.wikimedia.org/r/617120 (owner: 10Elukey) [13:33:06] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10JMeybohm) @Dzahn I did `wtp2001.codfw.wmnet` as that was pretty full as well. [13:33:53] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24222/" [puppet] - 10https://gerrit.wikimedia.org/r/616541 (owner: 10Muehlenhoff) [13:34:14] (03CR) 10Muehlenhoff: [C: 03+2] Add CAS support to Hue [puppet] - 10https://gerrit.wikimedia.org/r/616541 (owner: 10Muehlenhoff) [13:34:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['alert1001.wikimedia.org'] ` and were **ALL** successful. [13:36:00] RECOVERY - mediawiki-installation DSH group on wtp2001 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:38:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:42:32] (03PS3) 10Volans: dns: check that primary addresses have DNS names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183) [13:43:42] (03CR) 10Volans: [C: 03+2] "Self-merging as this is currently breaking the sre.dns.netbox cookbook due to a primary address without a DNS name." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [13:45:13] !log volans@cumin1001 START - Cookbook sre.dns.netbox [13:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:42] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy-future: new image for future versions of Envoy (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/616865 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [13:52:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:55:34] !log migrating *all* codfw mgmt DNS records to the autogenerated ones via Netbox - T233183 [13:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:42] T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 [13:55:46] (03CR) 10Volans: [C: 03+2] mgmt: netbox-generated data for mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/615668 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [13:55:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:57:21] (03CR) 10JMeybohm: [C: 03+1] "I'm unsure if it's a good idea to symlink to resources from ../envoy as that might lead to changes that are easy to overlook when bumping " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/616865 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [14:00:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:03:24] (03CR) 10Elukey: [C: 03+1] Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [14:03:56] (03CR) 10Hnowlan: "> Patch Set 1: Code-Review+1" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/616865 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [14:05:37] (03CR) 10Hnowlan: envoy-future: new image for future versions of Envoy (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/616865 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [14:07:15] (03PS1) 10Volans: scripts: codfw migrated to Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617147 (https://phabricator.wikimedia.org/T233183) [14:07:22] (03PS5) 10Cwhite: provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) [14:07:27] 10Operations, 10Mail, 10observability, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10MoritzMuehlenhoff) @herron Seems to work fine, didn't see a paniclog mail today \o/ [14:07:56] (03CR) 10Cwhite: provision loki on grafana-next (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:08:44] (03CR) 10Volans: [C: 03+2] "Self merging to align the behaviour to the just migrated records. Just a feature flag." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617147 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [14:15:18] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [14:17:03] PROBLEM - Ensure local MW versions match expected deployment on wtp2001 is CRITICAL: CRITICAL: Missing 1 sites from wikiversions. 513 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:18:26] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Update netbox to v2.8.8-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/616892 (https://phabricator.wikimedia.org/T258942) (owner: 10CRusnov) [14:18:28] (03PS5) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 [14:19:31] (03CR) 10Muehlenhoff: "Ack, I missed that it's the end of a patch series." [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:20:29] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:21:33] 10Operations, 10Traffic: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10ema) I've opened https://github.com/google/mtail/issues/331 to get an opinion from upstream. [14:21:54] (03PS1) 10Cmjohnson: Adding all dns for an-test-worker hosts in eqiad [dns] - 10https://gerrit.wikimedia.org/r/617148 (https://phabricator.wikimedia.org/T255520) [14:22:31] (03PS2) 10Jbond: profile::thanos::httpd: pass maxconn and query_port to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/617115 [14:23:50] (03CR) 10jerkins-bot: [V: 04-1] profile::thanos::httpd: pass maxconn and query_port to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/617115 (owner: 10Jbond) [14:23:54] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] envoy-future: new image for future versions of Envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/616865 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [14:23:59] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24224/" [puppet] - 10https://gerrit.wikimedia.org/r/617115 (owner: 10Jbond) [14:24:45] (03PS2) 10Cmjohnson: Adding all dns for an-test-worker hosts in eqiad [dns] - 10https://gerrit.wikimedia.org/r/617148 (https://phabricator.wikimedia.org/T255520) [14:25:44] (03CR) 10Cmjohnson: [C: 03+2] Adding all dns for an-test-worker hosts in eqiad [dns] - 10https://gerrit.wikimedia.org/r/617148 (https://phabricator.wikimedia.org/T255520) (owner: 10Cmjohnson) [14:27:08] ls -lh [14:27:19] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [14:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:36] (03PS3) 10Jbond: profile::thanos::httpd: pass maxconn and query_port to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/617115 [14:29:39] (03CR) 10Jbond: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24225/" [puppet] - 10https://gerrit.wikimedia.org/r/617115 (owner: 10Jbond) [14:29:39] !log installing exiv2 security updates [14:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:09] !log install curl security update for jessie [14:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:02] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:13] (03CR) 10Herron: [C: 03+2] kibana: CVE-2020-7016 / CVE-2020-7017 mitigations [puppet] - 10https://gerrit.wikimedia.org/r/616885 (https://phabricator.wikimedia.org/T259000) (owner: 10Herron) [14:36:52] (03PS5) 10Jbond: thanos: add thanos.wikimedia.org to the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) [14:37:06] (03CR) 10Jbond: thanos: add thanos.wikimedia.org to the cache layer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:39:15] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:39:16] (03PS1) 10Urbanecm: Set muswiki to reqd only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617152 (https://phabricator.wikimedia.org/T259004) [14:39:58] (03PS2) 10Urbanecm: Set muswiki to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617152 (https://phabricator.wikimedia.org/T259004) [14:43:12] (03PS2) 10JMeybohm: Add a new action to helm-chartctl to upload prebuild tgz [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813 [14:43:14] (03PS2) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616814 [14:44:30] (03CR) 10jerkins-bot: [V: 04-1] Add a new action to helm-chartctl to upload prebuild tgz [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813 (owner: 10JMeybohm) [14:44:46] (03CR) 10jerkins-bot: [V: 04-1] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616814 (owner: 10JMeybohm) [14:45:21] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10herron) 05Open→03Resolved >>! In T258739#6343721, @dcausse wrote: > @herron thanks for the deploy... [14:48:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Andrew) Is there any reason not to close this? Are there still asset tags or netbox things left to do? [14:48:22] (03PS3) 10Peter.ovchyn: Add defaults for initial state for sidebar. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610069 (https://phabricator.wikimedia.org/T254230) [14:48:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Andrew) These hosts are in service now. @Cmjohnson, can this be closed? [14:48:45] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [14:49:53] !log installing ruby-json security updates [14:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:07] (03PS3) 10JMeybohm: Add a new action to helm-chartctl to upload prebuild tgz [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813 [14:51:09] (03PS3) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616814 [14:52:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:58:17] (03PS1) 10Jbond: storeconfigs: add debug option to test $settings variable [puppet] - 10https://gerrit.wikimedia.org/r/617156 [14:58:20] (03PS1) 10Jbond: storeconfigs: only export resources if storeconfigs is enabled [puppet] - 10https://gerrit.wikimedia.org/r/617157 [14:59:02] (03PS2) 10Jbond: storeconfigs: add debug option to test $settings variable [puppet] - 10https://gerrit.wikimedia.org/r/617156 [14:59:32] (03CR) 10JMeybohm: [C: 03+2] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616814 (owner: 10JMeybohm) [14:59:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:00:05] (03CR) 10JMeybohm: [C: 03+2] Add a new action to helm-chartctl to upload prebuild tgz [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813 (owner: 10JMeybohm) [15:00:12] (03CR) 10jerkins-bot: [V: 04-1] storeconfigs: only export resources if storeconfigs is enabled [puppet] - 10https://gerrit.wikimedia.org/r/617157 (owner: 10Jbond) [15:01:03] (03PS1) 10Herron: Revert "kibana: CVE-2020-7016 / CVE-2020-7017 mitigations" [puppet] - 10https://gerrit.wikimedia.org/r/617140 [15:01:38] (03CR) 10Herron: [C: 03+2] Revert "kibana: CVE-2020-7016 / CVE-2020-7017 mitigations" [puppet] - 10https://gerrit.wikimedia.org/r/617140 (owner: 10Herron) [15:02:06] (03Merged) 10jenkins-bot: Add a new action to helm-chartctl to upload prebuild tgz [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813 (owner: 10JMeybohm) [15:02:07] (03Merged) 10jenkins-bot: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616814 (owner: 10JMeybohm) [15:05:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:08:14] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [15:10:28] !log imported docker-report_0.0.8-1 to buster-wikimedia [15:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:01] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:14:03] (03PS3) 10Jbond: storeconfigs: add debug option to test $settings variable [puppet] - 10https://gerrit.wikimedia.org/r/617156 [15:17:09] (03PS2) 10Jbond: storeconfigs: only export resources if storeconfigs is enabled [puppet] - 10https://gerrit.wikimedia.org/r/617157 [15:18:42] (03CR) 10jerkins-bot: [V: 04-1] storeconfigs: only export resources if storeconfigs is enabled [puppet] - 10https://gerrit.wikimedia.org/r/617157 (owner: 10Jbond) [15:21:05] (03CR) 10Urbanecm: [C: 03+2] Set muswiki to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617152 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [15:22:09] (03Merged) 10jenkins-bot: Set muswiki to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617152 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [15:24:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 617152: Set muswiki to read only | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/617152 (T259004) (duration: 01m 08s) [15:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:12] T259004: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 [15:25:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Cmjohnson) 05Open→03Resolved Thanks @Andrew for the assist with these! Resolved [15:26:22] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:26] PROBLEM - Check the last execution of helm-chartctl-package-all on chartmuseum2001 is CRITICAL: CRITICAL: Status of the systemd unit helm-chartctl-package-all https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:32:44] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2002.codfw.wmnet ` The log can be found in `/var/log... [15:33:06] !log liw@deploy1001 rebuilt and synchronized wikiversions files: Revert "group[0|1] wikis to 1.36.0-wmf.1" [15:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:36] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2003.codfw.wmnet ` The log can be found in `/var/log... [15:33:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [15:33:49] we should have a dedicated wikitech page for SRE team's systemd timers, not pointing to the analytics one :D [15:34:08] (see icinga alarm about chartmuseum above) [15:34:24] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:35:01] (03PS27) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [15:35:18] (03PS3) 10Hnowlan: api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) [15:35:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [15:36:15] liw: sorry, seems we ran into each other at the deployment host. Please let me know when it is safe enough, I need to revert a testing patch I created a while ago. [15:36:42] RECOVERY - Ensure local MW versions match expected deployment on wtp2001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [15:37:32] Urbanecm, will be a little while while I sort out things, sorry [15:37:53] no problem, please ping me once ready :) [15:38:07] Urbanecm, will do [15:40:35] (03PS1) 10Herron: Revert "Revert "kibana: CVE-2020-7016 / CVE-2020-7017 mitigations"" [puppet] - 10https://gerrit.wikimedia.org/r/617141 [15:43:00] (03PS1) 10Lars Wirzenius: Revert "group1 wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617165 [15:43:34] (03CR) 10Lars Wirzenius: [V: 03+2 C: 03+2] Revert "group1 wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617165 (owner: 10Lars Wirzenius) [15:44:29] !log volans@cumin1001 START - Cookbook sre.dns.netbox [15:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:42] (03CR) 10Hnowlan: api-gateway: Basic envoy chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [15:44:48] Urbanecm, I'm done for now [15:44:52] thanks! [15:45:23] (03PS1) 10Urbanecm: Revert "Set muswiki to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617167 (https://phabricator.wikimedia.org/T259004) [15:45:55] (03CR) 10Herron: "saw an intermittent error on the front end shortly after merging the original patch and quickly reverted for troubleshooting. after a clo" [puppet] - 10https://gerrit.wikimedia.org/r/617141 (owner: 10Herron) [15:46:17] (03CR) 10Urbanecm: [C: 03+2] Revert "Set muswiki to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617167 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [15:46:36] (03CR) 10Dzahn: [C: 03+2] installserver: use correct partman recipe for parse* [puppet] - 10https://gerrit.wikimedia.org/r/616920 (https://phabricator.wikimedia.org/T258775) (owner: 10Dzahn) [15:47:10] (03Merged) 10jenkins-bot: Revert "Set muswiki to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617167 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [15:47:12] (03CR) 10Herron: [C: 03+2] Revert "Revert "kibana: CVE-2020-7016 / CVE-2020-7017 mitigations"" [puppet] - 10https://gerrit.wikimedia.org/r/617141 (owner: 10Herron) [15:48:44] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 617167: Revert "Set muswiki to read only" | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/617167 (T259004) (duration: 01m 06s) [15:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:50] T259004: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 [15:48:56] * Urbanecm is done too [15:49:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:53:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:56:49] 10Operations, 10Traffic, 10observability, 10serviceops, 10Patch-For-Review: monitoring for mismatched LVS realserver addresses/configurations - https://phabricator.wikimedia.org/T258648 (10CDanis) 05Open→03Declined The structure of the data makes this teeth-pullingly impossibly difficult to do well,... [15:56:53] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10CDanis) [15:56:56] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:00:58] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:47] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2001.codfw.wmnet', 'parse2002.codfw.wmnet', 'par... [16:02:35] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:54] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) >>! In T258775#6344708, @JMeybohm wrote: > @Dzahn I did `wtp2001.codfw.wmnet` as that was pretty full as well. Thank you. Taking over with wtp2002,2003... [16:14:43] 10Operations, 10Traffic, 10observability, 10serviceops, 10Patch-For-Review: monitoring for mismatched LVS realserver addresses/configurations - https://phabricator.wikimedia.org/T258648 (10Joe) Truth is that every host that has multiple pools that use the same backend IP can happily live with only one po... [16:15:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:15:19] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:16:05] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:48] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:16:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:16:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:35] (03PS7) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [16:18:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:04] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:48] (03PS1) 10Bartosz Dziewoński: Revert new reply API for now (1.36.0-wmf.2 only) [extensions/DiscussionTools] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617169 (https://phabricator.wikimedia.org/T252558) [16:21:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:16] (03PS28) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [16:21:54] (03CR) 10Bartosz Dziewoński: "(please check that this looks right and +1)" [extensions/DiscussionTools] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617169 (https://phabricator.wikimedia.org/T252558) (owner: 10Bartosz Dziewoński) [16:22:13] PROBLEM - Check that envoy is running on wtp2002 is CRITICAL: connect to address 10.192.16.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [16:22:43] this is not supposed to happen because the reimage script is running and downtimes stuff ^ [16:23:17] PROBLEM - mediawiki-installation DSH group on wtp2003 is CRITICAL: Host wtp2003 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:23:17] PROBLEM - puppet last run on wtp2002 is CRITICAL: connect to address 10.192.16.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:23:46] there was an exception when the cookbook tried to downtime. i am manually downtiming. they are unpooled reinstalls [16:24:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:24:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:24:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:23] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2002.codfw.wmnet', 'parse2003.codfw.wmnet', 'parse2001.codfw.wmnet'] ` and were **ALL** successful. [16:28:15] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2004.codfw.wmnet', 'parse2005.codfw.wmnet', 'parse2006.codfw.wmnet'] `... [16:29:46] (03PS1) 10Bartosz Dziewoński: Move VisualEditor from beta to default on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617170 (https://phabricator.wikimedia.org/T258992) [16:33:37] (03PS29) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [16:36:03] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:36:39] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn) [16:37:13] (03Merged) 10jenkins-bot: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:38:52] RECOVERY - Check the last execution of helm-chartctl-package-all on chartmuseum2001 is OK: OK: Status of the systemd unit helm-chartctl-package-all https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:40:44] (03CR) 10Ppchelko: [C: 04-1] api-gateway: add helmfile.d configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:41:29] (03CR) 10Dzahn: "please see https://phabricator.wikimedia.org/T259152 for an issue with puppet on the contint servers that seems related here" [puppet] - 10https://gerrit.wikimedia.org/r/613104 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [16:42:12] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn) [16:42:15] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10Dzahn) [16:42:42] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn) [16:42:53] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn) [16:42:55] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10Dzahn) [16:43:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:43:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:43:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:25] 10Operations, 10Continuous-Integration-Infrastructure, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn) [16:45:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:45:26] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:26] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:56] PROBLEM - Check the last execution of helm-chartctl-package-all on chartmuseum2001 is CRITICAL: CRITICAL: Status of the systemd unit helm-chartctl-package-all https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:51:03] (03PS1) 10Urbanecm: Enable Translate extension at plwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617174 (https://phabricator.wikimedia.org/T259087) [16:51:31] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2006.codfw.wmnet', 'parse2005.codfw.wmnet', 'parse2004.codfw.wmnet'] ` and were **ALL** successful. [16:52:08] 10Operations, 10Continuous-Integration-Infrastructure, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10hashar) The error message refers to `push-no... [16:52:34] (03PS4) 10Hnowlan: api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) [16:53:18] (03PS1) 10BryanDavis: Add .gitignore [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617175 [16:53:20] (03PS1) 10BryanDavis: acme_chief: Profide .crt.chained.key file support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617176 (https://phabricator.wikimedia.org/T255249) [16:53:22] (03PS1) 10BryanDavis: api: Allow acme-chief clients to fetch .chained.crt.key files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617177 (https://phabricator.wikimedia.org/T255249) [16:55:00] (03PS4) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) [16:56:06] 10Operations, 10Continuous-Integration-Infrastructure, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn) >>! In T259152#6345663, @hashar wrote... [16:56:12] (03PS2) 10BryanDavis: acme_chief: Profide .chained.crt.key file support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617176 (https://phabricator.wikimedia.org/T255249) [16:56:14] (03PS2) 10BryanDavis: api: Allow acme-chief clients to fetch .chained.crt.key files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617177 (https://phabricator.wikimedia.org/T255249) [17:00:00] (03PS5) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) [17:00:10] (03CR) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) (owner: 10C. Scott Ananian) [17:01:02] (03CR) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) (owner: 10C. Scott Ananian) [17:01:37] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2007.codfw.wmnet', 'parse2008.codfw.wmnet', 'parse2009.codfw.wmnet'] `... [17:04:11] 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Dzahn) [17:06:09] 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Dzahn) [17:10:06] 10Operations, 10Epic, 10Goal: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [17:11:59] 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Volans) From the cookbook logs: ` 2020-07-29 16:15:18,960 dzahn 15835 [DEBUG puppetdb.py:320 in _execute] Queried puppetdb for '["or", ["=", "certname", "wtp2003.codfw.wmnet"]]... [17:15:13] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10RobH) p:05Triage→03Medium [17:15:16] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10RobH) [17:15:22] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10RobH) Please note that while I just created this task, the actual memory has NOT yet been placed to order. It was escalated for approvals and placement today. [17:15:22] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [17:16:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:16:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:09] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) So I think that this is all that remains: * [] cache layer proxy wss://phabricator.wikimedia.org to aphl... [17:18:14] 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Dzahn) >>! In T259158#6345791, @Volans wrote: > so either catalog compilation for wtp servers is particularly slow This seems likely to be the case. On the first run it does a... [17:19:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:48] 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Volans) No, I meant the catalog compilation on the puppetmaster, after which the catalog is sent to puppetdb. It's unrelated to how much time takes the first puppet run on the... [17:20:34] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:47] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10RobH) > 10:15 < robh> : So we have a number (at least 3) tasks for upgrading memory in existing hosts > 10:15 < robh> : ive just been pushing the actual upgrade t... [17:21:05] mutante: as a pro-tip I usually suggest to open multiple tmux/screen and run the reimages with 2~3 minutes of delay between each other [17:21:23] to avoid some race conditions that are possible, both on the icinga stuff and the certificate signing on the puppetmaster [17:21:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:40] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10RobH) a:03Cmjohnson [17:21:45] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:38] (03CR) 10Hnowlan: api-gateway: add helmfile.d configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [17:23:31] volans: ack! though the issue happened when i used 2 separate wmf-auto-reimage-host in 2 separate screens. when i used wmf-auto-reimage on 3 hosts at once i did not have that issue. but they were parse* with role(insetup) unlike wtp* with prod roles [17:25:19] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10elukey) [17:27:48] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:20] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2008.codfw.wmnet', 'parse2007.codfw.wmnet', 'parse2009.codfw.wmnet'] ` and were **ALL** successful. [17:29:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:32:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:41:22] RECOVERY - Check that envoy is running on wtp2002 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [17:45:46] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2002.codfw.wmnet'] ` and were **ALL** successful. [17:46:18] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2010.codfw.wmnet', 'parse2011.codfw.wmnet', 'parse2012.codfw.wmnet'] `... [17:48:00] RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:48:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:49:10] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2003.codfw.wmnet'] ` and were **ALL** successful. [17:50:02] RECOVERY - Long running screen/tmux on netbox1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [17:50:20] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2002.codfw.wmnet [17:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:48] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2003.codfw.wmnet [17:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:26] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2004.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [17:52:40] RECOVERY - Long running screen/tmux on weblog1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [17:54:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:56:35] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2005.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [17:56:39] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2005.codfw.wmnet'] ` Of which those **FAILED**: ` ['wtp2005.codfw.wmnet'] ` [17:57:27] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) Note wtp2005 is missing because of T257903. [17:58:05] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2006.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [17:58:10] (03PS1) 10Dave Pifke: arclamp: restore 90 day retention [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455) [18:00:05] liw and brennen: I, the Bot under the Fountain, allow thee, The Deployer, to do Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1800). [18:00:05] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1800). [18:00:05] MatmaRex and Urbanecm: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:51] hi [18:01:12] I'll deploy [18:01:17] hi MatmaRex [18:01:24] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:01:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:01:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:19] (03CR) 10Urbanecm: [C: 03+2] Revert new reply API for now (1.36.0-wmf.2 only) [extensions/DiscussionTools] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617169 (https://phabricator.wikimedia.org/T252558) (owner: 10Bartosz Dziewoński) [18:02:33] (03CR) 10Urbanecm: [C: 03+2] Move VisualEditor from beta to default on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617170 (https://phabricator.wikimedia.org/T258992) (owner: 10Bartosz Dziewoński) [18:02:41] (03CR) 10Urbanecm: [C: 03+2] Enable Translate extension at plwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617174 (https://phabricator.wikimedia.org/T259087) (owner: 10Urbanecm) [18:03:18] (03Merged) 10jenkins-bot: Move VisualEditor from beta to default on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617170 (https://phabricator.wikimedia.org/T258992) (owner: 10Bartosz Dziewoński) [18:03:31] (03Merged) 10jenkins-bot: Enable Translate extension at plwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617174 (https://phabricator.wikimedia.org/T259087) (owner: 10Urbanecm) [18:03:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:39] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:58] MatmaRex: config patch is ready for you to test at mwdebug1001 [18:04:15] (03PS1) 10QChris: Bump gerrit.war to 3.2.3-1-g185bdc3a69 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 [18:05:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:42] (03CR) 10QChris: "The corresponding WAR has been uploaded to our archiva already." [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (owner: 10QChris) [18:05:43] !log Create tables for Translate extension in plwikimedia (T259087) [18:05:46] Urbanecm: seems good [18:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:47] T259087: Enable Translate extension on pl.wikimedia.org - https://phabricator.wikimedia.org/T259087 [18:05:52] thank you, syncing [18:06:04] (03Merged) 10jenkins-bot: Revert new reply API for now (1.36.0-wmf.2 only) [extensions/DiscussionTools] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617169 (https://phabricator.wikimedia.org/T252558) (owner: 10Bartosz Dziewoński) [18:06:17] (sorry you had to wait, i was testing on the wrong mwdebug server for a while and i was very confused) [18:06:43] no problem :) [18:07:28] !log urbanecm@deploy1001 Synchronized dblists/visualeditor-nondefault.dblist: a237f5b40c3662c0f08398abeeaadba61d7462f8: Move VisualEditor from beta to default on enwikiversity (T258992) (duration: 01m 06s) [18:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:33] T258992: VisualEditor the default editor for wikiversity (english) - https://phabricator.wikimedia.org/T258992 [18:07:42] (03CR) 10Dzahn: [C: 03+1] "+1, should i just merge or you want to add reviewers?" [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455) (owner: 10Dave Pifke) [18:09:33] MatmaRex: your backport is ready at mwdebug1001 [18:09:37] could you have a look, please? [18:09:38] (03CR) 10Dave Pifke: "I think this is pretty low risk, but adding Krinkle to reviewers since he's the one who reminded me to follow up on this." [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455) (owner: 10Dave Pifke) [18:09:47] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2011.codfw.wmnet', 'parse2012.codfw.wmnet', 'parse2010.codfw.wmnet'] ` and were **ALL** successful. [18:10:35] yeah [18:10:44] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) @Volans Yea, that's right. The status is still that creating the VM worked but installing the OS did not (T254157#6241107). I will get back to de... [18:11:22] Urbanecm: also looks good [18:11:25] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d54f041be6508b641eec08e25287d280374cc863: Enable Translate extension at plwikimedia (T259087) (duration: 01m 08s) [18:11:28] thank you, syncing [18:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:31] T259087: Enable Translate extension on pl.wikimedia.org - https://phabricator.wikimedia.org/T259087 [18:12:19] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) 05Stalled→03Open [18:12:43] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:14] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.2/extensions/DiscussionTools/: 00ecec80d12a34977d55dd09bce0c5a1aab369f9: Revert new reply API for now (T252558) (duration: 01m 06s) [18:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:19] T252558: Create a low bandwidth reply API using parser.php/modifier.php - https://phabricator.wikimedia.org/T252558 [18:13:20] MatmaRex: should be all done! [18:13:28] (03PS1) 10QChris: Bring back jsonevent-layout library [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 [18:13:37] thanks Urbanecm [18:13:41] happy to help! [18:13:46] !log Morning B&C window is done [18:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:51] (03CR) 10jerkins-bot: [V: 04-1] Bring back jsonevent-layout library [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris) [18:15:47] (03CR) 10QChris: "We this change to enable json logging." [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris) [18:16:47] (03CR) 10QChris: [V: 03+1 C: 03+1] Bring back jsonevent-layout library [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris) [18:17:08] (03CR) 10jerkins-bot: [V: 04-1] Bring back jsonevent-layout library [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris) [18:20:17] (03CR) 10QChris: "Just in case searching no longer finds the commit at some point in the future. This is" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (owner: 10QChris) [18:20:20] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:24] (03CR) 10Krinkle: [C: 03+1] arclamp: restore 90 day retention [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455) (owner: 10Dave Pifke) [18:20:48] (03CR) 10Krinkle: [C: 03+1] "Good to go" [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455) (owner: 10Dave Pifke) [18:21:43] (03CR) 10Dzahn: [C: 03+2] arclamp: restore 90 day retention [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455) (owner: 10Dave Pifke) [18:26:44] (03CR) 10Herron: [C: 03+1] provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [18:27:51] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2013.codfw.wmnet', 'parse2014.codfw.wmnet', 'parse2015.codfw.wmnet'] `... [18:32:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:32:26] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:36:06] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:36:15] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:42] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Volans) Not a specific issue for me, came up as inconsistency in some cross checks for Netbox automation. Up to you. [18:39:18] 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Dzahn) Ah, ok! It happened again when running wtp2004 (separate screen window, separate script). `spicerack.remote.RemoteError: No hosts provided ` Then i manually used `sre... [18:39:30] (03CR) 10QChris: "The relevant bug on Phabricator is https://phabricator.wikimedia.org/T259135" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (owner: 10QChris) [18:39:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:39:51] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:30] (03CR) 10QChris: [V: 03+1 C: 03+1] "The relevant task in Phabricator is https://phabricator.wikimedia.org/T259135" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris) [18:40:58] (03PS2) 10QChris: Bump gerrit.war to 3.2.3-1-g185bdc3a69 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (https://phabricator.wikimedia.org/T259135) [18:42:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:01] Urbanecm: I'd need to do a quick gerrit upgrade. Did I read your message from 30 minutes correctly that the Morning backport window is done and updating gerrit would not get in your way? [18:44:22] qchris: yes, it is all yours now :) [18:44:27] Cool beans. [18:44:30] Thanks. [18:44:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:56] I have no clue whan "Train log triage with CPT" (which should also be happenning now) is. I don't see any logs about it in the channel. I don't want to ping them, as they might be doing $REALLY_IMPORTANT_STUFF. Do you know by chance what they are up to? [18:45:59] Urbanecm: ^ [18:46:42] qchris: I **assume** they are looking at logs (in logstash), and discussing them, but that is a pure guess. [18:47:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:17] Ok. Thanks. [18:48:48] PROBLEM - Check that envoy is running on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [18:49:52] PROBLEM - mediawiki-installation DSH group on wtp2006 is CRITICAL: Host wtp2006 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:49:52] PROBLEM - puppet last run on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:50:14] RECOVERY - mediawiki-installation DSH group on wtp2003 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:50:20] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:32] (03PS1) 10Herron: dns: add forward/reverse records for kafkamon[12]002 [dns] - 10https://gerrit.wikimedia.org/r/617211 (https://phabricator.wikimedia.org/T257561) [18:52:06] PROBLEM - Apache HTTP on wtp2006 is CRITICAL: connect to address 10.192.16.48 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [18:52:16] PROBLEM - nutcracker process on wtp2006 is CRITICAL: connect to address 10.192.16.48 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:52:16] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [18:52:34] (03CR) 10Thcipriani: [C: 03+1] Bump gerrit.war to 3.2.3-1-g185bdc3a69 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (https://phabricator.wikimedia.org/T259135) (owner: 10QChris) [18:53:22] PROBLEM - Check the last execution of php7.2-fpm_check_restart on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:53:26] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2013.codfw.wmnet', 'parse2014.codfw.wmnet'] ` Of which those **FAILED**: ` ['parse2015.codfw.wmnet'] ` [18:53:48] br-ennen let me know that it's ok to restart Gerrit. So I'll prepare the upgrade. [18:54:00] thanks for checking in [18:54:02] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:54:40] PROBLEM - PHP opcache health on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:54:40] PROBLEM - nutcracker socket on wtp2006 is CRITICAL: connect to address 10.192.16.48 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:56:38] PROBLEM - PHP7 rendering on wtp2004 is CRITICAL: connect to address 10.192.16.46 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:56:54] PROBLEM - parsoid on wtp2006 is CRITICAL: connect to address 10.192.16.48 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [18:56:54] PROBLEM - Check size of conntrack table on wtp2006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.48: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:56:54] PROBLEM - Check whether ferm is active by checking the default input chain on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:57:30] (03CR) 10QChris: [V: 03+2 C: 03+2] Bump gerrit.war to 3.2.3-1-g185bdc3a69 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (https://phabricator.wikimedia.org/T259135) (owner: 10QChris) [18:57:32] PROBLEM - Check no envoy runtime configuration is left persistent on wtp2006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.48: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [18:58:18] PROBLEM - Ensure local MW versions match expected deployment on wtp2006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.48: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers [18:59:31] Just a heads up that I'll restart Gerrit shortly to deploy a security fix https://phabricator.wikimedia.org/T259135 [18:59:57] wtp* hosts^ downtimed. that's an issue with setting the downtime during reimage [19:00:04] liw and brennen: That opportune time is upon us again. Time for a Mediawiki train - European+American Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1900). [19:00:15] !log qchris@deploy1001 Started deploy [gerrit/gerrit@9275b30]: Gerrit to v3.2.3-1-g185bdc3a69 on gerrit1001 [19:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:23] !log qchris@deploy1001 Finished deploy [gerrit/gerrit@9275b30]: Gerrit to v3.2.3-1-g185bdc3a69 on gerrit1001 (duration: 00m 08s) [19:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:43] !log Restarting Gerrit on gerrit1001 to make security fix effective. [19:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:48] a train status update: currently blocked, we're not presently doing anything in the current window. [19:00:52] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:01] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2016.codfw.wmnet', 'parse2017.codfw.wmnet', 'parse2018.codfw.wmnet'] `... [19:02:52] ACKNOWLEDGEMENT - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1529 bytes in 0.008 second response time daniel_zahn restart for config change by chris https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [19:03:06] Thanks mutante ^ [19:03:34] yw,np [19:03:47] ACKNOWLEDGEMENT - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused daniel_zahn restarting https://wikitech.wikimedia.org/wiki/Gerrit [19:03:54] !log qchris@deploy1001 Started deploy [gerrit/gerrit@9275b30]: Gerrit to v3.2.3-1-g185bdc3a69 on gerrit2001 [19:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:04] !log qchris@deploy1001 Finished deploy [gerrit/gerrit@9275b30]: Gerrit to v3.2.3-1-g185bdc3a69 on gerrit2001 (duration: 00m 09s) [19:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:31] !log Restarting Gerrit on gerrit2001 (gerrit-replica) to make security fix effective. [19:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:28] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:02] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:32] (03CR) 10Dzahn: [C: 03+1] "looks good" [dns] - 10https://gerrit.wikimedia.org/r/617211 (https://phabricator.wikimedia.org/T257561) (owner: 10Herron) [19:08:45] (03CR) 10Herron: [C: 03+2] dns: add forward/reverse records for kafkamon[12]002 [dns] - 10https://gerrit.wikimedia.org/r/617211 (https://phabricator.wikimedia.org/T257561) (owner: 10Herron) [19:12:12] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:17:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:18:37] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:37] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [19:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:00] !log herron@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [19:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:35] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:42] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2018.codfw.wmnet', 'parse2016.codfw.wmnet', 'parse2017.codfw.wmnet'] ` and were **ALL** successful. [19:26:35] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2019.codfw.wmnet', 'parse2020.codfw.wmnet'] ` The log can be found in... [19:26:43] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:32] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [19:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:37] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [19:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:20] (03PS1) 10Cwhite: debianization [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/617250 (https://phabricator.wikimedia.org/T222826) [19:36:29] (03PS6) 10Cwhite: provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) [19:40:05] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:41:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:48] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [19:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:01] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [19:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:44:54] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:45:42] (03CR) 10QChris: [V: 03+2 C: 03+2] "Since it's deployed already, I'll self-merge this here." [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris) [19:45:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:09] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2019.codfw.wmnet', 'parse2020.codfw.wmnet'] ` and were **ALL** successful. [19:53:27] (03PS1) 10Herron: dhcp: add records for kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/617254 (https://phabricator.wikimedia.org/T257561) [19:54:31] (03PS5) 10Jdlrobson: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) [19:57:45] (03PS2) 10Herron: dhcp: add records for kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/617254 (https://phabricator.wikimedia.org/T257561) [20:00:04] halfak and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T2000). [20:00:16] (03CR) 10Herron: [C: 03+2] dhcp: add records for kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/617254 (https://phabricator.wikimedia.org/T257561) (owner: 10Herron) [20:02:36] RECOVERY - PHP7 rendering on wtp2004 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 0.582 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:05:46] RECOVERY - PHP opcache health on wtp2004 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:05:52] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2004.codfw.wmnet'] ` and were **ALL** successful. [20:05:58] RECOVERY - Check size of conntrack table on wtp2006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:06:12] RECOVERY - Check the last execution of php7.2-fpm_check_restart on wtp2004 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:08:14] PROBLEM - Host wtp2006 is DOWN: PING CRITICAL - Packet loss = 100% [20:09:54] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:10:22] RECOVERY - Host wtp2006 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms [20:10:41] RECOVERY - nutcracker process on wtp2006 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [20:11:18] RECOVERY - Apache HTTP on wtp2006 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:12:51] RECOVERY - Check that envoy is running on wtp2004 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [20:12:51] RECOVERY - nutcracker socket on wtp2006 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_codfw.sock https://wikitech.wikimedia.org/wiki/Nutcracker [20:14:58] RECOVERY - Ensure local MW versions match expected deployment on wtp2006 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:16:19] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) all parse2* hosts done: ` [cumin1001:~] $ sudo cumin parse2* 'df -h | grep srv | cut -d " " -f12' .. 20 hosts will be targeted: 1%... [20:17:30] RECOVERY - Check no envoy runtime configuration is left persistent on wtp2006 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [20:17:38] RECOVERY - Check whether ferm is active by checking the default input chain on wtp2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:17:38] RECOVERY - Check the NTP synchronisation status of timesyncd on wtp2004 is OK: OK: synced at Wed 2020-07-29 20:17:36 UTC. https://wikitech.wikimedia.org/wiki/NTP [20:19:17] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2004.codfw.wmnet [20:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:49] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [20:24:36] (03PS1) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) [20:25:50] (03CR) 10jerkins-bot: [V: 04-1] prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:25:59] (03CR) 10Dzahn: [C: 03+2] "the members of this group have been promoted to roots but i am merging this anyways, there were no concerns and maybe the admins group wil" [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [20:26:06] (03PS2) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) [20:26:28] (03PS4) 10Dzahn: admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) [20:27:22] (03CR) 10jerkins-bot: [V: 04-1] prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:28:48] (03PS3) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) [20:30:59] (03PS2) 10Dzahn: admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) [20:31:51] (03CR) 10jerkins-bot: [V: 04-1] admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [20:32:04] (03CR) 10Dzahn: admins: let wdqs-admins run jstack as root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [20:32:26] RECOVERY - parsoid on wtp2006 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [20:32:37] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2006.codfw.wmnet'] ` and were **ALL** successful. [20:34:12] !log crusnov@deploy1001 Started deploy [netbox/deploy@fde9dfe]: Test deploy of 2.8.8 to netbox-next [20:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:44] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2008.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [20:35:10] (03PS3) 10Dzahn: admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) [20:35:24] !log crusnov@deploy1001 Finished deploy [netbox/deploy@fde9dfe]: Test deploy of 2.8.8 to netbox-next (duration: 01m 12s) [20:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:29] !log crusnov@deploy1001 Started deploy [netbox/deploy@fde9dfe]: Test deploy of 2.8.8 to netbox-next pt2 [20:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:34] !log crusnov@deploy1001 Finished deploy [netbox/deploy@fde9dfe]: Test deploy of 2.8.8 to netbox-next pt2 (duration: 00m 05s) [20:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:48] (03PS4) 10Dzahn: admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) [20:38:50] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:18] (03CR) 10Dzahn: [C: 03+2] "the members of this group have been promoted to roots but merging this anyways. maybe it will be used again in the future." [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [20:43:50] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:56:39] (03PS1) 10Herron: assign kafkamon[12]002 role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/617262 (https://phabricator.wikimedia.org/T257561) [20:58:10] (03CR) 10Herron: [C: 03+2] assign kafkamon[12]002 role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/617262 (https://phabricator.wikimedia.org/T257561) (owner: 10Herron) [21:00:41] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:00:41] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:00:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:15:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:15:27] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:16] PROBLEM - PHP7 rendering on wtp2007 is CRITICAL: connect to address 10.192.16.49 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:21:18] PROBLEM - Check whether ferm is active by checking the default input chain on wtp2007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.49: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:22:00] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:22:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:32] (03CR) 10Ppchelko: [C: 04-2] "After discussion with Eric, I'm changing my mind. See ticket for the reasoning." [deployment-charts] - 10https://gerrit.wikimedia.org/r/613650 (https://phabricator.wikimedia.org/T256769) (owner: 10Hnowlan) [21:24:56] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:44] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:40:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:51:54] PROBLEM - dhclient process on wtp2008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.50: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:58:48] PROBLEM - mediawiki-installation DSH group on wtp2008 is CRITICAL: Host wtp2008 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:00:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:00:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:22] (03PS1) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) [22:03:03] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616110 (https://phabricator.wikimedia.org/T251497) (owner: 10ZPapierski) [22:27:58] PROBLEM - MariaDB Replica Lag: s4 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1658.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:29:50] RECOVERY - PHP7 rendering on wtp2007 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:33:55] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2007.codfw.wmnet'] ` and were **ALL** successful. [22:39:24] RECOVERY - Check whether ferm is active by checking the default input chain on wtp2007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:39:32] RECOVERY - dhclient process on wtp2008 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [22:45:24] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2007.codfw.wmnet [22:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:44] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2006.codfw.wmnet [22:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:36] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2009.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T2300). Please do the needful. [23:03:51] (03PS1) 10Urbanecm: Add several extra namespaces for mswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617280 (https://phabricator.wikimedia.org/T255391) [23:04:46] (03CR) 10Urbanecm: [C: 03+2] Add several extra namespaces for mswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617280 (https://phabricator.wikimedia.org/T255391) (owner: 10Urbanecm) [23:05:35] (03Merged) 10jenkins-bot: Add several extra namespaces for mswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617280 (https://phabricator.wikimedia.org/T255391) (owner: 10Urbanecm) [23:07:16] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:22] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 396a395c79c606cb7deeb7906fefc7f16e63fa4f: Add several extra namespaces for mswiktionary (T255391) (duration: 01m 07s) [23:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:29] T255391: Create new namespaces on the Malay Wiktionary - https://phabricator.wikimedia.org/T255391 [23:09:07] !log Run mwscript namespaceDupes.php --wiki=mswiktionary --fix (T255391) [23:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:02] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:03] (03PS1) 10Urbanecm: Search Work NS by default at bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617284 (https://phabricator.wikimedia.org/T258982) [23:18:43] RECOVERY - mediawiki-installation DSH group on wtp2006 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [23:22:29] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2008.codfw.wmnet'] ` and were **ALL** successful. [23:28:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:28:54] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:35] (03PS1) 10BryanDavis: jessie-ssd: Fetch base image from docker-registry.tools.wmflabs.org [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/617288 [23:37:33] (03CR) 10Alex Monk: "you'll also want to add it to the safelist in api.py" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617176 (https://phabricator.wikimedia.org/T255249) (owner: 10BryanDavis) [23:39:09] (03CR) 10Alex Monk: [C: 03+2] "oh it's in the other PR, okay" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617176 (https://phabricator.wikimedia.org/T255249) (owner: 10BryanDavis) [23:39:54] (03CR) 10Alex Monk: [C: 03+2] api: Allow acme-chief clients to fetch .chained.crt.key files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617177 (https://phabricator.wikimedia.org/T255249) (owner: 10BryanDavis) [23:41:25] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2008.codfw.wmnet [23:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:52] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2010.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:49:43] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) 05Open→03Resolved a:03tstarling [23:51:22] PROBLEM - parsoid on wtp2009 is CRITICAL: connect to address 10.192.16.51 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [23:51:22] PROBLEM - Check size of conntrack table on wtp2009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.51: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:51:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:51:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:14] ACKNOWLEDGEMENT - Check size of conntrack table on wtp2009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.51: Connection reset by peer daniel_zahn reinstall https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:52:14] ACKNOWLEDGEMENT - Long running screen/tmux on wtp2009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.51: Connection reset by peer daniel_zahn reinstall https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens