[00:02:41] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10Mholloway) And again today.
[00:14:59] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[00:15:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:27] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[00:18:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:41] <wikibugs>	 (03PS1) 10Legoktm: mediawiki: Install firejail from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/616955 (https://phabricator.wikimedia.org/T179022)
[00:18:57] <legoktm>	 TimStarling: ^^
[00:19:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Install firejail from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/616955 (https://phabricator.wikimedia.org/T179022) (owner: 10Legoktm)
[00:20:57] <wikibugs>	 (03PS2) 10Legoktm: mediawiki: Install firejail from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/616955 (https://phabricator.wikimedia.org/T179022)
[00:21:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:22:05] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "I'm all for anything that cuts down on noise. Shipping it!" [puppet] - 10https://gerrit.wikimedia.org/r/616631 (owner: 10Ebernhardson)
[00:23:24] <wikibugs>	 (03CR) 10Ryan Kemper: "`sudo puppet-merge` done" [puppet] - 10https://gerrit.wikimedia.org/r/616631 (owner: 10Ebernhardson)
[00:24:56] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:29:02] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[00:31:34] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[00:33:17] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1046.eqiad.wmnet'] `  and were **ALL** successful.
[00:37:54] <icinga-wm>	 PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[00:37:54] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[00:43:16] <icinga-wm>	 RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[00:43:16] <icinga-wm>	 RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[00:44:04] <wikibugs>	 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) a:03Dzahn
[00:44:54] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24203/" [puppet] - 10https://gerrit.wikimedia.org/r/616027 (owner: 10DCausse)
[00:48:56] <ryankemper>	 !log sudo -E cumin -b 10 'A:wdqs-all' 'sudo run-puppet-agent'
[00:48:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:53:48] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1046.eqiad.wmnet
[00:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:54:05] <wikibugs>	 (03PS1) 10Catrope: Enable and configure GrowthExperiments on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616959 (https://phabricator.wikimedia.org/T255020)
[01:13:15] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1047.eqiad.wmnet'] `  and were **ALL** successful.
[01:15:31] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1047.eqiad.wmnet
[01:15:34] <wikibugs>	 (03PS1) 10Tim Starling: Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616937 (https://phabricator.wikimedia.org/T257091)
[01:15:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:15:37] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616937 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling)
[01:15:57] <wikibugs>	 (03PS1) 10Tim Starling: Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616938 (https://phabricator.wikimedia.org/T257091)
[01:16:14] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616938 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling)
[01:17:57] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1048.eqiad.wmnet'] `  and were **ALL** successful.
[01:19:58] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1048.eqiad.wmnet
[01:20:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:25:59] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) >>! In T258775#6341636, @JMeybohm wrote: > All hosts but `wtp104[6-8].eqiad.wmnet` completed.  wtp1046, wtp1047, wtp1048 completed and repooled  on wtp1...
[01:26:05] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_eventgate_main_http_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:29:39] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:34:38] <wikibugs>	 (03Merged) 10jenkins-bot: Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616937 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling)
[01:35:20] <wikibugs>	 (03Merged) 10jenkins-bot: Remove NO_EXECVE when executing gs for now [extensions/Score] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616938 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling)
[01:36:29] <wikibugs>	 (03CR) 10Ryan Kemper: "Puppet-merged, manually ran puppet agent on all wdqs nodes, everything looks good." [puppet] - 10https://gerrit.wikimedia.org/r/616027 (owner: 10DCausse)
[01:38:14] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Followup): Test gutter pool failover in production  and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10aaron) 05Open→03Resolved Yeah, technically, all sorts of anomalies are possible, so callers should always (a)...
[01:38:17] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10aaron)
[01:45:09] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Score/includes/Score.php: work around firejail bug (duration: 01m 08s)
[01:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:47:23] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.2/extensions/Score/includes/Score.php: work around firejail bug (duration: 01m 07s)
[01:47:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:49:48] <wikibugs>	 (03PS1) 10Tim Starling: Re-enable LilyPond/Score in safe mode (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616941
[01:49:59] <icinga-wm>	 PROBLEM - Long running screen/tmux on netbox1001 is CRITICAL: CRIT: Long running tmux process. (user: crusnov PID: 17784, 1910768s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[01:52:38] <wikibugs>	 (03PS2) 10Tim Starling: Re-enable LilyPond/Score in safe mode (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616941
[01:54:25] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Re-enable LilyPond/Score in safe mode (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616941 (owner: 10Tim Starling)
[01:55:13] <wikibugs>	 (03Merged) 10jenkins-bot: Re-enable LilyPond/Score in safe mode (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616941 (owner: 10Tim Starling)
[02:10:28] <wikibugs>	 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) p:05High→03Medium
[02:18:42] <wikibugs>	 10Operations, 10Arc-Lamp, 10Performance-Team, 10Patch-For-Review: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Krinkle)
[02:18:56] <wikibugs>	 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Krinkle) 05Open→03Resolved
[02:19:02] <logmsgbot>	 !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: re-enable lilypond in safe mode (duration: 01m 09s)
[02:19:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:19:13] <wikibugs>	 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Krinkle) >>! In T235456#6343338, @Krinkle wrote: > {F31952510 height=250}    Nice @dpifke :)
[02:35:12] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling)
[03:07:33] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:13:11] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:30:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:32:19] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:36:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:37:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:52:59] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:58:39] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:04:21] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:08:10] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:13:49] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:17:37] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:27:03] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:30:49] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:35:27] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.354e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:37:21] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 333 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:51:27] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:54:47] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:58:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1142', diff saved to https://phabricator.wikimedia.org/P12100 and previous config saved to /var/cache/conftool/dbconfig/20200729-045859-marostegui.json
[04:59:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:02:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1142', diff saved to https://phabricator.wikimedia.org/P12101 and previous config saved to /var/cache/conftool/dbconfig/20200729-050204-marostegui.json
[05:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:02:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1141', diff saved to https://phabricator.wikimedia.org/P12102 and previous config saved to /var/cache/conftool/dbconfig/20200729-050247-marostegui.json
[05:02:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078', diff saved to https://phabricator.wikimedia.org/P12103 and previous config saved to /var/cache/conftool/dbconfig/20200729-050346-marostegui.json
[05:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:13] <wikibugs>	 (03PS1) 10Marostegui: db2106: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/616989 (https://phabricator.wikimedia.org/T250666)
[05:21:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2106: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/616989 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui)
[05:23:41] <wikibugs>	 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Emufarmers) We are fine with this.
[05:35:13] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812
[05:43:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime
[05:43:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:45:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[05:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:52:48] <XioNoX>	 !log standardize mr1-eqsin interfaces
[05:52:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:04] <legoktm>	 !log ssh doc1001.eqiad.wmnet sudo -u doc-uploader git -C /srv/docroot pull
[06:01:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:47] <XioNoX>	 !log standardize mr1-ulsfo interfaces
[06:06:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:53] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812
[06:14:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1078', diff saved to https://phabricator.wikimedia.org/P12104 and previous config saved to /var/cache/conftool/dbconfig/20200729-061450-marostegui.json
[06:14:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616920 (https://phabricator.wikimedia.org/T258775) (owner: 10Dzahn)
[06:16:21] <XioNoX>	 !log standardize mr1-codfw interfaces
[06:16:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:21] <wikibugs>	 (03PS1) 10Tim Starling: Turn off .ly source downloads [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616978
[06:17:53] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] "It's in wmf.2 but not wmf.1" [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616978 (owner: 10Tim Starling)
[06:18:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "We're actively moving away from stretch-backports, in fact it will be disabled very soon entirely with https://gerrit.wikimedia.org/r/c/op" [puppet] - 10https://gerrit.wikimedia.org/r/616955 (https://phabricator.wikimedia.org/T179022) (owner: 10Legoktm)
[06:20:06] <wikibugs>	 (03PS1) 10Marostegui: db2106: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/616994
[06:20:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1078', diff saved to https://phabricator.wikimedia.org/P12105 and previous config saved to /var/cache/conftool/dbconfig/20200729-062009-marostegui.json
[06:20:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:21:42] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812
[06:21:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2106: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/616994 (owner: 10Marostegui)
[06:22:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112', diff saved to https://phabricator.wikimedia.org/P12106 and previous config saved to /var/cache/conftool/dbconfig/20200729-062224-marostegui.json
[06:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:31] <wikibugs>	 (03PS1) 10Marostegui: db2117: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/616995 (https://phabricator.wikimedia.org/T250666)
[06:26:39] <XioNoX>	 !log standardize mr1-eqiad interfaces
[06:26:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2117: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/616995 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui)
[06:30:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616885 (https://phabricator.wikimedia.org/T259000) (owner: 10Herron)
[06:32:42] <wikibugs>	 (03PS1) 10Ayounsi: Add interfaces support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/616997
[06:33:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) I think that we should coordinate first about how to proceed, given what discussed in T243521#6005828. There are two things to keep in mind:  * rack...
[06:34:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Self-merging as I manually tested it and pushed the differences to the routers. So it's currently a NOOP." [homer/public] - 10https://gerrit.wikimedia.org/r/616997 (owner: 10Ayounsi)
[06:35:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10elukey) Cross-posting: we had a chat a while ago with dc-ops about 10g-enabled racks and availability, ending up in T243521#6005828. The config may be outdated, but...
[06:35:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add interfaces support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/616997 (owner: 10Ayounsi)
[06:35:33] <wikibugs>	 (03Merged) 10jenkins-bot: Turn off .ly source downloads [extensions/Score] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616978 (owner: 10Tim Starling)
[06:36:58] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812
[06:40:13] <wikibugs>	 (03PS3) 10Muehlenhoff: Add CAS support to Hue [puppet] - 10https://gerrit.wikimedia.org/r/616541
[06:48:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime
[06:48:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[06:50:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:29] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812
[07:08:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Update netbox to v2.8.8-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/616892 (https://phabricator.wikimedia.org/T258942) (owner: 10CRusnov)
[07:19:19] <wikibugs>	 (03PS1) 10Elukey: Remove AAAA/PTR records for db1108 [dns] - 10https://gerrit.wikimedia.org/r/617064 (https://phabricator.wikimedia.org/T234826)
[07:25:55] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] Remove AAAA/PTR records for db1108 [dns] - 10https://gerrit.wikimedia.org/r/617064 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey)
[07:26:12] <elukey>	 jynus: ?
[07:26:23] <jynus>	 I'm commenting on patch
[07:26:42] <elukey>	 yes I wasn't about to merge, it was just a code review
[07:27:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "While this is an outlier, I think I was able to make it work. Let's keep the ipv6 for now." [dns] - 10https://gerrit.wikimedia.org/r/617064 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey)
[07:29:10] <elukey>	 okok :)
[07:29:19] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "> so replication from db1108 would fail/timeout multiple times" [dns] - 10https://gerrit.wikimedia.org/r/617064 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey)
[07:31:21] <jynus>	 elukey: I am getting some strange errors on backup
[07:31:33] <jynus>	 I may need you for a while
[07:33:37] <RhinosF1>	 jynus: they just said they were afk for a few hours in analytics
[07:33:46] <jynus>	 ok
[07:33:51] <jynus>	 thanks
[07:33:56] <elukey>	 jynus: I am about to go afk for a bit, sorry :( is it ok if we do it this afternoon?
[07:34:05] <jynus>	 yes, no rush
[07:34:09] <elukey>	 super
[07:36:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/24209/deploy1001.eqiad.wmnet/fulldiff.html at long last I managed to produce the correct " [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto)
[07:44:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P12107 and previous config saved to /var/cache/conftool/dbconfig/20200729-074414-marostegui.json
[07:44:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: CVE-2020-7016 / CVE-2020-7017 mitigations [puppet] - 10https://gerrit.wikimedia.org/r/616885 (https://phabricator.wikimedia.org/T259000) (owner: 10Herron)
[07:45:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: upgrade snmp-exporter config [puppet] - 10https://gerrit.wikimedia.org/r/616857 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[07:48:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P12108 and previous config saved to /var/cache/conftool/dbconfig/20200729-074828-marostegui.json
[07:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:05] <wikibugs>	 10Operations, 10WMDE-Analytics-Engineering, 10Graphite, 10User-Addshore: Regularly & Automatically backup WMDE metrics stored in graphite - https://phabricator.wikimedia.org/T125408 (10Addshore) 05Stalled→03Declined
[07:52:54] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:53:13] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: hide diffs in snmp_exporter::module [puppet] - 10https://gerrit.wikimedia.org/r/617067 (https://phabricator.wikimedia.org/T247967)
[07:53:55] <wikibugs>	 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi)
[07:54:50] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:55:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P12109 and previous config saved to /var/cache/conftool/dbconfig/20200729-075558-marostegui.json
[07:56:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:44] <wikibugs>	 (03PS1) 10Privacybatm: transferpy: Improve documentation [software/transferpy] - 10https://gerrit.wikimedia.org/r/617068 (https://phabricator.wikimedia.org/T257601)
[07:58:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see nits inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[08:02:40] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10JMeybohm) >>! In T258775#6343259, @Dzahn wrote: >>>! In T258775#6341636, @JMeybohm wrote: >> All hosts but `wtp104[6-8].eqiad.wmnet` completed. >  > wtp1046, w...
[08:02:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: hide diffs in snmp_exporter::module [puppet] - 10https://gerrit.wikimedia.org/r/617067 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[08:03:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1141', diff saved to https://phabricator.wikimedia.org/P12110 and previous config saved to /var/cache/conftool/dbconfig/20200729-080318-marostegui.json
[08:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:08] <wikibugs>	 (03CR) 10Volans: "Change looks sane, I've a question inline and a more general one:" (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond)
[08:04:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: reinstall netmon1002 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/617069 (https://phabricator.wikimedia.org/T247967)
[08:04:16] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:04:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121', diff saved to https://phabricator.wikimedia.org/P12111 and previous config saved to /var/cache/conftool/dbconfig/20200729-080442-marostegui.json
[08:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: reinstall netmon1002 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/617069 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[08:05:34] <marostegui>	 !log Deploy MCR schema change on db1121 (lag will show up on s4), also remove triggers on db1124:3314
[08:05:35] <wikibugs>	 (03PS2) 10Filippo Giunchedi: install_server: reinstall netmon1002 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/617069 (https://phabricator.wikimedia.org/T247967)
[08:05:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:04] <wikibugs>	 (03PS1) 10Privacybatm: transferpy: Release transferpy 1.0 [software/transferpy] - 10https://gerrit.wikimedia.org/r/617071 (https://phabricator.wikimedia.org/T257601)
[08:08:02] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:10:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:11:32] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:12:57] <wikibugs>	 (03PS2) 10Privacybatm: transferpy: Improve documentation [software/transferpy] - 10https://gerrit.wikimedia.org/r/617068 (https://phabricator.wikimedia.org/T257601)
[08:13:41] <wikibugs>	 (03PS2) 10Privacybatm: transferpy: Release transferpy 1.0 [software/transferpy] - 10https://gerrit.wikimedia.org/r/617071 (https://phabricator.wikimedia.org/T257601)
[08:18:39] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] increment extra plugin to 6.5.4-wmf-11 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/616602 (owner: 10Ebernhardson)
[08:18:55] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "I meant +1 :)" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/616602 (owner: 10Ebernhardson)
[08:21:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, the comments are nitpicks you can safely disregard." (032 comments) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813 (owner: 10JMeybohm)
[08:23:57] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[08:24:28] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[08:24:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:34] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:26:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:12] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] helmfile: add data for enabling service proxy in k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto)
[08:27:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: bump scrape timeout for PDUs [puppet] - 10https://gerrit.wikimedia.org/r/617073 (https://phabricator.wikimedia.org/T247967)
[08:29:11] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: prometheus: bump scrape timeout for PDUs [puppet] - 10https://gerrit.wikimedia.org/r/617073 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[08:31:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:32:10] <wikibugs>	 (03PS3) 10Jbond: validate_$type: add checks to prevent legacy stdlib functions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013)
[08:33:10] <wikibugs>	 (03CR) 10Jbond: "Thanks updated" (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond)
[08:33:42] <wikibugs>	 (03PS1) 10DCausse: [wdqs] install openjdk-8-dbg [puppet] - 10https://gerrit.wikimedia.org/r/617074
[08:34:45] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10dcausse) @herron thanks for the deploy. It works well for me. For jstack I need an extra package for...
[08:35:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617074 (owner: 10DCausse)
[08:40:28] <wikibugs>	 (03CR) 10Jbond: "Thanks updated, for clarification i have created this CR with the aim of supporting both python 2.7 and 3.7 more as a transition then a pe" (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/615793 (owner: 10Jbond)
[08:40:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "While the change seems technically correct, I'd like to see some rationale for making debian/rules more complicated in the commit message." [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) (owner: 10JMeybohm)
[08:41:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] [wdqs] install openjdk-8-dbg [puppet] - 10https://gerrit.wikimedia.org/r/617074 (owner: 10DCausse)
[08:41:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/616911 (owner: 10Dzahn)
[08:41:38] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:42:06] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp2001.codfw.wmnet
[08:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:27] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp2001.codfw.wmnet ` The log can be found in `/var/log...
[08:42:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: role: use rsync wrap_with_stunnel for netmon [puppet] - 10https://gerrit.wikimedia.org/r/617076 (https://phabricator.wikimedia.org/T247967)
[08:43:59] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Create ugly exception for port assignment for db1108 [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826)
[08:45:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.8-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/616724 (owner: 10Vgutierrez)
[08:46:26] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:47:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/24210/" [puppet] - 10https://gerrit.wikimedia.org/r/617076 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[08:47:33] <wikibugs>	 (03PS2) 10Filippo Giunchedi: role: use rsync wrap_with_stunnel for netmon [puppet] - 10https://gerrit.wikimedia.org/r/617076 (https://phabricator.wikimedia.org/T247967)
[08:49:52] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:52:16] <wikibugs>	 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi)
[08:52:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "ACK, LGTM" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond)
[08:53:13] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079
[08:54:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff)
[08:55:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1121', diff saved to https://phabricator.wikimedia.org/P12112 and previous config saved to /var/cache/conftool/dbconfig/20200729-085504-marostegui.json
[08:55:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:36] <marostegui>	 !log The above was db1112
[08:55:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:33] <wikibugs>	 (03PS2) 10Muehlenhoff: profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079
[08:56:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] helmfile: add data for enabling service proxy in k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto)
[09:00:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Thanks for clarification! LGTM then" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto)
[09:10:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1112', diff saved to https://phabricator.wikimedia.org/P12113 and previous config saved to /var/cache/conftool/dbconfig/20200729-091006-marostegui.json
[09:10:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:38] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24211/" [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff)
[09:10:47] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Volans) @Dzahn what's the status of this? It appears that the VM is up but not in puppet at all.
[09:11:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto)
[09:13:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1112', diff saved to https://phabricator.wikimedia.org/P12114 and previous config saved to /var/cache/conftool/dbconfig/20200729-091319-marostegui.json
[09:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:06] <wikibugs>	 (03PS1) 10Marostegui: db2117: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/617080
[09:15:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2117: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/617080 (owner: 10Marostegui)
[09:15:17] <vgutierrez>	 !log upload trafficserver 8.0.8-1wm2 to apt.wm.o (buster)
[09:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1112', diff saved to https://phabricator.wikimedia.org/P12115 and previous config saved to /var/cache/conftool/dbconfig/20200729-091528-marostegui.json
[09:15:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:02] <vgutierrez>	 !log upgrade ATS to version 8.0.8-1wm2 on cp4026 and cp4032
[09:16:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:33] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime
[09:20:33] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[09:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Rakefile: Correctly match start of YAML docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/616024 (owner: 10Alexandros Kosiaris)
[09:29:30] <wikibugs>	 (03Merged) 10jenkins-bot: Rakefile: Correctly match start of YAML docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/616024 (owner: 10Alexandros Kosiaris)
[09:29:32] <wikibugs>	 (03Merged) 10jenkins-bot: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos)
[09:37:08] <wikibugs>	 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi)
[09:39:49] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843)
[09:44:16] <vgutierrez>	 !log upgrade ATS to version 8.0.8-1wm2 on cp5006 and cp5012
[09:44:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron)
[09:47:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond)
[09:48:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] graphite::web: drop validate_functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond)
[09:49:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616753 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond)
[09:49:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] prometheus: drop legacy validate_functions [puppet] - 10https://gerrit.wikimedia.org/r/616753 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond)
[09:51:41] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on wtp2001 is CRITICAL: Host wtp2001 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[09:52:51] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[09:53:55] <icinga-wm>	 PROBLEM - Apache HTTP on wtp2001 is CRITICAL: connect to address 10.192.16.43 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[09:53:55] <icinga-wm>	 PROBLEM - nutcracker process on wtp2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.43: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker
[09:56:15] <icinga-wm>	 PROBLEM - nutcracker socket on wtp2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.43: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker
[09:56:43] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rsync: listen for stunnel connections on all AFs [puppet] - 10https://gerrit.wikimedia.org/r/617083
[09:57:11] <wikibugs>	 (03PS2) 10Filippo Giunchedi: rsync: listen for stunnel connections on v4/v6 [puppet] - 10https://gerrit.wikimedia.org/r/617083
[09:58:16] <jayme>	 wtp2001 is me (reimaging)
[09:58:45] <icinga-wm>	 PROBLEM - parsoid on wtp2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid
[10:07:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff)
[10:11:45] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 69.15 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[10:12:06] <vgutierrez>	 !log upgrade ATS to version 8.0.8-1wm2 on cp3064 and cp3065
[10:12:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:37] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Score/extension.json: do not offer .ly downloads (duration: 01m 20s)
[10:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:15] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Score/includes/Score.php: do not offer .ly downloads (duration: 01m 07s)
[10:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:26] <wikibugs>	 (03CR) 10Kormat: "Is there any chance we can get the ports changed instead? E.g. 3351/3352." [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo)
[10:28:11] <icinga-wm>	 PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder
[10:30:03] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "the TLS cert used on envoy at aphlict.discovery.wmnet must include the public faced hostname phabricator.wikimedia.org in the SAN list, cu" [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn)
[10:30:31] <vgutierrez>	 ^^ netmon1002 has been restarted lately?
[10:31:54] <godog>	 yes indeed, I've reimaged it earlier today, I'll rearm
[10:32:19] <vgutierrez>	 thx :D
[10:32:43] <godog>	 np! {{done}}
[10:32:52] <godog>	 with that, I'll go to lunch
[10:33:49] <icinga-wm>	 RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[10:36:57] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10Kormat) There are a number of scripts (i.e. executable python scripts) in this repo, but i'm not sure which ones are actively used or not:  I know these are used: ` switchover.py replication_tree...
[10:37:32] <wikibugs>	 (03PS26) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906)
[10:37:51] <wikibugs>	 (03CR) 10Elukey: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo)
[10:39:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff)
[10:40:54] <vgutierrez>	 !log rolling upgrade of ATS to version 8.0.8-1wm2
[10:40:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:22] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, but before deploying we should check existing ferm rules, not sure if that happened yet? Most should be created by auto_ferm, " [puppet] - 10https://gerrit.wikimedia.org/r/617083 (owner: 10Filippo Giunchedi)
[10:41:24] <wikibugs>	 (03PS1) 10Jbond: role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085 (https://phabricator.wikimedia.org/T259013)
[10:41:32] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10Marostegui) So: these are definitely used: ` compare.py mysql.py backup_mariadb.py osc_host.py `
[10:43:21] <icinga-wm>	 RECOVERY - Apache HTTP on wtp2001 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:43:21] <icinga-wm>	 RECOVERY - nutcracker process on wtp2001 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[10:43:27] <icinga-wm>	 RECOVERY - nutcracker socket on wtp2001 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_codfw.sock https://wikitech.wikimedia.org/wiki/Nutcracker
[10:44:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: envoyproxy::tls_terminator: update tls definitions [puppet] - 10https://gerrit.wikimedia.org/r/617086 (https://phabricator.wikimedia.org/T258140)
[10:44:54] <wikibugs>	 (03PS2) 10Jbond: role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085 (https://phabricator.wikimedia.org/T259013)
[10:49:12] <wikibugs>	 (03PS3) 10Jbond: role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085 (https://phabricator.wikimedia.org/T259013)
[10:52:06] <wikibugs>	 (03PS4) 10Jbond: role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085
[10:53:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085 (owner: 10Jbond)
[10:54:25] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] envoyproxy::tls_terminator: update tls definitions [puppet] - 10https://gerrit.wikimedia.org/r/617086 (https://phabricator.wikimedia.org/T258140) (owner: 10Giuseppe Lavagetto)
[10:54:28] <wikibugs>	 (03CR) 10Elukey: Move mjolnir's daemons to search-loader hosts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey)
[10:54:37] <wikibugs>	 (03PS5) 10Jbond: role::logstash::apifeature: refactor role to profile [puppet] - 10https://gerrit.wikimedia.org/r/617085
[10:56:21] <wikibugs>	 (03PS8) 10Elukey: Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245)
[10:57:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Add local service proxy to the tls terminator v0.2 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[10:58:52] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10jcrespo) My intention is to put backup_mariadb.py and its dependencies (remote execution, etc.) on a separate package (that is why I showed you the https://gerrit.wikimedia.org/r/c/operations/sof...
[11:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European mid-day backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1100).
[11:00:05] <jouncebot>	 VulpesVulpes825: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:35] <VulpesVulpes825>	 Present and currently waiting for patch deployment.
[11:01:14] <Lucas_WMDE>	 I can deploy today :)
[11:01:17] <Lucas_WMDE>	 looking at the patch now
[11:01:47] <Lucas_WMDE>	 the 1x version only seems to have minimal changes, is that correct?
[11:01:52] * Lucas_WMDE looks at the task
[11:02:10] <VulpesVulpes825>	 Lucas_WMDE That is correct.
[11:02:24] <Lucas_WMDE>	 ok
[11:02:32] <VulpesVulpes825>	 The reason why 1x logo is changed, but not 1.5x and 2x is the previous update in unknown.
[11:02:40] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Change the logo for Wu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616760 (https://phabricator.wikimedia.org/T259005) (owner: 10VulpesVulpes825)
[11:02:51] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Change the logo for Wu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616760 (https://phabricator.wikimedia.org/T259005) (owner: 10VulpesVulpes825)
[11:03:34] <wikibugs>	 (03Merged) 10jenkins-bot: Change the logo for Wu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616760 (https://phabricator.wikimedia.org/T259005) (owner: 10VulpesVulpes825)
[11:04:04] * Lucas_WMDE looks up how to deploy logo changes
[11:04:11] <Lucas_WMDE>	 probably needs some cache purging commands IIRC
[11:04:22] <Lucas_WMDE>	 ah yes, https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Purging
[11:05:01] <Lucas_WMDE>	 and I assume it doesn’t make sense to test this on mwdebug1002, /static probably isn’t affected by X-Wikimedia-Debug
[11:06:03] <Urbanecm>	 Lucas_WMDE: static is affected by X-Wikimedia-Debug
[11:06:11] <Lucas_WMDE>	 oh, ok
[11:06:14] <Lucas_WMDE>	 then I can try it out, I guess
[11:06:19] <Urbanecm>	 yup
[11:06:24] <Lucas_WMDE>	 thanks
[11:06:29] <Lucas_WMDE>	 ok change is on mwdebug1001
[11:06:36] <Urbanecm>	 and yes, you need to purge changed static URLs (mwscript purgeList.php)
[11:06:57] <wikibugs>	 (03PS6) 10Jbond: role::logstash: refactor role to conform to coding guid [puppet] - 10https://gerrit.wikimedia.org/r/617085
[11:07:43] <Lucas_WMDE>	 yup, seems to work (at 200% zoom, logo gets the larger characters with XWD)
[11:07:43] <Urbanecm>	 Lucas_WMDE: VulpesVulpes825: wfm
[11:07:45] <Lucas_WMDE>	 syncing
[11:07:58] <Urbanecm>	 :(
[11:08:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized static/images/project-logos/: Config: [[gerrit:616760|Change the logo for Wu Wikipedia (T259005)]] (duration: 01m 08s)
[11:08:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:00] <stashbot>	 T259005: Change the logo of Wu Wikipedia - https://phabricator.wikimedia.org/T259005
[11:09:35] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/project-logos/%s\n' 'wuuwiki.png' 'wuuwiki-1.5x.png' 'wuuwiki-2x.png' | mwscript purgeList.php # T259005
[11:09:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:57] <Lucas_WMDE>	 and now it also works without the debug header
[11:10:18] <Urbanecm>	 (y)
[11:11:02] <VulpesVulpes825>	 It is now live on wuu wikipedia. Lucas_WMDE, thank you for your help.
[11:11:12] <Lucas_WMDE>	 you’re welcome, thanks for the patch :)
[11:11:50] <Lucas_WMDE>	 !log EU B&C window done
[11:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:30] <wikibugs>	 (03PS1) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013)
[11:15:55] <wikibugs>	 (03PS2) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013)
[11:17:18] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10jcrespo) Let me go a bit overboard an propose you the following to be added to a potential wmfmariadbpy package- to be installed on cumin hosts:  * Libraries: WMFMariaDB, WMFReplication (I believ...
[11:19:01] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10jcrespo) BTW, this is duplicate of 3yo T165358 :-).
[11:19:35] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10jcrespo)
[11:22:29] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10Marostegui) >>! In T259021#6344214, @jcrespo wrote: > My intention is to put backup_mariadb.py and its dependencies (remote execution, etc.) on a separate package (that is why I showed you the ht...
[11:22:38] <wikibugs>	 (03CR) 10Jcrespo: "> we had rigid restrictions" [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo)
[11:25:28] <wikibugs>	 (03PS3) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013)
[11:25:52] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2001.codfw.wmnet'] `  and were **ALL** successful.
[11:26:20] <icinga-wm>	 RECOVERY - parsoid on wtp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid
[11:27:05] <wikibugs>	 (03PS1) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013)
[11:27:20] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1133, Errmsg: Error Cant find any matching row in the user table on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:27:45] <marostegui>	 elukey: ^
[11:27:53] <marostegui>	 I will check with you
[11:27:53] <jynus>	 eh, I didn't touch anything?
[11:28:04] <wikibugs>	 (03PS1) 10Elukey: Add hue overrides to an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/617091 (https://phabricator.wikimedia.org/T258768)
[11:28:08] <elukey>	 nono it is me, I created a test database, of course it didn't work
[11:28:21] <marostegui>	 elukey: it was the grants, not the DB
[11:28:25] <marostegui>	 I will work with you on that
[11:28:30] <elukey>	 ahhhh
[11:28:39] <marostegui>	 It will be an easy fix
[11:28:43] <jynus>	 elukey: also please don't create "test" databases :-D, call them something else :-D
[11:29:06] <jynus>	 we just had to spend 2 days of kormat's work fixing an issue with that :-D
[11:29:14] <jynus>	 (not kidding)
[11:30:01] <elukey>	 well I have to test a new version of the Hue daemon, and to avoid using the prod DB I need to create another one, it is technically testing but if good it will allow me to upgrade
[11:30:08] <elukey>	 and I can't do it elsewhere sadly
[11:30:17] <jynus>	 sorry I didn't express myself weel
[11:30:21] <jynus>	 you should create test databases
[11:30:29] <jynus>	 just not call them "test" :-D
[11:30:39] <jynus>	 or testsomething
[11:31:01] <elukey>	 ah okok yes I agree :)
[11:31:02] <marostegui>	 jynus: it is not called test or anything similar
[11:31:05] <jynus>	 ah, ok
[11:31:08] <jynus>	 I missunderstood
[11:31:17] <elukey>	 I called it hue_next, really promising
[11:31:20] <wikibugs>	 (03PS1) 10Urbanecm: Fix overindentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617092
[11:31:25] <elukey>	 (not really since I broke replication after a minute)
[11:31:28] <jynus>	 elukey: that has no issues
[11:31:42] <jynus>	 well, the naming at least
[11:32:52] <wikibugs>	 (03PS1) 10Urbanecm: Add Wikipedia wordmark for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617093 (https://phabricator.wikimedia.org/T255489)
[11:33:12] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617092 (owner: 10Urbanecm)
[11:33:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add Wikipedia wordmark for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617093 (https://phabricator.wikimedia.org/T255489) (owner: 10Urbanecm)
[11:33:59] <wikibugs>	 (03Merged) 10jenkins-bot: Fix overindentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617092 (owner: 10Urbanecm)
[11:34:03] <wikibugs>	 (03Merged) 10jenkins-bot: Add Wikipedia wordmark for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617093 (https://phabricator.wikimedia.org/T255489) (owner: 10Urbanecm)
[11:34:04] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: analytics_meta on db1108 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:35:58] <wikibugs>	 (03PS7) 10Jbond: role::logstash: refactor role to conform to coding guid [puppet] - 10https://gerrit.wikimedia.org/r/617085
[11:36:03] <wikibugs>	 (03CR) 10Jcrespo: "> then it should be in theory easy to stop the mariadb slaves and restart them no?" [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo)
[11:36:19] <wikibugs>	 (03PS4) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013)
[11:36:22] <wikibugs>	 (03Abandoned) 10Jcrespo: mariadb: Create ugly exception for port assignment for db1108 [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo)
[11:36:22] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 9f7e03292941d0d782437862f406efa7e1c6463e: Fix overindentation (duration: 01m 08s)
[11:36:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:26] <wikibugs>	 (03PS2) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013)
[11:36:52] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add hue overrides to an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/617091 (https://phabricator.wikimedia.org/T258768) (owner: 10Elukey)
[11:37:50] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "I don't like this approach, as I said, but if you insist, we can…" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) (owner: 10C. Scott Ananian)
[11:38:15] <wikibugs>	 10Operations, 10Fundraising-Backlog: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Pcoombe) @Krinkle I do quite like the idea of exploring a microsite in the longer term, but it would involve more work that we hadn't planned for. We're...
[11:39:09] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/wikipedia-wordmark-tr.svg: 252bb6c1bf83d96a14a0ef63e06eb544eef8a00b: Add Wikipedia wordmark for trwiki (T255489; sync 1/2) (duration: 01m 06s)
[11:39:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:14] <stashbot>	 T255489: Mobile version logo on Turkish Wikipedia - https://phabricator.wikimedia.org/T255489
[11:41:19] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 252bb6c1bf83d96a14a0ef63e06eb544eef8a00b: Add Wikipedia wordmark for trwiki (T255489; sync 2/2) (duration: 01m 05s)
[11:41:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:52] <wikibugs>	 (03PS1) 10Jbond: role::logstash7: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617095 (https://phabricator.wikimedia.org/T259013)
[11:42:21] <wikibugs>	 (03PS8) 10Jbond: role::logstash: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617085
[11:42:56] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906)
[11:43:01] <jynus>	 elukey: I think I may have found a solution to make things easier- create a third instance for tests/one time things
[11:43:19] <jynus>	 that way it can be in read-write and will not affect replication?
[11:43:35] <wikibugs>	 (03PS1) 10Cmjohnson: Adding mgmt dns for alert1001 to dns file, netbox aleady updated [dns] - 10https://gerrit.wikimedia.org/r/617096 (https://phabricator.wikimedia.org/T255072)
[11:43:48] <elukey>	 yes definitely this is a possible solution, but I also like to get exposed to these issues since I am learning a lot
[11:44:02] <elukey>	 I know it is a toll on your team but I hope to be more independent eventually :)
[11:44:04] <jynus>	 when you are available later or tomorrow
[11:44:23] <jynus>	 I will want to ask you, and at the same time show you how to do an emergency recovery
[11:44:24] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for alert1001 to dns file, netbox aleady updated [dns] - 10https://gerrit.wikimedia.org/r/617096 (https://phabricator.wikimedia.org/T255072) (owner: 10Cmjohnson)
[11:44:42] <jynus>	 in case you have to do it and we are not around
[11:45:07] <wikibugs>	 (03PS2) 10Jbond: role::logstash7: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617095 (https://phabricator.wikimedia.org/T259013)
[11:45:23] <wikibugs>	 (03PS5) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013)
[11:45:32] <wikibugs>	 (03PS3) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013)
[11:46:23] <wikibugs>	 (03PS3) 10Jbond: role::logstash7: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617095 (https://phabricator.wikimedia.org/T259013)
[11:47:42] <wikibugs>	 (03CR) 10Muehlenhoff: "This is currently in use on logstash1007-1009 and logstash2004-2006?  It will be obsolete with the full move to Kibana 7, though." [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond)
[11:52:25] <wikibugs>	 (03PS1) 10Cmjohnson: Adding production dns alert1001, public ip with ipv6 [dns] - 10https://gerrit.wikimedia.org/r/617101 (https://phabricator.wikimedia.org/T255072)
[11:52:29] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond)
[11:52:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Adding production dns alert1001, public ip with ipv6 [dns] - 10https://gerrit.wikimedia.org/r/617101 (https://phabricator.wikimedia.org/T255072) (owner: 10Cmjohnson)
[11:54:12] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:55:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/616541 (owner: 10Muehlenhoff)
[11:55:46] <wikibugs>	 (03PS2) 10Cmjohnson: Adding production dns alert1001, public ip with ipv6 [dns] - 10https://gerrit.wikimedia.org/r/617101 (https://phabricator.wikimedia.org/T255072)
[11:56:01] <wikibugs>	 10Operations, 10netbox: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10jbond) >>! In T244849#6330155, @ayounsi wrote: > No idea if it's useful here but came across https://github.com/jeremyschulman/netbox-plugin-auth-saml2  forgot to respond to this, yes this is useful thanks @ayounsi
[11:57:25] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding production dns alert1001, public ip with ipv6 [dns] - 10https://gerrit.wikimedia.org/r/617101 (https://phabricator.wikimedia.org/T255072) (owner: 10Cmjohnson)
[11:57:58] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:00:04] <jouncebot>	 Urbanecm and Amir1: #bothumor I � Unicode. All rise for Create avkwiki deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1200).
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1200)
[12:00:28] <wikibugs>	 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Nonovian) p:05Medium→03High I change the priority for more visibility on Phabricator.
[12:00:45] <Urbanecm>	 Amir1: \o/
[12:01:23] * marostegui around in case 
[12:02:44] <Amir1>	 o/
[12:02:57] <Amir1>	 You start, I'm around :D
[12:03:15] <Urbanecm>	 cool
[12:03:49] * Urbanecm rebasing config file
[12:04:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/617083 (owner: 10Filippo Giunchedi)
[12:05:03] <wikibugs>	 (03PS8) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943)
[12:05:25] <wikibugs>	 (03PS9) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943)
[12:05:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm)
[12:06:36] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm)
[12:06:44] <Urbanecm>	 fetching config to the deployment host
[12:07:10] <Urbanecm>	 pulling to mwmaint
[12:07:15] <moritzm>	 !log rebooting idp2001 for kernel update
[12:07:17] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[12:07:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:44] <Urbanecm>	 Amir1: so, `mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=cebwiki avk wikipedia avkwiki avk.wikipedia.org` would be the magical command now, right?
[12:07:55] <Amir1>	 Yup
[12:08:01] <Urbanecm>	 okay, running that
[12:09:14] <Urbanecm>	 sql avkwiki --write says the master host is 10.64.32.197, which is db1100, which is in s5
[12:09:17] <Urbanecm>	 seems it worked
[12:09:41] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[12:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:55] <marostegui>	 Urbanecm: I can successfully see the new database on s5 slaves
[12:09:59] <Amir1>	 cooolio
[12:10:00] <Urbanecm>	 cool!
[12:10:10] <Urbanecm>	 Amir1: at which point am I supposed to sync db-* files?
[12:10:25] <Urbanecm>	 something tells me before everything else
[12:10:36] <Amir1>	 yup, that's first
[12:10:40] <Urbanecm>	 okay, doing
[12:11:35] <Amir1>	 https://wikitech.wikimedia.org/wiki/Add_a_wiki#Database_creation
[12:12:02] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: Creating avkwiki (T257943) (duration: 01m 05s)
[12:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:07] <stashbot>	 T257943: Create Wikipedia Kotava - https://phabricator.wikimedia.org/T257943
[12:12:55] <Urbanecm>	 that doesn't say anything about db-eqiad.php and db-codfw.eqiad Amir1 - we need to update the wiki page for different shard process (ideally once we move some closed wikis)
[12:13:06] <Urbanecm>	 *db-codfw.php, of course
[12:13:38] <Amir1>	 yeah but the concept is the same, first db files
[12:13:49] <Urbanecm>	 i see
[12:13:51] <Amir1>	 db lists, db-eqiad.php
[12:14:27] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: Creating avkwiki (T257943) (duration: 01m 06s)
[12:14:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:33] <Urbanecm>	 syncing dblists
[12:14:53] <Amir1>	 I move to another desk now
[12:15:06] <wikibugs>	 (03PS1) 10Jbond: thanos: add thanos.wikimedia.org top the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009)
[12:15:29] <wikibugs>	 (03PS1) 10Cmjohnson: Adding alert1001 to site.pp and dhpd file [puppet] - 10https://gerrit.wikimedia.org/r/617106 (https://phabricator.wikimedia.org/T255072)
[12:15:41] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists: Creating avkwiki (T257943) (duration: 01m 06s)
[12:15:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:48] <Urbanecm>	 wikiversions going now
[12:15:51] <wikibugs>	 (03PS1) 10Jbond: thanos: add thanos cname pointing to cache [dns] - 10https://gerrit.wikimedia.org/r/617107 (https://phabricator.wikimedia.org/T151009)
[12:16:10] <wikibugs>	 (03PS2) 10Jbond: thanos: add thanos.wikimedia.org top the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009)
[12:19:53] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding alert1001 to site.pp and dhpd file [puppet] - 10https://gerrit.wikimedia.org/r/617106 (https://phabricator.wikimedia.org/T255072) (owner: 10Cmjohnson)
[12:22:14] <wikibugs>	 (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24218/cp1079.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond)
[12:24:08] <Urbanecm>	 scap is taking longer than usually to finish
[12:24:19] <logmsgbot>	 !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating avkwiki (T257943)
[12:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:25] <stashbot>	 T257943: Create Wikipedia Kotava - https://phabricator.wikimedia.org/T257943
[12:26:03] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Creating avkwiki (T257943) (duration: 01m 06s)
[12:26:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:16] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating avkwiki (T257943) (duration: 01m 03s)
[12:27:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:34] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized langlist: Creating avkwiki (T257943) (duration: 01m 05s)
[12:28:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:45] <wikibugs>	 (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617110
[12:28:47] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617110 (owner: 10Urbanecm)
[12:29:29] <wikibugs>	 (03CR) 10Nikerabbit: "The file looks not optimized for size. Is that intentional?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617093 (https://phabricator.wikimedia.org/T255489) (owner: 10Urbanecm)
[12:29:36] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617110 (owner: 10Urbanecm)
[12:30:18] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: Update Jenkins gpg release key in reprepro - https://phabricator.wikimedia.org/T259116 (10hashar)
[12:30:59] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 21s)
[12:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:12] <Urbanecm>	 Amir1: marostegui: So, we're done now :).
[12:31:18] <marostegui>	 <3
[12:31:28] <Amir1>	 Cool, let's check cebwiki
[12:31:30] <marostegui>	 Urbanecm: let me know when ready to sanitize on the ticket
[12:31:34] <Amir1>	 Does it look like avk wiki now?
[12:31:43] <marostegui>	 I  hope not :-/
[12:31:56] <Urbanecm>	 marostegui: it should be ready now, the database got created. I'll put a note on the DBA ticket
[12:32:01] <marostegui>	 thanks
[12:33:06] <Urbanecm>	 Amir1: cebwiki looks fine to me
[12:33:12] <Amir1>	 \o/
[12:33:42] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:34:45] <wikibugs>	 (03PS6) 10JMeybohm: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[12:35:27] <Urbanecm>	 Amir1: ^^ - doesn't look related to me through
[12:36:02] <Amir1>	 hmm, what's causing it?
[12:37:30] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:37:37] <Urbanecm>	 not sure, but `[{exception_id}] {exception_url} Wikimedia\Assert\InvariantException from line 224 of /srv/mediawiki/php-1.36.0-wmf.1/vendor/wikimedia/assert/src/Assert.php: Invariant failed: Bad UTF-8 at end of string (3 byte sequence)` sounds like T242298, opened in 1.35.0-wmf.14 times
[12:37:37] <stashbot>	 T242298: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T242298
[12:38:05] <wikibugs>	 (03CR) 10JMeybohm: "While trying to verify the generated envoy.yaml it seemed easier to push a new patch that to mention the linefeed chomping problems one-by" [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[12:40:31] <marostegui>	 Urbanecm Amir1 I see the table being fine on external store for avkwiki
[12:40:36] <marostegui>	 so that looks good too
[12:40:44] <Urbanecm>	 cool!
[12:41:47] <wikibugs>	 10Operations, 10User-jbond: OKR: Install and configure new CFSSL PKI server - https://phabricator.wikimedia.org/T259117 (10jbond) p:05Triage→03Medium
[12:43:07] <Amir1>	 \o/
[12:44:25] <moritzm>	 !log imported curl 7.38.0-4+deb8u16+wmf1 to apt.wikimedia.org (jessie-wikimedia) T259102
[12:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:15] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10jcrespo) I am getting a lot of 500 internal server errors on logstash-next instance. I am guessing that is expected/WIP?
[12:48:28] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[12:48:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:06] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[12:49:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:25] <wikibugs>	 10Operations, 10User-jbond: OKR: Install and configure new CFSSL PKI server - https://phabricator.wikimedia.org/T259117 (10jbond)
[12:49:40] <wikibugs>	 10Operations, 10User-jbond: OKR: Install and configure new CFSSL PKI server - https://phabricator.wikimedia.org/T259117 (10jbond)
[12:50:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` alert1001.wikimedia.org ` The log can be foun...
[12:51:47] <wikibugs>	 (03PS3) 10Jbond: thanos: add thanos.wikimedia.org to the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009)
[12:51:56] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] rsync: listen for stunnel connections on v4/v6 [puppet] - 10https://gerrit.wikimedia.org/r/617083 (owner: 10Filippo Giunchedi)
[12:54:02] <wikibugs>	 (03CR) 10Ema: [C: 03+1] thanos: add thanos.wikimedia.org to the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond)
[12:54:49] <wikibugs>	 (03PS2) 10Volans: mgmt: netbox-generated data for mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/615668 (https://phabricator.wikimedia.org/T233183)
[12:56:55] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[12:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:05] <wikibugs>	 (03PS4) 10Jbond: thanos: add thanos.wikimedia.org to the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009)
[12:57:26] <wikibugs>	 10Operations, 10RESTBase, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): RESTBase CORS redirect resolve should not hit frontend caches - https://phabricator.wikimedia.org/T259054 (10ema)
[12:58:00] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:58:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:10] <volans>	 [HEADS UP] I'm about to merge the DNS patch that moves all codfw mgmt record to the auto-generated ones via Netbox in few minutes ( https://gerrit.wikimedia.org/r/c/operations/dns/+/615668 )
[12:58:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond)
[13:00:04] <jouncebot>	 liw and brennen: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1300).
[13:01:15] <wikibugs>	 (03PS1) 10Lars Wirzenius: group1 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617114
[13:01:17] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617114 (owner: 10Lars Wirzenius)
[13:02:00] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617114 (owner: 10Lars Wirzenius)
[13:03:27] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime
[13:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:50] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:04:02] <wikibugs>	 (03PS4) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846
[13:04:24] <logmsgbot>	 !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.2
[13:04:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:32] <logmsgbot>	 !log liw@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.2 (duration: 01m 07s)
[13:05:34] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:05:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10Cmjohnson)
[13:07:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10Cmjohnson) 05Open→03Resolved
[13:08:51] <wikibugs>	 (03PS1) 10Jbond: profile::thanos::httpd: pass maxconn and query_port to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/617115
[13:10:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::thanos::httpd: pass maxconn and query_port to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/617115 (owner: 10Jbond)
[13:10:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud - hiera5: migrate labs main environment to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/615159 (owner: 10Jbond)
[13:13:16] <wikibugs>	 10Operations, 10RESTBase, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): RESTBase CORS redirect resolve should not hit frontend caches - https://phabricator.wikimedia.org/T259054 (10ema) For the record I don't think this currently causes any specific functional issue. I've spotted a few RESTBase...
[13:17:26] <wikibugs>	 (03PS1) 10Volans: dns: skip Netbox addresses without DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183)
[13:17:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dns: skip Netbox addresses without DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[13:18:30] <wikibugs>	 (03PS2) 10Volans: dns: skip Netbox addresses without DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183)
[13:25:26] <icinga-wm>	 RECOVERY - puppet last run on otrs1001 is OK: OK: Puppet is currently disabled (disable for OTRS 6.x upgrade), not alerting. Last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:26:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:29:15] <wikibugs>	 (03PS4) 10Muehlenhoff: Add CAS support to Hue [puppet] - 10https://gerrit.wikimedia.org/r/616541
[13:29:23] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2001.codfw.wmnet
[13:29:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:43] <wikibugs>	 (03PS1) 10Elukey: role::druid::test_analytics::worker: fix wrong monitor name [puppet] - 10https://gerrit.wikimedia.org/r/617120
[13:31:20] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: fix wrong monitor name [puppet] - 10https://gerrit.wikimedia.org/r/617120 (owner: 10Elukey)
[13:33:06] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10JMeybohm) @Dzahn I did `wtp2001.codfw.wmnet` as that was pretty full as well.
[13:33:53] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24222/" [puppet] - 10https://gerrit.wikimedia.org/r/616541 (owner: 10Muehlenhoff)
[13:34:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add CAS support to Hue [puppet] - 10https://gerrit.wikimedia.org/r/616541 (owner: 10Muehlenhoff)
[13:34:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['alert1001.wikimedia.org'] `  and were **ALL** successful.
[13:36:00] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on wtp2001 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[13:38:30] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:42:32] <wikibugs>	 (03PS3) 10Volans: dns: check that primary addresses have DNS names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183)
[13:43:42] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Self-merging as this is currently breaking the sre.dns.netbox cookbook due to a primary address without a DNS name." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[13:45:13] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[13:45:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:42] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy-future: new image for future versions of Envoy (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/616865 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[13:52:19] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:55:34] <volans>	 !log migrating *all* codfw mgmt DNS records to the autogenerated ones via Netbox - T233183
[13:55:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:42] <stashbot>	 T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183
[13:55:46] <wikibugs>	 (03CR) 10Volans: [C: 03+2] mgmt: netbox-generated data for mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/615668 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[13:55:49] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:57:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "I'm unsure if it's a good idea to symlink to resources from ../envoy as that might lead to changes that are easy to overlook when bumping " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/616865 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[14:00:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:03:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata)
[14:03:56] <wikibugs>	 (03CR) 10Hnowlan: "> Patch Set 1: Code-Review+1" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/616865 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[14:05:37] <wikibugs>	 (03CR) 10Hnowlan: envoy-future: new image for future versions of Envoy (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/616865 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[14:07:15] <wikibugs>	 (03PS1) 10Volans: scripts: codfw migrated to Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617147 (https://phabricator.wikimedia.org/T233183)
[14:07:22] <wikibugs>	 (03PS5) 10Cwhite: provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826)
[14:07:27] <wikibugs>	 10Operations, 10Mail, 10observability, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10MoritzMuehlenhoff) @herron Seems to work fine, didn't see a paniclog mail today \o/
[14:07:56] <wikibugs>	 (03CR) 10Cwhite: provision loki on grafana-next (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[14:08:44] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Self merging to align the behaviour to the just migrated records. Just a feature flag." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617147 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[14:15:18] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/617119 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[14:17:03] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp2001 is CRITICAL: CRITICAL: Missing 1 sites from wikiversions. 513 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:18:26] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] Update netbox to v2.8.8-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/616892 (https://phabricator.wikimedia.org/T258942) (owner: 10CRusnov)
[14:18:28] <wikibugs>	 (03PS5) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846
[14:19:31] <wikibugs>	 (03CR) 10Muehlenhoff: "Ack, I missed that it's the end of a patch series." [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond)
[14:20:29] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:21:33] <wikibugs>	 10Operations, 10Traffic: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10ema) I've opened https://github.com/google/mtail/issues/331 to get an opinion from upstream.
[14:21:54] <wikibugs>	 (03PS1) 10Cmjohnson: Adding all dns for an-test-worker hosts in eqiad [dns] - 10https://gerrit.wikimedia.org/r/617148 (https://phabricator.wikimedia.org/T255520)
[14:22:31] <wikibugs>	 (03PS2) 10Jbond: profile::thanos::httpd: pass maxconn and query_port to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/617115
[14:23:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::thanos::httpd: pass maxconn and query_port to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/617115 (owner: 10Jbond)
[14:23:54] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] envoy-future: new image for future versions of Envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/616865 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[14:23:59] <wikibugs>	 (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24224/" [puppet] - 10https://gerrit.wikimedia.org/r/617115 (owner: 10Jbond)
[14:24:45] <wikibugs>	 (03PS2) 10Cmjohnson: Adding all dns for an-test-worker hosts in eqiad [dns] - 10https://gerrit.wikimedia.org/r/617148 (https://phabricator.wikimedia.org/T255520)
[14:25:44] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding all dns for an-test-worker hosts in eqiad [dns] - 10https://gerrit.wikimedia.org/r/617148 (https://phabricator.wikimedia.org/T255520) (owner: 10Cmjohnson)
[14:27:08] <marostegui>	 ls -lh
[14:27:19] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[14:27:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:36] <wikibugs>	 (03PS3) 10Jbond: profile::thanos::httpd: pass maxconn and query_port to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/617115
[14:29:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24225/" [puppet] - 10https://gerrit.wikimedia.org/r/617115 (owner: 10Jbond)
[14:29:39] <moritzm>	 !log installing exiv2 security updates
[14:29:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:09] <jbond42>	 !log install curl security update for jessie
[14:30:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:02] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:34:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:13] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kibana: CVE-2020-7016 / CVE-2020-7017 mitigations [puppet] - 10https://gerrit.wikimedia.org/r/616885 (https://phabricator.wikimedia.org/T259000) (owner: 10Herron)
[14:36:52] <wikibugs>	 (03PS5) 10Jbond: thanos: add thanos.wikimedia.org to the cache layer [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009)
[14:37:06] <wikibugs>	 (03CR) 10Jbond: thanos: add thanos.wikimedia.org to the cache layer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond)
[14:39:15] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:39:16] <wikibugs>	 (03PS1) 10Urbanecm: Set muswiki to reqd only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617152 (https://phabricator.wikimedia.org/T259004)
[14:39:58] <wikibugs>	 (03PS2) 10Urbanecm: Set muswiki to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617152 (https://phabricator.wikimedia.org/T259004)
[14:43:12] <wikibugs>	 (03PS2) 10JMeybohm: Add a new action to helm-chartctl to upload prebuild tgz [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813
[14:43:14] <wikibugs>	 (03PS2) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616814
[14:44:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a new action to helm-chartctl to upload prebuild tgz [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813 (owner: 10JMeybohm)
[14:44:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616814 (owner: 10JMeybohm)
[14:45:21] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10herron) 05Open→03Resolved >>! In T258739#6343721, @dcausse wrote: > @herron thanks for the deploy...
[14:48:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Andrew) Is there any reason not to close this?  Are there still asset tags or netbox things left to do?
[14:48:22] <wikibugs>	 (03PS3) 10Peter.ovchyn: Add defaults for initial state for sidebar. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610069 (https://phabricator.wikimedia.org/T254230)
[14:48:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Andrew) These hosts are in service now.  @Cmjohnson, can this be closed?
[14:48:45] <wikibugs>	 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff)
[14:49:53] <moritzm>	 !log installing ruby-json security updates
[14:49:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:07] <wikibugs>	 (03PS3) 10JMeybohm: Add a new action to helm-chartctl to upload prebuild tgz [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813
[14:51:09] <wikibugs>	 (03PS3) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616814
[14:52:19] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:58:17] <wikibugs>	 (03PS1) 10Jbond: storeconfigs: add debug option to test $settings variable [puppet] - 10https://gerrit.wikimedia.org/r/617156
[14:58:20] <wikibugs>	 (03PS1) 10Jbond: storeconfigs: only export resources if storeconfigs is enabled [puppet] - 10https://gerrit.wikimedia.org/r/617157
[14:59:02] <wikibugs>	 (03PS2) 10Jbond: storeconfigs: add debug option to test $settings variable [puppet] - 10https://gerrit.wikimedia.org/r/617156
[14:59:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616814 (owner: 10JMeybohm)
[14:59:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:00:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add a new action to helm-chartctl to upload prebuild tgz [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813 (owner: 10JMeybohm)
[15:00:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] storeconfigs: only export resources if storeconfigs is enabled [puppet] - 10https://gerrit.wikimedia.org/r/617157 (owner: 10Jbond)
[15:01:03] <wikibugs>	 (03PS1) 10Herron: Revert "kibana: CVE-2020-7016 / CVE-2020-7017 mitigations" [puppet] - 10https://gerrit.wikimedia.org/r/617140
[15:01:38] <wikibugs>	 (03CR) 10Herron: [C: 03+2] Revert "kibana: CVE-2020-7016 / CVE-2020-7017 mitigations" [puppet] - 10https://gerrit.wikimedia.org/r/617140 (owner: 10Herron)
[15:02:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add a new action to helm-chartctl to upload prebuild tgz [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813 (owner: 10JMeybohm)
[15:02:07] <wikibugs>	 (03Merged) 10jenkins-bot: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616814 (owner: 10JMeybohm)
[15:05:31] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:08:14] <wikibugs>	 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff)
[15:10:28] <jayme>	 !log imported docker-report_0.0.8-1 to buster-wikimedia
[15:10:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:01] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:14:03] <wikibugs>	 (03PS3) 10Jbond: storeconfigs: add debug option to test $settings variable [puppet] - 10https://gerrit.wikimedia.org/r/617156
[15:17:09] <wikibugs>	 (03PS2) 10Jbond: storeconfigs: only export resources if storeconfigs is enabled [puppet] - 10https://gerrit.wikimedia.org/r/617157
[15:18:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] storeconfigs: only export resources if storeconfigs is enabled [puppet] - 10https://gerrit.wikimedia.org/r/617157 (owner: 10Jbond)
[15:21:05] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Set muswiki to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617152 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm)
[15:22:09] <wikibugs>	 (03Merged) 10jenkins-bot: Set muswiki to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617152 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm)
[15:24:06] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 617152: Set muswiki to read only | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/617152 (T259004) (duration: 01m 08s)
[15:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:12] <stashbot>	 T259004: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004
[15:25:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Cmjohnson) 05Open→03Resolved Thanks @Andrew for the assist with these!  Resolved
[15:26:22] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:26:26] <icinga-wm>	 PROBLEM - Check the last execution of helm-chartctl-package-all on chartmuseum2001 is CRITICAL: CRITICAL: Status of the systemd unit helm-chartctl-package-all https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:32:44] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2002.codfw.wmnet ` The log can be found in `/var/log...
[15:33:06] <logmsgbot>	 !log liw@deploy1001 rebuilt and synchronized wikiversions files: Revert "group[0|1] wikis to 1.36.0-wmf.1"
[15:33:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:36] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2003.codfw.wmnet ` The log can be found in `/var/log...
[15:33:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/617105 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond)
[15:33:49] <elukey>	 we should have a dedicated wikitech page for SRE team's systemd timers, not pointing to the analytics one :D
[15:34:08] <elukey>	 (see icinga alarm about chartmuseum above)
[15:34:24] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:35:01] <wikibugs>	 (03PS27) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906)
[15:35:18] <wikibugs>	 (03PS3) 10Hnowlan: api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906)
[15:35:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[15:36:15] <Urbanecm>	 liw: sorry, seems we ran into each other at the deployment host. Please let me know when it is safe enough, I need to revert a testing patch I created a while ago.
[15:36:42] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp2001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[15:37:32] <liw>	 Urbanecm, will be a little while while I sort out things, sorry
[15:37:53] <Urbanecm>	 no problem, please ping me once ready :)
[15:38:07] <liw>	 Urbanecm, will do
[15:40:35] <wikibugs>	 (03PS1) 10Herron: Revert "Revert "kibana: CVE-2020-7016 / CVE-2020-7017 mitigations"" [puppet] - 10https://gerrit.wikimedia.org/r/617141
[15:43:00] <wikibugs>	 (03PS1) 10Lars Wirzenius: Revert "group1 wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617165
[15:43:34] <wikibugs>	 (03CR) 10Lars Wirzenius: [V: 03+2 C: 03+2] Revert "group1 wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617165 (owner: 10Lars Wirzenius)
[15:44:29] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[15:44:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:42] <wikibugs>	 (03CR) 10Hnowlan: api-gateway: Basic envoy chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[15:44:48] <liw>	 Urbanecm, I'm done for now
[15:44:52] <Urbanecm>	 thanks!
[15:45:23] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Set muswiki to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617167 (https://phabricator.wikimedia.org/T259004)
[15:45:55] <wikibugs>	 (03CR) 10Herron: "saw an intermittent error on the front end shortly after merging the original patch and quickly reverted for troubleshooting.  after a clo" [puppet] - 10https://gerrit.wikimedia.org/r/617141 (owner: 10Herron)
[15:46:17] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "Set muswiki to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617167 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm)
[15:46:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] installserver: use correct partman recipe for parse* [puppet] - 10https://gerrit.wikimedia.org/r/616920 (https://phabricator.wikimedia.org/T258775) (owner: 10Dzahn)
[15:47:10] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Set muswiki to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617167 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm)
[15:47:12] <wikibugs>	 (03CR) 10Herron: [C: 03+2] Revert "Revert "kibana: CVE-2020-7016 / CVE-2020-7017 mitigations"" [puppet] - 10https://gerrit.wikimedia.org/r/617141 (owner: 10Herron)
[15:48:44] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 617167: Revert "Set muswiki to read only" | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/617167 (T259004) (duration: 01m 06s)
[15:48:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:50] <stashbot>	 T259004: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004
[15:48:56] * Urbanecm is done too
[15:49:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:53:38] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:56:49] <wikibugs>	 10Operations, 10Traffic, 10observability, 10serviceops, 10Patch-For-Review: monitoring for mismatched LVS realserver addresses/configurations - https://phabricator.wikimedia.org/T258648 (10CDanis) 05Open→03Declined The structure of the data makes this teeth-pullingly impossibly difficult to do well,...
[15:56:53] <wikibugs>	 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10CDanis)
[15:56:56] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:00:58] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:01:47] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2001.codfw.wmnet', 'parse2002.codfw.wmnet', 'par...
[16:02:35] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:02:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:54] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) >>! In T258775#6344708, @JMeybohm wrote: > @Dzahn I did `wtp2001.codfw.wmnet` as that was pretty full as well.  Thank you. Taking over with wtp2002,2003...
[16:14:43] <wikibugs>	 10Operations, 10Traffic, 10observability, 10serviceops, 10Patch-For-Review: monitoring for mismatched LVS realserver addresses/configurations - https://phabricator.wikimedia.org/T258648 (10Joe) Truth is that every host that has multiple pools that use the same backend IP can happily live with only one po...
[16:15:18] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[16:15:19] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[16:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[16:16:05] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[16:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:48] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[16:16:51] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[16:16:51] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[16:16:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:35] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843)
[16:18:54] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:04] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[16:19:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:48] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Revert new reply API for now (1.36.0-wmf.2 only) [extensions/DiscussionTools] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617169 (https://phabricator.wikimedia.org/T252558)
[16:21:06] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:21:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:16] <wikibugs>	 (03PS28) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906)
[16:21:54] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "(please check that this looks right and +1)" [extensions/DiscussionTools] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617169 (https://phabricator.wikimedia.org/T252558) (owner: 10Bartosz Dziewoński)
[16:22:13] <icinga-wm>	 PROBLEM - Check that envoy is running on wtp2002 is CRITICAL: connect to address 10.192.16.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[16:22:43] <mutante>	 this is not supposed to happen because the reimage script is running and downtimes stuff ^
[16:23:17] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on wtp2003 is CRITICAL: Host wtp2003 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[16:23:17] <icinga-wm>	 PROBLEM - puppet last run on wtp2002 is CRITICAL: connect to address 10.192.16.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:23:46] <mutante>	 there was an exception when the cookbook tried to downtime. i am manually downtiming. they are unpooled reinstalls
[16:24:49] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[16:24:49] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:54] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[16:24:55] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:24:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:23] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2002.codfw.wmnet', 'parse2003.codfw.wmnet', 'parse2001.codfw.wmnet'] `  and were **ALL** successful.
[16:28:15] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2004.codfw.wmnet', 'parse2005.codfw.wmnet', 'parse2006.codfw.wmnet'] `...
[16:29:46] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Move VisualEditor from beta to default on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617170 (https://phabricator.wikimedia.org/T258992)
[16:33:37] <wikibugs>	 (03PS29) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906)
[16:36:03] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[16:36:39] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn)
[16:37:13] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[16:38:52] <icinga-wm>	 RECOVERY - Check the last execution of helm-chartctl-package-all on chartmuseum2001 is OK: OK: Status of the systemd unit helm-chartctl-package-all https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:40:44] <wikibugs>	 (03CR) 10Ppchelko: [C: 04-1] api-gateway: add helmfile.d configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[16:41:29] <wikibugs>	 (03CR) 10Dzahn: "please see https://phabricator.wikimedia.org/T259152 for an issue with puppet on the contint servers that seems related here" [puppet] - 10https://gerrit.wikimedia.org/r/613104 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli)
[16:42:12] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn)
[16:42:15] <wikibugs>	 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10Dzahn)
[16:42:42] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn)
[16:42:53] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn)
[16:42:55] <wikibugs>	 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10Dzahn)
[16:43:19] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[16:43:23] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[16:43:23] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[16:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:25] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn)
[16:45:24] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:45:26] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[16:45:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:26] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:47:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:56] <icinga-wm>	 PROBLEM - Check the last execution of helm-chartctl-package-all on chartmuseum2001 is CRITICAL: CRITICAL: Status of the systemd unit helm-chartctl-package-all https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:51:03] <wikibugs>	 (03PS1) 10Urbanecm: Enable Translate extension at plwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617174 (https://phabricator.wikimedia.org/T259087)
[16:51:31] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2006.codfw.wmnet', 'parse2005.codfw.wmnet', 'parse2004.codfw.wmnet'] `  and were **ALL** successful.
[16:52:08] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10hashar) The error message refers to `push-no...
[16:52:34] <wikibugs>	 (03PS4) 10Hnowlan: api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906)
[16:53:18] <wikibugs>	 (03PS1) 10BryanDavis: Add .gitignore [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617175
[16:53:20] <wikibugs>	 (03PS1) 10BryanDavis: acme_chief: Profide .crt.chained.key file support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617176 (https://phabricator.wikimedia.org/T255249)
[16:53:22] <wikibugs>	 (03PS1) 10BryanDavis: api: Allow acme-chief clients to fetch .chained.crt.key files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617177 (https://phabricator.wikimedia.org/T255249)
[16:55:00] <wikibugs>	 (03PS4) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961)
[16:56:06] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: puppet errors on contint servers related to helmfiles for push-notifications - https://phabricator.wikimedia.org/T259152 (10Dzahn) >>! In T259152#6345663, @hashar wrote...
[16:56:12] <wikibugs>	 (03PS2) 10BryanDavis: acme_chief: Profide .chained.crt.key file support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617176 (https://phabricator.wikimedia.org/T255249)
[16:56:14] <wikibugs>	 (03PS2) 10BryanDavis: api: Allow acme-chief clients to fetch .chained.crt.key files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617177 (https://phabricator.wikimedia.org/T255249)
[17:00:00] <wikibugs>	 (03PS5) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961)
[17:00:10] <wikibugs>	 (03CR) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) (owner: 10C. Scott Ananian)
[17:01:02] <wikibugs>	 (03CR) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) (owner: 10C. Scott Ananian)
[17:01:37] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2007.codfw.wmnet', 'parse2008.codfw.wmnet', 'parse2009.codfw.wmnet'] `...
[17:04:11] <wikibugs>	 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Dzahn)
[17:06:09] <wikibugs>	 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Dzahn)
[17:10:06] <wikibugs>	 10Operations, 10Epic, 10Goal: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis)
[17:11:59] <wikibugs>	 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Volans) From the cookbook logs: ` 2020-07-29 16:15:18,960 dzahn 15835 [DEBUG puppetdb.py:320 in _execute] Queried puppetdb for '["or", ["=", "certname", "wtp2003.codfw.wmnet"]]...
[17:15:13] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10RobH) p:05Triage→03Medium
[17:15:16] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10RobH)
[17:15:22] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10RobH) Please note that while I just created this task, the actual memory has NOT yet been placed to order.  It was escalated for approvals and placement today.
[17:15:22] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[17:16:40] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[17:16:43] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[17:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:44] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[17:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:09] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) So I think that this is all that remains:  * [] cache layer proxy wss://phabricator.wikimedia.org to aphl...
[17:18:14] <wikibugs>	 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Dzahn) >>! In T259158#6345791, @Volans wrote: > so either catalog compilation for wtp servers is particularly slow  This seems likely to be the case. On the first run it does a...
[17:19:23] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:19:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:48] <wikibugs>	 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Volans) No, I meant the catalog compilation on the puppetmaster, after which the catalog is sent to puppetdb. It's unrelated to how much time takes the first puppet run on the...
[17:20:34] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:20:47] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10RobH) > 10:15 < robh>  :  So we have a number (at least 3) tasks for upgrading memory in existing hosts > 10:15 < robh>  :  ive just been pushing the actual upgrade t...
[17:21:05] <volans>	 mutante: as a pro-tip I usually suggest to open multiple tmux/screen and run the reimages with 2~3 minutes of delay between each other
[17:21:23] <volans>	 to avoid some race conditions that are possible, both on the icinga stuff and the certificate signing on the puppetmaster
[17:21:24] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:21:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:40] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10RobH) a:03Cmjohnson
[17:21:45] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[17:21:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:38] <wikibugs>	 (03CR) 10Hnowlan: api-gateway: add helmfile.d configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[17:23:31] <mutante>	 volans: ack! though the issue happened when i used 2 separate wmf-auto-reimage-host in 2 separate screens.  when i used wmf-auto-reimage on 3 hosts at once i did not have that issue. but they were parse* with role(insetup) unlike wtp* with prod roles
[17:25:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10elukey)
[17:27:48] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:28:20] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2008.codfw.wmnet', 'parse2007.codfw.wmnet', 'parse2009.codfw.wmnet'] `  and were **ALL** successful.
[17:29:08] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:32:46] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:41:22] <icinga-wm>	 RECOVERY - Check that envoy is running on wtp2002 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[17:45:46] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2002.codfw.wmnet'] `  and were **ALL** successful.
[17:46:18] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2010.codfw.wmnet', 'parse2011.codfw.wmnet', 'parse2012.codfw.wmnet'] `...
[17:48:00] <icinga-wm>	 RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:48:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:49:10] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2003.codfw.wmnet'] `  and were **ALL** successful.
[17:50:02] <icinga-wm>	 RECOVERY - Long running screen/tmux on netbox1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[17:50:20] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2002.codfw.wmnet
[17:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:48] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2003.codfw.wmnet
[17:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:26] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2004.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[17:52:40] <icinga-wm>	 RECOVERY - Long running screen/tmux on weblog1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[17:54:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:56:35] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2005.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[17:56:39] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2005.codfw.wmnet'] `  Of which those **FAILED**: ` ['wtp2005.codfw.wmnet'] `
[17:57:27] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) Note wtp2005 is missing because of T257903.
[17:58:05] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2006.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[17:58:10] <wikibugs>	 (03PS1) 10Dave Pifke: arclamp: restore 90 day retention [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455)
[18:00:05] <jouncebot>	 liw and brennen: I, the Bot under the Fountain, allow thee, The Deployer, to do Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1800).
[18:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1800).
[18:00:05] <jouncebot>	 MatmaRex and Urbanecm: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:51] <MatmaRex>	 hi
[18:01:12] <Urbanecm>	 I'll deploy
[18:01:17] <Urbanecm>	 hi MatmaRex 
[18:01:24] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:01:26] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:01:27] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:01:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert new reply API for now (1.36.0-wmf.2 only) [extensions/DiscussionTools] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617169 (https://phabricator.wikimedia.org/T252558) (owner: 10Bartosz Dziewoński)
[18:02:33] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Move VisualEditor from beta to default on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617170 (https://phabricator.wikimedia.org/T258992) (owner: 10Bartosz Dziewoński)
[18:02:41] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable Translate extension at plwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617174 (https://phabricator.wikimedia.org/T259087) (owner: 10Urbanecm)
[18:03:18] <wikibugs>	 (03Merged) 10jenkins-bot: Move VisualEditor from beta to default on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617170 (https://phabricator.wikimedia.org/T258992) (owner: 10Bartosz Dziewoński)
[18:03:31] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Translate extension at plwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617174 (https://phabricator.wikimedia.org/T259087) (owner: 10Urbanecm)
[18:03:32] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:03:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:39] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:58] <Urbanecm>	 MatmaRex: config patch is ready for you to test at mwdebug1001
[18:04:15] <wikibugs>	 (03PS1) 10QChris: Bump gerrit.war to 3.2.3-1-g185bdc3a69 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205
[18:05:34] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:05:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:42] <wikibugs>	 (03CR) 10QChris: "The corresponding WAR has been uploaded to our archiva already." [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (owner: 10QChris)
[18:05:43] <Urbanecm>	 !log Create tables for Translate extension in plwikimedia (T259087)
[18:05:46] <MatmaRex>	 Urbanecm: seems good
[18:05:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:47] <stashbot>	 T259087: Enable Translate extension on pl.wikimedia.org - https://phabricator.wikimedia.org/T259087
[18:05:52] <Urbanecm>	 thank you, syncing
[18:06:04] <wikibugs>	 (03Merged) 10jenkins-bot: Revert new reply API for now (1.36.0-wmf.2 only) [extensions/DiscussionTools] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/617169 (https://phabricator.wikimedia.org/T252558) (owner: 10Bartosz Dziewoński)
[18:06:17] <MatmaRex>	 (sorry you had to wait, i was testing on the wrong mwdebug server for a while and i was very confused)
[18:06:43] <Urbanecm>	 no problem :)
[18:07:28] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists/visualeditor-nondefault.dblist: a237f5b40c3662c0f08398abeeaadba61d7462f8: Move VisualEditor from beta to default on enwikiversity (T258992) (duration: 01m 06s)
[18:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:33] <stashbot>	 T258992: VisualEditor the default editor for wikiversity (english) - https://phabricator.wikimedia.org/T258992
[18:07:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "+1, should i just merge or you want to add reviewers?" [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455) (owner: 10Dave Pifke)
[18:09:33] <Urbanecm>	 MatmaRex: your backport is ready at mwdebug1001
[18:09:37] <Urbanecm>	 could you have a look, please?
[18:09:38] <wikibugs>	 (03CR) 10Dave Pifke: "I think this is pretty low risk, but adding Krinkle to reviewers since he's the one who reminded me to follow up on this." [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455) (owner: 10Dave Pifke)
[18:09:47] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2011.codfw.wmnet', 'parse2012.codfw.wmnet', 'parse2010.codfw.wmnet'] `  and were **ALL** successful.
[18:10:35] <MatmaRex>	 yeah
[18:10:44] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) @Volans Yea, that's right. The status is still  that creating the VM worked but installing the OS did not (T254157#6241107).  I will get back to de...
[18:11:22] <MatmaRex>	 Urbanecm: also looks good
[18:11:25] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d54f041be6508b641eec08e25287d280374cc863: Enable Translate extension at plwikimedia (T259087) (duration: 01m 08s)
[18:11:28] <Urbanecm>	 thank you, syncing
[18:11:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:31] <stashbot>	 T259087: Enable Translate extension on pl.wikimedia.org - https://phabricator.wikimedia.org/T259087
[18:12:19] <wikibugs>	 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) 05Stalled→03Open
[18:12:43] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:13:14] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.2/extensions/DiscussionTools/: 00ecec80d12a34977d55dd09bce0c5a1aab369f9: Revert new reply API for now (T252558) (duration: 01m 06s)
[18:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:19] <stashbot>	 T252558: Create a low bandwidth reply API using parser.php/modifier.php - https://phabricator.wikimedia.org/T252558
[18:13:20] <Urbanecm>	 MatmaRex: should be all done!
[18:13:28] <wikibugs>	 (03PS1) 10QChris: Bring back jsonevent-layout library [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206
[18:13:37] <MatmaRex>	 thanks Urbanecm
[18:13:41] <Urbanecm>	 happy to help!
[18:13:46] <Urbanecm>	 !log Morning B&C window is done
[18:13:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Bring back jsonevent-layout library [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris)
[18:15:47] <wikibugs>	 (03CR) 10QChris: "We this change to enable json logging." [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris)
[18:16:47] <wikibugs>	 (03CR) 10QChris: [V: 03+1 C: 03+1] Bring back jsonevent-layout library [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris)
[18:17:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Bring back jsonevent-layout library [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris)
[18:20:17] <wikibugs>	 (03CR) 10QChris: "Just in case searching no longer finds the commit at some point in the future. This is" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (owner: 10QChris)
[18:20:20] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:20:24] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] arclamp: restore 90 day retention [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455) (owner: 10Dave Pifke)
[18:20:48] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Good to go" [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455) (owner: 10Dave Pifke)
[18:21:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] arclamp: restore 90 day retention [puppet] - 10https://gerrit.wikimedia.org/r/617201 (https://phabricator.wikimedia.org/T235455) (owner: 10Dave Pifke)
[18:26:44] <wikibugs>	 (03CR) 10Herron: [C: 03+1] provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[18:27:51] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2013.codfw.wmnet', 'parse2014.codfw.wmnet', 'parse2015.codfw.wmnet'] `...
[18:32:26] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:32:26] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:32:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:36:06] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:36:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:14] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:36:15] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:36:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:42] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Volans) Not a specific issue for me, came up as inconsistency in some cross checks for Netbox automation. Up to you.
[18:39:18] <wikibugs>	 10Operations, 10SRE-tools: Exception raised while executing cookbook sre.hosts.downtime - https://phabricator.wikimedia.org/T259158 (10Dzahn) Ah, ok!  It happened again when running wtp2004 (separate screen window, separate script).  `spicerack.remote.RemoteError: No hosts provided `  Then i manually used `sre...
[18:39:30] <wikibugs>	 (03CR) 10QChris: "The relevant bug on Phabricator is https://phabricator.wikimedia.org/T259135" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (owner: 10QChris)
[18:39:51] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:39:51] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:39:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:30] <wikibugs>	 (03CR) 10QChris: [V: 03+1 C: 03+1] "The relevant task in Phabricator is https://phabricator.wikimedia.org/T259135" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris)
[18:40:58] <wikibugs>	 (03PS2) 10QChris: Bump gerrit.war to 3.2.3-1-g185bdc3a69 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (https://phabricator.wikimedia.org/T259135)
[18:42:52] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:43:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:01] <qchris>	 Urbanecm: I'd need to do a quick gerrit upgrade. Did I read your message from 30 minutes correctly that the Morning backport window is done and updating gerrit would not get in your way?
[18:44:22] <Urbanecm>	 qchris: yes, it is all yours now :)
[18:44:27] <qchris>	 Cool beans.
[18:44:30] <qchris>	 Thanks.
[18:44:58] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:45:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:56] <qchris>	 I have no clue whan "Train log triage with CPT" (which should also be happenning now) is. I don't see any logs about it in the channel. I don't want to ping them, as they might be doing $REALLY_IMPORTANT_STUFF. Do you know by chance what they are up to?
[18:45:59] <qchris>	 Urbanecm: ^
[18:46:42] <Urbanecm>	 qchris: I **assume** they are looking at logs (in logstash), and discussing them, but that is a pure guess.
[18:47:07] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:47:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:17] <qchris>	 Ok. Thanks.
[18:48:48] <icinga-wm>	 PROBLEM - Check that envoy is running on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[18:49:52] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on wtp2006 is CRITICAL: Host wtp2006 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[18:49:52] <icinga-wm>	 PROBLEM - puppet last run on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:50:14] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on wtp2003 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[18:50:20] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:50:32] <wikibugs>	 (03PS1) 10Herron: dns: add forward/reverse records for kafkamon[12]002 [dns] - 10https://gerrit.wikimedia.org/r/617211 (https://phabricator.wikimedia.org/T257561)
[18:52:06] <icinga-wm>	 PROBLEM - Apache HTTP on wtp2006 is CRITICAL: connect to address 10.192.16.48 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[18:52:16] <icinga-wm>	 PROBLEM - nutcracker process on wtp2006 is CRITICAL: connect to address 10.192.16.48 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:52:16] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP
[18:52:34] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] Bump gerrit.war to 3.2.3-1-g185bdc3a69 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (https://phabricator.wikimedia.org/T259135) (owner: 10QChris)
[18:53:22] <icinga-wm>	 PROBLEM - Check the last execution of php7.2-fpm_check_restart on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:53:26] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2013.codfw.wmnet', 'parse2014.codfw.wmnet'] `  Of which those **FAILED**: ` ['parse2015.codfw.wmnet'] `
[18:53:48] <qchris>	 br-ennen let me know that it's ok to restart Gerrit. So I'll prepare the upgrade.
[18:54:00] <brennen>	 thanks for checking in
[18:54:02] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:54:40] <icinga-wm>	 PROBLEM - PHP opcache health on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[18:54:40] <icinga-wm>	 PROBLEM - nutcracker socket on wtp2006 is CRITICAL: connect to address 10.192.16.48 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:56:38] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp2004 is CRITICAL: connect to address 10.192.16.46 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:56:54] <icinga-wm>	 PROBLEM - parsoid on wtp2006 is CRITICAL: connect to address 10.192.16.48 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid
[18:56:54] <icinga-wm>	 PROBLEM - Check size of conntrack table on wtp2006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.48: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[18:56:54] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on wtp2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:57:30] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Bump gerrit.war to 3.2.3-1-g185bdc3a69 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617205 (https://phabricator.wikimedia.org/T259135) (owner: 10QChris)
[18:57:32] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on wtp2006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.48: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[18:58:18] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp2006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.48: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers
[18:59:31] <qchris>	 Just a heads up that I'll restart Gerrit shortly to deploy a security fix https://phabricator.wikimedia.org/T259135
[18:59:57] <mutante>	 wtp* hosts^ downtimed. that's an issue with setting the downtime during reimage
[19:00:04] <jouncebot>	 liw and brennen: That opportune time is upon us again. Time for a Mediawiki train - European+American Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T1900).
[19:00:15] <logmsgbot>	 !log qchris@deploy1001 Started deploy [gerrit/gerrit@9275b30]: Gerrit to v3.2.3-1-g185bdc3a69 on gerrit1001
[19:00:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:23] <logmsgbot>	 !log qchris@deploy1001 Finished deploy [gerrit/gerrit@9275b30]: Gerrit to v3.2.3-1-g185bdc3a69 on gerrit1001 (duration: 00m 08s)
[19:00:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:43] <qchris>	 !log Restarting Gerrit on gerrit1001 to make security fix effective.
[19:00:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:48] <brennen>	 a train status update: currently blocked, we're not presently doing anything in the current window.
[19:00:52] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:02:01] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2016.codfw.wmnet', 'parse2017.codfw.wmnet', 'parse2018.codfw.wmnet'] `...
[19:02:52] <icinga-wm>	 ACKNOWLEDGEMENT - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1529 bytes in 0.008 second response time daniel_zahn restart for config change by chris https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[19:03:06] <qchris>	 Thanks mutante ^
[19:03:34] <mutante>	 yw,np
[19:03:47] <icinga-wm>	 ACKNOWLEDGEMENT - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused daniel_zahn restarting https://wikitech.wikimedia.org/wiki/Gerrit
[19:03:54] <logmsgbot>	 !log qchris@deploy1001 Started deploy [gerrit/gerrit@9275b30]: Gerrit to v3.2.3-1-g185bdc3a69 on gerrit2001
[19:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:04] <logmsgbot>	 !log qchris@deploy1001 Finished deploy [gerrit/gerrit@9275b30]: Gerrit to v3.2.3-1-g185bdc3a69 on gerrit2001 (duration: 00m 09s)
[19:04:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:31] <qchris>	 !log Restarting Gerrit on gerrit2001 (gerrit-replica) to make security fix effective.
[19:04:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:28] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:07:02] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:08:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks good" [dns] - 10https://gerrit.wikimedia.org/r/617211 (https://phabricator.wikimedia.org/T257561) (owner: 10Herron)
[19:08:45] <wikibugs>	 (03CR) 10Herron: [C: 03+2] dns: add forward/reverse records for kafkamon[12]002 [dns] - 10https://gerrit.wikimedia.org/r/617211 (https://phabricator.wikimedia.org/T257561) (owner: 10Herron)
[19:12:12] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:16:06] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[19:16:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:04] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[19:17:04] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[19:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:35] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[19:18:37] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[19:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:37] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.ganeti.makevm
[19:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:00] <logmsgbot>	 !log herron@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[19:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:37] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[19:20:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:35] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:25:42] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2018.codfw.wmnet', 'parse2016.codfw.wmnet', 'parse2017.codfw.wmnet'] `  and were **ALL** successful.
[19:26:35] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['parse2019.codfw.wmnet', 'parse2020.codfw.wmnet'] ` The log can be found in...
[19:26:43] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:27:32] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.ganeti.makevm
[19:27:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:37] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.ganeti.makevm
[19:29:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:20] <wikibugs>	 (03PS1) 10Cwhite: debianization [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/617250 (https://phabricator.wikimedia.org/T222826)
[19:36:29] <wikibugs>	 (03PS6) 10Cwhite: provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826)
[19:40:05] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:41:37] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[19:41:39] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[19:41:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:48] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[19:41:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:44] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[19:43:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:01] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[19:44:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:44:54] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:45:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:45:42] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] "Since it's deployed already, I'll self-merge this here." [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris)
[19:45:48] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[19:45:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:09] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2019.codfw.wmnet', 'parse2020.codfw.wmnet'] `  and were **ALL** successful.
[19:53:27] <wikibugs>	 (03PS1) 10Herron: dhcp: add records for kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/617254 (https://phabricator.wikimedia.org/T257561)
[19:54:31] <wikibugs>	 (03PS5) 10Jdlrobson: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227)
[19:57:45] <wikibugs>	 (03PS2) 10Herron: dhcp: add records for kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/617254 (https://phabricator.wikimedia.org/T257561)
[20:00:04] <jouncebot>	 halfak and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T2000).
[20:00:16] <wikibugs>	 (03CR) 10Herron: [C: 03+2] dhcp: add records for kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/617254 (https://phabricator.wikimedia.org/T257561) (owner: 10Herron)
[20:02:36] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp2004 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 0.582 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:05:46] <icinga-wm>	 RECOVERY - PHP opcache health on wtp2004 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[20:05:52] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2004.codfw.wmnet'] `  and were **ALL** successful.
[20:05:58] <icinga-wm>	 RECOVERY - Check size of conntrack table on wtp2006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[20:06:12] <icinga-wm>	 RECOVERY - Check the last execution of php7.2-fpm_check_restart on wtp2004 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:08:14] <icinga-wm>	 PROBLEM - Host wtp2006 is DOWN: PING CRITICAL - Packet loss = 100%
[20:09:54] <icinga-wm>	 RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:10:22] <icinga-wm>	 RECOVERY - Host wtp2006 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms
[20:10:41] <icinga-wm>	 RECOVERY - nutcracker process on wtp2006 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[20:11:18] <icinga-wm>	 RECOVERY - Apache HTTP on wtp2006 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:12:51] <icinga-wm>	 RECOVERY - Check that envoy is running on wtp2004 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[20:12:51] <icinga-wm>	 RECOVERY - nutcracker socket on wtp2006 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_codfw.sock https://wikitech.wikimedia.org/wiki/Nutcracker
[20:14:58] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp2006 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[20:16:19] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Dzahn) all parse2* hosts done:   ` [cumin1001:~] $ sudo cumin parse2* 'df -h | grep srv | cut -d " " -f12' .. 20 hosts will be targeted:  1%...
[20:17:30] <icinga-wm>	 RECOVERY - Check no envoy runtime configuration is left persistent on wtp2006 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[20:17:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on wtp2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:17:38] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on wtp2004 is OK: OK: synced at Wed 2020-07-29 20:17:36 UTC. https://wikitech.wikimedia.org/wiki/NTP
[20:19:17] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2004.codfw.wmnet
[20:19:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:49] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[20:24:36] <wikibugs>	 (03PS1) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418)
[20:25:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite)
[20:25:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "the members of this group have been promoted to roots but i am merging this anyways, there were no concerns and maybe the admins group wil" [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn)
[20:26:06] <wikibugs>	 (03PS2) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418)
[20:26:28] <wikibugs>	 (03PS4) 10Dzahn: admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739)
[20:27:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite)
[20:28:48] <wikibugs>	 (03PS3) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418)
[20:30:59] <wikibugs>	 (03PS2) 10Dzahn: admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739)
[20:31:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn)
[20:32:04] <wikibugs>	 (03CR) 10Dzahn: admins: let wdqs-admins run jstack as root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn)
[20:32:26] <icinga-wm>	 RECOVERY - parsoid on wtp2006 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid
[20:32:37] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2006.codfw.wmnet'] `  and were **ALL** successful.
[20:34:12] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@fde9dfe]: Test deploy of 2.8.8 to netbox-next
[20:34:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:44] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2008.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[20:35:10] <wikibugs>	 (03PS3) 10Dzahn: admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739)
[20:35:24] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@fde9dfe]: Test deploy of 2.8.8 to netbox-next (duration: 01m 12s)
[20:35:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:29] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@fde9dfe]: Test deploy of 2.8.8 to netbox-next pt2
[20:35:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:34] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@fde9dfe]: Test deploy of 2.8.8 to netbox-next pt2 (duration: 00m 05s)
[20:35:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:48] <wikibugs>	 (03PS4) 10Dzahn: admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739)
[20:38:50] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:43:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "the members of this group have been promoted to roots but merging this anyways. maybe it will be used again in the future." [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn)
[20:43:50] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:55:38] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:56:39] <wikibugs>	 (03PS1) 10Herron: assign kafkamon[12]002 role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/617262 (https://phabricator.wikimedia.org/T257561)
[20:58:10] <wikibugs>	 (03CR) 10Herron: [C: 03+2] assign kafkamon[12]002 role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/617262 (https://phabricator.wikimedia.org/T257561) (owner: 10Herron)
[21:00:41] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[21:00:41] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[21:00:45] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:00:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:15:26] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[21:15:27] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[21:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:21:16] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp2007 is CRITICAL: connect to address 10.192.16.49 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:21:18] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on wtp2007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.49: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[21:22:00] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[21:22:01] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:24:32] <wikibugs>	 (03CR) 10Ppchelko: [C: 04-2] "After discussion with Eric, I'm changing my mind. See ticket for the reasoning." [deployment-charts] - 10https://gerrit.wikimedia.org/r/613650 (https://phabricator.wikimedia.org/T256769) (owner: 10Hnowlan)
[21:24:56] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:28:44] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:40:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:51:54] <icinga-wm>	 PROBLEM - dhclient process on wtp2008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.50: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[21:58:48] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on wtp2008 is CRITICAL: Host wtp2008 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[22:00:43] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[22:00:43] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:01:22] <wikibugs>	 (03PS1) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821)
[22:03:03] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616110 (https://phabricator.wikimedia.org/T251497) (owner: 10ZPapierski)
[22:27:58] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1658.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:29:50] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp2007 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[22:33:55] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2007.codfw.wmnet'] `  and were **ALL** successful.
[22:39:24] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on wtp2007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[22:39:32] <icinga-wm>	 RECOVERY - dhclient process on wtp2008 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[22:45:24] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2007.codfw.wmnet
[22:45:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:46:44] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2006.codfw.wmnet
[22:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:48:36] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2009.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200729T2300). Please do the needful.
[23:03:51] <wikibugs>	 (03PS1) 10Urbanecm: Add several extra namespaces for mswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617280 (https://phabricator.wikimedia.org/T255391)
[23:04:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add several extra namespaces for mswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617280 (https://phabricator.wikimedia.org/T255391) (owner: 10Urbanecm)
[23:05:35] <wikibugs>	 (03Merged) 10jenkins-bot: Add several extra namespaces for mswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617280 (https://phabricator.wikimedia.org/T255391) (owner: 10Urbanecm)
[23:07:16] <icinga-wm>	 RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:07:22] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 396a395c79c606cb7deeb7906fefc7f16e63fa4f: Add several extra namespaces for mswiktionary (T255391) (duration: 01m 07s)
[23:07:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:29] <stashbot>	 T255391: Create new namespaces on the Malay Wiktionary - https://phabricator.wikimedia.org/T255391
[23:09:07] <Urbanecm>	 !log Run mwscript namespaceDupes.php --wiki=mswiktionary --fix (T255391)
[23:09:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:02] <icinga-wm>	 PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:12:03] <wikibugs>	 (03PS1) 10Urbanecm: Search Work NS by default at bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617284 (https://phabricator.wikimedia.org/T258982)
[23:18:43] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on wtp2006 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[23:22:29] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2008.codfw.wmnet'] `  and were **ALL** successful.
[23:28:54] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[23:28:54] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[23:28:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:35] <wikibugs>	 (03PS1) 10BryanDavis: jessie-ssd: Fetch base image from docker-registry.tools.wmflabs.org [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/617288
[23:37:33] <wikibugs>	 (03CR) 10Alex Monk: "you'll also want to add it to the safelist in api.py" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617176 (https://phabricator.wikimedia.org/T255249) (owner: 10BryanDavis)
[23:39:09] <wikibugs>	 (03CR) 10Alex Monk: [C: 03+2] "oh it's in the other PR, okay" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617176 (https://phabricator.wikimedia.org/T255249) (owner: 10BryanDavis)
[23:39:54] <wikibugs>	 (03CR) 10Alex Monk: [C: 03+2] api: Allow acme-chief clients to fetch .chained.crt.key files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617177 (https://phabricator.wikimedia.org/T255249) (owner: 10BryanDavis)
[23:41:25] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2008.codfw.wmnet
[23:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:41:52] <wikibugs>	 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2010.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...
[23:49:43] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) 05Open→03Resolved a:03tstarling
[23:51:22] <icinga-wm>	 PROBLEM - parsoid on wtp2009 is CRITICAL: connect to address 10.192.16.51 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid
[23:51:22] <icinga-wm>	 PROBLEM - Check size of conntrack table on wtp2009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.51: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[23:51:39] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[23:51:40] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[23:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:52:14] <icinga-wm>	 ACKNOWLEDGEMENT - Check size of conntrack table on wtp2009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.51: Connection reset by peer daniel_zahn reinstall https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[23:52:14] <icinga-wm>	 ACKNOWLEDGEMENT - Long running screen/tmux on wtp2009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.51: Connection reset by peer daniel_zahn reinstall https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens