[00:02:20] <icinga-wm>	 RECOVERY - dump of matomo in eqiad on icinga1001 is OK: Last dump for matomo at eqiad (db1108.eqiad.wmnet:3351) taken on 2020-08-11 00:00:01 (0 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[00:05:05] <wikibugs>	 (03PS1) 10Dzahn: Revert "admins: set http_proxy for myself, dzahn" [puppet] - 10https://gerrit.wikimedia.org/r/619367
[00:06:33] <wikibugs>	 (03PS3) 10Dave Pifke: arclamp: configurable email address for cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/619395
[00:08:22] <mutante>	 !log releases-jenkins.wikimedia.org currently under maintenance (T247652)
[00:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:08:26] <stashbot>	 T247652: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652
[00:08:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "admins: set http_proxy for myself, dzahn" [puppet] - 10https://gerrit.wikimedia.org/r/619367 (owner: 10Dzahn)
[00:09:03] <wikibugs>	 (03CR) 10Dave Pifke: "> ah, nice. but you don't need to go one step further and add a lookup() in the parameter and put it in Hiera for labs?" [puppet] - 10https://gerrit.wikimedia.org/r/619395 (owner: 10Dave Pifke)
[00:10:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] arclamp: configurable email address for cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/619395 (owner: 10Dave Pifke)
[00:13:17] <wikibugs>	 (03CR) 10Dzahn: "webperf1002: noop" [puppet] - 10https://gerrit.wikimedia.org/r/619395 (owner: 10Dave Pifke)
[00:20:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:21:28] <wikibugs>	 (03PS1) 10Dzahn: Revert "switch releases.wikimedia.org to buster backends" [dns] - 10https://gerrit.wikimedia.org/r/619368
[00:22:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "switch releases.wikimedia.org to buster backends" [dns] - 10https://gerrit.wikimedia.org/r/619368 (owner: 10Dzahn)
[00:24:10] <mutante>	 !log reverting switch of releases.wikimedia.org for today since releases-jenkins.wikimedia.org is tied to it and new jenkins still needs some config and plugins (T247652)
[00:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:24:13] <stashbot>	 T247652: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652
[00:25:29] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle)
[00:26:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:31:59] <logmsgbot>	 !log dpifke@deploy1001 Started deploy [performance/arc-lamp@fc5f1c6]: Deploying latest attempt to fix T259167
[00:32:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:02] <stashbot>	 T259167: Truncated ArcLamp output files - https://phabricator.wikimedia.org/T259167
[00:33:02] <logmsgbot>	 !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@fc5f1c6]: Deploying latest attempt to fix T259167 (duration: 01m 03s)
[00:33:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:33:26] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:35:18] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 49835896 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:37:18] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 118752 and 100 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:40:10] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:41:18] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:08] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:46:57] <wikibugs>	 (03CR) 10BPirkle: [C: 03+1] "Looks good, approved for self-merge and deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618646 (https://phabricator.wikimedia.org/T250248) (owner: 10Tim Starling)
[00:50:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:50:44] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:01:53] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:09:46] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:17:38] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:25:30] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:32:36] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:42:24] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:55:50] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Enable fastStale mode on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618646 (https://phabricator.wikimedia.org/T250248) (owner: 10Tim Starling)
[01:56:42] <wikibugs>	 (03Merged) 10jenkins-bot: Enable fastStale mode on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618646 (https://phabricator.wikimedia.org/T250248) (owner: 10Tim Starling)
[01:59:52] <logmsgbot>	 !log tstarling@deploy1001 Synchronized wmf-config/PoolCounterSettings.php: enabling fast stale mode T250248 (duration: 00m 58s)
[01:59:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:59:55] <stashbot>	 T250248: Fast stale ParserCache responses on PoolCounter contention - https://phabricator.wikimedia.org/T250248
[02:05:41] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.4 [core] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619397
[02:09:23] <wikibugs>	 (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.4 [core] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619397 (https://phabricator.wikimedia.org/T257972) (owner: 10TrainBranchBot)
[02:28:38] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:33:36] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:40:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:41:28] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:52:16] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:00:06] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:19:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:25:28] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:32:23] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:37:16] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:42:12] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:47:10] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:53:08] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:14:48] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:24:40] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:32:38] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:33:36] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:41:26] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:44:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:50:20] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:02:18] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:06:16] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:21:18] <wikibugs>	 (03PS1) 10Marostegui: install_sever: Do not reimage dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/619400
[05:21:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_sever: Do not reimage dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/619400 (owner: 10Marostegui)
[05:32:56] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:38:45] <_joe_>	 marostegui: have you looked at the fatals?
[05:39:07] <_joe_>	 they've been ongoing for hours and hours
[05:39:57] <_joe_>	 not completely sure why tbh
[05:40:49] <marostegui>	 _joe_: for days even
[05:40:52] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:41:01] <_joe_>	 not with this frequency
[05:41:39] <_joe_>	 and most of the times, it's just due to a bug in parsoid that's knowmn
[05:42:15] <_joe_>	 I'm not sure, though, of the numbers I see in that dashboard
[05:42:29] <_joe_>	 how they're calculated, given logstash has a different picture
[05:50:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_arclamp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:54:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:15:38] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:23:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:24:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:27:30] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:32:18] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:33:20] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:37:18] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:40:20] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] releases: Remove deployment-charts repo [puppet] - 10https://gerrit.wikimedia.org/r/618352 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[06:42:06] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:45:00] <wikibugs>	 (03PS1) 10Ayounsi: Re-add netbox_driven_interfaces feature flag [homer/public] - 10https://gerrit.wikimedia.org/r/619431
[06:45:02] <wikibugs>	 (03PS1) 10Ayounsi: Re-prioritize peering over transit eqiad/esams [homer/public] - 10https://gerrit.wikimedia.org/r/619432 (https://phabricator.wikimedia.org/T259614)
[06:45:04] <wikibugs>	 (03PS1) 10JMeybohm: Remove helm repo (index.yaml and chart tars) from git [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843)
[06:45:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove helm repo (index.yaml and chart tars) from git [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[06:45:34] <XioNoX>	 !log Re-prioritize peering over transit eqiad/esams - T259614
[06:45:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:45:38] <stashbot>	 T259614: Re-prioritize peering over transit - https://phabricator.wikimedia.org/T259614
[06:47:32] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Re-add netbox_driven_interfaces feature flag [homer/public] - 10https://gerrit.wikimedia.org/r/619431 (owner: 10Ayounsi)
[06:47:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Re-prioritize peering over transit eqiad/esams [homer/public] - 10https://gerrit.wikimedia.org/r/619432 (https://phabricator.wikimedia.org/T259614) (owner: 10Ayounsi)
[06:47:55] <wikibugs>	 (03Merged) 10jenkins-bot: Re-add netbox_driven_interfaces feature flag [homer/public] - 10https://gerrit.wikimedia.org/r/619431 (owner: 10Ayounsi)
[06:48:01] <wikibugs>	 (03Merged) 10jenkins-bot: Re-prioritize peering over transit eqiad/esams [homer/public] - 10https://gerrit.wikimedia.org/r/619432 (https://phabricator.wikimedia.org/T259614) (owner: 10Ayounsi)
[06:50:01] <wikibugs>	 (03PS2) 10JMeybohm: Remove helm repo (index.yaml and chart tars) from git [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843)
[06:54:16] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade
[06:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Burn with fire" [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[06:56:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Remove helm repo (index.yaml and chart tars) from git [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[06:56:29] <jayme>	 🔥
[06:57:10] <wikibugs>	 (03Merged) 10jenkins-bot: Remove helm repo (index.yaml and chart tars) from git [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[07:02:52] <Urbanecm>	 jouncebot: next
[07:02:53] <jouncebot>	 In 0 hour(s) and 57 minute(s): Move muswiki and mhwiktionary (closed wikis) from s3 to s5 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T0800)
[07:03:40] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review: Re-prioritize peering over transit - https://phabricator.wikimedia.org/T259614 (10ayounsi) 05Open→03Resolved All done!
[07:06:46] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:10:38] <wikibugs>	 (03PS4) 10ZPapierski: Replace a query service during data reload [puppet] - 10https://gerrit.wikimedia.org/r/619259 (https://phabricator.wikimedia.org/T259543)
[07:10:44] <wikibugs>	 (03PS2) 10ZPapierski: Add a weekly reload job for wcqs data reload [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515)
[07:12:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: helmfile: refactoring blubberoid (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[07:13:02] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: helmfile: refactoring blubberoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572)
[07:18:36] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:20:28] <wikibugs>	 (03PS1) 10JMeybohm: releases: Remove absend ressources [puppet] - 10https://gerrit.wikimedia.org/r/619434 (https://phabricator.wikimedia.org/T253843)
[07:23:36] <wikibugs>	 (03PS1) 10JMeybohm: helm: Remove obsolete cron ressource [puppet] - 10https://gerrit.wikimedia.org/r/619435
[07:24:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] releases: Remove absend ressources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619434 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[07:25:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] helm: Remove obsolete cron ressource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619435 (owner: 10JMeybohm)
[07:26:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Yeah! Let's do it 🎉" [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[07:29:08] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37
[07:31:30] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] helm: Remove obsolete cron ressource [puppet] - 10https://gerrit.wikimedia.org/r/619435 (owner: 10JMeybohm)
[07:33:24] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:33:34] <wikibugs>	 (03PS2) 10JMeybohm: helm: Remove obsolete cron resource [puppet] - 10https://gerrit.wikimedia.org/r/619435
[07:34:35] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437
[07:35:05] <wikibugs>	 (03PS2) 10JMeybohm: releases: Remove absent resources [puppet] - 10https://gerrit.wikimedia.org/r/619434 (https://phabricator.wikimedia.org/T253843)
[07:35:38] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] helm: Remove obsolete cron resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619435 (owner: 10JMeybohm)
[07:40:45] <wikibugs>	 (03PS3) 10ZPapierski: Add a weekly reload job for wcqs data reload [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515)
[07:41:13] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:41:17] <wikibugs>	 (03PS5) 10ZPapierski: Replace a query service during data reload [puppet] - 10https://gerrit.wikimedia.org/r/619259 (https://phabricator.wikimedia.org/T259543)
[07:42:02] <wikibugs>	 (03PS4) 10ZPapierski: Add a weekly reload job for wcqs data reload [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515)
[07:58:46] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat)
[08:00:04] <jouncebot>	 marostegui, Urbanecm, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Move muswiki and mhwiktionary (closed wikis) from s3 to s5 deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T0800).
[08:00:11] <marostegui>	 o/
[08:00:18] <Amir1>	 o/
[08:00:22] <jynus>	 yet another? hope not so!
[08:00:23] <Urbanecm>	 o/
[08:00:55] <marostegui>	 Urbanecm Amir1 we are following this then https://phabricator.wikimedia.org/T259004#6348180 ?
[08:01:40] <Urbanecm>	 Yes. Ready to turn wikis to read only. 
[08:02:16] <Amir1>	 yup
[08:02:46] <Amir1>	 I'm here mostly for emotional support, Urbanecm will do the main stuff
[08:02:51] <marostegui>	 Urbanecm: go for it
[08:02:56] <Urbanecm>	 ack :)
[08:03:05] <wikibugs>	 (03PS6) 10Urbanecm: Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004)
[08:03:07] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm)
[08:03:10] <RhinosF1>	 Good luck
[08:04:23] <wikibugs>	 (03Merged) 10jenkins-bot: Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm)
[08:04:36] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
[08:04:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:20] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Looks like the Prometeus JMX exporter used to be deployed using scap via this repository operations/software/prometheus_jmx_exporter . The" [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/404224 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani)
[08:06:25] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a04bc1f27e4ef4e38002d546d30bfd2d1dc60d0e: Turn muswiki and mhwiktionary to read-only (T259004) (duration: 01m 01s)
[08:06:26] <Urbanecm>	 marostegui: wikis should be read-only now.
[08:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:28] <stashbot>	 T259004: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004
[08:06:36] <marostegui>	 Urbanecm: ok, going to proceed
[08:06:47] <wikibugs>	 (03Abandoned) 10Hashar: Scap: git_fat -> git_binary_manager [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/404224 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani)
[08:06:49] <wikibugs>	 (03PS5) 10ZPapierski: Add a weekly reload job for wcqs data reload [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515)
[08:10:03] <marostegui>	 both loaded into codfw, checking stuff and after that, will sanitize sanitarium host
[08:10:22] <Urbanecm>	 ack
[08:12:15] <marostegui>	 looks good, going to proceed with eqiad
[08:13:47] <Urbanecm>	 marostegui: ack. Would you mind me preparing the "point to s5" patch at mwdebug hosts, or should I wait with that?
[08:14:23] <marostegui>	 Urbanecm: only codfw I would say
[08:15:24] <wikibugs>	 (03CR) 10Aklapper: "Nobody said it's broken... It is outdated and linking to a "blog" makes me have different expectations than getting one single blog *post*" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619129 (https://phabricator.wikimedia.org/T259978) (owner: 10Aklapper)
[08:15:34] <Urbanecm>	 marostegui: okay, merging and pulling to mwdebug2001
[08:16:24] <wikibugs>	 (03PS4) 10Urbanecm: Point muswiki and mhwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618090 (https://phabricator.wikimedia.org/T259004)
[08:16:24] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Point muswiki and mhwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618090 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm)
[08:16:39] <wikibugs>	 (03Merged) 10jenkins-bot: Point muswiki and mhwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618090 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm)
[08:18:25] <Urbanecm>	 marostegui: FYI: the patch is at mwdebug2001 only.
[08:19:08] <marostegui>	 Urbanecm: cool, I am proceeding with eqiad hosts
[08:19:15] <Urbanecm>	 ack
[08:23:53] <wikibugs>	 (03PS1) 10Hashar: Add .gitreview file [debs/hue] - 10https://gerrit.wikimedia.org/r/619438
[08:23:55] <wikibugs>	 10Operations, 10Patch-For-Review: Fix "Blog" link on noc.wikimedia.org - https://phabricator.wikimedia.org/T259978 (10Aklapper) One single blogpost is not the "Blog of the operators of Wikimedia's servers". So that's broken.
[08:24:58] <marostegui>	 all done, doing some checks now
[08:24:59] <wikibugs>	 (03PS3) 10Hashar: Add basic Debian packaging [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey)
[08:25:20] <Urbanecm>	 ack
[08:26:32] <wikibugs>	 (03PS2) 10Aklapper: noc: Remove link to outdated blog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619129 (https://phabricator.wikimedia.org/T259978)
[08:27:55] <marostegui>	 Urbanecm: everything is done, we should be now at this step: Change ./wmf-config/config/muswiki.yaml and ./wmf-config/config/mhwiktionary.yaml and then run composer buildDBLists from https://phabricator.wikimedia.org/T259004#6348180
[08:29:27] <Urbanecm>	 thanks!
[08:29:34] <Urbanecm>	 going to pull that patch to mwdebug1001
[08:30:04] <marostegui>	 Urbanecm: sounds good, and if possible, let's generate a write for those wikis so I can check they get replicated safely?
[08:31:05] <Urbanecm>	 marostegui: sure! I'm now verifying the patch works by looking at IP of the master the wikis talk to
[08:31:20] <marostegui>	 excellent
[08:33:03] <Urbanecm>	 marostegui: in eqiad, it talks to db1100/10.64.32.197, in codfw, it talks to db2123/10.192.16.12. Looks good to me per https://dbtree.wikimedia.org/
[08:33:38] <marostegui>	 Urbanecm: ips and hostnames are correct
[08:33:42] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:34:20] <Urbanecm>	 marostegui: great! I'm setting read-only back to false at mwdebug1001, so I can generate a write for you.
[08:34:32] <marostegui>	 cool
[08:34:45] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Turn muswiki and mhwiktionary to read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619369 (https://phabricator.wikimedia.org/T259004)
[08:34:51] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "Turn muswiki and mhwiktionary to read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619369 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm)
[08:35:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Turn muswiki and mhwiktionary to read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619369 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm)
[08:36:58] <Urbanecm>	 marostegui: I created https://mus.wikipedia.org/wiki/User:Martin_Urbanec/This_is_a_test_page (should appear in the page table)
[08:37:05] <marostegui>	 let me check
[08:38:55] <marostegui>	 Urbanecm: looks good, the row is on s5 but not on s3
[08:39:00] <marostegui>	 can you do the same for the other one?
[08:39:02] <Amir1>	 \o/
[08:39:02] <Urbanecm>	  sure!
[08:39:40] <Urbanecm>	 marostegui: created https://mh.wiktionary.org/wiki/User:Martin_Urbanec/This_is_a_test_page
[08:39:46] <marostegui>	 checking
[08:40:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24408/" [puppet] - 10https://gerrit.wikimedia.org/r/619296 (owner: 10Filippo Giunchedi)
[08:40:48] <marostegui>	 looks good too
[08:40:55] <Urbanecm>	 good! 
[08:40:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24408/" [puppet] - 10https://gerrit.wikimedia.org/r/619295 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi)
[08:40:56] <marostegui>	 also sanitization is working fine as well on labs hosts
[08:41:14] <Urbanecm>	 so, I think I can sync the shard change to all hosts now!
[08:41:25] <marostegui>	 yep!
[08:41:29] <Urbanecm>	 doing!
[08:41:30] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:42:22] <wikibugs>	 (03CR) 10Kormat: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat)
[08:42:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Good idea, but I think the current form is a bit too specific and also leaves the burden of preparing a wheels archive on the developer, w" (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar)
[08:43:34] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: 81f4594b6c583f938821549b3a1800fec5b120bb: Point muswiki and mhwiktionary to s5 (T259004; 1/3) (duration: 01m 02s)
[08:43:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:39] <stashbot>	 T259004: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004
[08:43:54] <wikibugs>	 10Operations, 10observability: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi)
[08:44:48] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: 81f4594b6c583f938821549b3a1800fec5b120bb: Point muswiki and mhwiktionary to s5 (T259004; 2/3) (duration: 00m 58s)
[08:44:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:58] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists/: 81f4594b6c583f938821549b3a1800fec5b120bb: Point muswiki and mhwiktionary to s5 (T259004; 3/3) (duration: 00m 58s)
[08:46:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:14] <Urbanecm>	 marostegui: shard change should be done. Any final checks before I make the wikis rw?
[08:46:31] <marostegui>	 Urbanecm: let's create one more test for each wiki?
[08:46:47] <Urbanecm>	 okay!
[08:46:56] <marostegui>	 thank you
[08:48:52] <Urbanecm>	 marostegui: created https://mus.wikipedia.org/wiki/User:Martin_Urbanec/Foo and https://mh.wiktionary.org/wiki/User:Martin_Urbanec/Foo
[08:48:58] <marostegui>	 checking!
[08:49:47] <marostegui>	 looks good, changes are on s5 and not on s3!
[08:49:58] <Urbanecm>	 cool!
[08:50:31] <Urbanecm>	 so, looks ready for read-write to me :)
[08:50:43] <marostegui>	 yep!
[08:50:47] <Urbanecm>	 syncing!
[08:52:29] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: e6ec237b6b6fb67a0a80613909589bc724f5eecf: Revert "Turn muswiki and mhwiktionary to read-only" (T259004) (duration: 00m 58s)
[08:52:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:32] <stashbot>	 T259004: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004
[08:52:53] <Urbanecm>	 marostegui: done!
[08:53:00] <wikibugs>	 10Operations, 10observability: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) The current thinking is to try option #4: ack'd incidents in VO that haven't been resolved within X hours will re-trigger, using X = 12. The normal workflow is sth like this:  1....
[08:53:04] <marostegui>	 Urbanecm: let's do that same test once more to be fully sure?
[08:53:29] <Urbanecm>	 marostegui: sure. Would you mind me deleting the pages now (logging table)?
[08:53:38] <marostegui>	 that sounds good
[08:54:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] helmfile: refactoring blubberoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[08:54:16] <Urbanecm>	 marostegui: https://mus.wikipedia.org/w/index.php?title=User:Martin_Urbanec/This_is_a_test_page and https://mh.wiktionary.org/w/index.php?title=User:Martin_Urbanec/This_is_a_test_page was just deleted
[08:54:21] <marostegui>	 ok, checking
[08:55:10] <wikibugs>	 (03Merged) 10jenkins-bot: helmfile: refactoring blubberoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[08:55:11] <marostegui>	 looks good!
[08:55:15] <Urbanecm>	 \o/
[08:55:22] <wikibugs>	 10Operations, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi)
[08:55:45] <marostegui>	 So we are done I think! We need to follow up the doc change at https://phabricator.wikimedia.org/T259438 and I am going to create a cleanup task to remove the tables from s3
[08:56:01] <Urbanecm>	 yup! I'll change the docs then. Thanks marostegui !
[08:56:09] <kormat>	 new phab task: "Drop all tables from s3"
[08:56:12] <marostegui>	 Thaaaaanks so much for driving this!
[08:56:30] <Urbanecm>	 happy to help!
[08:57:02] <jynus>	 if you won't need my backups, I will take a break
[08:57:12] <marostegui>	 jynus: I think we are fine, thank you!
[08:59:30] <Urbanecm>	 marostegui: do we need some follow-up from cloud-services, or should wiki replicas in labs work?
[08:59:45] <Urbanecm>	 (also, what about analytics replicas?)
[08:59:52] <marostegui>	 Urbanecm: no, there's nothing to do there
[09:00:17] <Urbanecm>	 okay, cool
[09:00:23] <marostegui>	 Urbanecm: the analytics dbstore might need creating the views though, as it is multi-instance
[09:00:24] <marostegui>	 let me check
[09:00:30] <marostegui>	 not sure if they use views there or not
[09:00:52] <jynus>	 if you mean the non-cloud ones, no, they don't have views
[09:01:01] <marostegui>	 yeah, just checked, no views there
[09:01:06] <marostegui>	 so we are good
[09:01:13] <jynus>	 we could run the check-private data after table removal once, just in case
[09:01:13] <Urbanecm>	 good, thanks for checking that :)
[09:01:35] <marostegui>	 jynus: I am going to run it now for s5, should be quick
[09:01:37] <volans>	 !log renewed puppet certificate on scb1001.eqiad.wmnet
[09:01:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:36] <icinga-wm>	 RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:06:59] <marostegui>	 private data check looks clean on both db1124:3315 and db2094:3315
[09:07:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:09:23] <Urbanecm>	 ^^this is _not_ related to the shard change AFAICS^^
[09:10:22] <marostegui>	 not from what I can see no
[09:11:28] <marostegui>	 !log Rename tables on muswiki and mhwiktionary on s3 master (db1123) without replication T260112
[09:11:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:31] <stashbot>	 T260112: Remove muswiki and mhwiktionary from s3 - https://phabricator.wikimedia.org/T260112
[09:13:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:21:29] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10User-Urbanecm, and 2 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm)
[09:22:01] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10User-Urbanecm, and 2 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm)
[09:25:10] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[09:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:30] <wikibugs>	 (03PS1) 10Kormat: mariadb: Drop check_mariadb.py in favour of packaged version [puppet] - 10https://gerrit.wikimedia.org/r/619443
[09:26:20] <wikibugs>	 (03PS1) 10Elukey: Upgrade the Hadoop test cluster to Bigtop [puppet] - 10https://gerrit.wikimedia.org/r/619444
[09:27:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Upgrade the Hadoop test cluster to Bigtop [puppet] - 10https://gerrit.wikimedia.org/r/619444 (owner: 10Elukey)
[09:28:06] <wikibugs>	 (03PS2) 10Kormat: mariadb: Drop check_mariadb.py in favour of packaged version [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516)
[09:29:59] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:10] <wikibugs>	 (03PS2) 10Gehel: Set up WCQS test server [puppet] - 10https://gerrit.wikimedia.org/r/618059 (owner: 10ZPapierski)
[09:33:50] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:34:07] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Set up WCQS test server [puppet] - 10https://gerrit.wikimedia.org/r/618059 (owner: 10ZPapierski)
[09:39:40] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:40:25] <wikibugs>	 (03PS1) 10Ladsgroup: Exclude thankyou wiki for mobile redirect [puppet] - 10https://gerrit.wikimedia.org/r/619446 (https://phabricator.wikimedia.org/T259002)
[09:40:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:44:33] <wikibugs>	 (03PS3) 10Kormat: mariadb: Drop check_mariadb.py in favour of packaged version [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516)
[09:48:30] <wikibugs>	 (03PS4) 10Kormat: mariadb: Drop check_mariadb.py in favour of packaged version [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516)
[09:48:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:48:56] <wikibugs>	 (03PS3) 10Filippo Giunchedi: Debian packaging for Grafana plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618953 (https://phabricator.wikimedia.org/T259143)
[09:49:12] <wikibugs>	 (03CR) 10Kormat: "PCC run: https://puppet-compiler.wmflabs.org/compiler1003/24411/" [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat)
[09:50:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: temp disable grafana db sync ahead of upgrade [puppet] - 10https://gerrit.wikimedia.org/r/618069 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi)
[09:51:59] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster
[09:52:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:11] <elukey>	 this is the test cluster --^
[09:54:25] <volans>	 [heads up] I'm about to merge the DNS  patch to migrate all eqiad mgmt records to netbox generated ones in a bit
[09:56:53] <wikibugs>	 (03PS5) 10Volans: mgmt: netbox-generated data for mgmt eqiad [dns] - 10https://gerrit.wikimedia.org/r/617509 (https://phabricator.wikimedia.org/T233183)
[09:59:34] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: deployment_server::helmfile: generate files for new-style helmfile organization [puppet] - 10https://gerrit.wikimedia.org/r/619448 (https://phabricator.wikimedia.org/T258572)
[10:00:12] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
[10:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:39] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh
[10:01:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:50] <elukey>	 (test cluster)
[10:02:17] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: deployment_server::helmfile: generate files for new-style helmfile organization [puppet] - 10https://gerrit.wikimedia.org/r/619448 (https://phabricator.wikimedia.org/T258572)
[10:04:00] <icinga-wm>	 PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org
[10:07:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24413/deploy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619448 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[10:07:30] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:11:23] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:11:48] <icinga-wm>	 RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 17053 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org
[10:16:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: deployment_server::helmfile: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/619449
[10:19:14] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:19:38] <icinga-wm>	 PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org
[10:20:18] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh (exit_code=0)
[10:20:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server::helmfile: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/619449 (owner: 10Giuseppe Lavagetto)
[10:20:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:23:32] <icinga-wm>	 RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 57001 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org
[10:25:06] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:26:27] <wikibugs>	 (03PS4) 10Filippo Giunchedi: Debian packaging for Grafana plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618953 (https://phabricator.wikimedia.org/T259143)
[10:29:42] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:32:01] <wikibugs>	 (03PS8) 10Hnowlan: api-gateway: open parts of the admin interface internally [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908)
[10:32:24] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:34:26] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[10:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:13] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:38:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:28] <volans>	 !log migrating *all* eqiad mgmt DNS records to the autogenerated ones via Netbox - T233183
[10:39:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:31] <stashbot>	 T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183
[10:39:36] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: helmfile.d/blubberoid: fix the paths of SRE-controlled values [deployment-charts] - 10https://gerrit.wikimedia.org/r/619450
[10:39:38] <wikibugs>	 (03CR) 10Volans: [C: 03+2] mgmt: netbox-generated data for mgmt eqiad [dns] - 10https://gerrit.wikimedia.org/r/617509 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[10:40:38] <wikibugs>	 (03CR) 10Hnowlan: "> Patch Set 7:" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan)
[10:42:12] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:23] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: switch Grafana plugins to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/619451 (https://phabricator.wikimedia.org/T259143)
[10:43:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] helmfile.d/blubberoid: fix the paths of SRE-controlled values [deployment-charts] - 10https://gerrit.wikimedia.org/r/619450 (owner: 10Giuseppe Lavagetto)
[10:44:43] <wikibugs>	 (03Merged) 10jenkins-bot: helmfile.d/blubberoid: fix the paths of SRE-controlled values [deployment-charts] - 10https://gerrit.wikimedia.org/r/619450 (owner: 10Giuseppe Lavagetto)
[10:45:52] <wikibugs>	 (03CR) 10Jcrespo: "The patch as it is looks ok, however, given that in some cases this is a paging alert, and given our past with failures deploying new aler" [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat)
[10:49:27] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: open parts of the admin interface internally [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan)
[10:50:39] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: open parts of the admin interface internally [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan)
[10:50:49] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: helmfile.d/blubberoid: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/619452
[10:51:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] helmfile.d/blubberoid: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/619452 (owner: 10Giuseppe Lavagetto)
[10:51:54] <wikibugs>	 (03PS2) 10KartikMistry: Enable Content Translation in Sundanese Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619255 (https://phabricator.wikimedia.org/T258502)
[10:52:56] <wikibugs>	 (03Merged) 10jenkins-bot: helmfile.d/blubberoid: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/619452 (owner: 10Giuseppe Lavagetto)
[10:57:02] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/619454
[10:58:32] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/619454 (owner: 10Hnowlan)
[10:59:39] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/619454 (owner: 10Hnowlan)
[11:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1100).
[11:00:05] <jouncebot>	 kart_: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:25] <Lucas_WMDE>	 Lemony Snicket’s A Series Of Unfortunate Deploys
[11:00:32] * kart_ is here.
[11:00:48] <Urbanecm>	 I can deploy today, unless kart_ wants to self-service :)
[11:01:04] <Lucas_WMDE>	 (I’m about to go for lunch so i can’t actually deploy, only make stupid jokes, sorry ^^)
[11:01:10] <Lucas_WMDE>	 thx Urbanecm
[11:01:11] <kart_>	 Urbanecm: I can deploy to just make sure I can deploy. No worries :)
[11:01:25] <Urbanecm>	 go ahead then :)
[11:02:30] <Urbanecm>	 (wow, is there a plan to have CT by default? nice!)
[11:02:47] <wikibugs>	 (03CR) 10Ladsgroup: Create dispatch lag alerts for test.wikidata.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große)
[11:02:49] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Enable Content Translation in Sundanese Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619255 (https://phabricator.wikimedia.org/T258502) (owner: 10KartikMistry)
[11:03:35] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Content Translation in Sundanese Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619255 (https://phabricator.wikimedia.org/T258502) (owner: 10KartikMistry)
[11:03:37] <kart_>	 Urbanecm: We are moving slowly out-of-beta for Wikis after discussing with community.
[11:03:44] <Urbanecm>	 (y)
[11:03:48] <Urbanecm>	 I like that :)
[11:03:57] <Majavah>	 Urbanecm: I'll have a late config patch in a minute
[11:03:57] <RhinosF1>	 Content Translation is really nice
[11:04:14] <Urbanecm>	 Majavah: ack, add it to the calendar then :)
[11:04:22] <Majavah>	 sure
[11:04:45] <wikibugs>	 (03CR) 10Ladsgroup: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große)
[11:05:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:06:22] <wikibugs>	 (03PS1) 10Majavah: labs: Disable TheWikipediaLibrary due to email issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619455 (https://phabricator.wikimedia.org/T256297)
[11:07:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:07:50] <Majavah>	 Urbanecm: added
[11:07:55] <Urbanecm>	 ack
[11:08:00] <Majavah>	 it's beta only
[11:08:58] <logmsgbot>	 !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|619255|Enable ContentTranslation in Sundanese WP as a default tool (T258502)]] (duration: 00m 59s)
[11:09:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:01] <stashbot>	 T258502: Enable Content Translation in Sundanese Wikipedia as a default tool - https://phabricator.wikimedia.org/T258502
[11:09:04] <kart_>	 Urbanecm: I'm done with deploy.
[11:09:44] <Urbanecm>	 kart_: thanks
[11:11:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619455 (https://phabricator.wikimedia.org/T256297) (owner: 10Majavah)
[11:11:20] <Urbanecm>	 Majavah: done :)
[11:11:39] <Urbanecm>	 (the magic should do that within 30 minutes)
[11:11:47] <Majavah>	 thanks!
[11:12:04] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Disable TheWikipediaLibrary due to email issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619455 (https://phabricator.wikimedia.org/T256297) (owner: 10Majavah)
[11:15:09] <wikibugs>	 (03PS4) 10Cparle: MediaSearch A/B test on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse)
[11:16:16] <wikibugs>	 (03PS5) 10Cparle: MediaSearch A/B test on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse)
[11:17:28] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/619446 (https://phabricator.wikimedia.org/T259002) (owner: 10Ladsgroup)
[11:18:17] <Urbanecm>	 !log EU B&C window done
[11:18:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:44] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[11:30:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:08] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:27] <wikibugs>	 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) 05Open→03Resolved a:05crusnov→03Volans All management records are now generated via Netbox, related wikitech documentation upd...
[11:37:38] <wikibugs>	 (03PS3) 10Volans: zone_validator: fix private/public detection [dns] - 10https://gerrit.wikimedia.org/r/616873
[11:41:00] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:42:37] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[11:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:50] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:54:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:54:24] <volans>	 netbox was fixed (context in T260077 )
[11:54:25] <stashbot>	 T260077: netbox dumps: fix permissions and timestamp - https://phabricator.wikimedia.org/T260077
[11:54:56] <marostegui>	 !log Install new MariaDB 10.4.14 on db2102
[11:54:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:44] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/619259 (https://phabricator.wikimedia.org/T259543) (owner: 10ZPapierski)
[11:58:23] <wikibugs>	 10Operations, 10Discovery-Search (Current work): Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10Gehel)
[11:59:17] <wikibugs>	 10Operations, 10Discovery-Search (Current work): wdqs1009 has puppet changes on each run - https://phabricator.wikimedia.org/T260123 (10Gehel)
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1200)
[12:12:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: "To be merged shortly before the upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/619451 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi)
[12:27:30] <wikibugs>	 10Operations, 10Discovery-Search (Current work): wdqs1009 has puppet changes on each run - https://phabricator.wikimedia.org/T260123 (10Gehel) Note: @Zbyszko is having a look into this as well.
[12:35:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:35:26] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Kormat)
[12:35:39] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Kormat) p:05Triage→03Medium a:03Kormat
[12:36:49] <logmsgbot>	 !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[12:36:49] <logmsgbot>	 !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[12:36:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:42:35] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10VPS-Projects, 10Product-Infrastructure-Team-Backlog (Kanban): Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deployme... - https://phabricator.wikimedia.org/T259812
[12:42:40] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Cause well train" [core] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619397 (https://phabricator.wikimedia.org/T257972) (owner: 10TrainBranchBot)
[12:44:16] <wikibugs>	 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat)
[12:47:46] <wikibugs>	 (03PS2) 10Cicalese: Configured additional settings for API Portal beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569)
[12:48:46] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:52:34] <kormat>	 !log uploaded wmfmariadbpy 0.2 packages to apt1001
[12:52:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:43] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:54:59] <wikibugs>	 (03PS3) 10Cicalese: Configured additional settings for API Portal beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569)
[12:59:44] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-10.4*: Update package version [software] - 10https://gerrit.wikimedia.org/r/619462
[13:00:04] <jouncebot>	 hashar and twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1300).
[13:01:36] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.4 [core] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619397 (https://phabricator.wikimedia.org/T257972) (owner: 10TrainBranchBot)
[13:03:47] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[13:03:47] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[13:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] web_testing: Remove the apache-fast-test placeholder [puppet] - 10https://gerrit.wikimedia.org/r/618602 (owner: 10RLazarus)
[13:06:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] web_testing: Clean up the old class used for apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/618603 (owner: 10RLazarus)
[13:06:40] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Vgutierrez) hmm that's interesting, please note that this is not the first time we use websockets. etherpad.wm.o i...
[13:12:36] <hashar>	 !log Applied 1.36.0-wmf.4 security patches # T257972
[13:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:40] <stashbot>	 T257972: 1.36.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T257972
[13:13:48] <wikibugs>	 (03PS1) 10Hashar: testwikis wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619464
[13:13:50] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619464 (owner: 10Hashar)
[13:14:37] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619464 (owner: 10Hashar)
[13:14:44] <logmsgbot>	 !log hashar@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.4
[13:14:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:37] <logmsgbot>	 !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[13:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:58] <wikibugs>	 (03PS1) 10CDanis: enable envoy websockets support for role::aphlict [puppet] - 10https://gerrit.wikimedia.org/r/619465 (https://phabricator.wikimedia.org/T238593)
[13:23:37] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] enable envoy websockets support for role::aphlict [puppet] - 10https://gerrit.wikimedia.org/r/619465 (https://phabricator.wikimedia.org/T238593) (owner: 10CDanis)
[13:23:41] <logmsgbot>	 !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[13:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:56] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] enable envoy websockets support for role::aphlict [puppet] - 10https://gerrit.wikimedia.org/r/619465 (https://phabricator.wikimedia.org/T238593) (owner: 10CDanis)
[13:27:18] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[13:27:19] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[13:27:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:48] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:34:10] <wikibugs>	 (03CR) 10Gilles: "Looking at the first image, aawiki.png, you're achieving 42% size reduction with a loss of 69.3% of distinct colors. My patch achieved 52%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609584 (https://phabricator.wikimedia.org/T252108) (owner: 10Thiemo Kreuz (WMDE))
[13:35:13] <wikibugs>	 (03PS1) 10Jayprakash12345: Enable tewiki as import source for tewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619371
[13:40:44] <wikibugs>	 (03CR) 10Gilles: "I'm going to deploy this the week of August 24, as I'll be working without interruption the months that follow, ensuring that I can addres" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles)
[13:41:28] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:48:28] <wikibugs>	 (03CR) 10Volans: [C: 03+2] zone_validator: fix private/public detection [dns] - 10https://gerrit.wikimedia.org/r/616873 (owner: 10Volans)
[13:48:58] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10CDanis) The Envoy TLS terminator is now configured to allow websocket upgrades -- however, it's improperly configu...
[13:51:04] <wikibugs>	 (03CR) 10ZPapierski: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24417/ - changes from a previous patch and cron job is not enabled on wdqs instances" [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) (owner: 10ZPapierski)
[13:52:45] <wikibugs>	 (03PS2) 10Jayprakash12345: Enable tewiki as import source for tewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619371 (https://phabricator.wikimedia.org/T260107)
[13:53:01] <wikibugs>	 (03CR) 10Jayprakash12345: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619371 (https://phabricator.wikimedia.org/T260107) (owner: 10Jayprakash12345)
[13:53:06] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:56:58] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:00:02] <wikibugs>	 (03PS1) 10JMeybohm: Set helm extra args early [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619471
[14:00:56] <wikibugs>	 (03PS1) 10Ottomata: camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935)
[14:01:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[14:04:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "LGTM, I'm sure we're missing something though :P" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619471 (owner: 10JMeybohm)
[14:04:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Set helm extra args early [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619471 (owner: 10JMeybohm)
[14:05:02] <icinga-wm>	 PROBLEM - Disk space on mwdebug1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=78%): /tmp 0 MB (0% inode=78%): /var/tmp 0 MB (0% inode=78%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops
[14:05:20] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): "> […] loss of 69.3% of distinct colors." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609584 (https://phabricator.wikimedia.org/T252108) (owner: 10Thiemo Kreuz (WMDE))
[14:05:25] <hashar>	 rsync: recv_generator: mkdir "/srv/mediawiki/php-1.36.0-wmf.4/resources" failed: No space left on device (28)
[14:05:28] <hashar>	 damn
[14:05:47] <hashar>	 that is on mw1319 apparently
[14:06:14] <wikibugs>	 (03CR) 10Herron: [C: 03+2] lists: stop automatically sycing fermium to lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/619355 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron)
[14:06:17] <jynus>	 mwdebug too, hashar
[14:06:26] <hashar>	 yeah / on mwdebug1001 is full ..
[14:06:43] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:07:50] <jynus>	 do we really need to keep so many versions?
[14:07:54] <hashar>	 but mw1319 is just fine
[14:08:07] <hashar>	 well we clean the old versions
[14:08:08] <jynus>	 there is like over 100 mw versions
[14:08:14] <jynus>	 on mwdebug1001
[14:08:15] <hashar>	 oh
[14:08:20] <hashar>	 yeah that is not normal
[14:08:33] <jynus>	 can you have a look, hashar, and tell me if I can manually remove some?
[14:08:35] <hashar>	 same on mw1319
[14:08:41] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "https://www.mediawiki.org/wiki/Gerrit/Privilege_policy/en#Merging_without_review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles)
[14:08:42] <jynus>	 to get out of the danger
[14:08:53] <hashar>	 supposedly scap should clean those
[14:08:53] <jynus>	 I don't like to have a host with no / space
[14:09:05] <jynus>	 just for a quick check, we can create a task late
[14:09:16] <hashar>	 but /srv/mediawiki has versions all the way done to 1.32.0 from mid 2018
[14:09:24] <hashar>	 s/done/down
[14:09:47] <jynus>	 can I delete for example all 1.32.0 versions or it is dangerous to do it on only 1 host?
[14:09:57] <hashar>	 ah no those are empty directories
[14:09:58] <jynus>	 or maybe can be done from scap?
[14:10:17] <hashar>	 I guess scap does not delete the cache/l10n empty directories
[14:10:33] <jynus>	 I see
[14:10:36] <hashar>	 !log mw1319: scap pull
[14:10:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:47] <hashar>	 for mwdebug1001 I cant tell
[14:11:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] "> Patch Set 1:" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619471 (owner: 10JMeybohm)
[14:12:08] <jynus>	 there was a recent increase in bytes used at 14h
[14:12:17] <jynus>	 other hosts not affected, but mwdebug was
[14:12:39] <wikibugs>	 (03PS2) 10Herron: lists: make lists1001 primary mailman host [puppet] - 10https://gerrit.wikimedia.org/r/619354 (https://phabricator.wikimedia.org/T224586)
[14:13:33] <wikibugs>	 (03CR) 10Kormat: [C: 04-2] "The current version of wmfmariadbpy (0.2) will pull in cumin on all db hosts, so this should not be merged until that is fixed." [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat)
[14:13:35] <jynus>	 any suggestion, hashar to proceed?
[14:13:48] <hashar>	 for mwdebug1001?  I don't know I haven't looked into it
[14:13:55] <jynus>	 is it safe to remove the cache of old versions?
[14:13:56] <hashar>	 I just had scap alerting about mw1319 but it seems all fine 
[14:14:02] <hashar>	 the old caches yeah
[14:14:14] <hashar>	 seems like there are just left over l10n cache directories
[14:14:30] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:14:41] <jynus>	 the problem is if that is not fixed, deploys will fail
[14:14:44] <wikibugs>	 (03Merged) 10jenkins-bot: Set helm extra args early [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619471 (owner: 10JMeybohm)
[14:14:57] <wikibugs>	 (03PS1) 10Kormat: Update remote execution libraries from transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516)
[14:15:51] <jynus>	 but 1319 should be ok, filesystem-wise
[14:16:20] <wikibugs>	 (03PS3) 10Herron: lists: make lists1001 primary mailman host [puppet] - 10https://gerrit.wikimedia.org/r/619354 (https://phabricator.wikimedia.org/T224586)
[14:16:30] <jynus>	 but the cache is KBs only, so I don't think that is the issue
[14:18:09] <jynus>	 each release is 6-8GB, and it doesn't fit on the 49G available for /srv
[14:18:39] <hashar>	 yeah
[14:18:53] <jynus>	 let me see if there is something I can do about it
[14:19:22] <jynus>	 it is a vm, so maybe the fs can be extended
[14:19:42] <hashar>	 6690	./php-1.35.0-wmf.40
[14:19:42] <hashar>	 6702	./php-1.35.0-wmf.41
[14:19:46] <hashar>	 they should have been purged 
[14:19:48] <hashar>	 bah
[14:20:23] <jynus>	 yeah, but it is just KB, minor issue we can report later
[14:20:34] <hashar>	 na those are in mB ;)
[14:20:37] <hashar>	 err
[14:20:38] <hashar>	 MB
[14:20:49] <hashar>	 we have some step to delete the old versions on tuesday
[14:20:54] <hashar>	 but apparently that does not occur bah
[14:23:14] <_joe_>	 so we currently have 
[14:23:16] <_joe_>	 uhm
[14:23:23] <_joe_>	 6 versions live?
[14:23:28] <_joe_>	 that seems excessive
[14:24:32] <hashar>	 in the deploy doc we have:
[14:24:54] <hashar>	 "Decide what old stuff to prune":  find /srv/mediawiki-staging -mindepth 2 -maxdepth 2 -type f -path './php-*/README' -ctime +7 -exec dirname {} \;
[14:25:10] <hashar>	 which yielded nothing for me earlier today
[14:25:16] <hashar>	 so I just moved to the next steps
[14:25:33] <hashar>	 else for all those old versions we need to run: scap clean --delete <VERSION HERE>
[14:25:40] <hashar>	 which apparnetly hasn't been done for the last few trains
[14:25:51] <_joe_>	 that's the real issue
[14:25:57] <_joe_>	 we still have, on all appservers
[14:26:20] <_joe_>	 php-1.35.0-wmf.40 php-1.35.0-wmf.41 php-1.36.0-wmf.1 php-1.36.0-wmf.2 php-1.36.0-wmf.3 php-1.36.0-wmf.4
[14:26:33] <_joe_>	 that's at least 3 releases too much
[14:26:42] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1312 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:50] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on snapshot1008 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:52] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2311 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:53] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2372 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:54] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mwdebug2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:58] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2310 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:58] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2324 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:02] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp2004 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:03] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp2014 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:20] <logmsgbot>	 !log hashar@deploy1001 sync aborted: testwikis wikis to 1.36.0-wmf.4 (duration: 72m 36s)
[14:27:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:28] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1351 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:28] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1302 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:28] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp1040 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:28] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mwdebug1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:40] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1275 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:40] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1327 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:43] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2253 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:48] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mwdebug1001 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:56] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1363 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:56] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2136 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:06] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1412 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:06] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1300 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:06] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2273 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:07] <_joe_>	 hashar: I can just remove the dirs for you
[14:28:32] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2317 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:36] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2292 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:38] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp1027 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:44] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1297 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:44] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2296 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:44] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2221 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:29:03] <hashar>	 "Decide what old stuff to prune":  find /srv/mediawiki-staging -mindepth 2 -maxdepth 2 -type f -path './php-*/README' -ctime +7 -exec dirname {} \;
[14:29:04] <hashar>	 ahah
[14:29:05] <hashar>	 found it
[14:29:08] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1368 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:29:08] <hashar>	 README no more exist
[14:29:10] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp2011 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:29:16] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1362 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:29:17] <hashar>	 it is README.md :-(
[14:29:18] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2218 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:29:46] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2208 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:29:48] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp1041 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:30:12] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on snapshot1005 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:30:21] <hauskatze>	 rm README.md ?
[14:30:24] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2204 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:30:29] <hashar>	 !log Cleaning old MediaWiki versions that were never removed 
[14:30:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:38] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1402 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:30:44] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1385 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:30:46] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1328 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:30:46] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2331 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:30:46] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2362 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:30:46] <wikibugs>	 (03PS1) 10JMeybohm: Fix distribution in changelog [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619478
[14:30:50] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2258 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:30:53] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1290 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:30:53] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2247 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:30:53] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2137 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:31:00] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2266 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:31:02] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2271 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:31:03] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2209 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:31:03] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2211 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:31:12] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1410 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:31:16] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2329 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:31:23] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2325 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:31:32] <_joe_>	 hnowlan: we need to make that grace period longer
[14:31:52] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2260 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:32:02] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1354 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:32:03] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2219 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:32:10] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2190 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:32:14] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1375 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:32:14] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1349 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:32:19] <hnowlan>	 _joe_: ack, will fix
[14:32:27] <_joe_>	 hashar: another thing I noticed: php-1.36.0-wmf.3 is 1.8 GB larger than previous versions
[14:32:37] <_joe_>	 any idea why?
[14:32:48] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on snapshot1008 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:32:49] <hashar>	 webpack / js dependencies being added with npm ?
[14:32:54] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2324 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:32:54] <hashar>	 more seriously, I don't know
[14:32:59] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix distribution in changelog [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619478 (owner: 10JMeybohm)
[14:33:14] <_joe_>	 hashar: you know who might know?
[14:33:35] <hashar>	 well
[14:33:38] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1327 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:33:40] <hashar>	 I don't know anything about mediawiki anymore
[14:33:52] <hashar>	 so short of filing a task, I guess I am not that helpful on that front ;]
[14:34:35] <_joe_>	 cache went from 5.5G to 7.2G
[14:34:37] <_joe_>	 lol
[14:35:00] <_joe_>	 the l10n cache is 7.2 GB
[14:36:04] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] control-mariadb-10.4*: Update package version [software] - 10https://gerrit.wikimedia.org/r/619462 (owner: 10Marostegui)
[14:37:02] <papaul>	 !log replacing msw-b5,b6,b7 and b8
[14:37:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:22] <_joe_>	 hashar: sorry any idea what these .~tmp~ directories are?
[14:38:33] <_joe_>	 it seems we keep two identical copies of the same l10n cache
[14:39:50] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2136 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:39:58] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1412 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:39:58] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2273 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:40:26] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2317 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:40:30] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2292 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:40:31] <logmsgbot>	 !log hashar@deploy1001 Pruned MediaWiki: 1.35.0-wmf.40 (duration: 10m 24s)
[14:40:32] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp1027 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:40:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:38] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1297 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:40:38] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2296 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:40:38] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2221 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:40:45] <hnowlan>	 _joe_: is 2 hours a reasonable window? 
[14:41:02] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1368 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:41:02] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp2011 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:41:10] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1362 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:41:12] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2218 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:41:42] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2208 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:41:43] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp1041 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:08] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on snapshot1005 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:20] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2204 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:34] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1402 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:40] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1385 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:42] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1328 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:43] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2331 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:43] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2362 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:44] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2258 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:48] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1290 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:50] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2247 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:50] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2137 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:56] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2271 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:56] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2266 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:56] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2211 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:42:56] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2209 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:43:06] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1410 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:43:10] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2329 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:43:17] <hashar>	 _joe_: sorry was busy. I guess we can file your finding as a train blocker ( https://phabricator.wikimedia.org/T257972 )
[14:43:18] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2325 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:43:35] <wikibugs>	 (03CR) 10Gilles: "Thiemo, your behaviour constitutes harassment and you are repeatedly breaching the trolling clause of the Code of Conduct:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles)
[14:43:46] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2260 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:43:56] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1354 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:43:56] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2219 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:44:06] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2190 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:44:08] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1349 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:44:08] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1375 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:44:33] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1312 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:44:44] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2311 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:44:44] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2372 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:44:46] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mwdebug2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:44:50] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2310 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:44:53] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp2004 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:44:53] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp2014 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:45:18] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1351 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:45:18] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp1040 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:45:18] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1302 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:45:32] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1275 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:45:34] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2253 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:45:46] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1363 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:45:56] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1300 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:46:56] <icinga-wm>	 RECOVERY - Disk space on mwdebug1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops
[14:47:52] <logmsgbot>	 !log hashar@deploy1001 Pruned MediaWiki: 1.35.0-wmf.41 (duration: 04m 20s)
[14:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:26] <jayme>	 !log imported helmfile_0.125.2-1 to buster-wikimedia, jessie-wikimedia, stretch-wikimedia
[14:48:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:18] <logmsgbot>	 !log hashar@deploy1001 Pruned MediaWiki: 1.36.0-wmf.1 (duration: 02m 07s)
[14:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:14] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mwdebug1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:51:38] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mwdebug1001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:51:39] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/refinery@35c4430]: Deploying to an-launcher1002 to get camus wrapper script changes - T251935
[14:51:41] * hashar files more tasks
[14:51:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:44] <stashbot>	 T251935: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935
[14:52:20] <hashar>	 _joe_: jynus: for the old l10n cache directories I have filed a task we will act on eventually I guess ( https://phabricator.wikimedia.org/T260146 )
[14:52:40] <hashar>	 and I cleaned the old versions
[14:52:54] <logmsgbot>	 !log otto@deploy1001 Finished deploy [analytics/refinery@35c4430]: Deploying to an-launcher1002 to get camus wrapper script changes - T251935 (duration: 01m 14s)
[14:52:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:55] <wikibugs>	 (03PS1) 10Hnowlan: check_mw_versions: increase grace period from 1 hour to 2 [puppet] - 10https://gerrit.wikimedia.org/r/619482
[14:55:08] <_joe_>	 hashar: what's not clear to me is why we have those .~tmp~ directories
[14:55:29] <hashar>	 used by the l10n cache / cdb thingie
[14:55:35] <hashar>	 I can't remember off hand how it works
[14:55:54] <hashar>	 but I guess they are all generated to that .~tmp~ sub directory then once generated moved to replace the old ones
[14:55:57] <jayme>	 !log updated helmfile to 0.125.2-1 on contint* and deploy*
[14:55:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:10] <logmsgbot>	 !log hashar@deploy1001 Pruned MediaWiki: 1.36.0-wmf.2 (duration: 04m 15s)
[14:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:20] <logmsgbot>	 !log hashar@deploy1001 Started scap: (no justification provided)
[14:56:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:26] <hashar>	 syncing again bah
[14:58:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans)
[14:59:03] <marostegui>	 !log Deploy MCR change on db1116:3318
[14:59:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:17] <wikibugs>	 (03CR) 10Ottomata: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[15:01:12] <icinga-wm>	 PROBLEM - Host ps1-b5-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:01:24] <icinga-wm>	 RECOVERY - Host ps1-b5-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.46 ms
[15:01:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans)
[15:02:48] <wikibugs>	 (03PS2) 10Ottomata: camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935)
[15:03:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[15:04:41] <wikibugs>	 (03PS3) 10Ottomata: camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935)
[15:05:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[15:05:54] <wikibugs>	 (03PS4) 10Ottomata: camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935)
[15:07:08] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:07:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott)
[15:07:38] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24425/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[15:08:28] <wikibugs>	 (03PS3) 10Ayounsi: Netbox: add circuits support [software/homer] - 10https://gerrit.wikimedia.org/r/617418
[15:08:58] <wikibugs>	 (03PS2) 10Ayounsi: Configure transport links OSPF based on Netbox data [homer/public] - 10https://gerrit.wikimedia.org/r/617603 (https://phabricator.wikimedia.org/T200277)
[15:10:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Netbox: add circuits support [software/homer] - 10https://gerrit.wikimedia.org/r/617418 (owner: 10Ayounsi)
[15:12:31] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs/ceph/backy fix name of backup script [puppet] - 10https://gerrit.wikimedia.org/r/619486 (https://phabricator.wikimedia.org/T259192)
[15:13:00] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:13:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy fix name of backup script [puppet] - 10https://gerrit.wikimedia.org/r/619486 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott)
[15:15:22] <wikibugs>	 (03PS1) 10Ottomata: camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935)
[15:16:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[15:17:36] <wikibugs>	 (03PS2) 10Ottomata: camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935)
[15:18:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[15:18:59] <wikibugs>	 (03PS4) 10Ayounsi: Netbox: add circuits support [software/homer] - 10https://gerrit.wikimedia.org/r/617418
[15:19:25] <wikibugs>	 (03PS3) 10Ottomata: camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935)
[15:20:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[15:20:56] <wikibugs>	 (03PS4) 10Ottomata: camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935)
[15:24:03] <wikibugs>	 10Operations, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10Bstorm) Are we considering the retrigger to be something implemented for the prod SRE rotation only? On WMCS, we seem pretty ok with manually resolving (small group and...
[15:27:12] <logmsgbot>	 !log hashar@deploy1001 Finished scap: (no justification provided) (duration: 30m 51s)
[15:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:47] <hashar>	 continuing with group 0
[15:29:57] <wikibugs>	 (03PS1) 10Hashar: group0 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619489
[15:29:59] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619489 (owner: 10Hashar)
[15:30:39] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619489 (owner: 10Hashar)
[15:31:13] <wikibugs>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/24429/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[15:33:55] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "Migration plan:" [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[15:36:31] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.4
[15:36:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:48] <wikibugs>	 10Operations, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) >>! In T259465#6376474, @Bstorm wrote: > Are we considering the retrigger to be something implemented for the prod SRE rotation only? On WMCS, we seem pretty...
[15:37:44] <wikibugs>	 (03PS4) 10Herron: lists: make lists1001 primary mailman host [puppet] - 10https://gerrit.wikimedia.org/r/619354 (https://phabricator.wikimedia.org/T224586)
[15:38:42] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: serve public traffic over TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/619490 (https://phabricator.wikimedia.org/T254908)
[15:39:28] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Product-Infrastructure-Team-Backlog, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deployment-docker... - https://phabricator.wikimedia.org/T259812
[15:40:08] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:43:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:44:30] <icinga-wm>	 PROBLEM - Host mw2208.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:44:30] <icinga-wm>	 PROBLEM - Host mw2210.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:46:00] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:47:18] <wikibugs>	 (03PS1) 10Ottomata: camus::job - Fix typo in stream_configs_constraints_opt [puppet] - 10https://gerrit.wikimedia.org/r/619491 (https://phabricator.wikimedia.org/T251935)
[15:47:57] <icinga-wm>	 PROBLEM - Host mw2189.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:48:06] <RhinosF1>	 hashar: FYI https://phabricator.wikimedia.org/T260155#6376588
[15:48:19] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] camus::job - Fix typo in stream_configs_constraints_opt [puppet] - 10https://gerrit.wikimedia.org/r/619491 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[15:48:41] <hashar>	 RhinosF1: ah cool I was about to file it
[15:49:11] <RhinosF1>	 hashar: :) I get emails for anything having their priority changed to UBN
[15:49:31] <hashar>	 ah smart
[15:49:33] <icinga-wm>	 RECOVERY - Host mw2208.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.74 ms
[15:49:33] <icinga-wm>	 RECOVERY - Host mw2210.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.14 ms
[15:49:56] <RhinosF1>	 hashar: nosey :)
[15:50:38] <hashar>	 made it a blocker to the train
[15:50:48] <RhinosF1>	 Ty
[15:51:12] <wikibugs>	 (03PS1) 10Jdlrobson: Revert "ServiceWiring: Avoid usage of deprecated Title::getSubjectPage()" [skins/MinervaNeue] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619376 (https://phabricator.wikimedia.org/T260155)
[15:51:25] <wikibugs>	 (03PS1) 10JMeybohm: helm: Add wmf-stable helm repo [puppet] - 10https://gerrit.wikimedia.org/r/619493
[15:51:29] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:53:03] <icinga-wm>	 RECOVERY - Host mw2189.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.01 ms
[15:53:28] <hashar>	 Jdlrobson: will deploy your fix
[15:53:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:33] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Revert "ServiceWiring: Avoid usage of deprecated Title::getSubjectPage()" [skins/MinervaNeue] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619376 (https://phabricator.wikimedia.org/T260155) (owner: 10Jdlrobson)
[15:53:43] <wikibugs>	 10Operations, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10Bstorm) >>! In T259465#6376519, @fgiunchedi wrote: >  > Good question re: SRE rotation only, I forgot to specify that the setting is unfortunately global per organizatio...
[15:54:01] <icinga-wm>	 PROBLEM - Host wtp2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:59:21] <icinga-wm>	 RECOVERY - Host wtp2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms
[16:00:05] <jouncebot>	 godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1600).
[16:00:05] <jouncebot>	 Amir1: A patch you scheduled for Puppet request window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:13] <Amir1>	 o/
[16:00:41] <papaul>	 msw-b5,b6,b7.and b8 replacement done
[16:03:21] <hashar>	 Jdlrobson: apparently the change will merge in 10 minutes
[16:04:39] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Marostegui) p:05Medium→03High Setting it to high as we don't have many "old" masters with 10.4 but we already have some that would use this script: x1, es4, es5...
[16:04:42] <_joe_>	 Amir1: can we do it tomorrow please?
[16:04:57] <_joe_>	 as in tomorrow morning? I was about to leave
[16:05:01] <Amir1>	 sure, no worries. nothing urgent
[16:11:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] noc: Remove link to outdated blog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619129 (https://phabricator.wikimedia.org/T259978) (owner: 10Aklapper)
[16:11:44] <mutante>	 _joe_: ACK, i saw the ping about testreduce. will look. have a good rest of the night
[16:12:21] <herron>	 !log migrating lists.wikimedia.org services from fermium to lists1001 T224586
[16:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:27] <stashbot>	 T224586: Migrate fermium to Buster - https://phabricator.wikimedia.org/T224586
[16:12:35] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ServiceWiring: Avoid usage of deprecated Title::getSubjectPage()" [skins/MinervaNeue] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619376 (https://phabricator.wikimedia.org/T260155) (owner: 10Jdlrobson)
[16:13:00] <wikibugs>	 (03CR) 10Herron: [C: 03+2] lists: make lists1001 primary mailman host [puppet] - 10https://gerrit.wikimedia.org/r/619354 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron)
[16:14:39] <hashar>	 Jdlrobson: pulling your patch to mwdebug1001
[16:17:55] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] web_testing: Remove the apache-fast-test placeholder [puppet] - 10https://gerrit.wikimedia.org/r/618602 (owner: 10RLazarus)
[16:18:09] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add new SSH key for Neil Shah-Quinn - https://phabricator.wikimedia.org/T260160 (10nshahquinn-wmf)
[16:19:13] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add new SSH key for Neil Shah-Quinn - https://phabricator.wikimedia.org/T260160 (10nshahquinn-wmf)
[16:20:29] <wikibugs>	 (03CR) 10Dzahn: "aww. man. sorry, this is not the puppet-repo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619129 (https://phabricator.wikimedia.org/T259978) (owner: 10Aklapper)
[16:20:54] <wikibugs>	 (03PS1) 10Dzahn: Revert "noc: Remove link to outdated blog" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619377
[16:21:32] <wikibugs>	 (03PS1) 10Ottomata: Refine - bump version to 0.0.132, but default to not merging Hive schemas [puppet] - 10https://gerrit.wikimedia.org/r/619496 (https://phabricator.wikimedia.org/T259924)
[16:21:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "noc: Remove link to outdated blog" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619377 (owner: 10Dzahn)
[16:22:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] releases: Remove absent resources [puppet] - 10https://gerrit.wikimedia.org/r/619434 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[16:22:57] <wikibugs>	 10Operations, 10ops-codfw, 10netops: (Need by:  ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul)
[16:23:46] <hashar>	 Jdlrobson: bah will do after sorry
[16:24:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, and 2 others: decom cloudvirt1015 - https://phabricator.wikimedia.org/T257366 (10nskaggs) p:05Triage→03Medium a:05Jclark-ctr→03Andrew
[16:25:32] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Refine - bump version to 0.0.132, but default to not merging Hive schemas [puppet] - 10https://gerrit.wikimedia.org/r/619496 (https://phabricator.wikimedia.org/T259924) (owner: 10Ottomata)
[16:26:44] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:28:54] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] "Nice, I see, there is no need to revert" [puppet] - 10https://gerrit.wikimedia.org/r/619496 (https://phabricator.wikimedia.org/T259924) (owner: 10Ottomata)
[16:29:36] <wikibugs>	 (03PS6) 10Hnowlan: Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908)
[16:31:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10nskaggs) p:05Triage→03Low
[16:32:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:38:31] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10Cmjohnson) @ayounsi  the cross-connect has been completed but I am not seeing any light from the demarc panel
[16:44:13] <wikibugs>	 (03PS1) 10Herron: lists: lists1001 enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/619498 (https://phabricator.wikimedia.org/T224586)
[16:46:34] <wikibugs>	 (03PS1) 10Hnowlan: wmnet: add api-gateway records [dns] - 10https://gerrit.wikimedia.org/r/619499 (https://phabricator.wikimedia.org/T254908)
[16:46:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmnet: add api-gateway records [dns] - 10https://gerrit.wikimedia.org/r/619499 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan)
[16:47:40] <wikibugs>	 (03CR) 10Herron: [C: 03+2] lists: lists1001 enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/619498 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron)
[16:48:23] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10Cmjohnson) just to confirm,  I double-checked the Equinix completion email and verified the ports (11,12) and the patch number (21504182-A).  I also verified the sfp+ and fiber ar...
[16:49:08] <hashar>	 bah back
[16:49:36] <hashar>	 lets deploy the hotfix for T260155
[16:49:37] <stashbot>	 T260155: PHP Fatal error: Uncaught Error: Call to undefined method TitleValue::isSubpage() in /srv/mediawiki/php-1.36.0-wmf.4/skins/MinervaNeue/includes/Skins/SkinUserPageHelper.php:55 in /srv/mediawiki/php-1.36.0-wmf.4/skins/MinervaNeue/includes/Skins/SkinUserPageHelper.php on line 55 - https://phabricator.wikimedia.org/T260155
[16:51:21] <wikibugs>	 (03PS2) 10Hnowlan: wmnet: add api-gateway records [dns] - 10https://gerrit.wikimedia.org/r/619499 (https://phabricator.wikimedia.org/T254908)
[16:51:38] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add new SSH key for Neil Shah-Quinn - https://phabricator.wikimedia.org/T260160 (10nshahquinn-wmf)
[16:51:44] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:53:17] <logmsgbot>	 !log hashar@deploy1001 Synchronized php-1.36.0-wmf.4/skins/MinervaNeue/: Revert "ServiceWiring: Avoid usage of deprecated Title::getSubjectPage()" - T260155 (duration: 01m 06s)
[16:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:26] <hashar>	 that should fix the icinga alert
[16:56:02] <wikibugs>	 (03PS1) 10Cmjohnson: updating mgmt ip to reflect correct asset tag cloudcephosd host [dns] - 10https://gerrit.wikimedia.org/r/619503 (https://phabricator.wikimedia.org/T251619)
[16:56:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] updating mgmt ip to reflect correct asset tag cloudcephosd host [dns] - 10https://gerrit.wikimedia.org/r/619503 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson)
[16:56:59] <wikibugs>	 (03Abandoned) 10Cmjohnson: updating mgmt ip to reflect correct asset tag cloudcephosd host [dns] - 10https://gerrit.wikimedia.org/r/619503 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson)
[16:58:25] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[16:58:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:12] <wikibugs>	 (03PS7) 10Hnowlan: Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908)
[17:00:04] <jouncebot>	 halfak and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1700).
[17:00:42] <wikibugs>	 (03PS1) 10Elukey: admin: add new ssh key for neilpquinn-wmf [puppet] - 10https://gerrit.wikimedia.org/r/619505 (https://phabricator.wikimedia.org/T260160)
[17:01:08] <wikibugs>	 (03PS1) 10Dbarratt: Grant all users in the checkuser group the investigate right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171)
[17:01:28] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:02:00] <wikibugs>	 10Operations, 10Patch-For-Review: Migrate fermium to Buster - https://phabricator.wikimedia.org/T224586 (10herron) 05Open→03Resolved a:03herron lists.wikimedia.org is now running from the buster host lists1001.wikimedia.org.  Fermium (the old lists host) has been shut down (via gnt-instance shutdown) and...
[17:02:02] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10herron)
[17:02:08] <wikibugs>	 (03CR) 10Dbarratt: Grant all users in the checkuser group the investigate right (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt)
[17:03:30] <wikibugs>	 (03PS2) 10Elukey: admin: add new ssh key for neilpquinn-wmf [puppet] - 10https://gerrit.wikimedia.org/r/619505 (https://phabricator.wikimedia.org/T260160)
[17:04:02] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:04:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:59] <wikibugs>	 (03PS1) 10Herron: lists: remove fermium entry from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/619507 (https://phabricator.wikimedia.org/T224586)
[17:05:24] <wikibugs>	 (03CR) 10Tchanders: [C: 03+1] Grant all users in the checkuser group the investigate right (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt)
[17:06:54] <Amir1>	 herron: thanks for the upgrade!
[17:07:11] <wikibugs>	 (03CR) 10Herron: [C: 03+2] lists: remove fermium entry from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/619507 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron)
[17:07:22] <herron>	 Amir1: np, glad to be off jessie!
[17:09:15] <wikibugs>	 (03PS1) 10Mholloway: Update mobileapps to 2020-08-11-170318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619508
[17:11:46] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Update mobileapps to 2020-08-11-170318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619508 (owner: 10Mholloway)
[17:12:51] <wikibugs>	 (03Merged) 10jenkins-bot: Update mobileapps to 2020-08-11-170318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619508 (owner: 10Mholloway)
[17:16:11] <wikibugs>	 (03PS1) 10Mholloway: Update proton to 2020-08-11-170508-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619509
[17:19:16] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Update proton to 2020-08-11-170508-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619509 (owner: 10Mholloway)
[17:20:27] <wikibugs>	 (03Merged) 10jenkins-bot: Update proton to 2020-08-11-170508-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619509 (owner: 10Mholloway)
[17:22:04] <wikibugs>	 (03PS1) 10Ppchelko: Create api-gateway-logstream image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812)
[17:25:16] <wikibugs>	 (03PS2) 10Ppchelko: Create api-gateway-logstream image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812)
[17:25:18] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
[17:25:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:48] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
[17:28:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:40] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[17:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:11] <wikibugs>	 (03PS3) 10Ppchelko: Create api-gateway-logstream image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812)
[17:33:08] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
[17:33:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:14] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] wmnet: add api-gateway records [dns] - 10https://gerrit.wikimedia.org/r/619499 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan)
[17:33:20] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan)
[17:34:36] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "This changes a wrong variable :-). See more in-text." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt)
[17:36:08] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:36:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:28] <hauskatze>	 Niharika: thanks, I was not aware re deprecation of `investigate`
[17:38:48] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[17:38:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:52] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] Grant all users in the checkuser group the investigate right (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt)
[17:40:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10Cmjohnson)
[17:40:41] <wikibugs>	 (03PS5) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812)
[17:42:54] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "LGTM but I'd be interested to hear what serviceops think of this approach. I'm not sure how else we might do this" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko)
[17:43:23] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
[17:43:23] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[17:43:24] <wikibugs>	 (03PS1) 10Cmjohnson: add production dns for pki1001 [dns] - 10https://gerrit.wikimedia.org/r/619518 (https://phabricator.wikimedia.org/T259826)
[17:43:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add production dns for pki1001 [dns] - 10https://gerrit.wikimedia.org/r/619518 (https://phabricator.wikimedia.org/T259826) (owner: 10Cmjohnson)
[17:44:05] <wikibugs>	 (03PS2) 10Cmjohnson: add production dns for pki1001 [dns] - 10https://gerrit.wikimedia.org/r/619518 (https://phabricator.wikimedia.org/T259826)
[17:44:56] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] add production dns for pki1001 [dns] - 10https://gerrit.wikimedia.org/r/619518 (https://phabricator.wikimedia.org/T259826) (owner: 10Cmjohnson)
[17:46:04] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10Cmjohnson)
[17:48:57] <wikibugs>	 10Operations, 10ops-eqiad: relforge1001's mgmt IP not reachable - https://phabricator.wikimedia.org/T259777 (10Cmjohnson) 05Open→03Resolved Replaced the cable
[17:50:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10RobH)
[17:50:43] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10RobH)
[17:50:54] <wikibugs>	 (03PS1) 10Ottomata: EventStreamConfig - Remove extraneous mediawiki.api-request stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619523 (https://phabricator.wikimedia.org/T251935)
[17:51:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10RobH)
[17:52:01] <wikibugs>	 (03PS2) 10Ottomata: EventStreamConfig - Remove extraneous mediawiki.api-request stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619523 (https://phabricator.wikimedia.org/T251935)
[17:52:32] <icinga-wm>	 RECOVERY - Host relforge1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms
[17:53:02] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
[17:53:02] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[17:53:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:45] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - Remove extraneous mediawiki.api-request stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619523 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[17:54:08] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By:  2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10RobH)
[17:54:17] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By:  2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10RobH)
[17:55:46] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1013 is CRITICAL: instance=kubernetes1013.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:56:37] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Remove extraneous mediawiki.api-request stream - T251935 (duration: 01m 01s)
[17:56:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:40] <stashbot>	 T251935: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935
[17:58:28] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10RobH) We've not gotten confirmation from Cloudflare that they are turned up, I'll email them to let them know we are ready on our end!
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1800).
[18:00:04] <jouncebot>	 RoanKattouw, CindyCicaleseWMF, and davidwbarratt: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:11] <RoanKattouw>	 I'll deploy
[18:00:37] <wikibugs>	 (03PS2) 10Catrope: Direct GrowthExperiments help panel questions to mentors on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618786 (https://phabricator.wikimedia.org/T250235) (owner: 10Gergő Tisza)
[18:00:46] <Pchelolo>	 RoanKattouw: please skip one from CindyCicaleseWMF, I'll deploy that one
[18:00:46] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Direct GrowthExperiments help panel questions to mentors on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618786 (https://phabricator.wikimedia.org/T250235) (owner: 10Gergő Tisza)
[18:00:51] <RoanKattouw>	 OK
[18:01:00] <Pchelolo>	 we're using it for showing how to do that
[18:01:28] <wikibugs>	 (03Merged) 10jenkins-bot: Direct GrowthExperiments help panel questions to mentors on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618786 (https://phabricator.wikimedia.org/T250235) (owner: 10Gergő Tisza)
[18:01:36] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[18:02:44] <davidwbarratt>	 I'm here!
[18:04:20] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2012 is CRITICAL: instance=kubernetes2012.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[18:04:28] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2013 is CRITICAL: instance=kubernetes2013.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[18:04:34] <wikibugs>	 (03PS1) 10Ottomata: camus - replace mediawiki_analytics_events with eventgate-analytics_events job [puppet] - 10https://gerrit.wikimedia.org/r/619532 (https://phabricator.wikimedia.org/T251935)
[18:05:37] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Direct GrowthExperiments help panel questions to mentors on cswiki (T250235) (duration: 01m 03s)
[18:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:40] <stashbot>	 T250235: Scale: pilot help panel with mentorship - https://phabricator.wikimedia.org/T250235
[18:06:01] <RoanKattouw>	 davidwbarratt: Your patch has -1 comments from Martin and I think he's right
[18:06:09] <RoanKattouw>	 $wgGrantPermissions is about OAuth grants (confusingly)
[18:06:30] <wikibugs>	 (03PS2) 10Ottomata: camus - replace mediawiki_analytics_events with eventgate-analytics_events job [puppet] - 10https://gerrit.wikimedia.org/r/619532 (https://phabricator.wikimedia.org/T251935)
[18:07:44] <mutante>	 @seen hashar
[18:07:44] <wm-bot>	 mutante: Last time I saw hashar they were quitting the network with reason: Quit: I am a virus. Please copy paste me in your /quit message to help me propagate N/A at 8/11/2020 5:10:40 PM (57m4s ago)
[18:07:50] <RoanKattouw>	 I think you're probably looking for $wgGroupPermissions
[18:07:57] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet - https://phabricator.wikimedia.org/T260188 (10RobH)
[18:08:12] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet - https://phabricator.wikimedia.org/T260188 (10RobH)
[18:08:23] <RoanKattouw>	 Pchelolo: Go ahead with your path
[18:08:26] <RoanKattouw>	 *patch
[18:08:29] <Pchelolo>	 thank you RoanKattouw
[18:08:59] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24431/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619532 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[18:09:18] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10wiki_willy)
[18:09:36] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10RobH) a:03fgiunchedi @fgiunchedi: What racking restrictions and what OS did you have for this incoming test system?...
[18:10:10] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2012 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[18:10:29] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10RobH)
[18:10:33] <wikibugs>	 (03PS4) 10Ppchelko: Configured additional settings for API Portal beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569) (owner: 10Cicalese)
[18:10:42] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+2] Configured additional settings for API Portal beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569) (owner: 10Cicalese)
[18:11:17] <davidwbarratt>	 ugh
[18:11:30] <wikibugs>	 (03Merged) 10jenkins-bot: Configured additional settings for API Portal beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569) (owner: 10Cicalese)
[18:11:42] * Urbanecm waves to davidwbarratt 
[18:11:56] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:12:16] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[18:12:34] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10RobH)
[18:15:48] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:16:57] <wikibugs>	 (03PS2) 10Dbarratt: Grant all users on frwiki the checkuser group the investigate right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171)
[18:17:27] <davidwbarratt>	 hey!
[18:17:39] <davidwbarratt>	 I updated the patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/619506/
[18:17:56] <davidwbarratt>	 Urbanecm & RoanKattouw ^
[18:18:12] <RoanKattouw>	 Looking
[18:18:24] <logmsgbot>	 !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: Beta-only: Configured additional settings for API Portal beta wiki gerrit:619339 (duration: 01m 03s)
[18:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:33] <Pchelolo>	 done with deploy
[18:18:37] <Urbanecm>	 this works, but only for frwiki. Is that what you want davidwbarratt ?
[18:18:45] <davidwbarratt>	 yes, just french wikipedia for now
[18:18:56] <davidwbarratt>	 that's the only other place it's enabled on
[18:18:56] <Urbanecm>	 Pchelolo: fyi, you don't need to sync -labs files :-). Just merge and fetch to deploy1001.
[18:19:02] <davidwbarratt>	 should I fix the merge conflict?
[18:19:07] <wikibugs>	 (03PS3) 10Urbanecm: Grant all users on frwiki the checkuser group the investigate right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt)
[18:19:14] <Urbanecm>	 davidwbarratt: that needed only a rebase, done
[18:19:14] <Urbanecm>	 LGTM
[18:19:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt)
[18:19:20] <Pchelolo>	 Urbanecm: we were using this one as a demo on how you would deploy )
[18:19:23] <Pchelolo>	 thank you
[18:19:28] <Urbanecm>	 aha :-)
[18:19:46] <Urbanecm>	 fine then, it shouldn't break anything :-)
[18:20:30] <wikibugs>	 (03PS1) 10Cwhite: prometheus: add mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619539 (https://phabricator.wikimedia.org/T256418)
[18:20:42] <wikibugs>	 (03CR) 10Dzahn: "yes, the directories are gone on releases100*" [puppet] - 10https://gerrit.wikimedia.org/r/619434 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[18:20:44] <RoanKattouw>	 OK I'll deploy then
[18:20:48] <RoanKattouw>	 Thanks for giving me the heads up Pchelolo 
[18:21:02] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Grant all users on frwiki the checkuser group the investigate right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt)
[18:21:49] <wikibugs>	 (03Merged) 10jenkins-bot: Grant all users on frwiki the checkuser group the investigate right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt)
[18:21:58] <davidwbarratt>	 uhh, where is the extension again?
[18:22:03] <davidwbarratt>	 the browser extension
[18:22:12] <wikibugs>	 (03PS2) 10RLazarus: web_testing: Clean up the old class used for apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/618603
[18:22:16] <wikibugs>	 (03PS1) 10Ottomata: camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935)
[18:23:24] <Urbanecm>	 davidwbarratt: see https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug
[18:23:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[18:23:56] <davidwbarratt>	 Urbanecm oh great! thanks!
[18:24:11] <RoanKattouw>	 davidwbarratt: Ready for you on mwdebug1002 (note 1002 not 1001) when you're set up
[18:24:22] <davidwbarratt>	 ok, testing now
[18:24:36] <wikibugs>	 (03PS2) 10Ottomata: camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935)
[18:25:01] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] web_testing: Clean up the old class used for apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/618603 (owner: 10RLazarus)
[18:25:08] <davidwbarratt>	 perfect! I see it on https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Liste_des_droits_de_groupe !
[18:25:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[18:28:19] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Grant investigate right to checkuser group on frwiki (T260171) (duration: 01m 04s)
[18:28:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:22] <stashbot>	 T260171: Fix issues with permissions for Special:Investigate access to checkusers on frwiki - https://phabricator.wikimedia.org/T260171
[18:28:41] <wikibugs>	 (03PS3) 10Ottomata: camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935)
[18:29:13] <davidwbarratt>	 is it done?
[18:29:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[18:31:28] <davidwbarratt>	 RoanKattouw ?
[18:31:35] <RoanKattouw>	 Yes sorry
[18:31:44] <davidwbarratt>	 no worries, I still fail at reading the logs. :)
[18:32:29] <wikibugs>	 (03PS4) 10Ottomata: camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935)
[18:32:43] <davidwbarratt>	 RoanKattouw thank you so much!
[18:33:32] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10RobH) a:05Cmjohnson→03RobH I just emailed back and forth with Matt @ Cloudflare (he is very prompt in replies!).  They had to put in an EQ order to have a patch placed between...
[18:35:01] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10RobH) 05Open→03Resolved >>! In T259923#6377508, @RobH wrote: > I just emailed back and forth with Matt @ Cloudflare (he is very prompt in replies!). >  > They had to put in an...
[18:36:10] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[18:41:48] <wikibugs>	 (03PS1) 10Ottomata: Re-enable canary for staging eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/619543
[18:43:11] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Re-enable canary for staging eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/619543 (owner: 10Ottomata)
[18:44:55] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[18:44:55] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
[18:44:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:41] <wikibugs>	 10Operations: expired puppet cert on scb1001 - https://phabricator.wikimedia.org/T260094 (10Dzahn)
[18:46:08] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics - Use remote EventStreamConfig in eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/619544 (https://phabricator.wikimedia.org/T251935)
[18:47:42] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - Use remote EventStreamConfig in eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/619544 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[18:48:52] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[18:48:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:17] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[18:55:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <jouncebot>	 hashar and twentyafterfour: (Dis)respected human, time to deploy Mediawiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1900). Please do the needful.
[19:11:08] <wikibugs>	 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10dpifke)
[19:11:22] <wikibugs>	 (03PS1) 10Ottomata: EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935)
[19:12:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[19:13:27] <wikibugs>	 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10dpifke)
[19:16:30] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:19:42] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[19:20:02] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153)
[19:22:55] <wikibugs>	 (03PS2) 10Ottomata: EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935)
[19:23:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[19:24:22] <wikibugs>	 (03PS3) 10Ottomata: EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935)
[19:25:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[19:26:01] <wikibugs>	 (03PS4) 10Ottomata: EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935)
[19:29:14] <icinga-wm>	 PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:29:40] <ottomata>	 twentyafterfour: hashar you all training?
[19:30:09] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:31:08] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn debugging, not in service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:31:12] <icinga-wm>	 RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:31:48] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms
[19:32:02] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.47 ms
[19:32:02] <ottomata>	 ok, i'm merging my config change, nothing is usiing it yet
[19:32:08] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[19:35:42] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Add streams for eventgate-main - T251935 (duration: 01m 04s)
[19:35:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:46] <stashbot>	 T251935: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935
[19:37:10] <wikibugs>	 (03PS1) 10Ottomata: eventgate-main - use MW EventStreamConfig in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/619554 (https://phabricator.wikimedia.org/T251935)
[19:38:47] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-main - use MW EventStreamConfig in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/619554 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata)
[19:40:10] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[19:40:10] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[19:40:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:24] <wikibugs>	 (03PS2) 10Cwhite: prometheus: add mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619539 (https://phabricator.wikimedia.org/T256418)
[20:01:24] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] prometheus: add mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619539 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite)
[20:07:45] <wikibugs>	 (03PS1) 10Cwhite: prometheus: track total hits for mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619558 (https://phabricator.wikimedia.org/T256418)
[20:07:55] <wikibugs>	 (03PS2) 10Cwhite: prometheus: track total hits for mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619558 (https://phabricator.wikimedia.org/T256418)
[20:09:06] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] prometheus: track total hits for mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619558 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite)
[20:09:27] <wikibugs>	 (03PS1) 10Dzahn: aphlict: listen on IPv6 instead IPv4 for client and admin ports [puppet] - 10https://gerrit.wikimedia.org/r/619560 (https://phabricator.wikimedia.org/T238593)
[20:10:39] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) test comment
[20:11:38] <wikibugs>	 (03PS2) 10Dzahn: aphlict: listen on IPv6 instead IPv4 for client and admin ports [puppet] - 10https://gerrit.wikimedia.org/r/619560 (https://phabricator.wikimedia.org/T238593)
[20:12:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] aphlict: listen on IPv6 instead IPv4 for client and admin ports [puppet] - 10https://gerrit.wikimedia.org/r/619560 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn)
[20:16:17] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) Will it blend?
[20:16:26] <Krinkle>	 Jdlrobson: ping regarding https://gerrit.wikimedia.org/r/c/mediawiki/core/+/619092 - can roll this out if you're around to fix the two user-facing issues in Echo and sitenotice.
[20:17:03] <Krinkle>	 sitenotice broken for over a week now for some wikis
[20:17:44] <wikibugs>	 (03PS1) 10Andrew Bogott: Mark cloudvirt1015 as spare [puppet] - 10https://gerrit.wikimedia.org/r/619562 (https://phabricator.wikimedia.org/T257366)
[20:19:30] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) it appears to blend.
[20:29:41] <wikibugs>	 (03PS2) 10Andrew Bogott: Mark cloudvirt1015 as spare [puppet] - 10https://gerrit.wikimedia.org/r/619562 (https://phabricator.wikimedia.org/T257366)
[20:32:06] <wikibugs>	 (03PS1) 10Cwhite: prometheus: add config tests [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418)
[20:32:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: add config tests [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite)
[20:32:49] <wikibugs>	 (03PS2) 10Cwhite: prometheus: add config tests [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418)
[20:34:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: add config tests [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite)
[20:36:23] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) >>! In T238593#6376059, @CDanis wrote: > The Envoy TLS terminator is now configured to allow websocket upgr...
[20:37:11] <wikibugs>	 10Operations, 10DBA, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10mmodell) ugh.  @jcrespo, I apologize, I let the ball drop on this one.  It wouldn't take much effort on my part, we already have the puppet scaffolding to support separati...
[20:44:07] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) 05Open→03Resolved a:03Dzahn We are seeing realtime notifications again and aphlict is now separated f...
[20:44:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:46:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:51:09] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37
[20:51:23] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "That is the primary change for the switch, the parent changes are dummy ones to cleanup puppet.  We once deployed it but it failed cause s" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[20:53:31] <wikibugs>	 10Operations, 10serviceops, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) Checking the box for phabricator/aphlict.  aphlict is now running on a dedicated VM, aphlict1001, on buster and nodejs 10....
[20:53:49] <wikibugs>	 10Operations, 10serviceops, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn)
[20:54:00] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:54:56] <wikibugs>	 10Operations, 10serviceops, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) Also checking the box for etherpad. That is also on buster and nodejs10 meanwhile. Upgraded by Alex Kosiaris.
[20:55:10] <wikibugs>	 10Operations, 10serviceops, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn)
[20:58:09] <wikibugs>	 (03PS1) 10ArielGlenn: cleanup misc dumps that aren't stored in per-date urls [puppet] - 10https://gerrit.wikimedia.org/r/619571 (https://phabricator.wikimedia.org/T257782)
[21:07:48] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:10:44] <wikibugs>	 (03PS1) 10Cwhite: fix flake8 [puppet] - 10https://gerrit.wikimedia.org/r/619572
[21:14:38] <wikibugs>	 (03PS3) 10Cwhite: prometheus: add config tests [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418)
[21:17:09] <Jdlrobson>	 Krinkle: it can be backported yes.
[21:18:18] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] skins: Call headElement() after getTemplateData() in SkinMustache [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619092 (https://phabricator.wikimedia.org/T259872) (owner: 10Krinkle)
[21:19:01] * Krinkle gets a new beverage
[21:21:14] * hauskatze drills a hole in Krinkle 's new beverage glass/can
[21:31:53] <wikibugs>	 (03PS1) 10Cwhite: prometheus: remove unnecessary define and split mediawiki queries by channel [puppet] - 10https://gerrit.wikimedia.org/r/619574 (https://phabricator.wikimedia.org/T256418)
[21:31:55] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575
[21:32:05] * Platonides gives Krinkle hauskatze's new jar
[21:32:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 (owner: 10Andrew Bogott)
[21:35:57] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575
[21:39:15] <wikibugs>	 (03PS1) 10Jdlrobson: Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619380 (https://phabricator.wikimedia.org/T231160)
[21:39:16] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:39:50] <wikibugs>	 (03PS1) 10Jdlrobson: Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619381 (https://phabricator.wikimedia.org/T231160)
[21:41:06] <wikibugs>	 (03Merged) 10jenkins-bot: skins: Call headElement() after getTemplateData() in SkinMustache [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619092 (https://phabricator.wikimedia.org/T259872) (owner: 10Krinkle)
[21:41:12] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:42:15] <Jdlrobson>	 Krinkle: need me to test before syncing?
[21:43:09] <Krinkle>	 staging now,
[21:43:10] <Krinkle>	 yeah
[21:44:18] <Krinkle>	 Jdlrobson: live on mwdebug1002
[21:45:09] <Krinkle>	 I'm also checking https://nl.wikimedia.org/wiki/Home with private and mwdebug1002 andd see the button is working there with XWD on
[21:49:33] <Jdlrobson>	 Krinkle: LGTM
[21:51:27] <Krinkle>	 ok, rolling out
[21:52:27] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.36.0-wmf.3/includes/skins/SkinMustache.php: Ibe1f07346, T259872, T259858 (duration: 01m 04s)
[21:52:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:31] <stashbot>	 T259858: Sitenotice: Button for dismissing content isn't in the right place and does nothing - https://phabricator.wikimedia.org/T259858
[21:52:31] <stashbot>	 T259872: Echo new message alert has no orange background in vector - https://phabricator.wikimedia.org/T259872
[22:04:44] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:08:38] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:15:59] <wikibugs>	 (03PS3) 10Andrew Bogott: Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 (https://phabricator.wikimedia.org/T259542)
[22:16:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 (https://phabricator.wikimedia.org/T259542) (owner: 10Andrew Bogott)
[22:19:28] <wikibugs>	 (03PS4) 10Andrew Bogott: Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 (https://phabricator.wikimedia.org/T259542)
[22:20:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 (https://phabricator.wikimedia.org/T259542) (owner: 10Andrew Bogott)
[22:25:15] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn)
[22:27:54] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10Privacy: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10Krinkle) Can we set a hard CSP on this domain at the web server level so that in general our report will be "oh no, there's a requ...
[22:30:40] <wikibugs>	 (03PS1) 10Dzahn: mailman: replace fermium with lists1001 in rsync scripts [puppet] - 10https://gerrit.wikimedia.org/r/619585 (https://phabricator.wikimedia.org/T224586)
[22:37:12] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "+1 for icinga-sms.py" [puppet] - 10https://gerrit.wikimedia.org/r/619572 (owner: 10Cwhite)
[22:37:20] <wikibugs>	 (03PS1) 10Dzahn: remove fermium from DHCP,partman and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/619586 (https://phabricator.wikimedia.org/T224586)
[22:42:49] <wikibugs>	 (03PS1) 10Jdlrobson: Beta cluster: Enable search in header on Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619588 (https://phabricator.wikimedia.org/T249363)
[22:46:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:47:50] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T2300).
[23:00:04] <jouncebot>	 Jdlrobson and kaldari: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:01:28] <Jdlrobson>	 o/ here
[23:01:44] <Urbanecm>	 I can deploy today!
[23:01:48] <Jdlrobson>	 thanks Urbanecm 
[23:02:11] <kaldari>	 here!
[23:02:24] <kaldari>	 Thank you!
[23:02:45] <wikibugs>	 (03PS2) 10Urbanecm: Switching to updated license definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618586 (owner: 10Kaldari)
[23:02:47] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Switching to updated license definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618586 (owner: 10Kaldari)
[23:03:27] <wikibugs>	 (03Merged) 10jenkins-bot: Switching to updated license definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618586 (owner: 10Kaldari)
[23:03:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619380 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson)
[23:03:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619381 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson)
[23:04:01] <Urbanecm>	 kaldari: could you test that at mwdebug1001, please?
[23:04:35] <kaldari>	 will do....
[23:06:20] <kaldari>	 Urbanecm: es perfecto!
[23:06:25] <Urbanecm>	 syncing!
[23:07:56] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 28faa279dacf6a4d6f0a663844e913738c2fa142: Switching to updated license definition (duration: 01m 04s)
[23:07:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:01] <Urbanecm>	 kaldari: done!
[23:09:17] <kaldari>	 Thanks! I'll keep an eye on the logs just in case.
[23:10:56] <Jdlrobson>	 Urbanecm: https://gerrit.wikimedia.org/r/c/619588/ is beta cluster only. I think it just needs a +2 ?
[23:11:09] <Urbanecm>	 yup
[23:11:17] <Urbanecm>	 (and git pull at deploy1001)
[23:11:50] <Urbanecm>	 Jdlrobson: should that be merged?
[23:12:32] <Jdlrobson>	 Urbanecm: yes please
[23:12:44] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Beta cluster: Enable search in header on Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619588 (https://phabricator.wikimedia.org/T249363) (owner: 10Jdlrobson)
[23:12:48] <Urbanecm>	 Jdlrobson: done!
[23:12:56] <Urbanecm>	 (it will be auto-deployed within 30 minutes)
[23:13:28] <wikibugs>	 (03Merged) 10jenkins-bot: Beta cluster: Enable search in header on Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619588 (https://phabricator.wikimedia.org/T249363) (owner: 10Jdlrobson)
[23:16:54] <Jdlrobson>	 thanks for that Urbanecm 
[23:16:59] <Urbanecm>	 happy to help
[23:25:56] <wikibugs>	 (03Merged) 10jenkins-bot: Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619380 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson)
[23:25:59] <wikibugs>	 (03Merged) 10jenkins-bot: Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619381 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson)
[23:26:29] <Jdlrobson>	 ready to test on debug :)
[23:27:24] <Urbanecm>	 just wait a sec, doing git-fu :-)
[23:28:30] <Urbanecm>	 Jdlrobson: pulled onto mwdebug1001 :)
[23:31:21] <Jdlrobson>	 Urbanecm: hmm it's not kicking in (but that's fine) and i've just realised why. (face palm)
[23:31:24] <Jdlrobson>	 you can sync that though
[23:31:49] <Urbanecm>	 Jdlrobson: okay. Do you need any follow-up patch?
[23:31:59] <Urbanecm>	 Happy to sync that too, but you'd need to get someone to merge it to master
[23:32:32] <Jdlrobson>	 Urbanecm: could i trouble you for one more config patch?
[23:32:37] <Urbanecm>	 sure!
[23:33:42] <Jdlrobson>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/619592 Update wgMFRemovableClasses [NEW]       
[23:33:46] <Jdlrobson>	 i'll add it to wikitech:Deployments
[23:33:51] <wikibugs>	 (03PS1) 10Jdlrobson: Update wgMFRemovableClasses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619592 (https://phabricator.wikimedia.org/T231160)
[23:33:51] <Urbanecm>	 cool
[23:34:22] <Jdlrobson>	 i didnt realise there was a production override
[23:34:25] <Urbanecm>	 hmm, isn't the value the same as in extensions.json?
[23:34:25] <Jdlrobson>	 would have saved a lot of time! :)
[23:34:33] <Jdlrobson>	 yep but arrays dont merge by default
[23:34:39] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update wgMFRemovableClasses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619592 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson)
[23:34:52] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.4/extensions/MobileFrontend/extension.json: 81d54b0ec82d0b78f723f9400031e918a4a143aa: Hide vertical nav-boxes on mobile domain (T231160) (duration: 01m 05s)
[23:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:34:55] <Urbanecm>	 I thought about removing the production override, but I'm fine with syncing this too :)
[23:34:55] <Jdlrobson>	 (Associative arrays that is)
[23:34:56] <stashbot>	 T231160: HtmlFormatter incorrectly removes partial classname matches in "xenomobile" or "not-an-navbox" - https://phabricator.wikimedia.org/T231160
[23:35:17] <wikibugs>	 (03Merged) 10jenkins-bot: Update wgMFRemovableClasses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619592 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson)
[23:35:18] <Jdlrobson>	 Mee too but .mbox-image' is not present in MobileFrontend
[23:35:36] <Urbanecm>	 ah, gotcha!
[23:36:00] <Urbanecm>	 once the .3 patch is deployed, I'll ping you to test it :)
[23:36:35] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.3/extensions/MobileFrontend/extension.json: c22d65ff9b2439f484ab8ccffed87b00e78c3ad2: Hide vertical nav-boxes on mobile domain (T231160) (duration: 01m 03s)
[23:36:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:37:05] <Urbanecm>	 Jdlrobson: ready for you at mwdebug1001
[23:37:50] <Jdlrobson>	 yay
[23:37:51] <Jdlrobson>	 that did it!
[23:37:53] <Jdlrobson>	 please sync :)
[23:37:57] <Urbanecm>	 wonderful, syncing!
[23:38:13] <Jdlrobson>	 please sync :)
[23:38:15] <Jdlrobson>	 oops sorry
[23:38:17] <Jdlrobson>	 wrong tab :)
[23:39:32] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 0f238f71c95c7bd7534c28abfac759fbb47f674f: Update wgMFRemovableClasses (T231160) (duration: 01m 03s)
[23:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:44] <Urbanecm>	 Jdlrobson: should be all done :)
[23:39:47] <Urbanecm>	 anything else?
[23:40:48] <Jdlrobson>	 thanks for all your help today Urbanecm !
[23:41:17] <Urbanecm>	 !log Evening B&C window completed
[23:41:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:41:20] <Urbanecm>	 no problem Jdlrobson :)