[00:02:20] RECOVERY - dump of matomo in eqiad on icinga1001 is OK: Last dump for matomo at eqiad (db1108.eqiad.wmnet:3351) taken on 2020-08-11 00:00:01 (0 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:05:05] (03PS1) 10Dzahn: Revert "admins: set http_proxy for myself, dzahn" [puppet] - 10https://gerrit.wikimedia.org/r/619367 [00:06:33] (03PS3) 10Dave Pifke: arclamp: configurable email address for cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/619395 [00:08:22] !log releases-jenkins.wikimedia.org currently under maintenance (T247652) [00:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:26] T247652: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 [00:08:44] (03CR) 10Dzahn: [C: 03+2] Revert "admins: set http_proxy for myself, dzahn" [puppet] - 10https://gerrit.wikimedia.org/r/619367 (owner: 10Dzahn) [00:09:03] (03CR) 10Dave Pifke: "> ah, nice. but you don't need to go one step further and add a lookup() in the parameter and put it in Hiera for labs?" [puppet] - 10https://gerrit.wikimedia.org/r/619395 (owner: 10Dave Pifke) [00:10:01] (03CR) 10Dzahn: [C: 03+2] arclamp: configurable email address for cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/619395 (owner: 10Dave Pifke) [00:13:17] (03CR) 10Dzahn: "webperf1002: noop" [puppet] - 10https://gerrit.wikimedia.org/r/619395 (owner: 10Dave Pifke) [00:20:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:21:28] (03PS1) 10Dzahn: Revert "switch releases.wikimedia.org to buster backends" [dns] - 10https://gerrit.wikimedia.org/r/619368 [00:22:21] (03CR) 10Dzahn: [C: 03+2] Revert "switch releases.wikimedia.org to buster backends" [dns] - 10https://gerrit.wikimedia.org/r/619368 (owner: 10Dzahn) [00:24:10] !log reverting switch of releases.wikimedia.org for today since releases-jenkins.wikimedia.org is tied to it and new jenkins still needs some config and plugins (T247652) [00:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:13] T247652: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 [00:25:29] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) [00:26:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:31:59] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@fc5f1c6]: Deploying latest attempt to fix T259167 [00:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:02] T259167: Truncated ArcLamp output files - https://phabricator.wikimedia.org/T259167 [00:33:02] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@fc5f1c6]: Deploying latest attempt to fix T259167 (duration: 01m 03s) [00:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:26] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:18] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 49835896 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:18] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 118752 and 100 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:41:18] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:46:57] (03CR) 10BPirkle: [C: 03+1] "Looks good, approved for self-merge and deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618646 (https://phabricator.wikimedia.org/T250248) (owner: 10Tim Starling) [00:50:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:50:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:09:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:17:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:25:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:32:36] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:24] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:50] (03CR) 10Tim Starling: [C: 03+2] Enable fastStale mode on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618646 (https://phabricator.wikimedia.org/T250248) (owner: 10Tim Starling) [01:56:42] (03Merged) 10jenkins-bot: Enable fastStale mode on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618646 (https://phabricator.wikimedia.org/T250248) (owner: 10Tim Starling) [01:59:52] !log tstarling@deploy1001 Synchronized wmf-config/PoolCounterSettings.php: enabling fast stale mode T250248 (duration: 00m 58s) [01:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:55] T250248: Fast stale ParserCache responses on PoolCounter contention - https://phabricator.wikimedia.org/T250248 [02:05:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.4 [core] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619397 [02:09:23] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.4 [core] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619397 (https://phabricator.wikimedia.org/T257972) (owner: 10TrainBranchBot) [02:28:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:33:36] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:40:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:41:28] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:52:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:00:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:19:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:25:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:32:23] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:42:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:47:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:53:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:14:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:24:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:32:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:33:36] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:26] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:44:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:50:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:02:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:06:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:21:18] (03PS1) 10Marostegui: install_sever: Do not reimage dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/619400 [05:21:43] (03CR) 10Marostegui: [C: 03+2] install_sever: Do not reimage dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/619400 (owner: 10Marostegui) [05:32:56] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:45] <_joe_> marostegui: have you looked at the fatals? [05:39:07] <_joe_> they've been ongoing for hours and hours [05:39:57] <_joe_> not completely sure why tbh [05:40:49] _joe_: for days even [05:40:52] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:41:01] <_joe_> not with this frequency [05:41:39] <_joe_> and most of the times, it's just due to a bug in parsoid that's knowmn [05:42:15] <_joe_> I'm not sure, though, of the numbers I see in that dashboard [05:42:29] <_joe_> how they're calculated, given logstash has a different picture [05:50:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_arclamp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:54:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:15:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:23:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:24:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:32:18] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:37:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:40:20] (03CR) 10JMeybohm: [C: 03+2] releases: Remove deployment-charts repo [puppet] - 10https://gerrit.wikimedia.org/r/618352 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [06:42:06] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:00] (03PS1) 10Ayounsi: Re-add netbox_driven_interfaces feature flag [homer/public] - 10https://gerrit.wikimedia.org/r/619431 [06:45:02] (03PS1) 10Ayounsi: Re-prioritize peering over transit eqiad/esams [homer/public] - 10https://gerrit.wikimedia.org/r/619432 (https://phabricator.wikimedia.org/T259614) [06:45:04] (03PS1) 10JMeybohm: Remove helm repo (index.yaml and chart tars) from git [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843) [06:45:06] (03CR) 10jerkins-bot: [V: 04-1] Remove helm repo (index.yaml and chart tars) from git [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [06:45:34] !log Re-prioritize peering over transit eqiad/esams - T259614 [06:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:38] T259614: Re-prioritize peering over transit - https://phabricator.wikimedia.org/T259614 [06:47:32] (03CR) 10Ayounsi: [C: 03+2] Re-add netbox_driven_interfaces feature flag [homer/public] - 10https://gerrit.wikimedia.org/r/619431 (owner: 10Ayounsi) [06:47:37] (03CR) 10Ayounsi: [C: 03+2] Re-prioritize peering over transit eqiad/esams [homer/public] - 10https://gerrit.wikimedia.org/r/619432 (https://phabricator.wikimedia.org/T259614) (owner: 10Ayounsi) [06:47:55] (03Merged) 10jenkins-bot: Re-add netbox_driven_interfaces feature flag [homer/public] - 10https://gerrit.wikimedia.org/r/619431 (owner: 10Ayounsi) [06:48:01] (03Merged) 10jenkins-bot: Re-prioritize peering over transit eqiad/esams [homer/public] - 10https://gerrit.wikimedia.org/r/619432 (https://phabricator.wikimedia.org/T259614) (owner: 10Ayounsi) [06:50:01] (03PS2) 10JMeybohm: Remove helm repo (index.yaml and chart tars) from git [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843) [06:54:16] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [06:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Burn with fire" [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [06:56:04] (03CR) 10JMeybohm: [C: 03+2] Remove helm repo (index.yaml and chart tars) from git [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [06:56:29] 🔥 [06:57:10] (03Merged) 10jenkins-bot: Remove helm repo (index.yaml and chart tars) from git [deployment-charts] - 10https://gerrit.wikimedia.org/r/619433 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [07:02:52] jouncebot: next [07:02:53] In 0 hour(s) and 57 minute(s): Move muswiki and mhwiktionary (closed wikis) from s3 to s5 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T0800) [07:03:40] 10Operations, 10netops, 10Patch-For-Review: Re-prioritize peering over transit - https://phabricator.wikimedia.org/T259614 (10ayounsi) 05Open→03Resolved All done! [07:06:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:10:38] (03PS4) 10ZPapierski: Replace a query service during data reload [puppet] - 10https://gerrit.wikimedia.org/r/619259 (https://phabricator.wikimedia.org/T259543) [07:10:44] (03PS2) 10ZPapierski: Add a weekly reload job for wcqs data reload [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) [07:12:12] (03CR) 10Giuseppe Lavagetto: helmfile: refactoring blubberoid (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [07:13:02] (03PS6) 10Giuseppe Lavagetto: helmfile: refactoring blubberoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) [07:18:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:20:28] (03PS1) 10JMeybohm: releases: Remove absend ressources [puppet] - 10https://gerrit.wikimedia.org/r/619434 (https://phabricator.wikimedia.org/T253843) [07:23:36] (03PS1) 10JMeybohm: helm: Remove obsolete cron ressource [puppet] - 10https://gerrit.wikimedia.org/r/619435 [07:24:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] releases: Remove absend ressources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619434 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [07:25:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] helm: Remove obsolete cron ressource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619435 (owner: 10JMeybohm) [07:26:57] (03CR) 10JMeybohm: [C: 03+1] "Yeah! Let's do it 🎉" [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [07:29:08] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [07:31:30] (03CR) 10JMeybohm: [C: 03+2] helm: Remove obsolete cron ressource [puppet] - 10https://gerrit.wikimedia.org/r/619435 (owner: 10JMeybohm) [07:33:24] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:34] (03PS2) 10JMeybohm: helm: Remove obsolete cron resource [puppet] - 10https://gerrit.wikimedia.org/r/619435 [07:34:35] (03PS1) 10Giuseppe Lavagetto: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 [07:35:05] (03PS2) 10JMeybohm: releases: Remove absent resources [puppet] - 10https://gerrit.wikimedia.org/r/619434 (https://phabricator.wikimedia.org/T253843) [07:35:38] (03CR) 10JMeybohm: [C: 03+2] helm: Remove obsolete cron resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619435 (owner: 10JMeybohm) [07:40:45] (03PS3) 10ZPapierski: Add a weekly reload job for wcqs data reload [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) [07:41:13] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:17] (03PS5) 10ZPapierski: Replace a query service during data reload [puppet] - 10https://gerrit.wikimedia.org/r/619259 (https://phabricator.wikimedia.org/T259543) [07:42:02] (03PS4) 10ZPapierski: Add a weekly reload job for wcqs data reload [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) [07:58:46] (03CR) 10Kormat: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [08:00:04] marostegui, Urbanecm, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Move muswiki and mhwiktionary (closed wikis) from s3 to s5 deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T0800). [08:00:11] o/ [08:00:18] o/ [08:00:22] yet another? hope not so! [08:00:23] o/ [08:00:55] Urbanecm Amir1 we are following this then https://phabricator.wikimedia.org/T259004#6348180 ? [08:01:40] Yes. Ready to turn wikis to read only. [08:02:16] yup [08:02:46] I'm here mostly for emotional support, Urbanecm will do the main stuff [08:02:51] Urbanecm: go for it [08:02:56] ack :) [08:03:05] (03PS6) 10Urbanecm: Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) [08:03:07] (03CR) 10Urbanecm: [C: 03+2] Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [08:03:10] Good luck [08:04:23] (03Merged) 10jenkins-bot: Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [08:04:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [08:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:20] (03CR) 10Hashar: [C: 03+1] "Looks like the Prometeus JMX exporter used to be deployed using scap via this repository operations/software/prometheus_jmx_exporter . The" [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/404224 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [08:06:25] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a04bc1f27e4ef4e38002d546d30bfd2d1dc60d0e: Turn muswiki and mhwiktionary to read-only (T259004) (duration: 01m 01s) [08:06:26] marostegui: wikis should be read-only now. [08:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:28] T259004: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 [08:06:36] Urbanecm: ok, going to proceed [08:06:47] (03Abandoned) 10Hashar: Scap: git_fat -> git_binary_manager [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/404224 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [08:06:49] (03PS5) 10ZPapierski: Add a weekly reload job for wcqs data reload [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) [08:10:03] both loaded into codfw, checking stuff and after that, will sanitize sanitarium host [08:10:22] ack [08:12:15] looks good, going to proceed with eqiad [08:13:47] marostegui: ack. Would you mind me preparing the "point to s5" patch at mwdebug hosts, or should I wait with that? [08:14:23] Urbanecm: only codfw I would say [08:15:24] (03CR) 10Aklapper: "Nobody said it's broken... It is outdated and linking to a "blog" makes me have different expectations than getting one single blog *post*" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619129 (https://phabricator.wikimedia.org/T259978) (owner: 10Aklapper) [08:15:34] marostegui: okay, merging and pulling to mwdebug2001 [08:16:24] (03PS4) 10Urbanecm: Point muswiki and mhwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618090 (https://phabricator.wikimedia.org/T259004) [08:16:24] (03CR) 10Urbanecm: [C: 03+2] Point muswiki and mhwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618090 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [08:16:39] (03Merged) 10jenkins-bot: Point muswiki and mhwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618090 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [08:18:25] marostegui: FYI: the patch is at mwdebug2001 only. [08:19:08] Urbanecm: cool, I am proceeding with eqiad hosts [08:19:15] ack [08:23:53] (03PS1) 10Hashar: Add .gitreview file [debs/hue] - 10https://gerrit.wikimedia.org/r/619438 [08:23:55] 10Operations, 10Patch-For-Review: Fix "Blog" link on noc.wikimedia.org - https://phabricator.wikimedia.org/T259978 (10Aklapper) One single blogpost is not the "Blog of the operators of Wikimedia's servers". So that's broken. [08:24:58] all done, doing some checks now [08:24:59] (03PS3) 10Hashar: Add basic Debian packaging [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [08:25:20] ack [08:26:32] (03PS2) 10Aklapper: noc: Remove link to outdated blog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619129 (https://phabricator.wikimedia.org/T259978) [08:27:55] Urbanecm: everything is done, we should be now at this step: Change ./wmf-config/config/muswiki.yaml and ./wmf-config/config/mhwiktionary.yaml and then run composer buildDBLists from https://phabricator.wikimedia.org/T259004#6348180 [08:29:27] thanks! [08:29:34] going to pull that patch to mwdebug1001 [08:30:04] Urbanecm: sounds good, and if possible, let's generate a write for those wikis so I can check they get replicated safely? [08:31:05] marostegui: sure! I'm now verifying the patch works by looking at IP of the master the wikis talk to [08:31:20] excellent [08:33:03] marostegui: in eqiad, it talks to db1100/10.64.32.197, in codfw, it talks to db2123/10.192.16.12. Looks good to me per https://dbtree.wikimedia.org/ [08:33:38] Urbanecm: ips and hostnames are correct [08:33:42] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:20] marostegui: great! I'm setting read-only back to false at mwdebug1001, so I can generate a write for you. [08:34:32] cool [08:34:45] (03PS1) 10Urbanecm: Revert "Turn muswiki and mhwiktionary to read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619369 (https://phabricator.wikimedia.org/T259004) [08:34:51] (03CR) 10Urbanecm: [C: 03+2] Revert "Turn muswiki and mhwiktionary to read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619369 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [08:35:37] (03Merged) 10jenkins-bot: Revert "Turn muswiki and mhwiktionary to read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619369 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [08:36:58] marostegui: I created https://mus.wikipedia.org/wiki/User:Martin_Urbanec/This_is_a_test_page (should appear in the page table) [08:37:05] let me check [08:38:55] Urbanecm: looks good, the row is on s5 but not on s3 [08:39:00] can you do the same for the other one? [08:39:02] \o/ [08:39:02] sure! [08:39:40] marostegui: created https://mh.wiktionary.org/wiki/User:Martin_Urbanec/This_is_a_test_page [08:39:46] checking [08:40:42] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24408/" [puppet] - 10https://gerrit.wikimedia.org/r/619296 (owner: 10Filippo Giunchedi) [08:40:48] looks good too [08:40:55] good! [08:40:55] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24408/" [puppet] - 10https://gerrit.wikimedia.org/r/619295 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:40:56] also sanitization is working fine as well on labs hosts [08:41:14] so, I think I can sync the shard change to all hosts now! [08:41:25] yep! [08:41:29] doing! [08:41:30] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:22] (03CR) 10Kormat: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [08:42:50] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Good idea, but I think the current form is a bit too specific and also leaves the burden of preparing a wheels archive on the developer, w" (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [08:43:34] !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: 81f4594b6c583f938821549b3a1800fec5b120bb: Point muswiki and mhwiktionary to s5 (T259004; 1/3) (duration: 01m 02s) [08:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:39] T259004: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 [08:43:54] 10Operations, 10observability: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) [08:44:48] !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: 81f4594b6c583f938821549b3a1800fec5b120bb: Point muswiki and mhwiktionary to s5 (T259004; 2/3) (duration: 00m 58s) [08:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:58] !log urbanecm@deploy1001 Synchronized dblists/: 81f4594b6c583f938821549b3a1800fec5b120bb: Point muswiki and mhwiktionary to s5 (T259004; 3/3) (duration: 00m 58s) [08:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:14] marostegui: shard change should be done. Any final checks before I make the wikis rw? [08:46:31] Urbanecm: let's create one more test for each wiki? [08:46:47] okay! [08:46:56] thank you [08:48:52] marostegui: created https://mus.wikipedia.org/wiki/User:Martin_Urbanec/Foo and https://mh.wiktionary.org/wiki/User:Martin_Urbanec/Foo [08:48:58] checking! [08:49:47] looks good, changes are on s5 and not on s3! [08:49:58] cool! [08:50:31] so, looks ready for read-write to me :) [08:50:43] yep! [08:50:47] syncing! [08:52:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: e6ec237b6b6fb67a0a80613909589bc724f5eecf: Revert "Turn muswiki and mhwiktionary to read-only" (T259004) (duration: 00m 58s) [08:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:32] T259004: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 [08:52:53] marostegui: done! [08:53:00] 10Operations, 10observability: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) The current thinking is to try option #4: ack'd incidents in VO that haven't been resolved within X hours will re-trigger, using X = 12. The normal workflow is sth like this: 1.... [08:53:04] Urbanecm: let's do that same test once more to be fully sure? [08:53:29] marostegui: sure. Would you mind me deleting the pages now (logging table)? [08:53:38] that sounds good [08:54:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] helmfile: refactoring blubberoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [08:54:16] marostegui: https://mus.wikipedia.org/w/index.php?title=User:Martin_Urbanec/This_is_a_test_page and https://mh.wiktionary.org/w/index.php?title=User:Martin_Urbanec/This_is_a_test_page was just deleted [08:54:21] ok, checking [08:55:10] (03Merged) 10jenkins-bot: helmfile: refactoring blubberoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [08:55:11] looks good! [08:55:15] \o/ [08:55:22] 10Operations, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) [08:55:45] So we are done I think! We need to follow up the doc change at https://phabricator.wikimedia.org/T259438 and I am going to create a cleanup task to remove the tables from s3 [08:56:01] yup! I'll change the docs then. Thanks marostegui ! [08:56:09] new phab task: "Drop all tables from s3" [08:56:12] Thaaaaanks so much for driving this! [08:56:30] happy to help! [08:57:02] if you won't need my backups, I will take a break [08:57:12] jynus: I think we are fine, thank you! [08:59:30] marostegui: do we need some follow-up from cloud-services, or should wiki replicas in labs work? [08:59:45] (also, what about analytics replicas?) [08:59:52] Urbanecm: no, there's nothing to do there [09:00:17] okay, cool [09:00:23] Urbanecm: the analytics dbstore might need creating the views though, as it is multi-instance [09:00:24] let me check [09:00:30] not sure if they use views there or not [09:00:52] if you mean the non-cloud ones, no, they don't have views [09:01:01] yeah, just checked, no views there [09:01:06] so we are good [09:01:13] we could run the check-private data after table removal once, just in case [09:01:13] good, thanks for checking that :) [09:01:35] jynus: I am going to run it now for s5, should be quick [09:01:37] !log renewed puppet certificate on scb1001.eqiad.wmnet [09:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:36] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:06:59] private data check looks clean on both db1124:3315 and db2094:3315 [09:07:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:09:23] ^^this is _not_ related to the shard change AFAICS^^ [09:10:22] not from what I can see no [09:11:28] !log Rename tables on muswiki and mhwiktionary on s3 master (db1123) without replication T260112 [09:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:31] T260112: Remove muswiki and mhwiktionary from s3 - https://phabricator.wikimedia.org/T260112 [09:13:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:21:29] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10User-Urbanecm, and 2 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) [09:22:01] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10User-Urbanecm, and 2 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) [09:25:10] !log volans@cumin1001 START - Cookbook sre.dns.netbox [09:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:30] (03PS1) 10Kormat: mariadb: Drop check_mariadb.py in favour of packaged version [puppet] - 10https://gerrit.wikimedia.org/r/619443 [09:26:20] (03PS1) 10Elukey: Upgrade the Hadoop test cluster to Bigtop [puppet] - 10https://gerrit.wikimedia.org/r/619444 [09:27:31] (03CR) 10Elukey: [C: 03+2] Upgrade the Hadoop test cluster to Bigtop [puppet] - 10https://gerrit.wikimedia.org/r/619444 (owner: 10Elukey) [09:28:06] (03PS2) 10Kormat: mariadb: Drop check_mariadb.py in favour of packaged version [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516) [09:29:59] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:10] (03PS2) 10Gehel: Set up WCQS test server [puppet] - 10https://gerrit.wikimedia.org/r/618059 (owner: 10ZPapierski) [09:33:50] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:07] (03CR) 10Gehel: [C: 03+2] Set up WCQS test server [puppet] - 10https://gerrit.wikimedia.org/r/618059 (owner: 10ZPapierski) [09:39:40] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:25] (03PS1) 10Ladsgroup: Exclude thankyou wiki for mobile redirect [puppet] - 10https://gerrit.wikimedia.org/r/619446 (https://phabricator.wikimedia.org/T259002) [09:40:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:44:33] (03PS3) 10Kormat: mariadb: Drop check_mariadb.py in favour of packaged version [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516) [09:48:30] (03PS4) 10Kormat: mariadb: Drop check_mariadb.py in favour of packaged version [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516) [09:48:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:48:56] (03PS3) 10Filippo Giunchedi: Debian packaging for Grafana plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618953 (https://phabricator.wikimedia.org/T259143) [09:49:12] (03CR) 10Kormat: "PCC run: https://puppet-compiler.wmflabs.org/compiler1003/24411/" [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [09:50:20] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: temp disable grafana db sync ahead of upgrade [puppet] - 10https://gerrit.wikimedia.org/r/618069 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [09:51:59] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster [09:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:11] this is the test cluster --^ [09:54:25] [heads up] I'm about to merge the DNS patch to migrate all eqiad mgmt records to netbox generated ones in a bit [09:56:53] (03PS5) 10Volans: mgmt: netbox-generated data for mgmt eqiad [dns] - 10https://gerrit.wikimedia.org/r/617509 (https://phabricator.wikimedia.org/T233183) [09:59:34] (03PS1) 10Giuseppe Lavagetto: deployment_server::helmfile: generate files for new-style helmfile organization [puppet] - 10https://gerrit.wikimedia.org/r/619448 (https://phabricator.wikimedia.org/T258572) [10:00:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) [10:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:39] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh [10:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:50] (test cluster) [10:02:17] (03PS2) 10Giuseppe Lavagetto: deployment_server::helmfile: generate files for new-style helmfile organization [puppet] - 10https://gerrit.wikimedia.org/r/619448 (https://phabricator.wikimedia.org/T258572) [10:04:00] PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:07:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24413/deploy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619448 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [10:07:30] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:23] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:48] RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 17053 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:16:29] (03PS1) 10Giuseppe Lavagetto: deployment_server::helmfile: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/619449 [10:19:14] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:38] PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:20:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh (exit_code=0) [10:20:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server::helmfile: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/619449 (owner: 10Giuseppe Lavagetto) [10:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:23:32] RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 57001 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:25:06] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:27] (03PS4) 10Filippo Giunchedi: Debian packaging for Grafana plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618953 (https://phabricator.wikimedia.org/T259143) [10:29:42] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:32:01] (03PS8) 10Hnowlan: api-gateway: open parts of the admin interface internally [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) [10:32:24] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:26] !log volans@cumin1001 START - Cookbook sre.dns.netbox [10:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:13] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:28] !log migrating *all* eqiad mgmt DNS records to the autogenerated ones via Netbox - T233183 [10:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:31] T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 [10:39:36] (03PS1) 10Giuseppe Lavagetto: helmfile.d/blubberoid: fix the paths of SRE-controlled values [deployment-charts] - 10https://gerrit.wikimedia.org/r/619450 [10:39:38] (03CR) 10Volans: [C: 03+2] mgmt: netbox-generated data for mgmt eqiad [dns] - 10https://gerrit.wikimedia.org/r/617509 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:40:38] (03CR) 10Hnowlan: "> Patch Set 7:" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [10:42:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:23] (03PS1) 10Filippo Giunchedi: profile: switch Grafana plugins to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/619451 (https://phabricator.wikimedia.org/T259143) [10:43:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] helmfile.d/blubberoid: fix the paths of SRE-controlled values [deployment-charts] - 10https://gerrit.wikimedia.org/r/619450 (owner: 10Giuseppe Lavagetto) [10:44:43] (03Merged) 10jenkins-bot: helmfile.d/blubberoid: fix the paths of SRE-controlled values [deployment-charts] - 10https://gerrit.wikimedia.org/r/619450 (owner: 10Giuseppe Lavagetto) [10:45:52] (03CR) 10Jcrespo: "The patch as it is looks ok, however, given that in some cases this is a paging alert, and given our past with failures deploying new aler" [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [10:49:27] (03CR) 10Hnowlan: [C: 03+2] api-gateway: open parts of the admin interface internally [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [10:50:39] (03Merged) 10jenkins-bot: api-gateway: open parts of the admin interface internally [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [10:50:49] (03PS1) 10Giuseppe Lavagetto: helmfile.d/blubberoid: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/619452 [10:51:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] helmfile.d/blubberoid: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/619452 (owner: 10Giuseppe Lavagetto) [10:51:54] (03PS2) 10KartikMistry: Enable Content Translation in Sundanese Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619255 (https://phabricator.wikimedia.org/T258502) [10:52:56] (03Merged) 10jenkins-bot: helmfile.d/blubberoid: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/619452 (owner: 10Giuseppe Lavagetto) [10:57:02] (03PS1) 10Hnowlan: api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/619454 [10:58:32] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/619454 (owner: 10Hnowlan) [10:59:39] (03Merged) 10jenkins-bot: api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/619454 (owner: 10Hnowlan) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1100). [11:00:05] kart_: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:25] Lemony Snicket’s A Series Of Unfortunate Deploys [11:00:32] * kart_ is here. [11:00:48] I can deploy today, unless kart_ wants to self-service :) [11:01:04] (I’m about to go for lunch so i can’t actually deploy, only make stupid jokes, sorry ^^) [11:01:10] thx Urbanecm [11:01:11] Urbanecm: I can deploy to just make sure I can deploy. No worries :) [11:01:25] go ahead then :) [11:02:30] (wow, is there a plan to have CT by default? nice!) [11:02:47] (03CR) 10Ladsgroup: Create dispatch lag alerts for test.wikidata.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [11:02:49] (03CR) 10KartikMistry: [C: 03+2] Enable Content Translation in Sundanese Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619255 (https://phabricator.wikimedia.org/T258502) (owner: 10KartikMistry) [11:03:35] (03Merged) 10jenkins-bot: Enable Content Translation in Sundanese Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619255 (https://phabricator.wikimedia.org/T258502) (owner: 10KartikMistry) [11:03:37] Urbanecm: We are moving slowly out-of-beta for Wikis after discussing with community. [11:03:44] (y) [11:03:48] I like that :) [11:03:57] Urbanecm: I'll have a late config patch in a minute [11:03:57] Content Translation is really nice [11:04:14] Majavah: ack, add it to the calendar then :) [11:04:22] sure [11:04:45] (03CR) 10Ladsgroup: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [11:05:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:06:22] (03PS1) 10Majavah: labs: Disable TheWikipediaLibrary due to email issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619455 (https://phabricator.wikimedia.org/T256297) [11:07:16] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:07:50] Urbanecm: added [11:07:55] ack [11:08:00] it's beta only [11:08:58] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|619255|Enable ContentTranslation in Sundanese WP as a default tool (T258502)]] (duration: 00m 59s) [11:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:01] T258502: Enable Content Translation in Sundanese Wikipedia as a default tool - https://phabricator.wikimedia.org/T258502 [11:09:04] Urbanecm: I'm done with deploy. [11:09:44] kart_: thanks [11:11:16] (03CR) 10Urbanecm: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619455 (https://phabricator.wikimedia.org/T256297) (owner: 10Majavah) [11:11:20] Majavah: done :) [11:11:39] (the magic should do that within 30 minutes) [11:11:47] thanks! [11:12:04] (03Merged) 10jenkins-bot: labs: Disable TheWikipediaLibrary due to email issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619455 (https://phabricator.wikimedia.org/T256297) (owner: 10Majavah) [11:15:09] (03PS4) 10Cparle: MediaSearch A/B test on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse) [11:16:16] (03PS5) 10Cparle: MediaSearch A/B test on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse) [11:17:28] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/619446 (https://phabricator.wikimedia.org/T259002) (owner: 10Ladsgroup) [11:18:17] !log EU B&C window done [11:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:44] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [11:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:08] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:27] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) 05Open→03Resolved a:05crusnov→03Volans All management records are now generated via Netbox, related wikitech documentation upd... [11:37:38] (03PS3) 10Volans: zone_validator: fix private/public detection [dns] - 10https://gerrit.wikimedia.org/r/616873 [11:41:00] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:37] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [11:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:50] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:54:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:54:24] netbox was fixed (context in T260077 ) [11:54:25] T260077: netbox dumps: fix permissions and timestamp - https://phabricator.wikimedia.org/T260077 [11:54:56] !log Install new MariaDB 10.4.14 on db2102 [11:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:44] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/619259 (https://phabricator.wikimedia.org/T259543) (owner: 10ZPapierski) [11:58:23] 10Operations, 10Discovery-Search (Current work): Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10Gehel) [11:59:17] 10Operations, 10Discovery-Search (Current work): wdqs1009 has puppet changes on each run - https://phabricator.wikimedia.org/T260123 (10Gehel) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1200) [12:12:20] (03CR) 10Filippo Giunchedi: "To be merged shortly before the upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/619451 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [12:27:30] 10Operations, 10Discovery-Search (Current work): wdqs1009 has puppet changes on each run - https://phabricator.wikimedia.org/T260123 (10Gehel) Note: @Zbyszko is having a look into this as well. [12:35:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:35:26] 10Operations, 10DBA, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Kormat) [12:35:39] 10Operations, 10DBA, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Kormat) p:05Triage→03Medium a:03Kormat [12:36:49] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [12:36:49] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [12:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:42:35] 10Puppet, 10Beta-Cluster-Infrastructure, 10VPS-Projects, 10Product-Infrastructure-Team-Backlog (Kanban): Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deployme... - https://phabricator.wikimedia.org/T259812 [12:42:40] (03CR) 10Hashar: [C: 03+2] "Cause well train" [core] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619397 (https://phabricator.wikimedia.org/T257972) (owner: 10TrainBranchBot) [12:44:16] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [12:47:46] (03PS2) 10Cicalese: Configured additional settings for API Portal beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569) [12:48:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:52:34] !log uploaded wmfmariadbpy 0.2 packages to apt1001 [12:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:54:59] (03PS3) 10Cicalese: Configured additional settings for API Portal beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569) [12:59:44] (03PS1) 10Marostegui: control-mariadb-10.4*: Update package version [software] - 10https://gerrit.wikimedia.org/r/619462 [13:00:04] hashar and twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1300). [13:01:36] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.4 [core] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619397 (https://phabricator.wikimedia.org/T257972) (owner: 10TrainBranchBot) [13:03:47] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [13:03:47] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [13:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] web_testing: Remove the apache-fast-test placeholder [puppet] - 10https://gerrit.wikimedia.org/r/618602 (owner: 10RLazarus) [13:06:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] web_testing: Clean up the old class used for apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/618603 (owner: 10RLazarus) [13:06:40] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Vgutierrez) hmm that's interesting, please note that this is not the first time we use websockets. etherpad.wm.o i... [13:12:36] !log Applied 1.36.0-wmf.4 security patches # T257972 [13:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:40] T257972: 1.36.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T257972 [13:13:48] (03PS1) 10Hashar: testwikis wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619464 [13:13:50] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619464 (owner: 10Hashar) [13:14:37] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619464 (owner: 10Hashar) [13:14:44] !log hashar@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.4 [13:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:37] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [13:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:58] (03PS1) 10CDanis: enable envoy websockets support for role::aphlict [puppet] - 10https://gerrit.wikimedia.org/r/619465 (https://phabricator.wikimedia.org/T238593) [13:23:37] (03CR) 10Vgutierrez: [C: 03+1] enable envoy websockets support for role::aphlict [puppet] - 10https://gerrit.wikimedia.org/r/619465 (https://phabricator.wikimedia.org/T238593) (owner: 10CDanis) [13:23:41] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [13:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:56] (03CR) 10CDanis: [C: 03+2] enable envoy websockets support for role::aphlict [puppet] - 10https://gerrit.wikimedia.org/r/619465 (https://phabricator.wikimedia.org/T238593) (owner: 10CDanis) [13:27:18] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [13:27:19] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [13:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:34:10] (03CR) 10Gilles: "Looking at the first image, aawiki.png, you're achieving 42% size reduction with a loss of 69.3% of distinct colors. My patch achieved 52%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609584 (https://phabricator.wikimedia.org/T252108) (owner: 10Thiemo Kreuz (WMDE)) [13:35:13] (03PS1) 10Jayprakash12345: Enable tewiki as import source for tewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619371 [13:40:44] (03CR) 10Gilles: "I'm going to deploy this the week of August 24, as I'll be working without interruption the months that follow, ensuring that I can addres" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [13:41:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:48:28] (03CR) 10Volans: [C: 03+2] zone_validator: fix private/public detection [dns] - 10https://gerrit.wikimedia.org/r/616873 (owner: 10Volans) [13:48:58] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10CDanis) The Envoy TLS terminator is now configured to allow websocket upgrades -- however, it's improperly configu... [13:51:04] (03CR) 10ZPapierski: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24417/ - changes from a previous patch and cron job is not enabled on wdqs instances" [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) (owner: 10ZPapierski) [13:52:45] (03PS2) 10Jayprakash12345: Enable tewiki as import source for tewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619371 (https://phabricator.wikimedia.org/T260107) [13:53:01] (03CR) 10Jayprakash12345: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619371 (https://phabricator.wikimedia.org/T260107) (owner: 10Jayprakash12345) [13:53:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:56:58] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:00:02] (03PS1) 10JMeybohm: Set helm extra args early [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619471 [14:00:56] (03PS1) 10Ottomata: camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) [14:01:07] (03CR) 10jerkins-bot: [V: 04-1] camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [14:04:34] (03CR) 10Giuseppe Lavagetto: "LGTM, I'm sure we're missing something though :P" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619471 (owner: 10JMeybohm) [14:04:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Set helm extra args early [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619471 (owner: 10JMeybohm) [14:05:02] PROBLEM - Disk space on mwdebug1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=78%): /tmp 0 MB (0% inode=78%): /var/tmp 0 MB (0% inode=78%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops [14:05:20] (03CR) 10Thiemo Kreuz (WMDE): "> […] loss of 69.3% of distinct colors." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609584 (https://phabricator.wikimedia.org/T252108) (owner: 10Thiemo Kreuz (WMDE)) [14:05:25] rsync: recv_generator: mkdir "/srv/mediawiki/php-1.36.0-wmf.4/resources" failed: No space left on device (28) [14:05:28] damn [14:05:47] that is on mw1319 apparently [14:06:14] (03CR) 10Herron: [C: 03+2] lists: stop automatically sycing fermium to lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/619355 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [14:06:17] mwdebug too, hashar [14:06:26] yeah / on mwdebug1001 is full .. [14:06:43] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:07:50] do we really need to keep so many versions? [14:07:54] but mw1319 is just fine [14:08:07] well we clean the old versions [14:08:08] there is like over 100 mw versions [14:08:14] on mwdebug1001 [14:08:15] oh [14:08:20] yeah that is not normal [14:08:33] can you have a look, hashar, and tell me if I can manually remove some? [14:08:35] same on mw1319 [14:08:41] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "https://www.mediawiki.org/wiki/Gerrit/Privilege_policy/en#Merging_without_review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [14:08:42] to get out of the danger [14:08:53] supposedly scap should clean those [14:08:53] I don't like to have a host with no / space [14:09:05] just for a quick check, we can create a task late [14:09:16] but /srv/mediawiki has versions all the way done to 1.32.0 from mid 2018 [14:09:24] s/done/down [14:09:47] can I delete for example all 1.32.0 versions or it is dangerous to do it on only 1 host? [14:09:57] ah no those are empty directories [14:09:58] or maybe can be done from scap? [14:10:17] I guess scap does not delete the cache/l10n empty directories [14:10:33] I see [14:10:36] !log mw1319: scap pull [14:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:47] for mwdebug1001 I cant tell [14:11:52] (03CR) 10JMeybohm: [C: 03+2] "> Patch Set 1:" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619471 (owner: 10JMeybohm) [14:12:08] there was a recent increase in bytes used at 14h [14:12:17] other hosts not affected, but mwdebug was [14:12:39] (03PS2) 10Herron: lists: make lists1001 primary mailman host [puppet] - 10https://gerrit.wikimedia.org/r/619354 (https://phabricator.wikimedia.org/T224586) [14:13:33] (03CR) 10Kormat: [C: 04-2] "The current version of wmfmariadbpy (0.2) will pull in cumin on all db hosts, so this should not be merged until that is fixed." [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [14:13:35] any suggestion, hashar to proceed? [14:13:48] for mwdebug1001? I don't know I haven't looked into it [14:13:55] is it safe to remove the cache of old versions? [14:13:56] I just had scap alerting about mw1319 but it seems all fine [14:14:02] the old caches yeah [14:14:14] seems like there are just left over l10n cache directories [14:14:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:14:41] the problem is if that is not fixed, deploys will fail [14:14:44] (03Merged) 10jenkins-bot: Set helm extra args early [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619471 (owner: 10JMeybohm) [14:14:57] (03PS1) 10Kormat: Update remote execution libraries from transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) [14:15:51] but 1319 should be ok, filesystem-wise [14:16:20] (03PS3) 10Herron: lists: make lists1001 primary mailman host [puppet] - 10https://gerrit.wikimedia.org/r/619354 (https://phabricator.wikimedia.org/T224586) [14:16:30] but the cache is KBs only, so I don't think that is the issue [14:18:09] each release is 6-8GB, and it doesn't fit on the 49G available for /srv [14:18:39] yeah [14:18:53] let me see if there is something I can do about it [14:19:22] it is a vm, so maybe the fs can be extended [14:19:42] 6690 ./php-1.35.0-wmf.40 [14:19:42] 6702 ./php-1.35.0-wmf.41 [14:19:46] they should have been purged [14:19:48] bah [14:20:23] yeah, but it is just KB, minor issue we can report later [14:20:34] na those are in mB ;) [14:20:37] err [14:20:38] MB [14:20:49] we have some step to delete the old versions on tuesday [14:20:54] but apparently that does not occur bah [14:23:14] <_joe_> so we currently have [14:23:16] <_joe_> uhm [14:23:23] <_joe_> 6 versions live? [14:23:28] <_joe_> that seems excessive [14:24:32] in the deploy doc we have: [14:24:54] "Decide what old stuff to prune": find /srv/mediawiki-staging -mindepth 2 -maxdepth 2 -type f -path './php-*/README' -ctime +7 -exec dirname {} \; [14:25:10] which yielded nothing for me earlier today [14:25:16] so I just moved to the next steps [14:25:33] else for all those old versions we need to run: scap clean --delete [14:25:40] which apparnetly hasn't been done for the last few trains [14:25:51] <_joe_> that's the real issue [14:25:57] <_joe_> we still have, on all appservers [14:26:20] <_joe_> php-1.35.0-wmf.40 php-1.35.0-wmf.41 php-1.36.0-wmf.1 php-1.36.0-wmf.2 php-1.36.0-wmf.3 php-1.36.0-wmf.4 [14:26:33] <_joe_> that's at least 3 releases too much [14:26:42] PROBLEM - Ensure local MW versions match expected deployment on mw1312 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:26:50] PROBLEM - Ensure local MW versions match expected deployment on snapshot1008 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:26:52] PROBLEM - Ensure local MW versions match expected deployment on mw2311 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:26:53] PROBLEM - Ensure local MW versions match expected deployment on mw2372 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:26:54] PROBLEM - Ensure local MW versions match expected deployment on mwdebug2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:26:58] PROBLEM - Ensure local MW versions match expected deployment on mw2310 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:26:58] PROBLEM - Ensure local MW versions match expected deployment on mw2324 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:02] PROBLEM - Ensure local MW versions match expected deployment on wtp2004 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:03] PROBLEM - Ensure local MW versions match expected deployment on wtp2014 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:20] !log hashar@deploy1001 sync aborted: testwikis wikis to 1.36.0-wmf.4 (duration: 72m 36s) [14:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:28] PROBLEM - Ensure local MW versions match expected deployment on mw1351 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:28] PROBLEM - Ensure local MW versions match expected deployment on mw1302 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:28] PROBLEM - Ensure local MW versions match expected deployment on wtp1040 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:28] PROBLEM - Ensure local MW versions match expected deployment on mwdebug1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:40] PROBLEM - Ensure local MW versions match expected deployment on mw1275 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:40] PROBLEM - Ensure local MW versions match expected deployment on mw1327 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:43] PROBLEM - Ensure local MW versions match expected deployment on mw2253 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:48] PROBLEM - Ensure local MW versions match expected deployment on mwdebug1001 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:56] PROBLEM - Ensure local MW versions match expected deployment on mw1363 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:27:56] PROBLEM - Ensure local MW versions match expected deployment on mw2136 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:28:06] PROBLEM - Ensure local MW versions match expected deployment on mw1412 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:28:06] PROBLEM - Ensure local MW versions match expected deployment on mw1300 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:28:06] PROBLEM - Ensure local MW versions match expected deployment on mw2273 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:28:07] <_joe_> hashar: I can just remove the dirs for you [14:28:32] PROBLEM - Ensure local MW versions match expected deployment on mw2317 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:28:36] PROBLEM - Ensure local MW versions match expected deployment on mw2292 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:28:38] PROBLEM - Ensure local MW versions match expected deployment on wtp1027 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:28:44] PROBLEM - Ensure local MW versions match expected deployment on mw1297 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:28:44] PROBLEM - Ensure local MW versions match expected deployment on mw2296 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:28:44] PROBLEM - Ensure local MW versions match expected deployment on mw2221 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:29:03] "Decide what old stuff to prune": find /srv/mediawiki-staging -mindepth 2 -maxdepth 2 -type f -path './php-*/README' -ctime +7 -exec dirname {} \; [14:29:04] ahah [14:29:05] found it [14:29:08] PROBLEM - Ensure local MW versions match expected deployment on mw1368 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:29:08] README no more exist [14:29:10] PROBLEM - Ensure local MW versions match expected deployment on wtp2011 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:29:16] PROBLEM - Ensure local MW versions match expected deployment on mw1362 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:29:17] it is README.md :-( [14:29:18] PROBLEM - Ensure local MW versions match expected deployment on mw2218 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:29:46] PROBLEM - Ensure local MW versions match expected deployment on mw2208 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:29:48] PROBLEM - Ensure local MW versions match expected deployment on wtp1041 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:30:12] PROBLEM - Ensure local MW versions match expected deployment on snapshot1005 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:30:21] rm README.md ? [14:30:24] PROBLEM - Ensure local MW versions match expected deployment on mw2204 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:30:29] !log Cleaning old MediaWiki versions that were never removed [14:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:38] PROBLEM - Ensure local MW versions match expected deployment on mw1402 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:30:44] PROBLEM - Ensure local MW versions match expected deployment on mw1385 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:30:46] PROBLEM - Ensure local MW versions match expected deployment on mw1328 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:30:46] PROBLEM - Ensure local MW versions match expected deployment on mw2331 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:30:46] PROBLEM - Ensure local MW versions match expected deployment on mw2362 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:30:46] (03PS1) 10JMeybohm: Fix distribution in changelog [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619478 [14:30:50] PROBLEM - Ensure local MW versions match expected deployment on mw2258 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:30:53] PROBLEM - Ensure local MW versions match expected deployment on mw1290 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:30:53] PROBLEM - Ensure local MW versions match expected deployment on mw2247 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:30:53] PROBLEM - Ensure local MW versions match expected deployment on mw2137 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:31:00] PROBLEM - Ensure local MW versions match expected deployment on mw2266 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:31:02] PROBLEM - Ensure local MW versions match expected deployment on mw2271 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:31:03] PROBLEM - Ensure local MW versions match expected deployment on mw2209 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:31:03] PROBLEM - Ensure local MW versions match expected deployment on mw2211 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:31:12] PROBLEM - Ensure local MW versions match expected deployment on mw1410 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:31:16] PROBLEM - Ensure local MW versions match expected deployment on mw2329 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:31:23] PROBLEM - Ensure local MW versions match expected deployment on mw2325 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:31:32] <_joe_> hnowlan: we need to make that grace period longer [14:31:52] PROBLEM - Ensure local MW versions match expected deployment on mw2260 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:32:02] PROBLEM - Ensure local MW versions match expected deployment on mw1354 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:32:03] PROBLEM - Ensure local MW versions match expected deployment on mw2219 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:32:10] PROBLEM - Ensure local MW versions match expected deployment on mw2190 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:32:14] PROBLEM - Ensure local MW versions match expected deployment on mw1375 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:32:14] PROBLEM - Ensure local MW versions match expected deployment on mw1349 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:32:19] _joe_: ack, will fix [14:32:27] <_joe_> hashar: another thing I noticed: php-1.36.0-wmf.3 is 1.8 GB larger than previous versions [14:32:37] <_joe_> any idea why? [14:32:48] RECOVERY - Ensure local MW versions match expected deployment on snapshot1008 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:32:49] webpack / js dependencies being added with npm ? [14:32:54] RECOVERY - Ensure local MW versions match expected deployment on mw2324 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:32:54] more seriously, I don't know [14:32:59] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix distribution in changelog [debs/helmfile] - 10https://gerrit.wikimedia.org/r/619478 (owner: 10JMeybohm) [14:33:14] <_joe_> hashar: you know who might know? [14:33:35] well [14:33:38] RECOVERY - Ensure local MW versions match expected deployment on mw1327 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:33:40] I don't know anything about mediawiki anymore [14:33:52] so short of filing a task, I guess I am not that helpful on that front ;] [14:34:35] <_joe_> cache went from 5.5G to 7.2G [14:34:37] <_joe_> lol [14:35:00] <_joe_> the l10n cache is 7.2 GB [14:36:04] (03CR) 10Kormat: [C: 03+1] control-mariadb-10.4*: Update package version [software] - 10https://gerrit.wikimedia.org/r/619462 (owner: 10Marostegui) [14:37:02] !log replacing msw-b5,b6,b7 and b8 [14:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:22] <_joe_> hashar: sorry any idea what these .~tmp~ directories are? [14:38:33] <_joe_> it seems we keep two identical copies of the same l10n cache [14:39:50] RECOVERY - Ensure local MW versions match expected deployment on mw2136 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:39:58] RECOVERY - Ensure local MW versions match expected deployment on mw1412 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:39:58] RECOVERY - Ensure local MW versions match expected deployment on mw2273 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:40:26] RECOVERY - Ensure local MW versions match expected deployment on mw2317 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:40:30] RECOVERY - Ensure local MW versions match expected deployment on mw2292 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:40:31] !log hashar@deploy1001 Pruned MediaWiki: 1.35.0-wmf.40 (duration: 10m 24s) [14:40:32] RECOVERY - Ensure local MW versions match expected deployment on wtp1027 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:38] RECOVERY - Ensure local MW versions match expected deployment on mw1297 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:40:38] RECOVERY - Ensure local MW versions match expected deployment on mw2296 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:40:38] RECOVERY - Ensure local MW versions match expected deployment on mw2221 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:40:45] _joe_: is 2 hours a reasonable window? [14:41:02] RECOVERY - Ensure local MW versions match expected deployment on mw1368 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:41:02] RECOVERY - Ensure local MW versions match expected deployment on wtp2011 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:41:10] RECOVERY - Ensure local MW versions match expected deployment on mw1362 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:41:12] RECOVERY - Ensure local MW versions match expected deployment on mw2218 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:41:42] RECOVERY - Ensure local MW versions match expected deployment on mw2208 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:41:43] RECOVERY - Ensure local MW versions match expected deployment on wtp1041 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:08] RECOVERY - Ensure local MW versions match expected deployment on snapshot1005 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:20] RECOVERY - Ensure local MW versions match expected deployment on mw2204 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:34] RECOVERY - Ensure local MW versions match expected deployment on mw1402 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:40] RECOVERY - Ensure local MW versions match expected deployment on mw1385 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:42] RECOVERY - Ensure local MW versions match expected deployment on mw1328 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:43] RECOVERY - Ensure local MW versions match expected deployment on mw2331 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:43] RECOVERY - Ensure local MW versions match expected deployment on mw2362 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:44] RECOVERY - Ensure local MW versions match expected deployment on mw2258 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:48] RECOVERY - Ensure local MW versions match expected deployment on mw1290 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:50] RECOVERY - Ensure local MW versions match expected deployment on mw2247 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:50] RECOVERY - Ensure local MW versions match expected deployment on mw2137 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:56] RECOVERY - Ensure local MW versions match expected deployment on mw2271 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:56] RECOVERY - Ensure local MW versions match expected deployment on mw2266 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:56] RECOVERY - Ensure local MW versions match expected deployment on mw2211 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:42:56] RECOVERY - Ensure local MW versions match expected deployment on mw2209 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:43:06] RECOVERY - Ensure local MW versions match expected deployment on mw1410 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:43:10] RECOVERY - Ensure local MW versions match expected deployment on mw2329 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:43:17] _joe_: sorry was busy. I guess we can file your finding as a train blocker ( https://phabricator.wikimedia.org/T257972 ) [14:43:18] RECOVERY - Ensure local MW versions match expected deployment on mw2325 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:43:35] (03CR) 10Gilles: "Thiemo, your behaviour constitutes harassment and you are repeatedly breaching the trolling clause of the Code of Conduct:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [14:43:46] RECOVERY - Ensure local MW versions match expected deployment on mw2260 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:43:56] RECOVERY - Ensure local MW versions match expected deployment on mw1354 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:43:56] RECOVERY - Ensure local MW versions match expected deployment on mw2219 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:44:06] RECOVERY - Ensure local MW versions match expected deployment on mw2190 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:44:08] RECOVERY - Ensure local MW versions match expected deployment on mw1349 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:44:08] RECOVERY - Ensure local MW versions match expected deployment on mw1375 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:44:33] RECOVERY - Ensure local MW versions match expected deployment on mw1312 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:44:44] RECOVERY - Ensure local MW versions match expected deployment on mw2311 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:44:44] RECOVERY - Ensure local MW versions match expected deployment on mw2372 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:44:46] RECOVERY - Ensure local MW versions match expected deployment on mwdebug2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:44:50] RECOVERY - Ensure local MW versions match expected deployment on mw2310 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:44:53] RECOVERY - Ensure local MW versions match expected deployment on wtp2004 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:44:53] RECOVERY - Ensure local MW versions match expected deployment on wtp2014 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:45:18] RECOVERY - Ensure local MW versions match expected deployment on mw1351 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:45:18] RECOVERY - Ensure local MW versions match expected deployment on wtp1040 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:45:18] RECOVERY - Ensure local MW versions match expected deployment on mw1302 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:45:32] RECOVERY - Ensure local MW versions match expected deployment on mw1275 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:45:34] RECOVERY - Ensure local MW versions match expected deployment on mw2253 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:45:46] RECOVERY - Ensure local MW versions match expected deployment on mw1363 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:45:56] RECOVERY - Ensure local MW versions match expected deployment on mw1300 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:46:56] RECOVERY - Disk space on mwdebug1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops [14:47:52] !log hashar@deploy1001 Pruned MediaWiki: 1.35.0-wmf.41 (duration: 04m 20s) [14:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:26] !log imported helmfile_0.125.2-1 to buster-wikimedia, jessie-wikimedia, stretch-wikimedia [14:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:18] !log hashar@deploy1001 Pruned MediaWiki: 1.36.0-wmf.1 (duration: 02m 07s) [14:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:14] RECOVERY - Ensure local MW versions match expected deployment on mwdebug1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:51:38] RECOVERY - Ensure local MW versions match expected deployment on mwdebug1001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:51:39] !log otto@deploy1001 Started deploy [analytics/refinery@35c4430]: Deploying to an-launcher1002 to get camus wrapper script changes - T251935 [14:51:41] * hashar files more tasks [14:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:44] T251935: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935 [14:52:20] _joe_: jynus: for the old l10n cache directories I have filed a task we will act on eventually I guess ( https://phabricator.wikimedia.org/T260146 ) [14:52:40] and I cleaned the old versions [14:52:54] !log otto@deploy1001 Finished deploy [analytics/refinery@35c4430]: Deploying to an-launcher1002 to get camus wrapper script changes - T251935 (duration: 01m 14s) [14:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:55] (03PS1) 10Hnowlan: check_mw_versions: increase grace period from 1 hour to 2 [puppet] - 10https://gerrit.wikimedia.org/r/619482 [14:55:08] <_joe_> hashar: what's not clear to me is why we have those .~tmp~ directories [14:55:29] used by the l10n cache / cdb thingie [14:55:35] I can't remember off hand how it works [14:55:54] but I guess they are all generated to that .~tmp~ sub directory then once generated moved to replace the old ones [14:55:57] !log updated helmfile to 0.125.2-1 on contint* and deploy* [14:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:10] !log hashar@deploy1001 Pruned MediaWiki: 1.36.0-wmf.2 (duration: 04m 15s) [14:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:20] !log hashar@deploy1001 Started scap: (no justification provided) [14:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:26] syncing again bah [14:58:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [14:59:03] !log Deploy MCR change on db1116:3318 [14:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:17] (03CR) 10Ottomata: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:01:12] PROBLEM - Host ps1-b5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:01:24] RECOVERY - Host ps1-b5-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.46 ms [15:01:36] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [15:02:48] (03PS2) 10Ottomata: camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) [15:03:15] (03CR) 10jerkins-bot: [V: 04-1] camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:04:41] (03PS3) 10Ottomata: camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) [15:05:04] (03CR) 10jerkins-bot: [V: 04-1] camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:05:54] (03PS4) 10Ottomata: camus::job - pass stream_configs_constraints through to refinery camus wrapper [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) [15:07:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:07:25] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [15:07:38] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24425/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619472 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:08:28] (03PS3) 10Ayounsi: Netbox: add circuits support [software/homer] - 10https://gerrit.wikimedia.org/r/617418 [15:08:58] (03PS2) 10Ayounsi: Configure transport links OSPF based on Netbox data [homer/public] - 10https://gerrit.wikimedia.org/r/617603 (https://phabricator.wikimedia.org/T200277) [15:10:01] (03CR) 10jerkins-bot: [V: 04-1] Netbox: add circuits support [software/homer] - 10https://gerrit.wikimedia.org/r/617418 (owner: 10Ayounsi) [15:12:31] (03PS1) 10Andrew Bogott: wmcs/ceph/backy fix name of backup script [puppet] - 10https://gerrit.wikimedia.org/r/619486 (https://phabricator.wikimedia.org/T259192) [15:13:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:13:56] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy fix name of backup script [puppet] - 10https://gerrit.wikimedia.org/r/619486 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [15:15:22] (03PS1) 10Ottomata: camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) [15:16:32] (03CR) 10jerkins-bot: [V: 04-1] camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:17:36] (03PS2) 10Ottomata: camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) [15:18:52] (03CR) 10jerkins-bot: [V: 04-1] camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:18:59] (03PS4) 10Ayounsi: Netbox: add circuits support [software/homer] - 10https://gerrit.wikimedia.org/r/617418 [15:19:25] (03PS3) 10Ottomata: camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) [15:20:37] (03CR) 10jerkins-bot: [V: 04-1] camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:20:56] (03PS4) 10Ottomata: camus - declare event service specific camus::jobs. [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) [15:24:03] 10Operations, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10Bstorm) Are we considering the retrigger to be something implemented for the prod SRE rotation only? On WMCS, we seem pretty ok with manually resolving (small group and... [15:27:12] !log hashar@deploy1001 Finished scap: (no justification provided) (duration: 30m 51s) [15:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:47] continuing with group 0 [15:29:57] (03PS1) 10Hashar: group0 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619489 [15:29:59] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619489 (owner: 10Hashar) [15:30:39] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619489 (owner: 10Hashar) [15:31:13] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/24429/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:33:55] (03CR) 10Ottomata: [C: 03+2] "Migration plan:" [puppet] - 10https://gerrit.wikimedia.org/r/619487 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:36:31] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.4 [15:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:48] 10Operations, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) >>! In T259465#6376474, @Bstorm wrote: > Are we considering the retrigger to be something implemented for the prod SRE rotation only? On WMCS, we seem pretty... [15:37:44] (03PS4) 10Herron: lists: make lists1001 primary mailman host [puppet] - 10https://gerrit.wikimedia.org/r/619354 (https://phabricator.wikimedia.org/T224586) [15:38:42] (03PS1) 10Hnowlan: api-gateway: serve public traffic over TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/619490 (https://phabricator.wikimedia.org/T254908) [15:39:28] 10Puppet, 10Beta-Cluster-Infrastructure, 10Product-Infrastructure-Team-Backlog, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deployment-docker... - https://phabricator.wikimedia.org/T259812 [15:40:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:43:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:30] PROBLEM - Host mw2208.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:44:30] PROBLEM - Host mw2210.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:46:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:47:18] (03PS1) 10Ottomata: camus::job - Fix typo in stream_configs_constraints_opt [puppet] - 10https://gerrit.wikimedia.org/r/619491 (https://phabricator.wikimedia.org/T251935) [15:47:57] PROBLEM - Host mw2189.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:06] hashar: FYI https://phabricator.wikimedia.org/T260155#6376588 [15:48:19] (03CR) 10Ottomata: [C: 03+2] camus::job - Fix typo in stream_configs_constraints_opt [puppet] - 10https://gerrit.wikimedia.org/r/619491 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:48:41] RhinosF1: ah cool I was about to file it [15:49:11] hashar: :) I get emails for anything having their priority changed to UBN [15:49:31] ah smart [15:49:33] RECOVERY - Host mw2208.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.74 ms [15:49:33] RECOVERY - Host mw2210.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.14 ms [15:49:56] hashar: nosey :) [15:50:38] made it a blocker to the train [15:50:48] Ty [15:51:12] (03PS1) 10Jdlrobson: Revert "ServiceWiring: Avoid usage of deprecated Title::getSubjectPage()" [skins/MinervaNeue] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619376 (https://phabricator.wikimedia.org/T260155) [15:51:25] (03PS1) 10JMeybohm: helm: Add wmf-stable helm repo [puppet] - 10https://gerrit.wikimedia.org/r/619493 [15:51:29] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:53:03] RECOVERY - Host mw2189.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.01 ms [15:53:28] Jdlrobson: will deploy your fix [15:53:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:33] (03CR) 10Hashar: [C: 03+2] Revert "ServiceWiring: Avoid usage of deprecated Title::getSubjectPage()" [skins/MinervaNeue] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619376 (https://phabricator.wikimedia.org/T260155) (owner: 10Jdlrobson) [15:53:43] 10Operations, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10Bstorm) >>! In T259465#6376519, @fgiunchedi wrote: > > Good question re: SRE rotation only, I forgot to specify that the setting is unfortunately global per organizatio... [15:54:01] PROBLEM - Host wtp2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:21] RECOVERY - Host wtp2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [16:00:05] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1600). [16:00:05] Amir1: A patch you scheduled for Puppet request window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:13] o/ [16:00:41] msw-b5,b6,b7.and b8 replacement done [16:03:21] Jdlrobson: apparently the change will merge in 10 minutes [16:04:39] 10Operations, 10DBA, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Marostegui) p:05Medium→03High Setting it to high as we don't have many "old" masters with 10.4 but we already have some that would use this script: x1, es4, es5... [16:04:42] <_joe_> Amir1: can we do it tomorrow please? [16:04:57] <_joe_> as in tomorrow morning? I was about to leave [16:05:01] sure, no worries. nothing urgent [16:11:17] (03CR) 10Dzahn: [C: 03+2] noc: Remove link to outdated blog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619129 (https://phabricator.wikimedia.org/T259978) (owner: 10Aklapper) [16:11:44] _joe_: ACK, i saw the ping about testreduce. will look. have a good rest of the night [16:12:21] !log migrating lists.wikimedia.org services from fermium to lists1001 T224586 [16:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:27] T224586: Migrate fermium to Buster - https://phabricator.wikimedia.org/T224586 [16:12:35] (03Merged) 10jenkins-bot: Revert "ServiceWiring: Avoid usage of deprecated Title::getSubjectPage()" [skins/MinervaNeue] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619376 (https://phabricator.wikimedia.org/T260155) (owner: 10Jdlrobson) [16:13:00] (03CR) 10Herron: [C: 03+2] lists: make lists1001 primary mailman host [puppet] - 10https://gerrit.wikimedia.org/r/619354 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [16:14:39] Jdlrobson: pulling your patch to mwdebug1001 [16:17:55] (03CR) 10RLazarus: [C: 03+2] web_testing: Remove the apache-fast-test placeholder [puppet] - 10https://gerrit.wikimedia.org/r/618602 (owner: 10RLazarus) [16:18:09] 10Operations, 10SRE-Access-Requests: Add new SSH key for Neil Shah-Quinn - https://phabricator.wikimedia.org/T260160 (10nshahquinn-wmf) [16:19:13] 10Operations, 10SRE-Access-Requests: Add new SSH key for Neil Shah-Quinn - https://phabricator.wikimedia.org/T260160 (10nshahquinn-wmf) [16:20:29] (03CR) 10Dzahn: "aww. man. sorry, this is not the puppet-repo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619129 (https://phabricator.wikimedia.org/T259978) (owner: 10Aklapper) [16:20:54] (03PS1) 10Dzahn: Revert "noc: Remove link to outdated blog" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619377 [16:21:32] (03PS1) 10Ottomata: Refine - bump version to 0.0.132, but default to not merging Hive schemas [puppet] - 10https://gerrit.wikimedia.org/r/619496 (https://phabricator.wikimedia.org/T259924) [16:21:51] (03CR) 10Dzahn: [C: 03+2] Revert "noc: Remove link to outdated blog" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619377 (owner: 10Dzahn) [16:22:34] (03CR) 10Dzahn: [C: 03+2] releases: Remove absent resources [puppet] - 10https://gerrit.wikimedia.org/r/619434 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [16:22:57] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [16:23:46] Jdlrobson: bah will do after sorry [16:24:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, and 2 others: decom cloudvirt1015 - https://phabricator.wikimedia.org/T257366 (10nskaggs) p:05Triage→03Medium a:05Jclark-ctr→03Andrew [16:25:32] (03CR) 10Ottomata: [C: 03+2] Refine - bump version to 0.0.132, but default to not merging Hive schemas [puppet] - 10https://gerrit.wikimedia.org/r/619496 (https://phabricator.wikimedia.org/T259924) (owner: 10Ottomata) [16:26:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:28:54] (03CR) 10Nuria: [C: 03+1] "Nice, I see, there is no need to revert" [puppet] - 10https://gerrit.wikimedia.org/r/619496 (https://phabricator.wikimedia.org/T259924) (owner: 10Ottomata) [16:29:36] (03PS6) 10Hnowlan: Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) [16:31:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10nskaggs) p:05Triage→03Low [16:32:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:38:31] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10Cmjohnson) @ayounsi the cross-connect has been completed but I am not seeing any light from the demarc panel [16:44:13] (03PS1) 10Herron: lists: lists1001 enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/619498 (https://phabricator.wikimedia.org/T224586) [16:46:34] (03PS1) 10Hnowlan: wmnet: add api-gateway records [dns] - 10https://gerrit.wikimedia.org/r/619499 (https://phabricator.wikimedia.org/T254908) [16:46:55] (03CR) 10jerkins-bot: [V: 04-1] wmnet: add api-gateway records [dns] - 10https://gerrit.wikimedia.org/r/619499 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [16:47:40] (03CR) 10Herron: [C: 03+2] lists: lists1001 enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/619498 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [16:48:23] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10Cmjohnson) just to confirm, I double-checked the Equinix completion email and verified the ports (11,12) and the patch number (21504182-A). I also verified the sfp+ and fiber ar... [16:49:08] bah back [16:49:36] lets deploy the hotfix for T260155 [16:49:37] T260155: PHP Fatal error: Uncaught Error: Call to undefined method TitleValue::isSubpage() in /srv/mediawiki/php-1.36.0-wmf.4/skins/MinervaNeue/includes/Skins/SkinUserPageHelper.php:55 in /srv/mediawiki/php-1.36.0-wmf.4/skins/MinervaNeue/includes/Skins/SkinUserPageHelper.php on line 55 - https://phabricator.wikimedia.org/T260155 [16:51:21] (03PS2) 10Hnowlan: wmnet: add api-gateway records [dns] - 10https://gerrit.wikimedia.org/r/619499 (https://phabricator.wikimedia.org/T254908) [16:51:38] 10Operations, 10SRE-Access-Requests: Add new SSH key for Neil Shah-Quinn - https://phabricator.wikimedia.org/T260160 (10nshahquinn-wmf) [16:51:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:53:17] !log hashar@deploy1001 Synchronized php-1.36.0-wmf.4/skins/MinervaNeue/: Revert "ServiceWiring: Avoid usage of deprecated Title::getSubjectPage()" - T260155 (duration: 01m 06s) [16:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:26] that should fix the icinga alert [16:56:02] (03PS1) 10Cmjohnson: updating mgmt ip to reflect correct asset tag cloudcephosd host [dns] - 10https://gerrit.wikimedia.org/r/619503 (https://phabricator.wikimedia.org/T251619) [16:56:10] (03CR) 10jerkins-bot: [V: 04-1] updating mgmt ip to reflect correct asset tag cloudcephosd host [dns] - 10https://gerrit.wikimedia.org/r/619503 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson) [16:56:59] (03Abandoned) 10Cmjohnson: updating mgmt ip to reflect correct asset tag cloudcephosd host [dns] - 10https://gerrit.wikimedia.org/r/619503 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson) [16:58:25] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:12] (03PS7) 10Hnowlan: Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) [17:00:04] halfak and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1700). [17:00:42] (03PS1) 10Elukey: admin: add new ssh key for neilpquinn-wmf [puppet] - 10https://gerrit.wikimedia.org/r/619505 (https://phabricator.wikimedia.org/T260160) [17:01:08] (03PS1) 10Dbarratt: Grant all users in the checkuser group the investigate right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) [17:01:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:02:00] 10Operations, 10Patch-For-Review: Migrate fermium to Buster - https://phabricator.wikimedia.org/T224586 (10herron) 05Open→03Resolved a:03herron lists.wikimedia.org is now running from the buster host lists1001.wikimedia.org. Fermium (the old lists host) has been shut down (via gnt-instance shutdown) and... [17:02:02] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10herron) [17:02:08] (03CR) 10Dbarratt: Grant all users in the checkuser group the investigate right (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt) [17:03:30] (03PS2) 10Elukey: admin: add new ssh key for neilpquinn-wmf [puppet] - 10https://gerrit.wikimedia.org/r/619505 (https://phabricator.wikimedia.org/T260160) [17:04:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:59] (03PS1) 10Herron: lists: remove fermium entry from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/619507 (https://phabricator.wikimedia.org/T224586) [17:05:24] (03CR) 10Tchanders: [C: 03+1] Grant all users in the checkuser group the investigate right (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt) [17:06:54] herron: thanks for the upgrade! [17:07:11] (03CR) 10Herron: [C: 03+2] lists: remove fermium entry from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/619507 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [17:07:22] Amir1: np, glad to be off jessie! [17:09:15] (03PS1) 10Mholloway: Update mobileapps to 2020-08-11-170318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619508 [17:11:46] (03CR) 10Mholloway: [C: 03+2] Update mobileapps to 2020-08-11-170318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619508 (owner: 10Mholloway) [17:12:51] (03Merged) 10jenkins-bot: Update mobileapps to 2020-08-11-170318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619508 (owner: 10Mholloway) [17:16:11] (03PS1) 10Mholloway: Update proton to 2020-08-11-170508-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619509 [17:19:16] (03CR) 10Mholloway: [C: 03+2] Update proton to 2020-08-11-170508-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619509 (owner: 10Mholloway) [17:20:27] (03Merged) 10jenkins-bot: Update proton to 2020-08-11-170508-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619509 (owner: 10Mholloway) [17:22:04] (03PS1) 10Ppchelko: Create api-gateway-logstream image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) [17:25:16] (03PS2) 10Ppchelko: Create api-gateway-logstream image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) [17:25:18] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:48] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [17:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:40] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:11] (03PS3) 10Ppchelko: Create api-gateway-logstream image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) [17:33:08] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [17:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:14] (03CR) 10CDanis: [C: 03+1] wmnet: add api-gateway records [dns] - 10https://gerrit.wikimedia.org/r/619499 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [17:33:20] (03CR) 10CDanis: [C: 03+1] Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [17:34:36] (03CR) 10Urbanecm: [C: 04-1] "This changes a wrong variable :-). See more in-text." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt) [17:36:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:28] Niharika: thanks, I was not aware re deprecation of `investigate` [17:38:48] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:52] (03CR) 10Urbanecm: [C: 04-1] Grant all users in the checkuser group the investigate right (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt) [17:40:20] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10Cmjohnson) [17:40:41] (03PS5) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [17:42:54] (03CR) 10Hnowlan: [C: 03+1] "LGTM but I'd be interested to hear what serviceops think of this approach. I'm not sure how else we might do this" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [17:43:23] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [17:43:23] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:43:24] (03PS1) 10Cmjohnson: add production dns for pki1001 [dns] - 10https://gerrit.wikimedia.org/r/619518 (https://phabricator.wikimedia.org/T259826) [17:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:32] (03CR) 10jerkins-bot: [V: 04-1] add production dns for pki1001 [dns] - 10https://gerrit.wikimedia.org/r/619518 (https://phabricator.wikimedia.org/T259826) (owner: 10Cmjohnson) [17:44:05] (03PS2) 10Cmjohnson: add production dns for pki1001 [dns] - 10https://gerrit.wikimedia.org/r/619518 (https://phabricator.wikimedia.org/T259826) [17:44:56] (03CR) 10Cmjohnson: [C: 03+2] add production dns for pki1001 [dns] - 10https://gerrit.wikimedia.org/r/619518 (https://phabricator.wikimedia.org/T259826) (owner: 10Cmjohnson) [17:46:04] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10Cmjohnson) [17:48:57] 10Operations, 10ops-eqiad: relforge1001's mgmt IP not reachable - https://phabricator.wikimedia.org/T259777 (10Cmjohnson) 05Open→03Resolved Replaced the cable [17:50:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10RobH) [17:50:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10RobH) [17:50:54] (03PS1) 10Ottomata: EventStreamConfig - Remove extraneous mediawiki.api-request stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619523 (https://phabricator.wikimedia.org/T251935) [17:51:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10RobH) [17:52:01] (03PS2) 10Ottomata: EventStreamConfig - Remove extraneous mediawiki.api-request stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619523 (https://phabricator.wikimedia.org/T251935) [17:52:32] RECOVERY - Host relforge1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [17:53:02] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [17:53:02] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:45] (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - Remove extraneous mediawiki.api-request stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619523 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [17:54:08] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10RobH) [17:54:17] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10RobH) [17:55:46] PROBLEM - kubelet operational latencies on kubernetes1013 is CRITICAL: instance=kubernetes1013.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:56:37] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Remove extraneous mediawiki.api-request stream - T251935 (duration: 01m 01s) [17:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:40] T251935: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935 [17:58:28] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10RobH) We've not gotten confirmation from Cloudflare that they are turned up, I'll email them to let them know we are ready on our end! [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1800). [18:00:04] RoanKattouw, CindyCicaleseWMF, and davidwbarratt: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] I'll deploy [18:00:37] (03PS2) 10Catrope: Direct GrowthExperiments help panel questions to mentors on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618786 (https://phabricator.wikimedia.org/T250235) (owner: 10Gergő Tisza) [18:00:46] RoanKattouw: please skip one from CindyCicaleseWMF, I'll deploy that one [18:00:46] (03CR) 10Catrope: [C: 03+2] Direct GrowthExperiments help panel questions to mentors on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618786 (https://phabricator.wikimedia.org/T250235) (owner: 10Gergő Tisza) [18:00:51] OK [18:01:00] we're using it for showing how to do that [18:01:28] (03Merged) 10jenkins-bot: Direct GrowthExperiments help panel questions to mentors on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618786 (https://phabricator.wikimedia.org/T250235) (owner: 10Gergő Tisza) [18:01:36] RECOVERY - kubelet operational latencies on kubernetes1013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:02:44] I'm here! [18:04:20] PROBLEM - kubelet operational latencies on kubernetes2012 is CRITICAL: instance=kubernetes2012.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:04:28] PROBLEM - kubelet operational latencies on kubernetes2013 is CRITICAL: instance=kubernetes2013.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:04:34] (03PS1) 10Ottomata: camus - replace mediawiki_analytics_events with eventgate-analytics_events job [puppet] - 10https://gerrit.wikimedia.org/r/619532 (https://phabricator.wikimedia.org/T251935) [18:05:37] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Direct GrowthExperiments help panel questions to mentors on cswiki (T250235) (duration: 01m 03s) [18:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:40] T250235: Scale: pilot help panel with mentorship - https://phabricator.wikimedia.org/T250235 [18:06:01] davidwbarratt: Your patch has -1 comments from Martin and I think he's right [18:06:09] $wgGrantPermissions is about OAuth grants (confusingly) [18:06:30] (03PS2) 10Ottomata: camus - replace mediawiki_analytics_events with eventgate-analytics_events job [puppet] - 10https://gerrit.wikimedia.org/r/619532 (https://phabricator.wikimedia.org/T251935) [18:07:44] @seen hashar [18:07:44] mutante: Last time I saw hashar they were quitting the network with reason: Quit: I am a virus. Please copy paste me in your /quit message to help me propagate N/A at 8/11/2020 5:10:40 PM (57m4s ago) [18:07:50] I think you're probably looking for $wgGroupPermissions [18:07:57] 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet - https://phabricator.wikimedia.org/T260188 (10RobH) [18:08:12] 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet - https://phabricator.wikimedia.org/T260188 (10RobH) [18:08:23] Pchelolo: Go ahead with your path [18:08:26] *patch [18:08:29] thank you RoanKattouw [18:08:59] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24431/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619532 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [18:09:18] 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10wiki_willy) [18:09:36] 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10RobH) a:03fgiunchedi @fgiunchedi: What racking restrictions and what OS did you have for this incoming test system?... [18:10:10] RECOVERY - kubelet operational latencies on kubernetes2012 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:10:29] 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10RobH) [18:10:33] (03PS4) 10Ppchelko: Configured additional settings for API Portal beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569) (owner: 10Cicalese) [18:10:42] (03CR) 10Ppchelko: [C: 03+2] Configured additional settings for API Portal beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569) (owner: 10Cicalese) [18:11:17] ugh [18:11:30] (03Merged) 10jenkins-bot: Configured additional settings for API Portal beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569) (owner: 10Cicalese) [18:11:42] * Urbanecm waves to davidwbarratt [18:11:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:12:16] RECOVERY - kubelet operational latencies on kubernetes2013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:12:34] 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10RobH) [18:15:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:16:57] (03PS2) 10Dbarratt: Grant all users on frwiki the checkuser group the investigate right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) [18:17:27] hey! [18:17:39] I updated the patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/619506/ [18:17:56] Urbanecm & RoanKattouw ^ [18:18:12] Looking [18:18:24] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: Beta-only: Configured additional settings for API Portal beta wiki gerrit:619339 (duration: 01m 03s) [18:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:33] done with deploy [18:18:37] this works, but only for frwiki. Is that what you want davidwbarratt ? [18:18:45] yes, just french wikipedia for now [18:18:56] that's the only other place it's enabled on [18:18:56] Pchelolo: fyi, you don't need to sync -labs files :-). Just merge and fetch to deploy1001. [18:19:02] should I fix the merge conflict? [18:19:07] (03PS3) 10Urbanecm: Grant all users on frwiki the checkuser group the investigate right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt) [18:19:14] davidwbarratt: that needed only a rebase, done [18:19:14] LGTM [18:19:18] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt) [18:19:20] Urbanecm: we were using this one as a demo on how you would deploy ) [18:19:23] thank you [18:19:28] aha :-) [18:19:46] fine then, it shouldn't break anything :-) [18:20:30] (03PS1) 10Cwhite: prometheus: add mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619539 (https://phabricator.wikimedia.org/T256418) [18:20:42] (03CR) 10Dzahn: "yes, the directories are gone on releases100*" [puppet] - 10https://gerrit.wikimedia.org/r/619434 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [18:20:44] OK I'll deploy then [18:20:48] Thanks for giving me the heads up Pchelolo [18:21:02] (03CR) 10Catrope: [C: 03+2] Grant all users on frwiki the checkuser group the investigate right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt) [18:21:49] (03Merged) 10jenkins-bot: Grant all users on frwiki the checkuser group the investigate right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619506 (https://phabricator.wikimedia.org/T260171) (owner: 10Dbarratt) [18:21:58] uhh, where is the extension again? [18:22:03] the browser extension [18:22:12] (03PS2) 10RLazarus: web_testing: Clean up the old class used for apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/618603 [18:22:16] (03PS1) 10Ottomata: camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) [18:23:24] davidwbarratt: see https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [18:23:29] (03CR) 10jerkins-bot: [V: 04-1] camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [18:23:56] Urbanecm oh great! thanks! [18:24:11] davidwbarratt: Ready for you on mwdebug1002 (note 1002 not 1001) when you're set up [18:24:22] ok, testing now [18:24:36] (03PS2) 10Ottomata: camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) [18:25:01] (03CR) 10RLazarus: [C: 03+2] web_testing: Clean up the old class used for apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/618603 (owner: 10RLazarus) [18:25:08] perfect! I see it on https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Liste_des_droits_de_groupe ! [18:25:49] (03CR) 10jerkins-bot: [V: 04-1] camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [18:28:19] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Grant investigate right to checkuser group on frwiki (T260171) (duration: 01m 04s) [18:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:22] T260171: Fix issues with permissions for Special:Investigate access to checkusers on frwiki - https://phabricator.wikimedia.org/T260171 [18:28:41] (03PS3) 10Ottomata: camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) [18:29:13] is it done? [18:29:50] (03CR) 10jerkins-bot: [V: 04-1] camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [18:31:28] RoanKattouw ? [18:31:35] Yes sorry [18:31:44] no worries, I still fail at reading the logs. :) [18:32:29] (03PS4) 10Ottomata: camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) [18:32:43] RoanKattouw thank you so much! [18:33:32] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10RobH) a:05Cmjohnson→03RobH I just emailed back and forth with Matt @ Cloudflare (he is very prompt in replies!). They had to put in an EQ order to have a patch placed between... [18:35:01] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10RobH) 05Open→03Resolved >>! In T259923#6377508, @RobH wrote: > I just emailed back and forth with Matt @ Cloudflare (he is very prompt in replies!). > > They had to put in an... [18:36:10] (03CR) 10Ottomata: [C: 03+2] camus - include mediawiki.api-request in eventgate-analytics topics to check [puppet] - 10https://gerrit.wikimedia.org/r/619541 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [18:41:48] (03PS1) 10Ottomata: Re-enable canary for staging eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/619543 [18:43:11] (03CR) 10Ottomata: [C: 03+2] Re-enable canary for staging eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/619543 (owner: 10Ottomata) [18:44:55] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [18:44:55] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [18:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:41] 10Operations: expired puppet cert on scb1001 - https://phabricator.wikimedia.org/T260094 (10Dzahn) [18:46:08] (03PS1) 10Ottomata: eventgate-analytics - Use remote EventStreamConfig in eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/619544 (https://phabricator.wikimedia.org/T251935) [18:47:42] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - Use remote EventStreamConfig in eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/619544 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [18:48:52] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [18:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:17] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [18:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] hashar and twentyafterfour: (Dis)respected human, time to deploy Mediawiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T1900). Please do the needful. [19:11:08] 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10dpifke) [19:11:22] (03PS1) 10Ottomata: EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) [19:12:10] (03CR) 10jerkins-bot: [V: 04-1] EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [19:13:27] 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10dpifke) [19:16:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:19:42] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [19:20:02] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153) [19:22:55] (03PS2) 10Ottomata: EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) [19:23:27] (03CR) 10jerkins-bot: [V: 04-1] EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [19:24:22] (03PS3) 10Ottomata: EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) [19:25:14] (03CR) 10jerkins-bot: [V: 04-1] EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [19:26:01] (03PS4) 10Ottomata: EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) [19:29:14] PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:29:40] twentyafterfour: hashar you all training? [19:30:09] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:31:08] ACKNOWLEDGEMENT - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn debugging, not in service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:12] RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:48] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [19:32:02] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.47 ms [19:32:02] ok, i'm merging my config change, nothing is usiing it yet [19:32:08] (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - add streams for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619552 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [19:35:42] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Add streams for eventgate-main - T251935 (duration: 01m 04s) [19:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:46] T251935: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935 [19:37:10] (03PS1) 10Ottomata: eventgate-main - use MW EventStreamConfig in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/619554 (https://phabricator.wikimedia.org/T251935) [19:38:47] (03CR) 10Ottomata: [C: 03+2] eventgate-main - use MW EventStreamConfig in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/619554 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [19:40:10] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [19:40:10] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [19:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:24] (03PS2) 10Cwhite: prometheus: add mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619539 (https://phabricator.wikimedia.org/T256418) [20:01:24] (03CR) 10Cwhite: [C: 03+2] prometheus: add mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619539 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:07:45] (03PS1) 10Cwhite: prometheus: track total hits for mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619558 (https://phabricator.wikimedia.org/T256418) [20:07:55] (03PS2) 10Cwhite: prometheus: track total hits for mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619558 (https://phabricator.wikimedia.org/T256418) [20:09:06] (03CR) 10Cwhite: [C: 03+2] prometheus: track total hits for mediawiki level queries [puppet] - 10https://gerrit.wikimedia.org/r/619558 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:09:27] (03PS1) 10Dzahn: aphlict: listen on IPv6 instead IPv4 for client and admin ports [puppet] - 10https://gerrit.wikimedia.org/r/619560 (https://phabricator.wikimedia.org/T238593) [20:10:39] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) test comment [20:11:38] (03PS2) 10Dzahn: aphlict: listen on IPv6 instead IPv4 for client and admin ports [puppet] - 10https://gerrit.wikimedia.org/r/619560 (https://phabricator.wikimedia.org/T238593) [20:12:54] (03CR) 10Dzahn: [C: 03+2] aphlict: listen on IPv6 instead IPv4 for client and admin ports [puppet] - 10https://gerrit.wikimedia.org/r/619560 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:16:17] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) Will it blend? [20:16:26] Jdlrobson: ping regarding https://gerrit.wikimedia.org/r/c/mediawiki/core/+/619092 - can roll this out if you're around to fix the two user-facing issues in Echo and sitenotice. [20:17:03] sitenotice broken for over a week now for some wikis [20:17:44] (03PS1) 10Andrew Bogott: Mark cloudvirt1015 as spare [puppet] - 10https://gerrit.wikimedia.org/r/619562 (https://phabricator.wikimedia.org/T257366) [20:19:30] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) it appears to blend. [20:29:41] (03PS2) 10Andrew Bogott: Mark cloudvirt1015 as spare [puppet] - 10https://gerrit.wikimedia.org/r/619562 (https://phabricator.wikimedia.org/T257366) [20:32:06] (03PS1) 10Cwhite: prometheus: add config tests [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418) [20:32:31] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add config tests [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:32:49] (03PS2) 10Cwhite: prometheus: add config tests [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418) [20:34:30] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add config tests [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:36:23] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) >>! In T238593#6376059, @CDanis wrote: > The Envoy TLS terminator is now configured to allow websocket upgr... [20:37:11] 10Operations, 10DBA, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10mmodell) ugh. @jcrespo, I apologize, I let the ball drop on this one. It wouldn't take much effort on my part, we already have the puppet scaffolding to support separati... [20:44:07] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) 05Open→03Resolved a:03Dzahn We are seeing realtime notifications again and aphlict is now separated f... [20:44:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:46:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:51:09] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [20:51:23] (03CR) 10Hashar: [C: 03+1] "That is the primary change for the switch, the parent changes are dummy ones to cleanup puppet. We once deployed it but it failed cause s" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [20:53:31] 10Operations, 10serviceops, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) Checking the box for phabricator/aphlict. aphlict is now running on a dedicated VM, aphlict1001, on buster and nodejs 10.... [20:53:49] 10Operations, 10serviceops, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) [20:54:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:54:56] 10Operations, 10serviceops, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) Also checking the box for etherpad. That is also on buster and nodejs10 meanwhile. Upgraded by Alex Kosiaris. [20:55:10] 10Operations, 10serviceops, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) [20:58:09] (03PS1) 10ArielGlenn: cleanup misc dumps that aren't stored in per-date urls [puppet] - 10https://gerrit.wikimedia.org/r/619571 (https://phabricator.wikimedia.org/T257782) [21:07:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:10:44] (03PS1) 10Cwhite: fix flake8 [puppet] - 10https://gerrit.wikimedia.org/r/619572 [21:14:38] (03PS3) 10Cwhite: prometheus: add config tests [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418) [21:17:09] Krinkle: it can be backported yes. [21:18:18] (03CR) 10Krinkle: [C: 03+2] skins: Call headElement() after getTemplateData() in SkinMustache [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619092 (https://phabricator.wikimedia.org/T259872) (owner: 10Krinkle) [21:19:01] * Krinkle gets a new beverage [21:21:14] * hauskatze drills a hole in Krinkle 's new beverage glass/can [21:31:53] (03PS1) 10Cwhite: prometheus: remove unnecessary define and split mediawiki queries by channel [puppet] - 10https://gerrit.wikimedia.org/r/619574 (https://phabricator.wikimedia.org/T256418) [21:31:55] (03PS1) 10Andrew Bogott: Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 [21:32:05] * Platonides gives Krinkle hauskatze's new jar [21:32:23] (03CR) 10jerkins-bot: [V: 04-1] Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 (owner: 10Andrew Bogott) [21:35:57] (03PS2) 10Andrew Bogott: Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 [21:39:15] (03PS1) 10Jdlrobson: Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619380 (https://phabricator.wikimedia.org/T231160) [21:39:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:39:50] (03PS1) 10Jdlrobson: Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619381 (https://phabricator.wikimedia.org/T231160) [21:41:06] (03Merged) 10jenkins-bot: skins: Call headElement() after getTemplateData() in SkinMustache [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619092 (https://phabricator.wikimedia.org/T259872) (owner: 10Krinkle) [21:41:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:42:15] Krinkle: need me to test before syncing? [21:43:09] staging now, [21:43:10] yeah [21:44:18] Jdlrobson: live on mwdebug1002 [21:45:09] I'm also checking https://nl.wikimedia.org/wiki/Home with private and mwdebug1002 andd see the button is working there with XWD on [21:49:33] Krinkle: LGTM [21:51:27] ok, rolling out [21:52:27] !log krinkle@deploy1001 Synchronized php-1.36.0-wmf.3/includes/skins/SkinMustache.php: Ibe1f07346, T259872, T259858 (duration: 01m 04s) [21:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:31] T259858: Sitenotice: Button for dismissing content isn't in the right place and does nothing - https://phabricator.wikimedia.org/T259858 [21:52:31] T259872: Echo new message alert has no orange background in vector - https://phabricator.wikimedia.org/T259872 [22:04:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:08:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:15:59] (03PS3) 10Andrew Bogott: Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 (https://phabricator.wikimedia.org/T259542) [22:16:23] (03CR) 10jerkins-bot: [V: 04-1] Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 (https://phabricator.wikimedia.org/T259542) (owner: 10Andrew Bogott) [22:19:28] (03PS4) 10Andrew Bogott: Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 (https://phabricator.wikimedia.org/T259542) [22:20:28] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: warn if any flavors are not assigned aggregates [puppet] - 10https://gerrit.wikimedia.org/r/619575 (https://phabricator.wikimedia.org/T259542) (owner: 10Andrew Bogott) [22:25:15] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [22:27:54] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10Privacy: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10Krinkle) Can we set a hard CSP on this domain at the web server level so that in general our report will be "oh no, there's a requ... [22:30:40] (03PS1) 10Dzahn: mailman: replace fermium with lists1001 in rsync scripts [puppet] - 10https://gerrit.wikimedia.org/r/619585 (https://phabricator.wikimedia.org/T224586) [22:37:12] (03CR) 10Dzahn: [C: 03+1] "+1 for icinga-sms.py" [puppet] - 10https://gerrit.wikimedia.org/r/619572 (owner: 10Cwhite) [22:37:20] (03PS1) 10Dzahn: remove fermium from DHCP,partman and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/619586 (https://phabricator.wikimedia.org/T224586) [22:42:49] (03PS1) 10Jdlrobson: Beta cluster: Enable search in header on Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619588 (https://phabricator.wikimedia.org/T249363) [22:46:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:47:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200811T2300). [23:00:04] Jdlrobson and kaldari: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:28] o/ here [23:01:44] I can deploy today! [23:01:48] thanks Urbanecm [23:02:11] here! [23:02:24] Thank you! [23:02:45] (03PS2) 10Urbanecm: Switching to updated license definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618586 (owner: 10Kaldari) [23:02:47] (03CR) 10Urbanecm: [C: 03+2] Switching to updated license definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618586 (owner: 10Kaldari) [23:03:27] (03Merged) 10jenkins-bot: Switching to updated license definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618586 (owner: 10Kaldari) [23:03:29] (03CR) 10Urbanecm: [C: 03+2] Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619380 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson) [23:03:31] (03CR) 10Urbanecm: [C: 03+2] Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619381 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson) [23:04:01] kaldari: could you test that at mwdebug1001, please? [23:04:35] will do.... [23:06:20] Urbanecm: es perfecto! [23:06:25] syncing! [23:07:56] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 28faa279dacf6a4d6f0a663844e913738c2fa142: Switching to updated license definition (duration: 01m 04s) [23:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:01] kaldari: done! [23:09:17] Thanks! I'll keep an eye on the logs just in case. [23:10:56] Urbanecm: https://gerrit.wikimedia.org/r/c/619588/ is beta cluster only. I think it just needs a +2 ? [23:11:09] yup [23:11:17] (and git pull at deploy1001) [23:11:50] Jdlrobson: should that be merged? [23:12:32] Urbanecm: yes please [23:12:44] (03CR) 10Urbanecm: [C: 03+2] Beta cluster: Enable search in header on Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619588 (https://phabricator.wikimedia.org/T249363) (owner: 10Jdlrobson) [23:12:48] Jdlrobson: done! [23:12:56] (it will be auto-deployed within 30 minutes) [23:13:28] (03Merged) 10jenkins-bot: Beta cluster: Enable search in header on Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619588 (https://phabricator.wikimedia.org/T249363) (owner: 10Jdlrobson) [23:16:54] thanks for that Urbanecm [23:16:59] happy to help [23:25:56] (03Merged) 10jenkins-bot: Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619380 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson) [23:25:59] (03Merged) 10jenkins-bot: Hide vertical nav-boxes on mobile domain [extensions/MobileFrontend] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619381 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson) [23:26:29] ready to test on debug :) [23:27:24] just wait a sec, doing git-fu :-) [23:28:30] Jdlrobson: pulled onto mwdebug1001 :) [23:31:21] Urbanecm: hmm it's not kicking in (but that's fine) and i've just realised why. (face palm) [23:31:24] you can sync that though [23:31:49] Jdlrobson: okay. Do you need any follow-up patch? [23:31:59] Happy to sync that too, but you'd need to get someone to merge it to master [23:32:32] Urbanecm: could i trouble you for one more config patch? [23:32:37] sure! [23:33:42] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/619592 Update wgMFRemovableClasses [NEW] [23:33:46] i'll add it to wikitech:Deployments [23:33:51] (03PS1) 10Jdlrobson: Update wgMFRemovableClasses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619592 (https://phabricator.wikimedia.org/T231160) [23:33:51] cool [23:34:22] i didnt realise there was a production override [23:34:25] hmm, isn't the value the same as in extensions.json? [23:34:25] would have saved a lot of time! :) [23:34:33] yep but arrays dont merge by default [23:34:39] (03CR) 10Urbanecm: [C: 03+2] Update wgMFRemovableClasses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619592 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson) [23:34:52] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.4/extensions/MobileFrontend/extension.json: 81d54b0ec82d0b78f723f9400031e918a4a143aa: Hide vertical nav-boxes on mobile domain (T231160) (duration: 01m 05s) [23:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:55] I thought about removing the production override, but I'm fine with syncing this too :) [23:34:55] (Associative arrays that is) [23:34:56] T231160: HtmlFormatter incorrectly removes partial classname matches in "xenomobile" or "not-an-navbox" - https://phabricator.wikimedia.org/T231160 [23:35:17] (03Merged) 10jenkins-bot: Update wgMFRemovableClasses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619592 (https://phabricator.wikimedia.org/T231160) (owner: 10Jdlrobson) [23:35:18] Mee too but .mbox-image' is not present in MobileFrontend [23:35:36] ah, gotcha! [23:36:00] once the .3 patch is deployed, I'll ping you to test it :) [23:36:35] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.3/extensions/MobileFrontend/extension.json: c22d65ff9b2439f484ab8ccffed87b00e78c3ad2: Hide vertical nav-boxes on mobile domain (T231160) (duration: 01m 03s) [23:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:05] Jdlrobson: ready for you at mwdebug1001 [23:37:50] yay [23:37:51] that did it! [23:37:53] please sync :) [23:37:57] wonderful, syncing! [23:38:13] please sync :) [23:38:15] oops sorry [23:38:17] wrong tab :) [23:39:32] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 0f238f71c95c7bd7534c28abfac759fbb47f674f: Update wgMFRemovableClasses (T231160) (duration: 01m 03s) [23:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:44] Jdlrobson: should be all done :) [23:39:47] anything else? [23:40:48] thanks for all your help today Urbanecm ! [23:41:17] !log Evening B&C window completed [23:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:20] no problem Jdlrobson :)