[00:02:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:03:14] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:08:44] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:56] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:11:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:15:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:20:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:25:02] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:30] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:46:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:16] (03CR) 10Krinkle: "I too thought it was a bug, but I was convinced otherwise." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594214 (owner: 10Jforrester) [01:03:16] (03CR) 10Krinkle: [C: 03+1] Replace stringified class names with ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593654 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy) [01:04:20] Hello, can someone please refresh special page for broken redirects on srwiki [01:04:53] I done revert of big number of talk pages, and I want to be sure that there aren't broken redirects on srwiki in talk namespace [01:13:09] I created task T251849 [01:13:10] T251849: Run updateSpecialPages.php maintenance script on srwiki - https://phabricator.wikimedia.org/T251849 [01:13:27] Please do this as soon as possible, I'm going to sleep now.. Cya [01:13:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:14:59] (03PS1) 10Andrew Bogott: Keystone: create /etc/keystone/fernet-keys directory [puppet] - 10https://gerrit.wikimedia.org/r/594335 (https://phabricator.wikimedia.org/T251294) [01:15:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:15:41] (03CR) 10Aaron Schulz: "Yes. No issues that I know of. Flow has extra DB/DBO layers that use startAtomic/endAtomic a lot, which mitigates the problems a lot." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977 (owner: 10Aaron Schulz) [01:22:19] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: create /etc/keystone/fernet-keys directory [puppet] - 10https://gerrit.wikimedia.org/r/594335 (https://phabricator.wikimedia.org/T251294) (owner: 10Andrew Bogott) [01:32:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:35:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:36:18] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:43:36] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:50:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:52:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:54:30] (03PS1) 10Andrew Bogott: Openstack haproxy ferm rules: support AAAA and fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/594338 (https://phabricator.wikimedia.org/T251294) [01:56:11] (03PS4) 10Aaron Schulz: Use DBO_DEFAULT for extension1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977 [01:59:29] (03PS2) 10Andrew Bogott: Openstack haproxy ferm rules: support AAAA and fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/594338 (https://phabricator.wikimedia.org/T251294) [02:25:32] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:13] (03CR) 10Krinkle: [C: 03+1] Use DBO_DEFAULT for extension1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977 (owner: 10Aaron Schulz) [02:30:58] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:50] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:18] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:36] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:06] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:14] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:28] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:44] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:10] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:35] !log Restart mysql on tendril host: db1115 - T231769 [04:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:39] T231769: Investigate possible memory leak on db1115 - https://phabricator.wikimedia.org/T231769 [04:43:35] ACKNOWLEDGEMENT - HTTPS-dbtree on dbmonitor1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 354 bytes in 0.009 second response time Marostegui tendril restart https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [04:53:52] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [04:54:35] ^ I am dealing with that [04:54:50] host seems unresponssive [04:54:55] I have rebooted it [04:54:59] It was totally overloaded [04:56:26] host back and mysql started [04:57:14] ^kormat reason #11 to kill tendril [04:57:22] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 85473 bytes in 0.906 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [05:00:04] marostegui: How many deployers does it take to do s5 and s6 primary database master restart deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200505T0500). [05:00:12] I am not doing the restarts yet [05:01:17] The messages log is full of blocked connections? [05:01:30] is it normal for events to take some time to get fresh status after reboot? [05:01:36] no [05:01:40] they are normally instant [05:01:52] that may have caused the overload, then [05:02:43] check /var/log/messages [05:02:50] I haven't seen that before in the tendril log [05:02:57] on tendril db or dbmonitor? [05:03:01] db1115 [05:03:24] I saw just a few of them on messages.1 but definitely not the amount we are getting now [05:03:51] root@db1115:~# cat /var/log/messages.1 | grep MAC | wc -l [05:03:52] 797 [05:03:52] root@db1115:~# cat /var/log/messages | grep MAC | wc -l [05:03:52] 4709 [05:05:36] PROBLEM - Check systemd state on labstore1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:44] PROBLEM - Check systemd state on labstore1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:34] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:14] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:10] ACKNOWLEDGEMENT - Check systemd state on labstore1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm HDFS rsync jobs are broken, apparently. The analytics teams would have been notified by the particular jobs. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:11] ACKNOWLEDGEMENT - Check systemd state on labstore1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm HDFS rsync jobs are broken, apparently. The analytics teams would have been notified by the particular jobs. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:44] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:48] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 92 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:19:06] !log Start s5 and s6 maintenance - T251154 [05:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:09] T251154: Upgrade and restart s5 and s6 primary DB master: Tue 5th May - https://phabricator.wikimedia.org/T251154 [05:20:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s5 and s6 as read-only for maintenance T251154', diff saved to https://phabricator.wikimedia.org/P11132 and previous config saved to /var/cache/conftool/dbconfig/20200505-052058-marostegui.json [05:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:13] RO confirmed [05:23:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:23:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s5 and s6 as read-only=off for maintenance T251154', diff saved to https://phabricator.wikimedia.org/P11133 and previous config saved to /var/cache/conftool/dbconfig/20200505-052334-marostegui.json [05:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:37] we are back [05:23:45] checking [05:24:02] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:24:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:24:13] fr still read only [05:24:36] I can edit it fine [05:24:41] yeah, now yes [05:24:43] maybe the config was still spreading? [05:24:44] it tooks some seconds [05:24:46] yeah [05:24:49] it is not instant [05:24:53] checking dewiki [05:25:05] works fine [05:25:20] yeah [05:25:34] question [05:25:47] on restart, do you need to restart heartbeat or does it work automatically? [05:26:06] or 3, you run puppet anyway? [05:26:06] I run puppet right away [05:26:09] :-D [05:26:35] cool, noting for my own knowlege [05:26:55] I am going to upgrade the tasks with the timings and the fact that we started a bit later than schedule [05:27:00] Thanks for the support! [05:27:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:27:22] tendril is still a tad slow for me [05:27:39] I don't think that's normal, but go on, I will keep researching that [05:27:41] yeah, that's next. We need to check what's going on there :( [05:28:11] https://grafana.wikimedia.org/d/000000377/host-overview?panelId=3&fullscreen&orgId=1&refresh=5m&var-server=db1115&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql [05:28:18] Something happened at around 4am utc [05:28:38] and apparently keeps happening [05:28:46] well, backups are running today [05:28:56] but that should not be a huge load [05:29:11] also they start much earlier than 4am [05:29:33] 10Operations, 10DBA: Upgrade and restart s5 and s6 primary DB master: Tue 5th May - https://phabricator.wikimedia.org/T251154 (10Marostegui) 05Open→03Resolved This was done. We started a bit later than expected due to some on-going issues with another service. RO started: 05:20:59 RO finished: 05:23:34 [05:29:36] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:30:02] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:30:23] 10Operations, 10DBA: Upgrade and restart s5 and s6 primary DB master: Tue 5th May - https://phabricator.wikimedia.org/T251154 (10Marostegui) [05:31:52] 05:31:43 up 36 min, 3 users, load average: 64.56, 57.27, 36.13 [05:37:00] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 92 jobs Jcrespo gerrit backups sometimes get delayed at beginning of month due to large, full backups - The acknowledgement expires at: 2020-05-13 05:36:00. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:09:00] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:04] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:11:56] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:28] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:26] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:58] following up with search --^ [06:26:34] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:34:13] (03PS2) 10ArielGlenn: CodeReview tables are now available for public download. [puppet] - 10https://gerrit.wikimedia.org/r/593731 (https://phabricator.wikimedia.org/T243055) [06:34:30] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:08] (03CR) 10ArielGlenn: [C: 03+2] CodeReview tables are now available for public download. [puppet] - 10https://gerrit.wikimedia.org/r/593731 (https://phabricator.wikimedia.org/T243055) (owner: 10ArielGlenn) [06:46:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:48:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:52:34] (03PS1) 10Marostegui: tendril.my.cnf: Enable tokudb for 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/594405 (https://phabricator.wikimedia.org/T231185) [06:55:46] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:59:00] !log depool wdqs1006 heavy lag [06:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:14] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:32] (03CR) 10Marostegui: [C: 03+2] "Works as expected: https://puppet-compiler.wmflabs.org/compiler1002/22277/" [puppet] - 10https://gerrit.wikimedia.org/r/594405 (https://phabricator.wikimedia.org/T231185) (owner: 10Marostegui) [07:08:23] 10Operations, 10DBA, 10Patch-For-Review: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) We had an issue with tendril today where tendril was very slow and almost unresponsive, at first I thought it was another case of {T231769}, but it wasn't. First of a... [07:12:20] (03CR) 10Ema: Add the ability to consume from kafka (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/594147 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [07:14:42] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:43] (03CR) 10Ema: [C: 03+1] "Let's add a line to d/changelog, other than that I like it!" [software/purged] - 10https://gerrit.wikimedia.org/r/594147 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [07:16:54] (03PS1) 10Busecolak: add Dockerfile and example config for Druid v0.17.0 [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/594408 [07:16:56] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/594408 (owner: 10Busecolak) [07:19:54] (03CR) 10Busecolak: [C: 03+1] add Dockerfile and example config for Druid v0.17.0 [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/594408 (owner: 10Busecolak) [07:21:04] (03CR) 10Muehlenhoff: [WIP] Initial debian commit (031 comment) [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [07:24:53] (03CR) 10Ema: "Really cool. Are we gonna run this in CI too? Currently purged is using debian-glue so we could just add `make integration` to debian/rule" (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/594148 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [07:24:58] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:28] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:52] this is wip, should be fixed soon --^ [07:33:08] (03CR) 10Elukey: [V: 03+2 C: 03+2] add Dockerfile and example config for Druid v0.17.0 [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/594408 (owner: 10Busecolak) [07:33:13] (03CR) 10Elukey: [V: 03+2 C: 03+2] "Thanks a lot!" [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/594408 (owner: 10Busecolak) [07:36:37] !log zpapierski@deploy1001 Started deploy [wdqs/wdqs@d37a059]: fix for the duplicated jars [07:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:58] PROBLEM - Query Service HTTP Port on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 9597 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:43:46] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:48] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 9597 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:43:50] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 9597 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:44:36] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:22] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs-internal_80: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:45:25] PROBLEM - LVS HTTP codfw IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 9597 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:45:28] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:47] looks like we have a bad deployment on wdqs [07:45:47] woop [07:45:52] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 9597 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:45:55] rollback coming up [07:45:57] <_joe_> zpapierski: rollback :) [07:45:59] ok! [07:46:35] <_joe_> gehel: can you ack the alert in victorops? [07:46:45] <_joe_> (have you even set up victorops?) [07:46:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=jmx_wdqs_updater site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:46:46] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs-internal_80: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:46:53] _joe_: I acked it [07:46:56] PROBLEM - Query Service HTTP Port on wdqs2005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 9597 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:47:02] _joe_: not yet setup :/ [07:47:09] I'm doing the rollback [07:47:11] <_joe_> XioNoX: in theory who's working on it should, but ok in this case [07:47:15] <_joe_> gehel: thanks! [07:47:20] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:20] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:21] yeah I know but they're busy :) [07:48:28] PROBLEM - Query Service HTTP Port on wdqs2006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 9597 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:49:06] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:08] PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:10] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [07:49:10] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [07:50:08] !log gehel@deploy1001 Started deploy [wdqs/wdqs@d37a059]: rollback wdqs to v 0.3.22 [07:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:13] * volans almost here, but I see is known and being handled [07:50:20] PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 9597 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:50:23] PROBLEM - LVS HTTPS eqiad IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 9640 bytes in 1.011 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:40] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1004.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:50:51] PROBLEM - LVS HTTP eqiad IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 9597 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:53] <_joe_> uh why eqiad as well now? [07:51:00] PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 9597 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:51:10] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1004.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:51:12] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:16] RECOVERY - Query Service HTTP Port on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.434 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:51:21] * jbond42 here [07:51:30] PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 9597 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:51:30] <_joe_> jbond42: a bad release, it seems [07:51:43] actually, a bad fix for a bad release :/ [07:51:54] ack thanks [07:52:00] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:17] RECOVERY - LVS HTTPS eqiad IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 483 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:52:26] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:52:36] RECOVERY - Query Service HTTP Port on wdqs2005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:52:54] RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:52:54] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:52:55] RECOVERY - LVS HTTP codfw IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:52:58] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:04] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:53:10] RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:53:12] RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:53:17] * apergos peeks in [07:53:22] RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:53:22] RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:53:25] ah [07:53:26] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:28] and we're comign back up! [07:53:45] thx [07:53:56] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:06] RECOVERY - Query Service HTTP Port on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:54:06] RECOVERY - Query Service HTTP Port on wdqs2006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:54:26] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@d37a059]: rollback wdqs to v 0.3.22 (duration: 04m 18s) [07:54:26] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:37] RECOVERY - LVS HTTP eqiad IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:54:48] RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:59] (03PS1) 10Ema: vtc test box: install py3 version of jenkinsapi [puppet] - 10https://gerrit.wikimedia.org/r/594411 [07:55:00] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:55:01] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:56:18] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:58:38] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:54] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:31] (03PS4) 10Gilles: engine.ghostscript: use -sstdout=%stderr with gs [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/593358 (https://phabricator.wikimedia.org/T236240) (owner: 10AntiCompositeNumber) [08:07:38] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:28] !log installing Java security updates on notebook/stat hosts [08:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:09:10] (03CR) 10Gilles: [C: 03+2] "Excellent work, thank you!" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/593358 (https://phabricator.wikimedia.org/T236240) (owner: 10AntiCompositeNumber) [08:09:16] (03CR) 10Ema: [C: 03+2] vcl: test 'exp' admission policy on two nodes [puppet] - 10https://gerrit.wikimedia.org/r/594144 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [08:13:08] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:46] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:56] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:39] (03CR) 10Gilles: [V: 03+2 C: 03+2] engine.ghostscript: use -sstdout=%stderr with gs [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/593358 (https://phabricator.wikimedia.org/T236240) (owner: 10AntiCompositeNumber) [08:19:06] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:12] !log cp2027 and cp2029 (both text): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809 [08:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:16] T249809: cache_upload varnish-fe exhausting transient memory - https://phabricator.wikimedia.org/T249809 [08:19:16] T144187: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187 [08:19:56] @icinga-wm is currently posting in #wikidata - is this intentional? [08:20:27] hi [08:20:43] icinga-wm is "spamming" #wikidata [08:20:51] not sure if someone here can do something about it [08:21:16] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 93 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:22:34] CustosLimen: there was an incident related to wikidata query service, it should be over now [08:22:55] thanks [08:23:59] (03PS1) 10Jcrespo: tendril: Fix Google charts js api change [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 [08:26:09] !log upgrading slapd on serpens/seaborgium [08:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:32] !log cp2028 and cp2030 (both upload): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809 [08:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:36] T249809: cache_upload varnish-fe exhausting transient memory - https://phabricator.wikimedia.org/T249809 [08:27:36] T144187: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187 [08:29:21] (03PS5) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: small refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) [08:32:03] (03CR) 10Ayounsi: [C: 03+1] "Not tested but reviewed the code and discussed it with John and lgtm!" [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [08:33:09] (03PS3) 10Jbond: cookbook sre.hosts.rotate-pdu-password: use request.Session and response.raise_for_status [cookbooks] - 10https://gerrit.wikimedia.org/r/594173 (https://phabricator.wikimedia.org/T246890) [08:33:15] (03PS8) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: update to raise exceptions [cookbooks] - 10https://gerrit.wikimedia.org/r/594197 (https://phabricator.wikimedia.org/T246890) [08:33:20] (03PS14) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594199 (https://phabricator.wikimedia.org/T246890) [08:37:17] (03CR) 10Kormat: [C: 04-1] "One minor comment so far." (031 comment) [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 (owner: 10Jcrespo) [08:38:41] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/594211 (owner: 10Muehlenhoff) [08:39:06] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:25] @dcausse still posting [08:39:29] (in #wikidata) [08:44:05] !log reimaging es1024 to buster T250666 [08:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:09] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [08:44:16] (03PS2) 10Dzahn: profile,gerrit: add enable_monitoring flag for gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/594293 (https://phabricator.wikimedia.org/T239151) (owner: 10Cwhite) [08:44:28] (03CR) 10Dzahn: [C: 03+2] "thanks! lgtm. https://puppet-compiler.wmflabs.org/compiler1003/22283/" [puppet] - 10https://gerrit.wikimedia.org/r/594293 (https://phabricator.wikimedia.org/T239151) (owner: 10Cwhite) [08:44:36] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:32] (03CR) 10Jbond: "lgtm apart from the CI issue" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594212 (owner: 10Muehlenhoff) [08:51:03] (03PS1) 10Dzahn: admins: add Eduardo Medina to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/594417 (https://phabricator.wikimedia.org/T251358) [08:53:31] (03CR) 10Ayounsi: [C: 03+1] "Actually found some small issues while reviewing the next CR." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [08:53:54] !log installing Java security updates on releases* [08:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:15] (03CR) 10Ayounsi: [C: 03+1] "Not tested but code reviewed and LGTM, using Session() makes the code much cleaner indeed!" [cookbooks] - 10https://gerrit.wikimedia.org/r/594173 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [08:55:30] (03CR) 10Volans: "small nit inline, Arzhel arrived before me :)" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [08:55:53] (03CR) 10Muehlenhoff: Also specify system user range for systemd-sysusers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594212 (owner: 10Muehlenhoff) [08:57:36] (03CR) 10Dzahn: "ran puppet on gerrit1001 and gerrit1002 then on icinga1001. checks got removed for gerrit1002" [puppet] - 10https://gerrit.wikimedia.org/r/594293 (https://phabricator.wikimedia.org/T239151) (owner: 10Cwhite) [08:57:38] (03PS2) 10Jcrespo: tendril: Fix Google charts js api change [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 [08:57:54] (03CR) 10Jcrespo: "> Patch Set 1: Code-Review-1" (031 comment) [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 (owner: 10Jcrespo) [08:58:14] (03CR) 10Jcrespo: "Done" (031 comment) [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 (owner: 10Jcrespo) [08:58:47] (03PS1) 10Kormat: Revert "install_server: Allow reimage of es2025" [puppet] - 10https://gerrit.wikimedia.org/r/594420 (https://phabricator.wikimedia.org/T250666) [09:00:58] (03CR) 10Kormat: [C: 03+2] Revert "install_server: Allow reimage of es2025" [puppet] - 10https://gerrit.wikimedia.org/r/594420 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [09:01:41] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Eamedina to `wmf` LDAF group - https://phabricator.wikimedia.org/T251358 (10Dzahn) p:05Triage→03High [09:01:41] (03CR) 10Muehlenhoff: [C: 03+2] Ship /etc/sysusers.d base directory for systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/594211 (owner: 10Muehlenhoff) [09:01:45] kormat: shall I puppet-merge along? [09:02:01] moritzm: : yes, please :) [09:02:06] (03CR) 10Volans: [C: 04-1] "Small details inline" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/594173 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:02:33] done [09:02:40] thanks :) [09:03:32] !log restarting wdqs updater on all servers [09:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:47] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) Icinga monitoring for gerrit1002 has been removed. Thanks Cole. [09:03:57] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:27] (03PS6) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: small refactor [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) [09:05:08] (03CR) 10Jbond: "updated thanks" (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/593476 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:06:47] (03CR) 10Volans: [C: 03+1] "optional nits inline, looks good otherwise" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/594197 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:07:23] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:57] (03CR) 10Dzahn: [C: 03+2] "noop in prod as it should. https://puppet-compiler.wmflabs.org/compiler1002/22284/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/594136 (owner: 10Paladox) [09:08:11] (03PS3) 10Jcrespo: tendril: Fix Google charts js api change [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 [09:09:35] (03PS4) 10Paladox: phabricator: Set phabricator_domain & phabricator_altdomain for devtools [puppet] - 10https://gerrit.wikimedia.org/r/594154 [09:09:48] (03CR) 10Dzahn: "eh, wrong compiler link but still good: https://puppet-compiler.wmflabs.org/compiler1001/22285/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/594136 (owner: 10Paladox) [09:09:51] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:58] (03PS3) 10Paladox: phabricator: Install the php zip extension [puppet] - 10https://gerrit.wikimedia.org/r/594157 [09:10:04] (03PS7) 10Paladox: phabricator: Drop phd.pid-directory as it's now uneeded [puppet] - 10https://gerrit.wikimedia.org/r/594162 [09:10:06] (03CR) 10Marostegui: "Let's use T96499 as main tack for this patch so it is easier to find in the future" [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 (owner: 10Jcrespo) [09:10:52] (03CR) 10Kormat: "One more, even more minor, comment." (031 comment) [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 (owner: 10Jcrespo) [09:11:22] (03PS4) 10Jbond: cookbook sre.hosts.rotate-pdu-password: use request.Session and response.raise_for_status [cookbooks] - 10https://gerrit.wikimedia.org/r/594173 (https://phabricator.wikimedia.org/T246890) [09:12:56] (03PS5) 10Jbond: cookbook sre.hosts.rotate-pdu-password: use request.Session and response.raise_for_status [cookbooks] - 10https://gerrit.wikimedia.org/r/594173 (https://phabricator.wikimedia.org/T246890) [09:13:00] (03PS4) 10Jcrespo: tendril: Fix Google charts js api change [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 [09:13:47] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:25] (03PS6) 10Jbond: cookbook sre.hosts.rotate-pdu-password: use request.Session and response.raise_for_status [cookbooks] - 10https://gerrit.wikimedia.org/r/594173 (https://phabricator.wikimedia.org/T246890) [09:14:28] (03CR) 10Jbond: "updated thanks" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/594173 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:15:57] (03PS2) 10Muehlenhoff: Also specify system user range for systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/594212 [09:16:11] (03PS1) 10Jcrespo: dbtree: Fix Google charts js api change [software/dbtree] - 10https://gerrit.wikimedia.org/r/594422 [09:17:51] (03CR) 10Jcrespo: tendril: Fix Google charts js api change (031 comment) [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 (owner: 10Jcrespo) [09:18:45] (03PS4) 10Vgutierrez: Release 8.1.0-unreleased-1wm1 [debs/trafficserver] (8.1.x) - 10https://gerrit.wikimedia.org/r/591308 [09:18:57] (03CR) 10jerkins-bot: [V: 04-1] Release 8.1.0-unreleased-1wm1 [debs/trafficserver] (8.1.x) - 10https://gerrit.wikimedia.org/r/591308 (owner: 10Vgutierrez) [09:19:42] (03CR) 10Kormat: [C: 03+1] "LGTM" [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 (owner: 10Jcrespo) [09:21:18] (03CR) 10Kormat: [C: 03+1] dbtree: Fix Google charts js api change [software/dbtree] - 10https://gerrit.wikimedia.org/r/594422 (owner: 10Jcrespo) [09:21:46] (03CR) 10Jcrespo: [C: 03+2] tendril: Fix Google charts js api change [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 (owner: 10Jcrespo) [09:23:39] (03CR) 10Dzahn: "unsure if we need it, but in production it does not show the warning, so maybe Mukunda told it to ignore that?" [puppet] - 10https://gerrit.wikimedia.org/r/594157 (owner: 10Paladox) [09:24:43] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] tendril: Fix Google charts js api change [software/tendril] - 10https://gerrit.wikimedia.org/r/594412 (owner: 10Jcrespo) [09:24:56] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:26:04] (03PS9) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: update to raise exceptions [cookbooks] - 10https://gerrit.wikimedia.org/r/594197 (https://phabricator.wikimedia.org/T246890) [09:27:30] (03CR) 10Dzahn: "let's use the same names as in production besides the domain name of course?" [puppet] - 10https://gerrit.wikimedia.org/r/594154 (owner: 10Paladox) [09:28:10] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre.hosts.rotate-pdu-password: update to raise exceptions [cookbooks] - 10https://gerrit.wikimedia.org/r/594197 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:28:46] (03PS10) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: update to raise exceptions [cookbooks] - 10https://gerrit.wikimedia.org/r/594197 (https://phabricator.wikimedia.org/T246890) [09:30:20] 10Operations, 10DBA, 10Privacy Engineering, 10Traffic, and 4 others: dbtree loads third party resources (from google.com/jsapi) - https://phabricator.wikimedia.org/T96499 (10Marostegui) For the record: https://gerrit.wikimedia.org/r/#/c/operations/software/tendril/+/594412/ https://gerrit.wikimedia.org/r/#... [09:30:45] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/22286/" [puppet] - 10https://gerrit.wikimedia.org/r/594212 (owner: 10Muehlenhoff) [09:31:51] 10Operations, 10Traffic, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) @dpifke once deployed, we will need to nuke the existing data for navtiming_responsestart_by_host_seconds on Prometheus. Otherwise it's going to... [09:33:53] 10Operations, 10vm-requests: codfw: 1 VM for builder - https://phabricator.wikimedia.org/T248165 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete for a while [09:34:29] 10Operations, 10Traffic, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) Put together a dashboard (with the underlying labels swapped for now): https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host?orgId=1 [09:34:45] (03PS5) 10Paladox: phabricator: Set phabricator_domain & phabricator_altdomain for devtools [puppet] - 10https://gerrit.wikimedia.org/r/594154 [09:36:24] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [09:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:26] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22728 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:36:47] !log removing boron.eqiad.wmnet [09:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:57] moritzm: thanks for being the first tester of the decom cookbook since yesterday's spicerack release ;) [09:37:06] (03CR) 10Dzahn: [C: 03+2] "cloud-only" [puppet] - 10https://gerrit.wikimedia.org/r/594154 (owner: 10Paladox) [09:37:08] just changed something in the netbox bit [09:37:08] happy to test :-) [09:37:23] ah but this is a VM [09:37:23] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:27] 10Operations, 10vm-requests: codfw: 1 VM for builder - https://phabricator.wikimedia.org/T248165 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `boron.eqiad.wmnet` - boron.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found Ganeti VM - VM shutdown -... [09:37:28] so yeah, not really [09:37:47] ok, still everything was working well :-) [09:39:22] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:03] (03CR) 10Jcrespo: [C: 03+2] dbtree: Fix Google charts js api change [software/dbtree] - 10https://gerrit.wikimedia.org/r/594422 (owner: 10Jcrespo) [09:40:57] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] dbtree: Fix Google charts js api change [software/dbtree] - 10https://gerrit.wikimedia.org/r/594422 (owner: 10Jcrespo) [09:41:05] (03PS1) 10Muehlenhoff: Remove boron from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/594433 [09:44:25] (03PS8) 10Paladox: phabricator: Drop phd.pid-directory as it's now uneeded [puppet] - 10https://gerrit.wikimedia.org/r/594162 [09:46:56] (03PS2) 10Muehlenhoff: Remove boron from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/594433 [09:48:31] ACKNOWLEDGEMENT - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 87, down: 2, dormant: 0, excluded: 0, unused: 0: Ayounsi Future link to be provisioned. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:48:56] (03CR) 10Dzahn: [C: 03+1] "yea, the warning says phab daemons no longer use PID files and i don't see a PID file in that location in prod. but letting 20after4 also " [puppet] - 10https://gerrit.wikimedia.org/r/594162 (owner: 10Paladox) [09:49:11] (03PS1) 10Jcrespo: dbtree: Add missing loader reference (followup to previous patch) [software/dbtree] - 10https://gerrit.wikimedia.org/r/594435 [09:50:21] (03CR) 10Muehlenhoff: [C: 03+2] Remove boron from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/594433 (owner: 10Muehlenhoff) [09:50:43] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] dbtree: Add missing loader reference (followup to previous patch) [software/dbtree] - 10https://gerrit.wikimedia.org/r/594435 (owner: 10Jcrespo) [09:52:23] (03PS1) 10Jbond: cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/594436 (https://phabricator.wikimedia.org/T246890) [09:53:31] (03CR) 10Jbond: [C: 03+1] Also specify system user range for systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/594212 (owner: 10Muehlenhoff) [09:53:47] (03PS2) 10Jbond: cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/594436 (https://phabricator.wikimedia.org/T246890) [09:54:00] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:55:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:55:43] (03PS1) 10Muehlenhoff: Remove boron from DNS [dns] - 10https://gerrit.wikimedia.org/r/594438 [09:56:11] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/594436 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:56:48] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:00:13] (03CR) 10Muehlenhoff: [C: 03+2] Remove boron from DNS [dns] - 10https://gerrit.wikimedia.org/r/594438 (owner: 10Muehlenhoff) [10:00:37] (03PS1) 10Filippo Giunchedi: icinga: filter out notifications disabled on /alerts, add /problems [puppet] - 10https://gerrit.wikimedia.org/r/594441 [10:01:10] (03PS3) 10Jbond: cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/594436 (https://phabricator.wikimedia.org/T246890) [10:01:54] (03CR) 10DCausse: increment extra plugin to 6.5.4-wmf-9 (032 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/593833 (owner: 10Mstyles) [10:02:36] (03CR) 10Muehlenhoff: [C: 03+2] Also specify system user range for systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/594212 (owner: 10Muehlenhoff) [10:02:58] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/594436 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:04:58] (03CR) 10Dzahn: "I feel strongly that disabling notifications is not a valid way to "handle" an alert. Icinga agrees with that and this is like a hack to o" [puppet] - 10https://gerrit.wikimedia.org/r/594441 (owner: 10Filippo Giunchedi) [10:09:04] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22727 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:09:54] (03PS1) 10DCausse: Force LC_ALL=C when sorting the checksums file [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/594443 [10:11:29] (03CR) 10DCausse: increment extra plugin to 6.5.4-wmf-9 (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/593833 (owner: 10Mstyles) [10:16:00] !log temp disabling puppet on all ganeti hosts to carefully deploy change related to rapi cert location [10:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:25] (03PS3) 10Dzahn: sslcert: add parameter to support cergen private keys [puppet] - 10https://gerrit.wikimedia.org/r/593249 [10:17:29] mutante: you might want to disable notifications for the netbox syncs from ganeti too ;) [10:17:39] and/or pause their timers [10:18:47] !log copy prometheus-pdns-exporter v0.5.1 from stretch-wikimedia to buster-wikimedia in apt1001 (T251575) [10:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:50] T251575: Build prometheus-pdns-exporter for Buster - https://phabricator.wikimedia.org/T251575 [10:20:40] volans: if puppet is disabled on the ganeti side it won't change anything about the cert or syncs from it [10:23:08] mutante: sure, I meant when you're applying the change [10:23:11] sorry wasn't clear [10:23:20] !log copy prometheus-rabbitmq-exporter v0.4 from stretch-wikimedia to buster-wikimedia in apt1001 (T251660) [10:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:24] T251660: Create prometheus-rabbitmq-exporter for buster - https://phabricator.wikimedia.org/T251660 [10:25:31] i will enable it on a single host first. then i will hopefully see a noop and if it breaks unexpectedly i will revert before re-enabling [10:25:40] (03PS1) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [10:25:55] (03Abandoned) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594199 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:28:51] (03CR) 10jerkins-bot: [V: 04-1] (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:28:51] (03CR) 10Dzahn: [C: 03+2] sslcert: add parameter to support cergen private keys [puppet] - 10https://gerrit.wikimedia.org/r/593249 (owner: 10Dzahn) [10:29:00] (03PS1) 10Kormat: install_server: Allow reimage of es1024 [puppet] - 10https://gerrit.wikimedia.org/r/594446 (https://phabricator.wikimedia.org/T250666) [10:30:07] (03CR) 10Marostegui: [C: 03+1] install_server: Allow reimage of es1024 [puppet] - 10https://gerrit.wikimedia.org/r/594446 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [10:31:05] (03CR) 10Kormat: [C: 03+2] install_server: Allow reimage of es1024 [puppet] - 10https://gerrit.wikimedia.org/r/594446 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [10:32:59] (03PS1) 10Kormat: install_server: switch es1024 to buster [puppet] - 10https://gerrit.wikimedia.org/r/594449 (https://phabricator.wikimedia.org/T250666) [10:33:27] (03CR) 10Marostegui: [C: 03+1] install_server: switch es1024 to buster [puppet] - 10https://gerrit.wikimedia.org/r/594449 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [10:33:29] (03PS4) 10Jbond: cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/594436 (https://phabricator.wikimedia.org/T246890) [10:33:39] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:49] (03PS11) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: update to raise exceptions [cookbooks] - 10https://gerrit.wikimedia.org/r/594197 (https://phabricator.wikimedia.org/T246890) [10:34:09] (03PS5) 10Jbond: cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/594436 (https://phabricator.wikimedia.org/T246890) [10:34:19] (03PS2) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [10:34:41] (03PS2) 10Kormat: install_server: switch es1024 to buster [puppet] - 10https://gerrit.wikimedia.org/r/594449 (https://phabricator.wikimedia.org/T250666) [10:35:06] volans: it's all noop. the key for the certs is in 2 different locations and was copied manually. this change now means it uses the right location and no more copying. noop because in private repo it's identical file in 2 locations [10:35:18] just confirmed on a single host and so on [10:36:08] sslcert::certificate now supports using cergen and copies the key from the right place [10:36:17] ack, great, ok then [10:36:20] :) [10:36:24] (03CR) 10jerkins-bot: [V: 04-1] (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:36:34] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/594436 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:37:10] (03PS3) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [10:37:55] volans: any reason im getting `pylint: broad-except` now but not before? is there something better to except then Exception? [10:38:33] jbond42: IIRC it was doing that before too [10:38:34] (03CR) 10Kormat: [C: 03+2] install_server: switch es1024 to buster [puppet] - 10https://gerrit.wikimedia.org/r/594449 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [10:38:35] which CR? [10:38:49] https://gerrit.wikimedia.org/r/594436 [10:39:28] (03CR) 10jerkins-bot: [V: 04-1] (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:40:09] that's expected AFAICT [10:41:47] (03CR) 10Dzahn: "temp disabled puppet on ganeti*, then deployed on single host first, confirmed the key is identical in both locations for all DCs and noop" [puppet] - 10https://gerrit.wikimedia.org/r/593249 (owner: 10Dzahn) [10:42:09] hmm i guess the original script may have got commited before theses checks where in place [10:43:20] (03PS1) 10Ema: Add flag -frontend_delay [software/purged] - 10https://gerrit.wikimedia.org/r/594454 (https://phabricator.wikimedia.org/T249583) [10:44:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 T248086', diff saved to https://phabricator.wikimedia.org/P11139 and previous config saved to /var/cache/conftool/dbconfig/20200505-104441-marostegui.json [10:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:47] T248086: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 [10:44:58] (03PS2) 10Ema: Add flag -frontend_delay [software/purged] - 10https://gerrit.wikimedia.org/r/594454 (https://phabricator.wikimedia.org/T249583) [10:45:04] (03PS6) 10Jbond: cookbook sre.hosts.rotate-pdu-password: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/594436 (https://phabricator.wikimedia.org/T246890) [10:45:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1126 T248086', diff saved to https://phabricator.wikimedia.org/P11140 and previous config saved to /var/cache/conftool/dbconfig/20200505-104540-marostegui.json [10:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:25] (03CR) 10Jbond: cookbook sre.hosts.rotate-pdu-password: rename (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/594436 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:48:13] (03PS4) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [10:48:16] jbond42: it's ok to add the ignore comment if you want to catch that and continue in all cases [10:48:38] spicerack will already catch that for you but would stop the execution [10:49:27] volans: i decided to just drop it as i think i catch most exceptions but if you think its better to keep it and ignore i can add it back [10:50:19] ok as is, if we'll encounter some weird cases we can decide what to do [10:50:26] (03CR) 10jerkins-bot: [V: 04-1] (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:50:48] sounds good to me [10:52:14] (03PS5) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [10:54:16] (03CR) 10jerkins-bot: [V: 04-1] (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:55:45] (03PS6) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [10:56:38] (03PS6) 10KartikMistry: Adjust ContentTranslation MT threshold for Chinese Wikipedia to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592479 (https://phabricator.wikimedia.org/T246383) (owner: 10VulpesVulpes825) [10:57:09] (03PS1) 10Jcrespo: dbtree: Update jquery version to avoid outadated code [software/dbtree] - 10https://gerrit.wikimedia.org/r/594457 [10:57:59] (03CR) 10jerkins-bot: [V: 04-1] (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:58:39] (03PS2) 10Jcrespo: dbtree: Update jquery version to avoid outdated code [software/dbtree] - 10https://gerrit.wikimedia.org/r/594457 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200505T1100). [11:00:04] kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] * kart_ is here. will proceed for the SWAT. [11:00:32] !log kormat@cumin1001 dbctl commit (dc=all): 'Depool es1024 for reimaging, add es1023 (master) for reading in the meantime T250666', diff saved to https://phabricator.wikimedia.org/P11141 and previous config saved to /var/cache/conftool/dbconfig/20200505-110031-kormat.json [11:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:36] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [11:00:44] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592479 (https://phabricator.wikimedia.org/T246383) (owner: 10VulpesVulpes825) [11:01:39] (03Merged) 10jenkins-bot: Adjust ContentTranslation MT threshold for Chinese Wikipedia to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592479 (https://phabricator.wikimedia.org/T246383) (owner: 10VulpesVulpes825) [11:01:49] !log installing remaining openldap security updates (client-side libs, tools) [11:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:58] kart_: should that change be proceeding per https://phabricator.wikimedia.org/T246383#6087778 [11:02:42] RhinosF1: we have discussed in the team for this and it is OK. [11:02:46] to deploy. [11:03:11] kart_: okay, was just confused with no comment on the task :) [11:03:47] RhinosF1: We should have some more updates after observing this change. [11:04:04] Ack [11:04:39] (03PS4) 10Giuseppe Lavagetto: Add integration tests using docker-compose [software/purged] - 10https://gerrit.wikimedia.org/r/594148 (https://phabricator.wikimedia.org/T133821) [11:08:13] (03PS7) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [11:08:59] (03PS8) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [11:09:09] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|592479|Adjust ContentTranslation MT threshold for Chinese WP to 70% (T246383)]] (duration: 01m 01s) [11:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:13] T246383: Adjust the threshold for Chinese Wikipedia to prevent publishing when overall unmodified content is higher than 70% - https://phabricator.wikimedia.org/T246383 [11:10:37] (03PS1) 10Filippo Giunchedi: profile: initial tests for logstash filters [puppet] - 10https://gerrit.wikimedia.org/r/594460 (https://phabricator.wikimedia.org/T251869) [11:11:09] (03CR) 10jerkins-bot: [V: 04-1] (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [11:12:23] Since there are no other patches, I'll consider EU SWAT to finish. [11:22:41] !log EU SWAT done. [11:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:47] Forgot this ^^ [11:23:41] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:23:52] (03CR) 10Joal: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/594272 (https://phabricator.wikimedia.org/T238357) (owner: 10Nuria) [11:26:04] !log rolling restart of apache/FPM on mw1261-mw1265 [11:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:15] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.0 200 OK - 22733 bytes in 7.362 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:30:45] !log Drop T248086_wb_terms table on labsdb hosts - T248086 [11:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:48] T248086: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 [11:31:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 T248086', diff saved to https://phabricator.wikimedia.org/P11143 and previous config saved to /var/cache/conftool/dbconfig/20200505-113100-marostegui.json [11:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087 T248086', diff saved to https://phabricator.wikimedia.org/P11144 and previous config saved to /var/cache/conftool/dbconfig/20200505-113152-marostegui.json [11:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:42] (03PS2) 10Dzahn: icinga: replace check_ssl_http with check_ssl_http_letsencrypt [puppet] - 10https://gerrit.wikimedia.org/r/594103 (https://phabricator.wikimedia.org/T251726) [11:32:44] (03PS2) 10Dzahn: admins: add Eduardo Medina to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/594417 (https://phabricator.wikimedia.org/T251358) [11:32:46] (03PS1) 10Dzahn: gerrit/cloud: add some missing Hiera keys for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/594461 (https://phabricator.wikimedia.org/T236569) [11:33:08] (03CR) 10Dzahn: [C: 03+2] icinga: replace check_ssl_http with check_ssl_http_letsencrypt [puppet] - 10https://gerrit.wikimedia.org/r/594103 (https://phabricator.wikimedia.org/T251726) (owner: 10Dzahn) [11:33:43] (03CR) 10Dzahn: "just for planet and wmfusercontent ..." [puppet] - 10https://gerrit.wikimedia.org/r/594103 (https://phabricator.wikimedia.org/T251726) (owner: 10Dzahn) [11:37:26] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) >>! In T251726#6103487, @Vgutierrez wrote: > Currently we're using the LE unified cert on the US DCs (codfw, eqiad and ulsfo). LE certs are va... [11:39:31] (03CR) 10Dzahn: [C: 03+2] admins: add Eduardo Medina to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/594417 (https://phabricator.wikimedia.org/T251358) (owner: 10Dzahn) [11:40:02] (03PS9) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [11:41:16] !log LDAP - added eamedia to wmf group (T251358) [11:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:21] T251358: Add Eamedina to `wmf` LDAF group - https://phabricator.wikimedia.org/T251358 [11:41:25] (03PS10) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [11:43:18] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Eamedina to `wmf` LDAF group - https://phabricator.wikimedia.org/T251358 (10Dzahn) 05Open→03Resolved a:03Dzahn @eamedina Thanks for the details. You have been added to the "wmf" group now. You can try out logstash. [11:43:41] (03CR) 10jerkins-bot: [V: 04-1] (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [11:44:22] (03CR) 10Dzahn: [C: 03+2] gerrit/cloud: add some missing Hiera keys for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/594461 (https://phabricator.wikimedia.org/T236569) (owner: 10Dzahn) [11:45:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: (Need By: TBD) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10ayounsi) Cabling diagram, let me know if something is missing or unclear: {F31803448} [11:47:46] !log rolling restart of apache on kibana hosts [11:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:49] (03PS1) 10Alexandros Kosiaris: deployment-prep: Remove citoid profile config [puppet] - 10https://gerrit.wikimedia.org/r/594463 [11:57:51] (03CR) 10Elukey: [C: 03+1] Add tests for atskafka.go [software/atskafka] - 10https://gerrit.wikimedia.org/r/593890 (owner: 10Ema) [11:59:06] (03CR) 10Elukey: [C: 03+1] "Nice! I guess that this is provided by the go std lib right? So we don't need to add deps etc..?" [software/atskafka] - 10https://gerrit.wikimedia.org/r/593892 (owner: 10Ema) [12:00:24] (03PS11) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [12:00:33] (03CR) 10Elukey: "Not really sure what kind of licenses we use for tools that we publish, e.g. Apache 2.0 vs GPL vs etc.. GPL seems fine, I'll let you decid" [software/atskafka] - 10https://gerrit.wikimedia.org/r/593894 (owner: 10Ema) [12:03:21] !log rolling restart of apache on puppetboard* to pick up OpenLDAP update [12:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:28] (03CR) 10Elukey: Filter logs by regular expression (031 comment) [software/atskafka] - 10https://gerrit.wikimedia.org/r/594116 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [12:07:33] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:35] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: update templates login page [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [12:07:47] !log updating cas login page [12:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:47] (03CR) 10Volans: [C: 03+2] changelog: specify breaking change [software/spicerack] - 10https://gerrit.wikimedia.org/r/594273 (owner: 10Volans) [12:09:59] (03CR) 10Volans: [C: 03+2] doc: set min version of sphinx_rtd_theme to 0.1.9 [software/spicerack] - 10https://gerrit.wikimedia.org/r/594274 (owner: 10Volans) [12:10:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:10:53] (03CR) 10Volans: [C: 03+2] doc: fix documentation generation for Sphinx 3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/594275 (owner: 10Volans) [12:11:16] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:15:10] (03CR) 10Jbond: [C: 03+1] "not tested but looks good to me and great improvment" [puppet] - 10https://gerrit.wikimedia.org/r/594460 (https://phabricator.wikimedia.org/T251869) (owner: 10Filippo Giunchedi) [12:16:34] (03Merged) 10jenkins-bot: changelog: specify breaking change [software/spicerack] - 10https://gerrit.wikimedia.org/r/594273 (owner: 10Volans) [12:17:31] (03Merged) 10jenkins-bot: doc: set min version of sphinx_rtd_theme to 0.1.9 [software/spicerack] - 10https://gerrit.wikimedia.org/r/594274 (owner: 10Volans) [12:18:12] (03Merged) 10jenkins-bot: doc: fix documentation generation for Sphinx 3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/594275 (owner: 10Volans) [12:22:14] 10Operations, 10Wikimedia-Mailing-lists, 10Malayalam-Sites: Wikiml-l mail archives are empty after August 2019 (moderation enabled but nobody moderates, hence no emails get delivered) - https://phabricator.wikimedia.org/T251554 (10Praveenp) [12:34:51] (03PS2) 10Jbond: apereo_cas: add more timeout values [puppet] - 10https://gerrit.wikimedia.org/r/587515 [12:35:24] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: add more timeout values [puppet] - 10https://gerrit.wikimedia.org/r/587515 (owner: 10Jbond) [12:35:54] (03PS3) 10Jbond: apereo_cas: add more timeout values [puppet] - 10https://gerrit.wikimedia.org/r/587515 [12:37:03] (03CR) 10Jbond: apereo_cas: add more timeout values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587515 (owner: 10Jbond) [12:37:07] !log push pfw policy - T251769 [12:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:08] (03CR) 10Dzahn: [C: 03+2] "confirmed these are already symlinks to the capitalized version" [puppet] - 10https://gerrit.wikimedia.org/r/591335 (https://phabricator.wikimedia.org/T250806) (owner: 10Reedy) [12:40:18] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10hashar) CI on WMCS uses a cumin master. Krenair already created the Buster instance `integration-cumin-02.integration.eqiad.wmflabs`. We can use it to experiment :] [12:47:39] (03PS1) 10Kormat: Revert "install_server: Allow reimage of es1024" [puppet] - 10https://gerrit.wikimedia.org/r/594470 (https://phabricator.wikimedia.org/T250666) [12:48:21] (03CR) 10Kormat: [C: 03+2] Revert "install_server: Allow reimage of es1024" [puppet] - 10https://gerrit.wikimedia.org/r/594470 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [12:50:04] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [12:52:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:52:54] !log kormat@cumin1001 dbctl commit (dc=all): 'Repool es1024 at 25% after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11145 and previous config saved to /var/cache/conftool/dbconfig/20200505-125254-kormat.json [12:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:58] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [12:53:05] (03PS1) 10Joal: Update turnilo pageview config with new dimensions [puppet] - 10https://gerrit.wikimedia.org/r/594472 (https://phabricator.wikimedia.org/T243090) [12:54:16] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:55:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: add toolforge.org and wmcloud.org to CSP allows [puppet] - 10https://gerrit.wikimedia.org/r/591236 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis) [13:02:02] (03CR) 10Ema: [C: 03+2] Add tests for atskafka.go [software/atskafka] - 10https://gerrit.wikimedia.org/r/593890 (owner: 10Ema) [13:02:26] (03CR) 10Ema: [C: 03+2] "> Nice! I guess that this is provided by the go std lib right? So we" [software/atskafka] - 10https://gerrit.wikimedia.org/r/593892 (owner: 10Ema) [13:02:40] (03CR) 10Ema: [C: 03+2] Add license and copyright notices [software/atskafka] - 10https://gerrit.wikimedia.org/r/593894 (owner: 10Ema) [13:07:30] (03PS2) 10Alexandros Kosiaris: deployment-prep: Remove citoid profile config [puppet] - 10https://gerrit.wikimedia.org/r/594463 [13:07:35] (03PS1) 10Elukey: superset::proxy: allow to set the x_forwareded_proto header [puppet] - 10https://gerrit.wikimedia.org/r/594474 [13:08:36] (03CR) 10jerkins-bot: [V: 04-1] superset::proxy: allow to set the x_forwareded_proto header [puppet] - 10https://gerrit.wikimedia.org/r/594474 (owner: 10Elukey) [13:08:46] uff [13:09:53] (03PS3) 10Ema: Filter logs by regular expression [software/atskafka] - 10https://gerrit.wikimedia.org/r/594116 (https://phabricator.wikimedia.org/T237993) [13:10:08] (03PS2) 10Elukey: superset::proxy: allow to set the x_forwareded_proto header [puppet] - 10https://gerrit.wikimedia.org/r/594474 [13:11:44] (03CR) 10Giuseppe Lavagetto: Add integration tests using docker-compose (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/594148 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [13:12:28] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:12:32] (03PS3) 10Elukey: superset::proxy: allow to set the x_forwareded_proto header [puppet] - 10https://gerrit.wikimedia.org/r/594474 [13:12:43] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/594472 (https://phabricator.wikimedia.org/T243090) (owner: 10Joal) [13:13:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:14:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] deployment-prep: Remove citoid profile config [puppet] - 10https://gerrit.wikimedia.org/r/594463 (owner: 10Alexandros Kosiaris) [13:14:33] (03CR) 10Ema: Filter logs by regular expression (031 comment) [software/atskafka] - 10https://gerrit.wikimedia.org/r/594116 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [13:15:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Repool es1024 to 50% after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11147 and previous config saved to /var/cache/conftool/dbconfig/20200505-131520-kormat.json [13:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:24] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [13:18:35] !log upgrade ATS to version 8.1 () on cp4026, cp4032, cp5006 and cp5011 [13:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:56] (03PS12) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [13:20:13] (03PS13) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [13:21:15] (03PS14) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [13:23:23] (03CR) 10jerkins-bot: [V: 04-1] (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [13:23:30] (03PS1) 10Dzahn: contint: move common and default Hiera settings to role level [puppet] - 10https://gerrit.wikimedia.org/r/594475 (https://phabricator.wikimedia.org/T224591) [13:24:07] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Eamedina to `wmf` LDAF group - https://phabricator.wikimedia.org/T251358 (10eamedina) Thanks @Dzahn! I tried out logstash and was able to login successfully 👍 [13:24:34] (03PS2) 10Dzahn: contint: move common and default Hiera settings to role level [puppet] - 10https://gerrit.wikimedia.org/r/594475 (https://phabricator.wikimedia.org/T224591) [13:25:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Code LGTM, left a few nitpicks (mostly about docstrings :)). Feel free to merge as-is." (033 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/594454 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:30:41] (03PS2) 10Alexandros Kosiaris: cxserver: Remove logstash logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/573240 (https://phabricator.wikimedia.org/T219921) [13:30:45] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Eamedina to `wmf` LDAF group - https://phabricator.wikimedia.org/T251358 (10Dzahn) Great! Thanks for confirming. [13:30:52] (03CR) 10Elukey: [C: 03+2] superset::proxy: allow to set the x_forwareded_proto header [puppet] - 10https://gerrit.wikimedia.org/r/594474 (owner: 10Elukey) [13:32:00] 10Operations, 10Traffic, 10serviceops: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10BBlack) >>! In T251726#6108138, @Dzahn wrote: > Even if we still use non-LE certs in some DCs i believe this is ok since we should also have other monitoring for the expira... [13:32:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] cxserver: Remove logstash logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/573240 (https://phabricator.wikimedia.org/T219921) (owner: 10Alexandros Kosiaris) [13:32:30] (03PS15) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [13:33:19] (03PS1) 10Dzahn: contint: switch jenkins/zuul from contint1001 to contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/594477 (https://phabricator.wikimedia.org/T224591) [13:33:46] (03PS4) 10Ottomata: [WIP] Initial debian commit [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) [13:34:01] (03PS2) 10Elukey: profile::reportupdater::jobs: Add delay to published_cx2_translations_mysql [puppet] - 10https://gerrit.wikimedia.org/r/594169 (owner: 10Mforns) [13:35:53] (03CR) 10Elukey: [C: 03+1] Filter logs by regular expression [software/atskafka] - 10https://gerrit.wikimedia.org/r/594116 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [13:36:01] (03CR) 10Elukey: [C: 03+2] profile::reportupdater::jobs: Add delay to published_cx2_translations_mysql [puppet] - 10https://gerrit.wikimedia.org/r/594169 (owner: 10Mforns) [13:36:36] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [13:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:06] !log Updated Jenkins job https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler to have it defined in JJB # T97513 [13:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:10] T97513: Migrate Jenkins job "operations-puppet-catalog-compiler" to Jenkins Job Builder - https://phabricator.wikimedia.org/T97513 [13:41:35] (03PS16) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [13:41:57] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [13:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:26] (03CR) 10Ema: [C: 03+2] Filter logs by regular expression [software/atskafka] - 10https://gerrit.wikimedia.org/r/594116 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [13:43:32] (03PS1) 10Dzahn: switch contint from 1001 to 2001 [dns] - 10https://gerrit.wikimedia.org/r/594480 (https://phabricator.wikimedia.org/T224591) [13:43:38] (03CR) 10Jhedden: [C: 03+1] Openstack haproxy ferm rules: support AAAA and fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/594338 (https://phabricator.wikimedia.org/T251294) (owner: 10Andrew Bogott) [13:45:34] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' . [13:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:02] !log deploy cxserver chart 0.0.15 to staging, codfw, eqiad. T219921 [13:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:06] T219921: Move cxserver logging to new logging pipeline - https://phabricator.wikimedia.org/T219921 [13:46:38] (03PS17) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [13:48:27] (03CR) 10Ema: [C: 03+1] Add integration tests using docker-compose [software/purged] - 10https://gerrit.wikimedia.org/r/594148 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [13:49:03] 10Operations, 10Traffic, 10serviceops: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) Or we could make a new Icinga check that isn't check_http for a specific service but runs openssl directly on the cert file in the private repo and has a generic nam... [13:49:49] (03PS2) 10Dzahn: contint: switch jenkins/zuul/gearman to contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/594477 (https://phabricator.wikimedia.org/T224591) [13:50:56] (03PS3) 10Andrew Bogott: Openstack haproxy ferm rules: support AAAA and fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/594338 (https://phabricator.wikimedia.org/T251294) [13:50:58] (03PS1) 10Andrew Bogott: openstack/buster/nova: Create 'nova' system user in puppet [puppet] - 10https://gerrit.wikimedia.org/r/594483 (https://phabricator.wikimedia.org/T251294) [13:51:03] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, 10Patch-For-Review: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10akosiaris) [13:51:05] 10Operations, 10CX-cxserver, 10Wikimedia-Logstash, 10observability, and 3 others: Move cxserver logging to new logging pipeline - https://phabricator.wikimedia.org/T219921 (10akosiaris) 05Open→03Resolved a:03akosiaris I am gonna resolve this, I just removed direct logstash logging from cxserver, now... [13:52:40] (03PS2) 10Filippo Giunchedi: profile: initial tests for logstash filters [puppet] - 10https://gerrit.wikimedia.org/r/594460 (https://phabricator.wikimedia.org/T251869) [13:53:04] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:53:15] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 3 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) [13:54:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:55:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587515 (owner: 10Jbond) [13:55:50] (03CR) 10Andrew Bogott: [C: 03+2] Openstack haproxy ferm rules: support AAAA and fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/594338 (https://phabricator.wikimedia.org/T251294) (owner: 10Andrew Bogott) [13:55:58] (03CR) 10Andrew Bogott: [C: 03+2] openstack/buster/nova: Create 'nova' system user in puppet [puppet] - 10https://gerrit.wikimedia.org/r/594483 (https://phabricator.wikimedia.org/T251294) (owner: 10Andrew Bogott) [13:56:01] (03PS1) 10Alexandros Kosiaris: eventgate, eventstreams, citoid: Log with namedlevels [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459) [13:56:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1ed until eventgate, eventstreams and citoid have bumped their service-runner dependency to 2.7.7" [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris) [14:18:33] 10Operations, 10LDAP-Access-Requests: Add `Jgiannelos` to `wmf` LDAP group - https://phabricator.wikimedia.org/T251899 (10Jgiannelos) [14:22:43] (03PS18) 10Jbond: (WIP) cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [14:28:27] (03PS1) 10Kormat: install_server: switch d-i-test to buster [puppet] - 10https://gerrit.wikimedia.org/r/594494 (https://phabricator.wikimedia.org/T251768) [14:31:00] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [14:31:30] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [14:31:59] !log kormat@cumin1001 dbctl commit (dc=all): 'Repool es1024 to 75% after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11149 and previous config saved to /var/cache/conftool/dbconfig/20200505-143158-kormat.json [14:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:04] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [14:35:22] PROBLEM - ps1-a7-codfw-infeed-load-tower-B-phase-X on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:35:46] PROBLEM - ps1-a7-codfw-infeed-load-tower-B-phase-Y on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:35:52] PROBLEM - ps1-a7-codfw-infeed-load-tower-B-phase-Z on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:36:02] PROBLEM - ps1-a7-codfw-infeed-load-tower-A-phase-Y on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:36:14] PROBLEM - ps1-a7-codfw-infeed-load-tower-A-phase-X on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:36:30] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Jclark-ctr) a:03Jclark-ctr [14:37:00] PROBLEM - ps1-a7-codfw-infeed-load-tower-A-phase-Z on ps1-a7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:37:08] (03CR) 10Mvolz: "The hat before ^2.7.1 means 2.7.7 would get installed on update - I think - but I'll explicitly update to it and test the update." [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris) [14:39:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> The hat before ^2.7.1 means 2.7.7 would get installed on update - I think - but I'll explicitly update to it and test the update." [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris) [14:40:41] (03CR) 10Filippo Giunchedi: "LGTM, but looking at the test logs I notice there's both s-maxage= and maxage= and I don't know what's the expected behavior" [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [14:41:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/594320 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [14:42:40] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10Jclark-ctr) dac cable is in the wrong nic port switched. Confirmed with @elukey is working now [14:43:02] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [14:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:59] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [14:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:08] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` cloudelastic1005.wikimedia.org ` The log can be found in `/var/log/wmf-auto-re... [14:47:04] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:52:59] (03CR) 10Jcrespo: [C: 03+1] install_server: switch d-i-test to buster [puppet] - 10https://gerrit.wikimedia.org/r/594494 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [14:53:51] (03PS1) 10Jbond: pcc: add default COMPILER_MODE [puppet] - 10https://gerrit.wikimedia.org/r/594497 [14:54:20] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22726 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:54:29] (03CR) 10CDanis: [C: 03+1] pcc: add default COMPILER_MODE [puppet] - 10https://gerrit.wikimedia.org/r/594497 (owner: 10Jbond) [14:54:49] (03CR) 10RLazarus: [C: 03+1] pcc: add default COMPILER_MODE [puppet] - 10https://gerrit.wikimedia.org/r/594497 (owner: 10Jbond) [14:54:55] (03CR) 10Jbond: [C: 03+2] pcc: add default COMPILER_MODE [puppet] - 10https://gerrit.wikimedia.org/r/594497 (owner: 10Jbond) [14:56:41] !log hnowlan@deploy1001 Started deploy [changeprop/deploy@6c65779]: Enabling on_transclusion_update on k8s, disabling on scb [14:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:54] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:58:12] !log hnowlan@deploy1001 Finished deploy [changeprop/deploy@6c65779]: Enabling on_transclusion_update on k8s, disabling on scb (duration: 01m 31s) [14:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:33] (03CR) 10Nuria: [C: 03+1] Update turnilo pageview config with new dimensions [puppet] - 10https://gerrit.wikimedia.org/r/594472 (https://phabricator.wikimedia.org/T243090) (owner: 10Joal) [14:59:45] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/594441 (owner: 10Filippo Giunchedi) [14:59:59] (03CR) 10Nuria: [C: 03+1] "I can test this change together with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/594272/" [puppet] - 10https://gerrit.wikimedia.org/r/594472 (https://phabricator.wikimedia.org/T243090) (owner: 10Joal) [15:00:42] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [15:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:04] (03CR) 10RLazarus: "https://puppet-compiler.wmflabs.org/compiler1002/22303/" [puppet] - 10https://gerrit.wikimedia.org/r/594239 (https://phabricator.wikimedia.org/T244852) (owner: 10RLazarus) [15:02:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, but run the puppet compiler to confirm it's a noop just in case." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594239 (https://phabricator.wikimedia.org/T244852) (owner: 10RLazarus) [15:02:16] _joe_: ahahaha [15:03:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:29] 10Operations, 10observability: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643 (10fgiunchedi) [15:04:42] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10elukey) ` 15:03:13 | cloudelastic1005.wikimedia.org | WARNING: unable to verify that BIOS boot parameters are back to normal, got: Boot parameter version: 1 Boot parameter 5 is valid/... [15:07:59] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudelastic1005.wikimedia.org'] ` and were **ALL** successful. [15:08:13] (03PS4) 10Andrew Bogott: Openstack haproxy ferm rules: support AAAA and fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/594338 (https://phabricator.wikimedia.org/T251294) [15:08:15] (03PS1) 10Andrew Bogott: nova::common: fix ordering a bit for a clean install [puppet] - 10https://gerrit.wikimedia.org/r/594500 (https://phabricator.wikimedia.org/T251294) [15:08:50] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` cloudelastic1006.wikimedia.org ` The log can be found in `/var/log/wmf-auto-re... [15:09:15] (03CR) 10jerkins-bot: [V: 04-1] nova::common: fix ordering a bit for a clean install [puppet] - 10https://gerrit.wikimedia.org/r/594500 (https://phabricator.wikimedia.org/T251294) (owner: 10Andrew Bogott) [15:09:28] (03PS2) 10Andrew Bogott: nova::common: fix ordering a bit for a clean install [puppet] - 10https://gerrit.wikimedia.org/r/594500 (https://phabricator.wikimedia.org/T251294) [15:09:35] 10Operations, 10Traffic, 10serviceops: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Vgutierrez) >>! In T251726#6108138, @Dzahn wrote: > Even if we still use non-LE certs in some DCs i believe this is ok since we should also have other monitoring for the ex... [15:09:50] (03CR) 10jerkins-bot: [V: 04-1] nova::common: fix ordering a bit for a clean install [puppet] - 10https://gerrit.wikimedia.org/r/594500 (https://phabricator.wikimedia.org/T251294) (owner: 10Andrew Bogott) [15:09:52] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22740 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:10:24] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [15:11:20] (03PS1) 10Dzahn: add static-codereview.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/594501 (https://phabricator.wikimedia.org/T243056) [15:11:26] (03PS3) 10Andrew Bogott: nova::common: fix ordering a bit for a clean install [puppet] - 10https://gerrit.wikimedia.org/r/594500 (https://phabricator.wikimedia.org/T251294) [15:12:03] (03PS2) 10Cwhite: mtail: update mtail testing to python3 [puppet] - 10https://gerrit.wikimedia.org/r/594320 (https://phabricator.wikimedia.org/T251466) [15:12:12] (03CR) 10Andrew Bogott: [C: 03+2] nova::common: fix ordering a bit for a clean install [puppet] - 10https://gerrit.wikimedia.org/r/594500 (https://phabricator.wikimedia.org/T251294) (owner: 10Andrew Bogott) [15:14:24] (03PS4) 10Cwhite: toil: mitigate monthly acct cronspam [puppet] - 10https://gerrit.wikimedia.org/r/593750 (https://phabricator.wikimedia.org/T167035) [15:14:51] (03CR) 10Cwhite: [C: 03+2] mtail: update mtail testing to python3 [puppet] - 10https://gerrit.wikimedia.org/r/594320 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [15:17:00] (03PS8) 10Cwhite: mtail: add flag to install mtail from apt component [puppet] - 10https://gerrit.wikimedia.org/r/593327 (https://phabricator.wikimedia.org/T251466) [15:17:18] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [15:23:38] (03PS19) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [15:24:09] (03CR) 10Jbond: "There are still a few issues but a review would quick review would be appreciated" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [15:24:21] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [15:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:24] (03CR) 10Filippo Giunchedi: "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588769 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [15:24:54] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] dbtree: Update jquery version to avoid outdated code [software/dbtree] - 10https://gerrit.wikimedia.org/r/594457 (owner: 10Jcrespo) [15:26:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:50] (03PS1) 10Elukey: profile::analytics::cluster::hdfs_mount: temporarily disable alarms [puppet] - 10https://gerrit.wikimedia.org/r/594506 [15:30:35] (03CR) 10Filippo Giunchedi: "I'm not sure I understand why moving to global variables? Seems like it would make it harder to test" [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [15:32:22] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::hdfs_mount: temporarily disable alarms [puppet] - 10https://gerrit.wikimedia.org/r/594506 (owner: 10Elukey) [15:32:39] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudelastic1006.wikimedia.org'] ` and were **ALL** successful. [15:32:49] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:35:15] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:35:34] (03CR) 10Jbond: apereo_cas: add more timeout values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587515 (owner: 10Jbond) [15:38:43] !log kormat@cumin1001 dbctl commit (dc=all): 'Repool es1024 to 100% after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11153 and previous config saved to /var/cache/conftool/dbconfig/20200505-153843-kormat.json [15:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:46] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [15:39:10] (03CR) 10Muehlenhoff: [C: 03+1] apereo_cas: add more timeout values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587515 (owner: 10Jbond) [15:39:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "Personally I think the if os_version() around the class invocation is more clear but this works too" [puppet] - 10https://gerrit.wikimedia.org/r/593750 (https://phabricator.wikimedia.org/T167035) (owner: 10Cwhite) [15:39:22] 10Operations, 10LDAP-Access-Requests: Add `Jgiannelos` to `wmf` LDAP group - https://phabricator.wikimedia.org/T251899 (10Aklapper) Hi @Jgiannelos, please see https://phabricator.wikimedia.org/project/profile/1564/ for required data and information. (And if you're following some team's onboarding documentation... [15:41:59] (03PS2) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [15:46:23] (03PS1) 10Mvolz: Citoid: Update restbase to 2.7.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/594513 (https://phabricator.wikimedia.org/T239459) [15:51:40] (03PS2) 10Mvolz: Citoid: Update restbase to 2.7.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/594513 (https://phabricator.wikimedia.org/T239459) [15:56:53] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10elukey) [15:57:42] (03CR) 10Mstyles: [C: 03+2] "so https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/593833 will get rebased on top of this?" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/594443 (owner: 10DCausse) [15:59:23] (03CR) 10Mstyles: increment extra plugin to 6.5.4-wmf-9 (032 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/593833 (owner: 10Mstyles) [15:59:58] (03CR) 10Cwhite: [C: 03+2] aptrepo: add mtail component for controlled mtail upgrade [puppet] - 10https://gerrit.wikimedia.org/r/593314 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [16:00:04] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200505T1600). [16:00:04] qedk: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:18] (03PS3) 10Cwhite: aptrepo: add mtail component for controlled mtail upgrade [puppet] - 10https://gerrit.wikimedia.org/r/593314 (https://phabricator.wikimedia.org/T251466) [16:01:08] (03CR) 10DCausse: "> Patch Set 1: Code-Review+2" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/594443 (owner: 10DCausse) [16:01:14] (03PS3) 10Arturo Borrero Gonzalez: kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) [16:02:12] (03CR) 10jerkins-bot: [V: 04-1] kubeadm: refactor toolforge code for reuse by PAWS [puppet] - 10https://gerrit.wikimedia.org/r/594471 (https://phabricator.wikimedia.org/T251297) (owner: 10Arturo Borrero Gonzalez) [16:04:16] (03PS1) 10Elukey: profile::kerberos::client: allow to set a different credentials cache dir [puppet] - 10https://gerrit.wikimedia.org/r/594516 [16:04:38] (03PS5) 10Arturo Borrero Gonzalez: toolforge: Increase HSTS max-age directive to one month [puppet] - 10https://gerrit.wikimedia.org/r/593747 (https://phabricator.wikimedia.org/T102367) (owner: 10QEDK) [16:08:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Increase HSTS max-age directive to one month [puppet] - 10https://gerrit.wikimedia.org/r/593747 (https://phabricator.wikimedia.org/T102367) (owner: 10QEDK) [16:09:12] oh thanks i was just coming for this [16:09:51] should i test? [16:11:21] (03PS2) 10Elukey: profile::kerberos::client: allow to set a different credentials cache dir [puppet] - 10https://gerrit.wikimedia.org/r/594516 [16:11:55] looks good: https://securityheaders.com/?q=wordcount.toolforge.org&followRedirects=on [16:11:57] (03PS4) 10Mstyles: increment extra plugin to 6.5.4-wmf-9 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/593833 (https://phabricator.wikimedia.org/T222669) [16:12:28] (03PS5) 10Mstyles: increment extra plugin to 6.5.4-wmf-9 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/593833 (https://phabricator.wikimedia.org/T222669) [16:13:04] (03CR) 10Mstyles: "> Patch Set 1:" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/594443 (owner: 10DCausse) [16:13:36] (03CR) 10Elukey: "A nice no-op" [puppet] - 10https://gerrit.wikimedia.org/r/594516 (owner: 10Elukey) [16:18:28] 10Operations, 10DNS: Reverse DNS missing for some hosts - https://phabricator.wikimedia.org/T251522 (10colewhite) p:05Triage→03Medium [16:18:32] (03PS1) 10Elukey: Change Kerberos credentials cache location on stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/594519 [16:18:52] !log notice: planning branch cut for 1.35.0-wmf.31 (T249963) at 16:30 UTC [16:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:57] T249963: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 [16:20:46] 10Operations, 10DBA, 10DC-Ops, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10colewhite) p:05Triage→03Medium [16:21:33] 10Operations, 10serviceops: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards - https://phabricator.wikimedia.org/T251378 (10colewhite) p:05Triage→03Medium [16:23:19] (03PS2) 10Elukey: Change Kerberos credentials cache location on stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/594519 [16:23:42] (03CR) 10jerkins-bot: [V: 04-1] Change Kerberos credentials cache location on stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/594519 (owner: 10Elukey) [16:23:47] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:30] (03PS3) 10Elukey: Change Kerberos credentials cache location on stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/594519 [16:28:53] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/22318/" [puppet] - 10https://gerrit.wikimedia.org/r/594519 (owner: 10Elukey) [16:29:22] (03CR) 10CDanis: [C: 03+1] "I think removing 'notifications disabled' hosts from /alerts is orthogonal to the policy question of whether or not it's okay to disable n" [puppet] - 10https://gerrit.wikimedia.org/r/594441 (owner: 10Filippo Giunchedi) [16:29:36] (03CR) 10Herron: [C: 03+1] toil: mitigate monthly acct cronspam [puppet] - 10https://gerrit.wikimedia.org/r/593750 (https://phabricator.wikimedia.org/T167035) (owner: 10Cwhite) [16:29:43] 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10colewhite) [16:29:49] 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10colewhite) p:05Triage→03Medium [16:30:04] 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10JHedden) 05Open→03Stalled Waiting for the next reboot of this host [16:30:34] (03PS4) 10Elukey: Change Kerberos credentials cache location on stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/594519 [16:32:41] !triggering branch cut for 1.35.0-wmf.31 (T249963) via https://releases-jenkins.wikimedia.org/job/MediaWiki%20Train%20Branch%20Cut/build?delay=0sec [16:32:41] T249963: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 [16:33:00] (03PS5) 10Cwhite: toil: mitigate monthly acct cronspam [puppet] - 10https://gerrit.wikimedia.org/r/593750 (https://phabricator.wikimedia.org/T167035) [16:33:06] (03CR) 10Elukey: "http://puppet-compiler.wmflabs.org/22319/" [puppet] - 10https://gerrit.wikimedia.org/r/594519 (owner: 10Elukey) [16:33:11] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:34:15] brennen forgot !log :) [16:34:32] paladox: d'oh - thanks. :) [16:34:37] !log triggering branch cut for 1.35.0-wmf.31 (T249963) via https://releases-jenkins.wikimedia.org/job/MediaWiki%20Train%20Branch%20Cut/build?delay=0sec [16:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:41] yw :) [16:34:44] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10Papaul) [16:36:15] 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10bd808) [16:36:19] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10bd808) [16:36:33] (03CR) 10Cwhite: [C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1001/22320/" [puppet] - 10https://gerrit.wikimedia.org/r/593750 (https://phabricator.wikimedia.org/T167035) (owner: 10Cwhite) [16:38:22] 10Operations, 10Patch-For-Review: stretch acct monthly cron will spam when /var/log/wtmp.1 doesn't exist - https://phabricator.wikimedia.org/T167035 (10colewhite) mitigation was deployed today. will watch for next run. [16:38:33] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10colewhite) [16:38:35] 10Operations, 10Patch-For-Review: stretch acct monthly cron will spam when /var/log/wtmp.1 doesn't exist - https://phabricator.wikimedia.org/T167035 (10colewhite) 05Open→03Resolved a:03colewhite [16:40:56] (03PS6) 10Mstyles: increment extra plugin to 6.5.4-wmf-9 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/593833 (https://phabricator.wikimedia.org/T222669) [16:41:41] 10Operations, 10observability, 10cloud-services-team (Kanban): remove cloud "dev" hosts from Icinga? - https://phabricator.wikimedia.org/T250787 (10Andrew) a:03Andrew It's useful to have these visible in icinga, but they ought to all be downtimed (pretty much forever). I'll double-check that this is curre... [16:42:29] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/594441 (owner: 10Filippo Giunchedi) [16:45:25] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 22749 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:48:09] !log 1.35.0-wmf.31 was branched at 4d3fed31a435e7bd24925a154f89a9407670986d for T249963 [16:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:12] T249963: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 [16:49:05] (03CR) 10DCausse: [C: 03+1] "lgtm" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/593833 (https://phabricator.wikimedia.org/T222669) (owner: 10Mstyles) [16:50:52] (03CR) 10CDanis: "Brandon, will you have time soon for this stack of changes?" [dns] - 10https://gerrit.wikimedia.org/r/572269 (owner: 10CDanis) [17:00:04] halfak and accraze: May I have your attention please! Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200505T1700) [17:00:57] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [17:01:47] brennen: thanks for announcing the branch cut in advance this week. i didn't have anything in particular that i was concerned about this week, but i've often wished for more predictability in the branch cut time [17:02:13] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [17:02:55] mdholloway: you bet. i'll add that to the train instructions. [17:08:03] (03PS3) 10BBlack: Manitoba: better served by codfw [dns] - 10https://gerrit.wikimedia.org/r/572269 (owner: 10CDanis) [17:08:51] (03CR) 10BBlack: [C: 03+2] Manitoba: better served by codfw [dns] - 10https://gerrit.wikimedia.org/r/572269 (owner: 10CDanis) [17:09:10] (03PS3) 10BBlack: Saskatchewan: ulsfo >> codfw > eqiad [dns] - 10https://gerrit.wikimedia.org/r/572270 (owner: 10CDanis) [17:09:39] (03CR) 10BBlack: [C: 03+2] Saskatchewan: ulsfo >> codfw > eqiad [dns] - 10https://gerrit.wikimedia.org/r/572270 (owner: 10CDanis) [17:10:17] bblack: ah, thanks! [17:10:27] np [17:10:39] I thought they would need esams-offline fixups too, but in practice they don't :) [17:11:01] ah yeah I think I remember checking for that :) [17:24:14] (03PS1) 10Catrope: GrowthExperiments: Disable guidance feature on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594529 [17:31:15] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Blog: Availability to update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10Varnent) [17:31:19] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594531 [17:31:21] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594531 (owner: 10Brennen Bearnes) [17:32:04] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594531 (owner: 10Brennen Bearnes) [17:35:15] !log brennen@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.31 [17:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:04] 10Operations, 10Anti-Harassment, 10Traffic: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [17:38:25] 10Operations, 10Anti-Harassment, 10Traffic: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [17:39:23] 10Operations, 10Anti-Harassment, 10Traffic: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [17:40:06] 10Operations, 10Anti-Harassment, 10Traffic: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [17:41:55] 10Operations, 10Anti-Harassment, 10Traffic: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [17:46:20] (03CR) 10RLazarus: [C: 03+2] mcrouter_wancache: Clean up $use_gutter now that it's true everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/594239 (https://phabricator.wikimedia.org/T244852) (owner: 10RLazarus) [17:47:12] (03CR) 10Cwhite: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [17:50:54] 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10hashar) @MoritzMuehlenhoff filed a similar task previously. Do you have details as to why the releng/bazel:0.4.0 fails? [17:51:40] 10Operations, 10serviceops, 10Continuous-Integration-Config: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10hashar) [17:58:07] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200505T1800) [18:04:36] (03PS1) 10Mforns: analytics::refinery::job::refine.pp: Bump up jar version [puppet] - 10https://gerrit.wikimedia.org/r/594538 [18:04:41] 10Operations, 10Anti-Harassment, 10Traffic, 10serviceops: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10CDanis) [18:05:17] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22735 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:05:17] 10Operations, 10Anti-Harassment, 10Traffic, 10serviceops: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10dbarratt) [18:06:03] 10Operations, 10observability, 10cloud-services-team (Kanban): remove cloud "dev" hosts from Icinga? - https://phabricator.wikimedia.org/T250787 (10Andrew) 05Open→03Resolved I've downtimed all cloud*-dev.* servers until 2030. [18:07:11] (03PS2) 10Mforns: analytics::refinery::job::refine.pp: Bump up jar version [puppet] - 10https://gerrit.wikimedia.org/r/594538 [18:11:15] (03PS5) 10Cwhite: smart: prepare collect_smart_metrics for handling devices of different types [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) [18:12:00] (03CR) 10jerkins-bot: [V: 04-1] smart: prepare collect_smart_metrics for handling devices of different types [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [18:13:32] (03PS6) 10Cwhite: smart: prepare collect_smart_metrics for handling devices of different types [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) [18:15:17] !log mforns@deploy1001 Started deploy [analytics/refinery@ebd624a]: Regular analytics weekly train [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] [18:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:18] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Ottomata) FYI, I just added aarora to the nda LDAP group. This seemed to have been missed as part of this acccess request. https://w... [18:22:10] (03PS1) 10Andrew Bogott: rabbitmq::plugins: change order of args to rabbitmq-plugins check [puppet] - 10https://gerrit.wikimedia.org/r/594543 (https://phabricator.wikimedia.org/T251294) [18:22:54] (03PS1) 10Herron: admin: add ryankemper to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/594544 (https://phabricator.wikimedia.org/T251572) [18:23:19] (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq::plugins: change order of args to rabbitmq-plugins check [puppet] - 10https://gerrit.wikimedia.org/r/594543 (https://phabricator.wikimedia.org/T251294) (owner: 10Andrew Bogott) [18:34:12] !log mforns@deploy1001 Finished deploy [analytics/refinery@ebd624a]: Regular analytics weekly train [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] (duration: 18m 54s) [18:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:14] !log mforns@deploy1001 Started deploy [analytics/refinery@ebd624a] (thin): Regular analytics weekly train THIN [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] [18:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:23] !log mforns@deploy1001 Finished deploy [analytics/refinery@ebd624a] (thin): Regular analytics weekly train THIN [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] (duration: 00m 09s) [18:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:28] !log depool mw2221 for some manual testing [18:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:49] (03CR) 10Herron: [C: 03+2] admin: add ryankemper to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/594544 (https://phabricator.wikimedia.org/T251572) (owner: 10Herron) [18:46:59] PROBLEM - Ensure local MW versions match expected deployment on mw1270 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:46:59] PROBLEM - Ensure local MW versions match expected deployment on mw1330 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:46:59] PROBLEM - Ensure local MW versions match expected deployment on mw1325 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:01] PROBLEM - Ensure local MW versions match expected deployment on wtp2011 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:11] PROBLEM - Ensure local MW versions match expected deployment on scandium is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:11] PROBLEM - Ensure local MW versions match expected deployment on mw1405 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:15] PROBLEM - Ensure local MW versions match expected deployment on mw2219 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:15] PROBLEM - Ensure local MW versions match expected deployment on mw2163 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:23] PROBLEM - Ensure local MW versions match expected deployment on mw2140 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:23] PROBLEM - Ensure local MW versions match expected deployment on mw2178 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:33] PROBLEM - Ensure local MW versions match expected deployment on mwmaint1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:33] PROBLEM - Ensure local MW versions match expected deployment on mw1328 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:33] PROBLEM - Ensure local MW versions match expected deployment on mw1335 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:33] PROBLEM - Ensure local MW versions match expected deployment on mw1305 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:33] PROBLEM - Ensure local MW versions match expected deployment on mw1351 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:34] PROBLEM - Ensure local MW versions match expected deployment on mw2292 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:34] PROBLEM - Ensure local MW versions match expected deployment on mw2311 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:35] PROBLEM - Ensure local MW versions match expected deployment on mw2317 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:35] PROBLEM - Ensure local MW versions match expected deployment on mw2274 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:36] PROBLEM - Ensure local MW versions match expected deployment on mw2239 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:37] PROBLEM - Ensure local MW versions match expected deployment on mw2137 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:47] PROBLEM - Ensure local MW versions match expected deployment on mw1400 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:47:47] PROBLEM - Ensure local MW versions match expected deployment on mw2362 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:01] PROBLEM - Ensure local MW versions match expected deployment on mw2247 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:01] PROBLEM - Ensure local MW versions match expected deployment on mw2209 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:11] PROBLEM - Ensure local MW versions match expected deployment on labweb1001 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:15] PROBLEM - Ensure local MW versions match expected deployment on mw1368 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:15] PROBLEM - Ensure local MW versions match expected deployment on mw1275 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:15] PROBLEM - Ensure local MW versions match expected deployment on mw1274 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:23] PROBLEM - Ensure local MW versions match expected deployment on mw1385 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:25] PROBLEM - Ensure local MW versions match expected deployment on mw2273 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:39] PROBLEM - Ensure local MW versions match expected deployment on mw2296 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:39] PROBLEM - Ensure local MW versions match expected deployment on mw2260 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:45] PROBLEM - Ensure local MW versions match expected deployment on mw1327 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:45] PROBLEM - Ensure local MW versions match expected deployment on mw2329 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:45] PROBLEM - Ensure local MW versions match expected deployment on wtp1027 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:48] cdanis: ^? not sure if related [18:48:57] PROBLEM - Ensure local MW versions match expected deployment on mw2168 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:48:59] PROBLEM - Ensure local MW versions match expected deployment on mw1322 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:01] PROBLEM - Ensure local MW versions match expected deployment on mw2183 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:03] PROBLEM - Ensure local MW versions match expected deployment on mw2204 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:03] PROBLEM - Ensure local MW versions match expected deployment on mw2136 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:07] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:49:07] PROBLEM - Ensure local MW versions match expected deployment on mw1316 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:09] PROBLEM - Ensure local MW versions match expected deployment on wtp1041 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:09] PROBLEM - Ensure local MW versions match expected deployment on mw1354 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:09] PROBLEM - Ensure local MW versions match expected deployment on mw2139 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:13] PROBLEM - Ensure local MW versions match expected deployment on mw2325 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:15] PROBLEM - Ensure local MW versions match expected deployment on mwdebug2001 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:19] PROBLEM - Ensure local MW versions match expected deployment on mw2164 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:21] PROBLEM - Ensure local MW versions match expected deployment on mw2135 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:23] PROBLEM - Ensure local MW versions match expected deployment on mw1409 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:23] PROBLEM - Ensure local MW versions match expected deployment on mw2266 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:31] PROBLEM - Ensure local MW versions match expected deployment on mw2190 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:41] PROBLEM - Ensure local MW versions match expected deployment on mw1297 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:41] PROBLEM - Ensure local MW versions match expected deployment on mw2155 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:41] PROBLEM - Ensure local MW versions match expected deployment on mw1300 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:43] PROBLEM - Ensure local MW versions match expected deployment on mw1395 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:49] PROBLEM - Ensure local MW versions match expected deployment on mw2258 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:51] PROBLEM - Ensure local MW versions match expected deployment on snapshot1005 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:51] PROBLEM - Ensure local MW versions match expected deployment on mw2310 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:49:55] PROBLEM - Ensure local MW versions match expected deployment on mw2324 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:01] PROBLEM - Ensure local MW versions match expected deployment on mw2253 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:05] PROBLEM - Ensure local MW versions match expected deployment on mw2283 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:05] PROBLEM - Ensure local MW versions match expected deployment on mw2263 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:05] PROBLEM - Ensure local MW versions match expected deployment on mw2271 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:21] PROBLEM - Ensure local MW versions match expected deployment on mw1341 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:23] PROBLEM - Ensure local MW versions match expected deployment on mw2221 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:33] PROBLEM - Ensure local MW versions match expected deployment on mw1410 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:37] PROBLEM - Ensure local MW versions match expected deployment on mw1362 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:41] PROBLEM - Ensure local MW versions match expected deployment on mw1402 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:41] PROBLEM - Ensure local MW versions match expected deployment on mw1393 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:41] PROBLEM - Ensure local MW versions match expected deployment on mw1363 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:41] PROBLEM - Ensure local MW versions match expected deployment on mw1349 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:43] PROBLEM - Ensure local MW versions match expected deployment on mw1311 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:43] PROBLEM - Ensure local MW versions match expected deployment on mw1326 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:43] PROBLEM - Ensure local MW versions match expected deployment on mw2331 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:50:49] PROBLEM - Ensure local MW versions match expected deployment on wtp2004 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:01] PROBLEM - Ensure local MW versions match expected deployment on mw1302 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:01] PROBLEM - Ensure local MW versions match expected deployment on mw1296 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:03] PROBLEM - Ensure local MW versions match expected deployment on wtp2014 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:07] PROBLEM - Ensure local MW versions match expected deployment on mw2359 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:11] PROBLEM - Ensure local MW versions match expected deployment on wtp1036 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:19] PROBLEM - Ensure local MW versions match expected deployment on mw2241 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:21] PROBLEM - Ensure local MW versions match expected deployment on mw1375 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:21] PROBLEM - Ensure local MW versions match expected deployment on mw1273 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:27] PROBLEM - Ensure local MW versions match expected deployment on mw2208 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:27] PROBLEM - Ensure local MW versions match expected deployment on mw2154 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:35] PROBLEM - Ensure local MW versions match expected deployment on mw1312 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:41] PROBLEM - Ensure local MW versions match expected deployment on mw1383 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:42] shdubsh: ^^ ^ [18:51:45] PROBLEM - Ensure local MW versions match expected deployment on mw1399 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:46] ? [18:51:53] PROBLEM - Ensure local MW versions match expected deployment on mw1392 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:53] PROBLEM - Ensure local MW versions match expected deployment on mw2216 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:55] PROBLEM - Ensure local MW versions match expected deployment on mwdebug2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:03] PROBLEM - Ensure local MW versions match expected deployment on mw2156 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:03] PROBLEM - Ensure local MW versions match expected deployment on wtp2018 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:07] PROBLEM - Ensure local MW versions match expected deployment on mw2211 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:07] PROBLEM - Ensure local MW versions match expected deployment on mw1338 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:07] PROBLEM - Ensure local MW versions match expected deployment on mw1290 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:07] PROBLEM - Ensure local MW versions match expected deployment on wtp2008 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:17] PROBLEM - Ensure local MW versions match expected deployment on mw2372 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:17] PROBLEM - Ensure local MW versions match expected deployment on mwdebug1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:19] PROBLEM - Ensure local MW versions match expected deployment on snapshot1008 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:35] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 22743 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:52:47] RECOVERY - Ensure local MW versions match expected deployment on mw1330 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:53:01] RECOVERY - Ensure local MW versions match expected deployment on scandium is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:53:03] RECOVERY - Ensure local MW versions match expected deployment on mw2163 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:53:11] RECOVERY - Ensure local MW versions match expected deployment on mw2140 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:53:19] brennen: Has scap not finished? [18:53:23] RECOVERY - Ensure local MW versions match expected deployment on mwmaint1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:53:23] RECOVERY - Ensure local MW versions match expected deployment on mw1305 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:53:23] RECOVERY - Ensure local MW versions match expected deployment on mw1335 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:53:23] RECOVERY - Ensure local MW versions match expected deployment on mw2292 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:53:23] RECOVERY - Ensure local MW versions match expected deployment on mw2274 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:53:24] RECOVERY - Ensure local MW versions match expected deployment on mw2239 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:53:28] Reedy: it has not. [18:53:31] o_0 [18:53:37] RECOVERY - Ensure local MW versions match expected deployment on mw1400 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:53:50] Reedy: 78%. i'm running into a bug that i will dig up shortly. [18:53:57] RECOVERY - Ensure local MW versions match expected deployment on labweb1001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:54:03] RECOVERY - Ensure local MW versions match expected deployment on mw1274 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:54:27] RECOVERY - Ensure local MW versions match expected deployment on mw2296 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:54:33] RECOVERY - Ensure local MW versions match expected deployment on mw1327 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:54:33] RECOVERY - Ensure local MW versions match expected deployment on mw2329 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:54:38] suffice it to say that at the moment, scap periodically gives me an opportunity to press enter a bunch to make it go faster. [18:54:45] RECOVERY - Ensure local MW versions match expected deployment on mw2168 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:54:47] RECOVERY - Ensure local MW versions match expected deployment on mw1322 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:54:49] RECOVERY - Ensure local MW versions match expected deployment on mw2183 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:54:51] RECOVERY - Ensure local MW versions match expected deployment on mw2204 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:54:55] RECOVERY - Ensure local MW versions match expected deployment on mw1316 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:54:57] RECOVERY - Ensure local MW versions match expected deployment on mw1354 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:54:57] RECOVERY - Ensure local MW versions match expected deployment on mw2139 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:01] RECOVERY - Ensure local MW versions match expected deployment on mw2325 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:01] RECOVERY - Ensure local MW versions match expected deployment on mwdebug2001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:11] RECOVERY - Ensure local MW versions match expected deployment on mw2135 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:13] RECOVERY - Ensure local MW versions match expected deployment on mw1409 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:19] RECOVERY - Ensure local MW versions match expected deployment on mw2190 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:31] RECOVERY - Ensure local MW versions match expected deployment on mw1300 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:33] RECOVERY - Ensure local MW versions match expected deployment on mw1395 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:41] RECOVERY - Ensure local MW versions match expected deployment on mw2258 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:41] RECOVERY - Ensure local MW versions match expected deployment on mw2310 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:47] RECOVERY - Ensure local MW versions match expected deployment on mw2324 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:57] RECOVERY - Ensure local MW versions match expected deployment on mw2283 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:57] RECOVERY - Ensure local MW versions match expected deployment on mw2263 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:55:57] RECOVERY - Ensure local MW versions match expected deployment on mw2271 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:11] RECOVERY - Ensure local MW versions match expected deployment on mw1341 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:15] RECOVERY - Ensure local MW versions match expected deployment on mw2221 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:27] RECOVERY - Ensure local MW versions match expected deployment on mw1362 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:31] RECOVERY - Ensure local MW versions match expected deployment on mw1402 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:31] RECOVERY - Ensure local MW versions match expected deployment on mw1393 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:31] RECOVERY - Ensure local MW versions match expected deployment on mw1349 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:33] RECOVERY - Ensure local MW versions match expected deployment on mw1311 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:33] RECOVERY - Ensure local MW versions match expected deployment on mw1326 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:37] RECOVERY - Ensure local MW versions match expected deployment on wtp2004 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:49] RECOVERY - Ensure local MW versions match expected deployment on mw1302 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:49] RECOVERY - Ensure local MW versions match expected deployment on mw1296 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:55] RECOVERY - Ensure local MW versions match expected deployment on mw2359 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:57] RECOVERY - Ensure local MW versions match expected deployment on wtp1036 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:07] RECOVERY - Ensure local MW versions match expected deployment on mw2241 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:07] RECOVERY - Ensure local MW versions match expected deployment on mw1375 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:07] RECOVERY - Ensure local MW versions match expected deployment on mw1273 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:13] RECOVERY - Ensure local MW versions match expected deployment on mw2208 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:15] RECOVERY - Ensure local MW versions match expected deployment on mw2154 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:29] RECOVERY - Ensure local MW versions match expected deployment on mw1383 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:33] RECOVERY - Ensure local MW versions match expected deployment on mw1399 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:41] RECOVERY - Ensure local MW versions match expected deployment on mw1392 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:41] RECOVERY - Ensure local MW versions match expected deployment on mw2216 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:41] RECOVERY - Ensure local MW versions match expected deployment on mwdebug2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:49] RECOVERY - Ensure local MW versions match expected deployment on mw2156 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:49] RECOVERY - Ensure local MW versions match expected deployment on wtp2018 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:53] RECOVERY - Ensure local MW versions match expected deployment on mw2211 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:55] RECOVERY - Ensure local MW versions match expected deployment on mw1338 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:55] RECOVERY - Ensure local MW versions match expected deployment on mw1290 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:55] RECOVERY - Ensure local MW versions match expected deployment on wtp2008 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:05] RECOVERY - Ensure local MW versions match expected deployment on mwdebug1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:05] RECOVERY - Ensure local MW versions match expected deployment on mw2372 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:07] RECOVERY - Ensure local MW versions match expected deployment on snapshot1008 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:35] RECOVERY - Ensure local MW versions match expected deployment on mw1325 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:37] RECOVERY - Ensure local MW versions match expected deployment on mw1270 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:38] RECOVERY - Ensure local MW versions match expected deployment on wtp2011 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:47] RECOVERY - Ensure local MW versions match expected deployment on mw1405 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:51] RECOVERY - Ensure local MW versions match expected deployment on mw2219 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:59] RECOVERY - Ensure local MW versions match expected deployment on mw2178 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:11] RECOVERY - Ensure local MW versions match expected deployment on mw1351 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:11] RECOVERY - Ensure local MW versions match expected deployment on mw2317 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:11] RECOVERY - Ensure local MW versions match expected deployment on mw2311 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:15] RECOVERY - Ensure local MW versions match expected deployment on mw2137 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:27] RECOVERY - Ensure local MW versions match expected deployment on mw2362 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:37] RECOVERY - Ensure local MW versions match expected deployment on mw2247 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:38] RECOVERY - Ensure local MW versions match expected deployment on mw2209 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:51] RECOVERY - Ensure local MW versions match expected deployment on mw1275 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:51] RECOVERY - Ensure local MW versions match expected deployment on mw1368 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:01] RECOVERY - Ensure local MW versions match expected deployment on mw1385 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:05] brennen and hashar: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200505T1900). [19:00:05] RECOVERY - Ensure local MW versions match expected deployment on mw2273 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:15] RECOVERY - Ensure local MW versions match expected deployment on mw2260 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:21] RECOVERY - Ensure local MW versions match expected deployment on wtp1027 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:24] the part of that alert that pings irc *really* needs to be an aggregated one [19:00:37] RECOVERY - Ensure local MW versions match expected deployment on mw2136 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:38] yeah. :\ [19:00:45] RECOVERY - Ensure local MW versions match expected deployment on wtp1041 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:45] rzl: the thing i was editing was just the phpadmin stuff [19:00:51] so I think it had more to do with brennen's scap [19:00:57] RECOVERY - Ensure local MW versions match expected deployment on mw2164 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:01] RECOVERY - Ensure local MW versions match expected deployment on mw2266 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:08] cdanis: yep for sure, sorry for the ping [19:01:21] RECOVERY - Ensure local MW versions match expected deployment on mw1297 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:21] RECOVERY - Ensure local MW versions match expected deployment on mw2155 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:31] RECOVERY - Ensure local MW versions match expected deployment on snapshot1005 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:41] RECOVERY - Ensure local MW versions match expected deployment on mw2253 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:13] RECOVERY - Ensure local MW versions match expected deployment on mw1410 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:21] RECOVERY - Ensure local MW versions match expected deployment on mw1363 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:22] cdanis: this does seem like a good candidate for aggregation. I'll check for a task/file as needed. :) [19:02:23] RECOVERY - Ensure local MW versions match expected deployment on mw2331 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:27] !log train status: 1.35.0-wmf.31: presently pressing enter through scap-cdb-rebuild; at 8% (T249963, T223287) [19:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:32] T249963: 1.35.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T249963 [19:02:32] T223287: Investigate scap-cdb-rebuild idling until pressing ENTER repeatedly - https://phabricator.wikimedia.org/T223287 [19:02:43] RECOVERY - Ensure local MW versions match expected deployment on wtp2014 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:03] shinken-wm: thanks! yeah there are a bunch of per-appserver alerts where... we approximately want the workflow of "if one host is broken for $LONG_ENOUGH, file a ticket; if many are broken, alert on IRC once" [19:03:11] RECOVERY - Ensure local MW versions match expected deployment on mw1312 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:28] _really_ need to get to the bottom of T223287 one of these days. [19:03:36] s/shinken-wm/shdubsh/ [19:04:15] brennen: Aha, the "shit isn't continuing till you press enter" bug [19:04:59] RECOVERY - Ensure local MW versions match expected deployment on mw1328 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:05:07] 😱 [19:12:38] !log brennen@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.31 (duration: 97m 23s) [19:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:11] !log mforns@deploy1001 Started deploy [analytics/refinery@6868fc0]: Regular analytics weekly train (2nd try) [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] [19:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:07] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:17:10] (03PS1) 10Brennen Bearnes: group0 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594552 [19:17:12] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594552 (owner: 10Brennen Bearnes) [19:18:01] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594552 (owner: 10Brennen Bearnes) [19:19:43] 10Operations, 10observability: check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10colewhite) [19:19:52] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.31 [19:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:57] 10Operations, 10observability: check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10colewhite) p:05Triage→03Medium [19:20:32] 10Operations, 10observability: check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10colewhite) [19:24:46] (03PS5) 10Cwhite: smart: add multiple hpsa controller support [puppet] - 10https://gerrit.wikimedia.org/r/588769 (https://phabricator.wikimedia.org/T199236) [19:24:49] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22738 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:28:26] (03CR) 10Cwhite: smart: add multiple hpsa controller support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588769 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [19:34:43] (03CR) 10Cwhite: [C: 03+1] "This is a good additional viewpoint into the state of the fleet." [puppet] - 10https://gerrit.wikimedia.org/r/594441 (owner: 10Filippo Giunchedi) [19:36:07] (03PS1) 10CDanis: php-admin: add Content-Types [puppet] - 10https://gerrit.wikimedia.org/r/594560 [19:36:09] (03PS1) 10CDanis: typos: fix "fragmenation" [puppet] - 10https://gerrit.wikimedia.org/r/594561 [19:38:30] !log mforns@deploy1001 Finished deploy [analytics/refinery@6868fc0]: Regular analytics weekly train (2nd try) [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] (duration: 25m 18s) [19:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:36] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work): SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10RKemper) Production SSH Key: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICHjbEiVf5Z7L4yrqByE9kVcRtR3MTmwyPg65l3LPZc7 rk... [19:38:45] !log mforns@deploy1001 Started deploy [analytics/refinery@6868fc0] (thin): Regular analytics weekly train THIN [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] [19:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:53] !log mforns@deploy1001 Finished deploy [analytics/refinery@6868fc0] (thin): Regular analytics weekly train THIN [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] (duration: 00m 08s) [19:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:05] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Blog: Availability to update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10colewhite) p:05Triage→03High [19:57:00] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Blog: Availability to update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10colewhite) The requested day and time will determine who is available to assist you with this task. Please let us know the details as soon as you have... [19:58:39] 10Operations, 10Anti-Harassment, 10Traffic, 10serviceops: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10colewhite) p:05Triage→03Medium [19:58:47] (03CR) 10Ottomata: [C: 03+2] analytics::refinery::job::refine.pp: Bump up jar version [puppet] - 10https://gerrit.wikimedia.org/r/594538 (owner: 10Mforns) [19:58:55] (03PS1) 10Herron: admin: give ryankemper shell access and add to ops group [puppet] - 10https://gerrit.wikimedia.org/r/594563 (https://phabricator.wikimedia.org/T251572) [20:00:25] 10Operations, 10SRE-Access-Requests: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251523 (10colewhite) p:05Triage→03Medium a:03colewhite [20:02:44] !log added ryankemper to wmf and ops ldap groups T251572 [20:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:48] T251572: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 [20:03:04] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10herron) [20:03:57] 10Operations, 10LDAP-Access-Requests: Add `Jgiannelos` to `wmf` LDAP group - https://phabricator.wikimedia.org/T251899 (10colewhite) p:05Triage→03Medium a:03colewhite [20:05:03] 10Operations, 10LDAP-Access-Requests: LDAP-Access-Requests for Superset - https://phabricator.wikimedia.org/T251516 (10colewhite) p:05Triage→03Medium a:03colewhite [20:06:00] 10Operations, 10Analytics, 10LDAP-Access-Requests: LDAP access to the wmf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10colewhite) p:05Triage→03Medium a:03colewhite [20:07:23] 10Operations, 10SRE-Access-Requests: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251122 (10colewhite) p:05Triage→03Medium a:03colewhite [20:07:34] (03PS2) 10Herron: admin: give ryankemper shell access and add to ops group [puppet] - 10https://gerrit.wikimedia.org/r/594563 (https://phabricator.wikimedia.org/T251572) [20:08:45] (03PS1) 10Ottomata: Add camus job event_dynamic_stream_configs [puppet] - 10https://gerrit.wikimedia.org/r/594565 (https://phabricator.wikimedia.org/T251609) [20:09:43] (03CR) 10Herron: [C: 03+1] admin: give ryankemper shell access and add to ops group [puppet] - 10https://gerrit.wikimedia.org/r/594563 (https://phabricator.wikimedia.org/T251572) (owner: 10Herron) [20:09:49] (03CR) 10jerkins-bot: [V: 04-1] Add camus job event_dynamic_stream_configs [puppet] - 10https://gerrit.wikimedia.org/r/594565 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:10:56] 10Operations, 10SRE-Access-Requests: Requesting Access to sites from Google Search Console - https://phabricator.wikimedia.org/T251128 (10colewhite) p:05Triage→03Medium a:03colewhite [20:11:39] (03CR) 10Ryan Kemper: [C: 03+2] admin: give ryankemper shell access and add to ops group [puppet] - 10https://gerrit.wikimedia.org/r/594563 (https://phabricator.wikimedia.org/T251572) (owner: 10Herron) [20:12:54] (03PS2) 10Ottomata: Add camus job event_dynamic_stream_configs [puppet] - 10https://gerrit.wikimedia.org/r/594565 (https://phabricator.wikimedia.org/T251609) [20:16:37] (03CR) 10RLazarus: [C: 03+1] "love to defragmenate my hard drive" [puppet] - 10https://gerrit.wikimedia.org/r/594561 (owner: 10CDanis) [20:18:27] (03PS2) 10CDanis: typos: fix "fragmenation" [puppet] - 10https://gerrit.wikimedia.org/r/594561 [20:18:39] (03CR) 10CDanis: [C: 03+2] typos: fix "fragmenation" [puppet] - 10https://gerrit.wikimedia.org/r/594561 (owner: 10CDanis) [20:19:31] (03PS5) 10Ottomata: Initial debian commit [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) [20:20:06] (03CR) 10Ottomata: "Thank you! That works:" [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [20:20:31] (03PS2) 10CDanis: php-admin: add Content-Types [puppet] - 10https://gerrit.wikimedia.org/r/594560 [20:21:07] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:21:37] (03CR) 10RLazarus: [C: 03+1] php-admin: add Content-Types [puppet] - 10https://gerrit.wikimedia.org/r/594560 (owner: 10CDanis) [20:21:52] (03CR) 10CDanis: [C: 03+2] php-admin: add Content-Types [puppet] - 10https://gerrit.wikimedia.org/r/594560 (owner: 10CDanis) [20:22:24] (03CR) 10Ottomata: "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/594565 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:23:02] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10herron) [20:28:13] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22738 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:35:47] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10RKemper) [20:37:12] 10Operations, 10LDAP-Access-Requests: LDAP Access for Superset for PDas - https://phabricator.wikimedia.org/T251516 (10Aklapper) [20:37:41] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10Papaul) [20:38:40] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10Papaul) [20:41:03] PROBLEM - Host analytics1060 is DOWN: PING CRITICAL - Packet loss = 100% [20:49:23] (03CR) 10CRusnov: "> Patch Set 2:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/592721 (owner: 10CRusnov) [21:09:42] 10Operations, 10ops-codfw, 10Cloud-Services, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10JHedden) You can use the partman config `echo partman/standard.cfg partman/raid1-2dev.cfg` This is the same as we have in production. 2 x OS drives... [21:30:57] (03CR) 10Hashar: "Tentatively the link have been broken since mid March, so do we still need to redirect? I am tempted to break them to avoid carrying thos" [puppet] - 10https://gerrit.wikimedia.org/r/593344 (owner: 10Krinkle) [21:37:51] jouncebot: now [21:37:51] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [21:37:52] jouncebot: next [21:37:52] In 1 hour(s) and 22 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200505T2300) [21:53:29] (03CR) 10Bstorm: [C: 03+2] "I'm merging it. I'll do a bit more kicking around when deployed on toolsbeta." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/586162 (https://phabricator.wikimedia.org/T197930) (owner: 10BryanDavis) [21:54:08] (03Merged) 10jenkins-bot: Replace pykube with a custom API client [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/586162 (https://phabricator.wikimedia.org/T197930) (owner: 10BryanDavis) [21:54:52] Backport for T251952 ready to be synced [21:54:53] T251952: CoreParserFunctions::revisionuser: Call to a member function getUser() on boolean - https://phabricator.wikimedia.org/T251952 [21:55:37] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.31/includes/specials/SpecialNewpages.php: T251950 (duration: 01m 06s) [21:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:40] T251950: MutableRevisionRecord: Bad value for parameter $visibility: must be a integer - https://phabricator.wikimedia.org/T251950 [21:57:27] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.31/includes/parser/CoreParserFunctions.php: T251952 (duration: 01m 05s) [21:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:14] thanks much DannyS712, Reedy. [21:58:20] um, https://www.mediawiki.org/wiki/Help_talk:Magic_words is still broken [21:58:34] hrm, so it is. [21:58:37] I guess it wasn't the issue? `XrHhgwpAMMEAADYIcS8AAADI` [21:58:42] Seemingly with the same error [21:58:42] 2020-05-05 21:57:53 [XrHhYQpAIH4AAMdPuwEAAABM] mw1409 mediawikiwiki 1.35.0-wmf.31 exception ERROR: [XrHhYQpAIH4AAMdPuwEAAABM] /wiki/Help_talk:Magic_words Error from line 1400 of /srv/mediawiki/php-1.35.0-wmf.31/includes/parser/CoreParserFunctions.php: Call to a member function getUser() on boolean {"exception_id":"XrHhYQpAIH4AAMdPuwEAAABM","exception_url":"/wiki/Help_talk:Magic_words","caught_by":"mwe_handler"} [21:59:10] I pulled too early [21:59:18] Line 1400 means that the patch wasn't deployed [21:59:26] See above [22:00:27] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.31/includes/parser/CoreParserFunctions.php: T251952 take 2 (duration: 01m 06s) [22:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:30] T251952: CoreParserFunctions::revisionuser: Call to a member function getUser() on boolean - https://phabricator.wikimedia.org/T251952 [22:00:34] And indeed, it is now fixed [22:05:31] (03PS1) 10Bstorm: d/changelog: prepare for 0.69 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/594579 [22:08:22] (03CR) 10Bstorm: "I presume I still should do this step right, Arturo? If this is right, I'll try to deploy with your script tomorrow :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/594579 (owner: 10Bstorm) [22:09:58] (03CR) 10Volans: [C: 03+1] "Ship it!" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/592721 (owner: 10CRusnov) [22:50:18] 10Operations, 10SRE-Access-Requests: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251122 (10colewhite) Hi Antonino! I'll go ahead and prep the necessary changes. In the mean time, would you mind re-submitting your SSH key as a comment a... [22:51:37] (03PS1) 10Cwhite: admin: add ahemmer to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/594593 (https://phabricator.wikimedia.org/T251899) [22:54:21] (03PS2) 10Cwhite: admin: add ahemmer to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/594593 (https://phabricator.wikimedia.org/T251122) [22:58:52] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add `Jgiannelos` to `wmf` LDAP group - https://phabricator.wikimedia.org/T251899 (10colewhite) [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200505T2300). [23:00:04] Huji: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:05:35] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add `Jgiannelos` to `wmf` LDAP group - https://phabricator.wikimedia.org/T251899 (10colewhite) Hi @Jgiannelos! I've added the LDAP access request template to the description. Once you have a chance to fill out the purpose field, we can complete this... [23:08:19] 10Operations, 10Analytics, 10LDAP-Access-Requests: LDAP access to the wmf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10colewhite) [23:08:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251122 (10colewhite) [23:08:57] 10Operations, 10Analytics, 10LDAP-Access-Requests: LDAP access to the wmf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10colewhite) This can be completed once the patch T251122 has been deployed. [23:09:57] Hi. I am here because of the patch assocaited with T249643 [23:09:57] T249643: Restore the "reviewer" group for fawiki - https://phabricator.wikimedia.org/T249643 [23:11:46] This would be my very first time being involved with a Deployment so guidance is appreciated. [23:12:34] RoanKattouw: do I bother you for this, given that your name was listed as the Deployer for this window? [23:14:07] (03PS1) 10Cwhite: admin: add pdas to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594598 (https://phabricator.wikimedia.org/T251516) [23:14:14] huji: Hi, sorry, I'm interviewing a job applicant right now but I'll be there in 15 mins [23:14:27] RoanKattouw: no worries, I can wait [23:14:45] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Update Netbox to v2.8.1-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/592721 (owner: 10CRusnov) [23:16:52] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access for Superset for PDas - https://phabricator.wikimedia.org/T251516 (10colewhite) Hi Praveen! I appears you are a WMF employee, so this is likely a request for membership to the `wmf` LDAP group. @Nuria, would you mind confirming that this... [23:20:02] !log crusnov@deploy1001 Started deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part1) [23:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:22] !log crusnov@deploy1001 Finished deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part1) (duration: 01m 20s) [23:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:23] !log crusnov@deploy1001 Started deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part1) [23:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:37] !log crusnov@deploy1001 Finished deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part1) (duration: 01m 14s) [23:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:43] !log crusnov@deploy1001 Started deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part3) [23:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:55] !log crusnov@deploy1001 Finished deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part3) (duration: 00m 11s) [23:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:19] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:19] PROBLEM - Check the last execution of netbox_ganeti_ulsfo_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:28:09] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:03] huji: OK I'm back! [23:31:10] Do you have the WikimediaDebug browser extension installed? [23:31:12] * huji shows thumbs up [23:31:18] If not, please install it, it will help you test [23:31:37] RoanKattouw: I don't but is it relevant for this particular patch? The only thing it does is create a new user group on one wiki [23:32:08] I see. Yeah that's pretty simple, I can test that for you [23:32:24] (03CR) 10Catrope: [C: 03+2] Restore the 'reviewer' group for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587301 (https://phabricator.wikimedia.org/T249643) (owner: 10Huji) [23:33:18] (03Merged) 10jenkins-bot: Restore the 'reviewer' group for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587301 (https://phabricator.wikimedia.org/T249643) (owner: 10Huji) [23:33:19] RoanKattouw: I installed it anyway [23:39:01] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:37] huji: OK, your patch is now on mwdebug1002. You can test it by going to Special:Listgrouprights on fawiki, then enabling the extension, pointing it to mwdebug1002, and refreshing. With the extension set to "on", you should see the new group, with it set to "off" you shouldn't see it [23:40:33] PROBLEM - Check the last execution of netbox_ganeti_esams_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:40:49] ^known issue fixing [23:40:50] Thanks. That is not happening though. With it set to ON or OFF, I see the same list of groups. [23:41:25] Actually, I take it back [23:41:35] It is there, but I cannot assign it to users. [23:41:54] RoanKattouw: I thought my patch would allow sysops (of which, I am one) to assign a person to this new group. No? [23:42:07] Yes, and that is what it says here too [23:42:23] Under Administrators it says "Add groups: Rollbackers, Autopatrollers, Confirmed users, Patrollers, Extended confirmed users, Eliminators, IP block exemptions and Reviewers" [23:42:51] Are you saying that if you go to Special:Userrights it doesn't show up? [23:42:52] RoanKattouw: oh wait. I think there was a caching issue or something, because the new checkbox just showed up on Special:UserRights/someuser [23:42:58] Oh cool, OK [23:43:14] Yep, now it works [23:43:22] I did also update the server a little bit later than I said, I did the steps in the wrong order so the change was only there ~90 seconds after I said it was [23:43:29] OK great, then let's deploy it everywhere [23:44:11] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:44:50] !log catrope@deploy1001 Synchronized wmf-config/flaggedrevs.php: Restore the reviewer group on fawiki (T249643) (duration: 01m 06s) [23:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:53] T249643: Restore the "reviewer" group for fawiki - https://phabricator.wikimedia.org/T249643 [23:45:46] huji: Alright, that's it. Thanks for installing the extension and doing the testing stuff even though it was confusing for a bit there [23:46:24] (03PS1) 10CRusnov: ganeti-netbox-sync.py: Fix for API drift [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/594600 [23:46:27] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:46:42] RoanKattouw: thank you for deploying it and educating me! [23:46:45] PROBLEM - Check the last execution of netbox_ganeti_eqsin_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:47:01] (03CR) 10CRusnov: [V: 03+2 C: 03+2] "Self-reviewing after self-test due to breakage." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/594600 (owner: 10CRusnov) [23:49:51] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:21] RECOVERY - Check the last execution of netbox_ganeti_esams_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:55:17] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:45] RECOVERY - Check the last execution of netbox_ganeti_ulsfo_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:58:55] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state