[00:28:45] PROBLEM - PHP opcache health on mw2307 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:30:57] PROBLEM - PHP opcache health on mw2309 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:59:51] PROBLEM - Check systemd state on elastic2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:53] (03CR) 10Andrew Bogott: [C: 03+2] Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [02:03:39] (03PS1) 10Andrew Bogott: nova: install the novavendordata api in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/656986 (https://phabricator.wikimedia.org/T271273) [02:04:43] (03CR) 10Andrew Bogott: [C: 03+2] nova: install the novavendordata api in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/656986 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [02:07:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.27 [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/656987 [02:12:43] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.27 [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/656987 (https://phabricator.wikimedia.org/T271341) (owner: 10TrainBranchBot) [02:47:57] PROBLEM - dump of matomo in eqiad on alert1001 is CRITICAL: Last dump for matomo at eqiad (db1108.eqiad.wmnet:3351) taken on 2021-01-19 02:27:34 is 1 GB, but previous one was 0 GB, a change of 114.0% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:59:05] PROBLEM - Check systemd state on elastic2049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:59] cdanis: I assume still no deploys, right? [04:34:08] the message is going to get ignored if not updated :) [04:34:30] (not by me, I'm just predicting others) [04:37:25] !log locks scap on deploy1001 as precaution [04:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:33] Please undo if/when okay. [04:38:57] * Krinkle undoes the above based on https://phabricator.wikimedia.org/T272215#6755025 [04:39:34] !log unlocked per ttps://phabricator.wikimedia.org/T272215#6755025 [04:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:11] PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:53] !log Upgrade kernel on pc2007 pc2008 pc2009 pc2010 T272121 [06:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-parsercache site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:20:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:24:57] 10ops-codfw, 10DBA: cold reset and upgrade pc2010's idrac - https://phabricator.wikimedia.org/T272337 (10Marostegui) [06:25:07] 10ops-codfw, 10DBA: cold reset and upgrade pc2010's idrac - https://phabricator.wikimedia.org/T272337 (10Marostegui) p:05Triage→03Medium [06:30:37] 10ops-codfw, 10DBA: cold reset and upgrade pc2010's idrac - https://phabricator.wikimedia.org/T272337 (10Marostegui) 05Open→03Resolved I tried the cold reset from the CLI and looks like I was able to reboot the host from the idrac, so considering this fixed [06:55:41] (03PS1) 10Marostegui: db1082: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656998 [06:57:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1082 to stop replication T272008', diff saved to https://phabricator.wikimedia.org/P13821 and previous config saved to /var/cache/conftool/dbconfig/20210119-065748-marostegui.json [06:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:53] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [06:58:02] (03CR) 10Marostegui: [C: 03+2] db1082: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656998 (owner: 10Marostegui) [07:02:47] !log Stop MySQL on db1082 T272008 [07:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:51] !log clean up prometheus es exporter units on es-codfw nodes not needed anymore [07:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:29] RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:05] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:18:21] RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:51] RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:20:19] (03PS1) 10Marostegui: Revert "db1082: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656918 [07:20:55] (03CR) 10Marostegui: [C: 03+2] Revert "db1082: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656918 (owner: 10Marostegui) [07:22:16] RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 25%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13822 and previous config saved to /var/cache/conftool/dbconfig/20210119-072329-root.json [07:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:14] RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:25:03] (03PS1) 10Ladsgroup: bacula: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/657040 (https://phabricator.wikimedia.org/T209953) [07:25:58] RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:26:40] (03CR) 10Ladsgroup: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27511/console" [puppet] - 10https://gerrit.wikimedia.org/r/657040 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [07:27:18] RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:52] RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:29:38] RECOVERY - Check systemd state on elastic2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:29:38] RECOVERY - Check systemd state on elastic2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:29:38] RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:29:38] RECOVERY - Check systemd state on elastic2049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:29:38] RECOVERY - Check systemd state on elastic2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:18] gooood [07:30:47] (03PS1) 10Ladsgroup: cloudinfra: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657043 (https://phabricator.wikimedia.org/T209953) [07:34:12] RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [07:35:36] <_joe_> heads up: I'm about to do a null deployment, after that I'll update the topic [07:36:46] PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:14] (03PS1) 10Ladsgroup: query_service: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657044 (https://phabricator.wikimedia.org/T209953) [07:38:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 50%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13823 and previous config saved to /var/cache/conftool/dbconfig/20210119-073832-root.json [07:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:48] RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:58] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27512/" [puppet] - 10https://gerrit.wikimedia.org/r/657044 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [07:41:42] !log oblivian@deploy1001 Synchronized README: Null deployments to test php restarts from scap (duration: 01m 23s) [07:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:17] (03CR) 10Marostegui: [C: 03+2] cloudinfra: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657043 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [07:53:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 75%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13824 and previous config saved to /var/cache/conftool/dbconfig/20210119-075336-root.json [07:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:31] <_joe_> Status: Up! | Server Admin Log: https://bit.ly/wikitech | This channel is logged: https://bit.ly/opsirclog | SRE Clinic Duty: jynus [07:54:36] <_joe_> argh [07:54:43] <_joe_> I forgot /topic :D [08:02:30] (03PS1) 10Ladsgroup: eventlogging: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/657045 (https://phabricator.wikimedia.org/T209953) [08:04:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656953 (https://phabricator.wikimedia.org/T259780) (owner: 10Filippo Giunchedi) [08:08:08] (03CR) 10Muehlenhoff: [C: 03+1] rsyslog: install rsyslog from component/rsyslog on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656953 (https://phabricator.wikimedia.org/T259780) (owner: 10Filippo Giunchedi) [08:08:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 100%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13825 and previous config saved to /var/cache/conftool/dbconfig/20210119-080839-root.json [08:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:53] (03CR) 10Elukey: [C: 03+2] eventlogging: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/657045 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:15:06] (03PS1) 10Marostegui: db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657046 [08:21:05] (03PS1) 10Ladsgroup: kafka: Use lookup() instead of hiera() in code comment [puppet] - 10https://gerrit.wikimedia.org/r/657047 (https://phabricator.wikimedia.org/T209953) [08:29:01] (03CR) 10Ladsgroup: "Found this tiny thing 😅" [puppet] - 10https://gerrit.wikimedia.org/r/657047 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:30:24] (03CR) 10Elukey: [C: 03+2] kafka: Use lookup() instead of hiera() in code comment [puppet] - 10https://gerrit.wikimedia.org/r/657047 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:31:33] (03CR) 10David Caro: "Nice :)" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [08:31:43] (03PS1) 10Elukey: sre.druid.roll-restart-workers: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657049 (https://phabricator.wikimedia.org/T269925) [08:32:21] (03CR) 10David Caro: "> Patch Set 15:" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [08:35:55] (03PS4) 10Filippo Giunchedi: rsyslog: install rsyslog from component/rsyslog on Buster [puppet] - 10https://gerrit.wikimedia.org/r/656953 (https://phabricator.wikimedia.org/T259780) [08:36:52] (03CR) 10Gehel: [C: 03+2] query_service: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657044 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:36:57] (03PS2) 10Gehel: query_service: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657044 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:39:57] (03CR) 10Kormat: [C: 03+2] install_server: Fix domain for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/656903 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [08:40:29] (03PS5) 10Filippo Giunchedi: rsyslog: install rsyslog from component/rsyslog on Buster [puppet] - 10https://gerrit.wikimedia.org/r/656953 (https://phabricator.wikimedia.org/T259780) [08:40:41] (03CR) 10Filippo Giunchedi: "Thank you for the review!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656953 (https://phabricator.wikimedia.org/T259780) (owner: 10Filippo Giunchedi) [08:54:07] (03CR) 10JMeybohm: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [08:54:23] !log swift codfw-prod: more weight to ms-be20[58-61] - T269337 [08:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:27] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [08:56:55] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the migration!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657049 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [08:58:46] (03CR) 10Jcrespo: "This is likely to be deployable as is, but let me test PCC on the actual host affected (dbprovs and backupX002)." [puppet] - 10https://gerrit.wikimedia.org/r/657040 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:58:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078 to stop replication T272008', diff saved to https://phabricator.wikimedia.org/P13826 and previous config saved to /var/cache/conftool/dbconfig/20210119-085856-marostegui.json [08:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:01] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [08:59:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1078, depooled by mistake', diff saved to https://phabricator.wikimedia.org/P13827 and previous config saved to /var/cache/conftool/dbconfig/20210119-085918-marostegui.json [08:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:31] (03CR) 10Marostegui: [C: 03+2] db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657046 (owner: 10Marostegui) [09:01:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 to stop replication T272008', diff saved to https://phabricator.wikimedia.org/P13828 and previous config saved to /var/cache/conftool/dbconfig/20210119-090100-marostegui.json [09:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656953 (https://phabricator.wikimedia.org/T259780) (owner: 10Filippo Giunchedi) [09:11:21] (03PS1) 10Ayounsi: Add tests [software/homer] - 10https://gerrit.wikimedia.org/r/657051 [09:14:13] (03CR) 10jerkins-bot: [V: 04-1] Add tests [software/homer] - 10https://gerrit.wikimedia.org/r/657051 (owner: 10Ayounsi) [09:14:55] (03PS2) 10Ayounsi: Add tests [software/homer] - 10https://gerrit.wikimedia.org/r/657051 [09:17:02] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: install rsyslog from component/rsyslog on Buster [puppet] - 10https://gerrit.wikimedia.org/r/656953 (https://phabricator.wikimedia.org/T259780) (owner: 10Filippo Giunchedi) [09:17:59] (03PS1) 10Marostegui: Revert "db1112: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656921 [09:18:01] (03CR) 10Volans: [C: 03+1] "Looks good, just one nit (and waiting for CI)" (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/657051 (owner: 10Ayounsi) [09:18:44] (03CR) 10Marostegui: [C: 03+2] Revert "db1112: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656921 (owner: 10Marostegui) [09:19:24] 10SRE, 10observability, 10Patch-For-Review, 10User-fgiunchedi: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 (10fgiunchedi) [09:20:10] 10SRE, 10observability, 10Patch-For-Review, 10User-fgiunchedi: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving since we're running a fixed rsyslog version now. [09:24:52] (03PS1) 10Filippo Giunchedi: role: remove rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/657052 (https://phabricator.wikimedia.org/T199406) [09:26:32] (03PS2) 10Filippo Giunchedi: role: remove rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/657052 (https://phabricator.wikimedia.org/T199406) [09:28:15] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27513/console" [puppet] - 10https://gerrit.wikimedia.org/r/657052 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [09:31:34] (03CR) 10Elukey: sre.druid.roll-restart-workers: move to class API (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657049 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [09:31:36] (03PS2) 10Elukey: sre.druid.roll-restart-workers: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657049 (https://phabricator.wikimedia.org/T269925) [09:33:53] (03CR) 10Elukey: [C: 03+2] sre.druid.roll-restart-workers: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657049 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [09:36:38] (03Merged) 10jenkins-bot: sre.druid.roll-restart-workers: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657049 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [09:38:04] (03CR) 10Ladsgroup: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/657040 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [09:40:45] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2016.codfw.wmnet [09:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:45] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:09] PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:58] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2016.codfw.wmnet [09:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:08] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2017.codfw.wmnet [09:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:19] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ms-be2017.codfw.wmnet [09:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:51] (03PS2) 10Jcrespo: bacula: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/657040 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [09:58:42] (03PS2) 10Elukey: Initial configuration of the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/635751 (https://phabricator.wikimedia.org/T260411) [10:00:04] (03CR) 10Jcrespo: [C: 03+2] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/657040 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [10:00:58] (03PS3) 10Elukey: Initial configuration of the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/635751 (https://phabricator.wikimedia.org/T260411) [10:01:20] (03PS3) 10Ayounsi: Add tests [software/homer] - 10https://gerrit.wikimedia.org/r/657051 [10:02:35] (03CR) 10Elukey: "I have finally updated the patch with 18 new worker nodes, but two of them are still down (need to follow up with dcops, it is probably a " [puppet] - 10https://gerrit.wikimedia.org/r/635751 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [10:03:47] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2017.codfw.wmnet [10:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:25] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:05] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:50] (03PS4) 10Ayounsi: Add tests [software/homer] - 10https://gerrit.wikimedia.org/r/657051 [10:18:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2017.codfw.wmnet [10:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:18:55] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) [10:20:24] (03CR) 10Volans: [C: 03+1] "Nice! thanks a lot for adding the tests! Small nit inline, LGTM otherwise." (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/657051 (owner: 10Ayounsi) [10:22:31] 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [10:23:51] (03PS5) 10Ayounsi: Add tests [software/homer] - 10https://gerrit.wikimedia.org/r/657051 [10:24:52] (03PS1) 10Kormat: wikimedia.org: Add orchestrator CNAME [dns] - 10https://gerrit.wikimedia.org/r/657059 (https://phabricator.wikimedia.org/T266106) [10:25:02] (03CR) 10Hnowlan: [C: 03+2] similar-users: add helmfile configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [10:25:15] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:47] (03Merged) 10jenkins-bot: similar-users: add helmfile configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [10:27:14] (03PS6) 10Ayounsi: Add tests [software/homer] - 10https://gerrit.wikimedia.org/r/657051 [10:27:53] (03CR) 10Volans: [C: 03+1] "Ship it!" [software/homer] - 10https://gerrit.wikimedia.org/r/657051 (owner: 10Ayounsi) [10:29:23] (03CR) 10Kormat: [C: 03+2] wikimedia.org: Add orchestrator CNAME [dns] - 10https://gerrit.wikimedia.org/r/657059 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [10:30:07] (03CR) 10Volans: "post-merge -1" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/657059 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [10:31:14] volans: hurm. i put it in the same section as 'tendril', which is marked as "Other websites (NO wikis!)" [10:31:25] PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:20] kormat: oh,my bad I missed that last section header, anyway it looks to me it belongs more to the "; Service aliases (alphabetical order)" [10:32:31] YMMV ofc [10:32:38] I'm not authoritative on the dns repo ;) [10:33:14] all the others in that block are dyna [10:33:41] many, not all [10:33:59] the rest are eithe rexternal or misplaced IMHO (tendril and puppet) [10:34:50] (03PS1) 10Muehlenhoff: Bump timeout for accessing RAID in smart_data_dump [puppet] - 10https://gerrit.wikimedia.org/r/657060 [10:35:47] (03PS2) 10Filippo Giunchedi: debian: add packaging [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 [10:35:49] volans: you are not authoritative on the dns repo? Srsly? :D [10:36:54] I have only the delegation for the netbox-generated zonefiles elukey :D [10:38:08] kormat: please don't trust Riccardo in code reviews anymore :D [10:38:12] (03CR) 10Ayounsi: [C: 03+2] "Shipping." [software/homer] - 10https://gerrit.wikimedia.org/r/657051 (owner: 10Ayounsi) [10:38:24] elukey: [nit] you're assuming i ever did. ;) [10:38:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, +Cole for visibility" [puppet] - 10https://gerrit.wikimedia.org/r/657060 (owner: 10Muehlenhoff) [10:39:03] RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:49] (03Merged) 10jenkins-bot: Add tests [software/homer] - 10https://gerrit.wikimedia.org/r/657051 (owner: 10Ayounsi) [10:41:40] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1020.eqiad.wmnet [10:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:49] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to production shell and wmf ldap access for Razzi Abuissa - https://phabricator.wikimedia.org/T261443 (10Aklapper) @Dzahn: In my understanding this ticket wasn't a request for any //direct// #Gerrit-Privilege-Requests itself, but instead... [10:47:49] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:01] RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:01] (03PS1) 10Kormat: wikimedia.org: Move orchestrator/tendril to better* section [dns] - 10https://gerrit.wikimedia.org/r/657062 [10:49:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1020.eqiad.wmnet [10:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:49] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1021.eqiad.wmnet [10:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:25] 10SRE, 10vm-requests: eqiad: 1 of VMs requested for Cumin - https://phabricator.wikimedia.org/T272349 (10MoritzMuehlenhoff) [10:53:38] 10SRE, 10vm-requests: eqiad: 1 of VMs requested for Cumin - https://phabricator.wikimedia.org/T272349 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [10:54:27] moritzm: can --^ wait a sec? I can send the code review to move the makevm cookbook to the class api, so we can test it [10:54:36] (03PS2) 10Muehlenhoff: Bump timeout for accessing RAID in smart_data_dump [puppet] - 10https://gerrit.wikimedia.org/r/657060 [10:54:42] elukey: definitely! [10:54:59] moritzm: all right, 10 mins and the code review should be coming [10:55:07] no hurry :-) [10:55:19] 10SRE, 10vm-requests: eqiad: 1 of VMs requested for Cumin - https://phabricator.wikimedia.org/T272349 (10Volans) LGTM, +1 [10:55:21] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 88395656 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:56:18] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1021.eqiad.wmnet [10:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:13] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 179960 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:59:23] PROBLEM - Check systemd state on ms-be2053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:48] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1023.eqiad.wmnet [11:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:26] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:20] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [dns] - 10https://gerrit.wikimedia.org/r/657062 (owner: 10Kormat) [11:08:40] (03CR) 10Kormat: [C: 03+2] wikimedia.org: Move orchestrator/tendril to better* section [dns] - 10https://gerrit.wikimedia.org/r/657062 (owner: 10Kormat) [11:09:33] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1023.eqiad.wmnet [11:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:22] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:32] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1024.eqiad.wmnet [11:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:52] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [11:11:52] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [11:11:52] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [11:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:53] PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [11:17:48] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1024.eqiad.wmnet [11:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:14] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1025.eqiad.wmnet [11:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:53] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:32] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1025.eqiad.wmnet [11:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:57] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [11:33:57] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [11:33:57] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [11:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1026.eqiad.wmnet [11:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1026.eqiad.wmnet [11:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1028.eqiad.wmnet [11:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:15] RECOVERY - Check systemd state on ms-be2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:54] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/655518 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [11:47:55] PROBLEM - SSH on ms-be2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:49:19] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:29] PROBLEM - Check systemd state on ms-be2041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:11] !log installing remaining openssl 1.1 updates on stretch [11:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:05] (03CR) 10Jbond: "minor comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655485 (owner: 10CDanis) [11:57:21] James_F: next [11:57:26] sorry [11:57:33] jouncebot: next [11:57:33] In 0 hour(s) and 2 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210119T1200) [11:58:39] RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [11:59:29] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1028.eqiad.wmnet [11:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210119T1200). [12:00:04] Kizule, Majavah, jan_drewniak, and Daimona: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:10] here [12:00:10] \o/ [12:00:13] I can deploy today [12:00:23] o/ (I can do my own) [12:00:30] I'll start with you Majavah :) [12:00:48] (03PS2) 10Urbanecm: Revert "Switch fiwiki to their 500k temporary logo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655281 (owner: 10Majavah) [12:00:53] (03CR) 10Urbanecm: [C: 03+2] Revert "Switch fiwiki to their 500k temporary logo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655281 (owner: 10Majavah) [12:01:31] (03PS1) 10Elukey: sre.ganeti.makevm: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657065 [12:01:43] moritzm: 10 mins was clearly too optimistic :D [12:01:45] (03Merged) 10jenkins-bot: Revert "Switch fiwiki to their 500k temporary logo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655281 (owner: 10Majavah) [12:02:33] (03PS2) 10Elukey: sre.ganeti.makevm: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657065 [12:02:46] Majavah: please test it at mwdebug1001 [12:02:53] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:19] RECOVERY - SSH on ms-be2019 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:03:39] Urbanecm: I see the temporary logo on some pages and the original on some, maybe caching? [12:03:57] Majavah: can you try it with ?debug=1 please? [12:04:20] Urbanecm: that shows the original logo [12:04:25] o/ [12:04:26] elukey: I'll review within 10 minutes :-) [12:04:30] ahahahhahah [12:04:40] so that means it works, I'll sync it then :) [12:04:51] moritzm: this time if I get less than 10 comments from Riccardo I'll consider it a win [12:05:14] jan_drewniak: ack, I'll ping you at the end of the window to sync it :) [12:05:21] at least nothing is broken :D [12:05:24] elukey: what python version does that run on because f' strings might be nicer to write than format(dc=dc if it's after 3.6 but that's personal preference on whether you want to write dc 3 times or not? [12:05:36] Urbanecm: sure thing [12:06:16] RhinosF1: good point! Maybe for a later change, I am moving the cookbook to the new API in this one (keeping things as close as possible) [12:06:36] but we should be able to use fstrings for sure [12:06:41] official buster repos have 3.7, stretch is on 3.5 [12:06:56] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 6a4cbe662655edaa4f6c36e69877766a6a48d828: Revert "Switch fiwiki to their 500k temporary logo!" (T270974) (duration: 00m 56s) [12:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:59] T270974: Temporarily change logo on the Finnish Wikipedia - https://phabricator.wikimedia.org/T270974 [12:07:01] Majavah: done :) [12:07:23] Urbanecm: thanks! maybe we can remove the logo files in a week? or something like that [12:07:28] Daimona: you're next :) [12:07:31] Majavah: sure, works for me [12:07:35] Sure [12:08:01] elukey: ah cool, if Majavah is right than it might have to wait until buster is across the world. It just seems strange to write dc 3 times. I mean {}_{}.format(dc, whatevertheothervariablewas) would also work but I think but that's less clean. [12:08:07] has merge conflict, rebasing [12:08:31] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1029.eqiad.wmnet [12:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:45] RhinosF1: yep sure, if you want you can send a code change after mine to follow up! [12:09:16] we run the cookbooks on a buster node (cumin1001/cumin2001) [12:10:10] (03PS6) 10Urbanecm: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [12:10:32] (03CR) 10Urbanecm: [C: 03+2] wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [12:11:20] (03Merged) 10jenkins-bot: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [12:12:00] Daimona: please test at mwdebug1001 :) [12:12:23] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [12:12:23] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [12:12:23] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [12:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:28] Yup, going [12:14:25] I've created an AbuseLog entry and it seems to be fine [12:14:37] Can't test much more than this, but checking logstash just in case [12:15:52] Daimona: I think you can see the log entry via toolforge replicas, if it's already there? Or I can do a prod query for you if you tell me the ID(s) [12:16:19] I also think we should test with a global filter just to be sure it works [12:16:20] (03PS1) 10Arturo Borrero Gonzalez: [DON'T MERGE] Allow Cloud VPS NAT address for $wmgAllowLabsAnonEdits wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657067 (https://phabricator.wikimedia.org/T209011) [12:16:48] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1029.eqiad.wmnet [12:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:22] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1030.eqiad.wmnet [12:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:45] Urbanecm: the columns weren't added to the replicas yet [12:17:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/657065 (owner: 10Elukey) [12:17:54] So yeah, a prod query would be appreciated [12:18:04] Daimona: what are the AFL IDs please? 🙂 [12:18:11] RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:30] (and wiki) [12:18:41] 1503108 on itwp [12:18:47] Still have to test with global filters [12:18:51] (03CR) 10Volans: [C: 03+1] "LGTM, minor nits inline. Thanks a lot for the refactor!" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657065 (owner: 10Elukey) [12:18:57] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Marking as -1. Merging this change should be coupled with https://gerrit.wikimedia.org/r/c/operations/puppet/+/656883" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657067 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [12:19:00] I'm checking detailed logs on logstash (and opened an unrelated report in the meanwhile) [12:20:19] Logstash is clean [12:20:44] great! [12:21:03] (03PS3) 10Elukey: sre.ganeti.makevm: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657065 [12:21:11] Now hunting for a global filter that I can trigger [12:21:12] Daimona: https://phabricator.wikimedia.org/P13832 is the log entry, LGTM, but pasting it so you can check it too [12:21:25] (03CR) 10Elukey: sre.ganeti.makevm: move to class API (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657065 (owner: 10Elukey) [12:21:33] Yup, looks perfect [12:22:31] (03CR) 10Volans: [C: 03+1] "Ship it!" [cookbooks] - 10https://gerrit.wikimedia.org/r/657065 (owner: 10Elukey) [12:22:51] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [12:22:51] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [12:22:51] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [12:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:56] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1030.eqiad.wmnet [12:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:21] (03CR) 10Klausman: [C: 03+1] Initial configuration of the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/635751 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [12:23:39] Daimona: I can edit https://meta.wikimedia.org/wiki/Special:AbuseFilter/159 for you if you want [12:24:40] Don't worry, I found one :d [12:24:41] * :D [12:24:49] elukey: will look over them later [12:25:00] okay :) [12:25:02] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1031.eqiad.wmnet [12:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:50] {{Done}}, https://meta.wikimedia.org/wiki/Special:AbuseLog/1184834 [12:26:39] So IDs are: 3014224 on frwiki and 1184834 on metawiki [12:27:28] looking [12:29:11] Daimona: posted at https://phabricator.wikimedia.org/P13833. [12:29:31] (03CR) 10Elukey: [C: 03+2] sre.ganeti.makevm: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657065 (owner: 10Elukey) [12:29:40] it worries me a bit that metawiki's has afl_global=0, on the other hand, that's in line with afl_filter, so maybe it's all right? [12:29:54] moritzm: I am merging, feel free to test it anytime, if it breaks I'll follow up :D [12:30:10] That's intentional [12:30:15] okay [12:30:20] so all good then? [12:30:28] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1031.eqiad.wmnet [12:30:29] Because the filter is local to meta. It should have afl_wiki set, but that's unrelated [12:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:39] I'd say yes, thank you! [12:30:46] okay, syncing! [12:30:53] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1033.eqiad.wmnet [12:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:43] Next up is running the script, and hope it won't be a chore [12:31:54] yup, I'll start it after the window [12:32:09] Daimona: want to do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/647117 now (READ_NEW in beta), or should we wait with that for later? [12:32:11] (03Merged) 10jenkins-bot: sre.ganeti.makevm: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657065 (owner: 10Elukey) [12:32:16] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 338c0f9fe32512266c3030f7c9b7f8804ed30432: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere (T269712) (duration: 00m 56s) [12:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:20] T269712: Migrate afl_filter to afl_filter_id and afl_global - https://phabricator.wikimedia.org/T269712 [12:32:31] Cool, thank you again [12:32:38] np [12:33:05] (there's a question few lines above for you Daimona ) [12:34:54] Oh, sorry [12:35:02] We can do it now I think [12:35:11] okay, merging it too :) [12:35:18] Beta is already using WRITE_BOTH since December I believe, so... [12:35:21] (03PS4) 10Urbanecm: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647117 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [12:35:45] (03CR) 10Urbanecm: [C: 03+2] "no-op for prod, per Daimona's OK in -operations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647117 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [12:36:36] (03Merged) 10jenkins-bot: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647117 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [12:36:57] and...done; will be deployed to beta automagically [12:37:01] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1033.eqiad.wmnet [12:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:17] jan_drewniak: all clear for you 🙂 [12:37:30] Urbanecm: thanks [12:37:42] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656842 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [12:38:29] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656842 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [12:38:55] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1034.eqiad.wmnet [12:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:00] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:656842| Bumping portals to master (T128546)]] (duration: 00m 56s) [12:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:04] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [12:44:28] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [12:44:28] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [12:44:28] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [12:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:57] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:656842| Bumping portals to master (T128546)]] (duration: 00m 56s) [12:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:31] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'production' . [12:45:31] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'staging' . [12:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:53] let there be cake! (belated cake, but still...) www.wikipedia.org [12:46:36] \o/ [12:46:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1034.eqiad.wmnet [12:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:10] (03PS3) 10RhinosF1: Use f string to avoid repeating words. [cookbooks] - 10https://gerrit.wikimedia.org/r/656923 [12:47:55] elukey: ^ works [12:48:32] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1035.eqiad.wmnet [12:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:24] RECOVERY - Check systemd state on ms-be2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:35] elukey: I ran script to find uses of format and I can replace them all if you and Volans think good idea later tonight. [12:58:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1035.eqiad.wmnet [12:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1036.eqiad.wmnet [12:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:25] (03PS1) 10Kormat: orchestrator: Put orchestrator behind IDP [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) [13:00:04] Urbanecm and Amir1: Dear deployers, time to do the Create new wiki deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210119T1300). [13:00:12] o/ [13:00:12] \o/ [13:01:26] preparing the config [13:02:03] hellooo [13:02:13] here to watch the new wiki going live [13:07:48] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1036.eqiad.wmnet [13:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:03] hi Evrifaessa :) [13:09:31] (03PS1) 10Urbanecm: initial configuration for trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657072 (https://phabricator.wikimedia.org/T271260) [13:09:56] (03PS2) 10Urbanecm: Initial configuration for trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657072 (https://phabricator.wikimedia.org/T271260) [13:12:09] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657072 (https://phabricator.wikimedia.org/T271260) (owner: 10Urbanecm) [13:12:16] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.796 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [13:13:02] (03Merged) 10jenkins-bot: Initial configuration for trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657072 (https://phabricator.wikimedia.org/T271260) (owner: 10Urbanecm) [13:13:04] (03CR) 10Evrifaessa: [C: 03+1] Initial configuration for trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657072 (https://phabricator.wikimedia.org/T271260) (owner: 10Urbanecm) [13:13:12] (03PS2) 10Kormat: orchestrator: Put orchestrator behind IDP [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) [13:13:52] that alert...doesn't look good. does it mean logstash no longer indexes MW events (like exceptions) properly? If that's true, that should be fixed before i start to sync it out? [13:13:59] Amir1: ^^ [13:14:27] it used to happen before too [13:14:34] I saw it, I think it'll recover [13:14:51] pinging observability team if it doesn't recover [13:15:06] ack. Should I continue Amir1 ? [13:15:27] wait for five min first, to see if it recovvers [13:15:32] ack [13:16:14] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1037.eqiad.wmnet [13:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:14] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:48] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 11.05 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [13:18:10] doesn't sound like a recovery [13:18:15] godog: is it intentional? ^ [13:18:32] time to wake up SREs mwhaha [13:18:35] (03PS3) 10Kormat: orchestrator: Put orchestrator behind IDP [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) [13:18:39] (jk I swear) [13:19:03] pinging o11y team for now [13:19:22] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27518/console" [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:20:24] (03PS2) 10DCausse: Add import_commons_mediainfo_dumps to role::analytics_cluster::launcher [puppet] - 10https://gerrit.wikimedia.org/r/642411 [13:21:10] PROBLEM - SSH on mw1278.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:21:36] Amir1: I'll take a look, but no not intentional [13:22:21] Thanks. Let us know when we can continue deployment [13:22:28] thanks :) [13:22:38] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1037.eqiad.wmnet [13:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:42] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:16] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 10.61 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [13:23:56] it's a really persistent alarm [13:24:32] (03PS4) 10Kormat: orchestrator: Put orchestrator behind IDP [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) [13:24:38] yeah, if you want to go ahead and finish the deployment I think that might be ok, while I investigate [13:24:45] Amir1 Urbanecm ^ [13:24:55] cool [13:25:00] Thanks [13:25:24] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27519/console" [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:25:47] thanks [13:26:47] (03CR) 10Muehlenhoff: orchestrator: Put orchestrator behind IDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:27:39] (03PS5) 10Kormat: orchestrator: Put orchestrator behind IDP [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) [13:28:56] hohoho, works :) https://usercontent.irccloud-cdn.com/file/XqEEl9Wc/image.png [13:29:29] \o/ [13:29:43] syncing it out [13:30:36] !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: Creating trwikivoyage (T271260) (duration: 00m 56s) [13:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:40] T271260: Create Wikivoyage Turkish - https://phabricator.wikimedia.org/T271260 [13:30:56] (03CR) 10Kormat: "@muehlenhoff: I'm flying mostly-blind here, so any feedback you have is very welcomed :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:31:07] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1038.eqiad.wmnet [13:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:37] !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: Creating trwikivoyage (T271260) (duration: 00m 55s) [13:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:35] !log urbanecm@deploy1001 Synchronized dblists: Creating trwikivoyage (T271260) (duration: 00m 55s) [13:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:05] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating trwikivoyage (T271260) [13:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:27] I'll poke at logstash and dump the offending messages [13:35:07] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Creating trwikivoyage (T271260) (duration: 00m 55s) [13:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:22] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating trwikivoyage (T271260) (duration: 00m 55s) [13:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:27] T271260: Create Wikivoyage Turkish - https://phabricator.wikimedia.org/T271260 [13:36:34] and...just iw cache now [13:36:48] Evrifaessa: the wiki is up, but needscontent to be imported first :). someone'll care about it [13:37:20] !log urbanecm@deploy1001 update-interwiki-cache aborted: Update interwiki cache (duration: 00m 05s) [13:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:28] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27520/console" [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:37:34] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657077 [13:37:36] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657077 (owner: 10Urbanecm) [13:38:04] (03CR) 10Muehlenhoff: "One comment inline, otherwise looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:38:13] !log bounce logstash on logstash1025 to debug unindexable logs [13:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:20] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657077 (owner: 10Urbanecm) [13:38:47] (03PS6) 10Kormat: orchestrator: Put orchestrator behind IDP [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) [13:39:19] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 01m 53s) [13:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:31] (03CR) 10Kormat: orchestrator: Put orchestrator behind IDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:39:34] !log trwikivoyage is created [13:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:43] godog: ftr, I'm done with deployments now [13:39:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1038.eqiad.wmnet [13:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:58] Urbanecm: ack, thanks [13:40:00] \o/ [13:40:53] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1039.eqiad.wmnet [13:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:06] (03CR) 10Kormat: "@muehlenhoff: moving you to reviewer :)" [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:43:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:43:55] (03CR) 10Kormat: [C: 03+2] orchestrator: Put orchestrator behind IDP [puppet] - 10https://gerrit.wikimedia.org/r/657070 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:45:21] Urbanecm: When will someone import stuff? [13:45:28] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:02] Evrifaessa: it's usually people from langcom doing the work [13:48:28] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 11.44 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [13:49:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1039.eqiad.wmnet [13:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:46] !log start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T271264) [13:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:51] T271264: Add Wikidata support for trwikivoyage - https://phabricator.wikimedia.org/T271264 [13:59:05] (03PS1) 10Kormat: orchestrator: Enable apache2 mod_ssl [puppet] - 10https://gerrit.wikimedia.org/r/657082 (https://phabricator.wikimedia.org/T266106) [14:00:52] (03CR) 10Kormat: [C: 03+2] orchestrator: Enable apache2 mod_ssl [puppet] - 10https://gerrit.wikimedia.org/r/657082 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [14:02:08] kormat: <3 [14:03:08] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1040.eqiad.wmnet [14:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:52] (03PS1) 10Kormat: orchestrator: Add sslcert:dhparam [puppet] - 10https://gerrit.wikimedia.org/r/657084 (https://phabricator.wikimedia.org/T266106) [14:07:10] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27521/console" [puppet] - 10https://gerrit.wikimedia.org/r/657084 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [14:07:21] 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, and 2 others: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [14:07:25] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 11.33 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [14:07:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/657084 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [14:08:00] (03CR) 10Kormat: [V: 03+1 C: 03+2] orchestrator: Add sslcert:dhparam [puppet] - 10https://gerrit.wikimedia.org/r/657084 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [14:08:21] !log end of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T271264) [14:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:26] T271264: Add Wikidata support for trwikivoyage - https://phabricator.wikimedia.org/T271264 [14:09:30] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1040.eqiad.wmnet [14:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1041.eqiad.wmnet [14:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:14] 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, and 2 others: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [14:15:44] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1041.eqiad.wmnet [14:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:32] !log Sanitize trwikivoyage on db2094:3315, db1124:3315, db1154:3315 T271261 [14:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:36] T271261: Prepare and check storage layer for trwikivoyage - https://phabricator.wikimedia.org/T271261 [14:22:49] RECOVERY - SSH on mw1278.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:24:18] (03PS1) 10Hnowlan: similar-users: Add TLS configuration configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/657091 (https://phabricator.wikimedia.org/T268837) [14:26:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1042.eqiad.wmnet [14:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:55] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:57] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1042.eqiad.wmnet [14:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:09] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10Ottomata) Hi @lilients_WMDE Am happy to approve this, but perhaps all you really need is `nda` LDAP access? If you have this, you can access some data in Hive via [[ https://wikitech.wikime... [14:37:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) a:05Ottomata→03lilients_WMDE [14:49:55] (03CR) 10jerkins-bot: [V: 04-1] [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 (owner: 10RhinosF1) [14:50:40] (03CR) 10RhinosF1: [nitpick] don't assign variables only used to return (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 (owner: 10RhinosF1) [14:51:19] (03PS3) 10RhinosF1: [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 [14:51:29] (03PS1) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [14:51:31] (03PS1) 10Hnowlan: services: similar-users discovery and LVS component [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) [14:52:28] (03PS4) 10RhinosF1: [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 [14:53:23] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [14:54:21] (03CR) 10jerkins-bot: [V: 04-1] [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 (owner: 10RhinosF1) [14:54:34] (03PS5) 10RhinosF1: [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 [14:54:59] (03PS6) 10RhinosF1: [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 [14:55:25] (03PS7) 10RhinosF1: [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 [14:55:47] PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [14:56:37] volans: I did that for another nitpick I've developed. The only one I don't understand is cf.py but I don't understand that because I'm not sure it's ever got the end. [14:56:56] 10SRE, 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10BBlack) @Cmjohnson - let me know when you're ready to deal with this, and we'll stop service on the node and fail things over to lvs1016. [14:57:52] (03PS8) 10RhinosF1: [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 [14:57:54] (03CR) 10jerkins-bot: [V: 04-1] [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 (owner: 10RhinosF1) [14:58:13] (03PS1) 10Jgreen: A/PTR records for frdata-(eqiad|codfw).wm.o, adjust legacy hostname to frdata1001 [dns] - 10https://gerrit.wikimedia.org/r/657103 (https://phabricator.wikimedia.org/T272066) [14:58:44] RhinosF1: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/657106/7/cookbooks/sre/network/cf.py#b76 is changing the behaviour, why do you want to change it [14:59:22] same for the pdus [14:59:26] volans: I don't, see my review comment. I don't quite get how that works because unless I'm blind it wouldn't append [14:59:27] and I din't check the others [14:59:57] Prepare-upgrade confuses me [15:00:05] None of others should change behaviour [15:00:16] (03CR) 10Jgreen: [C: 03+2] A/PTR records for frdata-(eqiad|codfw).wm.o, adjust legacy hostname to frdata1001 [dns] - 10https://gerrit.wikimedia.org/r/657103 (https://phabricator.wikimedia.org/T272066) (owner: 10Jgreen) [15:00:35] RhinosF1: you return inside a for loop [15:00:43] instead of changing the return code to return after the loop is over [15:01:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10lilients_WMDE) Hi @jcrespo and @Ottomata, I have LDAP access and also to the nice tool you linked. Thanks! I guess for now that is sufficient. According to my colleagues I will need the shel... [15:01:25] (03PS13) 10MSantos: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) [15:01:38] volans: before it's overwritten result_json on each iteration so it'd always get whatever the last slot is. [15:01:48] I'm trying to work out if that's what you intended [15:01:54] I'm talking about the link above, the return_code ones [15:01:58] (03PS2) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [15:01:59] in cf and pdus for example [15:02:48] volans: there's no behaviour change. It just stops storing it in a variable as there's no need. You can simply return 1 or 0. Nothing else uses the variable. [15:02:57] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [15:03:22] (03CR) 10Hnowlan: [C: 03+2] similar-users: Add TLS configuration configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/657091 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [15:03:51] !log authdns-update DNS adjustments for frdata-(eqiad|codfw) [15:03:52] (03PS3) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [15:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:37] RhinosF1: pick https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/657106/2/cookbooks/sre/network/cf.py#50 as example. There is a for loop, for each iteration something gets done, and we store the return code 1 if anything fails, but we keep operating on the other iterations [15:04:42] if you return it will stop there [15:04:56] and miss to operate on the other items [15:05:12] (03Merged) 10jenkins-bot: similar-users: Add TLS configuration configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/657091 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [15:05:16] volans: ah, yeah. Hang on. I'll fix that. [15:05:28] the existing code it's correct [15:05:32] what do you want to fix? [15:06:04] RhinosF1: we really appreciate you taking a pass on the code, but I'd suggest to study it a bit more before sending code reviews :) [15:06:33] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [15:06:39] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [15:06:39] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [15:06:39] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [15:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:28] (03PS9) 10RhinosF1: [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 [15:08:57] (03PS4) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [15:10:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10Ottomata) If you'll need it relatively soon (within the next few months), then let's go ahead and get it set up since you've already started the process. If you'll need it later than that,... [15:10:20] volans: that works round it. I still need to find a use of the prepare-upgrade part to get my head round what it's doing. [15:10:25] (03CR) 10jerkins-bot: [V: 04-1] [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 (owner: 10RhinosF1) [15:11:10] RhinosF1: also please check on wikitech the guidelines to format code reviews and docstrings :) [15:11:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:11:15] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [15:11:20] elukey: I'll go over them. [15:11:48] RhinosF1: the existing code in cf.py has nothing wrong to correct with regard to variable assignments and returns [15:14:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:14:13] (03PS1) 10Hnowlan: similar-users: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657126 (https://phabricator.wikimedia.org/T268837) [15:15:44] !log Run `foreachwikiindblist closed extensions/AbuseFilter/maintenance/MigrateAflFilter.php` (T269713) [15:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:48] T269713: Run the MigrateAflFilter script for AbuseFilter - https://phabricator.wikimedia.org/T269713 [15:16:03] (03CR) 10Hnowlan: [C: 03+2] similar-users: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657126 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [15:17:07] volans: that's from a list of popular extensions to flake8 and its advice is not to assign a variable only to return it but the more I think the more it makes less sense in this use case. [15:18:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10lilients_WMDE) Ok, then go ahead. [15:18:19] (03Merged) 10jenkins-bot: similar-users: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657126 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [15:18:52] (03PS14) 10MSantos: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) [15:20:24] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [15:21:26] (03Abandoned) 10RhinosF1: [nitpick] don't assign variables only used to return [cookbooks] - 10https://gerrit.wikimedia.org/r/657106 (owner: 10RhinosF1) [15:22:34] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1043.eqiad.wmnet [15:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:04] (03CR) 10Legoktm: [C: 03+1] docker_registry_ha: Add "Vary: Accept" to response [puppet] - 10https://gerrit.wikimedia.org/r/650153 (https://phabricator.wikimedia.org/T256762) (owner: 10JMeybohm) [15:23:36] (03PS5) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [15:23:49] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [15:23:49] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [15:23:49] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [15:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:17] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host cuminunpriv1001.eqiad.wmnet [15:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:36] (03CR) 10Bstorm: "This looks good. I think this is just a matter of when we want to hold our breath and try it, right?" [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [15:28:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1043.eqiad.wmnet [15:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:07] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1044.eqiad.wmnet [15:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:22] (03PS1) 10Elukey: Set Bigtop 1.5 for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/657127 (https://phabricator.wikimedia.org/T269919) [15:29:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10Ottomata) Ok, then! Shell access approved. [15:30:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) > I will need the shell access later on, but no rush Based on the reasons for request, and analytics overwatch, SREs will proceed with cluster access rather than just web access. [15:30:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) a:05lilients_WMDE→03jcrespo [15:31:03] (03CR) 10Elukey: [C: 03+2] Set Bigtop 1.5 for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/657127 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [15:31:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) [15:31:41] RhinosF1: fwiw that flake8 plugin is buggy as it's reporting false positives and doesn't seem either popular nor much developed [15:33:00] (03CR) 10Legoktm: docker_registry_ha: Add a script to generate a static HTML homepage (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [15:33:13] (03PS8) 10Legoktm: docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) [15:33:15] (03PS4) 10Legoktm: docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) [15:33:20] 10SRE, 10vm-requests: esams/ulsfo/eqsin: 1 VM requested for bastions - https://phabricator.wikimedia.org/T271404 (10MoritzMuehlenhoff) 05Open→03Resolved This is done. [15:33:28] *neither [15:37:29] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1044.eqiad.wmnet [15:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:52] (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [15:38:13] (03PS1) 10Hnowlan: similar-users: change entrypoint to reflect code changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657128 (https://phabricator.wikimedia.org/T268837) [15:40:34] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1045.eqiad.wmnet [15:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:15] (03PS7) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [15:41:39] (03CR) 10Gmodena: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657128 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [15:41:43] (03CR) 10BryanDavis: [C: 04-1] "-1 to indicate that there are a some things we need to update prior to committing this change. The main one I can think of right now is th" [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [15:42:06] (03CR) 10Hnowlan: [C: 03+2] similar-users: change entrypoint to reflect code changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657128 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [15:43:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host cuminunpriv1001.eqiad.wmnet [15:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:26] (03Merged) 10jenkins-bot: similar-users: change entrypoint to reflect code changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657128 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [15:45:39] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [15:45:39] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [15:45:39] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [15:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1045.eqiad.wmnet [15:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:36] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet [15:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13835 and previous config saved to /var/cache/conftool/dbconfig/20210119-155127-root.json [15:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:33] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [15:51:33] (03PS1) 10Muehlenhoff: Add cuminunpriv1001 [puppet] - 10https://gerrit.wikimedia.org/r/657129 [15:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:21] (03CR) 10Jbond: cookbook sre.apt.reboot: (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [15:53:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [15:54:22] (03CR) 10Volans: [C: 03+1] "LGTM for the python bits" [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [15:54:40] (03PS1) 10Hnowlan: similar-users: fix volume mount ordering [deployment-charts] - 10https://gerrit.wikimedia.org/r/657130 (https://phabricator.wikimedia.org/T268837) [15:55:32] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to production shell and wmf ldap access for Razzi Abuissa - https://phabricator.wikimedia.org/T261443 (10Dzahn) "Gerrit Analytics Group" is different from membership in wmf LDAP though. It's a custom Gerrit group. [15:56:17] (03PS8) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [15:56:51] (03PS9) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [15:56:59] jouncebot now [15:56:59] No deployments scheduled for the next 1 hour(s) and 3 minute(s) [15:57:18] (03PS1) 10Lucas Werkmeister (WMDE): Update unitConversionConfig.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657131 (https://phabricator.wikimedia.org/T270252) [15:58:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [15:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:27] 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [15:58:31] 10SRE, 10Cloud-VPS, 10Epic, 10IPv6, 10cloud-services-team (Kanban): Enable IPv6 on CloudVPS - https://phabricator.wikimedia.org/T37947 (10aborrero) [15:58:39] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [15:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:48] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [15:58:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "don’t deploy yet, see T267644#6758238" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657131 (https://phabricator.wikimedia.org/T270252) (owner: 10Lucas Werkmeister (WMDE)) [16:01:57] (03PS10) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [16:04:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of comments inline. Overall /me likes." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [16:06:02] 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [16:06:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] docker_registry_ha: Add "Vary: Accept" to response [puppet] - 10https://gerrit.wikimedia.org/r/650153 (https://phabricator.wikimedia.org/T256762) (owner: 10JMeybohm) [16:06:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13836 and previous config saved to /var/cache/conftool/dbconfig/20210119-160630-root.json [16:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:47] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/657060 (owner: 10Muehlenhoff) [16:07:03] 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [16:07:24] !log powercycling ms-be1046, stuck during boot [16:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:00] (03CR) 10Cwhite: [C: 03+1] role: remove rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/657052 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [16:08:31] PROBLEM - Host ms-be1046 is DOWN: PING CRITICAL - Packet loss = 100% [16:09:36] 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [16:10:19] 10SRE, 10ops-eqiad: ms-be1046 stuck on reboot - https://phabricator.wikimedia.org/T272396 (10MoritzMuehlenhoff) [16:12:30] (03PS1) 10Kormat: orchestrator: Use CAS headers for user identification [puppet] - 10https://gerrit.wikimedia.org/r/657135 (https://phabricator.wikimedia.org/T266106) [16:12:35] (03PS2) 10Hnowlan: similar-users: fix volume mount ordering [deployment-charts] - 10https://gerrit.wikimedia.org/r/657130 (https://phabricator.wikimedia.org/T268837) [16:13:28] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27522/console" [puppet] - 10https://gerrit.wikimedia.org/r/657135 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [16:14:29] (03CR) 10Marostegui: [C: 03+1] "what a privilege" [puppet] - 10https://gerrit.wikimedia.org/r/657135 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [16:14:32] (03CR) 10Hnowlan: [C: 03+2] similar-users: fix volume mount ordering [deployment-charts] - 10https://gerrit.wikimedia.org/r/657130 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [16:14:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh (exit_code=0) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [16:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:57] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:19] gooooood [16:15:32] (03CR) 10Kormat: [V: 03+1 C: 03+2] orchestrator: Use CAS headers for user identification [puppet] - 10https://gerrit.wikimedia.org/r/657135 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [16:15:59] (03Merged) 10jenkins-bot: similar-users: fix volume mount ordering [deployment-charts] - 10https://gerrit.wikimedia.org/r/657130 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [16:16:44] (03PS1) 10Effie Mouzeli: scap: disable udp logging [puppet] - 10https://gerrit.wikimedia.org/r/657136 (https://phabricator.wikimedia.org/T227080) [16:18:40] 10SRE, 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: orchestrator: Puppetize - https://phabricator.wikimedia.org/T265990 (10Kormat) [16:18:54] 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, and 2 others: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) 05Open→03Resolved [16:19:18] 10SRE, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) [16:19:22] 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [16:20:21] 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, and 2 others: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Marostegui) Great work! [16:20:48] (03CR) 10Giuseppe Lavagetto: "I concur with alex's comment about the ownership of the homepage directory (it should be read-only for nginx). Other than that, you have a" [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [16:21:05] (03CR) 10Gmodena: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657130 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [16:21:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13837 and previous config saved to /var/cache/conftool/dbconfig/20210119-162134-root.json [16:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/657052 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [16:21:49] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [16:21:49] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [16:21:49] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [16:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:36] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) [16:23:38] (03PS1) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [16:23:51] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [16:23:51] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [16:23:51] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [16:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:24:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:25:14] (03CR) 10jerkins-bot: [V: 04-1] [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [16:25:51] 10SRE, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) [16:26:03] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:26:19] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:27:31] 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [16:27:34] 10SRE, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10aborrero) [16:30:19] RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [16:30:28] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [16:30:28] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [16:30:28] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [16:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:32:22] 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [16:32:37] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) a:05RobH→03Cmjohnson I'm unsubscribing myself from this, as its been taken over by the subteam, and its causing a lot of noise in my phabricator... [16:34:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of inline comments, but once addressed we should be able to enable this." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [16:36:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13838 and previous config saved to /var/cache/conftool/dbconfig/20210119-163637-root.json [16:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:39:25] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [16:39:25] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [16:39:25] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [16:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:40] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ms-be1046.eqiad.wmnet [16:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:54] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] role: remove rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/657052 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [16:43:18] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [16:43:18] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [16:43:18] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [16:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2312.codfw.wmnet with reason: REIMAGE [16:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2313.codfw.wmnet with reason: REIMAGE [16:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/657136 (https://phabricator.wikimedia.org/T227080) (owner: 10Effie Mouzeli) [16:45:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2314.codfw.wmnet with reason: REIMAGE [16:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2315.codfw.wmnet with reason: REIMAGE [16:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2312.codfw.wmnet with reason: REIMAGE [16:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:28] !log 1.36.0-wmf.27 was branched at fbb516d8e33924c6cb66c93bb6d42907558c31f3 for T271341 [16:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:34] T271341: 1.36.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T271341 [16:47:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2315.codfw.wmnet with reason: REIMAGE [16:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:54] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'canary' . [16:47:54] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [16:47:54] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'test' . [16:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:44] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2313.codfw.wmnet with reason: REIMAGE [16:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:06] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2314.codfw.wmnet with reason: REIMAGE [16:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2314.codfw.wmnet with reason: new install on buster [16:51:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2314.codfw.wmnet with reason: new install on buster [16:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:22] (03CR) 10Brennen Bearnes: [C: 03+2] Branch commit for wmf/1.36.0-wmf.27 [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/656987 (https://phabricator.wikimedia.org/T271341) (owner: 10TrainBranchBot) [16:56:25] (03PS1) 10Dave Pifke: webperf: enable Apache base::service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/657149 (https://phabricator.wikimedia.org/T135991) [16:58:35] 10SRE, 10Inuka-Team, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10JFishback_WMF) I spoke with a member of #wmf-legal about this issue and the tl;dr is that we do not presently have a sufficientl... [17:00:03] (03CR) 10Ahmon Dancy: [C: 03+1] scap: disable udp logging [puppet] - 10https://gerrit.wikimedia.org/r/657136 (https://phabricator.wikimedia.org/T227080) (owner: 10Effie Mouzeli) [17:00:04] jbond42 and cdanis: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210119T1700). [17:04:11] (03PS1) 10Hnowlan: similar-users: correct loglevel, remove readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/657151 (https://phabricator.wikimedia.org/T268837) [17:04:40] !log mwscript extensions/AbuseFilter/maintenance/MigrateAflFilter.php --wiki=testwiki --batch-size=1000 # T269713 [17:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:44] T269713: Run the MigrateAflFilter script for AbuseFilter - https://phabricator.wikimedia.org/T269713 [17:04:58] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:25] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2312.codfw.wmnet'] ` an... [17:06:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2313.codfw.wmnet'] ` an... [17:06:19] !log mwscript extensions/AbuseFilter/maintenance/MigrateAflFilter.php --wiki=test2wiki --batch-size=1000 # T269713 [17:06:20] (03PS1) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud: drop NAT exceptions for dumps NFS [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397) [17:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2315.codfw.wmnet'] ` an... [17:07:50] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2314.codfw.wmnet'] ` an... [17:08:47] !log Run extensions/AbuseFilter/maintenance/MigrateAflFilter.php for all group0 wikis (T269713) [17:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:15] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "-1 because we need to evaluate some other stuff before merging the patch." [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez) [17:12:48] (03CR) 10BryanDavis: [C: 04-1] "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [17:16:01] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1001/27524" [puppet] - 10https://gerrit.wikimedia.org/r/654330 (owner: 10Aaron Schulz) [17:18:43] (03CR) 10Ayounsi: [DONT MERGE] cloud: drop NAT exceptions for dumps NFS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez) [17:23:37] (03PS1) 10Bstorm: wikireplicas: Add new DNS names for multiinstance replicas [puppet] - 10https://gerrit.wikimedia.org/r/657155 (https://phabricator.wikimedia.org/T267376) [17:24:02] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.27 [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/656987 (https://phabricator.wikimedia.org/T271341) (owner: 10TrainBranchBot) [17:25:27] 10SRE, 10ops-codfw: codfw: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268749 (10Papaul) Row A complete [17:29:28] (03PS1) 10Elukey: Revert "Set Bigtop 1.5 for the Hadoop test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/657109 [17:29:57] (03CR) 10Awight: [C: 03+1] "Ready to go!" [puppet] - 10https://gerrit.wikimedia.org/r/649662 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight) [17:30:09] !log Start of `foreachwikiindblist group1 extensions/AbuseFilter/maintenance/MigrateAflFilter.php --batch-size=1000 ` (T269713) [17:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:15] T269713: Run the MigrateAflFilter script for AbuseFilter - https://phabricator.wikimedia.org/T269713 [17:33:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:33:46] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10dpifke) This may affect plans to use the Elastic Common Schema for logging. https://github.com/elastic/ecs still appear... [17:34:28] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` restbase2009.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20210... [17:34:32] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase2009.codfw.wmnet'] ` Of which those **FAILED**: ` ['restbase2009.codfw.wmnet'] ` [17:35:03] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` restbase2009.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20210... [17:35:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:35:38] (03CR) 10Elukey: [C: 03+2] Revert "Set Bigtop 1.5 for the Hadoop test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/657109 (owner: 10Elukey) [17:56:07] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) [17:56:18] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) [17:58:04] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2009.codfw.wmnet with reason: REIMAGE [17:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:02] !log starting deploy-promote to testwikis for 1.36.0-wmf.27 (T271341) [17:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:06] T271341: 1.36.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T271341 [17:59:31] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657157 [17:59:33] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657157 (owner: 10Brennen Bearnes) [17:59:57] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on restbase2009.codfw.wmnet with reason: REIMAGE [17:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210119T1800). [18:00:21] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657157 (owner: 10Brennen Bearnes) [18:01:00] !log brennen@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.27 [18:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:59] 10SRE, 10DBA, 10Performance-Team, 10Platform Engineering Roadmap Decision Making, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) @daniel @WDoranWMF Now that the docs have landed (thanks @nnikkhoui), I believe the next step is removing the obsolete gr... [18:03:11] 10SRE, 10DBA, 10Performance-Team, 10Platform Engineering Roadmap Decision Making, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) [18:16:12] (03PS5) 10Bstorm: toolforge k8s: upgrade docker and containerd [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) [18:17:20] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase2009.codfw.wmnet'] ` and were **ALL** successful. [18:21:40] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10Papaul) 05Open→03Resolved a:03Papaul @hnowlan this server is ready for service [18:22:27] 10SRE, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T268622 (10Papaul) @hnowlan re-image complete server is ready for service [18:28:28] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:29:02] (03CR) 10Gmodena: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657151 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [18:31:06] PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [18:33:34] PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:34:37] ACKNOWLEDGEMENT - SSH on ms-be1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Muehlenhoff T272396 https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:34:37] ACKNOWLEDGEMENT - Host ms-be1046 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T272396 [18:36:00] PROBLEM - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:37:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:38:32] PROBLEM - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [18:39:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:39:29] !log mbsantos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [18:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:16] PROBLEM - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:42:21] !log brennen@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.27 (duration: 41m 57s) [18:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:21] !log mbsantos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [18:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:44] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:46:14] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [18:47:24] !log mbsantos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [18:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:40] PROBLEM - PHP opcache health on mw2315 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:48:52] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:54:46] twentyafterfour: I messed up the meeting time :( sorry [18:55:02] I thought it starts in 6 min only to realize that is the end of it.. ugh [18:55:28] twentyafterfour: [18:57:21] (03PS2) 10Mstyles: update flink logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) [18:58:33] !log brennen@deploy1001 Pruned MediaWiki: 1.36.0-wmf.22 (duration: 03m 53s) [18:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:40] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10Volans) Related: https://www.elastic.co/blog/why-license-change-AWS [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210119T1900) [19:01:01] (03PS3) 10Mstyles: update flink logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) [19:04:04] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.692 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:14:04] PROBLEM - Ensure local MW versions match expected deployment on mw2315 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:22:26] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [19:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:59] testing the rollback [19:27:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [19:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:01] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [19:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:30] PROBLEM - mediawiki-installation DSH group on mw2315 is CRITICAL: Host mw2315 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:37:29] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10Ottomata) Confluent did the same thing a few years ago: https://www.confluent.io/blog/license-changes-confluent-platform... [19:39:08] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 21.52 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:42:00] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:46:48] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh (exit_code=0) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [19:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:58] woooow [19:47:02] the rollback worked as well [19:47:26] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [19:47:30] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:48:02] (03CR) 10CRusnov: [C: 03+2] interface_automation.py: Minor refactors and fixes for 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [19:49:24] PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:56] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:00:04] brennen and liw: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210119T2000). [20:02:06] rolling to group0 momentarily. [20:03:56] (03PS1) 10Brennen Bearnes: group0 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657174 [20:03:58] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657174 (owner: 10Brennen Bearnes) [20:04:52] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657174 (owner: 10Brennen Bearnes) [20:07:12] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2033 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:07:38] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2312.codfw.wmnet [20:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:45] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2313.codfw.wmnet [20:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:56] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2314.codfw.wmnet [20:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:07] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2315.codfw.wmnet [20:08:08] RECOVERY - Ensure local MW versions match expected deployment on mw2315 is OK: OKAY: Not alerting due to fresh production wikiversions: 131 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:29] jouncebot: next [20:08:29] In 3 hour(s) and 51 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210120T0000) [20:09:05] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.27 [20:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:08] chaomodus: here is an issue I noticed during wmf-reimage cookbook and related to netbox https://phabricator.wikimedia.org/P13839 [20:10:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:13:00] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:14:56] RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:52] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [20:19:32] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 80129576 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:19:38] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [20:20:46] PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [20:22:34] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1273288 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:25:04] PROBLEM - PHP opcache health on mw2312 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:27:06] PROBLEM - PHP opcache health on mw2313 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:28:04] hrm [20:31:44] 10SRE, 10ops-eqiad: ms-be1046 stuck on reboot - https://phabricator.wikimedia.org/T272396 (10wiki_willy) a:03Cmjohnson [20:32:19] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10wiki_willy) a:03Cmjohnson [20:35:34] RECOVERY - mediawiki-installation DSH group on mw2315 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:38:46] brennen: opcache issue on mw23 hosts is expected due to reimages. that is codfw-only [20:38:55] other alerts not related though [20:39:06] mutante: ack, thanks. [20:39:14] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2033 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:39:28] i don't _think_ latency on icinga checks correlates with deploy. [20:40:21] brennen: I don't think so either, it seems to be latency for checks themselves, not latency of appservers or something like that [20:40:29] right on. [20:41:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:41:22] ACKNOWLEDGEMENT - PHP opcache health on mw2312 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:41:22] ACKNOWLEDGEMENT - PHP opcache health on mw2313 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:41:22] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2313 is CRITICAL: Host mw2313 is not in mediawiki-installation dsh group daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:43:00] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:46:20] ACKed the ones I know about to remove noise [20:46:56] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [20:53:16] PROBLEM - PHP opcache health on mw2314 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:56:56] mutante: just wanted to run a maintenance script for T269713 (no impact on appservers expected, only a bunch of update queries), but if we're in an incident, happy to delay. Any issues with going forward? [20:56:57] T269713: Run the MigrateAflFilter script for AbuseFilter - https://phabricator.wikimedia.org/T269713 [20:59:11] i am seeing a spike of "Could not enqueue jobs from stream ..." errors at the moment. [20:59:11] Urbanecm: we are not in an incident, go ahead [20:59:24] eh.. besides what brennen just said maybe [20:59:50] maybe just a transient spike. [21:00:12] started in a tmux session under my account at mwmaint, feel free to kill if needed [21:00:36] !log Start of `foreachwikiindblist group2 extensions/AbuseFilter/maintenance/MigrateAflFilter.php --batch-size=1000` (T269713) [21:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:08:36] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 246146080 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:11:18] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:11:20] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:11:50] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 881704 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:14:26] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:14:28] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:22:30] PROBLEM - Ensure local MW versions match expected deployment on mw2314 is CRITICAL: CRITICAL: 131 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:22:36] PROBLEM - Ensure local MW versions match expected deployment on mw2313 is CRITICAL: CRITICAL: 131 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:23:09] fixing those last 2 [21:23:18] PROBLEM - Ensure local MW versions match expected deployment on mw2312 is CRITICAL: CRITICAL: 131 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:24:58] scap pulling on hosts that were reimaged during deployment [21:26:10] PROBLEM - Ensure local MW versions match expected deployment on mw2315 is CRITICAL: CRITICAL: 131 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:28:35] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10KFrancis) @jcrespo Please send me Amrutha Chandra's full name (if something different than this) and their email address and I'll work on processing this request. Th... [21:29:46] RECOVERY - Ensure local MW versions match expected deployment on mw2314 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:29:52] RECOVERY - Ensure local MW versions match expected deployment on mw2313 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:30:34] RECOVERY - Ensure local MW versions match expected deployment on mw2312 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:31:36] PROBLEM - Disk space on kafka-test1010 is CRITICAL: DISK CRITICAL - free space: / 1784 MB (1% inode=99%): /tmp 1784 MB (1% inode=99%): /var/tmp 1784 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kafka-test1010&var-datasource=eqiad+prometheus/ops [21:33:20] RECOVERY - Ensure local MW versions match expected deployment on mw2315 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:35:04] PROBLEM - Kafka Broker Replica Max Lag on kafka-test1009 is CRITICAL: 1.393e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1009 [21:35:10] PROBLEM - Kafka Broker Replica Max Lag on kafka-test1008 is CRITICAL: 1.327e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1008 [21:35:26] PROBLEM - Kafka Broker Replica Max Lag on kafka-test1006 is CRITICAL: 1.32e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1006 [21:35:31] this is razzi and I testing some stuff ^^^^ [21:35:36] the kafka-test things [21:36:16] (03CR) 10Legoktm: docker_registry_ha: Add a script to generate a static HTML homepage (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [21:36:26] PROBLEM - Kafka Broker Replica Max Lag on kafka-test1007 is CRITICAL: 7.689e+06 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1007 [21:36:45] (03PS9) 10Legoktm: docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) [21:36:47] (03PS5) 10Legoktm: docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) [21:37:01] ottomata: thanks! (and the mw version ones are me and fixed) [21:37:26] (03CR) 10jerkins-bot: [V: 04-1] docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [21:37:36] (03CR) 10jerkins-bot: [V: 04-1] docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [21:38:32] (03PS10) 10Legoktm: docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) [21:38:34] (03PS6) 10Legoktm: docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) [21:40:03] (03PS4) 10Mstyles: update flink logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) [21:45:29] jouncebot now [21:45:29] For the next 0 hour(s) and 14 minute(s): Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210119T2000) [21:46:20] !log wiping kafka-test cluster data and starting from scratch - T255973 [21:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:25] T255973: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 [21:47:16] i'm seeing some new-looking "table mediawikiwiki.translate_cache doesn't exist" stuff. going to roll this back just in case. [21:50:57] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2315.codfw.wmnet [21:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2312.codfw.wmnet [21:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:29] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2313.codfw.wmnet [21:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:33] brennen: https://phabricator.wikimedia.org/rETRA96307453f26c68ed8e0ae11e4671bf0fe1234d4e [21:51:35] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group0 wikis to 1.36.0-wmf.26 [21:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:11] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2314.codfw.wmnet [21:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:02] (03PS1) 10Brennen Bearnes: Revert "group0 wikis to 1.36.0-wmf.27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657197 [21:53:04] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group0 wikis to 1.36.0-wmf.27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657197 (owner: 10Brennen Bearnes) [21:53:48] RECOVERY - Disk space on kafka-test1010 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kafka-test1010&var-datasource=eqiad+prometheus/ops [21:54:20] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.36.0-wmf.27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657197 (owner: 10Brennen Bearnes) [21:54:23] RhinosF1: looks relevant - should i comment on T182433? [21:54:24] T182433: Implement a stronger synchronization in RepoNG and Translate - https://phabricator.wikimedia.org/T182433 [21:54:45] brennen: that commit has been there since .23 though [21:54:49] hrm [21:54:56] So I don't know yet why that's new if it is [21:55:02] yeah, i'm going to file this as a production error [21:55:17] brennen: do you have a full trace? [21:55:37] It's definately a production error and probably needs a dba to apply the change or a feature flag [21:56:00] RhinosF1: yeah, one sec. [22:00:50] (03CR) 10Urbanecm: [C: 04-2] "to avoid accidential merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657067 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [22:02:07] RhinosF1: trace on T272428. [22:02:07] T272428: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T272428 [22:02:14] RECOVERY - Kafka Broker Replica Max Lag on kafka-test1007 is OK: (C)5e+06 ge (W)1e+06 ge 1.491e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1007 [22:05:36] James_F: the code should have been there since .23 though [22:05:50] Why's it suddenly turned on with this deployment [22:08:14] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:00] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [22:10:30] I found it https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/606424/54/utils/MessageUpdateJob.php is what turned it on [22:13:12] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:14:06] RECOVERY - Kafka Broker Replica Max Lag on kafka-test1006 is OK: (C)5e+06 ge (W)1e+06 ge 9.632e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1006 [22:14:43] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Jclark-ctr) DYT7773 is correct ST for db1156 located last DB server racked in D3 U12 [22:15:14] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Jclark-ctr) [22:17:04] RECOVERY - Kafka Broker Replica Max Lag on kafka-test1009 is OK: (C)5e+06 ge (W)1e+06 ge 3.867e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1009 [22:17:10] RECOVERY - Kafka Broker Replica Max Lag on kafka-test1008 is OK: (C)5e+06 ge (W)1e+06 ge 3.061e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1008 [22:24:30] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:57] (03CR) 10CRusnov: "Tested on -next works as expected." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [22:31:07] RhinosF1: Good find. [22:33:16] James_F: ty :) [22:33:24] (03PS1) 10QChris: Add .gitreview [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657200 [22:33:26] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657200 (owner: 10QChris) [22:33:45] Reverting the patch is pretty easy, but might be the wrong way to go if there's other code silently expecting it to work. [22:34:06] James_F: that's beyond me + php at 22:34 [22:34:12] * James_F grins. [22:34:27] that's beyond me + php awake and sane to be honest [22:34:32] and I'm neither [22:34:34] Yeah, leave to the Language team rather than reverting right now is I think the best route forward, but that's brennen's call. [22:36:14] James_F, RhinosF1: i think at this point in the day i'm prepared to call train stopped here and wait for language team to sort it. [22:36:34] Cool. [22:36:50] (03PS11) 10Cwhite: profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) [22:36:51] * James_F awaits the nastygram e-mail from brennen about the train being halted. [22:37:03] updating blocker ticket accordingly & prepping nastygram. :) [22:37:48] brennen: that's cool. Like I said, I'm not awake and sane at the moment anyway. [22:38:10] RhinosF1: yeah - thanks for the help at such a late hour. [22:38:37] it's ok [22:39:06] * RhinosF1 hasn't been awake and sane at the same time since about mid september [22:39:12] (03PS11) 10Legoktm: docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) [22:39:38] (03PS12) 10Legoktm: docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) [22:40:13] (03CR) 10Legoktm: [C: 03+2] "PS11: I adjusted the script to use http://localhost:5000 because that's what the registry listens on" [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [22:42:36] (03PS1) 10Legoktm: docker_registry_ha: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/657204 [22:42:52] (03PS2) 10Legoktm: docker_registry_ha: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/657204 [22:44:33] (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/657204 (owner: 10Legoktm) [22:44:38] (03CR) 10Cwhite: [C: 03+2] profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) (owner: 10Cwhite) [22:46:04] oh legoktm can now +2 in puppet :) [22:47:04] tabbycat: more specifically, I can break stuff in puppet! [22:47:28] legoktm: well, we all need some fun from time to time :) [22:47:50] legoktm: You've been breaking things everywhere else for a while, now you can /really/ wreak havoc, huh? ;-) [22:48:13] bet they keep buying t-shirts for "I broke puppet, timely fixed it and all I got was this lousy t-shirt (and managed to keep my job)" [22:48:39] (03PS15) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [22:48:47] (03PS1) 10Legoktm: docker_registry_ha: Add missing parameter for systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/657206 [22:49:38] (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Add missing parameter for systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/657206 (owner: 10Legoktm) [22:51:44] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:54:52] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:55:24] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [22:56:26] PROBLEM - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is CRITICAL: connect to address 10.64.0.127 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:57:12] PROBLEM - Check systemd state on aqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:26] tabbycat: i still await my "i broke wikipedia" shirt. [22:57:42] PROBLEM - Check systemd state on registry2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:07] brennen: Aren't they standard issue in RelEng? ;-) [22:58:13] ...i guess maybe i'll have to order some [22:58:28] the registry2001 failure is me, still fixing [22:58:36] that'd be a good method: you're issued your shirt when you start, but you're not allowed to wear it until the inevitable happens [22:58:50] PROBLEM - cassandra-b service on aqs1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:00:07] brennen: please mail your request first class, postage prepaid to: 1/#1600 New Montgomery Street, San Francisco, CA 94010 USA :-) [23:00:23] brennen: As I've said before, it's not the kind of award we're meant to make with oak leaf clusters. ;-) [23:00:40] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 (owner: 10CRusnov) [23:01:00] James_F: stinging nettles more like [23:01:52] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [23:02:04] Very Gothic. [23:03:18] RECOVERY - NFS Share Volume Space /srv/tools on labstore1004 is OK: DISK OK - free space: /srv/tools 1999333 MB (25% inode=81%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [23:09:28] PROBLEM - Check systemd state on registry1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:04] (03PS1) 10Legoktm: docker_registry_ha: Make registery-homepage-builder Python 3.5 compatible [puppet] - 10https://gerrit.wikimedia.org/r/657210 [23:16:05] /ac/ac [23:16:42] RECOVERY - Check systemd state on aqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:17:36] ACKNOWLEDGEMENT - Check systemd state on registry2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Legoktm T179696, fixing https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:17:47] ACKNOWLEDGEMENT - Check systemd state on registry1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Legoktm T179696, fixing https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:18] RECOVERY - cassandra-b service on aqs1004 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:18:40] (03CR) 10Legoktm: "I'll need to upload 0.0.4 of python3-docker-report to stretch-wikimedia before this will work." [puppet] - 10https://gerrit.wikimedia.org/r/657210 (owner: 10Legoktm) [23:22:24] RECOVERY - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on 10.64.0.127 port 9042 https://phabricator.wikimedia.org/T93886 [23:27:12] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:27:12] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:33:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:33:20] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:40:44] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:43:46] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:52:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:55:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:59:15] (03PS1) 10Cwhite: profile: drop ECS messages on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/657213 (https://phabricator.wikimedia.org/T234565)