[00:01:18] (03PS2) 10Ebernhardson: Add Cirrus testing profile for glent m1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672565 (https://phabricator.wikimedia.org/T262612) [00:01:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:12:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:24:24] legoktm: oops, thanks! the decom cookbook had ended but it does not set it to inactive automatically [00:28:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:35:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:39:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:58:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:01:08] (03PS3) 10CRusnov: Group Sensitive Remote: Sync groups at auth time, not creation time [software/netbox] - 10https://gerrit.wikimedia.org/r/672548 (https://phabricator.wikimedia.org/T244849) [01:03:31] (03CR) 10CRusnov: "Okay tested and deployed on -next. Funny thing about when this gets called: It doesn't seem to call .authenticate unless you fail the toke" [software/netbox] - 10https://gerrit.wikimedia.org/r/672548 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [01:05:24] (03PS4) 10CRusnov: Group Sensitive Remote: Sync groups at auth time, not creation time [software/netbox] - 10https://gerrit.wikimedia.org/r/672548 (https://phabricator.wikimedia.org/T244849) [01:05:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:07:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:21:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:30:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:35:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:40:49] (03PS1) 10Ladsgroup: flaggedrevs: Simplify the config a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672571 [01:41:54] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [01:44:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:45:16] (03CR) 10DannyS712: [C: 03+1] flaggedrevs: Simplify the config a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672571 (owner: 10Ladsgroup) [01:54:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:56:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:07:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.35 [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672574 [02:15:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:19:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:25:55] (03PS2) 10Andrew Bogott: Nova vendordata: rework initial partitioning [puppet] - 10https://gerrit.wikimedia.org/r/672538 (https://phabricator.wikimedia.org/T272114) [02:29:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:32:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:37:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:41:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:43:37] (03PS1) 10Tim Starling: Use the RequestTimeout library to set time limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672579 (https://phabricator.wikimedia.org/T269326) [02:44:34] RECOVERY - dump of matomo in eqiad on alert1001 is OK: Last dump for matomo at eqiad (db1108.eqiad.wmnet:3351) taken on 2021-03-16 02:28:01 (0 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:49:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:51:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:54:27] (03CR) 10BryanDavis: wikireplicas: create actual paws database accounts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [03:03:56] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 132994224 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:06:12] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 731016 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:12:08] (03PS3) 10Andrew Bogott: Nova vendordata: rework initial partitioning [puppet] - 10https://gerrit.wikimedia.org/r/672538 (https://phabricator.wikimedia.org/T272114) [03:17:16] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:19:32] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.340 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:23:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:27:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:40:56] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [03:40:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:54:39] (03CR) 10Krinkle: [C: 03+1] xhgui: enable database access for admins [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [04:00:09] (03PS1) 10Krinkle: rdbms: avoid undefined "expectBy" notices in TransactionProfiler (II) [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672587 (https://phabricator.wikimedia.org/T269789) [04:05:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:07:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:18:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:21:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:24:48] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10nshahquinn-wmf) @Volans he can already access Hue, but when he tried to query private data (`mediawiki_history`, for T271962) he got a permissions error. [04:30:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:38:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:47:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:47:58] (03PS3) 10KartikMistry: WIP: Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) [04:52:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:02:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:03:24] RECOVERY - cassandra-a service on aqs1010 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:04:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:19:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:20:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:22:00] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:23:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:28:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:31:28] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:32:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:33:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:35:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076', diff saved to https://phabricator.wikimedia.org/P14862 and previous config saved to /var/cache/conftool/dbconfig/20210316-053516-marostegui.json [05:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:36] !log Stop MySQL on db1162 to clone db1162 T258361 [05:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:44] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [05:38:33] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1076 -> db1162 transfer is on-going now [05:39:20] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1136.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20210316053... [05:46:56] PROBLEM - SSH on mw2246.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:47:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:50:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:51:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1136.eqiad.wmnet with reason: REIMAGE [05:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1136.eqiad.wmnet with reason: REIMAGE [05:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:43] PROBLEM - Disk space on backup2002 is CRITICAL: DISK CRITICAL - free space: /srv 2972129 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [05:59:22] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1136.eqiad.wmnet'] ` and were **ALL** successful. [06:08:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:09:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:15:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:19:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:32:30] (03PS1) 10Marostegui: db1136: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672606 (https://phabricator.wikimedia.org/T277007) [06:43:21] (03CR) 10Marostegui: [C: 03+2] db1136: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672606 (https://phabricator.wikimedia.org/T277007) (owner: 10Marostegui) [06:44:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 25%: Repool db1136', diff saved to https://phabricator.wikimedia.org/P14864 and previous config saved to /var/cache/conftool/dbconfig/20210316-064358-root.json [06:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:21] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui) 05Open→03Resolved Host is being repooled. Thanks! [06:45:26] 10SRE, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [06:45:29] 10SRE, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [06:47:53] RECOVERY - SSH on mw2246.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:49:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:51:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:51:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2120 T275633', diff saved to https://phabricator.wikimedia.org/P14865 and previous config saved to /var/cache/conftool/dbconfig/20210316-065148-marostegui.json [06:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:56] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:52:15] !log Stop MySQL on db2120 to clone db2150 - T275633 [06:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:40] (03PS1) 10Marostegui: db2148: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672608 [06:55:24] (03CR) 10Marostegui: [C: 03+2] db2148: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672608 (owner: 10Marostegui) [06:58:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2108', diff saved to https://phabricator.wikimedia.org/P14867 and previous config saved to /var/cache/conftool/dbconfig/20210316-065814-marostegui.json [06:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2148', diff saved to https://phabricator.wikimedia.org/P14868 and previous config saved to /var/cache/conftool/dbconfig/20210316-065840-marostegui.json [06:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 50%: Repool db1136', diff saved to https://phabricator.wikimedia.org/P14869 and previous config saved to /var/cache/conftool/dbconfig/20210316-065903-root.json [06:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:37] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) >>! In T256538#6889860, @Legoktm wrote: >>>! In T256538#6889858, @Marostegui wrote: >> @Ladsgroup are these just testing databases that will be deleted at some point or are these... [07:09:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:12:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:14:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 75%: Repool db1136', diff saved to https://phabricator.wikimedia.org/P14870 and previous config saved to /var/cache/conftool/dbconfig/20210316-071407-root.json [07:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:29:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: Repool db1136', diff saved to https://phabricator.wikimedia.org/P14871 and previous config saved to /var/cache/conftool/dbconfig/20210316-072910-root.json [07:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:45:51] 10SRE, 10MW-on-K8s, 10serviceops: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10Joe) After some more work, this is my ideas for liveness and readiness probes: # httpd: -- liveness: tcp connection to the main tcp port -- readiness: http status page (via t... [07:46:14] (03PS1) 10Elukey: profile::docker::engine: add default to the version parameter [puppet] - 10https://gerrit.wikimedia.org/r/672647 [07:46:59] 10SRE, 10MW-on-K8s, 10serviceops: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10Joe) p:05Triage→03High [07:49:05] (03CR) 10ZPapierski: [C: 03+1] zpapierski home: Start tmux on connect [puppet] - 10https://gerrit.wikimedia.org/r/670943 (owner: 10Ebernhardson) [07:49:28] (03PS2) 10Elukey: profile::docker::engine: add default to the version parameter [puppet] - 10https://gerrit.wikimedia.org/r/672647 [07:51:02] (03CR) 10Elukey: hiera/modules: Add role for ML k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [07:51:09] !log swift eqiad-prod: less weight for ms-be[1019-1026] - T272836 [07:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:16] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [07:52:38] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: generate metrics on dead letter events [puppet] - 10https://gerrit.wikimedia.org/r/672558 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [07:53:03] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: short-circuit dead letter recursion [puppet] - 10https://gerrit.wikimedia.org/r/672556 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [07:58:37] (03PS1) 10Marostegui: db1162: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672649 (https://phabricator.wikimedia.org/T258361) [08:03:05] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:23] (03CR) 10Marostegui: [C: 03+2] db1162: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672649 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [08:05:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 1%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14872 and previous config saved to /var/cache/conftool/dbconfig/20210316-080547-root.json [08:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:16] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) I am slowly repooling db1162 [08:10:43] (03PS1) 10Marostegui: mariadb: Productionize db2150 [puppet] - 10https://gerrit.wikimedia.org/r/672652 (https://phabricator.wikimedia.org/T275633) [08:11:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2150 [puppet] - 10https://gerrit.wikimedia.org/r/672652 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:14:11] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 66 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:14:58] (03PS3) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [08:16:38] I'm taking down logmsgbot shortly to enable nick enforcing (T276303) [08:16:38] T276303: logmsgbot auth issues - https://phabricator.wikimedia.org/T276303 [08:18:53] !log enable nick enforcing for logmsgbot - T276303 [08:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:20:04] {{done}} [08:20:45] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:20:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 2%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14873 and previous config saved to /var/cache/conftool/dbconfig/20210316-082051-root.json [08:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:54] (03PS1) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [08:33:42] (03CR) 10Hashar: [C: 03+1] "The Jenkins instances are working fine with Java 11 :]" [puppet] - 10https://gerrit.wikimedia.org/r/670776 (https://phabricator.wikimedia.org/T269354) (owner: 10Muehlenhoff) [08:33:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:35:08] (03CR) 10Muehlenhoff: [C: 03+2] releases/ci: Remove now obsolete Java 8 packages [puppet] - 10https://gerrit.wikimedia.org/r/670776 (https://phabricator.wikimedia.org/T269354) (owner: 10Muehlenhoff) [08:35:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 5%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14874 and previous config saved to /var/cache/conftool/dbconfig/20210316-083555-root.json [08:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 25%: Slowly repool db1076', diff saved to https://phabricator.wikimedia.org/P14875 and previous config saved to /var/cache/conftool/dbconfig/20210316-083653-root.json [08:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:39] !log remove Java 8 from contint/releases T269354 [08:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:46] T269354: Switch Jenkins servers to Java 11 - https://phabricator.wikimedia.org/T269354 [08:44:54] (03PS2) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [08:46:54] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28623/console" [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [08:47:00] !log Check tables on db2150 db2120 T276742 [08:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:07] T276742: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 [08:49:45] (03PS3) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [08:50:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:50:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14876 and previous config saved to /var/cache/conftool/dbconfig/20210316-085058-root.json [08:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 50%: Slowly repool db1076', diff saved to https://phabricator.wikimedia.org/P14877 and previous config saved to /var/cache/conftool/dbconfig/20210316-085157-root.json [08:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:54:20] (03PS4) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [08:56:59] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [08:57:01] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [08:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:10] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 18 hosts with reason: Reinitialize codfw k8s cluster with new etcd [08:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:17] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 18 hosts with reason: Reinitialize codfw k8s cluster with new etcd [08:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:49] !log starting the k8s codfw cluster reinitialization process [08:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:08] (03PS5) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [09:00:12] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:12] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:13] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:13] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:13] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:13] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:14] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:14] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:14] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:15] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:15] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:15] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:16] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:16] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:17] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:17] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:18] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:18] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:19] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:19] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:20] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:20] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:21] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:21] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:22] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:22] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:23] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:23] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:24] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:24] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:25] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:25] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:26] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:26] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:27] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:27] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:28] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:28] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:29] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:29] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:30] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:30] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:31] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:00:31] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:23] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Samwalton9) >>! In T277298#6916288, @nshahquinn-wmf wrote: > @Volans he can already access Hue, but when he tried to query private data (`mediawiki_history`, for T271962) he got a permissio... [09:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:54] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:56] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:01:58] (03PS2) 10Muehlenhoff: profile::kerberos::client: Default to use DNS canonicalisation [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) [09:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:14] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28626/console" [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [09:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:13] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:03:13] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) (owner: 10Muehlenhoff) [09:03:20] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:22] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:54] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:03:54] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 15%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14878 and previous config saved to /var/cache/conftool/dbconfig/20210316-090602-root.json [09:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:44] (03PS1) 10Hashar: contint: use Java 11 on Jenkins agents [puppet] - 10https://gerrit.wikimedia.org/r/672658 (https://phabricator.wikimedia.org/T269354) [09:07:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 75%: Slowly repool db1076', diff saved to https://phabricator.wikimedia.org/P14879 and previous config saved to /var/cache/conftool/dbconfig/20210316-090701-root.json [09:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:27] (03PS4) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [09:07:29] (03PS6) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [09:07:47] (03CR) 10jerkins-bot: [V: 04-1] contint: use Java 11 on Jenkins agents [puppet] - 10https://gerrit.wikimedia.org/r/672658 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [09:09:21] (03CR) 10Elukey: "@Ottomata,@Razzi: this is what I have in mind for the profile, the settings will need to be adjusted/discussed of course, lemme know if it" [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [09:10:15] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=apertium [09:10:15] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=api-gateway [09:10:16] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=blubberoid [09:10:16] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=citoid [09:10:16] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=cxserver [09:10:17] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=echostore [09:10:17] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventgate-analytics [09:10:17] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventgate-analytics-external [09:10:18] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventgate-logging-external [09:10:18] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventgate-main [09:10:18] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventstreams [09:10:19] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventstreams-internal [09:10:19] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=linkrecommendation [09:10:20] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=mathoid [09:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:20] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=mobileapps [09:10:20] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=proton [09:10:21] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=push-notifications [09:10:21] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=recommendation-api [09:10:22] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=sessionstore [09:10:22] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=similar-users [09:10:23] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=termbox [09:10:23] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=wikifeeds [09:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:57] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=restbase-async [09:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:35] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=restbase-async [09:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:03] (03CR) 10Jcrespo: "You can use the "resolved" feature in any order you find useful to you, no problem with that." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672150 (owner: 10H.krishna123) [09:17:13] (03CR) 10Jbond: [C: 03+2] zpapierski home: Start tmux on connect [puppet] - 10https://gerrit.wikimedia.org/r/670943 (owner: 10Ebernhardson) [09:17:52] thanks, jbond42 [09:18:10] no probs merging now will take ~30 mins to fully propogate [09:18:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:18:21] !log switch restbase-async to eqiad since the kubernetes codfw cluster is being reinitialized and it makes little sense to have it there while the callers will run in eqiad only [09:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:28] (03CR) 10Jcrespo: "recheck" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/671942 (https://phabricator.wikimedia.org/T277162) (owner: 10Rohitesh20) [09:19:08] (03CR) 10jerkins-bot: [V: 04-1] Add logger to recover-dump::to indicate actions taken [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/671942 (https://phabricator.wikimedia.org/T277162) (owner: 10Rohitesh20) [09:20:40] (03CR) 10Jcrespo: "./wmfbackups/cli/recover_dump.py:165:5: F841 local variable 'logger' is assigned to but never used" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/671942 (https://phabricator.wikimedia.org/T277162) (owner: 10Rohitesh20) [09:21:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 20%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14880 and previous config saved to /var/cache/conftool/dbconfig/20210316-092106-root.json [09:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:16] (03CR) 10Jcrespo: "recheck" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672150 (owner: 10H.krishna123) [09:22:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 100%: Slowly repool db1076', diff saved to https://phabricator.wikimedia.org/P14881 and previous config saved to /var/cache/conftool/dbconfig/20210316-092204-root.json [09:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:15] (03PS2) 10Hashar: contint: use Java 11 on Jenkins agents [puppet] - 10https://gerrit.wikimedia.org/r/672658 (https://phabricator.wikimedia.org/T269354) [09:23:17] (03PS1) 10Hashar: puppet_compiler: stop installing openjdk [puppet] - 10https://gerrit.wikimedia.org/r/672660 (https://phabricator.wikimedia.org/T269354) [09:23:31] (03CR) 10Jcrespo: "If this helps, the tests run correctly (although I haven't had yet a deep check at them)." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672150 (owner: 10H.krishna123) [09:24:58] (03PS3) 10Muehlenhoff: profile::kerberos::client: Default to use DNS canonicalisation [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) [09:28:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) (owner: 10Muehlenhoff) [09:31:58] (03CR) 10DCausse: create helmfile.d structure (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [09:32:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:34:49] !log poweroff acrux and acrab T277191 [09:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:56] T277191: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 [09:36:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14883 and previous config saved to /var/cache/conftool/dbconfig/20210316-093609-root.json [09:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:38:11] (03CR) 10Jbond: "> Patch Set 3:" [software/netbox] - 10https://gerrit.wikimedia.org/r/672548 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [09:39:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:39:37] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers acrab.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:40:04] oops, fixing [09:40:18] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd2004.codfw.wmnet [09:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:13] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers acrab.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:41:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 for schema change', diff saved to https://phabricator.wikimedia.org/P14884 and previous config saved to /var/cache/conftool/dbconfig/20210316-094117-marostegui.json [09:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:36] (03PS2) 10JMeybohm: kubernetes codfw: Populate new worker hiera keys for k8s update [puppet] - 10https://gerrit.wikimedia.org/r/671174 (https://phabricator.wikimedia.org/T277191) [09:43:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:43:43] (03PS3) 10JMeybohm: kubernetes codfw: Apply role/hiera to new masters [puppet] - 10https://gerrit.wikimedia.org/r/671171 (https://phabricator.wikimedia.org/T277191) [09:44:05] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([acrab.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [09:44:21] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2004.codfw.wmnet [09:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:31] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd2005.codfw.wmnet [09:44:31] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([acrab.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [09:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:17] PROBLEM - Prometheus k8s cache not updating on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [09:45:26] (03PS1) 10David Caro: ceph-common: add fio for testing and debugging [puppet] - 10https://gerrit.wikimedia.org/r/672670 (https://phabricator.wikimedia.org/T273649) [09:45:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:46:10] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd2006.codfw.wmnet [09:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:43] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2005.codfw.wmnet [09:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:01] PROBLEM - puppet last run on sretest1002 is CRITICAL: CRITICAL: Puppet has been disabled for longer than 86400 seconds, message: test puppet deactivations alerts re-enable after 22/03/21 - jbond, last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:47:09] PROBLEM - Prometheus k8s cache not updating on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [09:47:30] (03PS4) 10Muehlenhoff: profile::kerberos::client: Default to use DNS canonicalisation [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) [09:47:57] (03PS1) 10Marostegui: install_server: Reimage db1161 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/672671 (https://phabricator.wikimedia.org/T258361) [09:48:44] !log drain ganeti2011 [09:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:07] (03PS1) 10Alexandros Kosiaris: Add new kubemaster.svc.codfw.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/672672 (https://phabricator.wikimedia.org/T277191) [09:49:10] * jbond42 looking at sretest that shouldn't hav e gon critial for anoth 6 days [09:49:17] prometheus is us as well, k8s masters down in codfw. Acknowledged [09:49:57] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1161 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/672671 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [09:50:12] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2006.codfw.wmnet [09:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:30] 10SRE, 10cloud-services-team (Kanban): WMCS standalone puppet master does not lookup cherry picked hiera change - https://phabricator.wikimedia.org/T277526 (10hashar) [09:50:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:51:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 49%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14885 and previous config saved to /var/cache/conftool/dbconfig/20210316-095113-root.json [09:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:39] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet [09:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:08] 10SRE, 10cloud-services-team (Kanban): WMCS standalone puppet master does not lookup cherry picked hiera change - https://phabricator.wikimedia.org/T277526 (10hashar) [09:53:08] 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10cloud-services-team (Kanban): WMCS standalone puppet master does not lookup cherry picked hiera change - https://phabricator.wikimedia.org/T277526 (10Majavah) [09:53:26] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1161.eqiad.wmnet'] ` The log ca... [09:55:08] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/671174 (https://phabricator.wikimedia.org/T277191) (owner: 10JMeybohm) [09:55:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:56:19] (03CR) 10Hashar: "The puppet compiler profile includes ::profile::ci::slave::labs::commons . With https://gerrit.wikimedia.org/r/672658 I have added ::prof" [puppet] - 10https://gerrit.wikimedia.org/r/672660 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [09:56:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add new kubemaster.svc.codfw.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/672672 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [09:57:54] (03CR) 10Hashar: "I have cherry picked it on the deployment-prep puppet master but the value is not looked up by the Puppet master ( T277526 ). So that has" [puppet] - 10https://gerrit.wikimedia.org/r/672658 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [09:59:17] !log Push new certs for kubemaster.svc.codfw.wmnet - T277191 [09:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:27] T277191: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 [10:00:04] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet [10:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:45] !log drain ganeti2012 [10:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:02] (03PS1) 10Jbond: base::check_puppetrun fix warning vs critical state [puppet] - 10https://gerrit.wikimedia.org/r/672677 [10:04:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:05:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1161.eqiad.wmnet with reason: REIMAGE [10:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14886 and previous config saved to /var/cache/conftool/dbconfig/20210316-100617-root.json [10:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:07:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1161.eqiad.wmnet with reason: REIMAGE [10:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:39] (03PS1) 10Muehlenhoff: Switch the IDPs to the serial Memcached transcoder [puppet] - 10https://gerrit.wikimedia.org/r/672679 (https://phabricator.wikimedia.org/T273867) [10:07:43] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet [10:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:45] (03CR) 10Kormat: [C: 03+1] xhgui: add dummy admin password [labs/private] - 10https://gerrit.wikimedia.org/r/672461 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [10:09:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:09:24] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [10:09:47] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/672660 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [10:10:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:12:40] (03CR) 10Jbond: [C: 03+2] base::check_puppetrun fix warning vs critical state [puppet] - 10https://gerrit.wikimedia.org/r/672677 (owner: 10Jbond) [10:13:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:14:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/672679 (https://phabricator.wikimedia.org/T273867) (owner: 10Muehlenhoff) [10:14:39] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1161.eqiad.wmnet'] ` and were **ALL** successful. [10:15:09] <_joe_> is ircd down in codfw? [10:15:31] (03CR) 10JMeybohm: [C: 03+2] kubernetes codfw: Populate new worker hiera keys for k8s update [puppet] - 10https://gerrit.wikimedia.org/r/671174 (https://phabricator.wikimedia.org/T277191) (owner: 10JMeybohm) [10:15:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet [10:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:50] <_joe_> moritzm: can be related to what you've just done? [10:15:54] (03CR) 10Klausman: [C: 03+1] profile::docker::engine: add default to the version parameter [puppet] - 10https://gerrit.wikimedia.org/r/672647 (owner: 10Elukey) [10:16:50] _joe_: no, don't think so, kraz is running on ganeti2020, which wasn't touched at all [10:17:56] irc.wikimedia.org works for me, though? [10:18:08] just connected to #de.wikipedia and can see changes fly by [10:18:29] _joe_: i remember seeing an alert about ircd in codfw yesterday or so, too [10:19:09] <_joe_> yeah I was trying to understand what that alert was about [10:20:26] Cole recently added some alerting, maybe 1-2 weeks ago, maybe that needs finetuning of sorts [10:21:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 60%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14887 and previous config saved to /var/cache/conftool/dbconfig/20210316-102121-root.json [10:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:06] what was the error, host unreachability or "ircecho is not relaying", currently kraz seems all fine in Icinga [10:22:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:23:49] ah yes, the "PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={irc" one [10:24:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:25:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:26:15] (03CR) 10David Caro: paws: block using the Jupyterhub from Tor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [10:26:18] that sounds like an error in the new metrics collection, I'll reopen the task [10:26:45] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/appservers-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:26:53] RECOVERY - Confd template for /srv/config-master/pybal/codfw/apaches on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:26:57] RECOVERY - Confd template for /srv/config-master/pybal/codfw/appservers-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:27:03] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/apaches on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:28:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes codfw: Apply role/hiera to new masters [puppet] - 10https://gerrit.wikimedia.org/r/671171 (https://phabricator.wikimedia.org/T277191) (owner: 10JMeybohm) [10:28:11] (03PS4) 10Alexandros Kosiaris: kubernetes codfw: Apply role/hiera to new masters [puppet] - 10https://gerrit.wikimedia.org/r/671171 (https://phabricator.wikimedia.org/T277191) (owner: 10JMeybohm) [10:30:19] <_joe_> volans: it looks like the decom script doesn't ensure the server is not in conftool anymore [10:30:51] <_joe_> that caused the failures above, the server was set to inactive (while it should really be removed completely) after the decom script ran [10:31:04] 10SRE, 10IRCecho, 10Icinga, 10observability: Icinga check for ircecho should check for actual activity - https://phabricator.wikimedia.org/T216611 (10MoritzMuehlenhoff) This has been flapping in Icinga, e.g. for today: ` [11:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CR... [10:31:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:31:43] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [10:31:55] 10SRE, 10IRCecho, 10Icinga, 10observability: Icinga check for ircecho should check for actual activity - https://phabricator.wikimedia.org/T216611 (10MoritzMuehlenhoff) 05Resolved→03Open [10:33:42] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Volans) @nshahquinn-wmf , @Samwalton9 : thanks for the additional info, I've spoke with analytics and indeed the membership of the `analytics-privatedata-users` group is needed to access so... [10:34:46] 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10cloud-services-team (Kanban): WMCS standalone puppet master does not lookup cherry picked hiera change - https://phabricator.wikimedia.org/T277526 (10jbond) @hashar tl;dr the you should place the value in `hieradata/cloud.yaml` to set a sane default for all... [10:36:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:25] _joe_: no, the decom cookbook right now is not touching confctl at all, it's supposed to be run after the host has been removed from production already, and that changes greatly between hosts. If there is an additional check that we can add in conftool to prevent this happy to add it (patches are welcome too ;) ) [10:36:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14889 and previous config saved to /var/cache/conftool/dbconfig/20210316-103625-root.json [10:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:55] (03PS1) 10Muehlenhoff: Add irc2001.wikimedia.org (running buster) as second irc server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672687 (https://phabricator.wikimedia.org/T224579) [10:39:45] <_joe_> volans: I think I'll settle for a task [10:40:03] _joe_: it also presents the user a list of references in the puppet public/private repos as well as the mediawiki one so that the user is aware of leftovers [10:40:25] <_joe_> volans: so conftool-data would be caught, right? [10:40:28] yes [10:40:35] it should, if not is a bug [10:40:35] <_joe_> ok then disregard all [10:40:43] <_joe_> yeah dunno about that [10:40:51] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kubemaster on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kubemaster is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:41:02] if you have a hostname I can check the logs [10:41:11] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kubemaster on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kubemaster is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:41:28] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:41:31] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kubemaster is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:41:56] akosiaris: ^ thats probably the last change, right? [10:43:56] (03PS2) 10David Caro: ceph-common: add fio for testing and debugging [puppet] - 10https://gerrit.wikimedia.org/r/672670 (https://phabricator.wikimedia.org/T273649) [10:44:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 25%: Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P14890 and previous config saved to /var/cache/conftool/dbconfig/20210316-104420-root.json [10:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:30] (03CR) 10David Caro: "Added very simple tests to make sure nothing explodes." [puppet] - 10https://gerrit.wikimedia.org/r/672670 (https://phabricator.wikimedia.org/T273649) (owner: 10David Caro) [10:44:34] (03CR) 10jerkins-bot: [V: 04-1] ceph-common: add fio for testing and debugging [puppet] - 10https://gerrit.wikimedia.org/r/672670 (https://phabricator.wikimedia.org/T273649) (owner: 10David Caro) [10:44:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) (owner: 10Muehlenhoff) [10:46:51] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: dc=codfw,service=kubesvc,name=kubernetes2016.codfw.wmnet [10:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:01] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: dc=codfw,service=kubesvc,name=kubernetes2015.codfw.wmnet [10:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:21] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2001.codfw.wmnet with reason: REIMAGE [10:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:55] (03PS1) 10Giuseppe Lavagetto: profile::discovery::client: remove services filename [puppet] - 10https://gerrit.wikimedia.org/r/672689 [10:49:20] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2002.codfw.wmnet with reason: REIMAGE [10:49:27] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2001.codfw.wmnet with reason: REIMAGE [10:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:46] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: dc=codfw,service=kubesvc,name=kubernetes2005.codfw.wmnet [10:49:52] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: dc=codfw,service=kubesvc,name=kubernetes2006.codfw.wmnet [10:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:08] PROBLEM - Check systemd state on kubemaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:12] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28627/console" [puppet] - 10https://gerrit.wikimedia.org/r/672689 (owner: 10Giuseppe Lavagetto) [10:51:22] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2003.codfw.wmnet with reason: REIMAGE [10:51:27] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2002.codfw.wmnet with reason: REIMAGE [10:51:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14891 and previous config saved to /var/cache/conftool/dbconfig/20210316-105128-root.json [10:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:34] (03PS1) 10Alexandros Kosiaris: Correctly add new kubemaster.svc.codfw.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/672690 (https://phabricator.wikimedia.org/T277191) [10:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:55] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Correctly add new kubemaster.svc.codfw.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/672690 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [10:51:58] (03CR) 10jerkins-bot: [V: 04-1] Correctly add new kubemaster.svc.codfw.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/672690 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [10:52:23] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2007.codfw.wmnet with reason: REIMAGE [10:52:30] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventstreams on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventstreams is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:52:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/sessionstore on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/sessionstore is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:52:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventgate-main on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventgate-main is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:52:30] PROBLEM - Confd template for /srv/config-master/pybal/codfw/apertium on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/apertium is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:52:30] PROBLEM - Confd template for /srv/config-master/pybal/codfw/similar-users on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/similar-users is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:52:30] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wikifeeds on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wikifeeds is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:52:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventgate-analytics-external on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventgate-analytics-external is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:52:31] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/mathoid on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/mathoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:36] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/cxserver on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/cxserver is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:52:36] PROBLEM - Confd template for /srv/config-master/pybal/codfw/citoid on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/citoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:52:36] PROBLEM - Confd template for /srv/config-master/pybal/codfw/termbox on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/termbox is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:52:45] (03PS2) 10Alexandros Kosiaris: Correctly add new kubemaster.svc.codfw.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/672690 (https://phabricator.wikimedia.org/T277191) [10:52:52] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/linkrecommendation on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/linkrecommendation is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:53:08] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventgate-logging-external on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventgate-logging-external is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:53:08] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventgate-main on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventgate-main is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:53:18] PROBLEM - Confd template for /srv/config-master/pybal/codfw/mobileapps on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/mobileapps is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:53:30] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2003.codfw.wmnet with reason: REIMAGE [10:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:45] (03PS5) 10Muehlenhoff: profile::kerberos::client: Default to use DNS canonicalisation [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) [10:53:52] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] "Per pcc, the change seems to DTRT" [puppet] - 10https://gerrit.wikimedia.org/r/672689 (owner: 10Giuseppe Lavagetto) [10:54:24] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2008.codfw.wmnet with reason: REIMAGE [10:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:03] _joe_: I can confirm from logs that the command run should have output among the other things matching: [10:55:06] conftool-data/node/codfw.yaml: mw2225.codfw.wmnet: [apache2,nginx] [10:55:16] (as it still does) [10:55:26] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2004.codfw.wmnet with reason: REIMAGE [10:55:31] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2007.codfw.wmnet with reason: REIMAGE [10:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:51] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2009.codfw.wmnet with reason: REIMAGE [10:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:25] (03PS3) 10Jbond: ceph-common: add fio for testing and debugging [puppet] - 10https://gerrit.wikimedia.org/r/672670 (https://phabricator.wikimedia.org/T273649) (owner: 10David Caro) [10:57:30] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2008.codfw.wmnet with reason: REIMAGE [10:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) (owner: 10Muehlenhoff) [10:58:25] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2010.codfw.wmnet with reason: REIMAGE [10:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:40] (03PS4) 10KartikMistry: WIP: Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) [10:59:02] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventstreams-internal on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventstreams-internal is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:02] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/similar-users on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/similar-users is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:02] PROBLEM - Confd template for /srv/config-master/pybal/codfw/api-gateway on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/api-gateway is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:02] PROBLEM - Confd template for /srv/config-master/pybal/codfw/recommendation-api on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/recommendation-api is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:59:22] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2011.codfw.wmnet with reason: REIMAGE [10:59:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 50%: Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P14892 and previous config saved to /var/cache/conftool/dbconfig/20210316-105924-root.json [10:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:29] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2009.codfw.wmnet with reason: REIMAGE [10:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:39] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2017.codfw.wmnet [10:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210316T1100). [11:00:04] kart_ and MatmaRex: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:22] hi [11:00:26] * kart_ is here [11:00:27] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2004.codfw.wmnet with reason: REIMAGE [11:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:33] I’m in a meeting, is someone else around to run the window? [11:00:49] i can deploy today [11:00:51] cc Lucas_WMDE [11:00:59] cool, thanks :) [11:01:00] PROBLEM - Confd template for /srv/config-master/pybal/codfw/mobileapps on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/mobileapps is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:00] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wikifeeds on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wikifeeds is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:00] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/zotero on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/zotero is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:02] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/proton on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/proton is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:10] PROBLEM - Confd template for /srv/config-master/pybal/codfw/push-notifications on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/push-notifications is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:31] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2011.codfw.wmnet with reason: REIMAGE [11:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:14] (03PS2) 10Urbanecm: Enable ContentTranslation as a default tool in Amharic, Maltese and Uzbek Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672391 (https://phabricator.wikimedia.org/T276765) (owner: 10KartikMistry) [11:02:20] (03CR) 10Urbanecm: [C: 03+2] Enable ContentTranslation as a default tool in Amharic, Maltese and Uzbek Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672391 (https://phabricator.wikimedia.org/T276765) (owner: 10KartikMistry) [11:02:25] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2012.codfw.wmnet with reason: REIMAGE [11:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:38] oh it's now. I'm in meeting the whole hour :( [11:03:00] PROBLEM - Confd template for /srv/config-master/pybal/codfw/sessionstore on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/sessionstore is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:00] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wikifeeds on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wikifeeds is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:00] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/push-notifications on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/push-notifications is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:00] Amir1: will you have time for new wikis in an hour, as scheduled? [11:03:12] yeah yeah [11:03:14] good [11:03:25] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2010.codfw.wmnet with reason: REIMAGE [11:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:46] (03Merged) 10jenkins-bot: Enable ContentTranslation as a default tool in Amharic, Maltese and Uzbek Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672391 (https://phabricator.wikimedia.org/T276765) (owner: 10KartikMistry) [11:04:04] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kubemaster is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:20] kart_: can you test on mwdebug1001, please? [11:04:26] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2013.codfw.wmnet with reason: REIMAGE [11:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:32] Urbanecm: sure. [11:04:41] (03PS2) 10Urbanecm: Enable DiscussionTools' beta features on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672507 (https://phabricator.wikimedia.org/T273146) (owner: 10Bartosz Dziewoński) [11:04:46] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools' beta features on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672507 (https://phabricator.wikimedia.org/T273146) (owner: 10Bartosz Dziewoński) [11:05:02] PROBLEM - Confd template for /srv/config-master/pybal/codfw/apertium on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/apertium is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:04] PROBLEM - Confd template for /srv/config-master/pybal/codfw/similar-users on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/similar-users is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:04] PROBLEM - Confd template for /srv/config-master/pybal/codfw/zotero on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/zotero is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:04] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/termbox on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/termbox is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:04] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/cxserver on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/cxserver is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:04] PROBLEM - Confd template for /srv/config-master/pybal/codfw/blubberoid on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/blubberoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:04] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/recommendation-api on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/recommendation-api is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:23] jayme: are the icinga alerts related to your work? I'd like to do a MW deploys, and I'd like to make sure they're safe to ignore. [11:05:25] (for me) [11:05:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/citoid on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/citoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:39] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2012.codfw.wmnet with reason: REIMAGE [11:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:52] (03Merged) 10jenkins-bot: Enable DiscussionTools' beta features on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672507 (https://phabricator.wikimedia.org/T273146) (owner: 10Bartosz Dziewoński) [11:06:19] Urbanecm: looks good! [11:06:23] thanks [11:06:26] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2014.codfw.wmnet with reason: REIMAGE [11:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:48] waiting for a comment regarding the alerts [11:07:15] PROBLEM - Confd template for /srv/config-master/pybal/codfw/api on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/api is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:07:15] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/echostore on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/echostore is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:07:15] PROBLEM - Confd template for /srv/config-master/pybal/codfw/citoid on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/citoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:07:17] PROBLEM - Confd template for /srv/config-master/pybal/codfw/zotero on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/zotero is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:07:39] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/mobileapps on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/mobileapps is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:07:49] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2013.codfw.wmnet with reason: REIMAGE [11:07:51] Urbanecm: it seems regarding reimaging of services on codfw: [Wikitech-l] codfw kubernetes cluster upgrade this week [11:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:09] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: dc=codfw,service=kubemaster,name=.*,cluster=kubernetes [11:08:12] PROBLEM - Host kubernetes2010 is DOWN: PING CRITICAL - Packet loss = 100% [11:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:18] possibly, but this amount of alerts makes it harder to see alerts that _are_ related to deploys [11:08:22] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=kubemaster,name=.*,cluster=kubernetes [11:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:33] RECOVERY - Host kubernetes2010 is UP: PING OK - Packet loss = 0%, RTA = 31.74 ms [11:09:29] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/api on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/api is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:29] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventgate-analytics on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventgate-analytics is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:29] PROBLEM - Confd template for /srv/config-master/pybal/codfw/linkrecommendation on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/linkrecommendation is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:29] PROBLEM - Confd template for /srv/config-master/pybal/codfw/cxserver on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/cxserver is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:29] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/linkrecommendation-external on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/linkrecommendation-external is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:29] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/zotero on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/zotero is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:10:04] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2014.codfw.wmnet with reason: REIMAGE [11:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:30] akosiaris ^ Is it safe to deploy mediawiki-config deployment with above alerts? [11:10:34] akosiaris: jayme: those alerts make it hard for me to see alerts that are (or might be) related to my deploys. If they're ignorable, can they be silenced or something? [11:11:08] kart_: yes [11:11:26] Urbanecm: yeah let me do that [11:11:41] thanks a lot akosiaris :) [11:11:42] PROBLEM - Confd template for /srv/config-master/pybal/codfw/proton on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/proton is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:42] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/proton on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/proton is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:42] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventgate-logging-external on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventgate-logging-external is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:44] PROBLEM - Confd template for /srv/config-master/pybal/codfw/linkrecommendation-external on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/linkrecommendation-external is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:44] PROBLEM - Confd template for /srv/config-master/pybal/codfw/sessionstore on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/sessionstore is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:44] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/apertium on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/apertium is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:44] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventgate-analytics on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventgate-analytics is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:11:48] doing the sync kart_ [11:12:10] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventgate-analytics-external on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventgate-analytics-external is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:12:52] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10MoritzMuehlenhoff) >>! In T244849#6915834, @Volans wrote: > I think that it's required to avoid the security issue of a user removed from an LDAP group keeping the previous access and the usability i... [11:13:16] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/apertium on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/apertium is broken alexandros kosiaris codfw k8s reinit https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:16] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/api on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/api is broken alexandros kosiaris codfw k8s reinit https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:16] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/api-gateway on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/api-gateway is broken alexandros kosiaris codfw k8s reinit https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:16] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/blubberoid on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/blubberoid is broken alexandros kosiaris codfw k8s reinit https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:16] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/citoid on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/citoid is broken alexandros kosiaris codfw k8s reinit https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:19] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 835f9ab9fb107a339e6a9dcc008c9626ba66853e: Enable ContentTranslation as a default tool in Amharic, Maltese and Uzbek Wikipedias (T276765) (duration: 01m 00s) [11:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:27] T276765: Enable Content Translation in 3 Wikipedia as a default tool - https://phabricator.wikimedia.org/T276765 [11:13:31] kart_: done [11:13:45] MatmaRex: first one pulled onto mwdebug1001, please test [11:13:55] (03PS2) 10Urbanecm: Enable DiscussionTools' beta features on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672508 (https://phabricator.wikimedia.org/T276493) (owner: 10Bartosz Dziewoński) [11:13:57] Urbanecm: Thanks!! [11:14:00] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools' beta features on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672508 (https://phabricator.wikimedia.org/T276493) (owner: 10Bartosz Dziewoński) [11:14:02] np kart_ [11:14:06] thanks, looking [11:14:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 75%: Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P14893 and previous config saved to /var/cache/conftool/dbconfig/20210316-111427-root.json [11:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:36] Urbanecm: seems fine [11:14:45] thanks, syncing [11:14:50] (03Merged) 10jenkins-bot: Enable DiscussionTools' beta features on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672508 (https://phabricator.wikimedia.org/T276493) (owner: 10Bartosz Dziewoński) [11:15:00] Urbanecm: we can probably do all the other patches at once, they all make similar changes to different wikis [11:15:13] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2005.codfw.wmnet [11:15:16] MatmaRex: ack, let me merge 'em all [11:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:30] (03PS2) 10Urbanecm: Enable DiscussionTools' beta features on almost all remaining projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672509 (https://phabricator.wikimedia.org/T276498) (owner: 10Bartosz Dziewoński) [11:15:35] (03PS3) 10Urbanecm: Make DiscussionTools' replytool available for everyone on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672510 (https://phabricator.wikimedia.org/T277103) (owner: 10Bartosz Dziewoński) [11:15:37] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools' beta features on almost all remaining projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672509 (https://phabricator.wikimedia.org/T276498) (owner: 10Bartosz Dziewoński) [11:15:42] (03CR) 10Urbanecm: [C: 03+2] Make DiscussionTools' replytool available for everyone on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672510 (https://phabricator.wikimedia.org/T277103) (owner: 10Bartosz Dziewoński) [11:16:09] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f0d546502458437ae7b050c3f4bdb5f5a67a9529: Enable DiscussionTools beta features on enwiki (T273146) (duration: 00m 58s) [11:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:16] T273146: Make the Reply and New Discussion Tools available as opt-in Beta Features at en.wiki - https://phabricator.wikimedia.org/T273146 [11:16:18] first one synced [11:16:20] PROBLEM - Confd template for /srv/config-master/pybal/codfw/cxserver on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/cxserver is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:20] PROBLEM - Confd template for /srv/config-master/pybal/codfw/recommendation-api on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/recommendation-api is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:22] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/blubberoid on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/blubberoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:22] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventstreams-internal on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventstreams-internal is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:22] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/linkrecommendation-external on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/linkrecommendation-external is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:22] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/recommendation-api on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/recommendation-api is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:24] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventgate-analytics on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventgate-analytics is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:24] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/api-gateway on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/api-gateway is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:24] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventgate-main on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventgate-main is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:33] akosiaris: apparently the silencing did not work properly ^^ :-) [11:16:51] (03Merged) 10jenkins-bot: Enable DiscussionTools' beta features on almost all remaining projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672509 (https://phabricator.wikimedia.org/T276498) (owner: 10Bartosz Dziewoński) [11:16:54] (03Merged) 10jenkins-bot: Make DiscussionTools' replytool available for everyone on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672510 (https://phabricator.wikimedia.org/T277103) (owner: 10Bartosz Dziewoński) [11:17:16] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2017.codfw.wmnet [11:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:21] MatmaRex: all of them are at mwdebug1001, can you test? [11:17:43] looking [11:17:59] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2005.codfw.wmnet [11:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:10] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/citoid on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/citoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:10] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventstreams on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventstreams is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:12] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventstreams on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventstreams is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:12] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/sessionstore on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/sessionstore is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:12] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wikifeeds on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wikifeeds is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:25] (03PS1) 10Muehlenhoff: Switch java::package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/672693 [11:19:37] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/672660 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [11:19:52] Urbanecm: everything seems fine [11:19:57] thanks, syncing [11:20:34] (03PS2) 10Muehlenhoff: Switch java::package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/672693 [11:22:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c444517: 4e66529: dff200b: Enable DiscussionTools features on several projects (T276493; T276498; T277103) (duration: 00m 57s) [11:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:01] T276493: Make the Reply Tool and New Discussion Tools available as opt-in Beta Features at ru.wiki - https://phabricator.wikimedia.org/T276493 [11:23:02] T276498: Make New Discussion Tool and Reply Tool available as Beta Features at all sister projects - https://phabricator.wikimedia.org/T276498 [11:23:02] T277103: Please turn on the Reply tool and New Discussion tool at office.wiki - https://phabricator.wikimedia.org/T277103 [11:23:22] MatmaRex: done [11:23:24] anything else? [11:23:32] thanks! [11:23:43] RECOVERY - Check systemd state on kubemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/672693 (owner: 10Muehlenhoff) [11:24:47] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [11:26:27] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventstreams-internal on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventstreams-internal is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:27:03] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [11:28:46] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2006.codfw.wmnet [11:28:47] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kubernetes2006.codfw.wmnet [11:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 100%: Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P14895 and previous config saved to /var/cache/conftool/dbconfig/20210316-112931-root.json [11:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:44] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2006.codfw.wmnet [11:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:58] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2015.codfw.wmnet [11:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:35] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2016.codfw.wmnet [11:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph-common: add fio for testing and debugging [puppet] - 10https://gerrit.wikimedia.org/r/672670 (https://phabricator.wikimedia.org/T273649) (owner: 10David Caro) [11:31:40] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2006.codfw.wmnet [11:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:12] !log upgrade memached in mc1023, mc2023 [11:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:32] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2016.codfw.wmnet [11:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:13] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2015.codfw.wmnet [11:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:32] (03CR) 10David Caro: [C: 03+2] ceph-common: add fio for testing and debugging [puppet] - 10https://gerrit.wikimedia.org/r/672670 (https://phabricator.wikimedia.org/T273649) (owner: 10David Caro) [11:34:29] (03PS6) 10Muehlenhoff: profile::kerberos::client: Default to use DNS canonicalisation [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) [11:35:21] PROBLEM - Confd template for /srv/config-master/pybal/codfw/linkrecommendation on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/linkrecommendation is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:35:21] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/blubberoid on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/blubberoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:35:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) (owner: 10Muehlenhoff) [11:42:33] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/mathoid on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/mathoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:43:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1172 for schema change', diff saved to https://phabricator.wikimedia.org/P14896 and previous config saved to /var/cache/conftool/dbconfig/20210316-114310-marostegui.json [11:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:05] (03PS7) 10Muehlenhoff: profile::kerberos::client: Default to use DNS canonicalisation [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) [11:45:11] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/mobileapps on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/mobileapps is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:45:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) (owner: 10Muehlenhoff) [11:46:35] ACKNOWLEDGEMENT - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet has been disabled for longer than 86400 seconds, message: experiments - jiji - jiji, last run 3 days ago with 0 failures Effie Mouzeli experiment https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:47:13] (03PS1) 10Jbond: Rakefile: update the spec test to check for module Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/672698 [11:47:15] (03PS1) 10Jbond: foo: add test module to check rakefile cehck [puppet] - 10https://gerrit.wikimedia.org/r/672699 [11:47:31] PROBLEM - Confd template for /srv/config-master/pybal/codfw/echostore on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/echostore is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:47:56] (03CR) 10jerkins-bot: [V: 04-1] foo: add test module to check rakefile cehck [puppet] - 10https://gerrit.wikimedia.org/r/672699 (owner: 10Jbond) [11:50:21] (03PS2) 10Jbond: Rakefile: update the spec test to check for module Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/672698 [11:50:23] (03CR) 10JMeybohm: [C: 04-1] "This is such a weird design...glad you figured it out!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [11:50:39] (03PS2) 10Jbond: foo: add test module to check rakefile cehck [puppet] - 10https://gerrit.wikimedia.org/r/672699 [11:51:11] (03CR) 10jerkins-bot: [V: 04-1] foo: add test module to check rakefile cehck [puppet] - 10https://gerrit.wikimedia.org/r/672699 (owner: 10Jbond) [11:51:55] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventstreams-internal on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventstreams-internal is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:51:55] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventgate-analytics on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventgate-analytics is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:52:35] (03CR) 10JMeybohm: [C: 04-1] docker_registry_ha: Require authentication from k8s nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [11:53:25] (03PS3) 10Jbond: foo: add test module to check rakefile cehck [puppet] - 10https://gerrit.wikimedia.org/r/672699 [11:54:18] (03CR) 10jerkins-bot: [V: 04-1] foo: add test module to check rakefile cehck [puppet] - 10https://gerrit.wikimedia.org/r/672699 (owner: 10Jbond) [11:54:33] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes,service=kubesvc [11:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:18] (03CR) 10Jbond: "See example https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/22071/console" [puppet] - 10https://gerrit.wikimedia.org/r/672698 (owner: 10Jbond) [11:56:28] jouncebot: next [11:56:29] In 0 hour(s) and 3 minute(s): New wikis (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210316T1200) [11:58:24] (03PS9) 10Hnowlan: aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) [12:00:04] Urbanecm and Amir1: How many deployers does it take to do New wikis deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210316T1200). [12:00:11] i hope only two :) [12:00:12] o/ [12:00:46] My meeting is bit stretching over but generally I'm around [12:01:10] okay, great [12:01:17] I'm preparing config for the third wiki [12:01:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:51] (03CR) 10Hnowlan: [C: 03+2] aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [12:02:39] 10SRE, 10SRE-Access-Requests, 10wikimedia-irc-freenode: Grant wmopbot +o permissions in #wikimedia-operations IRC channel - https://phabricator.wikimedia.org/T275711 (10faidon) I //think// I've implemented this -- it's been a while :) [12:02:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] Aggregate IPPools in codfw and eqiad, enable codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/671144 (https://phabricator.wikimedia.org/T277191) (owner: 10JMeybohm) [12:04:14] (03CR) 10JMeybohm: [C: 03+2] Aggregate IPPools in codfw and eqiad, enable codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/671144 (https://phabricator.wikimedia.org/T277191) (owner: 10JMeybohm) [12:04:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/672693 (owner: 10Muehlenhoff) [12:05:25] (03PS1) 10Urbanecm: Initial configuration for mnwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672702 (https://phabricator.wikimedia.org/T276125) [12:05:26] let's start [12:05:31] (03PS3) 10Urbanecm: Initial configuration for taywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672376 (https://phabricator.wikimedia.org/T275803) [12:05:38] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for taywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672376 (https://phabricator.wikimedia.org/T275803) (owner: 10Urbanecm) [12:05:40] (03Merged) 10jenkins-bot: Aggregate IPPools in codfw and eqiad, enable codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/671144 (https://phabricator.wikimedia.org/T277191) (owner: 10JMeybohm) [12:06:05] (03CR) 10Muehlenhoff: "PCC shows no diff for Hadoop roles and the cuminunpriv* change is expected: https://puppet-compiler.wmflabs.org/compiler1001/694/" [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) (owner: 10Muehlenhoff) [12:06:30] (03Merged) 10jenkins-bot: Initial configuration for taywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672376 (https://phabricator.wikimedia.org/T275803) (owner: 10Urbanecm) [12:07:02] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [12:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:10] pulled to mwmaint1002 [12:07:43] hopefully it'll be easy [12:07:56] finger crossed, knock on the wood, thoughts and prayers [12:08:09] Amir1: "only" deprecation warnigns now [12:08:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:58] yay progress :D [12:09:08] yeah :) [12:09:27] so syncing now [12:10:19] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs1011.eqiad.wmnet with reason: New buster host [12:10:20] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10elukey) @nshahquinn-wmf @Samwalton9 if possible I'd ask to use Pyspark + hive to explore mediawiki history, Hue is currently not in a great shape after the hadoop upgrade and we should use... [12:10:20] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs1011.eqiad.wmnet with reason: New buster host [12:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:35] !log urbanecm@deploy1002 Synchronized wmf-config/db-eqiad.php: Creating taywiki (T275803) (duration: 00m 59s) [12:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:42] T275803: Create Wikipedia Atayal - https://phabricator.wikimedia.org/T275803 [12:11:45] !log urbanecm@deploy1002 Synchronized wmf-config/db-codfw.php: Creating taywiki (T275803) (duration: 01m 02s) [12:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:49] !log urbanecm@deploy1002 Synchronized dblists: Creating taywiki (T275803) (duration: 00m 58s) [12:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:30] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating taywiki (T275803) [12:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:55] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:15:56] (03CR) 10Muehlenhoff: [C: 03+2] Switch java::package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/672693 (owner: 10Muehlenhoff) [12:16:07] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating taywiki (T275803) (duration: 00m 58s) [12:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:15] T275803: Create Wikipedia Atayal - https://phabricator.wikimedia.org/T275803 [12:17:14] (03PS1) 10Alexandros Kosiaris: Add kubernetes2017 to BGP [homer/public] - 10https://gerrit.wikimedia.org/r/672708 (https://phabricator.wikimedia.org/T277191) [12:17:16] (03PS1) 10Alexandros Kosiaris: Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 [12:17:17] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [12:17:23] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating taywiki (T275803) (duration: 00m 57s) [12:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:49] (03CR) 10jerkins-bot: [V: 04-1] Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 (owner: 10Alexandros Kosiaris) [12:18:07] RECOVERY - cassandra-a CQL 10.64.0.88:9042 on aqs1010 is OK: TCP OK - 0.000 second response time on 10.64.0.88 port 9042 https://phabricator.wikimedia.org/T93886 [12:18:35] * elukey lunch! [12:18:43] RECOVERY - Prometheus k8s cache not updating on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [12:18:51] (03PS2) 10Alexandros Kosiaris: Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 [12:19:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add kubernetes2017 to BGP [homer/public] - 10https://gerrit.wikimedia.org/r/672708 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [12:19:07] RECOVERY - Prometheus k8s cache not updating on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [12:19:17] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating taywiki (T275803) (duration: 00m 58s) [12:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:47] RECOVERY - cassandra-b service on aqs1010 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:19:51] (03Merged) 10jenkins-bot: Add kubernetes2017 to BGP [homer/public] - 10https://gerrit.wikimedia.org/r/672708 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [12:19:53] (03PS3) 10Urbanecm: Initial configuration for trvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672378 (https://phabricator.wikimedia.org/T276246) [12:19:58] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for trvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672378 (https://phabricator.wikimedia.org/T276246) (owner: 10Urbanecm) [12:20:26] !log urbanecm@deploy1002 Synchronized langlist: Creating taywiki (T275803) (duration: 00m 57s) [12:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:35] Amir1: first wiki live. [12:20:52] (03Merged) 10jenkins-bot: Initial configuration for trvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672378 (https://phabricator.wikimedia.org/T276246) (owner: 10Urbanecm) [12:22:06] \o/ [12:22:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) [12:24:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) p:05Medium→03High [12:24:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) @Cmjohnson please let me know if I can help to put these 2 servers into production as soon as possible [12:26:34] syncing second one [12:27:27] !log urbanecm@deploy1002 Synchronized wmf-config/db-eqiad.php: Creating trvwiki (T276246) (duration: 00m 57s) [12:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:35] T276246: Create Wikipedia Kari Seediq - https://phabricator.wikimedia.org/T276246 [12:28:06] (03PS2) 10Urbanecm: Initial configuration for mnwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672702 (https://phabricator.wikimedia.org/T276125) [12:28:15] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for mnwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672702 (https://phabricator.wikimedia.org/T276125) (owner: 10Urbanecm) [12:28:36] !log urbanecm@deploy1002 Synchronized wmf-config/db-codfw.php: Creating trvwiki (T276246) (duration: 01m 02s) [12:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:48] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [12:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:07] (03Merged) 10jenkins-bot: Initial configuration for mnwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672702 (https://phabricator.wikimedia.org/T276125) (owner: 10Urbanecm) [12:29:37] !log urbanecm@deploy1002 Synchronized dblists: Creating trvwiki (T276246) (duration: 00m 57s) [12:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:00] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating trvwiki (T276246) [12:31:04] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:01] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating trvwiki (T276246) (duration: 00m 58s) [12:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This will" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672647 (owner: 10Elukey) [12:32:28] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:33:00] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating trvwiki (T276246) (duration: 00m 57s) [12:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:07] T276246: Create Wikipedia Kari Seediq - https://phabricator.wikimedia.org/T276246 [12:34:03] !log urbanecm@deploy1002 Synchronized langlist: Creating trvwiki (T276246) (duration: 00m 58s) [12:34:08] Amir1: second wiki done [12:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:25] wohooo [12:35:44] we had three IIRC [12:35:53] yup [12:35:53] it's hard to keep track of those now [12:36:08] syncing third one [12:36:15] Great [12:36:44] Amir1: i was trying to create a workboard for the create project [12:36:48] !log urbanecm@deploy1002 Synchronized wmf-config/db-eqiad.php: Creating mnwwiktionary (T276125) (duration: 00m 58s) [12:36:54] Then I can squeeze this noop patch after it: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/672571 [12:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:57] T276125: Create Wiktionary Mon - https://phabricator.wikimedia.org/T276125 [12:37:03] Amir1: but i wasn't able to do so :/ [12:37:08] Urbanecm: I can do it [12:37:13] thanks [12:37:22] do you want it as a subproject or milestone? [12:37:45] Do we want to include subtickets and parent tickets? [12:37:49] so many questions [12:37:51] i would go for a milestone [12:37:51] !log urbanecm@deploy1002 Synchronized wmf-config/db-codfw.php: Creating mnwwiktionary (T276125) (duration: 00m 58s) [12:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:02] and we probably don't need to tag other tasks [12:38:07] we rarely really track them [12:38:14] and if we do at one time, a separate tag would be probably better [12:39:00] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [12:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:10] yeah you're right [12:39:18] !log urbanecm@deploy1002 Synchronized dblists: Creating mnwwiktionary (T276125) (duration: 00m 57s) [12:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:40] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating mnwwiktionary (T276125) [12:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:41:43] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating mnwwiktionary (T276125) (duration: 01m 01s) [12:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:34] (03PS1) 10Arturo Borrero Gonzalez: toolforge: fix apt pinnings for several packages in stretch [puppet] - 10https://gerrit.wikimedia.org/r/672711 [12:42:45] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating mnwwiktionary (T276125) (duration: 01m 00s) [12:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:57] T276125: Create Wiktionary Mon - https://phabricator.wikimedia.org/T276125 [12:43:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14897 and previous config saved to /var/cache/conftool/dbconfig/20210316-124303-root.json [12:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:16] RECOVERY - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is OK: TCP OK - 0.000 second response time on 10.64.0.120 port 9042 https://phabricator.wikimedia.org/T93886 [12:43:46] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating mnwwiktionary (T276125) (duration: 00m 57s) [12:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:05] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672712 [12:44:07] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672712 (owner: 10Urbanecm) [12:45:00] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672712 (owner: 10Urbanecm) [12:46:00] !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 06s) [12:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:07] Amir1: so, should be all done :) [12:46:29] yay [12:46:41] Urbanecm: while you're there can you push that noop patch? [12:46:44] I can do it too [12:46:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:14] let's hope it's really noop [12:47:28] (03CR) 10Urbanecm: [C: 03+2] flaggedrevs: Simplify the config a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672571 (owner: 10Ladsgroup) [12:47:45] everything with flagged revs is a trap [12:47:52] yeah [12:48:58] (03Merged) 10jenkins-bot: flaggedrevs: Simplify the config a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672571 (owner: 10Ladsgroup) [12:49:38] (03PS1) 10Alexandros Kosiaris: Move profile::kubernetes::node::cni_config to eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/672713 (https://phabricator.wikimedia.org/T277191) [12:50:08] Urbanecm: btw. Create is already a milestone of wiki setup [12:50:15] interesting [12:50:17] maybe I'm missing something obvious then? [12:50:21] so why can't i create columns? [12:50:25] !log urbanecm@deploy1002 Synchronized wmf-config/flaggedrevs.php: 1426d04abe08458dac57868a85550e05f9cb544b: flaggedrevs: Simplify the config a bit (duration: 00m 58s) [12:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:42] Urbanecm: can you now? https://phabricator.wikimedia.org/project/board/2942/ [12:50:46] ohoho [12:50:48] thanks [12:50:55] I needed to create a workboard for it [12:51:01] thanks [12:51:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: fix apt pinnings for several packages in stretch [puppet] - 10https://gerrit.wikimedia.org/r/672711 (owner: 10Arturo Borrero Gonzalez) [12:51:13] Amir1: will you update new wikii handler to reflect status in columns too, or should i? [12:51:37] I can do it but I have a lot of stuff in my plate for today [12:51:38] !log volans@cumin1001 START - Cookbook sre.dns.netbox [12:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:50] if you beat me to it, I don't mind :D [12:53:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:19] hehe [12:53:21] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28628/console" [puppet] - 10https://gerrit.wikimedia.org/r/672713 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [12:53:33] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10nshahquinn-wmf) @elukey what exactly will happen if we use Hue too much? I recommended that Sam use Hue since he just wants to write an SQL query without having to deal with Jupyter, Pyth... [12:53:48] !log New wiki creation is done [12:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:56] I just created some columns but not sure about stalled atm [12:54:18] we should have a "blocked for now" too [12:54:25] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/28628/ says ok, merging" [puppet] - 10https://gerrit.wikimedia.org/r/672713 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [12:54:46] i would split it to "Under discussion" and "Not ready" [12:54:56] what do you think Amir1 ? [12:55:26] yeah definitely split but not sure about naming [12:55:45] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:01] I think the naming is good for now, we can change it later [12:56:31] (03PS1) 10Jbond: hiera - cloud: add grafana config [puppet] - 10https://gerrit.wikimedia.org/r/672714 [12:56:33] (03PS1) 10Jbond: C:grafana: add types [puppet] - 10https://gerrit.wikimedia.org/r/672715 [12:57:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28629/console" [puppet] - 10https://gerrit.wikimedia.org/r/672715 (owner: 10Jbond) [12:57:48] https://phabricator.wikimedia.org/project/board/2942/ [12:57:52] Urbanecm: vola [12:57:53] (03CR) 10Jbond: [C: 03+2] hiera - cloud: add grafana config [puppet] - 10https://gerrit.wikimedia.org/r/672714 (owner: 10Jbond) [12:57:57] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:grafana: add types [puppet] - 10https://gerrit.wikimedia.org/r/672715 (owner: 10Jbond) [12:58:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14898 and previous config saved to /var/cache/conftool/dbconfig/20210316-125807-root.json [12:58:17] *voila [12:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:18] Amir1: ohoho, thanks [12:58:46] you make the patch for maintaince bot then :D [12:58:53] afk for lunch [12:58:56] :D [12:59:06] !log drain ganeti2013 [12:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:35] (03PS1) 10Jbond: cloud - sso grafana: use unique rw domain [puppet] - 10https://gerrit.wikimedia.org/r/672716 [13:02:07] (03CR) 10Jbond: [V: 03+2 C: 03+2] cloud - sso grafana: use unique rw domain [puppet] - 10https://gerrit.wikimedia.org/r/672716 (owner: 10Jbond) [13:02:44] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'staging' . [13:02:44] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' . [13:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:11] !log sync all services on the new codfw kubernetes cluster T277191 [13:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:18] T277191: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 [13:04:44] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [13:04:44] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [13:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:19] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [13:05:19] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [13:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:44] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:05:44] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' . [13:05:44] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:56] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/citoid on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:00] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/mathoid on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:04] (03PS1) 10Jbond: cloud: sso - grafana: fix vhosts [puppet] - 10https://gerrit.wikimedia.org/r/672717 [13:06:20] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventgate-logging-external on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:30] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wikifeeds on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:30] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/cxserver on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:32] RECOVERY - Confd template for /srv/config-master/pybal/codfw/echostore on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:32] RECOVERY - Confd template for /srv/config-master/pybal/codfw/zotero on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:32] RECOVERY - Confd template for /srv/config-master/pybal/codfw/blubberoid on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:32] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/similar-users on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:32] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wikifeeds on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:32] RECOVERY - Confd template for /srv/config-master/pybal/codfw/proton on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:33] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/proton on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:34] RECOVERY - Confd template for /srv/config-master/pybal/codfw/apertium on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:34] RECOVERY - Confd template for /srv/config-master/pybal/codfw/recommendation-api on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:38] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventstreams on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:40] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/push-notifications on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:40] RECOVERY - Confd template for /srv/config-master/pybal/codfw/mathoid on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:44] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/proton on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:44] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/api on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:46] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/linkrecommendation-external on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:46] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventstreams-internal on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:46] RECOVERY - Confd template for /srv/config-master/pybal/codfw/recommendation-api on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:46] RECOVERY - Confd template for /srv/config-master/pybal/codfw/echostore on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:46] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventgate-analytics on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:46] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/mobileapps on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:46] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventstreams on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:47] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/appservers-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:48] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/recommendation-api on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:48] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/cxserver on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:49] RECOVERY - Confd template for /srv/config-master/pybal/codfw/termbox on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:49] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventgate-main on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:50] RECOVERY - Confd template for /srv/config-master/pybal/codfw/appservers-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:56] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/linkrecommendation on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:00] RECOVERY - Confd template for /srv/config-master/pybal/codfw/api-gateway on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:00] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventstreams-internal on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:00] RECOVERY - Confd template for /srv/config-master/pybal/codfw/cxserver on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:02] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventgate-logging-external on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:02] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/api-gateway on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:04] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/linkrecommendation-external on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:04] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/apaches on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:04] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/zotero on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:06] RECOVERY - Confd template for /srv/config-master/pybal/codfw/api on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:06] RECOVERY - Confd template for /srv/config-master/pybal/codfw/citoid on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:08] RECOVERY - Confd template for /srv/config-master/pybal/codfw/similar-users on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:08] RECOVERY - Confd template for /srv/config-master/pybal/codfw/cxserver on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:08] RECOVERY - Confd template for /srv/config-master/pybal/codfw/apertium on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:09] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [13:07:09] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [13:07:10] RECOVERY - Confd template for /srv/config-master/pybal/codfw/linkrecommendation on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:14] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/echostore on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:14] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/termbox on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:16] RECOVERY - Confd template for /srv/config-master/pybal/codfw/zotero on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:24] RECOVERY - Confd template for /srv/config-master/pybal/codfw/mobileapps on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:24] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kubemaster on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:24] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/zotero on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:26] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventgate-analytics on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:26] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventstreams-internal on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:26] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventstreams on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:28] RECOVERY - Confd template for /srv/config-master/pybal/codfw/wikifeeds on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:34] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/sessionstore on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:34] RECOVERY - Confd template for /srv/config-master/pybal/codfw/linkrecommendation on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:38] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kubemaster on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:38] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/mobileapps on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:39] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [13:07:39] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'staging' . [13:07:40] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/blubberoid on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:42] RECOVERY - Confd template for /srv/config-master/pybal/codfw/citoid on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:46] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/recommendation-api on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:46] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/blubberoid on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:46] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventgate-main on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:48] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:48] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventgate-analytics-external on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:48] RECOVERY - Confd template for /srv/config-master/pybal/codfw/sessionstore on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:50] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventstreams-internal on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:50] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/api on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:50] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/sessionstore on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:50] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventgate-analytics-external on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:52] RECOVERY - Confd template for /srv/config-master/pybal/codfw/similar-users on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:53] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventgate-analytics on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:54] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventgate-main on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:54] RECOVERY - Confd template for /srv/config-master/pybal/codfw/wikifeeds on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:54] (03CR) 10Jbond: [C: 03+2] cloud: sso - grafana: fix vhosts [puppet] - 10https://gerrit.wikimedia.org/r/672717 (owner: 10Jbond) [13:07:54] RECOVERY - Confd template for /srv/config-master/pybal/codfw/mobileapps on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:54] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventgate-logging-external on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:56] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/apertium on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:07:59] (03PS2) 10Jbond: cloud: sso - grafana: fix vhosts [puppet] - 10https://gerrit.wikimedia.org/r/672717 [13:08:00] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventgate-analytics on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:08:00] RECOVERY - Confd template for /srv/config-master/pybal/codfw/linkrecommendation-external on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:08:02] RECOVERY - Confd template for /srv/config-master/pybal/codfw/sessionstore on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:08:04] (03CR) 10Jbond: [V: 03+2 C: 03+2] cloud: sso - grafana: fix vhosts [puppet] - 10https://gerrit.wikimedia.org/r/672717 (owner: 10Jbond) [13:08:15] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [13:08:15] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [13:08:18] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/citoid on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:08:18] RECOVERY - Confd template for /srv/config-master/pybal/codfw/push-notifications on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:08:20] RECOVERY - Confd template for /srv/config-master/pybal/codfw/apaches on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:38] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/mathoid on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:08:42] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:09:52] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'staging' . [13:09:52] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'production' . [13:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:37] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:10:37] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:22] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:12:01] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet [13:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:27] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [13:12:28] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [13:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:00] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:13:09] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:13:09] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [13:13:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14899 and previous config saved to /var/cache/conftool/dbconfig/20210316-131310-root.json [13:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:16] RECOVERY - puppet last run on kubestagemaster1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:09] (03PS1) 10Jbond: fix cloud: sso-grafana: comment out cas for now [puppet] - 10https://gerrit.wikimedia.org/r/672718 [13:15:04] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:15:04] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:56] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [13:16:56] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [13:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:44] (03CR) 10Jbond: [C: 03+2] fix cloud: sso-grafana: comment out cas for now [puppet] - 10https://gerrit.wikimedia.org/r/672718 (owner: 10Jbond) [13:18:37] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [13:18:37] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' . [13:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:26] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [13:19:26] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [13:19:26] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [13:19:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet [13:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:40] (03PS1) 10Urbanecm: trvwiki: set logo to File:Wikipedia-logo-v2-trv.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672719 (https://phabricator.wikimedia.org/T276246) [13:20:41] jouncebot: now [13:20:42] No deployments scheduled for the next 2 hour(s) and 39 minute(s) [13:20:51] (03CR) 10Urbanecm: [C: 03+2] trvwiki: set logo to File:Wikipedia-logo-v2-trv.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672719 (https://phabricator.wikimedia.org/T276246) (owner: 10Urbanecm) [13:20:54] !log drain ganeti2014 [13:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:40] (03Merged) 10jenkins-bot: trvwiki: set logo to File:Wikipedia-logo-v2-trv.svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672719 (https://phabricator.wikimedia.org/T276246) (owner: 10Urbanecm) [13:22:26] !log urbanecm@deploy1002 sync-file aborted: 7fb50c3: trvwiki: set logo to File:Wikipedia-logo-v2-trv.svg (T276246) (duration: 00m 00s) [13:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:37] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [13:22:37] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' . [13:22:38] T276246: Create Wikipedia Kari Seediq - https://phabricator.wikimedia.org/T276246 [13:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:44] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet [13:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:31] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 7fb50c3: trvwiki: set logo to File:Wikipedia-logo-v2-trv.svg (T276246; 1/2) (duration: 01m 01s) [13:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:34] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [13:24:34] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [13:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:44] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 7fb50c3: trvwiki: set logo to File:Wikipedia-logo-v2-trv.svg (T276246; 2/2) (duration: 00m 57s) [13:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:51] * Urbanecm done [13:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:27] (03CR) 10Elukey: profile::docker::engine: add default to the version parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672647 (owner: 10Elukey) [13:26:44] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [13:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:51] (03CR) 10Jbond: [C: 03+2] puppet_compiler: stop installing openjdk [puppet] - 10https://gerrit.wikimedia.org/r/672660 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [13:27:56] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'canary' . [13:27:56] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [13:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] profile::docker::engine: add default to the version parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672647 (owner: 10Elukey) [13:28:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14900 and previous config saved to /var/cache/conftool/dbconfig/20210316-132814-root.json [13:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3318 for schema change', diff saved to https://phabricator.wikimedia.org/P14901 and previous config saved to /var/cache/conftool/dbconfig/20210316-132844-marostegui.json [13:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:16] (03CR) 10David Caro: [C: 03+1] "Awesome!! <3" [puppet] - 10https://gerrit.wikimedia.org/r/672698 (owner: 10Jbond) [13:30:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet [13:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:19] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [13:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/672658 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [13:30:57] (03CR) 10Jbond: [C: 03+2] Rakefile: update the spec test to check for module Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/672698 (owner: 10Jbond) [13:31:26] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [13:31:26] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [13:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:36] !log drain ganeti2015 [13:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:56] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'similar-users' for release 'main' . [13:32:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:19] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672723 [13:32:34] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'staging' . [13:32:34] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [13:32:34] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'test' . [13:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:41] (03PS2) 10Muehlenhoff: Add approval for graphite-admins [puppet] - 10https://gerrit.wikimedia.org/r/671182 (https://phabricator.wikimedia.org/T276465) [13:33:56] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [13:33:56] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [13:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:04] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'staging' . [13:35:04] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [13:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:19] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet [13:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:43] (03PS1) 10Alexandros Kosiaris: kubernetes2017: Assign failure-domain labels [puppet] - 10https://gerrit.wikimedia.org/r/672724 [13:36:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes2017: Assign failure-domain labels [puppet] - 10https://gerrit.wikimedia.org/r/672724 (owner: 10Alexandros Kosiaris) [13:36:20] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ottomata) Perhaps Superset + SQLLab is a better option? Or is the fact that the query takes longer than a minute mean that it will timeout there? [13:36:24] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) With orchestrator in place, can these be removed now? Support for jessie will cease in two weeks. [13:37:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:37:33] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10Marostegui) Unfortunately not, we don't have all the sections in Orchestrator yet. And also need to see how we'll be replacing the query activity, we have some ideas on that front but yet to be implement... [13:38:30] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:30] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10elukey) >>! In T277298#6917176, @nshahquinn-wmf wrote: > @elukey what exactly will happen if we use Hue too much? > > I recommended that Sam use Hue since he just wants to write an SQL qu... [13:42:58] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) Ouch, let's move to dbmonitor to Stretch, then? If PHP 5 is the blocker (I remember some issues with PHP7 vaguely), I can make a stretch-wikimedia build of php5, but this really, reall... [13:44:43] (03PS2) 10Alexandros Kosiaris: Revert "Revert "mobileapps: Enable egress network policy"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670540 (owner: 10MSantos) [13:45:00] (03CR) 10Alexandros Kosiaris: "Merging this as it's required for the application to work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/670540 (owner: 10MSantos) [13:45:06] !log powercycling ganeti2015, stuck on reboot [13:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:28] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10Marostegui) The problem was indeed `mysqli`, we can try to see if we can run php5 on stretch as you propose. @jcrespo took a deep look at this a couple of years ago I think, so maybe he can give more co... [13:47:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "Revert "mobileapps: Enable egress network policy"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670540 (owner: 10MSantos) [13:48:54] (03Abandoned) 10Elukey: profile::docker::engine: add default to the version parameter [puppet] - 10https://gerrit.wikimedia.org/r/672647 (owner: 10Elukey) [13:49:51] (03PS10) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [13:50:05] (03Merged) 10jenkins-bot: Revert "Revert "mobileapps: Enable egress network policy"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670540 (owner: 10MSantos) [13:50:21] (03CR) 10Hashar: "Will redo it to use the cloud common hieradata hierarchy" [puppet] - 10https://gerrit.wikimedia.org/r/672658 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [13:51:38] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) >>! In T224589#6917296, @Marostegui wrote: > The problem was indeed `mysqli`, we can try to see if we can run php5 on stretch as you propose. > @jcrespo took a deep look at this a cou... [13:52:26] (03CR) 10Klausman: hiera/modules: Add role for ML k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [13:53:15] (03CR) 10Ottomata: "Man our role/profile rules really suck for default settings, eh?" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [13:53:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:53:31] 10SRE, 10ops-codfw: ganeti2015 doesn't boot - https://phabricator.wikimedia.org/T277537 (10MoritzMuehlenhoff) [13:54:14] (03PS1) 10Jbond: P:base::firewall::extra: profile to set ferm rules via hiera [puppet] - 10https://gerrit.wikimedia.org/r/672727 [13:55:09] (03CR) 10Muehlenhoff: [C: 03+2] Add approval for graphite-admins [puppet] - 10https://gerrit.wikimedia.org/r/671182 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [13:55:17] (03CR) 10jerkins-bot: [V: 04-1] P:base::firewall::extra: profile to set ferm rules via hiera [puppet] - 10https://gerrit.wikimedia.org/r/672727 (owner: 10Jbond) [13:55:51] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) > @jcrespo took a deep look at this a couple of years ago I think I gave up- it is not just a question of mysql->mysqli rewrite, which would be trivial, it also uses global variables in a way t... [13:55:54] PROBLEM - Host ganeti2015 is DOWN: PING CRITICAL - Packet loss = 100% [13:56:05] ACKNOWLEDGEMENT - Host ganeti2015 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T277537 [13:57:34] (03PS2) 10Jbond: P:base::firewall::extra: profile to set ferm rules via hiera [puppet] - 10https://gerrit.wikimedia.org/r/672727 [13:58:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:59:06] (03CR) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [13:59:13] (03CR) 10Jbond: [C: 03+2] P:base::firewall::extra: profile to set ferm rules via hiera [puppet] - 10https://gerrit.wikimedia.org/r/672727 (owner: 10Jbond) [13:59:41] moritzm: ok to merge yours [14:00:10] * jbond42 boldly mergeing [14:00:41] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) >>! In T256538#6916381, @Marostegui wrote: > Do you have some estimations on how long you want to run this test for? For me, hopefully a month or two. Add one or two to be safe. [14:01:34] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) >>! In T256538#6917363, @Ladsgroup wrote: >>>! In T256538#6916381, @Marostegui wrote: >> Do you have some estimations on how long you want to run this test for? > > For me, hopef... [14:02:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:42] (03CR) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [14:07:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:11:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:12:23] jbond42: thanks, sorry missed the pint [14:12:38] moritzm: np [14:14:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:16:07] (03CR) 10Hashar: "I wanted to try something else for hiera but that did not work :) So yeah +1 and I take care of the various WMCS instances having this c" [puppet] - 10https://gerrit.wikimedia.org/r/672658 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [14:17:33] (03PS1) 10Kormat: mariadb: Enable ssl when using profile::mariadb::client [puppet] - 10https://gerrit.wikimedia.org/r/672728 [14:18:55] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28630/console" [puppet] - 10https://gerrit.wikimedia.org/r/672728 (owner: 10Kormat) [14:19:19] (03PS1) 10Jbond: cloud - sso-grafana: move horizon config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/672729 [14:20:33] (03CR) 10Jbond: [C: 03+2] cloud - sso-grafana: move horizon config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/672729 (owner: 10Jbond) [14:22:09] (03CR) 10Hashar: [C: 03+1] contint: use Java 11 on Jenkins agents [puppet] - 10https://gerrit.wikimedia.org/r/672658 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [14:25:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:29:43] (03CR) 10Kormat: [V: 03+1] "For the folks on CC (akosiaris, bd808, hnowlan, krinkle, urbanecm), you're all here because there's one or more occurrences of" [puppet] - 10https://gerrit.wikimedia.org/r/672728 (owner: 10Kormat) [14:30:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:31:26] 10SRE, 10serviceops, 10Performance-Team (Radar): Get rid of nutcracker for connecting to redis - https://phabricator.wikimedia.org/T277183 (10AMooney) [14:32:54] (03PS1) 10Jbond: cloud sso-grafana: use correct ldap groups [puppet] - 10https://gerrit.wikimedia.org/r/672731 [14:33:59] (03CR) 10Jbond: [C: 03+2] doc: script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/666309 (https://phabricator.wikimedia.org/T275468) (owner: 10Hashar) [14:34:11] (03CR) 10Jbond: [C: 03+2] cloud sso-grafana: use correct ldap groups [puppet] - 10https://gerrit.wikimedia.org/r/672731 (owner: 10Jbond) [14:37:36] !log start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T276251 T276129 T275839) [14:37:45] (03PS3) 10Jbond: P:grafana: Update CAS config to authenticate users on the correct vhost [puppet] - 10https://gerrit.wikimedia.org/r/654813 (https://phabricator.wikimedia.org/T269272) [14:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:47] T276129: Add Wikidata support for mnwwiktionary - https://phabricator.wikimedia.org/T276129 [14:37:48] T276251: Add Wikidata support for trvwiki - https://phabricator.wikimedia.org/T276251 [14:37:49] T275839: Add Wikidata support for taywiki - https://phabricator.wikimedia.org/T275839 [14:42:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/672728 (owner: 10Kormat) [14:47:12] (03PS1) 10Jbond: cloud - pki: add vhost definision for new cas site [puppet] - 10https://gerrit.wikimedia.org/r/672734 [14:47:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:48:24] kormat: quick question about the patch you CC'ed me on: will `sql xxwiki --write` on mwmaint1002 work? [14:48:36] it's the standard way to connect to master DB [14:49:08] (03CR) 10Jbond: [C: 03+2] cloud - pki: add vhost definision for new cas site [puppet] - 10https://gerrit.wikimedia.org/r/672734 (owner: 10Jbond) [14:49:22] Urbanecm: it looks like that uses `mwscript mysql.php`, which i can only assume does not use the mysql client's config [14:49:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 25%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P14902 and previous config saved to /var/cache/conftool/dbconfig/20210316-144935-root.json [14:49:39] (03CR) 10Muehlenhoff: [C: 03+2] contint: use Java 11 on Jenkins agents [puppet] - 10https://gerrit.wikimedia.org/r/672658 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [14:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:05] Urbanecm: oh. but it says it takes options to pass to mysql. sigh. i guess i'll have to find/read mysql.php [14:51:03] kormat: mysql.php is at https://github.com/wikimedia/mediawiki/blob/master/maintenance/mysql.php [14:52:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:53:25] (03PS1) 10Hashar: gerrit: remove sudo rule for 'gerritslave' user [puppet] - 10https://gerrit.wikimedia.org/r/672735 [14:53:55] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti2015.codfw.wmnet [14:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:23] kormat: it seems to use the IP (ie. 10.64.48.34 for s2 master) [14:54:31] uff [14:56:22] Urbanecm: welp. this won't work [14:56:29] kormat: it runs sth like `mysql --defaults-extra-file=/tmp/mw-mysql1a9ba1df9b68.ini --user=wikiadmin --database=cswiki --host=10.64.0.99` [14:57:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:57:14] Urbanecm: thanks for pointing this out [14:57:25] so `sql` never uses ssl. that's.. concerning. [14:58:08] !log end of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T276251 T276129 T275839) [14:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:22] T276129: Add Wikidata support for mnwwiktionary - https://phabricator.wikimedia.org/T276129 [14:58:23] T276251: Add Wikidata support for trvwiki - https://phabricator.wikimedia.org/T276251 [14:58:23] T275839: Add Wikidata support for taywiki - https://phabricator.wikimedia.org/T275839 [14:58:31] no problem kormat. Can you create a task to change mysql.php's behavior? [14:59:25] (03CR) 10Urbanecm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/672728 (owner: 10Kormat) [14:59:51] (03CR) 10Urbanecm: [C: 04-1] "this will probably break `sql xxwiki`, as it runs something like this under the hoods:" [puppet] - 10https://gerrit.wikimedia.org/r/672728 (owner: 10Kormat) [15:02:23] (03CR) 10Bstorm: paws: block using the Jupyterhub from Tor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [15:03:24] !log hashar@deploy1002 Started deploy [integration/docroot@44d5685]: Verify check can restart php-fpm # T275468 [15:03:31] !log hashar@deploy1002 Finished deploy [integration/docroot@44d5685]: Verify check can restart php-fpm # T275468 (duration: 00m 07s) [15:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:34] T275468: Apache on doc1001 does not see updated PHP files for hours/days after deployment - https://phabricator.wikimedia.org/T275468 [15:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 50%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P14903 and previous config saved to /var/cache/conftool/dbconfig/20210316-150439-root.json [15:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:53] (03CR) 10Ayounsi: Add kubernetes1017 to BGP peers (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/672709 (owner: 10Alexandros Kosiaris) [15:05:57] (03CR) 10Muehlenhoff: [C: 03+2] gerrit: remove sudo rule for 'gerritslave' user [puppet] - 10https://gerrit.wikimedia.org/r/672735 (owner: 10Hashar) [15:06:27] !log hashar@deploy1002 Started deploy [integration/docroot@cf787a5]: (no justification provided) [15:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:58] !log hashar@deploy1002 Finished deploy [integration/docroot@cf787a5]: (no justification provided) (duration: 00m 30s) [15:07:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:15] Urbanecm: i filed https://phabricator.wikimedia.org/T277539, but i've no idea what tags to put on it :) [15:08:37] (03CR) 10DCausse: [C: 03+1] Add Cirrus testing profile for glent m1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672565 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [15:10:24] kormat: thanks. I’ll add some :) [15:10:28] Urbanecm: <3 [15:16:16] PROBLEM - Host analytics1066.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:17:39] this is being worked on --^ [15:17:57] elukey: kicking it while it's down, eh? [15:18:43] (03CR) 10Kormat: [V: 03+1 C: 04-2] "Opened https://phabricator.wikimedia.org/T277539 to track the issue with mysql.php" [puppet] - 10https://gerrit.wikimedia.org/r/672728 (owner: 10Kormat) [15:19:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 75%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P14904 and previous config saved to /var/cache/conftool/dbconfig/20210316-151943-root.json [15:19:44] kormat: are you by any chance suggesting that I am a bad person? [15:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:55] elukey: is that something that needs to be _suggested_? [15:21:17] kormat: ahahhaha [15:22:42] RECOVERY - Host analytics1066.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [15:23:23] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10Cmjohnson) 05Open→03Resolved swapped the bbu, the server is back up and handed back to @elukey [15:23:32] RECOVERY - Check systemd state on analytics1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:10] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10elukey) ` elukey@analytics1066:~$ sudo megacli -LDInfo -Lall -aALL | grep "Cache Policy" Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if... [15:24:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) I see this server is down and will need to go through the process of removing all the components and adding them back 1 by 1 until I can figure out what is cau... [15:26:57] (03CR) 10CRusnov: "> Patch Set 4:" [software/netbox] - 10https://gerrit.wikimedia.org/r/672548 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [15:27:27] !log ayounsi@deploy1002 Started deploy [homer/deploy@759f82c]: T277006 [15:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:35] T277006: Remove servers interface names from switches interfaces descriptions - https://phabricator.wikimedia.org/T277006 [15:30:17] (03PS1) 10Arturo Borrero Gonzalez: icinga: monitor: fix notes URL for some toolschecker entries [puppet] - 10https://gerrit.wikimedia.org/r/672739 [15:31:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] icinga: monitor: fix notes URL for some toolschecker entries [puppet] - 10https://gerrit.wikimedia.org/r/672739 (owner: 10Arturo Borrero Gonzalez) [15:32:24] !log ayounsi@deploy1002 Finished deploy [homer/deploy@759f82c]: T277006 (duration: 04m 56s) [15:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:42] RECOVERY - MegaRAID on analytics1066 is OK: OK: optimal, 12 logical, 13 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:34:32] (03CR) 10CRusnov: "> Patch Set 4:" [software/netbox] - 10https://gerrit.wikimedia.org/r/672548 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [15:34:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 100%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P14905 and previous config saved to /var/cache/conftool/dbconfig/20210316-153446-root.json [15:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:47] (03PS10) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) [15:39:16] 10SRE, 10Patch-For-Review: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10colewhite) Since yesterday, the Prometheus jobs reduced availability alert has been firing about ircd on irc2001. Looking at the logs, there appears to be some breakdown in communication between... [15:39:56] (03PS3) 10Alexandros Kosiaris: Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 [15:41:22] 10SRE, 10IRCecho, 10Icinga, 10observability: Icinga check for ircecho should check for actual activity - https://phabricator.wikimedia.org/T216611 (10colewhite) 05Open→03Resolved Digging a bit into this found that there is some breakdown in communication between prometheus-ircd-exporter and ircd on the... [15:46:57] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 16 days, 16:00:00 on acrab.codfw.wmnet with reason: Extend downtime for like a month until we remove the VMs [15:46:58] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16 days, 16:00:00 on acrab.codfw.wmnet with reason: Extend downtime for like a month until we remove the VMs [15:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:03] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 16 days, 16:00:00 on acrux.codfw.wmnet with reason: Extend downtime for like a month until we remove the VMs [15:47:04] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16 days, 16:00:00 on acrux.codfw.wmnet with reason: Extend downtime for like a month until we remove the VMs [15:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:52:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:56:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:57:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:00:04] jbond42 and cdanis: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210316T1600). [16:00:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:01:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:04:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:08:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, don't forget to bump the chart version though" [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [16:11:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:11:59] (03CR) 10Mstyles: rdf-streaming-updater: fix networkpolicy selector (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [16:15:37] (03PS4) 10Mstyles: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [16:16:44] (03CR) 10Andrew Bogott: "I built an old-school exec node in testlabs and applied this class. Here's the fstab difference from before/after:" [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [16:17:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:17:12] (03CR) 10Mstyles: create helmfile.d structure (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [16:17:24] !log testreduce1001 - gzip /var/log/daemon.log.1 ; apt-get clean .. free some disk space [16:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:01] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:19:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:20:45] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2226.codfw.wmnet [16:20:48] (03PS1) 10Bstorm: toolschecker: sudo should use -i to properly have the right environ [puppet] - 10https://gerrit.wikimedia.org/r/672749 [16:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:54] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2227.codfw.wmnet [16:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2226.codfw.wmnet [16:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:27] (03CR) 10Bstorm: "I have some vague memory of running a chmod or something to make errors go away on tools-checker-* hosts. That would have been blown away " [puppet] - 10https://gerrit.wikimedia.org/r/672749 (owner: 10Bstorm) [16:22:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolschecker: sudo should use -i to properly have the right environ [puppet] - 10https://gerrit.wikimedia.org/r/672749 (owner: 10Bstorm) [16:22:57] (03CR) 10Bstorm: [C: 03+2] toolschecker: sudo should use -i to properly have the right environ [puppet] - 10https://gerrit.wikimedia.org/r/672749 (owner: 10Bstorm) [16:24:07] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2240.codfw.wmnet [16:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:41] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2241.codfw.wmnet [16:25:46] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2242.codfw.wmnet [16:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:16] (03PS1) 10ArielGlenn: Dumps: Restructure page content batch jobs [dumps] - 10https://gerrit.wikimedia.org/r/672753 (https://phabricator.wikimedia.org/T252396) [16:36:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:37:21] where is jenkins? [16:38:38] apergos: https://integration.wikimedia.org/zuul/ [16:38:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:39:05] 11 minutes [16:39:11] It's quite backlogged [16:39:31] (03PS3) 10Eevans: Update sessionstore prod to Kask 2021-03-12-195445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/671252 (https://phabricator.wikimedia.org/T274262) [16:39:33] I'll say [16:40:00] blame cscott -- https://gerrit.wikimedia.org/r/c/mediawiki/core/+/665126 [16:40:06] (03CR) 10Eevans: [C: 03+2] Update sessionstore prod to Kask 2021-03-12-195445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/671252 (https://phabricator.wikimedia.org/T274262) (owner: 10Eevans) [16:40:26] 11-13 patches in that chain [16:40:43] PROBLEM - mediawiki-installation DSH group on mw2227 is CRITICAL: Host mw2227 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:40:59] apergos: 332 patches currently in CI according to https://grafana.wikimedia.org/d/000000284/continuous-integration?viewPanel=4&orgId=1&from=now-5m&to=now [16:43:05] that's unheard of [16:43:08] it now says 14 monutes [16:43:09] *minutes [16:43:09] something's up [16:43:34] 😬 [16:44:34] !log eevans@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [16:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:11] sp 13*45 = see you tomorrow [16:46:13] right [16:47:42] !log eevans@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [16:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:07] PROBLEM - Device not healthy -SMART- on analytics1066 is CRITICAL: cluster=analytics device=sat+megaraid,0 instance=analytics1066 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1066&var-datasource=eqiad+prometheus/ops [16:55:06] (03CR) 10Ahmon Dancy: [C: 03+2] Branch commit for wmf/1.36.0-wmf.35 [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672574 (owner: 10TrainBranchBot) [16:55:20] whatttt [16:56:45] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10crusnov) So we discussed this at the automation meeting, and it turns out we've all agreed that the current code and patches need to be thrown out entirely and the project redone with the django-auth... [17:00:05] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210316T1700). [17:04:44] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2226.codfw.wmnet [17:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:20] (03PS2) 10ArielGlenn: Dumps: Restructure page content batch jobs [dumps] - 10https://gerrit.wikimedia.org/r/672753 (https://phabricator.wikimedia.org/T252396) [17:09:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2227.codfw.wmnet [17:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:46] 10SRE, 10MW-on-K8s, 10serviceops: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10jijiki) >>! In T276908#6916385, @Joe wrote: > After some more work, this is my ideas for liveness and readiness probes: > > # httpd: > -- liveness: tcp connection to the main... [17:12:56] (03CR) 10Jbond: [C: 03+2] logspam.pl: Ignore messages from mwmaint* hosts [puppet] - 10https://gerrit.wikimedia.org/r/672483 (owner: 10Ahmon Dancy) [17:13:41] (03PS1) 10Ebernhardson: Expand CirrusSearchNamespaceWeights with explicit ns numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672761 (https://phabricator.wikimedia.org/T277332) [17:14:59] (03PS2) 10Ebernhardson: Expand CirrusSearchNamespaceWeights with explicit ns numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672761 (https://phabricator.wikimedia.org/T277332) [17:19:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:22:15] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.35 [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672574 (owner: 10TrainBranchBot) [17:24:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:30:31] (03CR) 10Cwhite: [C: 03+2] prometheus: generate metrics on dead letter events [puppet] - 10https://gerrit.wikimedia.org/r/672558 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [17:30:37] (03PS2) 10Cwhite: prometheus: generate metrics on dead letter events [puppet] - 10https://gerrit.wikimedia.org/r/672558 (https://phabricator.wikimedia.org/T277080) [17:31:27] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:32:06] !log ppchelko@deploy1002 Started deploy [restbase/deploy@f99ddaa]: Add new wikis T275837 T271983 T273466 T276127 T273460 T276249 [17:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:21] T276127: Add mnwwiktionary to RESTBase - https://phabricator.wikimedia.org/T276127 [17:32:21] T271983: Add altwiki to RESTBase - https://phabricator.wikimedia.org/T271983 [17:32:21] T273466: Add mniwiki to RESTBase - https://phabricator.wikimedia.org/T273466 [17:32:21] T276249: Add trvwiki to RESTBase - https://phabricator.wikimedia.org/T276249 [17:32:22] T275837: Add taywiki to RESTBase - https://phabricator.wikimedia.org/T275837 [17:32:22] T273460: Add mniwiktionary to RESTBase - https://phabricator.wikimedia.org/T273460 [17:32:40] (03PS1) 10Volans: tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) [17:33:19] (03CR) 10jerkins-bot: [V: 04-1] tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [17:33:46] (03CR) 10Ahmon Dancy: [C: 03+2] rdbms: avoid undefined "expectBy" notices in TransactionProfiler (II) [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672587 (https://phabricator.wikimedia.org/T269789) (owner: 10Krinkle) [17:37:39] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2227.codfw.wmnet [17:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:40:38] (03PS1) 10Giuseppe Lavagetto: Add liveness/readiness probe script to php-fpm images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/672767 (https://phabricator.wikimedia.org/T276908) [17:40:39] (03PS1) 10Giuseppe Lavagetto: [WiP] test harness for php-fpm images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/672768 [17:41:02] (03PS5) 10Mstyles: rdf-streaming-updater:create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [17:43:39] (03PS1) 10Cwhite: logstash: update ingest errors to use dead letters gauge [puppet] - 10https://gerrit.wikimedia.org/r/672769 (https://phabricator.wikimedia.org/T277080) [17:44:31] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on aqs1011.eqiad.wmnet with reason: New buster hosts, not in use [17:44:31] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on aqs1011.eqiad.wmnet with reason: New buster hosts, not in use [17:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:11] (03PS1) 10Cwhite: logstash: clean up mtail config [puppet] - 10https://gerrit.wikimedia.org/r/672771 (https://phabricator.wikimedia.org/T277080) [17:49:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:50:45] (03CR) 10Cwhite: [C: 03+2] logstash: short-circuit dead letter recursion [puppet] - 10https://gerrit.wikimedia.org/r/672556 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [17:58:20] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10brennen) So @Sergey.Trofimovsky.SF reports getting prompted for a credential when trying to follow the user creat... [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210316T1800) [18:03:38] !log ppchelko@deploy1002 Finished deploy [restbase/deploy@f99ddaa]: Add new wikis T275837 T271983 T273466 T276127 T273460 T276249 (duration: 31m 31s) [18:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:51] T276127: Add mnwwiktionary to RESTBase - https://phabricator.wikimedia.org/T276127 [18:03:51] T271983: Add altwiki to RESTBase - https://phabricator.wikimedia.org/T271983 [18:03:52] T273466: Add mniwiki to RESTBase - https://phabricator.wikimedia.org/T273466 [18:03:52] T276249: Add trvwiki to RESTBase - https://phabricator.wikimedia.org/T276249 [18:03:52] T275837: Add taywiki to RESTBase - https://phabricator.wikimedia.org/T275837 [18:03:52] T273460: Add mniwiktionary to RESTBase - https://phabricator.wikimedia.org/T273460 [18:06:55] (03Merged) 10jenkins-bot: rdbms: avoid undefined "expectBy" notices in TransactionProfiler (II) [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672587 (https://phabricator.wikimedia.org/T269789) (owner: 10Krinkle) [18:07:44] (03PS1) 10Effie Mouzeli: profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773 [18:08:53] (03CR) 10jerkins-bot: [V: 04-1] profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773 (owner: 10Effie Mouzeli) [18:11:56] (03PS1) 10MSantos: mobileapps: bump to 2021-03-12-060749-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672775 [18:13:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:13:29] (03Abandoned) 10CRusnov: Group Sensitive Remote: Sync groups at auth time, not creation time [software/netbox] - 10https://gerrit.wikimedia.org/r/672548 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [18:17:27] (03PS2) 10MSantos: mobileapps: bump to 2021-03-12-060749-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672775 [18:18:15] (03CR) 10Legoktm: docker_registry_ha: Require authentication from k8s nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:18:48] (03PS5) 10Legoktm: docker_registry_ha: Require authentication from k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) [18:19:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:23:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,routinator} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:23:53] (03CR) 10Legoktm: [C: 04-1] docker_registry_ha: Require authentication from k8s nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:25:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:28:14] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-03-12-060749-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672775 (owner: 10MSantos) [18:30:07] (03Merged) 10jenkins-bot: mobileapps: bump to 2021-03-12-060749-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672775 (owner: 10MSantos) [18:32:00] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10DannyH) I approve Sam's request. [18:39:13] 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10lmata) Hi @AMooney, I'd like to present this patch as the other of the two I was hoping to bring to your attention for next clinic duty... Please let me know... [18:41:09] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [18:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:07] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:01] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:50:52] (03PS6) 10Legoktm: docker_registry_ha: Require authentication from k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) [18:51:06] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:51:30] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:51:43] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28632/console" [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:54:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:55:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2228.codfw.wmnet [18:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:17] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672781 [18:56:19] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672781 (owner: 10Ahmon Dancy) [18:57:18] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672781 (owner: 10Ahmon Dancy) [18:58:21] !log dancy@deploy1002 Started scap: testwikis wikis to 1.36.0-wmf.35 [18:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:08] PROBLEM - kubelet operational latencies on kubernetes1007 is CRITICAL: instance=kubernetes1007.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:00:04] dancy and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210316T1900). [19:03:36] here as backup. [19:03:38] RECOVERY - kubelet operational latencies on kubernetes1007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:06:30] !log commit changes to pfw3-codfw - T274422 [19:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:56] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2228.codfw.wmnet [19:08:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2228.codfw.wmnet [19:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:24] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2229.codfw.wmnet [19:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:04] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2229.codfw.wmnet [19:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:22] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:19:33] (03PS2) 10Dzahn: DHCP: remove mw2243 through mw2250 [puppet] - 10https://gerrit.wikimedia.org/r/670988 [19:19:54] (03CR) 10Dzahn: [C: 03+2] DHCP: remove mw2243 through mw2250 [puppet] - 10https://gerrit.wikimedia.org/r/670988 (owner: 10Dzahn) [19:20:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2229.codfw.wmnet [19:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:17] (03CR) 10Dzahn: [C: 03+2] site/conftool: decom mw2224 through mw2229 [puppet] - 10https://gerrit.wikimedia.org/r/670957 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [19:21:19] (03PS3) 10Dzahn: site/conftool: decom mw2224 through mw2229 [puppet] - 10https://gerrit.wikimedia.org/r/670957 (https://phabricator.wikimedia.org/T277119) [19:22:45] ACKNOWLEDGEMENT - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T277537 [19:31:00] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.36.0-wmf.35 (duration: 33m 41s) [19:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:19] 10SRE, 10LDAP-Access-Requests: Grant Access to for apine - https://phabricator.wikimedia.org/T277544 (10Aklapper) @cmassaro: That username seems wrong? https://wikitech.wikimedia.org/wiki/User:apine says that the user account is not registered. [19:41:26] (03PS1) 10Ahmon Dancy: group0 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672785 [19:41:28] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672785 (owner: 10Ahmon Dancy) [19:41:31] 10SRE, 10DNS, 10Traffic: Apple Business Manager: verify ownership of wikimedia.org - https://phabricator.wikimedia.org/T274592 (10bcampbell) Hey @BBlack all good, thanks for the help. Unfortunately, the verification random string resets itself every 14 calendar days, so the txt record in the original request... [19:43:16] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672785 (owner: 10Ahmon Dancy) [19:44:53] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.35 [19:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:51] (03PS1) 10Ahmon Dancy: group0 wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672791 [19:49:53] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672791 (owner: 10Ahmon Dancy) [19:50:45] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672791 (owner: 10Ahmon Dancy) [19:51:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:51:57] !log commit changes to pfw3-eqiad - T274422 [19:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:20] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.34 [19:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:47] (03PS1) 10BBlack: wikimedia.org: Add Apple Business Manager TXT record [dns] - 10https://gerrit.wikimedia.org/r/672794 (https://phabricator.wikimedia.org/T274592) [19:58:55] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:00:48] !log 1.36.0-wmf.35 train status (T274939): blocked at group0 on T277362 [20:00:48] 10SRE, 10LDAP-Access-Requests: Grant Access to for apine - https://phabricator.wikimedia.org/T277544 (10Legoktm) Please see https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I_request? - I think you want a #sre-access-requests, not an LDAP one. >>! In T277544#6918998, @Ak... [20:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:56] T277362: Deprecation warning client-repo wikitext link - https://phabricator.wikimedia.org/T277362 [20:00:56] T274939: 1.36.0-wmf.35 deployment blockers - https://phabricator.wikimedia.org/T274939 [20:01:03] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:02:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:02:25] (03PS1) 10Dzahn: otrs: replace spamassassin cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/672796 (https://phabricator.wikimedia.org/T273673) [20:02:35] (03PS2) 10Effie Mouzeli: profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773 [20:04:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:04:30] (03CR) 10BBlack: [C: 03+2] wikimedia.org: Add Apple Business Manager TXT record [dns] - 10https://gerrit.wikimedia.org/r/672794 (https://phabricator.wikimedia.org/T274592) (owner: 10BBlack) [20:05:42] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Apple Business Manager: verify ownership of wikimedia.org - https://phabricator.wikimedia.org/T274592 (10BBlack) @bcampbell - Updated with the new record, try again? [20:06:36] (03PS1) 10Elukey: Replace labsdb1012 with clouddb1021 in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/672797 (https://phabricator.wikimedia.org/T269211) [20:07:28] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ottomata) Oh, in case needed: approved by me too. [20:08:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:08:52] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Apple Business Manager: verify ownership of wikimedia.org - https://phabricator.wikimedia.org/T274592 (10bcampbell) @BBlack Bingo, that worked. Our domain is now verified. Apple's documentation says you can remove the record now if you want. [20:08:54] (03CR) 10Razzi: [C: 03+1] Replace labsdb1012 with clouddb1021 in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/672797 (https://phabricator.wikimedia.org/T269211) (owner: 10Elukey) [20:15:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:15:52] !log remove DMZ zone from pfw3-eqiad - T174203 [20:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:03] T174203: Investigate decommissioning two eqiad-frack vlans - https://phabricator.wikimedia.org/T174203 [20:19:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:21:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:25:03] 10SRE, 10SRE-Access-Requests, 10wikimedia-irc-freenode: Grant wmopbot +o permissions in #wikimedia-operations IRC channel - https://phabricator.wikimedia.org/T275711 (10DeltaQuad) 05Open→03Resolved [20:25:25] 10SRE, 10SRE-Access-Requests, 10wikimedia-irc-freenode: Grant wmopbot +o permissions in #wikimedia-operations IRC channel - https://phabricator.wikimedia.org/T275711 (10DeltaQuad) >*!*@ wikimedia/bot/wmopbot +Aeiortv (OP) [modified 8h 22m 19s ago] [20:27:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:30:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:33:20] (03CR) 10Ayounsi: [C: 03+1] Replace labsdb1012 with clouddb1021 in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/672797 (https://phabricator.wikimedia.org/T269211) (owner: 10Elukey) [20:36:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:37:49] (03CR) 10Ayounsi: [C: 03+1] Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 (owner: 10Alexandros Kosiaris) [20:39:38] 10SRE, 10SRE-Access-Requests, 10wikimedia-irc-freenode: Grant wmopbot +o permissions in #wikimedia-operations IRC channel - https://phabricator.wikimedia.org/T275711 (10Legoktm) a:03faidon Thanks! [20:41:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:45:36] !log andrew@deploy1002 Started deploy [horizon/deploy@e4fd934]: tiny horizon patch to support flavor deprecation [20:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:20] !log andrew@deploy1002 Finished deploy [horizon/deploy@e4fd934]: tiny horizon patch to support flavor deprecation (duration: 03m 44s) [20:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:32] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 (10ayounsi) Those are hot swap-able, so it can be installed anytime. Please sync up with me first so I'm around to test it, let's use `slot 1`. I'll use that opportunity to... [20:54:04] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) install MPC7E-MRATE FPC into cr[12]-codfw - https://phabricator.wikimedia.org/T277341 (10ayounsi) Those are hot swap-able, so then can be installed anytime. Please sync up with me first so I'm around to test them and let's use `slot 1`. Just in case here is the... [20:54:39] 10SRE, 10DC-Ops, 10homer, 10netops: Remove servers interface names from switches interfaces descriptions - https://phabricator.wikimedia.org/T277006 (10ayounsi) 05Open→03Resolved a:03ayounsi All done! [20:59:13] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2230.codfw.wmnet [20:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:32] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2231.codfw.wmnet [20:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:42] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2232.codfw.wmnet [20:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:04:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:08:22] (03PS1) 10BBlack: Revert "wikimedia.org: Add Apple Business Manager TXT record" [dns] - 10https://gerrit.wikimedia.org/r/672801 (https://phabricator.wikimedia.org/T274592) [21:10:25] PROBLEM - mediawiki-installation DSH group on mw2232 is CRITICAL: Host mw2232 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:10:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:10:47] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [21:11:55] (03CR) 10BBlack: [C: 03+2] Revert "wikimedia.org: Add Apple Business Manager TXT record" [dns] - 10https://gerrit.wikimedia.org/r/672801 (https://phabricator.wikimedia.org/T274592) (owner: 10BBlack) [21:12:47] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [21:13:13] PROBLEM - mediawiki-installation DSH group on mw2231 is CRITICAL: Host mw2231 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:14:34] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10CGlenn) [21:15:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:17:42] (03PS1) 10Cwhite: Add normalized_message keyword field [software/ecs] - 10https://gerrit.wikimedia.org/r/672805 [21:19:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:21:37] (03PS2) 10Cwhite: Add normalized_message keyword field [software/ecs] - 10https://gerrit.wikimedia.org/r/672805 [21:22:43] (03PS3) 10Cwhite: Add normalized_message keyword field [software/ecs] - 10https://gerrit.wikimedia.org/r/672805 [21:35:10] (03PS1) 10Cwhite: logstash: add normalized_message field to ECS [puppet] - 10https://gerrit.wikimedia.org/r/672828 [21:40:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:49:30] PROBLEM - mediawiki-installation DSH group on mw2230 is CRITICAL: Host mw2230 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:52:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:53:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:57:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:58:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:04:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:05:09] (03PS2) 10Dzahn: otrs: replace spamassassin cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/672796 (https://phabricator.wikimedia.org/T273673) [22:05:11] (03PS1) 10Dzahn: otrs: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/672830 (https://phabricator.wikimedia.org/T273673) [22:06:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:11:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:18:54] 10SRE, 10ops-codfw: ganeti2015 doesn't boot - https://phabricator.wikimedia.org/T277537 (10wiki_willy) a:03Papaul Chatted with @MoritzMuehlenhoff today, and this can wait until @Papaul is back next Monday. Thanks, Willy [22:18:56] (03PS1) 10Dzahn: site/conftool: remove mw2239 [puppet] - 10https://gerrit.wikimedia.org/r/672834 [22:20:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:32:45] (03PS2) 10Dzahn: site/conftool: remove mw2230 through mw2242 [puppet] - 10https://gerrit.wikimedia.org/r/672834 (https://phabricator.wikimedia.org/T277119) [22:36:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:38:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:38:45] (03CR) 10Ladsgroup: "This change is ready for review." [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672811 (https://phabricator.wikimedia.org/T277362) (owner: 10Ladsgroup) [22:45:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:46:20] (03CR) 10Ori.livneh: "This change is ready for review." [extensions/Scribunto] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/672836 (owner: 10Ori.livneh) [22:46:43] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10jbond) >>! In T274461#6918624, @brennen wrote: > So @Sergey.Trofimovsky.SF reports getting prompted for a credent... [22:47:11] (03PS4) 10Krinkle: InitialiseSettings: Remove wmg/wg indirection for BotPasswords (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666778 [22:47:13] (03PS4) 10Krinkle: CommonSettings: Remove wmg/wg indirection for BotPasswords (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666779 [22:48:22] (03PS6) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [22:49:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:49:57] (03CR) 10Jeena Huneidi: "> Patch Set 4: Code-Review+1" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [22:50:02] (03PS5) 10Krinkle: CommonSettings: Remove wmg/wg indirection for BotPasswords (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666779 [22:50:09] (03PS2) 10Krinkle: InitialiseSettings: Remove wmg/wg indirection for BotPasswords (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667935 [22:51:41] ori: rolling out the above, can take yours with it [22:52:12] (03PS5) 10Jeena Huneidi: rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 [22:53:01] (03CR) 10Krinkle: [C: 03+2] InitialiseSettings: Remove wmg/wg indirection for BotPasswords (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666778 (owner: 10Krinkle) [22:53:48] (03Merged) 10jenkins-bot: InitialiseSettings: Remove wmg/wg indirection for BotPasswords (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666778 (owner: 10Krinkle) [22:55:54] Krinkle: awesome, thanks [22:56:10] (03CR) 10Krinkle: [C: 03+2] CommonSettings: Remove wmg/wg indirection for BotPasswords (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666779 (owner: 10Krinkle) [22:56:34] jouncebot: refresh [22:56:35] I refreshed my knowledge about deployments. [22:56:46] (03PS1) 10H.krishna123: Add new methods in recover-dump to measure execution time [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672837 (https://phabricator.wikimedia.org/T277160) [22:57:00] (03Merged) 10jenkins-bot: CommonSettings: Remove wmg/wg indirection for BotPasswords (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666779 (owner: 10Krinkle) [22:57:22] Amir1: I ccan roll it out for you [22:58:20] (03CR) 10Krinkle: [C: 03+2] InitialiseSettings: Remove wmg/wg indirection for BotPasswords (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667935 (owner: 10Krinkle) [22:58:21] Sure. Thanks [22:59:06] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Ie24eb2077 (duration: 00m 58s) [22:59:12] (03Merged) 10jenkins-bot: InitialiseSettings: Remove wmg/wg indirection for BotPasswords (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667935 (owner: 10Krinkle) [22:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210316T2300). [23:00:05] ebernhardson and Amir1: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:22] o/ I'm around but Krinkle is handling it (Thanks!) [23:00:46] * Krinkle looks for which line pinged [23:00:53] I see youre having a chat with jouncebot [23:00:54] good times :) [23:00:58] \o [23:01:56] (03CR) 10Krinkle: [C: 03+2] Revert "Deprecate constructing revision with non-proper page" [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672811 (https://phabricator.wikimedia.org/T277362) (owner: 10Ladsgroup) [23:02:01] The lockdown has taken its toll on me :D [23:02:04] Amir1: you know this branch isn't live yet, right? [23:02:18] * Krinkle knows nothing [23:02:30] !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: I4097cbcb1d5 (duration: 00m 59s) [23:02:34] isn't it already cut? [23:02:39] let me see [23:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:54] (03CR) 10Mstyles: rdf-streaming-updater: fix networkpolicy selector (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [23:03:35] !log applied hotfix to phabricator/src/infrastructure/customfield/storage/PhabricatorCustomFieldStorage.php and restarted php-fpm [23:03:37] Krinkle: it's deployed to group0 https://phabricator.wikimedia.org/T277593 [23:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:45] > ~34 of these after deploy of 1.36.0-wmf.35 to group0. [23:04:20] versions.toolforge disagrees [23:04:30] how do I get jenkins-bot to run on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/672836? (I'm very rusty.) [23:04:58] ugh git is being annoying again [23:05:00] ori: it is already https://integration.wikimedia.org/zuul/?q=672836 [23:05:04] maybe got rolled back? [23:05:18] I fix the git rebase issue in the mean time [23:05:20] but we've tolerated CI slowing down from 5min to 25min [23:05:22] oh good, thanks [23:05:23] so this is the new normal [23:05:37] :(((( [23:05:56] deposit a token at this task to register your dissatisfaction :) https://phabricator.wikimedia.org/T225730 [23:06:35] token deposited [23:06:58] <3 [23:07:51] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I8d8c94d95c6 (duration: 00m 59s) [23:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:23] (03PS2) 10Ladsgroup: Revert "Deprecate constructing revision with non-proper page" [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672811 (https://phabricator.wikimedia.org/T277362) [23:12:29] (03CR) 10Krinkle: [C: 03+2] "CI still still running, but a selenium job failed:" [extensions/Scribunto] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/672836 (owner: 10Ori.livneh) [23:12:33] (03CR) 10Ladsgroup: [C: 03+2] Revert "Deprecate constructing revision with non-proper page" [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672811 (https://phabricator.wikimedia.org/T277362) (owner: 10Ladsgroup) [23:13:07] ebernhardson: are the patches fine to apply separately one by one? [23:13:36] (03CR) 10Krinkle: [C: 03+2] Expand CirrusSearchNamespaceWeights with explicit ns numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672761 (https://phabricator.wikimedia.org/T277332) (owner: 10Ebernhardson) [23:14:01] Krinkle: yup, they can apply together or one at a time, they impact separate things [23:14:08] (03CR) 10jerkins-bot: [V: 04-1] Temporary debug logging for LuaEngine::frameExists('current') [extensions/Scribunto] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/672836 (owner: 10Ori.livneh) [23:14:25] ebernhardson: ok, I'll roll out the first one now. Will be up for mwdebug1002 verirication soon [23:14:40] ok [23:14:49] Krinkle: Do you need me for anything? Looks like you're handling the train blocker? [23:15:12] dancy: ack, I was already deploying pre-window so I'm taking them the same time [23:15:24] I don't know if these were train unblockers or not, but they were on the backport window [23:15:47] OK. Will you be rolling group0 to .35? [23:15:53] I don't think the test failure ('Can't call getText on element with selector "#mw-content-text .mw-parser-output" because element wasn't found') on my change is legit.. [23:16:06] Or just backport to .34? [23:16:09] dancy: No, just doing the backport window at the current calendar time [23:16:33] Gotcha. Have a good day. I'm going offline [23:16:50] I think maybe amir's revert is train related? [23:17:07] which might end up unblocking it, I don't know yeah. [23:17:09] Yeah. https://phabricator.wikimedia.org/T277362 [23:17:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:17:34] so maybe later today you/releng could roll forward in that case [23:17:45] should be done in 30min or so [23:18:03] not sure why the config patch isn't landing [23:18:07] (03PS3) 10Krinkle: Expand CirrusSearchNamespaceWeights with explicit ns numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672761 (https://phabricator.wikimedia.org/T277332) (owner: 10Ebernhardson) [23:18:12] (03CR) 10Krinkle: [C: 03+2] Expand CirrusSearchNamespaceWeights with explicit ns numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672761 (https://phabricator.wikimedia.org/T277332) (owner: 10Ebernhardson) [23:18:17] (03PS3) 10Krinkle: Add Cirrus testing profile for glent m1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672565 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:18:18] Jenkins seems to be overloaded today [23:19:02] ah, it needs fast-forward but doesnt' say anything [23:19:04] landing now [23:19:10] (03Merged) 10jenkins-bot: Expand CirrusSearchNamespaceWeights with explicit ns numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672761 (https://phabricator.wikimedia.org/T277332) (owner: 10Ebernhardson) [23:19:36] ebernhardson: alright, nr 1 is up on mwdebug1002 [23:19:51] For my own sanity I will wait until the Weds train window to roll forward. [23:20:02] k :) [23:20:29] Krinkle: all looks sane [23:21:45] ack, noticed one INFO log at https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002 [23:21:49] but nothing unexpected I suppose [23:21:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:21:51] rolling out [23:22:05] (03CR) 10Ori.livneh: [C: 03+2] "One more attempt to deflake." [extensions/Scribunto] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/672836 (owner: 10Ori.livneh) [23:22:31] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Icd6635cb302cc, T277332 (duration: 00m 58s) [23:22:36] (03CR) 10Krinkle: [C: 03+2] Add Cirrus testing profile for glent m1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672565 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:39] T277332: Uncaught Error: Widget not found / Call to a member function getNsIndex() on null on CirrusSearch result page with internal error - https://phabricator.wikimedia.org/T277332 [23:25:13] (03PS4) 10Krinkle: Add Cirrus testing profile for glent m1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672565 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:25:20] (03CR) 10Krinkle: [C: 03+2] Add Cirrus testing profile for glent m1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672565 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:25:41] ori: the gate jobs were still running from my +2, the test failure was from the patch submission jobs. [23:25:52] but the added +2 didn't re-run them unfortunately [23:25:54] fortunately* [23:26:04] https://integration.wikimedia.org/zuul/ filter "gate" [23:26:08] almost done :) [23:26:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:27:06] (03Merged) 10jenkins-bot: Add Cirrus testing profile for glent m1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672565 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:27:44] ack [23:28:06] ebernhardson: 2nd patch up on mwdebug1002 [23:30:31] Krinkle: looks good as well [23:31:06] ack, rolling out now [23:31:19] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [23:31:58] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I1ca4f30c2, T262612 (duration: 00m 57s) [23:31:59] (03Abandoned) 10H.krishna123: Add new methods in recover-dump to measure execution time [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672837 (https://phabricator.wikimedia.org/T277160) (owner: 10H.krishna123) [23:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:05] T262612: Run an A/B test using suggestions generated using glent Method 1 - https://phabricator.wikimedia.org/T262612 [23:33:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:37:16] (03Merged) 10jenkins-bot: Temporary debug logging for LuaEngine::frameExists('current') [extensions/Scribunto] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/672836 (owner: 10Ori.livneh) [23:39:40] cscott are you around? [23:40:23] (03PS2) 10H.krishna123: Add new methods in recover-dump to measure execution time [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672150 (https://phabricator.wikimedia.org/T277160) [23:41:44] (03PS3) 10H.krishna123: Add new methods in recover-dump to measure execution time [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672150 (https://phabricator.wikimedia.org/T277160) [23:43:38] ori: wakey wakey :D [23:44:05] checking [23:44:23] Krinkle: you haven't synced yet, right? [23:44:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:45:12] ori: I pulled it on mwdebug1002 just now [23:45:22] i'm tailing AdHocDebug.log on mwlog1001 [23:45:23] I guess we don't have a repro, right? [23:45:43] (03Merged) 10jenkins-bot: Revert "Deprecate constructing revision with non-proper page" [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672811 (https://phabricator.wikimedia.org/T277362) (owner: 10Ladsgroup) [23:46:09] ori: for now just want to excercise the common path on XWD to make sure it's not gonna throw an error of sorts [23:47:16] !log There is an uncommitted dirty diff in /srv/mediawiki-staging/php-1.36.0-wmf.34/extensions/WikimediaMaintenance/createExtensionTables.php [23:47:20] Yup [23:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:49] tgr|away: might be yours ^ given SAL entry from earlier [23:48:23] DannyS712 eating dinner now, but I'll keep an eye on IRC [23:48:38] tgr_: I'll undo it for now, assuming it wasn't synced [23:49:01] ori: does it get triggered on an edit with one #invoke? [23:49:04] Krinkle: something about a missing path segment? [23:49:11] - $path = "$IP/extensions/GrowthExperiments/maintenance/mysql"; [23:49:11] + $path = "$IP/extensions/GrowthExperiments/maintenance/schemas/mysql"; [23:49:13] Krinkle: yes, I just tried it and it was fine [23:49:25] https://en.wikipedia.org/w/index.php?title=User:ATDT/sandbox&oldid=1012546982 on mwdebug1001 [23:49:30] yeah, that's mine, sorry. I was sure I cleaned it up. [23:49:47] np, undid it now using git checkout [23:50:09] (no log messages but that's what i'm expecting.) [23:50:56] ori: syncing now [23:51:13] thanks [23:51:31] Amir1: ready to go? [23:51:40] Sure! [23:51:45] it should be noop [23:51:48] !log krinkle@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/Scribunto/: I84e8732d8d - tmp logging (duration: 00m 58s) [23:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:54] I don't know how it can be easily tested [23:52:09] still no log messages \o/ [23:53:09] Amir1: pulled to mwd1002, but not verifiable right now, so this is for verification later when hte train rolls out, right? [23:53:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:55:07] yes [23:56:11] syncing [23:56:32] !log krinkle@deploy1002 Synchronized php-1.36.0-wmf.35/includes/Revision/: I8619ab9e92b, T277362, T275531 (duration: 00m 58s) [23:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:40] T275531: Make RevisionRecord return a ProperPageIdentity - https://phabricator.wikimedia.org/T275531 [23:56:41] T277362: Deprecation warning client-repo wikitext link - https://phabricator.wikimedia.org/T277362