[00:02:03] alright, that was the backport deploy window [00:02:06] jouncebot: next [00:02:06] In 10 hour(s) and 57 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210317T1100) [00:03:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:03:43] PROBLEM - MariaDB Replica SQL: s5 on db2089 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 212.200.0.0/16-0-0-1 for key ipb_address_unique on query. Default database: srwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:03:51] (03CR) 10CRusnov: "Ping @BBlack on the disposition of this script." [puppet] - 10https://gerrit.wikimedia.org/r/630693 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:03:59] PROBLEM - MariaDB Replica SQL: s5 on db2137 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 212.200.0.0/16-0-0-1 for key ipb_address_unique on query. Default database: srwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:04:15] PROBLEM - MariaDB Replica SQL: s5 on db2099 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 212.200.0.0/16-0-0-1 for key ipb_address_unique on query. Default database: srwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:05:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:05:27] PROBLEM - MariaDB Replica SQL: s5 on db2139 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 212.200.0.0/16-0-0-1 for key ipb_address_unique on query. Default database: srwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:06:02] Krinkle: thanks again [00:07:45] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670922 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:09:21] (03CR) 10CRusnov: "Just to verify, the previous +1s still apply?" [puppet] - 10https://gerrit.wikimedia.org/r/658396 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:09:35] Krinkle: Thanks1 [00:12:07] PROBLEM - dump of es5 in eqiad on alert1001 is CRITICAL: dump for es5 at eqiad taken more than 8 days ago: Most recent backup 2021-03-09 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:12:23] (03PS2) 10CRusnov: confd/confd-lint-wrap.py: Port for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658414 (https://phabricator.wikimedia.org/T247364) [00:17:47] (03CR) 10Ori.livneh: "Filed https://phabricator.wikimedia.org/T277614 for the flaky test" [extensions/Scribunto] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/672836 (owner: 10Ori.livneh) [00:17:49] (03PS6) 10Jeena Huneidi: rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 [00:20:38] (03PS7) 10Jeena Huneidi: rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 [00:22:19] PROBLEM - MariaDB Replica Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1395.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:23:03] PROBLEM - MariaDB Replica Lag: s5 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1438.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:23:25] PROBLEM - MariaDB Replica Lag: s5 on db2137 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1460.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:23:25] PROBLEM - MariaDB Replica Lag: s5 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1460.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:26:02] (03CR) 10Mstyles: rdf-streaming-updater: fix networkpolicy selector (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [00:27:21] (03CR) 10Mstyles: rdf-streaming-updater: fix networkpolicy selector (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [00:27:36] (03CR) 10Mstyles: [C: 03+1] rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [00:28:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:28:53] PROBLEM - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad taken more than 8 days ago: Most recent backup 2021-03-09 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:31:23] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [00:31:27] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [00:34:50] (03PS3) 10Legoktm: [WIP] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [00:35:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [00:38:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:20:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:24:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:31:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:34:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:38:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:41:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:55:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:57:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:01:12] (03CR) 10Razzi: [C: 03+1] Replace labsdb1012 with clouddb1021 in analytics-in4 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/672797 (https://phabricator.wikimedia.org/T269211) (owner: 10Elukey) [02:02:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:26:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:31:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:36:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:40:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:43:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:48:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:50:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:57:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:09:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:17:45] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:22:41] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:31:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:38:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:46:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:48:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:53:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:56:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:00:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,ircd} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:03:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:25:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:35:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:42:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:55:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:59:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:09:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:18:43] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:21:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:42:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111 for schema change', diff saved to https://phabricator.wikimedia.org/P14907 and previous config saved to /var/cache/conftool/dbconfig/20210317-054206-marostegui.json [05:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:52:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:57:50] (03PS1) 10Marostegui: install_server: Do not reimage db1162 [puppet] - 10https://gerrit.wikimedia.org/r/672878 (https://phabricator.wikimedia.org/T258361) [05:59:05] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1162 [puppet] - 10https://gerrit.wikimedia.org/r/672878 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [05:59:27] (03PS1) 10ArielGlenn: Create group for root access to snapshot, dumpsdata and labstore1006,7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/672879 (https://phabricator.wikimedia.org/T277629) [06:00:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:01:11] (03PS1) 10Marostegui: instances.yaml: Add db2150 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/672880 (https://phabricator.wikimedia.org/T275633) [06:02:07] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2150 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/672880 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [06:02:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:03:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2150 to s7, depooled T275633', diff saved to https://phabricator.wikimedia.org/P14908 and previous config saved to /var/cache/conftool/dbconfig/20210317-060358-marostegui.json [06:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:06] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:17:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:18:14] ACKNOWLEDGEMENT - MariaDB Replica Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 22606.75 seconds Marostegui T277632 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:18:14] ACKNOWLEDGEMENT - MariaDB Replica SQL: s5 on db2089 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 212.200.0.0/16-0-0-1 for key ipb_address_unique on query. Default database: srwiki. [Query snipped] Marostegui T277632 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:18:14] ACKNOWLEDGEMENT - MariaDB Replica Lag: s5 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 22681.21 seconds Marostegui T277632 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:18:14] ACKNOWLEDGEMENT - MariaDB Replica SQL: s5 on db2099 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 212.200.0.0/16-0-0-1 for key ipb_address_unique on query. Default database: srwiki. [Query snipped] Marostegui T277632 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:18:14] ACKNOWLEDGEMENT - MariaDB Replica Lag: s5 on db2137 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 22677.91 seconds Marostegui T277632 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:18:14] ACKNOWLEDGEMENT - MariaDB Replica SQL: s5 on db2137 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 212.200.0.0/16-0-0-1 for key ipb_address_unique on query. Default database: srwiki. [Query snipped] Marostegui T277632 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:18:15] ACKNOWLEDGEMENT - MariaDB Replica Lag: s5 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 22657.36 seconds Marostegui T277632 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:18:15] ACKNOWLEDGEMENT - MariaDB Replica SQL: s5 on db2139 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 212.200.0.0/16-0-0-1 for key ipb_address_unique on query. Default database: srwiki. [Query snipped] Marostegui T277632 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:19:23] (03PS1) 10Marostegui: db2089,db2137,db2139: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672882 (https://phabricator.wikimedia.org/T277632) [06:19:51] (03CR) 10Marostegui: [C: 03+2] db2089,db2137,db2139: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672882 (https://phabricator.wikimedia.org/T277632) (owner: 10Marostegui) [06:20:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:23:49] RECOVERY - MariaDB Replica SQL: s5 on db2089 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:25:45] RECOVERY - MariaDB Replica SQL: s5 on db2139 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:26:41] RECOVERY - MariaDB Replica SQL: s5 on db2137 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:27:03] RECOVERY - MariaDB Replica SQL: s5 on db2099 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:29:41] RECOVERY - MariaDB Replica Lag: s5 on db2139 is OK: OK slave_sql_lag Replication lag: 0.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:30:05] RECOVERY - MariaDB Replica Lag: s5 on db2099 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:30:41] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [06:30:45] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [06:31:19] RECOVERY - MariaDB Replica Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:32:27] RECOVERY - MariaDB Replica Lag: s5 on db2137 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:32:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:42:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:45:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: Slowly repool db1111', diff saved to https://phabricator.wikimedia.org/P14909 and previous config saved to /var/cache/conftool/dbconfig/20210317-064513-root.json [06:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:21] (03CR) 10Ayounsi: [C: 03+1] Replace labsdb1012 with clouddb1021 in analytics-in4 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/672797 (https://phabricator.wikimedia.org/T269211) (owner: 10Elukey) [06:46:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2150 into s7 T275633', diff saved to https://phabricator.wikimedia.org/P14910 and previous config saved to /var/cache/conftool/dbconfig/20210317-064606-marostegui.json [06:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:14] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:46:55] (03PS1) 10Marostegui: db2150: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672884 (https://phabricator.wikimedia.org/T275633) [06:47:27] (03CR) 10Marostegui: [C: 03+2] db2150: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672884 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [06:50:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:51:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1082 to clone db1161 T258361', diff saved to https://phabricator.wikimedia.org/P14911 and previous config saved to /var/cache/conftool/dbconfig/20210317-065146-marostegui.json [06:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:54] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:52:18] (03CR) 10Elukey: [C: 03+1] burrow/check_kafka_consumer_lag.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658396 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:52:19] !log Stop MySQL on db1082 to clone db1161 (lag will appear on s5 on wikireplicas) - T258361 [06:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:54:59] (03CR) 10Elukey: [C: 03+1] profile::kerberos::client: Default to use DNS canonicalisation [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) (owner: 10Muehlenhoff) [06:55:02] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) On going transfer from db1082 to db1161 [06:58:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:59:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:00:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: Slowly repool db1111', diff saved to https://phabricator.wikimedia.org/P14913 and previous config saved to /var/cache/conftool/dbconfig/20210317-070017-root.json [07:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:31] 10ops-eqiad: analytics1063 interface errors - https://phabricator.wikimedia.org/T277633 (10ayounsi) [07:01:09] XioNoX: --^ <3 [07:01:20] ;) [07:06:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:07:55] 10ops-eqiad: elastic1062 interface errors - https://phabricator.wikimedia.org/T277634 (10ayounsi) [07:08:01] elukey: https://phabricator.wikimedia.org/T277634 [07:08:53] will we get automatic alarms in the future? [07:10:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:10:27] (03CR) 10Ladsgroup: [C: 03+1] "Looks straightforward enough." [puppet] - 10https://gerrit.wikimedia.org/r/672796 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [07:13:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:15:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:15:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: Slowly repool db1111', diff saved to https://phabricator.wikimedia.org/P14914 and previous config saved to /var/cache/conftool/dbconfig/20210317-071520-root.json [07:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:30:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:30:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: Slowly repool db1111', diff saved to https://phabricator.wikimedia.org/P14915 and previous config saved to /var/cache/conftool/dbconfig/20210317-073024-root.json [07:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:33:41] elukey: https://phabricator.wikimedia.org/T225140#6920441 [07:34:01] ack [07:34:02] elukey: dcops and I already get the emails for those btw [07:34:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1114 for schema change', diff saved to https://phabricator.wikimedia.org/P14916 and previous config saved to /var/cache/conftool/dbconfig/20210317-073403-marostegui.json [07:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:38:59] 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10Aklapper) a:05Techwizzie→03None [07:39:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:49:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:50:06] (03CR) 10Muehlenhoff: "This needs an approval: line, other than that looks fine, but needs discussion/signoff in next SRE meeting." [puppet] - 10https://gerrit.wikimedia.org/r/672879 (https://phabricator.wikimedia.org/T277629) (owner: 10ArielGlenn) [07:50:14] !log swift eqiad-prod: less weight for ms-be[1019-1026] - T272836 [07:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:26] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [07:50:44] 10SRE, 10Domains, 10Okapi, 10Traffic: Add enterprise subdomain for OKAPI - https://phabricator.wikimedia.org/T276585 (10Aklapper) [07:54:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:58:02] (03CR) 10Elukey: Replace labsdb1012 with clouddb1021 in analytics-in4 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/672797 (https://phabricator.wikimedia.org/T269211) (owner: 10Elukey) [07:58:16] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Volans) [07:59:02] (03PS2) 10Elukey: Replace labsdb1012 with clouddb1021 in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/672797 (https://phabricator.wikimedia.org/T269211) [07:59:35] (03CR) 10Elukey: Replace labsdb1012 with clouddb1021 in analytics-in4 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/672797 (https://phabricator.wikimedia.org/T269211) (owner: 10Elukey) [08:00:41] (03PS1) 10Marostegui: mariadb: Productionize db1161 [puppet] - 10https://gerrit.wikimedia.org/r/672974 (https://phabricator.wikimedia.org/T258361) [08:01:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1161 [puppet] - 10https://gerrit.wikimedia.org/r/672974 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [08:01:22] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [08:02:29] (03CR) 10Muehlenhoff: [C: 03+2] profile::kerberos::client: Default to use DNS canonicalisation [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) (owner: 10Muehlenhoff) [08:03:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:08:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:10:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:15:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:15:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 25%: Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P14917 and previous config saved to /var/cache/conftool/dbconfig/20210317-081557-root.json [08:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:29] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1161 is now replicating [08:22:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:26:47] (03PS1) 10Hashar: Review access change [dumps] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/672817 [08:26:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,routinator} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:27:33] (03PS2) 10Hashar: Make operations-dumps group owner of the hierarchy [dumps] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/672817 (https://phabricator.wikimedia.org/T277630) [08:31:01] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [08:31:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 50%: Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P14918 and previous config saved to /var/cache/conftool/dbconfig/20210317-083101-root.json [08:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: replace spamassassin cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/672796 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [08:35:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:35:52] (03PS1) 10Muehlenhoff: Add thirdparty/php56 [puppet] - 10https://gerrit.wikimedia.org/r/672976 (https://phabricator.wikimedia.org/T224589) [08:38:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P14919 and previous config saved to /var/cache/conftool/dbconfig/20210317-083846-root.json [08:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:59] (03CR) 10Alexandros Kosiaris: "Looks ok on otrs1001. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/672796 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [08:39:11] (03PS2) 10Alexandros Kosiaris: otrs: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/672830 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [08:39:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/672830 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [08:40:55] (03PS2) 10Muehlenhoff: Add thirdparty/php56 [puppet] - 10https://gerrit.wikimedia.org/r/672976 (https://phabricator.wikimedia.org/T224589) [08:45:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:46:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 75%: Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P14920 and previous config saved to /var/cache/conftool/dbconfig/20210317-084605-root.json [08:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:48] (03CR) 10Hashar: [V: 03+2 C: 03+2] Make operations-dumps group owner of the hierarchy [dumps] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/672817 (https://phabricator.wikimedia.org/T277630) (owner: 10Hashar) [08:49:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:51:53] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [08:53:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P14921 and previous config saved to /var/cache/conftool/dbconfig/20210317-085350-root.json [08:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:34] (03PS1) 10Muehlenhoff: Adapt package installation for tendril/buster to pull in PHP 5.6 [puppet] - 10https://gerrit.wikimedia.org/r/672977 (https://phabricator.wikimedia.org/T224589) [08:58:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1084 T276302', diff saved to https://phabricator.wikimedia.org/P14922 and previous config saved to /var/cache/conftool/dbconfig/20210317-085852-marostegui.json [08:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:00] T276302: decommission db1084.eqiad.wmnet - https://phabricator.wikimedia.org/T276302 [08:59:03] (03CR) 10Muehlenhoff: [C: 03+2] Add thirdparty/php56 [puppet] - 10https://gerrit.wikimedia.org/r/672976 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff) [09:01:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 100%: Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P14923 and previous config saved to /var/cache/conftool/dbconfig/20210317-090108-root.json [09:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:05] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-04-16 09:01:44 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [09:03:51] PROBLEM - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-04-16 09:01:44 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [09:04:05] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=citoid [09:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:14] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=cxserver [09:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:30] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=recommendation-api [09:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P14924 and previous config saved to /var/cache/conftool/dbconfig/20210317-090853-root.json [09:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker_registry_ha: Require authentication from k8s nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [09:11:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] chromium-render: Add default labels and fix name of configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/670464 (owner: 10JMeybohm) [09:15:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:17:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:18:10] 10SRE, 10netops, 10Patch-For-Review: Auhoritative ports list - https://phabricator.wikimedia.org/T277146 (10Joe) FWIW, the document on wikitech is not authoritative - `service::catalog` in hiera is. [09:19:22] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/28633/" [puppet] - 10https://gerrit.wikimedia.org/r/672977 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff) [09:22:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P14925 and previous config saved to /var/cache/conftool/dbconfig/20210317-092357-root.json [09:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1109 for schema change', diff saved to https://phabricator.wikimedia.org/P14926 and previous config saved to /var/cache/conftool/dbconfig/20210317-092443-marostegui.json [09:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:25:55] RECOVERY - dump of es4 in eqiad on alert1001 is OK: Last dump for es4 at eqiad (es1022.eqiad.wmnet) taken on 2021-03-16 00:00:01 (1585 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:25:55] RECOVERY - dump of es5 in eqiad on alert1001 is OK: Last dump for es5 at eqiad (es1025.eqiad.wmnet) taken on 2021-03-16 00:00:01 (1563 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:27:18] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) So to sum up for this testing databases: Section: m5 Name: `testmailman3` and `testmailman3web` Approximate time frame before deleting them: 2-3 months I would need the users an... [09:29:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:32:12] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10jcrespo) @Ladsgroup I assume no backups needed as this is a test? [09:32:17] (03PS3) 10Giuseppe Lavagetto: [WiP] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [09:32:19] (03PS1) 10Giuseppe Lavagetto: Scaffold: fix template calls for php applications [deployment-charts] - 10https://gerrit.wikimedia.org/r/672980 [09:32:45] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [09:34:55] (03PS1) 10Marostegui: db2099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672981 [09:34:59] (03PS1) 10Kormat: mariadb: Disable notifications for db2099 [puppet] - 10https://gerrit.wikimedia.org/r/672982 (https://phabricator.wikimedia.org/T277632) [09:35:09] kormat: XD [09:35:16] 4s, dang [09:35:25] You are so slow [09:35:31] 🥀 [09:35:36] kormat you go or I do? [09:35:41] i got it [09:35:46] <3 [09:36:16] (03CR) 10Kormat: [C: 03+2] mariadb: Disable notifications for db2099 [puppet] - 10https://gerrit.wikimedia.org/r/672982 (https://phabricator.wikimedia.org/T277632) (owner: 10Kormat) [09:36:19] (03Abandoned) 10Marostegui: db2099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672981 (owner: 10Marostegui) [09:36:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:37:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] chromium-render: Add default labels and fix name of configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/670464 (owner: 10JMeybohm) [09:43:50] 10SRE, 10Patch-For-Review: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10fgiunchedi) It looks like the exporter was stuck in a loop: ` [pid 18567] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0 [pid 18486] <... futex resumed> ) = 0 [pid 18567] futex(0x55a8b... [09:47:41] !log restart varnish-fe on cp5011 [09:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:59] RECOVERY - Varnish frontend child restarted on cp5011 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp5011&var-datasource=eqsin+prometheus/ops [09:51:58] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) >>! In T256538#6920684, @jcrespo wrote: > @Ladsgroup I assume no backups needed as this is a test? correct [09:53:47] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) >>! In T256538#6920675, @Marostegui wrote: > > I would need the users and the desired grants. user `testmailman3` having all rights on `testmailman3` database user `testmailman3... [09:59:40] !log imported PHP 5.6.40 to thirdparty/php56 T224589 [09:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:49] T224589: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 [10:00:27] (03CR) 10Muehlenhoff: [C: 03+2] Adapt package installation for tendril/buster to pull in PHP 5.6 [puppet] - 10https://gerrit.wikimedia.org/r/672977 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff) [10:04:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [10:13:28] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) a:03Marostegui [10:16:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:28] (03PS5) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [10:23:32] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) @Ladsgroup from which hosts would you be connecting from? m5 doesn't use the proxy (yet), so I would need to grant certain IPs instead of the proxy ones. [10:23:34] (03PS7) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [10:24:34] (03CR) 10jerkins-bot: [V: 04-1] hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [10:26:17] (03PS6) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [10:26:19] (03PS8) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [10:27:26] (03CR) 10jerkins-bot: [V: 04-1] hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [10:28:33] 10SRE, 10Packaging, 10puppet-compiler, 10User-jbond: PCC always has an ERROR when compiling for servers with profile::redis::slave - https://phabricator.wikimedia.org/T228266 (10jbond) Just a note that the above python snippet can now be preformed with something like the following from the compiler hosts... [10:29:17] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) Databases created on db1128.eqiad.wmnet (m5 master): ` # host m5-master.eqiad.wmnet m5-master.eqiad.wmnet is an alias for db1128.eqiad.wmnet. db1128.eqiad.wmnet has address 1... [10:31:53] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [10:32:03] (03PS1) 10Hashar: gerrit: GerritSite.css remove CI customization [puppet] - 10https://gerrit.wikimedia.org/r/672986 (https://phabricator.wikimedia.org/T277645) [10:32:05] (03PS1) 10Hashar: gerrit: GerritSite.css remove unused diff customization [puppet] - 10https://gerrit.wikimedia.org/r/672987 (https://phabricator.wikimedia.org/T232893) [10:32:21] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [10:37:01] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) Thanks. It should be the IP of the VM but that's not created yet (T276686) we were waiting for the databases to be created first (basically a chicken and egg problem) [10:37:15] (03PS7) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [10:37:17] (03PS9) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [10:37:37] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) Databases are now created, once I get the IPs I will create the users :) [10:38:29] (03CR) 10jerkins-bot: [V: 04-1] hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [10:38:33] (03CR) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [10:41:24] (03PS8) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [10:41:26] (03PS10) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [10:55:38] (03CR) 10Elukey: "> Patch Set 4:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [10:58:11] (03PS1) 10Jbond: add profile::kubernetes::node::docker_kubernetes_user_password 672537 [labs/private] - 10https://gerrit.wikimedia.org/r/672991 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210317T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:41] yay [11:00:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 25%: Slowly repool db1109', diff saved to https://phabricator.wikimedia.org/P14927 and previous config saved to /var/cache/conftool/dbconfig/20210317-110050-root.json [11:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:42] (03CR) 10Muehlenhoff: [C: 03+2] Switch the IDPs to the serial Memcached transcoder [puppet] - 10https://gerrit.wikimedia.org/r/672679 (https://phabricator.wikimedia.org/T273867) (owner: 10Muehlenhoff) [11:07:10] (03PS9) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [11:07:12] (03PS11) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [11:08:16] (03CR) 10jerkins-bot: [V: 04-1] hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [11:08:27] (03CR) 10Jbond: [V: 03+2 C: 03+2] add profile::kubernetes::node::docker_kubernetes_user_password 672537 [labs/private] - 10https://gerrit.wikimedia.org/r/672991 (owner: 10Jbond) [11:09:35] !log restarting tomcat on idp.wikimedia.org [11:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:29] (03PS10) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [11:14:31] (03PS12) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [11:15:36] (03CR) 10jerkins-bot: [V: 04-1] hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [11:15:49] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 50%: Slowly repool db1109', diff saved to https://phabricator.wikimedia.org/P14928 and previous config saved to /var/cache/conftool/dbconfig/20210317-111553-root.json [11:15:58] for some reason utis/run_ci_locally gives me the green light but I get -1 from jenkins [11:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:38] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad [11:17:38] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=restbase-async,name=codfw [11:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:51] (03PS11) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [11:17:53] (03PS13) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [11:20:33] !log switch restbase-async back to codfw (the newly initialized cluster) [11:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:17] (03PS12) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [11:23:19] (03PS14) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [11:26:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28638/console" [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [11:29:29] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10Volans) p:05Triage→03Medium [11:29:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:30:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:30:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 75%: Slowly repool db1109', diff saved to https://phabricator.wikimedia.org/P14929 and previous config saved to /var/cache/conftool/dbconfig/20210317-113057-root.json [11:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:22] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10Volans) @ArielGlenn This is actually a two-fold requests. This one is for the new group creatio... [11:41:13] (03CR) 10Elukey: [C: 03+1] "LGTM but I am very ignorant about the workers set up." [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:42:24] 10SRE, 10LDAP-Access-Requests: Grant Access to for apine - https://phabricator.wikimedia.org/T277544 (10Volans) p:05Triage→03Medium [11:42:27] (03PS2) 10ArielGlenn: Create group for root access to snapshot, dumpsdata and labstore1006,7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/672879 (https://phabricator.wikimedia.org/T277629) [11:43:16] (03CR) 10ArielGlenn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/672879 (https://phabricator.wikimedia.org/T277629) (owner: 10ArielGlenn) [11:44:42] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10ArielGlenn) >>! In T277629#6921052, @Volans wrote: > @ArielGlenn This is actually a two-fold re... [11:46:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 100%: Slowly repool db1109', diff saved to https://phabricator.wikimedia.org/P14930 and previous config saved to /var/cache/conftool/dbconfig/20210317-114601-root.json [11:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:42] (03PS11) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [11:47:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 for schema change', diff saved to https://phabricator.wikimedia.org/P14931 and previous config saved to /var/cache/conftool/dbconfig/20210317-114746-marostegui.json [11:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28643/console" [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [11:49:11] !log Deploy schema change on s8, lag will appear on wiki replicas T276150 T276156 [11:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:20] T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 [11:49:21] T276156: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 [11:52:37] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:01] (03PS12) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [11:55:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28644/console" [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [11:56:20] (03CR) 10Ayounsi: "This change is ready for review." [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [11:57:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:57:45] (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [11:59:07] PROBLEM - Long running screen/tmux on ganeti1011 is CRITICAL: CRIT: Long running tmux process. (user: kormat PID: 148816, 1737893s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [11:59:23] * kormat looks innocent [11:59:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:01:41] RECOVERY - Long running screen/tmux on ganeti1011 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [12:01:48] 😇 [12:10:40] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventgate-main [12:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:29] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=mathoid [12:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:05] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m on alert1001 is OK: (C)0 le (W)100 le 119.1 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [12:15:15] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on alert1001 is OK: (C)0 le (W)100 le 107.4 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [12:16:23] (03CR) 10jerkins-bot: [V: 04-1] Capirca POC [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [12:19:18] (03CR) 10Jbond: Capirca POC (035 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [12:19:52] (03PS1) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 [12:21:48] (03CR) 10jerkins-bot: [V: 04-1] Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [12:25:58] (03CR) 10Ottomata: "Looks generally correct to me! :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [12:28:01] (03PS1) 10Kosta Harlan: linkrecommendation: Bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) [12:29:20] (03PS2) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 [12:30:12] (03CR) 10Sharvaniharan: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [12:30:33] (03CR) 10jerkins-bot: [V: 04-1] Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [12:34:25] (03PS3) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 [12:34:51] (03CR) 10Ottomata: "Ok! Looks good to me!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [12:35:42] (03CR) 10jerkins-bot: [V: 04-1] Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [12:44:24] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=apertium [12:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:48] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=blubberoid [12:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:22] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=linkrecommendation [12:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:38] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=proton [12:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:44] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=push-notifications [12:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:55] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=similar-users [12:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:31] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:56] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10WDoranWMF) Hi as @holger.knust manager, I approve the request [12:56:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:58:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "From what I gathered (please correct me if I am wrong), now cxserver will also listen on port 9090 service a prometheus endpoint (/metrics" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry) [12:58:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:04:43] 10SRE, 10vm-requests: eqiad: 1 of VMs requested for tendril/buster - https://phabricator.wikimedia.org/T277657 (10MoritzMuehlenhoff) [13:07:05] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672449 (owner: 10PipelineBot) [13:07:10] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672398 (owner: 10PipelineBot) [13:07:17] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672238 (owner: 10PipelineBot) [13:09:02] (03PS9) 10Ayounsi: Capirca POC [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) [13:09:08] (03CR) 10Ayounsi: Capirca POC (034 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [13:11:50] (03PS5) 10KartikMistry: WIP: Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) [13:11:59] !log installing tiff security updates [13:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:08] (03PS1) 10Mvolz: Update Zotero to 2021-03-12-015945-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/673011 [13:13:25] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=termbox [13:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:31] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=wikifeeds [13:13:38] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=mobileapps [13:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:03] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventstreams-internal [13:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:25] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:25] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventstreams [13:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:46] (03PS1) 10Kormat: mariadb: Don't use ssl client auth [puppet] - 10https://gerrit.wikimedia.org/r/673013 [13:20:45] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28645/console" [puppet] - 10https://gerrit.wikimedia.org/r/673013 (owner: 10Kormat) [13:20:58] (03CR) 10Jbond: "Gone through and made a few comments. It would be really useful if we could have a side by side diff of the changes this would make. in " (038 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/663535 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [13:21:39] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [13:22:03] (03CR) 10Kormat: [V: 03+1] "As discussed yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/673013 (owner: 10Kormat) [13:22:15] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [13:23:45] (03CR) 10Marostegui: [C: 03+1] "Let's see what this breaks 👍" [puppet] - 10https://gerrit.wikimedia.org/r/673013 (owner: 10Kormat) [13:23:51] !log stopping s5 instance on db2099 and restoring from backup T277632 [13:23:53] marostegui: :D [13:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:04] (03CR) 10KartikMistry: "> Patch Set 4: Code-Review-1" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry) [13:24:13] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Don't use ssl client auth [puppet] - 10https://gerrit.wikimedia.org/r/673013 (owner: 10Kormat) [13:27:40] !log otto@deploy1002 Started deploy [analytics/aqs/deploy@3e92346]: deploy aqs as part of train - T207171, T263697 [13:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:49] T263697: Add more popular articles per country data to AQS - https://phabricator.wikimedia.org/T263697 [13:27:49] T207171: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 [13:29:16] (03CR) 10jerkins-bot: [V: 04-1] Capirca POC [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [13:31:04] !log otto@deploy1002 Finished deploy [analytics/aqs/deploy@3e92346]: deploy aqs as part of train - T207171, T263697 (duration: 03m 24s) [13:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:23] !log stopping db2089:s5 T277632 [13:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:08] !log stopping db2137:s5 T277632 [13:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 25%: Slowly repool db1087', diff saved to https://phabricator.wikimedia.org/P14932 and previous config saved to /var/cache/conftool/dbconfig/20210317-134018-root.json [13:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:29] !log otto@deploy1002 Started deploy [analytics/refinery@d2f1b28]: Regular analytics weekly train [analytics/refinery@d2f1b28] [13:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:27] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1002.eqiad.wmnet [13:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:44] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1002.eqiad.wmnet [13:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:04] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1003.eqiad.wmnet [13:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:15] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=echostore [13:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:22] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=api-gateway [13:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:37] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventgate-logging-external [13:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:45] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventgate-analytics-external [13:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:45] (03CR) 10Elukey: "> Patch Set 12:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [13:51:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:52:05] !log otto@deploy1002 Finished deploy [analytics/refinery@d2f1b28]: Regular analytics weekly train [analytics/refinery@d2f1b28] (duration: 11m 36s) [13:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:25] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1003.eqiad.wmnet [13:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:31] (03CR) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [13:53:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:54:04] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1001.eqiad.wmnet [13:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 50%: Slowly repool db1087', diff saved to https://phabricator.wikimedia.org/P14933 and previous config saved to /var/cache/conftool/dbconfig/20210317-135522-root.json [13:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:50] (03CR) 10Ottomata: hadoop: add a profile to deploy the capacity scheduler's settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [13:55:54] (03CR) 10Ottomata: [C: 03+1] hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [13:56:20] !log otto@deploy1002 Started deploy [analytics/refinery@d2f1b28] (thin): Regular analytics weekly train THIN [analytics/refinery@d2f1b28] [13:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:27] !log otto@deploy1002 Finished deploy [analytics/refinery@d2f1b28] (thin): Regular analytics weekly train THIN [analytics/refinery@d2f1b28] (duration: 00m 06s) [13:56:29] (03CR) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [13:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:41] !log otto@deploy1002 Started deploy [analytics/refinery@d2f1b28] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d2f1b28] [13:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:28] (03PS1) 10Jbond: C:puppet_compiler: ensure the yaml dir is always readable [puppet] - 10https://gerrit.wikimedia.org/r/673019 [13:58:41] !log added bullseye tftpboot environment T275873 [13:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:48] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [13:59:34] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1001.eqiad.wmnet [13:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28646/console" [puppet] - 10https://gerrit.wikimedia.org/r/673019 (owner: 10Jbond) [14:01:00] !log otto@deploy1002 Finished deploy [analytics/refinery@d2f1b28] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d2f1b28] (duration: 04m 19s) [14:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:14] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:puppet_compiler: ensure the yaml dir is always readable [puppet] - 10https://gerrit.wikimedia.org/r/673019 (owner: 10Jbond) [14:02:30] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventgate-analytics [14:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:39] !log jayme@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=sessionstore [14:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 75%: Slowly repool db1087', diff saved to https://phabricator.wikimedia.org/P14934 and previous config saved to /var/cache/conftool/dbconfig/20210317-141028-root.json [14:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:15] (03PS1) 10Muehlenhoff: Adapt condition to mask puppet service [puppet] - 10https://gerrit.wikimedia.org/r/673025 [14:12:41] (03CR) 10JMeybohm: [C: 03+2] admin/: Remove codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/671170 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [14:13:00] (03CR) 10jerkins-bot: [V: 04-1] Adapt condition to mask puppet service [puppet] - 10https://gerrit.wikimedia.org/r/673025 (owner: 10Muehlenhoff) [14:14:52] (03PS1) 10Herron: mwlog: add primary/standby host settings and rsync [puppet] - 10https://gerrit.wikimedia.org/r/673026 (https://phabricator.wikimedia.org/T224565) [14:16:58] (03CR) 10JMeybohm: [C: 03+1] docker_registry_ha: Require authentication from k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [14:17:13] (03PS1) 10Jbond: P:doc: use the correct php version for each debian distro [puppet] - 10https://gerrit.wikimedia.org/r/673027 [14:17:33] (03PS2) 10Herron: mwlog: add primary/standby host settings and rsync [puppet] - 10https://gerrit.wikimedia.org/r/673026 (https://phabricator.wikimedia.org/T224565) [14:17:38] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host testreduce1001.eqiad.wmnet [14:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28648/console" [puppet] - 10https://gerrit.wikimedia.org/r/673027 (owner: 10Jbond) [14:18:07] !log rebooting restreduce1001 for T277580 [14:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:14] T277580: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 [14:20:27] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/28649/" [puppet] - 10https://gerrit.wikimedia.org/r/673026 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [14:21:33] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [14:22:09] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [14:24:46] (03PS2) 10Muehlenhoff: Adapt condition to mask puppet service [puppet] - 10https://gerrit.wikimedia.org/r/673025 [14:25:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1087 (re)pooling @ 100%: Slowly repool db1087', diff saved to https://phabricator.wikimedia.org/P14935 and previous config saved to /var/cache/conftool/dbconfig/20210317-142532-root.json [14:25:33] (03CR) 10jerkins-bot: [V: 04-1] Adapt condition to mask puppet service [puppet] - 10https://gerrit.wikimedia.org/r/673025 (owner: 10Muehlenhoff) [14:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:54] (03CR) 10Herron: [C: 03+2] grafana: make domainrw optional [puppet] - 10https://gerrit.wikimedia.org/r/672445 (owner: 10Herron) [14:28:16] (03PS1) 10Jbond: cloud - sso: add class definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/673029 [14:29:20] (03PS2) 10Jbond: cloud - sso: add class definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/673029 [14:32:47] (03PS3) 10Muehlenhoff: Adapt condition to mask puppet service [puppet] - 10https://gerrit.wikimedia.org/r/673025 [14:37:59] (03CR) 10Filippo Giunchedi: [C: 03+1] mwlog: add primary/standby host settings and rsync [puppet] - 10https://gerrit.wikimedia.org/r/673026 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [14:39:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/670922 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [14:43:14] (03PS1) 10Mholloway: [BUG] sessionTick: Tick right away on sessionReset [extensions/WikimediaEvents] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672821 (https://phabricator.wikimedia.org/T277515) [14:43:44] (03CR) 10Jbond: [C: 03+2] cloud - sso: add class definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/673029 (owner: 10Jbond) [14:45:48] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1001.eqiad.wmnet [14:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:33] (03PS1) 10Jbond: O:grafana: move httpd to the P:grafana [puppet] - 10https://gerrit.wikimedia.org/r/673037 [14:48:35] (03PS1) 10Jbond: hiera - cloud: P:grafana now installes httpd so no need to do it seperatly [puppet] - 10https://gerrit.wikimedia.org/r/673038 [14:51:08] (03PS2) 10Majavah: Add deployment-restbase03 to beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/669477 (https://phabricator.wikimedia.org/T250574) [14:52:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28651/console" [puppet] - 10https://gerrit.wikimedia.org/r/673037 (owner: 10Jbond) [14:52:05] (03PS3) 10Majavah: Add deployment-restbase03 to beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/669477 (https://phabricator.wikimedia.org/T250574) [14:56:43] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm for new host dbmonitor1002.wikimedia.org [14:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/673025 (owner: 10Muehlenhoff) [14:58:12] (03CR) 10Jbond: [C: 03+1] modules/beta/files/wmf-beta-update-databases.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670922 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [14:58:51] (03CR) 10Muehlenhoff: [C: 03+2] Add deployment-restbase03 to beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/669477 (https://phabricator.wikimedia.org/T250574) (owner: 10Majavah) [15:02:23] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [15:02:59] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [15:05:16] (03CR) 10Herron: [C: 03+2] mwlog: add primary/standby host settings and rsync [puppet] - 10https://gerrit.wikimedia.org/r/673026 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [15:06:11] (03CR) 10Klausman: [C: 03+2] hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [15:07:05] (03CR) 10Mholloway: "I've added this to the next backport window on https://wikitech.wikimedia.org/wiki/Deployments." [extensions/WikimediaEvents] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672821 (https://phabricator.wikimedia.org/T277515) (owner: 10Mholloway) [15:12:39] (03PS19) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [15:13:07] PROBLEM - Check systemd state on mwlog1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:44] (03PS1) 10Klausman: hiera: change docker package for ML worker nodes to docker.io [puppet] - 10https://gerrit.wikimedia.org/r/673044 (https://phabricator.wikimedia.org/T272918) [15:14:52] PROBLEM - Check systemd state on ml-serve1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:53] PROBLEM - Check systemd state on ml-serve1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:53] PROBLEM - Check systemd state on ml-serve1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:53] PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:51] elukey, klausman: FYI ^^^ seems docker [15:16:07] Yes, my mistake [15:16:42] (03CR) 10Elukey: [C: 03+1] hiera: change docker package for ML worker nodes to docker.io [puppet] - 10https://gerrit.wikimedia.org/r/673044 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [15:17:19] ACKNOWLEDGEMENT - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Klausman Puppet setup mistake (wrong docker package name) https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:27] ACKNOWLEDGEMENT - Check systemd state on ml-serve1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Klausman Puppet setup mistake (wrong docker package name) https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:27] ACKNOWLEDGEMENT - Check systemd state on ml-serve1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Klausman Puppet setup mistake (wrong docker package name) https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:27] ACKNOWLEDGEMENT - Check systemd state on ml-serve1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Klausman Puppet setup mistake (wrong docker package name) https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:51] (03CR) 10Klausman: [C: 03+2] hiera: change docker package for ML worker nodes to docker.io [puppet] - 10https://gerrit.wikimedia.org/r/673044 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [15:21:06] (03PS1) 10Herron: rsync::quickdatacopy add auto_ferm_ipv6 param [puppet] - 10https://gerrit.wikimedia.org/r/673045 [15:21:47] (03PS2) 10Herron: rsync::quickdatacopy add auto_ferm_ipv6 param [puppet] - 10https://gerrit.wikimedia.org/r/673045 [15:22:07] PROBLEM - DPKG on ml-serve1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:22:21] PROBLEM - Check systemd state on ml-serve2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:33] PROBLEM - DPKG on ml-serve1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:23:28] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dbmonitor1002.wikimedia.org [15:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:04] (03PS1) 10Volans: admin: add Sam Walton as LDAP only account [puppet] - 10https://gerrit.wikimedia.org/r/673046 (https://phabricator.wikimedia.org/T277298) [15:27:42] (03CR) 10jerkins-bot: [V: 04-1] admin: add Sam Walton as LDAP only account [puppet] - 10https://gerrit.wikimedia.org/r/673046 (https://phabricator.wikimedia.org/T277298) (owner: 10Volans) [15:28:55] (03PS2) 10Volans: admin: add Sam Walton as LDAP only account [puppet] - 10https://gerrit.wikimedia.org/r/673046 (https://phabricator.wikimedia.org/T277298) [15:29:41] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/28652/" [puppet] - 10https://gerrit.wikimedia.org/r/673045 (owner: 10Herron) [15:33:34] (03CR) 10CRusnov: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:33:39] (03CR) 10CRusnov: [C: 03+2] modules/tcpircbot: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:33:43] (03PS3) 10Volans: admin: add Sam Walton as LDAP only account [puppet] - 10https://gerrit.wikimedia.org/r/673046 (https://phabricator.wikimedia.org/T277298) [15:34:37] PROBLEM - DPKG on ml-serve1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:35:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/673046 (https://phabricator.wikimedia.org/T277298) (owner: 10Volans) [15:35:25] (03PS4) 10CRusnov: modules/tcpircbot: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364) [15:35:27] (03CR) 10CRusnov: modules/tcpircbot: Port to Python3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:35:47] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/28653/ including additional hosts using rsync::quickdatacopy (grafana hosts) where this s" [puppet] - 10https://gerrit.wikimedia.org/r/673045 (owner: 10Herron) [15:35:50] (03PS5) 10CRusnov: modules/tcpircbot: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364) [15:36:27] (03CR) 10Cwhite: [C: 03+1] rsync::quickdatacopy add auto_ferm_ipv6 param [puppet] - 10https://gerrit.wikimedia.org/r/673045 (owner: 10Herron) [15:36:31] (03PS1) 10Majavah: beta: remove deployment-restbase01 [puppet] - 10https://gerrit.wikimedia.org/r/673047 (https://phabricator.wikimedia.org/T250574) [15:36:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] hiera/modules: Add role for ML k8s workers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [15:37:28] (03PS1) 10Jbond: hiera - cloud: add config for pki-debmon [puppet] - 10https://gerrit.wikimedia.org/r/673048 [15:37:36] (03PS1) 10Muehlenhoff: Add dbmonitor1002 to site.pp/DHCP [puppet] - 10https://gerrit.wikimedia.org/r/673049 [15:39:09] (03CR) 10CRusnov: [C: 03+2] modules/tcpircbot: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:40:39] PROBLEM - DPKG on ml-serve1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:40:57] PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:17] (03CR) 10Volans: [C: 03+2] admin: add Sam Walton as LDAP only account [puppet] - 10https://gerrit.wikimedia.org/r/673046 (https://phabricator.wikimedia.org/T277298) (owner: 10Volans) [15:41:31] PROBLEM - DPKG on ml-serve2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:41:57] (03PS4) 10Legoktm: [WIP] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [15:42:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Volans) [15:42:43] PROBLEM - Check systemd state on ml-serve2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:03] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [15:47:31] RECOVERY - Check systemd state on ml-serve2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:31] PROBLEM - Check systemd state on ml-serve2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Volans) @Samwalton9 The patch has been merged. Within ~half an hour Puppet should have run everywhere and should apply the changes. Feel free to resolve it once you ca... [15:50:37] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10Volans) @CGlenn I'd need your manager approval here on task in order to proceed, but I can't find their Phabricator's account to ping them. Could you please let them know? [15:53:36] 10SRE, 10SRE-Access-Requests: Grant Access to for apine - https://phabricator.wikimedia.org/T277544 (10Jdforrester-WMF) Yeah, Cory will want `wmf` I think, or possibly `analytics-privatedata-users` for his work with us in the Abstract Wikipedia team? [15:58:21] PROBLEM - DPKG on ml-serve2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:58:41] (03PS2) 10Muehlenhoff: Add dbmonitor1002 to site.pp/DHCP [puppet] - 10https://gerrit.wikimedia.org/r/673049 [15:59:48] (03PS1) 10Jbond: P:debmonitor: Make profile compatibla wth cloud environments [puppet] - 10https://gerrit.wikimedia.org/r/673050 [16:00:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28654/console" [puppet] - 10https://gerrit.wikimedia.org/r/673050 (owner: 10Jbond) [16:01:22] (03CR) 10Muehlenhoff: [C: 03+2] Add dbmonitor1002 to site.pp/DHCP [puppet] - 10https://gerrit.wikimedia.org/r/673049 (owner: 10Muehlenhoff) [16:03:48] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) install MPC7E-MRATE FPC into cr[12]-codfw - https://phabricator.wikimedia.org/T277341 (10ayounsi) a:05ayounsi→03Papaul [16:04:02] !log dancy@deploy1002 Synchronized php-1.36.0-wmf.35/includes/Revision/RevisionRecord.php: (no justification provided) (duration: 00m 58s) [16:04:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28655/console" [puppet] - 10https://gerrit.wikimedia.org/r/673050 (owner: 10Jbond) [16:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:19] (03PS1) 10Ahmon Dancy: group0 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673051 [16:04:22] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673051 (owner: 10Ahmon Dancy) [16:05:12] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673051 (owner: 10Ahmon Dancy) [16:06:49] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.35 [16:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:59] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:57] PROBLEM - Check systemd state on ml-serve2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:19] PROBLEM - DPKG on ml-serve2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [16:13:11] PROBLEM - DPKG on ml-serve2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [16:14:54] (03CR) 10Arturo Borrero Gonzalez: P:debmonitor: Make profile compatibla wth cloud environments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673050 (owner: 10Jbond) [16:15:37] (03PS2) 10Jbond: P:debmonitor: Make profile compatible with cloud environments [puppet] - 10https://gerrit.wikimedia.org/r/673050 [16:15:55] (03CR) 10Jbond: "thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673050 (owner: 10Jbond) [16:22:59] (03PS1) 10Zabe: Define Portal namespace for niawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673053 (https://phabricator.wikimedia.org/T277671) [16:23:54] (03PS2) 10Zabe: Define Portal and Portal talk namespace for niawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673053 (https://phabricator.wikimedia.org/T277671) [16:24:22] (03CR) 10Razzi: [C: 03+1] Replace labsdb1012 with clouddb1021 in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/672797 (https://phabricator.wikimedia.org/T269211) (owner: 10Elukey) [16:27:49] (03CR) 10jerkins-bot: [V: 04-1] Define Portal and Portal talk namespace for niawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673053 (https://phabricator.wikimedia.org/T277671) (owner: 10Zabe) [16:29:31] (03PS3) 10Zabe: Define Portal and Portal talk namespace for niawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673053 (https://phabricator.wikimedia.org/T277671) [16:37:59] !log upgrade memcached on mc1025, mc2025 [16:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:37] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: use ecs-compatible apache access log format [puppet] - 10https://gerrit.wikimedia.org/r/670951 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:41:28] (03CR) 10Subramanya Sastry: [C: 03+1] "All good to go. Regression testing on 54 potentially regressing titles indicates it is all good." [vendor] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672812 (https://phabricator.wikimedia.org/T276649) (owner: 10C. Scott Ananian) [16:44:44] !log andrew@deploy1002 Started deploy [horizon/deploy@e4fd934]: more support for disabled flavors [16:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:52] !log andrew@deploy1002 Finished deploy [horizon/deploy@e4fd934]: more support for disabled flavors (duration: 00m 07s) [16:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:08] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10JMeybohm) >>! In T276908#6916385, @Joe wrote: > After some more work, this is my ideas for liveness and readiness probes: > > # httpd: > -- liveness: tc... [16:45:10] !log andrew@deploy1002 Started deploy [horizon/deploy@8c50f27]: more support for disabled flavors [16:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:18] !log andrew@deploy1002 Finished deploy [horizon/deploy@8c50f27]: more support for disabled flavors (duration: 00m 07s) [16:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:02] !log andrew@deploy1002 Started deploy [horizon/deploy@8c50f27]: more support for disabled flavors [16:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:05] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10Joe) >>! In T276908#6922060, @JMeybohm wrote: >>>! In T276908#6916385, @Joe wrote: >> After some more work, this is my ideas for liveness and readiness p... [16:50:35] !log andrew@deploy1002 Finished deploy [horizon/deploy@8c50f27]: more support for disabled flavors (duration: 02m 32s) [16:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:33] (03CR) 10JMeybohm: [C: 03+2] admin: Remove staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/671169 (owner: 10Alexandros Kosiaris) [16:54:15] (03Merged) 10jenkins-bot: admin: Remove staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/671169 (owner: 10Alexandros Kosiaris) [16:54:19] (03PS1) 10Effie Mouzeli: hieradata: enable ipv6 on envoy services on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/673061 (https://phabricator.wikimedia.org/T255568) [16:54:25] (03Merged) 10jenkins-bot: admin/: Remove codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/671170 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [16:54:52] 10SRE, 10Prod-Kubernetes, 10SRE-tools, 10serviceops, 10User-jijiki: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) [16:55:08] 10SRE, 10Prod-Kubernetes, 10SRE-tools, 10serviceops: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) [16:58:17] (03CR) 10Ayounsi: [C: 03+1] Replace labsdb1012 with clouddb1021 in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/672797 (https://phabricator.wikimedia.org/T269211) (owner: 10Elukey) [16:59:03] 10SRE, 10Prod-Kubernetes, 10SRE-tools, 10serviceops: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) p:05Triage→03Medium [17:03:38] (03PS2) 10JMeybohm: fluent-bit: Switch to nobody and use seed_image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670838 (https://phabricator.wikimedia.org/T274852) [17:03:47] (03PS2) 10JMeybohm: ratelimit: Switch to nobody, update build and base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670836 (https://phabricator.wikimedia.org/T274852) [17:04:15] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:21] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: update ingest errors to use dead letters gauge [puppet] - 10https://gerrit.wikimedia.org/r/672769 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [17:06:56] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: clean up mtail config [puppet] - 10https://gerrit.wikimedia.org/r/672771 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [17:09:36] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:45] (03PS2) 10Effie Mouzeli: hieradata: enable ipv6 on envoy services on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/673061 (https://phabricator.wikimedia.org/T255568) [17:10:20] (03CR) 10Herron: [C: 03+2] rsync::quickdatacopy add auto_ferm_ipv6 param [puppet] - 10https://gerrit.wikimedia.org/r/673045 (owner: 10Herron) [17:12:37] RECOVERY - Check systemd state on mwlog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:57] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10Aklapper) (Adding notorious link to https://www.mediawiki.org/wiki/Phabricator/Help#Creating_your_account if folks want to create an account here.) [17:15:03] RECOVERY - Check systemd state on ml-serve1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) a:05Cmjohnson→03RobH @robh mc1037 NIC cfg done (enabled pxe on 10G disabeld on the 1GE), MAC BC:97:E1:E4:4B:30 mc1038 same an... [17:16:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [17:17:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) @RobH after you set the 2 up can you assign this task back to @Jclark-ctr to finish the remainder please [17:19:29] (03CR) 10Andrew Bogott: [C: 03+2] Nova vendordata: rework initial partitioning [puppet] - 10https://gerrit.wikimedia.org/r/672538 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [17:22:01] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:09] PROBLEM - Check systemd state on ml-serve1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:41] PROBLEM - Ensure local MW versions match expected deployment on mw2232 is CRITICAL: CRITICAL: 131 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [17:23:43] PROBLEM - Ensure local MW versions match expected deployment on mw2231 is CRITICAL: CRITICAL: 131 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [17:23:49] PROBLEM - Ensure local MW versions match expected deployment on mw2230 is CRITICAL: CRITICAL: 131 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [17:25:41] PROBLEM - tcpircbot_service_running on alert1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py https://wikitech.wikimedia.org/wiki/Logmsgbot [17:26:00] that's me i'll look in a sec [17:28:46] 10Puppet, 10Beta-Cluster-Infrastructure: Unduplicate beta cluster hiera keys set both in Horizon and in ops/puppet - https://phabricator.wikimedia.org/T277680 (10Majavah) [17:28:53] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:37] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/673050 (owner: 10Jbond) [17:34:40] (03CR) 10Jeena Huneidi: [C: 03+2] rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [17:35:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:18] (03Merged) 10jenkins-bot: rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [17:39:29] (03PS1) 10Bstorm: paws haproxy: tighten block restriction to use x-forwarded-for as well [puppet] - 10https://gerrit.wikimedia.org/r/673065 (https://phabricator.wikimedia.org/T276615) [17:41:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2230.codfw.wmnet [17:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:56] 10SRE, 10SRE-Access-Requests: Grant Access to for apine - https://phabricator.wikimedia.org/T277544 (10Volans) @Jdforrester-WMF the user `apine` is already part of the `wmf` group. Adding @Ottomata for advice on what kind of access is needed in this case. [17:44:15] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10Volans) p:05Triage→03Medium [17:46:20] (03CR) 10Dzahn: "Wow, so nice to see this is already merged. did not expect that to happen :) thanks" [puppet] - 10https://gerrit.wikimedia.org/r/672796 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:48:03] (03CR) 10Dzahn: [C: 03+2] gerrit: use ecs-compatible apache access log format [puppet] - 10https://gerrit.wikimedia.org/r/670951 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:48:11] (03PS2) 10Dzahn: gerrit: use ecs-compatible apache access log format [puppet] - 10https://gerrit.wikimedia.org/r/670951 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:48:26] 10SRE, 10SRE-Access-Requests: Grant Access to for apine - https://phabricator.wikimedia.org/T277544 (10Ottomata) mediawiki_wikitext_history will need analytics-privatdata-user with ssh and kerberos access. `wmf` LDAP will be useful too. So: https://wikitech.wikimedia.org/wiki/Analytics/Data_access... [17:48:41] (03CR) 10Razzi: [C: 03+2] Replace labsdb1012 with clouddb1021 in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/672797 (https://phabricator.wikimedia.org/T269211) (owner: 10Elukey) [17:50:45] !log update firewall rules to allow mysql-sqoop in analytics-in4 to access clouddb1021 - https://gerrit.wikimedia.org/r/c/operations/homer/public/+/672797 [17:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2230.codfw.wmnet [17:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:20] (03CR) 10Dzahn: "can confirm the format of /var/log/apache2/gerrit.wikimedia.org.https.access.log changed" [puppet] - 10https://gerrit.wikimedia.org/r/670951 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:55:49] (03CR) 10Dzahn: [C: 03+2] phabricator: use ecs-compatible apache access log format [puppet] - 10https://gerrit.wikimedia.org/r/670950 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:56:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2231.codfw.wmnet [17:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:22] (03CR) 10Dzahn: "can confirm the format of /var/log/apache2/phabricator_access.log changed on phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/670950 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [18:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210317T1800). [18:00:04] cscott, mholloway, hip, and Zabe: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:04] dancy and brennen: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210317T1800). [18:00:31] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10CGlenn) No problem! @Volans Hi @ahemmer ! Can you approve my phabricator ticket for me please? [18:00:37] o/ [18:05:32] d:q [18:06:05] (03PS1) 10Volans: WIP. tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) [18:07:03] is't that :wq ? :-P [18:07:18] (03CR) 10jerkins-bot: [V: 04-1] WIP. tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [18:08:30] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2231.codfw.wmnet [18:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:28] 10SRE, 10SRE-Access-Requests: Grant Access to for apine - https://phabricator.wikimedia.org/T277544 (10Volans) @cmassaro please update the task description updating the request for what's listed in the previous reply. You can follow the instructions at: https://wikitech.wikimedia.org/wiki/Production... [18:09:47] i'm here for the parsoid deploy [18:09:53] parsoid backport, rather [18:10:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2232.codfw.wmnet [18:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:36] (since we were just speaking of parsoid and backports) [18:10:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:18] (03CR) 10Paladox: "Most styles in this file can be removed apart from the styles for login." [puppet] - 10https://gerrit.wikimedia.org/r/672986 (https://phabricator.wikimedia.org/T277645) (owner: 10Hashar) [18:13:34] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH These are ready for you to finish the installs, I did verify that I was able to connect to mgmt on all of them. Use the temp password. [18:13:36] (03CR) 10Bstorm: [C: 03+2] paws haproxy: tighten block restriction to use x-forwarded-for as well [puppet] - 10https://gerrit.wikimedia.org/r/673065 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [18:14:21] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Cmjohnson) [18:14:58] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10ahemmer) Approved! [18:17:02] RoanKattouw, Niharika, Urbanecm: who's setting things on fire today? [18:17:12] 10SRE, 10ops-eqiad: analytics1063 interface errors - https://phabricator.wikimedia.org/T277633 (10Cmjohnson) @elukey does this need to be scheduled or can I just swap the cable? it will take ~30secs to do [18:17:14] I can take a turn [18:17:16] Haven't done it in a while :) [18:17:22] RoanKattouw: go ahead then :) [18:17:50] * cscott prepares marshmallows and a toasting stick [18:18:01] (03CR) 10Catrope: [C: 03+2] Define Portal and Portal talk namespace for niawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673053 (https://phabricator.wikimedia.org/T277671) (owner: 10Zabe) [18:18:32] Zabe: Your change is first, please be ready to test on the debug server in a few minutes [18:19:03] ok [18:19:37] (03Merged) 10jenkins-bot: Define Portal and Portal talk namespace for niawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673053 (https://phabricator.wikimedia.org/T277671) (owner: 10Zabe) [18:19:38] ( https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage if you're not already familiar with the WikimediaDebug browser extension) [18:22:04] Zabe: Your change is ready for testing on mwdebug1002 [18:22:23] (03CR) 10Catrope: [C: 03+2] Bump wikimedia/parsoid to 0.13.0-a28 [vendor] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672812 (https://phabricator.wikimedia.org/T276649) (owner: 10C. Scott Ananian) [18:22:46] 10SRE, 10ops-eqiad: analytics1063 interface errors - https://phabricator.wikimedia.org/T277633 (10elukey) >>! In T277633#6922486, @Cmjohnson wrote: > @elukey does this need to be scheduled or can I just swap the cable? it will take ~30secs to do +1 to proceed Also adding @razzi in Cc [18:23:15] mholloway: Are you around for your backport deploy for WikimediaEvents? [18:23:27] RoanKattouw: yep! [18:23:38] (03CR) 10Catrope: [C: 03+2] [BUG] sessionTick: Tick right away on sessionReset [extensions/WikimediaEvents] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672821 (https://phabricator.wikimedia.org/T277515) (owner: 10Mholloway) [18:24:56] RoanKattouw: it's working the correct way [18:27:33] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:27:52] ours is really only visible in aggregate as events come in after the change. we'll be watching dashboards and running queries as events come in with the new patch [18:29:39] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:30:00] Yay! Will deploy [18:31:43] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Define Portal and Portal talk namespace for niawiki (T277671) (duration: 01m 11s) [18:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:52] T277671: Portal namespace setup for nia.wikipedia - https://phabricator.wikimedia.org/T277671 [18:32:00] Zabe: Your change is live! Thank you for your patience [18:32:20] thanks [18:32:49] The other two changes (Parsoid and WikimediaEvents) are waiting on CI, once that's done they'll merge at the same time and I'll put them on the debug server together [18:33:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2232.codfw.wmnet [18:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:09] as usual (sigh) parsoid can't be tested on the debug server, but i'll test on group0 once that goes live [18:35:02] i can watch the logs on the debug server to make sure things aren't exploding though [18:35:06] (03CR) 10Dzahn: [C: 03+2] gerrit: GerritSite.css remove CI customization [puppet] - 10https://gerrit.wikimedia.org/r/672986 (https://phabricator.wikimedia.org/T277645) (owner: 10Hashar) [18:37:46] cscott: OK [18:40:00] actually, i think i figured out a way to test this on mwdebug* too. also watching the client errors dashboard [18:41:23] Well, now I'm learning how long CI takes on vendor changes :/ [18:42:40] (03CR) 10Dzahn: [C: 03+2] gerrit: GerritSite.css remove unused diff customization [puppet] - 10https://gerrit.wikimedia.org/r/672987 (https://phabricator.wikimedia.org/T232893) (owner: 10Hashar) [18:42:47] (03PS2) 10Dzahn: gerrit: GerritSite.css remove unused diff customization [puppet] - 10https://gerrit.wikimedia.org/r/672987 (https://phabricator.wikimedia.org/T232893) (owner: 10Hashar) [18:43:29] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2233.codfw.wmnet [18:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:38] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2234.codfw.wmnet [18:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:45] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2235.codfw.wmnet [18:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2233.codfw.wmnet [18:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:09] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.13.0-a28 [vendor] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672812 (https://phabricator.wikimedia.org/T276649) (owner: 10C. Scott Ananian) [18:47:11] (03Merged) 10jenkins-bot: [BUG] sessionTick: Tick right away on sessionReset [extensions/WikimediaEvents] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672821 (https://phabricator.wikimedia.org/T277515) (owner: 10Mholloway) [18:48:21] Finally! [18:48:34] (03PS1) 10Ottomata: Remove schema overrides for 6 finished EL migrations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673075 (https://phabricator.wikimedia.org/T267347) [18:49:21] (03CR) 10Ottomata: "@Mforns, from what I can tell, the first 3 schemas listed here are fully migrated in their extension.json files on all wikis. So this sho" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673075 (https://phabricator.wikimedia.org/T267347) (owner: 10Ottomata) [18:49:51] (03PS2) 10Ottomata: Remove schema overrides for 6 finished EL migrations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673075 (https://phabricator.wikimedia.org/T267347) [18:49:58] mholloway: Your change is on mwdebug1002, go ahead and test [18:50:14] cscott: For you, I'll just deploy now [18:51:26] (03PS20) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [18:52:07] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/673077 (owner: 10CRusnov) [18:52:11] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [18:52:26] !log catrope@deploy1002 Synchronized php-1.36.0-wmf.35/vendor/: Bump wikimedia/parsoid to 0.13.0-a28 (T276649) (duration: 01m 18s) [18:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:34] T276649: PHP Notice: Undefined property: stdClass::$title - https://phabricator.wikimedia.org/T276649 [18:53:51] RECOVERY - Check systemd state on ml-serve1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:54:26] RoanKattouw: sigh, i was wrong about my testing strategy, to make a long story short i'd need to log in to ensure i'm in sample but i'm only interested in the first tick of a (timing) session, which would have passed by the time i'm logged in [18:54:32] (03PS21) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [18:54:47] cscott: Your change is live [18:54:51] mholloway: OK I can just deploy if you want [18:54:58] if you don't mind deploying, i'll keep an eye on client errors. fwiw i am not worried about this one [18:55:00] yes please [18:55:14] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [18:55:30] 10SRE, 10SRE-Access-Requests: Grant Access to for apine - https://phabricator.wikimedia.org/T277544 (10cmassaro) 05Open→03Invalid [18:55:52] 10SRE, 10SRE-Access-Requests: Grant Access to for apine - https://phabricator.wikimedia.org/T277544 (10cmassaro) Thanks, all! I've made the request here: https://phabricator.wikimedia.org/T277692 [18:56:52] OK, rolling [18:57:29] PROBLEM - Check systemd state on ml-serve1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:59] !log catrope@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/WikimediaEvents/: sessionTick: Tick right away on sessionReset (T277515) (duration: 01m 10s) [18:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:07] T277515: [Session Length] Missing sessions - https://phabricator.wikimedia.org/T277515 [19:00:04] dancy and brennen: May I have your attention please! Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210317T1900) [19:00:11] RoanKattouw: thank you! [19:00:24] RoanKattouw: All done? [19:00:31] Yup, all done [19:00:34] thx [19:00:35] Just under the wire [19:00:39] nice [19:01:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2233.codfw.wmnet [19:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:53] (03PS1) 10Ahmon Dancy: group1 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673078 [19:01:55] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673078 (owner: 10Ahmon Dancy) [19:02:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2234.codfw.wmnet [19:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:00] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673078 (owner: 10Ahmon Dancy) [19:03:57] RoanKattouw: tested, looks good, thanks! [19:04:42] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.35 [19:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:54] !log ganeti1011 - rebooting VM testreduce1001 on ganeti level for T277580 [19:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:01] T277580: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 [19:06:08] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.36.0-wmf.35 (duration: 01m 26s) [19:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:36] (03PS22) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [19:10:18] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [19:12:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:14:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:14:37] !log andrew@deploy1002 Started deploy [horizon/deploy@8967660]: clean up a reverted hack [19:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2234.codfw.wmnet [19:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:02] !log andrew@deploy1002 Finished deploy [horizon/deploy@8967660]: clean up a reverted hack (duration: 03m 25s) [19:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:33] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH) So I just checked and confirmed with Chris that when the add server interface script was run for each of these, the skip ivp6 checkbox was checked, but they all seem to ha... [19:21:57] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH) bios, raid, and idrac firmware updated on all hosts for this task. [19:23:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2235.codfw.wmnet [19:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:53] (03PS23) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [19:27:57] (03PS6) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [19:29:49] !log testreduce1001 - rebooted, fdisk /dev/sdb, create partition table, create primary partition, mkfs.ext4 /dev/vdb1 [19:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:00] (03PS1) 10Ottomata: Always set REQUESTS_CA_BUNDLE for spark so that Python executors will use the proper CA certs [puppet] - 10https://gerrit.wikimedia.org/r/673083 (https://phabricator.wikimedia.org/T272313) [19:32:16] (03PS24) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [19:32:49] (03PS7) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [19:33:07] (03CR) 10jerkins-bot: [V: 04-1] Always set REQUESTS_CA_BUNDLE for spark so that Python executors will use the proper CA certs [puppet] - 10https://gerrit.wikimedia.org/r/673083 (https://phabricator.wikimedia.org/T272313) (owner: 10Ottomata) [19:33:53] (03PS2) 10Ottomata: Always set REQUESTS_CA_BUNDLE for spark [puppet] - 10https://gerrit.wikimedia.org/r/673083 (https://phabricator.wikimedia.org/T272313) [19:34:59] PROBLEM - SSH on mw2236.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:37:43] (03PS1) 10Ottomata: Needed for new release of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/673084 (https://phabricator.wikimedia.org/T262847) [19:37:46] (03CR) 10Ottomata: [C: 03+2] Always set REQUESTS_CA_BUNDLE for spark [puppet] - 10https://gerrit.wikimedia.org/r/673083 (https://phabricator.wikimedia.org/T272313) (owner: 10Ottomata) [19:38:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2235.codfw.wmnet [19:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:42] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Release anaconda-2020.02-wmf3 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/670939 (owner: 10Ottomata) [19:41:19] (03PS1) 10Andrew Bogott: Nova: allow projectadmins to resize VMs [puppet] - 10https://gerrit.wikimedia.org/r/673085 [19:42:24] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2236.codfw.wmnet [19:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:30] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2237.codfw.wmnet [19:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:41] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2238.codfw.wmnet [19:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2236.codfw.wmnet [19:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:27] (03PS8) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [19:44:54] !log andrew@deploy1002 Started deploy [horizon/deploy@3c2d1ee]: support VM resizing [19:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:37] (03CR) 10Andrew Bogott: [C: 03+2] Nova: allow projectadmins to resize VMs [puppet] - 10https://gerrit.wikimedia.org/r/673085 (owner: 10Andrew Bogott) [19:45:40] (03PS1) 10RobH: new db host additions [puppet] - 10https://gerrit.wikimedia.org/r/673090 (https://phabricator.wikimedia.org/T273566) [19:48:07] RECOVERY - Check systemd state on ml-serve2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:37] !log andrew@deploy1002 Finished deploy [horizon/deploy@3c2d1ee]: support VM resizing (duration: 03m 42s) [19:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:28] (03CR) 10RobH: [C: 03+2] new db host additions [puppet] - 10https://gerrit.wikimedia.org/r/673090 (https://phabricator.wikimedia.org/T273566) (owner: 10RobH) [19:51:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2236.codfw.wmnet [19:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:53] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` db1176.eqiad.wmnet ` The log can be found in `/var/l... [19:52:54] PROBLEM - Check systemd state on ml-serve2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:30] PROBLEM - mediawiki-installation DSH group on mw2237 is CRITICAL: Host mw2237 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:54:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2237.codfw.wmnet [19:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:54] (03PS1) 10Ebernhardson: Add fallback profile including glent m1 [extensions/CirrusSearch] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672825 (https://phabricator.wikimedia.org/T262612) [19:57:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatdata-user for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10dr0ptp4kt) Approved. [19:58:55] (03PS1) 10Andrew Bogott: nova-fullstack: use the new g3 flavor: g3.cores1.ram2.disk20 [puppet] - 10https://gerrit.wikimedia.org/r/673092 [19:59:32] (03Abandoned) 10Dzahn: site/conftool: decom mw2239 through mw2242 [puppet] - 10https://gerrit.wikimedia.org/r/671260 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [19:59:47] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: use the new g3 flavor: g3.cores1.ram2.disk20 [puppet] - 10https://gerrit.wikimedia.org/r/673092 (owner: 10Andrew Bogott) [20:00:04] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210317T2000). [20:04:14] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:05:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2237.codfw.wmnet [20:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:45] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1176.eqiad.wmnet with reason: REIMAGE [20:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:20] 10SRE, 10Data-Persistence-Backup, 10Goal: Create a first release of the media backups automation tools - https://phabricator.wikimedia.org/T276445 (10jcrespo) a:03jcrespo [20:06:40] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) a:03jcrespo [20:07:26] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:52] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Marostegui) Having IPv6 allocated is fine as long as they don't have a DNS attached to it :-) [20:07:58] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1176.eqiad.wmnet with reason: REIMAGE [20:07:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2238.codfw.wmnet [20:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:58] PROBLEM - k8s API server requests latencies on argon is CRITICAL: instance=10.64.32.133 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:15:06] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1176.eqiad.wmnet'] ` and were **ALL** successful. [20:17:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatdata-user for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Ottomata) Approved. @JAllemandou FYI for you that Cory wants to work with mediawiki_wikitext_history. :) [20:17:41] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH) [20:19:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2238.codfw.wmnet [20:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:59] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['db1177.eqiad.wmnet', 'db1178.eqiad.wmnet', 'db1179.eqiad.wmnet', 'db1180... [20:27:03] (03CR) 10Ottomata: [C: 03+2] Needed for new release of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/673084 (https://phabricator.wikimedia.org/T262847) (owner: 10Ottomata) [20:27:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_arclamp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:31:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:35:20] (03PS1) 10Dzahn: rsync::quickdatacopy: replace cron job with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/673097 (https://phabricator.wikimedia.org/T273673) [20:35:57] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1177.eqiad.wmnet with reason: REIMAGE [20:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:11] RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:31] PROBLEM - DPKG on an-worker1080 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:37:55] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1178.eqiad.wmnet with reason: REIMAGE [20:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:01] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: REIMAGE [20:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:41] PROBLEM - DPKG on an-worker1082 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:39:24] !log andrew@deploy1002 Started deploy [horizon/deploy@17ea780]: display volume usage summaries [20:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:54] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1179.eqiad.wmnet with reason: REIMAGE [20:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:08] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: REIMAGE [20:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:29] RECOVERY - k8s API server requests latencies on argon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:41:56] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1180.eqiad.wmnet with reason: REIMAGE [20:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:13] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01691 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:42:20] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1179.eqiad.wmnet with reason: REIMAGE [20:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:59] !log andrew@deploy1002 Finished deploy [horizon/deploy@17ea780]: display volume usage summaries (duration: 03m 34s) [20:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:58] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE [20:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:25] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1180.eqiad.wmnet with reason: REIMAGE [20:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:06] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1182.eqiad.wmnet with reason: REIMAGE [20:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:27] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE [20:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:06] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1183.eqiad.wmnet with reason: REIMAGE [20:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:28] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1182.eqiad.wmnet with reason: REIMAGE [20:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:59] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1184.eqiad.wmnet with reason: REIMAGE [20:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:55] (03PS1) 10Cwhite: logstash: return canceled event in expected format [puppet] - 10https://gerrit.wikimedia.org/r/673103 [20:50:33] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1183.eqiad.wmnet with reason: REIMAGE [20:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:28] (03CR) 10Cwhite: [C: 03+2] logstash: return canceled event in expected format [puppet] - 10https://gerrit.wikimedia.org/r/673103 (owner: 10Cwhite) [20:52:37] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: REIMAGE [20:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:08] (03PS2) 10Cwhite: logstash: return canceled event in expected format [puppet] - 10https://gerrit.wikimedia.org/r/673103 [20:55:31] (03CR) 10Cwhite: [C: 03+2] logstash: return canceled event in expected format [puppet] - 10https://gerrit.wikimedia.org/r/673103 (owner: 10Cwhite) [20:58:52] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1177.eqiad.wmnet', 'db1178.eqiad.wmnet', 'db1179.eqiad.wmnet', 'db1180.eqiad.wmnet', 'db1181.eqiad.wmnet', 'db1182.eqi... [21:01:13] (03PS1) 10Jbond: (WIP): netbase: firt pass at parsing service::cataloug ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [21:02:27] PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:29] (03CR) 10Andrew Bogott: "@hashar and/or @dduvall, I'd like some help testing this. First let's make sure this doesn't break existing nodes (it should leave them as" [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [21:02:36] (03CR) 10jerkins-bot: [V: 04-1] (WIP): netbase: firt pass at parsing service::cataloug ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [21:03:56] (03PS2) 10Jbond: (WIP): netbase: firt pass at parsing service::cataloug ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [21:05:30] (03CR) 10jerkins-bot: [V: 04-1] (WIP): netbase: firt pass at parsing service::cataloug ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [21:05:47] (03PS25) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [21:05:57] (03PS9) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [21:06:40] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [21:08:09] RECOVERY - DPKG on an-worker1080 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:09:27] RECOVERY - DPKG on an-worker1082 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:13:02] (03PS10) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [21:18:14] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frqueue1001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T277171 (10wiki_willy) a:03Cmjohnson [21:22:47] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Legoktm) Was this ever finished? This has come up again on https://en.wi... [21:24:14] RECOVERY - Check systemd state on ml-serve1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:24:16] (03CR) 10Bstorm: wikireplicas: create actual paws database accounts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [21:24:19] (03PS3) 10Jbond: (WIP): netbase: firt pass at parsing service::cataloug ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [21:26:08] (03PS26) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [21:26:19] (03PS11) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [21:26:28] (03PS4) 10Jbond: (WIP): netbase: firt pass at parsing service::cataloug ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [21:27:30] PROBLEM - Check systemd state on ml-serve1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:05] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH) [21:35:12] (03PS3) 10Dzahn: site/conftool: remove mw2230 through mw2238, rack A3 [puppet] - 10https://gerrit.wikimedia.org/r/672834 (https://phabricator.wikimedia.org/T277119) [21:35:47] (03PS5) 10Jbond: (WIP): netbase: firt pass at parsing service::cataloug ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [21:35:54] (03CR) 10Dzahn: [C: 03+2] site/conftool: remove mw2230 through mw2238, rack A3 [puppet] - 10https://gerrit.wikimedia.org/r/672834 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [21:37:01] (03CR) 10jerkins-bot: [V: 04-1] (WIP): netbase: firt pass at parsing service::cataloug ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [21:40:03] (03PS4) 10Dzahn: site/conftool: remove mw2230 through mw2238, rack A3 [puppet] - 10https://gerrit.wikimedia.org/r/672834 (https://phabricator.wikimedia.org/T277119) [21:42:08] (03PS4) 10Bstorm: wikireplicas: create actual paws database accounts [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) [21:45:40] (03CR) 10Bstorm: wikireplicas: create actual paws database accounts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [21:48:14] (03CR) 10Jbond: "there is curr" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [21:49:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:51:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:52:28] (03PS1) 10Cwhite: logstash: remove type setting on dlq input [puppet] - 10https://gerrit.wikimedia.org/r/673131 [21:57:52] (03PS1) 10Dwisehaupt: Add ssl check for frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/673132 (https://phabricator.wikimedia.org/T260183) [21:59:39] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) p:05Triage→03Medium [22:04:32] 10SRE, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10valerio.bozzolan) [22:08:09] (03PS1) 10RobH: mc103[78] install params [puppet] - 10https://gerrit.wikimedia.org/r/673137 (https://phabricator.wikimedia.org/T274925) [22:08:38] (03CR) 10RobH: [C: 03+2] mc103[78] install params [puppet] - 10https://gerrit.wikimedia.org/r/673137 (https://phabricator.wikimedia.org/T274925) (owner: 10RobH) [22:09:47] (03CR) 10CRusnov: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:10:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['mc1037.eqiad.wmnet', 'mc1038.eqiad.w... [22:12:23] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670925 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:19:49] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670928 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:21:17] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Krinkle) [22:23:13] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1037.eqiad.wmnet with reason: REIMAGE [22:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:09] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1038.eqiad.wmnet with reason: REIMAGE [22:25:13] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1037.eqiad.wmnet with reason: REIMAGE [22:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:23] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1038.eqiad.wmnet with reason: REIMAGE [22:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:46] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:29:55] (03CR) 10CRusnov: "Is there any obvious way of testing this? Who should I talk to about that? Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/655743 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:29:57] RECOVERY - DPKG on ml-serve1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:31:17] (03CR) 10CRusnov: "Looking at how to test this, is there a particular server to do that on? Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/658396 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:31:27] RECOVERY - Check systemd state on ml-serve2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1037.eqiad.wmnet', 'mc1038.eqiad.wmnet'] ` and were **ALL** successful. [22:35:11] PROBLEM - Check systemd state on ml-serve2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:38] (03PS1) 10Cwhite: logstash: add late-stage host field type validation for ecs events [puppet] - 10https://gerrit.wikimedia.org/r/673142 [22:40:53] (03CR) 10CRusnov: "In particular I'm interested if there'll be any fall-out adding python3-redis to the ::redis::client::python module." [puppet] - 10https://gerrit.wikimedia.org/r/670937 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:44:37] RECOVERY - Check systemd state on ml-serve1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:05] (03PS2) 10Cwhite: logstash: remove type setting on dlq input [puppet] - 10https://gerrit.wikimedia.org/r/673131 [22:47:37] (03PS3) 10Cwhite: logstash: remove type setting on dlq input [puppet] - 10https://gerrit.wikimedia.org/r/673131 [22:48:03] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:50:23] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670938 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:53:11] (03PS1) 10Cwhite: ensure host field is the correct type in late-stage ecs filter [software/ecs] - 10https://gerrit.wikimedia.org/r/673148 [22:54:04] (03CR) 10Bstorm: [C: 03+1] "There is a lot here, so I would be very surprised if there wasn't refinement needed in the near future after merge that we've missed in re" [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [22:54:26] jouncebot: refresh [22:54:27] I refreshed my knowledge about deployments. [22:55:20] (03PS4) 10Cwhite: logstash: remove type setting on dlq input [puppet] - 10https://gerrit.wikimedia.org/r/673131 (https://phabricator.wikimedia.org/T277080) [22:57:39] PROBLEM - Check systemd state on ml-serve1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210317T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:50] here, i'll ship it [23:00:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:01:02] (03CR) 10Ebernhardson: [C: 03+2] Add fallback profile including glent m1 [extensions/CirrusSearch] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672825 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:01:11] (03CR) 10Bstorm: [C: 03+1] "As I was looking at this, I thought of one or two things we might want to do. However, I really think we'll know best what we'd want to do" [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [23:01:28] ack ebernhardson [23:01:37] would you please ping me once you're done? [23:01:41] RECOVERY - Check systemd state on ml-serve2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:17] Urbanecm: sure [23:02:24] thx [23:02:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:03:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [23:04:18] (03CR) 10Bstorm: "I'd love to get at least one more check of this before I merge it. I'll try merging tomorrow, and if I end up having to delete a few thous" [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [23:05:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) a:05RobH→03Jclark-ctr #serviceops please be aware mc1037 and mc1038 are ready for your team to place into service. The rest are no... [23:07:03] PROBLEM - Check systemd state on ml-serve2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:39] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670952 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:12:47] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [23:15:03] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [23:28:28] (03Merged) 10jenkins-bot: Add fallback profile including glent m1 [extensions/CirrusSearch] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/672825 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:28:58] (03PS1) 10Urbanecm: Define confirmed group in MediaWikiServices hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673159 (https://phabricator.wikimedia.org/T275334) [23:30:09] !log ebernhardson@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/CirrusSearch/profiles/FallbackProfiles.config.php: Add fallback profile including glent m1 (duration: 01m 42s) [23:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:29] (03CR) 10jerkins-bot: [V: 04-1] Define confirmed group in MediaWikiServices hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673159 (https://phabricator.wikimedia.org/T275334) (owner: 10Urbanecm) [23:30:37] Urbanecm: i'm all done [23:30:41] thanks [23:32:24] (03PS2) 10Urbanecm: Define confirmed group in MediaWikiServices hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673159 (https://phabricator.wikimedia.org/T275334) [23:33:26] (03CR) 10Urbanecm: [C: 03+2] Define confirmed group in MediaWikiServices hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673159 (https://phabricator.wikimedia.org/T275334) (owner: 10Urbanecm) [23:35:06] (03PS3) 10Urbanecm: Define confirmed group in MediaWikiServices hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673159 (https://phabricator.wikimedia.org/T275334) [23:35:37] (03CR) 10Urbanecm: [C: 03+2] Define confirmed group in MediaWikiServices hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673159 (https://phabricator.wikimedia.org/T275334) (owner: 10Urbanecm) [23:36:23] RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:27] (03PS1) 10Urbanecm: idwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673162 (https://phabricator.wikimedia.org/T259024) [23:36:34] (03CR) 10Urbanecm: [C: 03+2] idwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673162 (https://phabricator.wikimedia.org/T259024) (owner: 10Urbanecm) [23:37:12] (03Merged) 10jenkins-bot: Define confirmed group in MediaWikiServices hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673159 (https://phabricator.wikimedia.org/T275334) (owner: 10Urbanecm) [23:37:37] (03Merged) 10jenkins-bot: idwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673162 (https://phabricator.wikimedia.org/T259024) (owner: 10Urbanecm) [23:40:18] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 5c14e7d2045f0905f7e85b249e821bbe8d69c600: Define confirmed group in MediaWikiServices hook (T275334, T277704, T275310, T275333) (duration: 01m 08s) [23:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:30] T275334: Changing user groups from $wgExtensionFunctions no longer works reliably - https://phabricator.wikimedia.org/T275334 [23:40:30] T277704: Account in "eventcoordinator" group cannot change rights via Special:UserRights - https://phabricator.wikimedia.org/T277704 [23:40:30] T275310: Sysop can not grant "confirmed user" flag on pt.wiki: Not shown in the user interface - https://phabricator.wikimedia.org/T275310 [23:40:30] T275333: Special:ListUsers/confirmed doesn't load the list of confirmed users for some users at ptwiki - https://phabricator.wikimedia.org/T275333 [23:40:59] four bugs fixed today: ::white_check_mark: [23:41:08] Nice work [23:41:18] bonus point: only one patch needed :D [23:42:50] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c730dd5feb865a8325279cd4e76c133512f14251: idwiki: Deploy Growth features to newcomers (T259024) (duration: 01m 08s) [23:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:57] T259024: Deploy Growth experiments at Indonesian Wikipedia - https://phabricator.wikimedia.org/T259024 [23:43:27] PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:46] legoktm: https://phabricator.wikimedia.org/T275334#6923618 sounds like a good plan to me [23:44:06] perfect, /me assigns T275334 to Urbanecm [23:44:08] ;-) [23:44:17] I'm not sure it's a good idea :D [23:45:39] the one thing I'm not sure of is whether $wgGroupInheritsPermissions should support inheriting from multiple groups [23:45:56] imagine some "arbcom" group that wants to be "checkuser" + "oversight" [23:46:58] I wonder if the various botadmin groups are actually fully sysop + bot [23:47:04] legoktm: arbcom group inheriting from CU and OS is not a good idea anyway. CU/OS-level permissions can be granted only by stewards, [23:47:24] btw, you would also need wgRenameGroup [23:47:47] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L4198 [23:48:43] and https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L4061 is also in extension function, althrough I'm not sure why [23:49:23] it's because it's checking if the user is logged in or not [23:50:00] it's checking whether a session is persistent AFAICS [23:51:17] same thing :p [23:52:57] I commented on https://phabricator.wikimedia.org/T112147#6923653