[00:02:01] mutante sorry, not sure when it stopped [00:02:20] (03PS1) 10Gergő Tisza: Ensure variant A homepage sidebar is always at least 300px [extensions/GrowthExperiments] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/630801 (https://phabricator.wikimedia.org/T263905) [00:02:54] (03CR) 10Gergő Tisza: [C: 03+2] Ensure variant A homepage sidebar is always at least 300px [extensions/GrowthExperiments] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/630801 (https://phabricator.wikimedia.org/T263905) (owner: 10Gergő Tisza) [00:11:34] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) Looking at the pattern of requests to ores in the past couple of days, it seems OKAPI has been bringing dow... [00:12:18] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Reedy) [00:13:17] (03Merged) 10jenkins-bot: Ensure variant A homepage sidebar is always at least 300px [extensions/GrowthExperiments] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/630801 (https://phabricator.wikimedia.org/T263905) (owner: 10Gergő Tisza) [00:41:20] !log tgr@deploy1001 Synchronized php-1.36.0-wmf.11/extensions/GrowthExperiments/: Backport: [[gerrit:630801|Ensure variant A homepage sidebar is always at least 300px (T263905)]] (duration: 01m 01s) [00:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:27] T263905: Mentor and help module are too narrow on desktop Special:Homepage in new Vector - https://phabricator.wikimedia.org/T263905 [01:10:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_citoid_cluster_codfw,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:12:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:15:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:17:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:22:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:24:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:27:39] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 103 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:28:17] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 69 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:33:19] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 47 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:33:57] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:37:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:41:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:01:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:03:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:06:45] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 17452 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:06:57] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 17452 bytes in 0.211 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:07:21] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 17452 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:10:07] PROBLEM - Ubuntu mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [04:19:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:21:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:31:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:36:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:45:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:46:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:02:11] (03PS1) 10Marostegui: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630984 (https://phabricator.wikimedia.org/T260670) [05:11:21] (03CR) 10Marostegui: [C: 03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630984 (https://phabricator.wikimedia.org/T260670) (owner: 10Marostegui) [05:28:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:29:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:29:42] !log Reduce busy-time from 3600 to 1800 on labsdb1010 [05:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:31] (03PS1) 10Marostegui: mariadb: Decommission es2019 [puppet] - 10https://gerrit.wikimedia.org/r/630986 (https://phabricator.wikimedia.org/T264063) [05:36:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:34] (03PS1) 10Marostegui: dns: Remove es2019 entries [dns] - 10https://gerrit.wikimedia.org/r/630987 (https://phabricator.wikimedia.org/T264063) [05:40:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:37] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission es2019 [puppet] - 10https://gerrit.wikimedia.org/r/630986 (https://phabricator.wikimedia.org/T264063) (owner: 10Marostegui) [05:41:23] (03CR) 10Marostegui: [C: 03+2] dns: Remove es2019 entries [dns] - 10https://gerrit.wikimedia.org/r/630987 (https://phabricator.wikimedia.org/T264063) (owner: 10Marostegui) [05:43:27] !log Remove es2019 from tendril and zarcillo T264063 [05:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:33] T264063: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 [05:43:56] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Marostegui) Ready for #dc-ops! [06:00:08] (03PS2) 10Elukey: admin: add journactl perms to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/630635 [06:01:35] (03CR) 10Elukey: [C: 03+2] admin: add journactl perms to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/630635 (owner: 10Elukey) [06:07:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2085:3318', diff saved to https://phabricator.wikimedia.org/P12846 and previous config saved to /var/cache/conftool/dbconfig/20200930-060705-marostegui.json [06:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2085:3318', diff saved to https://phabricator.wikimedia.org/P12847 and previous config saved to /var/cache/conftool/dbconfig/20200930-060754-marostegui.json [06:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2082', diff saved to https://phabricator.wikimedia.org/P12848 and previous config saved to /var/cache/conftool/dbconfig/20200930-061005-marostegui.json [06:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2082', diff saved to https://phabricator.wikimedia.org/P12849 and previous config saved to /var/cache/conftool/dbconfig/20200930-061036-marostegui.json [06:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:50] (03CR) 10Ayounsi: [C: 03+2] profile::prometheus::snmp_exporter: update snmp ro string [puppet] - 10https://gerrit.wikimedia.org/r/630867 (owner: 10Jbond) [06:19:13] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [06:19:14] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) >>! In T263910#6503699, @Ladsgroup wrote: > Looking at the pattern of requests to ores in the past couple of days, it seems OKAPI has be... [06:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:58] (03CR) 10Ayounsi: [C: 03+2] Remove unused mock passwords (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/630830 (owner: 10Ayounsi) [06:21:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [06:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:49] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) Executed the cookbook also up to an-worker1110, all looks good! [06:23:38] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Remove unused mock passwords [labs/private] - 10https://gerrit.wikimedia.org/r/630830 (owner: 10Ayounsi) [06:24:47] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) 05Stalled→03Open [06:25:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:26:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:31:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:33:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:35:37] (03PS1) 10Elukey: Update rack settings for new Analytics Hadoop nodes in hiera [puppet] - 10https://gerrit.wikimedia.org/r/630991 (https://phabricator.wikimedia.org/T255140) [06:39:22] (03PS2) 10Elukey: Update rack settings for new Analytics Hadoop nodes in hiera [puppet] - 10https://gerrit.wikimedia.org/r/630991 (https://phabricator.wikimedia.org/T255140) [06:42:56] (03CR) 10Elukey: "I have used https://netbox.wikimedia.org/ to get the rack placement, please check if I made mistakes 😊" [puppet] - 10https://gerrit.wikimedia.org/r/630991 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [06:43:48] RECOVERY - Ubuntu mirror in sync with upstream on sodium is OK: /srv/mirrors/ubuntu is over 1 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [06:50:37] Status: Up | Server Admin Log: https://bit.ly/wikitech | This channel is logged: https://bit.ly/opsirclog | Ops Clinic Duty: herron [06:52:47] 10Operations, 10fundraising-tech-ops, 10netops: Automate diff and commit of frack ACL - https://phabricator.wikimedia.org/T260655 (10ayounsi) We can use Juniper permissions system to restrict which part of the config a user can change: https://www.juniper.net/documentation/en_US/junos/topics/topic-map/junos-... [06:58:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2016 T264156', diff saved to https://phabricator.wikimedia.org/P12850 and previous config saved to /var/cache/conftool/dbconfig/20200930-065838-marostegui.json [06:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:44] T264156: decommission es2016.codfw.wmnet - https://phabricator.wikimedia.org/T264156 [06:59:34] (03PS1) 10Marostegui: es2016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/631132 (https://phabricator.wikimedia.org/T264156) [07:00:05] (03CR) 10Marostegui: [C: 03+2] es2016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/631132 (https://phabricator.wikimedia.org/T264156) (owner: 10Marostegui) [07:00:28] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.157e+07 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [07:00:33] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@7bdc414]: Upgrade to 0.37.2 [07:00:36] XioNoX: your puppet changes are ok to merge? [07:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:47] marostegui: oops, yep [07:00:54] XioNoX: merging! [07:00:57] thx [07:01:21] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@7bdc414]: Upgrade to 0.37.2 (duration: 00m 49s) [07:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:45] (03PS1) 10Marostegui: backup2002.cnf.erb: Change es3 backup source [puppet] - 10https://gerrit.wikimedia.org/r/631134 [07:03:48] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.157e+07 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [07:03:56] (03CR) 10Marostegui: [C: 03+2] backup2002.cnf.erb: Change es3 backup source [puppet] - 10https://gerrit.wikimedia.org/r/631134 (owner: 10Marostegui) [07:05:07] !log Stop mysql on es2016 before decommissioning T264156 [07:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:14] T264156: decommission es2016.codfw.wmnet - https://phabricator.wikimedia.org/T264156 [07:05:52] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia, AS1299/IPv6: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:06:07] (03PS2) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/630773 (https://phabricator.wikimedia.org/T263227) [07:06:15] (03PS3) 10Marostegui: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/630774 (https://phabricator.wikimedia.org/T263227) [07:07:22] PROBLEM - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1061, Errmsg: Error Duplicate key name ix_alerts_active on query. Default database: superset_production. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:08:38] elukey: ^ [07:08:55] elukey: if you need help, pm me! :) [07:09:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:11:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:14:00] (03CR) 10Jcrespo: "I've updated grants for dump user on that host." [puppet] - 10https://gerrit.wikimedia.org/r/631134 (owner: 10Marostegui) [07:15:27] marostegui: ah just upgraded superset's db, checking [07:18:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1131 with weight 0 T263227', diff saved to https://phabricator.wikimedia.org/P12851 and previous config saved to /var/cache/conftool/dbconfig/20200930-071841-marostegui.json [07:18:46] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.157e+07 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [07:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:48] T263227: Failover s6 master, db1093 to db1131 - https://phabricator.wikimedia.org/T263227 [07:23:47] (03PS3) 10Rosalie Perside (WMDE): Remove unused $wgExtraLanguageNames['qqq'] assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [07:23:49] (03PS3) 10Rosalie Perside (WMDE): Stop using $wmgExtraLanguageNames in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628774 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [07:23:51] (03PS3) 10Rosalie Perside (WMDE): Remove $wmgExtraLanguageNames from InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628775 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [07:25:38] (03CR) 10Rosalie Perside (WMDE): [C: 03+1] Remove unused $wgExtraLanguageNames['qqq'] assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [07:25:51] (03CR) 10Rosalie Perside (WMDE): [C: 03+1] Stop using $wmgExtraLanguageNames in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628774 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [07:26:02] (03CR) 10Rosalie Perside (WMDE): [C: 03+1] Remove $wmgExtraLanguageNames from InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628775 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [07:26:16] (03CR) 10Rosalie Perside (WMDE): [C: 03+1] Remove $wmgExtraLanguageNames from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628776 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [07:27:06] RECOVERY - MariaDB Replica SQL: analytics_meta on db1108 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:27:14] 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [07:29:44] (03CR) 10Muehlenhoff: [C: 03+2] Enabled managed sources.list for esams/eqsin [puppet] - 10https://gerrit.wikimedia.org/r/630879 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [07:31:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:34:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:36:22] (03CR) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/630773 (https://phabricator.wikimedia.org/T263227) (owner: 10Marostegui) [07:36:27] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/630773 (https://phabricator.wikimedia.org/T263227) (owner: 10Marostegui) [07:39:56] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10akosiaris) >>! In T263910#6503978, @Joe wrote: >>>! In T263910#6503699, @Ladsgroup wrote: >> Looking at the pattern of requests to ores in th... [07:41:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:41:08] 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) Status update: ms-be2057 is at ~40% used. The rest of the cluster is behaving as expected (i.e. freeing up space) except for ms-be2017... [07:41:37] !log Starting s6 eqiad failover from db1093 to db1131 - T263227 [07:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:42] T263227: Failover s6 master, db1093 to db1131 - https://phabricator.wikimedia.org/T263227 [07:42:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:44:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1131 on s6 eqiad master T263227, also give weight to db1093 as new API host', diff saved to https://phabricator.wikimedia.org/P12852 and previous config saved to /var/cache/conftool/dbconfig/20200930-074417-marostegui.json [07:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:17] (03CR) 10Marostegui: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/630774 (https://phabricator.wikimedia.org/T263227) (owner: 10Marostegui) [07:47:22] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/630774 (https://phabricator.wikimedia.org/T263227) (owner: 10Marostegui) [07:51:41] 10Operations: FY2020-2021 Q1 eqiad -> codfw switchover - https://phabricator.wikimedia.org/T243316 (10Marostegui) [07:54:05] (03PS1) 10Muehlenhoff: Enable managed sources.list for codfw [puppet] - 10https://gerrit.wikimedia.org/r/631139 (https://phabricator.wikimedia.org/T158562) [07:55:59] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [07:55:59] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [07:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:39] !log upgrade termbox to latest chart, fixing various prometheus-statsd-export configuration minor issues. [07:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:09] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [08:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:10] (03PS1) 10Marostegui: instances.yaml: Remove es2016 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/631141 (https://phabricator.wikimedia.org/T264156) [08:06:22] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [08:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:01] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es2016 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/631141 (https://phabricator.wikimedia.org/T264156) (owner: 10Marostegui) [08:08:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es2016 from dbctl T264156', diff saved to https://phabricator.wikimedia.org/P12853 and previous config saved to /var/cache/conftool/dbconfig/20200930-080817-marostegui.json [08:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:23] T264156: decommission es2016.codfw.wmnet - https://phabricator.wikimedia.org/T264156 [08:10:23] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [08:10:55] (03PS1) 10Filippo Giunchedi: alertmanager: add column after alert status on IRC [puppet] - 10https://gerrit.wikimedia.org/r/631142 (https://phabricator.wikimedia.org/T258948) [08:11:18] (03CR) 10jerkins-bot: [V: 04-1] alertmanager: add column after alert status on IRC [puppet] - 10https://gerrit.wikimedia.org/r/631142 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:12:07] (03PS2) 10Filippo Giunchedi: alertmanager: add column after alert status on IRC [puppet] - 10https://gerrit.wikimedia.org/r/631142 (https://phabricator.wikimedia.org/T258948) [08:14:35] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-40] - https://phabricator.wikimedia.org/T260445 (10elukey) >>! In T260445#6431168, @Cmjohnson wrote: > @elukey Are you trying to re-use hostnames? We should be using an-worker1118+ Sorry didn't see... [08:17:01] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [08:20:50] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add column after alert status on IRC [puppet] - 10https://gerrit.wikimedia.org/r/631142 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:21:23] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: categorise netbox links [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/630900 (owner: 10Filippo Giunchedi) [08:23:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630661 (owner: 10Dzahn) [08:24:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630691 (owner: 10Dzahn) [08:25:04] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn) [08:30:51] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-40] - https://phabricator.wikimedia.org/T260445 (10elukey) @Cmjohnson I checked the items listed in the package slip but I don't see the quantity, only the fact that a 480G disk is listed (T258727#64... [08:32:25] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) [08:32:34] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) [08:34:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] interfaces: drop aggregate (bonding) dead code [puppet] - 10https://gerrit.wikimedia.org/r/630779 (owner: 10Arturo Borrero Gonzalez) [08:40:29] (03PS15) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [08:43:48] (03CR) 10Muehlenhoff: reboot-groups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [08:45:21] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [08:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:42] !log deploying schema change to s7/eqiad T259831 [08:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:46] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [08:50:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10Gehel) a:05Gehel→03wiki_willy `smart-data-dump` shows a few errors for sdb: ` # HELP device_smart_program_fail_cnt_total SMART attribute program_fail_cnt_tot... [08:53:55] (03CR) 10Jbond: "> Patch Set 2: Code-Review-1" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [08:55:19] (03CR) 10Jbond: [C: 03+1] Enable managed sources.list for codfw [puppet] - 10https://gerrit.wikimedia.org/r/631139 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [08:57:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` wdqs1009.eqiad.wmnet ` The log can be found in `/var/log/w... [08:57:54] 10Operations, 10LDAP-Access-Requests: Access to Superset for Jack Rabah - https://phabricator.wikimedia.org/T263868 (10MoritzMuehlenhoff) 05Resolved→03Open Reopening, this needs an update in data.yaml as well (in the ldap users table). [08:58:02] 10Operations, 10Traffic, 10Patch-For-Review: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests - https://phabricator.wikimedia.org/T236754 (10ema) 05Open→03Resolved a:03ema Confirmed on a busy text@esams node (cp3050): issuing various VCL reloads, all old VCLs ge... [09:04:55] (03PS3) 10Jbond: labstore::nfs_mount: drop support for empty string share_path [puppet] - 10https://gerrit.wikimedia.org/r/630589 [09:05:01] (03CR) 10jerkins-bot: [V: 04-1] labstore::nfs_mount: drop support for empty string share_path [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [09:08:55] (03CR) 10Kormat: [C: 03+2] Split static methods out of WMFMariaDB.py into dbutil.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/630187 (owner: 10Kormat) [09:09:54] (03Merged) 10jenkins-bot: Split static methods out of WMFMariaDB.py into dbutil.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/630187 (owner: 10Kormat) [09:10:43] (03PS1) 10JMeybohm: services_proxy: Add nodejs keepalive timeout (4.5s) to citoid and zotero [puppet] - 10https://gerrit.wikimedia.org/r/631147 (https://phabricator.wikimedia.org/T255869) [09:10:51] !log gehel@cumin1001 START - Cookbook sre.hosts.downtime [09:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:17] 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10MoritzMuehlenhoff) >>! In T258768#6497946, @elukey wrote: > Currently two outstanding UI issues: > In theory those are not blocking the migration of Hue, so this task could probably be... [09:12:52] !log gehel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:58] (03PS3) 10Kormat: Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 [09:13:47] (03CR) 10Kormat: "Now that the wmfmariadbpy CR is merged, i've updated this repo to point at the relevant commit. Once 0.6 is released, that can be replaced" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat) [09:15:32] (03CR) 10Jcrespo: [C: 03+2] Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat) [09:16:22] (03CR) 10Jcrespo: "Now the question is if you want to do a new deploy of both wmfmariadbpy and wmfbackups just 2 days before me going on vacation for 2 weeks" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat) [09:17:55] (03CR) 10Muehlenhoff: [C: 03+2] Enable managed sources.list for codfw [puppet] - 10https://gerrit.wikimedia.org/r/631139 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [09:18:13] (03CR) 10Kormat: "> Patch Set 3:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat) [09:20:07] (03CR) 10Jcrespo: "> Patch Set 3:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat) [09:20:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wdqs1009.eqiad.wmnet'] ` and were **ALL** successful. [09:21:27] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:44] (03PS1) 10Kormat: WMFMariaDB: remove execute_many() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631149 [09:24:46] (03CR) 10Jbond: "See inline the difference between alias and query is still a bit confusing" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [09:26:01] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 3 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10Mvolz) [09:26:25] (03PS4) 10Jbond: labstore::nfs_mount: drop support for empty string share_path [puppet] - 10https://gerrit.wikimedia.org/r/630589 [09:26:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:27:18] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 3 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10Mvolz) The hold-up seems to be eventstreams; it actually uses a fork of service runner, and the... [09:27:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:33:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/25534/restbase1025.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631147 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [09:35:11] 10Operations, 10DBA: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179 (10Marostegui) @jcrespo I would like to close this - I don't think this is doable on long-term even, I would even say this is very long-long-long-long term for sX sections. There are many limitations he... [09:38:30] 10Operations, 10DBA: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179 (10jcrespo) 05Open→03Declined > This ticket is to decide if this change is worth it, how to do it, where (maybe not all servers require it), when and what blockers there are. In a way this is "done... [09:38:46] 10Operations, 10DBA: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179 (10jcrespo) 05Declined→03Resolved [09:39:21] (03CR) 10JMeybohm: [C: 03+2] "Thanks joe!" [puppet] - 10https://gerrit.wikimedia.org/r/631147 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [09:40:13] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) This sounds good to me, my only worry is that it would be too much work for something that's going to be replaced in the soon-ish... [09:41:57] (03PS8) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) [09:45:03] (03PS1) 10Hnowlan: envoy-future: use envoy 1.15.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/631151 [09:50:43] !log imported envoyproxy 1.15.1 to buster-wikimedia component/envoy-future - T264157 [09:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:34] (03CR) 10JMeybohm: [C: 03+1] envoy-future: use envoy 1.15.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/631151 (owner: 10Hnowlan) [09:54:36] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 62, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:56:50] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] envoy-future: use envoy 1.15.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/631151 (owner: 10Hnowlan) [09:59:52] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [09:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:24] (03PS1) 10Hnowlan: api-gateway: use envoy 1.15.1 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/631154 (https://phabricator.wikimedia.org/T264157) [10:02:26] (03PS1) 10Elukey: Allow deployment of AMD ROCm drivers on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/631153 (https://phabricator.wikimedia.org/T255138) [10:07:02] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [10:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:10] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:17] !log deploying schema change to s4/eqiad T259831 [10:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:22] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:08:40] (03PS1) 10Jbond: profile::swift::proxy_tls: Swap hiera for lookup function [puppet] - 10https://gerrit.wikimedia.org/r/631155 [10:08:42] (03PS1) 10Jbond: swift::swiftrepl: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631156 [10:08:44] (03PS1) 10Jbond: swift::storage: migrate swift::storage from a role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631157 [10:08:46] (03PS1) 10Jbond: profile::swift::stats_reporter: bass statesd host and port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/631158 [10:08:48] (03PS1) 10Jbond: profile::swift::proxy: pass swift paramters via profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631159 [10:08:50] (03PS1) 10Jbond: (WIP): remove old files: Theses likkly need to be added to earlier CR's [puppet] - 10https://gerrit.wikimedia.org/r/631160 [10:09:53] (03CR) 10jerkins-bot: [V: 04-1] swift::swiftrepl: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631156 (owner: 10Jbond) [10:10:07] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/631155 (owner: 10Jbond) [10:10:39] (03CR) 10jerkins-bot: [V: 04-1] profile::swift::stats_reporter: bass statesd host and port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/631158 (owner: 10Jbond) [10:11:16] (03PS1) 10Kormat: mysql.py: Update to dbutil [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631161 [10:12:42] (03CR) 10Kormat: [C: 03+2] mysql.py: Update to dbutil [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631161 (owner: 10Kormat) [10:12:53] (03PS2) 10Jbond: swift::swiftrepl: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631156 [10:14:07] (03Merged) 10jenkins-bot: mysql.py: Update to dbutil [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631161 (owner: 10Kormat) [10:14:36] (03PS2) 10Kormat: WMFMariaDB: remove execute_many() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631149 [10:16:48] (03CR) 10Kormat: [C: 03+2] WMFMariaDB: remove execute_many() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631149 (owner: 10Kormat) [10:17:34] (03PS3) 10Jbond: swift::swiftrepl: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631156 [10:18:20] (03Merged) 10jenkins-bot: WMFMariaDB: remove execute_many() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631149 (owner: 10Kormat) [10:19:25] (03CR) 10Jbond: "PCC: ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/631156 (owner: 10Jbond) [10:19:37] (03PS2) 10Jbond: swift::storage: migrate swift::storage from a role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631157 [10:21:29] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:20] (03CR) 10Klausman: [C: 03+2] Update rack settings for new Analytics Hadoop nodes in hiera [puppet] - 10https://gerrit.wikimedia.org/r/630991 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [10:24:07] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:18] (03CR) 10Klausman: [C: 03+2] Allow deployment of AMD ROCm drivers on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/631153 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [10:24:40] 10Operations, 10User-Kormat: Submit reuse-parts to partman upstream - https://phabricator.wikimedia.org/T264169 (10Kormat) [10:24:48] 10Operations, 10User-Kormat: Submit reuse-parts to partman upstream - https://phabricator.wikimedia.org/T264169 (10Kormat) p:05Triage→03Low [10:25:01] ^ moritzm: so i don't keep forgetting about it [10:25:40] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) Just for the record, those CPU reset/error have been happening since the first crash. [10:25:43] (03CR) 10Hnowlan: [C: 03+2] api-gateway: use envoy 1.15.1 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/631154 (https://phabricator.wikimedia.org/T264157) (owner: 10Hnowlan) [10:26:16] (03PS3) 10Jbond: swift::storage: migrate swift::storage from a role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631157 [10:28:09] (03PS4) 10Jbond: swift::storage: migrate swift::storage from a role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631157 [10:28:22] (03CR) 10Giuseppe Lavagetto: "Apart from my comment on the sizing of the memcached installation, LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630845 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [10:28:24] (03Merged) 10jenkins-bot: api-gateway: use envoy 1.15.1 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/631154 (https://phabricator.wikimedia.org/T264157) (owner: 10Hnowlan) [10:31:04] (03CR) 10Effie Mouzeli: mcrouter: install ohhost memcached on MediaWiki servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630845 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [10:31:43] (03PS1) 10Filippo Giunchedi: am: send resolved alerts [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631162 (https://phabricator.wikimedia.org/T258948) [10:31:45] (03PS4) 10Effie Mouzeli: mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/630845 (https://phabricator.wikimedia.org/T244340) [10:34:02] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:34:02] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:34:03] (03PS5) 10Jbond: swift::storage: migrate swift::storage from a role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631157 [10:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:29] (03PS6) 10Jbond: swift::storage: migrate swift::storage from a role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631157 [10:36:51] (03CR) 10Jbond: "Ready to review" [puppet] - 10https://gerrit.wikimedia.org/r/631157 (owner: 10Jbond) [10:37:31] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/631157 (owner: 10Jbond) [10:37:42] (03PS2) 10Jbond: profile::swift::stats_reporter: bass statesd host and port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/631158 [10:38:51] (03CR) 10jerkins-bot: [V: 04-1] profile::swift::stats_reporter: bass statesd host and port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/631158 (owner: 10Jbond) [10:40:53] (03PS16) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [10:43:57] (03PS3) 10Jbond: profile::swift::stats_reporter: pass statsd host and port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/631158 [10:44:04] (03CR) 10Muehlenhoff: reboot-groups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [10:44:54] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:44:55] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:05] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/631158 (owner: 10Jbond) [10:47:01] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:47:01] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:51] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) There is one thing I have seen, which is that the temperature of this host, according to grafana is a lot higher than a host on the same section (db2126... [10:52:30] (03PS2) 10Jbond: profile::swift::proxy_tls: Swap hiera for lookup function [puppet] - 10https://gerrit.wikimedia.org/r/631155 [10:52:32] (03PS4) 10Jbond: swift::swiftrepl: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631156 [10:52:34] (03PS7) 10Jbond: swift::storage: migrate swift::storage from a role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631157 [10:52:36] (03PS4) 10Jbond: profile::swift::stats_reporter: pass statsd host and port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/631158 [10:52:38] (03PS2) 10Jbond: profile::swift::proxy: pass swift paramters via profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631159 [10:54:07] (03CR) 10Muehlenhoff: Allow deployment of AMD ROCm drivers on Stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631153 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [10:55:52] (03PS3) 10Jbond: profile::swift::proxy: pass swift paramters via profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631159 [10:57:59] !log installing librsvg security updates [10:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200930T1100). [11:00:04] Nikerabbit: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:06] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: use iptables-legacy [puppet] - 10https://gerrit.wikimedia.org/r/631167 (https://phabricator.wikimedia.org/T262979) [11:01:32] (03PS4) 10Jbond: profile::swift::proxy: pass swift parameters via profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631159 [11:03:20] ooo/ [11:03:43] Any deployers around? [11:03:57] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/631159 (owner: 10Jbond) [11:04:45] 10Operations, 10Discovery-Search, 10User-MoritzMuehlenhoff: Also use java::security on elasticsearch/relforge - https://phabricator.wikimedia.org/T251540 (10MoritzMuehlenhoff) >>! In T251540#6172552, @MoritzMuehlenhoff wrote: > This is now a parameter of the java profile, so once Elastic* are migrated to tha... [11:05:15] If not, I'll deploy myself [11:05:31] (03PS2) 10Nikerabbit: Enable Special:TranslationStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627744 (https://phabricator.wikimedia.org/T263004) [11:06:08] !log disable puppet on P:mediawiki::mcrouter_wancache for 630845 - T244340 [11:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:14] T244340: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 [11:06:32] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10awight) #okapi team, please consider following the https://stream.wikimedia.org/?doc#/Streams/get_v2_stream_revision_score event stream rathe... [11:06:38] (03CR) 10Nikerabbit: [C: 03+2] Enable Special:TranslationStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627744 (https://phabricator.wikimedia.org/T263004) (owner: 10Nikerabbit) [11:07:18] (03Merged) 10jenkins-bot: Enable Special:TranslationStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627744 (https://phabricator.wikimedia.org/T263004) (owner: 10Nikerabbit) [11:10:58] 10Operations: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10MoritzMuehlenhoff) [11:11:13] (03PS1) 10Jbond: swift: move swift parameters to the profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631169 [11:11:20] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: use iptables-legacy [puppet] - 10https://gerrit.wikimedia.org/r/631167 (https://phabricator.wikimedia.org/T262979) [11:11:40] 10Operations, 10Cassandra: Move cassandra puppet code (used by Restbase, Sessionstore, AQS) to profile::java - https://phabricator.wikimedia.org/T261966 (10MoritzMuehlenhoff) [11:12:03] 10Operations: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10MoritzMuehlenhoff) [11:12:05] 10Operations, 10Cassandra: Move cassandra puppet code (used by Restbase, Sessionstore, AQS) to profile::java - https://phabricator.wikimedia.org/T261966 (10MoritzMuehlenhoff) [11:12:46] 10Operations, 10Cassandra: Move cassandra puppet code (used by Restbase, Sessionstore, AQS, maps) to profile::java - https://phabricator.wikimedia.org/T261966 (10MoritzMuehlenhoff) [11:13:43] hmm, is it okay to deploy from deployment.eqiad.wmnet, or should I use deployment.codfw.wmnet now? [11:13:44] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/631169 (owner: 10Jbond) [11:14:24] (03CR) 10Effie Mouzeli: [C: 03+2] mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/630845 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [11:14:50] Nikerabbit: use deploy1001.eqiad.wmnet [11:15:04] deploy2001 is not the active deployment host [11:16:16] Urbanecm: thanks [11:16:20] no problem [11:16:29] Nikerabbit: btw, I'm around, if you want me to deploy [11:16:35] Urbanecm: in fact I think both redirect to the correct server [11:16:42] Nikerabbit: indeed :) [11:16:54] Urbanecm: I'm almost done, just testing on a debug server [11:16:58] ack :) [11:17:13] (03PS5) 10Jbond: profile::swift::stats_reporter: pass statsd host and port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/631158 [11:17:30] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:18:31] effie: ^^ [11:18:36] (03PS5) 10Jbond: profile::swift::proxy: pass swift parameters via profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631159 [11:20:11] (03PS3) 10Arturo Borrero Gonzalez: openstack: neutron: use iptables-legacy [puppet] - 10https://gerrit.wikimedia.org/r/631167 (https://phabricator.wikimedia.org/T262979) [11:20:27] (03PS6) 10Jbond: profile::swift::proxy: pass swift parameters via profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631159 [11:21:34] (03Abandoned) 10Jbond: (WIP): remove old files: Theses likkly need to be added to earlier CR's [puppet] - 10https://gerrit.wikimedia.org/r/631160 (owner: 10Jbond) [11:21:35] !log nikerabbit@deploy1001 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:627744|Enable Special:TranslationStats (T263004)]] (duration: 00m 59s) [11:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:41] T263004: Enable Special:TranslationStats on Wikimedia production - https://phabricator.wikimedia.org/T263004 [11:21:55] (03PS2) 10Jbond: swift: move swift parameters to the profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/631169 [11:22:21] 10Operations: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10MoritzMuehlenhoff) [11:23:08] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:23:22] 10Operations: Switch cergen to profile::java - https://phabricator.wikimedia.org/T264177 (10MoritzMuehlenhoff) [11:23:24] Urbanecm: I think that's it... or did I miss any final steps? [11:23:49] Nikerabbit: no, that should be it :) [11:23:50] 10Operations: Switch puppetdb to profile::java - https://phabricator.wikimedia.org/T264178 (10MoritzMuehlenhoff) [11:24:00] (03CR) 10Jbond: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [11:24:08] (03PS1) 10Jcrespo: [WIP] Start adding more unit tests [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/631172 [11:24:28] (03Abandoned) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [11:30:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/25551/" [puppet] - 10https://gerrit.wikimedia.org/r/631167 (https://phabricator.wikimedia.org/T262979) (owner: 10Arturo Borrero Gonzalez) [11:31:11] 10Operations, 10SRE-Access-Requests: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10MarcoAurelio) [11:31:54] (03CR) 10Jbond: [C: 03+1] "LGTM couple of minor comments" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [11:33:30] !log enable puppet P:mediawiki::mcrouter_wancache for 630845 - T244340 [11:33:31] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:35] T244340: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 [11:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:41] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:47] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:33:47] Urbanecm: thanks [11:33:48] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:59] (03PS1) 10Hashar: gerrit: add link to codesearch [puppet] - 10https://gerrit.wikimedia.org/r/631174 (https://phabricator.wikimedia.org/T264163) [11:34:13] (03PS3) 10Effie Mouzeli: hieradata: enable onhost memcached on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/630856 (https://phabricator.wikimedia.org/T263958) [11:35:38] (03CR) 10Hashar: "That looks like: https://phabricator.wikimedia.org/F32368737" [puppet] - 10https://gerrit.wikimedia.org/r/631174 (https://phabricator.wikimedia.org/T264163) (owner: 10Hashar) [11:36:20] (03CR) 10Kosta Harlan: [C: 03+1] gerrit: add link to codesearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631174 (https://phabricator.wikimedia.org/T264163) (owner: 10Hashar) [11:41:31] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable onhost memcached on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/630856 (https://phabricator.wikimedia.org/T263958) (owner: 10Effie Mouzeli) [11:57:16] (03PS17) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [12:01:12] (03CR) 10Muehlenhoff: reboot-groups (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [12:14:14] (03PS2) 10Hashar: gerrit: add link to codesearch [puppet] - 10https://gerrit.wikimedia.org/r/631174 (https://phabricator.wikimedia.org/T264163) [12:14:19] (03CR) 10Hashar: gerrit: add link to codesearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631174 (https://phabricator.wikimedia.org/T264163) (owner: 10Hashar) [12:16:53] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10MoritzMuehlenhoff) [12:17:41] 10Operations, 10Gerrit: Migrate Gerrit to profile::java - https://phabricator.wikimedia.org/T264182 (10MoritzMuehlenhoff) [14:05:02] !log create thirdparty/amd-rocm33 for stretch-wikimedia [14:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:18] 10Operations, 10DBA, 10SRE-swift-storage, 10Goal: Research storage solutions for media backups - https://phabricator.wikimedia.org/T264190 (10jcrespo) [14:08:39] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal: Research storage solutions for media backups - https://phabricator.wikimedia.org/T264190 (10jcrespo) [14:08:56] (03PS5) 10Jbond: cfssl::multiroot: add multiroot class [puppet] - 10https://gerrit.wikimedia.org/r/631183 (https://phabricator.wikimedia.org/T259117) [14:10:51] (03PS1) 10Pablo Grass (WMDE): Revert "labs: Turn on termbox v2 on wikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631023 [14:10:52] !log powering down ores100[3-9 to upgrade memory in each T259909 [14:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:57] T259909: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 [14:11:19] (03PS6) 10Jbond: cfssl::multiroot: add multiroot class [puppet] - 10https://gerrit.wikimedia.org/r/631183 (https://phabricator.wikimedia.org/T259117) [14:12:51] (03CR) 10Jbond: [C: 03+2] cfssl::multiroot: add multiroot class [puppet] - 10https://gerrit.wikimedia.org/r/631183 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [14:13:10] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) The plan agreed with Papaul is to use an old disk from an es host that was decommissioned, and see if the controller recognizes the disk. If it does, the new disk is prob... [14:15:26] (03PS1) 10Jbond: cfssl: fix multirootca service ExecStart [puppet] - 10https://gerrit.wikimedia.org/r/631192 [14:15:56] (03CR) 10Jbond: [C: 03+2] cfssl: fix multirootca service ExecStart [puppet] - 10https://gerrit.wikimedia.org/r/631192 (owner: 10Jbond) [14:17:45] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1400.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:19:06] (03PS1) 10Jbond: cfssl: use correct config file for multirooca service [puppet] - 10https://gerrit.wikimedia.org/r/631193 [14:19:23] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:19:46] (03CR) 10Jbond: [C: 03+2] cfssl: use correct config file for multirooca service [puppet] - 10https://gerrit.wikimedia.org/r/631193 (owner: 10Jbond) [14:20:02] (03PS18) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [14:20:32] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:20:32] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:40] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:20:40] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:59] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:11] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [14:22:43] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:22:44] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:47] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:22:48] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:56] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:22:56] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:47] (03CR) 10Muehlenhoff: [C: 03+2] reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [14:25:33] (03PS1) 10Jbond: cfssl::multirootca: listen on primary ip and use safe title [puppet] - 10https://gerrit.wikimedia.org/r/631194 [14:26:42] (03CR) 10Jbond: [C: 03+2] cfssl::multirootca: listen on primary ip and use safe title [puppet] - 10https://gerrit.wikimedia.org/r/631194 (owner: 10Jbond) [14:29:13] (03PS1) 10Jbond: cfssl: correct fact [puppet] - 10https://gerrit.wikimedia.org/r/631195 [14:29:41] (03CR) 10Jbond: [C: 03+2] cfssl: correct fact [puppet] - 10https://gerrit.wikimedia.org/r/631195 (owner: 10Jbond) [14:32:25] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:31] (03PS2) 10Pablo Grass (WMDE): Revert "labs: Turn on termbox v2 on wikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631023 [14:38:27] (03PS1) 10Muehlenhoff: Add profile::java for cergen [puppet] - 10https://gerrit.wikimedia.org/r/631197 (https://phabricator.wikimedia.org/T264177) [14:40:01] (03PS3) 10Pablo Grass (WMDE): Revert "labs: Turn on termbox v2 on wikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631023 (https://phabricator.wikimedia.org/T264066) [14:40:54] (03CR) 10Lars Wirzenius: [C: 03+2] "Scap 3.15.0, which includes these plugins, has been deployed now, so I approve of this change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601388 (https://phabricator.wikimedia.org/T248490) (owner: 10Jforrester) [14:41:17] 10Operations, 10Traffic: External Monitoring alerting on 400 Bad Request errors - https://phabricator.wikimedia.org/T264111 (10colewhite) p:05Triage→03Medium a:03ema [14:41:49] (03Merged) 10jenkins-bot: Drop scap plugins, moved into scap proper [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601388 (https://phabricator.wikimedia.org/T248490) (owner: 10Jforrester) [14:42:24] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10Isaac) Thanks @ArielGlenn ! @Nuria -- anything additional you need from us? [14:46:15] (03CR) 10WMDE-leszek: [C: 03+1] Revert "labs: Turn on termbox v2 on wikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631023 (https://phabricator.wikimedia.org/T264066) (owner: 10Pablo Grass (WMDE)) [14:46:31] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) The disk from one of the decom es server works [14:46:55] 10Operations, 10Traffic: External Monitoring alerting on 400 Bad Request errors - https://phabricator.wikimedia.org/T264111 (10ema) In the checks definition, under "Advanced" -> "Custom HTTP request headers" we used to specify a header with quotes, which was passed verbatim by CA App Synthetic Monitor. Quotes... [14:46:59] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) Status Name State Slot Number Size Security Status Bus Protocol Media Type Hot Spare Remaining Rated Write Endurance Physical Disk 0:1:0 Online 0 1862.5 GB... [14:50:11] 10Operations, 10Traffic: External Monitoring alerting on 400 Bad Request errors - https://phabricator.wikimedia.org/T264111 (10colewhite) 05Open→03Resolved Notifications re-enabled. [14:51:00] (03PS1) 10Hnowlan: changeprop: lower log level to error [deployment-charts] - 10https://gerrit.wikimedia.org/r/631200 (https://phabricator.wikimedia.org/T264195) [14:51:31] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) I can see the disk now: ` Time: Wed Sep 30 14:44:09 2020 Code: 0x00000072 Class: 0 Locale: 0x02 Event Description: State change on PD 02(e0x20/s2) from OFFLINE(10) to RE... [14:56:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] docker: add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/630661 (owner: 10Dzahn) [15:02:07] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [15:06:39] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) Not sure if it is actually going to work but at least the disk is seen: ` root@es2026:~# megacli -PDRbld -ShowProg -physdrv[32:2] -aALL Rebuild Progress on Device at En... [15:06:50] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10Cmjohnson) [15:07:38] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10Cmjohnson) 05Open→03Resolved @akosiaris the memory upgrade is complete, verified all servers are up and running. Resolving this task [15:11:17] (03PS4) 10Jbond: P:pki::client: add ability to create certs [puppet] - 10https://gerrit.wikimedia.org/r/629207 (https://phabricator.wikimedia.org/T259117) [15:11:40] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939 (10MSantos) 05Open→03Resolved This task is finished and working on codfw cluster, but the scripts are disabled in eqiad, for that part of the work please... [15:11:44] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) I create another dispatch to request a new disk and shipped the one received on 9/25/2020 back. ` Create Dispatch: Success You have successfully submitted request SR1038... [15:12:21] 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) [15:13:24] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) Great! The rebuild is happening, slowly, but at least has started: ` root@es2026:~# megacli -PDRbld -ShowProg -physdrv[32:2] -aALL Rebuild Progress on Device at Enclosur... [15:24:56] (03PS2) 10Muehlenhoff: Add profile::java for cergen [puppet] - 10https://gerrit.wikimedia.org/r/631197 (https://phabricator.wikimedia.org/T264177) [15:25:12] (03PS5) 10Jbond: P:pki::client: add ability to create certs [puppet] - 10https://gerrit.wikimedia.org/r/629207 (https://phabricator.wikimedia.org/T259117) [15:26:49] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10MSantos) Blocked on {T260947} [15:28:29] (03PS6) 10Jbond: P:pki::client: add ability to create certs [puppet] - 10https://gerrit.wikimedia.org/r/629207 (https://phabricator.wikimedia.org/T259117) [15:28:56] !log removed librsvg 2.40.20-3+wmf1+stretch1 from component/thumbor, superseded by 2.40.21-0+deb9u1 released via stretch-security [15:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:10] (03PS7) 10Jbond: P:pki::client: add ability to create certs [puppet] - 10https://gerrit.wikimedia.org/r/629207 (https://phabricator.wikimedia.org/T259117) [15:30:25] 10Operations, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Story: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10WDoranWMF) [15:33:22] Nikerabbit are you around? [15:34:30] (03PS8) 10Jbond: P:pki::client: add ability to create certs [puppet] - 10https://gerrit.wikimedia.org/r/629207 (https://phabricator.wikimedia.org/T259117) [15:34:35] PROBLEM - Check the last execution of mediawiki_job_wikidata-updateQueryServiceLag on mwmaint2001 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_wikidata-updateQueryServiceLag https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:35:53] PROBLEM - ping-offload grafana alert on alert1001 is CRITICAL: CRITICAL: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is alerting: target IP missing on hosts loopback. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [15:36:23] PROBLEM - Prometheus prometheus2003/ops restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [15:36:57] (03CR) 10Jbond: [C: 03+2] P:pki::client: add ability to create certs [puppet] - 10https://gerrit.wikimedia.org/r/629207 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [15:37:27] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.16e+07 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [15:37:33] RECOVERY - ping-offload grafana alert on alert1001 is OK: OK: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is not alerting. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [15:37:50] 10Operations, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Story: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10Joe) Those cookies are harmless cookies we set at the edge cache for all requests. We could add an exception for the api gateway, but it w... [15:38:15] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [15:38:45] PROBLEM - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [15:39:19] DannyS712: yes, replied [15:39:29] (03PS2) 10Andrew Bogott: Update partman recipes for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/630954 (https://phabricator.wikimedia.org/T263677) [15:39:31] (03PS1) 10Andrew Bogott: nova: reduce live-migration timeout to 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/631204 [15:42:19] (03CR) 10Andrew Bogott: [C: 03+2] Update partman recipes for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/630954 (https://phabricator.wikimedia.org/T263677) (owner: 10Andrew Bogott) [15:42:37] (03CR) 10Andrew Bogott: [C: 03+2] nova: reduce live-migration timeout to 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/631204 (owner: 10Andrew Bogott) [15:43:09] (03PS1) 10Jbond: cfssl:cert: make profile optional [puppet] - 10https://gerrit.wikimedia.org/r/631205 [15:43:56] (03CR) 10Jbond: [C: 03+2] cfssl:cert: make profile optional [puppet] - 10https://gerrit.wikimedia.org/r/631205 (owner: 10Jbond) [15:45:11] RECOVERY - Check the last execution of mediawiki_job_wikidata-updateQueryServiceLag on mwmaint2001 is OK: OK: Status of the systemd unit mediawiki_job_wikidata-updateQueryServiceLag https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:46:06] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25564/" [puppet] - 10https://gerrit.wikimedia.org/r/631197 (https://phabricator.wikimedia.org/T264177) (owner: 10Muehlenhoff) [15:48:11] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1290.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [15:51:11] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1398.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1284.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:55:05] (03PS1) 10Jbond: cfssl::cert: include stderr in unless comparison [puppet] - 10https://gerrit.wikimedia.org/r/631208 [15:55:34] (03CR) 10Jbond: [C: 03+2] cfssl::cert: include stderr in unless comparison [puppet] - 10https://gerrit.wikimedia.org/r/631208 (owner: 10Jbond) [15:56:05] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=misc file=smartmon.prom instance=relforge1004 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [16:00:04] CindyCicaleseWMF: My dear minions, it's time we take the moon! Just kidding. Time for Platform Engineering Team deployment deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200930T1600). [16:01:26] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10akosiaris) Awesome, many thanks! [16:01:33] I'm here and getting ready to deploy. Is there anything going on or are we good to go? [16:02:11] all quiet as far as I can tell, glhf :) [16:02:27] RECOVERY - Prometheus prometheus2003/ops restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [16:02:39] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [16:03:09] RECOVERY - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [16:03:27] thanks! [16:06:34] (03PS4) 10Cicalese: Add beta config for API Portal/OAuth communications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630947 (https://phabricator.wikimedia.org/T261358) [16:07:37] (03CR) 10Cicalese: [C: 03+2] Add beta config for API Portal/OAuth communications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630947 (https://phabricator.wikimedia.org/T261358) (owner: 10Cicalese) [16:08:18] (03Merged) 10jenkins-bot: Add beta config for API Portal/OAuth communications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630947 (https://phabricator.wikimedia.org/T261358) (owner: 10Cicalese) [16:09:21] (03PS1) 10Effie Mouzeli: hieradata: mwdebug fix for onhost memcached [puppet] - 10https://gerrit.wikimedia.org/r/631210 [16:10:16] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) Return tracking information {F32368964} [16:12:23] (03PS5) 10Dzahn: docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661 [16:12:28] (03CR) 10Dzahn: docker: add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/630661 (owner: 10Dzahn) [16:15:44] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: mwdebug fix for onhost memcached [puppet] - 10https://gerrit.wikimedia.org/r/631210 (owner: 10Effie Mouzeli) [16:16:55] (03PS3) 10Dzahn: DHCP switch TFTP server for esams from bast3004 to install3001 [puppet] - 10https://gerrit.wikimedia.org/r/630966 (https://phabricator.wikimedia.org/T252526) [16:17:13] (03PS3) 10Dzahn: DHCP: switch TFTP server for ulsfo from bast4002 to install4001 [puppet] - 10https://gerrit.wikimedia.org/r/630964 (https://phabricator.wikimedia.org/T252526) [16:18:23] (03CR) 10Dzahn: [C: 03+2] DHCP: switch TFTP server for ulsfo from bast4002 to install4001 [puppet] - 10https://gerrit.wikimedia.org/r/630964 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [16:20:43] jouncebot: next [16:20:43] In 1 hour(s) and 39 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200930T1800) [16:20:43] In 1 hour(s) and 39 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200930T1800) [16:20:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:20:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:20:56] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:37] !log re-enabled puppet on install2003 [16:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:58] CindyCicaleseWMF: are you done deploying? Are there any objection to rolling the train forward to group0 now? https://phabricator.wikimedia.org/T263177#6504363 indicates it's safe. [16:24:12] twentyafterfour: We're getting changes that we do not expect after merging and then doing "git fetch origin && git diff HEAD origin". Do you have any ideas? [16:25:09] CindyCicaleseWMF: I'm not sure,that doesn't sound good though [16:26:07] I see my change and a lot of other stuff. What do you normally do in such a situation? [16:26:19] I usually use `git diff HEAD...origin` I'm not sure how it behaves when you use `HEAD origin` [16:27:22] still a lot of other changes to .gitignore, scap files, etc. [16:27:27] what does `git log origin...HEAD` say? [16:27:29] [16:27:34] hmmm [16:27:59] I can take a look, which repo? [16:28:09] mediawiki-config [16:28:14] ohh [16:28:25] I made a simple change to CommonSettings-labs.php [16:28:42] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10Jgiannelos) [16:29:01] I see that at the end of the git log origin..HEAD, but lots of other changes before [16:29:06] CindyCicaleseWMF: looks like just one patch that wasn't deployed [16:29:50] on deploy1001 `tig HEAD origin/master`: [16:29:52] b85d3e5 2020-09-29 17:15 Cindy Cicalese ● {origin/master} {origin/HEAD} Add beta config for API Portal/OAuth communications [16:29:55] 658f289 2020-06-01 11:47 James D. Forrester ● Drop scap plugins, moved into scap proper [16:29:57] 7b43921 2020-09-16 10:49 Niklas Laxström ● [master] Enable Special:TranslationStats [16:30:11] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/19; [edit interfaces interface-range disabled]... [16:30:58] hmm, what do you suggest? [16:30:59] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10Jgiannelos) @sdkim i updated the checklist in the acceptance criteria. It lo... [16:30:59] the patch from James isn't really meant to be deployed since it's just scap stuff so I think it's fine to pull all that in and deploy. [16:31:20] so just move on to the "git merge origin/master"? [16:31:27] yeah [16:31:31] cool - thanks! [16:31:36] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:31:38] you're welcome! [16:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:42] twentyafterfour: sync-file is running now, then I should be done [16:33:52] !log cicalese@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: Add beta config for API Portal/OAuth communications (duration: 00m 58s) [16:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:18] done - thanks! [16:34:24] (03PS1) 10Effie Mouzeli: mcrouter_wancache: fix typo in WarmUpRoute [puppet] - 10https://gerrit.wikimedia.org/r/631212 [16:35:11] thanks CindyCicaleseWMF [16:36:10] volans: fyi, I used makevm cookbook in each of the 3 POPs and noticed how that also automatically assigns an IP and f.e. for ULSFO I did not even have to make the manual edit anymore. worked fine, thanks:) [16:36:21] (03CR) 10Effie Mouzeli: [C: 03+2] mcrouter_wancache: fix typo in WarmUpRoute [puppet] - 10https://gerrit.wikimedia.org/r/631212 (owner: 10Effie Mouzeli) [16:36:43] mutante: thanks for the feedback! as of yesterday eqsin too is fully automated (dns got migrated) [16:36:46] esams next :) [16:37:19] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:53] volans: cool, and if you ever saw a "testvm" get in your way, I was using to test new install servers (and on the side the whole create VM / IP thing) and they will be removed again, of course with decom cookbook [16:37:58] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:38:16] hah corona deal [16:38:24] mutante: yeah saw them, all good [16:39:14] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:39:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10wiki_willy) a:05wiki_willy→03Cmjohnson [16:40:07] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10RBrounley_WMF) Hey, connecting with folks on ORES team around this today - sorry we were given advice that the ORES stream may have some data... [16:40:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10wiki_willy) Moving over to @RobH and @Cmjohnson, so they can pull TSR reports for a replacement part [16:41:36] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Papaul) [16:46:50] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) >>! In T263910#6505891, @RBrounley_WMF wrote: > Hey, connecting with folks on ORES team around this today - sorry we were given advice t... [16:55:18] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) >>! In T263910#6505891, @RBrounley_WMF wrote: > Hey, connecting with folks on ORES team around this today - sorry we were given ad... [16:56:00] (03PS4) 10Dzahn: DHCP switch TFTP server for esams from bast3004 to install3001 [puppet] - 10https://gerrit.wikimedia.org/r/630966 (https://phabricator.wikimedia.org/T252526) [16:58:56] (03PS1) 10ArielGlenn: add deployment-snapshot02 to dsh list for deployment-prep in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/631216 (https://phabricator.wikimedia.org/T245402) [17:01:01] (03CR) 10ArielGlenn: [C: 03+2] add deployment-snapshot02 to dsh list for deployment-prep in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/631216 (https://phabricator.wikimedia.org/T245402) (owner: 10ArielGlenn) [17:01:36] !log finished adding restbase2018-a to the cassandra cluster [17:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:01] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [17:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:01] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:05] PROBLEM - nova-compute proc maximum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:05:13] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:07:48] (03CR) 10Dzahn: [C: 03+2] DHCP switch TFTP server for esams from bast3004 to install3001 [puppet] - 10https://gerrit.wikimedia.org/r/630966 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [17:08:30] apergos: multi-merge on puppetmaster ok? (multiple/no):) [17:08:42] ah crap [17:08:44] yes please [17:09:00] mutante: [17:09:42] apergos: done! i was looking at the actual change. [17:09:50] beware the host name change in cloud [17:09:53] yeah it's a "labs only" change [17:10:04] i's a new instance [17:10:06] does it work? [17:10:08] only the new style names work [17:10:14] in some places you will see new host names [17:10:24] but you can't use them in other places, depends [17:10:26] well the old style host name doesn't even resolve for the new instance [17:10:31] ok [17:10:42] so if we can't use the new style one, I am really and truly screwed [17:11:16] I won't know if it works for awhile :-/ [17:13:23] I tested putting it into my .ssh/config and to connect. I am talking to something but my connection gets closed, fwiw. [17:14:09] oh I can ssh in with that name, like I say it's the only style name that works [17:14:17] ok then [17:14:41] RECOVERY - nova-compute proc maximum on cloudvirt1033 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:15:07] if you fail to connect, you might have your config set up oddly; are you able to connect to other deployment-prep instances with new-style names? [17:17:28] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal: Research storage solutions for media backups - https://phabricator.wikimedia.org/T264190 (10herron) p:05Triage→03Medium [17:17:41] I am not sure which exactly are the ones that have new-style names and issues I ran into were in other VPS projects. If it works for you then it's good enough for right now. [17:18:24] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10herron) p:05Triage→03Medium [17:19:22] one thing is that in openstack-browser you will see host names but you can't copy/paste those into the puppet compiler, you need to SSH and get the old host names [17:19:42] 10Operations, 10Gerrit: Migrate Gerrit to profile::java - https://phabricator.wikimedia.org/T264182 (10herron) p:05Triage→03Medium [17:20:59] mutante: the old<->new name mapping is static. s/.eqiad.wmflabs/.eqiad1.wikimedia.cloud/ and in reverse [17:21:15] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1315.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:21:30] and yes the new names are long, but hopefully most folks have tab completion for their ssh clients [17:22:01] 10Operations, 10SRE-Access-Requests: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10herron) p:05Triage→03Medium [17:23:07] bd808: ok, thanks [17:23:25] 10Operations, 10Analytics, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10herron) p:05Triage→03High [17:24:19] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:25:47] 10Operations, 10Puppet, 10Patch-For-Review: unbound variable error when calling puppet-merge script with an explicit treeish - https://phabricator.wikimedia.org/T264014 (10herron) p:05Triage→03Medium [17:25:55] question: how do I view the wfDebug output logs for the beta cluster? [17:27:37] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1288.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1394.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:28:09] mw1315 is "marked down" but seems up. hmm.. rescheduling check.. or depooling it [17:28:22] it's not ongoing work, is it [17:28:55] 10Operations, 10observability, 10serviceops: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10herron) p:05Triage→03Medium [17:31:01] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:31:47] andrewbogott: is wikitech-static moving or some maintenance? i saw package upgrades yesterday and now it's redirecting to regular wikitech? [17:32:02] I just did a routine maintenance there yesterday [17:32:08] the redir is something that's been wrong for ages [17:32:31] https://phabricator.wikimedia.org/T257643 [17:32:41] it only happens at the very top-level url [17:32:42] andrewbogott: Icinga thinks there is " HTTP CRITICAL: HTTP/1.1 500 Internal Server Error -" on both labweb1001 and labweb1002 for wikitech-static [17:32:51] but that's not where that is served? [17:33:07] andrewbogott: ah, gotcha @redirect. thanks [17:33:15] oh, ok, something else is actually wrong [17:33:16] https://wikitech-static.wikimedia.org/wiki/Main_Page [17:33:34] mutante: I can look at that in a bit if you don't get there first [17:33:42] although really I'm hoping Reedy will jump in before I get there [17:33:42] oh, yea, that part I had not seen, just icinga [17:33:46] 10Operations: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10herron) Removing the SRE-Access-Requests tag for now, please re-add when ready to proceed with this. Thanks! [17:34:16] andrewbogott: ack, it doesn't seem to be UBN [17:34:47] 10Operations, 10LDAP-Access-Requests: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10herron) 05Open→03Resolved [17:34:57] other pages still working [17:35:23] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:36:21] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1394.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [17:41:20] is someone working on api server pools, with these repeated pybal alerts? [17:41:58] hmm, taking a look around a little [17:42:22] (it's also interesting that only lvs1016 is alerting, which is the backup one) [17:42:45] bblack: not that i know of, i was wondering about it and that one mw host it mentioned looked fine. before i could depool it the alert recovered [17:42:56] and nothing in SAL [17:43:14] then saw other stuff in Icinga [17:43:33] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Papaul) [17:43:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Papaul) complete [17:44:20] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Papaul) 05Open→03Resolved [17:45:15] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Papaul) [17:45:55] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Papaul) 05Open→03Resolved complete [17:46:20] yeah I just did a quick diff on lvs1016 of the pooled stuff showing in ipvsadm between ports 80 and 443, and there's some diff [17:46:48] 1297, 1398, 1314 are all only on in the port 80 set, and 1313 is only in the 443 set [17:46:53] I imagine there's some flapping going on [17:47:10] yeah lots of: [17:47:13] Sep 30 17:46:56 lvs1016 pybal[2315]: [api_80 ProxyFetch] WARN: mw1315.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed (http://en.wikipedia.org/w/api.php), 5.001 s [17:47:17] for various api servers [17:47:23] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:48:26] but it's not just api, I think api is just the most-sensitive... [17:50:04] there's always sporadic "Fetch failed" in syslog, but lvs1016 seems to have a ton more such spam than its peers [17:50:49] either it has a network issue, or we've yet again hit the case of pybal faling behind on its own healthchecks? 1016 does have more total servers/checks than the others, since it's the superset of everything 1013, 1014, and 1015 handle [17:52:29] attempting a pybal restart in case that helps... [17:52:46] 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) I've gone ahead and fixed the self dispatch issue (had to add a new SG group to get this to work, not sure how it worked last time I sent parts but wha... [17:52:54] !log lvs1016: restart pybal [17:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:49] the rate of healthcheck failures looks just as unacceptable as before the restart, so far [18:00:04] twentyafterfour and hashar: That opportune time is upon us again. Time for a Train log triage with CPT deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200930T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200930T1800). Please do the needful. [18:00:04] RoanKattouw, Jdlrobson, and hip: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] (03PS1) 10Urbanecm: [labs] Sync wmgMonologChannels with production value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631224 [18:00:18] why does the bot not respect me [18:00:58] haha [18:00:59] jouncebot will not harm you, or through in action allow you to come to harm, but it doesn't have to *like* you [18:01:09] I'll do the deploynent [18:01:13] o/ present [18:01:22] thanks RoanKattouw :) [18:01:33] Here [18:01:37] 2 of the 4 patches are mine anyway [18:01:57] (03PS3) 10Catrope: Move search in header for anons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630207 (https://phabricator.wikimedia.org/T263032) (owner: 10Jdlrobson) [18:02:16] (03CR) 10Catrope: [C: 03+2] Move search in header for anons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630207 (https://phabricator.wikimedia.org/T263032) (owner: 10Jdlrobson) [18:03:06] (03Merged) 10jenkins-bot: Move search in header for anons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630207 (https://phabricator.wikimedia.org/T263032) (owner: 10Jdlrobson) [18:03:32] Jdlrobson: Your patch is on mwdebug2001, please test [18:03:48] RoanKattouw: on it [18:05:22] RoanKattouw: looks great! feel free to sync [18:05:43] RoanKattouw: but now im tewsting i realize i should have just defaulted it to true for everyone [18:06:27] (ill follow up with a patch before the end of the window) [18:06:30] OK great [18:06:56] (03CR) 10Gehel: [C: 03+2] Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [18:07:14] (03PS1) 10Jdlrobson: Always make the search in the header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631225 (https://phabricator.wikimedia.org/T263032) [18:08:01] (03PS4) 10Catrope: clientError: Enable on Wikidata + all Wikipedias besides enwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630908 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [18:08:06] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Move search in header for anons (T263032) (duration: 00m 59s) [18:08:06] (03CR) 10Catrope: [C: 03+2] clientError: Enable on Wikidata + all Wikipedias besides enwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630908 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [18:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:11] T263032: Deploy the new location of the search bar to new vector and begin A/B test on test wikis - https://phabricator.wikimedia.org/T263032 [18:08:36] (03PS2) 10Jdlrobson: Always make the search in the header for anons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631225 (https://phabricator.wikimedia.org/T263032) [18:08:59] (03Merged) 10jenkins-bot: clientError: Enable on Wikidata + all Wikipedias besides enwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630908 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [18:09:28] hip5: Your patch is on mwdebug2001, please test [18:09:35] roger [18:09:55] [still looking at lvs1016 issues, not production critical at present] [18:10:31] RoanKattouw: looks good [18:12:32] RoanKattouw: if https://gerrit.wikimedia.org/r/631225 could go out at the end of the deploy that would be great [18:12:49] Yeah will do [18:13:00] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for adding those!" [software/cumin] - 10https://gerrit.wikimedia.org/r/630934 (owner: 10Gehel) [18:13:35] (03PS3) 10Catrope: Always make the search in the header for anons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631225 (https://phabricator.wikimedia.org/T263032) (owner: 10Jdlrobson) [18:14:00] (03CR) 10Catrope: [C: 03+2] Always make the search in the header for anons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631225 (https://phabricator.wikimedia.org/T263032) (owner: 10Jdlrobson) [18:14:18] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable clientError on Wikidata and all Wikipedias except enwiki (T255585) (duration: 00m 58s) [18:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:23] T255585: Extend client-side error logging coverage - https://phabricator.wikimedia.org/T255585 [18:14:49] (03Merged) 10jenkins-bot: Always make the search in the header for anons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631225 (https://phabricator.wikimedia.org/T263032) (owner: 10Jdlrobson) [18:16:53] (03PS1) 10Effie Mouzeli: hieradata: enable onhost memcached on mw2271 [puppet] - 10https://gerrit.wikimedia.org/r/631246 (https://phabricator.wikimedia.org/T263958) [18:21:36] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10Mholloway) a:05Mholloway→03None [18:21:58] RoanKattouw: on 2001? [18:22:09] (03PS2) 10Dzahn: DHCP: set TFTP servers in other DC to bootstrap install servers [puppet] - 10https://gerrit.wikimedia.org/r/630971 (https://phabricator.wikimedia.org/T252526) [18:23:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10RobH) I've pulled the TSR and submitted SR1038301301 for dispatching a replacement SSD to eqiad. This is now over to Chris for receipt and installation/return of... [18:23:29] Jdlrobson: Sorry I got distracted [18:23:47] Jdlrobson: It is now [18:24:12] (03CR) 10Dzahn: [C: 03+2] DHCP: set TFTP servers in other DC to bootstrap install servers [puppet] - 10https://gerrit.wikimedia.org/r/630971 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [18:24:24] (03PS2) 10Catrope: GrowthExperiments: Enable for newcomers on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630886 (https://phabricator.wikimedia.org/T255027) [18:24:26] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Enable for newcomers on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630886 (https://phabricator.wikimedia.org/T255027) (owner: 10Catrope) [18:24:29] RoanKattouw: perfect [18:24:39] please sync away! [18:25:09] (03Merged) 10jenkins-bot: GrowthExperiments: Enable for newcomers on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630886 (https://phabricator.wikimedia.org/T255027) (owner: 10Catrope) [18:25:45] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1287.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:28:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10RobH) IRC update from my chat with @gehel: wdqs1009 is not in production, but a non user facing test server. We can do our hdparm testing of secure erasure with... [18:28:54] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Put search in header for anons on all wikis, not just desktop-improvements wikis (T263032) (duration: 00m 59s) [18:28:55] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1287.eqiad.wmnet, mw1396.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [18:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:00] T263032: Deploy the new location of the search bar to new vector and begin A/B test on test wikis - https://phabricator.wikimedia.org/T263032 [18:31:11] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:31:17] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:31:51] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable GrowthExperiments for newcomers on ptwiki (T225027) (duration: 00m 58s) [18:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:56] T225027: long queries - https://phabricator.wikimedia.org/T225027 [18:32:02] (03PS3) 10Catrope: Enable and configure GrowthExperiments on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627395 (https://phabricator.wikimedia.org/T257220) [18:32:07] (03CR) 10Catrope: [C: 03+2] Enable and configure GrowthExperiments on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627395 (https://phabricator.wikimedia.org/T257220) (owner: 10Catrope) [18:32:43] (03PS1) 10Dzahn: bastionhost::pop: stop TFTP service on role level [puppet] - 10https://gerrit.wikimedia.org/r/631249 (https://phabricator.wikimedia.org/T252526) [18:32:50] (03Merged) 10jenkins-bot: Enable and configure GrowthExperiments on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627395 (https://phabricator.wikimedia.org/T257220) (owner: 10Catrope) [18:32:54] 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1016 enp5s0f0 interface errors - https://phabricator.wikimedia.org/T264227 (10BBlack) p:05Triage→03High [18:33:05] 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10RhinosF1) I just noticed the IR on wikitech says: >duplicate key for ip banning The ipblocks table actually s... [18:34:15] that alert about uploads looks like a mass upload from a bot. exactly as the docs link says ..and also "typically nothing is immediately on fire" though should be followed-up on [18:35:40] (03CR) 10Dzahn: [C: 03+2] bastionhost::pop: stop TFTP service on role level [puppet] - 10https://gerrit.wikimedia.org/r/631249 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [18:36:08] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10Nuria) @Isaac When does this collaboration expire? [18:36:31] !log lvs1016 pybal diff alerts downtimed in icinga for ~48h to reduce annoying flappy alert spam, with reference to https://phabricator.wikimedia.org/T264227 [18:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:58] 10Operations, 10ops-eqiad, 10Traffic, 10netops: lvs1016 enp5s0f0 interface errors - https://phabricator.wikimedia.org/T264227 (10BBlack) [18:37:38] bblack: oh, thank you. so it is actually limited to the test host. wasn't sure what else to do there that would have been helpful [18:38:26] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable and configure GrowthExperiments on svwiki (T257220) (duration: 00m 58s) [18:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:33] T257220: Deploy Growth features on Swedish Wikipedia - https://phabricator.wikimedia.org/T257220 [18:38:34] well the pattern seemed weird, so then I noticed all the failing proxyfetches were in the same vlan/row, then I went looking for interface errors, etc... :) [18:39:04] I'm betting on a physical issue, but we'll see [18:39:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10Gehel) Note that it is trivial to reimage this server, so feel free to nuke it instead of rebuilding the raid if it's easier (or use this as a learning opportunit... [18:39:53] bblack: I see. so we should be glad the physical issue manifested on the backup host out of all hosts [18:40:13] yes :) [18:40:17] (03CR) 10Gehel: [C: 03+2] Adding some type annotations to clustershell.py [software/cumin] - 10https://gerrit.wikimedia.org/r/630934 (owner: 10Gehel) [18:40:40] although once we figured it out, we'd just fail over and use 1016 in that case [18:45:09] (03PS2) 10Effie Mouzeli: hieradata: enable onhost memcached on mw2271 [puppet] - 10https://gerrit.wikimedia.org/r/631246 (https://phabricator.wikimedia.org/T263958) [18:47:32] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable onhost memcached on mw2271 [puppet] - 10https://gerrit.wikimedia.org/r/631246 (https://phabricator.wikimedia.org/T263958) (owner: 10Effie Mouzeli) [18:48:31] RoanKattouw: Are you done SWATing? [18:49:17] (03PS4) 10Hoo man: Revert "labs: Turn on termbox v2 on wikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631023 (https://phabricator.wikimedia.org/T264066) (owner: 10Pablo Grass (WMDE)) [18:49:52] (03PS1) 10Dzahn: add testvm to partman and comment about bast5001 to site [puppet] - 10https://gerrit.wikimedia.org/r/631250 [18:51:03] (03CR) 10Dzahn: [C: 03+2] add testvm to partman and comment about bast5001 to site [puppet] - 10https://gerrit.wikimedia.org/r/631250 (owner: 10Dzahn) [18:52:53] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10Isaac) > When does this collaboration expire? @Nuria We have agreed to a six-month MOU/NDA with the opportunity to renew, so 10 March 2021 is the current exp... [18:53:29] (03PS3) 10Dzahn: bastionhost::pop: remove tftp from bastions [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526) [18:54:36] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10Nuria) Ok, let's approve access until 10 March 2021 and when collaboration is extended access can be so. Approved on my end. [18:54:45] (03CR) 10Hoo man: [C: 03+2] Revert "labs: Turn on termbox v2 on wikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631023 (https://phabricator.wikimedia.org/T264066) (owner: 10Pablo Grass (WMDE)) [18:55:15] (03CR) 10Dzahn: "@Muehlenhoff, this is mostly just FYI, TFTP has been switched from bast* to new install* VMs in the 3 POPs. I have tested OS installs afte" [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [18:55:50] (03Merged) 10jenkins-bot: Revert "labs: Turn on termbox v2 on wikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631023 (https://phabricator.wikimedia.org/T264066) (owner: 10Pablo Grass (WMDE)) [18:56:15] (03CR) 10Dzahn: "When merging this I would manually remove the package with --purge" [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [18:58:35] !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: Revert "labs: Turn on termbox v2 on wikidatawiki" (T264066) (duration: 00m 58s) [18:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:41] T264066: Disable termbox v2 on desktop Wikidata - https://phabricator.wikimedia.org/T264066 [19:00:05] twentyafterfour and hashar: #bothumor I � Unicode. All rise for Mediawiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200930T1900). [19:00:23] !log hoo@deploy1001 Synchronized wmf-config/: Revert "labs: Turn on termbox v2 on wikidatawiki" (T264066) (duration: 00m 58s) [19:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:31] I'm done [19:00:45] thanks hoo [19:01:50] !log disable puppet on mw2271 and use onhost memcached - T263958 [19:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:56] T263958: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 [19:04:50] (03PS5) 10CRusnov: Migrate ESAMS to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) [19:09:22] (03PS1) 1020after4: group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631258 [19:09:24] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631258 (owner: 1020after4) [19:09:40] (03PS1) 10Krinkle: Collect data about CodeMirror preference usage [extensions/WikimediaEvents] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631229 (https://phabricator.wikimedia.org/T260138) [19:10:25] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631258 (owner: 1020after4) [19:10:35] (03CR) 10Krinkle: [C: 03+1] "Please schedule for backport deploy as you see fit :)" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631229 (https://phabricator.wikimedia.org/T260138) (owner: 10Krinkle) [19:11:06] (03PS1) 10Herron: admin: change sbailey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/631259 (https://phabricator.wikimedia.org/T264127) [19:11:16] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10ACraze) >>! In T263910#6505937, @Ladsgroup wrote: > If there are scores missing from the stream, feel free to hit the endpoint but I suggest... [19:12:42] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.11 [19:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:02] (03CR) 10Kosta Harlan: [C: 03+1] gerrit: add link to codesearch [puppet] - 10https://gerrit.wikimedia.org/r/631174 (https://phabricator.wikimedia.org/T264163) (owner: 10Hashar) [19:14:05] (03PS1) 10Dzahn: switch DHCP servers in POPs to new local install hosts [homer/public] - 10https://gerrit.wikimedia.org/r/631261 (https://phabricator.wikimedia.org/T252526) [19:17:54] (03PS1) 10Dzahn: installserver: remove hiera host overrides, start DHCP and squid [puppet] - 10https://gerrit.wikimedia.org/r/631262 (https://phabricator.wikimedia.org/T252526) [19:18:42] (03CR) 10Dzahn: [C: 03+2] installserver: remove hiera host overrides, start DHCP and squid [puppet] - 10https://gerrit.wikimedia.org/r/631262 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [19:21:17] !log activating DHCP and squid on install[345]001.wikimedia.org [19:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:30] 10Operations, 10Patch-For-Review: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10Dzahn) #netops could you please deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/631261 ? [19:34:52] (03PS2) 10Dzahn: quarry: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630691 [19:35:37] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "NOOP https://puppet-compiler.wmflabs.org/compiler1003/25569/" [puppet] - 10https://gerrit.wikimedia.org/r/630691 (owner: 10Dzahn) [19:37:19] 10Operations: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T264233 (10Keith_D) [19:39:22] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10Andrew) 05Resolved→03Open this is still red in icinga: "Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical" [19:40:18] (03CR) 10Dzahn: [V: 03+1] "I can confirm the key matches what is on the ticket but for these cases we don't want to trust Phabricator alone. So if you could do any s" [puppet] - 10https://gerrit.wikimedia.org/r/631259 (https://phabricator.wikimedia.org/T264127) (owner: 10Herron) [19:42:36] (03CR) 10Paladox: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/631174 (https://phabricator.wikimedia.org/T264163) (owner: 10Hashar) [19:43:20] (03CR) 10Dzahn: [C: 03+2] gerrit: add link to codesearch [puppet] - 10https://gerrit.wikimedia.org/r/631174 (https://phabricator.wikimedia.org/T264163) (owner: 10Hashar) [19:44:33] (03CR) 10Dzahn: [C: 03+2] gerrit: add link to codesearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631174 (https://phabricator.wikimedia.org/T264163) (owner: 10Hashar) [19:44:44] (03CR) 10Dzahn: [C: 03+2] "resolving comment and submitting" [puppet] - 10https://gerrit.wikimedia.org/r/631174 (https://phabricator.wikimedia.org/T264163) (owner: 10Hashar) [19:47:21] (03CR) 10Dzahn: "button added: https://phabricator.wikimedia.org/T264163#6506605" [puppet] - 10https://gerrit.wikimedia.org/r/631174 (https://phabricator.wikimedia.org/T264163) (owner: 10Hashar) [19:50:57] (03PS3) 10Dzahn: profile::swift::proxy_tls: Swap hiera for lookup function [puppet] - 10https://gerrit.wikimedia.org/r/631155 (owner: 10Jbond) [19:51:42] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25535/ms-fe2005.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631155 (owner: 10Jbond) [19:52:49] mutante: thx ;) [19:52:58] hashar: no problem [19:53:39] (03PS1) 10Effie Mouzeli: Revert "hieradata: enable onhost memcached on mw2271" [puppet] - 10https://gerrit.wikimedia.org/r/631230 [19:53:42] my first ever polygerrit patch ! :] [19:53:54] heh:) [19:53:56] paladox: ^ [19:54:11] :D [19:54:27] \o/ [19:54:40] (03CR) 10Effie Mouzeli: [C: 04-2] "NO need to merge this unless mw2271 is erroring or we see issues that might be related to this." [puppet] - 10https://gerrit.wikimedia.org/r/631230 (owner: 10Effie Mouzeli) [19:54:46] mutante: unrelated, I have updated the process to fetch the Jenkins debian package! :] https://wikitech.wikimedia.org/wiki/Jenkins#Get_the_package [19:54:51] now using solely reprepro [19:55:07] and moritz tweaked the reprepro to get rid of some old entries that were left over [19:55:19] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631270 [19:56:16] mutante: tldr: reprepro update -C thirdparty/ci --restrict=jenkins buster-wikimedia [19:56:19] does the magic ;] [19:56:19] hashar: alright, but we have always been using reprepro [19:56:29] yeah well kind of [19:56:37] there was an alternate method with a bunch of commands [19:56:47] and the last time we got the non LTS package for some reason ;) [19:58:22] yea, eh.. it was one command either way, same as we import other packages. I wrote the alternate method. but if the automatic download works now without touching unrelated packages, alright [20:00:04] chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200930T2000). [20:02:29] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631273 [20:03:30] (03CR) 10Dzahn: "confirmed noop (ms-fe1005/2005)" [puppet] - 10https://gerrit.wikimedia.org/r/631155 (owner: 10Jbond) [20:06:10] (03PS5) 10Dzahn: swift::swiftrepl: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631156 (owner: 10Jbond) [20:08:39] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25570/" [puppet] - 10https://gerrit.wikimedia.org/r/631156 (owner: 10Jbond) [20:10:41] mutante: yup the new command only acts on a specific component / package. So that should address the issue of touching other unrelated packages [20:10:43] (03CR) 10Dzahn: "confirmed on ms-fe1005/2005. only thing changed is motd" [puppet] - 10https://gerrit.wikimedia.org/r/631156 (owner: 10Jbond) [20:11:13] mutante: possibly we could have a thirdparty/jenkins component dedicted to just jenkins , this we are sure to never touch anything else but jenkins ;] [20:13:24] hashar: I noticed the --restrict part. that seems to be it, yea. sounds good :) [20:13:52] (03PS8) 10Dzahn: swift::storage: migrate swift::storage from a role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631157 (owner: 10Jbond) [20:13:58] mutante: next thing, I will finally look at how to sync the jenkins master in a way that does not require to adjust the uid/gid ;] [20:14:36] mutante: I will write some runbook about how to switch the CI master. And I guess we can later schedule a switch over when time allow [20:16:01] (03PS1) 10Ahmon Dancy: removed a comment [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631275 [20:16:28] mutante: or have you had a runbook already? [20:18:38] (03CR) 10Ahmon Dancy: [C: 03+2] removed a comment [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631275 (owner: 10Ahmon Dancy) [20:18:43] (03CR) 10Ahmon Dancy: [C: 03+2] Add wikiversions-dev.json (train-dev branch) [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631276 (owner: 10Ahmon Dancy) [20:18:58] hashar: I think the checkboxes on our https://phabricator.wikimedia.org/T224591 are close to a runbook [20:19:18] just needs some more detail like exact rsync command [20:19:19] (03Merged) 10jenkins-bot: removed a comment [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631275 (owner: 10Ahmon Dancy) [20:19:21] (03Merged) 10jenkins-bot: Add wikiversions-dev.json (train-dev branch) [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631276 (owner: 10Ahmon Dancy) [20:19:38] mutante: pretty much yes [20:19:44] see the bottom that we did not get to [20:19:50] mutante: my .plan is to extract the infos from that task and stick them on a wiki page [20:20:15] https://phabricator.wikimedia.org/T256422 [20:20:38] hashar: sounds good, thanks. and btw also that ticket [20:21:59] I think a good way is if we can just link to a previous gerrit change after each of those check boxes [20:22:18] that is always nice being able to just copy an old change [20:36:28] !log temp disabling puppet on swift::storage (swift-be) hosts, applying gerrit:631157 refactoring change [20:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:10] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "< mutante> !log temp disabling puppet on swift::storage (swift-be) hosts, applying gerrit:631157 refactoring change" [puppet] - 10https://gerrit.wikimedia.org/r/631157 (owner: 10Jbond) [20:39:30] (03CR) 10Dzahn: "confirmed total NOOP on ms-be2046/ms-be1046 - re-enabling on other hosts" [puppet] - 10https://gerrit.wikimedia.org/r/631157 (owner: 10Jbond) [20:41:53] (03PS6) 10Dzahn: profile::swift::stats_reporter: pass statsd host and port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/631158 (owner: 10Jbond) [20:46:20] (03CR) 10Dzahn: "note ms-be2017 and ms-be2057 are disabled because they were previously disabled with another reason" [puppet] - 10https://gerrit.wikimedia.org/r/631157 (owner: 10Jbond) [20:46:38] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:47:37] !log temp disabling puppet on C:profile::swift::stats_reporter hosts, applying gerrit:631158 refactoring change [20:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:46] (03CR) 10Dzahn: [C: 03+2] profile::swift::stats_reporter: pass statsd host and port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/631158 (owner: 10Jbond) [20:48:08] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "20:47 < mutante> !log temp disabling puppet on C:profile::swift::stats_reporter hosts, applying gerrit:631158 refactoring change" [puppet] - 10https://gerrit.wikimedia.org/r/631158 (owner: 10Jbond) [20:49:00] jouncebot: next [20:49:00] In 2 hour(s) and 10 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200930T2300) [20:51:13] (03CR) 10Dzahn: "confirmed NOOP on thanos-fe1001, ms-fe2005, re-enabling on other 12 hosts." [puppet] - 10https://gerrit.wikimedia.org/r/631158 (owner: 10Jbond) [20:51:53] (03PS1) 10Catrope: Prevent returning the full templatelinks table in TemplateFilter [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631235 (https://phabricator.wikimedia.org/T264029) [20:52:12] (03PS1) 10Catrope: Prevent returning the full templatelinks table in TemplateFilter [extensions/GrowthExperiments] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631236 (https://phabricator.wikimedia.org/T264029) [20:53:10] (03PS1) 1020after4: group1 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631283 [20:53:12] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631283 (owner: 1020after4) [20:53:52] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631283 (owner: 1020after4) [20:56:54] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.11 [20:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:14] 24 hours already, huh [20:57:35] <_joe_> downtime expired, right? [20:57:42] ? [20:57:42] Ah! [20:57:45] okay cool :) [20:57:52] i disabled notifications [20:58:00] how did that alert? [20:58:03] expired VO ack yeah [20:58:14] and it pages when expires? :O [20:58:14] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.11 (duration: 01m 20s) [20:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:23] I'm rolling back group1 due to a bunch of new errors [20:58:37] we’ll need to resolve them in VO, ack only lasts 24h before re-paging [20:58:38] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10aaron) Are the jobs failing near the end and getting retried? Also, job B can still be enqueued if a dupl... [20:58:58] ahhh, so the 24hr downtime was a coincidence, this would have re-paged from VO right now anyway [20:59:14] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [20:59:29] I have some time to investigate that "bunch of new errors" once a task is filed [20:59:58] PROBLEM - Apache HTTP on mw2326 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1966 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:00:11] !log twentyafterfour@deploy1001 scap failed: average error rate on 5/6 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details) [21:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:20] PROBLEM - PHP7 rendering on mw2217 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2027 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:00:48] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 5541 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:00:50] PROBLEM - Apache HTTP on mw2217 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2027 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:00:58] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:01:05] twentyafterfour: don't forget to --force to skip the canary check, which will fail because of the existing errors [21:01:08] here to help if needed [21:01:12] PROBLEM - PHP7 rendering on mw2326 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1966 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:02:26] PROBLEM - PHP7 rendering on mw2284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2102 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:02:32] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:02:54] PROBLEM - Apache HTTP on mw2284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2102 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:03:32] doh [21:03:38] (03CR) 10Dzahn: [C: 03+1] gerrit: open link in new window [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [21:04:14] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: rollback [21:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:02] ok one of the (many) errors is https://phabricator.wikimedia.org/T264241 [21:06:34] just got `internal_api_error_Error` on en.wikiversity when running via api `action=query&list=tags&tglimit=max&tgprop=displayname` - don't have any of the other info, but is this related to the train? [21:06:37] twentyafterfour: I hope it's not related to the earlier chat about seeing lots of different files during deploy? [21:07:29] DannyS712: seems like it could be, a revert is going on right now [21:08:25] mutante: nope [21:08:32] alright [21:08:37] it's legit breakage in the new version branch [21:08:48] *nod* [21:08:48] and several unrelated breakages it seems [21:08:51] https://phabricator.wikimedia.org/T264243 [21:10:08] DannyS712: https://en.wikiversity.org/w/api.php?action=query&list=tags&tglimit=max&tgprop=displayname works for me (now) [21:10:20] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1111.eqiad.wmnet ` The log can be found... [21:10:25] it's been rolled back [21:10:47] yep, thx [21:11:10] mutante yeah it was as part of my global watchlist user script, but https://en.wikiversity.org/w/api.php?action=query&list=tags&tgprop=displayname&tglimit=max works for me [21:11:36] DannyS712: you hit it during the short period of time it was broken. should be back to normal [21:14:22] I don't understand how some of this stuff didn't break in ci [21:16:57] also why the errors haven't gone away completely :-/ [21:18:07] or why there are a lot for version wmf.10 [21:19:42] (03PS1) 10Dzahn: prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 [21:20:48] (03CR) 10jerkins-bot: [V: 04-1] prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 (owner: 10Dzahn) [21:22:27] rolling back group0 also [21:22:38] (03PS2) 10Dzahn: prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 [21:23:25] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: rollback group0 also [21:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:29] cscott https://phabricator.wikimedia.org/T264241#6506990 - is the last .10 meant to be .11 ? [21:27:41] ...nevermind, already fixed [21:29:05] (03PS1) 10Dzahn: trafficserver: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/631291 [21:32:18] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1111.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1111.eqiad.wmnet'] ` [21:36:01] mutante just got internal_api_error_Error on scowiki for the same tags query [21:36:19] Why doesn't the `mw.Api()` javascript class report more than just the error code? [21:37:54] (03PS1) 10Dzahn: zookeeper: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/631295 [21:38:14] DannyS712: it works for me on scowiki. i think that happened to you during the rollback then [21:39:05] (03CR) 10jerkins-bot: [V: 04-1] zookeeper: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/631295 (owner: 10Dzahn) [21:39:41] it was about 30 seconds before I reported it - was the rollback still ongoing? [21:40:54] (03PS1) 10Ahmon Dancy: train-dev: Disable wmgUseTranslate and wmgUseTranslationNotifications [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631296 [21:41:11] (03PS2) 10Dzahn: zookeeper: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/631295 [21:41:18] DannyS712: yes, i think so [21:41:35] (03CR) 10DannyS712: train-dev: Disable wmgUseTranslate and wmgUseTranslationNotifications (031 comment) [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631296 (owner: 10Ahmon Dancy) [21:41:49] (03CR) 10jerkins-bot: [V: 04-1] train-dev: Disable wmgUseTranslate and wmgUseTranslationNotifications [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631296 (owner: 10Ahmon Dancy) [21:42:18] (03CR) 10jerkins-bot: [V: 04-1] zookeeper: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/631295 (owner: 10Dzahn) [21:42:42] So this says jerkins, but on the actual patch it says jenkins? [21:42:55] (03CR) 10Ahmon Dancy: train-dev: Disable wmgUseTranslate and wmgUseTranslationNotifications (031 comment) [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631296 (owner: 10Ahmon Dancy) [21:42:57] DannyS712: it becomes jerkins when it downvotes you :p [21:43:04] DannyS712: they're the same :) we just make it say "jerkins" when it's downvoting your patch because it's a jerk [21:43:18] lol is this a feature of wikibugs ? [21:43:24] yes [21:43:32] neat. Is the source code published? [21:43:53] https://www.mediawiki.org/wiki/Wikibugs#Source [21:46:16] !log depool mw2356 and mw2319 [21:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:30] https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/wikibugs2/+/refs/heads/master/grrrrit.py#118 found it [21:46:49] it bugs me that there are 2 spaces instead of 1 before the # [21:49:10] (03PS2) 10Ahmon Dancy: train-dev: Disable wmgUseTranslate and wmgUseTranslationNotifications [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631296 [21:49:24] (03CR) 10DannyS712: [C: 03+1] train-dev: Disable wmgUseTranslate and wmgUseTranslationNotifications [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631296 (owner: 10Ahmon Dancy) [21:49:46] DannyS712: two spaces before an end of line comment is a standard Python convention [21:50:07] oh. Its been years since I used python, I guess I forgot [21:50:10] flake8 will whine at you for only using one :) [21:51:00] (03CR) 10Ahmon Dancy: [C: 03+2] train-dev: Disable wmgUseTranslate and wmgUseTranslationNotifications [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631296 (owner: 10Ahmon Dancy) [21:51:36] (03Merged) 10jenkins-bot: train-dev: Disable wmgUseTranslate and wmgUseTranslationNotifications [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/631296 (owner: 10Ahmon Dancy) [21:52:03] (03PS3) 10Dzahn: prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 [21:54:26] RECOVERY - PHP7 rendering on mw2284 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:54:28] RECOVERY - Apache HTTP on mw2217 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:54:52] RECOVERY - PHP7 rendering on mw2326 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:54:54] RECOVERY - Apache HTTP on mw2284 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:55:18] RECOVERY - Apache HTTP on mw2326 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:55:40] RECOVERY - PHP7 rendering on mw2217 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:56:08] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 42 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:57:11] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1114.eqiad.wmnet', 'an-worker1115.eqi... [21:58:10] (03PS3) 10Dzahn: zookeeper: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/631295 [22:03:26] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [22:03:34] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:34] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:34] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_restbase_cluster_eqiad,swagger_check_wikifeeds_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:03:44] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:48] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=wikifeeds.svc.codfw.wmnet, port=4101): Read timed out. (read timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Wikifeeds [22:03:52] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:38] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:56] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1114.eqiad.wmnet ` The log can be found... [22:05:08] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [22:05:10] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:10] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:05:16] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:18] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:20] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:36] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:06:12] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:07:12] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10CDanis) [22:07:17] (03PS1) 10Dzahn: hadoop::monitoring: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631302 [22:10:09] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [22:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:21] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1114.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1114.eqiad.wmnet'] ` [22:11:52] (03PS1) 10Dzahn: pmacct: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631303 [22:12:06] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:44] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [22:17:11] (03PS1) 10Dzahn: wmcs::services::ntp: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631304 [22:24:49] (03PS1) 10Dzahn: keyholder: hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/631306 [22:24:59] (03PS2) 10Urbanecm: [labs] Sync wmgMonologChannels with production value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631224 [22:25:53] (03CR) 10jerkins-bot: [V: 04-1] keyholder: hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/631306 (owner: 10Dzahn) [22:29:48] (03PS1) 10Dzahn: base: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/631307 [22:30:43] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:31:03] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [22:35:29] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1115.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1114.eqiad.wmnet'] ` [22:37:34] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1116.eqiad.wmnet ` The log can be found... [22:37:58] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1117.eqiad.wmnet ` The log can be found... [22:40:10] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [22:48:52] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1117.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1117.eqiad.wmnet'] ` [22:50:30] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [22:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:54] (03PS1) 10Cwhite: backport patch: MarshalJSON bucket bounds as strings [debs/mtail] (debian/sid) - 10https://gerrit.wikimedia.org/r/631310 [22:52:25] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:21] (03PS1) 10Ebernhardson: cirrus: Increase more_like cache from one to seven days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631312 (https://phabricator.wikimedia.org/T264053) [22:56:02] (03PS2) 10Ebernhardson: cirrus: Increase more_like cache from one to three days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631312 (https://phabricator.wikimedia.org/T264053) [22:56:59] (03PS2) 10Cwhite: backport patch: MarshalJSON bucket bounds as strings configure gbp to use pdebuild [debs/mtail] (debian/sid) - 10https://gerrit.wikimedia.org/r/631310 [23:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200930T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:14] i have an easy one in there too, i suppose it didn't refresh [23:04:27] I'll do the deployment in a minute [23:05:42] RoanKattouw: hang on please [23:05:44] I think we should probably hold off deploying [23:05:46] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1116.eqiad.wmnet'] ` and were **ALL** successful. [23:05:49] OK, waiting [23:05:51] there is an ongoing incident [23:05:56] (03CR) 10Cwhite: [C: 03+1] "LGTM" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631162 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [23:05:58] over in -sre we're debugging some cache weirdness that we haven't sorted out yet [23:06:07] My patches fix an OOM error in production BTW [23:06:13] But I can wait until things settle down [23:06:13] RoanKattouw: oh? [23:06:25] well fixes would be welcome but OOM isn't the issue we're seeing [23:07:01] RoanKattouw, opcache corruption issues. [23:07:26] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [23:20:15] (03PS1) 10Dzahn: toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315 [23:21:15] (03CR) 10jerkins-bot: [V: 04-1] toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [23:21:25] (03PS2) 10Dzahn: keyholder: hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/631306 [23:23:50] (03PS2) 10Dzahn: toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315 [23:26:07] (03CR) 10Dzahn: [V: 04-1] maps: hiera()->lookup(), add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn) [23:26:12] (03PS5) 10Dzahn: maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439 [23:28:32] (03PS4) 10Dzahn: prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) [23:30:23] (03PS2) 10Dzahn: cumin: remove alias wikireplicas-analytics [puppet] - 10https://gerrit.wikimedia.org/r/630256 [23:30:37] (03Abandoned) 10Dzahn: cumin: remove alias wikireplicas-analytics [puppet] - 10https://gerrit.wikimedia.org/r/630256 (owner: 10Dzahn) [23:54:14] (03PS1) 10Krinkle: Reject ParserCache entries from the last wmf.11 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631318 (https://phabricator.wikimedia.org/T263851) [23:58:20] (03PS2) 10Krinkle: Reject ParserCache entries from the last wmf.11 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631318 (https://phabricator.wikimedia.org/T264257)