[00:03:24] James_F: yay [00:03:29] thanks for everything today [00:03:33] hopefully you can rest now? [00:15:15] Jdlrobson: point taken re minerva icon, will reply on task first thign tomorrow to clarify my view, and leave it to you [00:15:17] good night :) [00:27:25] (03PS1) 10Dave Pifke: php: $enable_request_profiling should affect CLI [puppet] - 10https://gerrit.wikimedia.org/r/599476 (https://phabricator.wikimedia.org/T253547) [02:04:51] (03CR) 10Krinkle: [C: 03+1] "Nice :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599476 (https://phabricator.wikimedia.org/T253547) (owner: 10Dave Pifke) [02:50:14] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:51:24] (03PS1) 10Huji: Set wgCheckUserLogLogins to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253946) [02:51:50] (03PS2) 10Huji: Set wgCheckUserLogLogins to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) [02:52:16] (03PS3) 10Huji: Set wgCheckUserLogLogins to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) [04:25:19] !log Start topology changes in s4 - T253808 [04:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:23] T253808: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 [04:41:05] (03PS2) 10Marostegui: mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/599155 (https://phabricator.wikimedia.org/T253808) [04:41:06] (03CR) 10Marostegui: mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/599155 (https://phabricator.wikimedia.org/T253808) (owner: 10Marostegui) [04:42:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/599155 (https://phabricator.wikimedia.org/T253808) (owner: 10Marostegui) [05:00:04] marostegui, jynus, and kormat: May I have your attention please! s4 database master failover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200529T0500) [05:00:11] let's go? [05:00:20] I'm ready [05:00:26] +1 [05:00:33] ok, let's start [05:00:37] !log Starting s4 failover from db1138 to db1081 -T253808 [05:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:42] T253808: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 [05:00:42] <_joe_> I'm here if you need any check [05:01:26] wow, dbctl on cumin2001 is taking AGEs, going to run it from 1001 [05:01:33] will follow up later on that [05:01:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s4 as read-only for maintenance T253808', diff saved to https://phabricator.wikimedia.org/P11333 and previous config saved to /var/cache/conftool/dbconfig/20200529-050153-marostegui.json [05:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:58] <_joe_> yeah cross-dc performance is not great, given the numbr of etcd queries [05:02:07] RO confirmed [05:02:11] WARNING: The database has been locked for maintenance [05:02:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1081 to s4 master and remove read-only from s4 T253808', diff saved to https://phabricator.wikimedia.org/P11334 and previous config saved to /var/cache/conftool/dbconfig/20200529-050224-marostegui.json [05:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:27] all done [05:02:43] I can edit again [05:02:45] A+ [05:02:49] <_joe_> new edits in rc too [05:02:53] same here [05:03:27] <_joe_> so, yes, dbctl is doing a shitton of requests to etcd, so it's exponentially faster if run from the DC where the etcd main cluster is [05:03:33] everything looks good so far [05:03:34] waiting for good metrics [05:03:56] no errors on log [05:04:06] _joe_: yeah, but it used to be a lot faster from 2001 a few months ago, in fact cd4nis always said that for switchovers it was better to run it from there [05:04:27] <_joe_> yeah well, a few months back rz.l and I switched over etcd [05:04:32] <_joe_> now the main DC is eqiad [05:04:53] <_joe_> so that advice is outdated :) [05:05:06] haha [05:05:16] <_joe_> or better, inaccurate [05:05:30] "malicious" [05:05:55] (03CR) 10Marostegui: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/599156 (https://phabricator.wikimedia.org/T253808) (owner: 10Marostegui) [05:06:21] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/599156 (https://phabricator.wikimedia.org/T253808) (owner: 10Marostegui) [05:06:49] <_joe_> dig -t SRV _etcd._tcp.conftool.eqiad.wmnet will tell you which DC is master [05:07:21] <_joe_> kormat: I was trying to be kind to our "friend" [05:08:06] :D [05:08:09] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) [05:09:49] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) The master failover was done successfully. This was done successfully RO started at 05:01:54 RO stopped at 05:02:25 Total... [05:15:53] (03PS1) 10Marostegui: db1138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/599583 (https://phabricator.wikimedia.org/T253808) [05:17:28] (03CR) 10Kormat: [C: 03+1] db1138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/599583 (https://phabricator.wikimedia.org/T253808) (owner: 10Marostegui) [05:18:04] (03CR) 10Marostegui: [C: 03+2] db1138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/599583 (https://phabricator.wikimedia.org/T253808) (owner: 10Marostegui) [05:19:54] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [05:20:56] !log Deploy schema change on db1138 (no longer s4 master) - T250055 [05:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:01] T250055: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 [05:31:23] (03PS3) 10Kormat: transfer.py: Enforce mariadb version match for xtrabackup. [puppet] - 10https://gerrit.wikimedia.org/r/599343 [05:32:53] (03CR) 10Jcrespo: [C: 04-1] "xtrabackup should be on patch on the latest version of all packages, so we can use the one found on path." [puppet] - 10https://gerrit.wikimedia.org/r/599343 (owner: 10Kormat) [05:33:04] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Enforce mariadb version match for xtrabackup. [puppet] - 10https://gerrit.wikimedia.org/r/599343 (owner: 10Kormat) [05:33:19] (03CR) 10Jcrespo: [C: 04-1] "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/599343 (owner: 10Kormat) [05:42:54] (03CR) 10Dzahn: [C: 03+1] dnsdist: allow DoT (DNS-over-TLS) [puppet] - 10https://gerrit.wikimedia.org/r/599390 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [05:45:01] (03PS1) 10Jcrespo: transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 [05:45:14] (03PS2) 10Jcrespo: transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 [05:56:17] (03PS3) 10Jcrespo: transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 [05:56:32] (03PS4) 10Jcrespo: transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 [05:58:11] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 (owner: 10Jcrespo) [05:58:40] (03PS5) 10Jcrespo: transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 [06:00:26] (03PS6) 10Jcrespo: transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 [06:00:26] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 (owner: 10Jcrespo) [06:08:50] (03PS1) 10Dzahn: site: decom mw2173 through mw2179 [puppet] - 10https://gerrit.wikimedia.org/r/599603 (https://phabricator.wikimedia.org/T247018) [06:10:03] (03PS1) 10Kormat: mariadb: Add mbstream to path [software] - 10https://gerrit.wikimedia.org/r/599604 [06:12:58] (03PS1) 10Dzahn: site: decom mw2180 through mw2186 [puppet] - 10https://gerrit.wikimedia.org/r/599606 (https://phabricator.wikimedia.org/T247018) [06:13:11] (03PS7) 10Jcrespo: transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 [06:13:38] (03Abandoned) 10Dzahn: decom 15 codfw appservers from rack C3 [puppet] - 10https://gerrit.wikimedia.org/r/579073 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [06:14:27] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:12] (03PS8) 10Jcrespo: transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 [06:16:57] (03CR) 10Elukey: [C: 03+1] prometheus: move analytics to profile [puppet] - 10https://gerrit.wikimedia.org/r/599342 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [06:21:09] (03PS1) 10Dzahn: remove production IPs of mw2163 through mw2172 [dns] - 10https://gerrit.wikimedia.org/r/599610 (https://phabricator.wikimedia.org/T247018) [06:24:04] (03CR) 10Dzahn: [C: 03+2] "these are decom'ed and gone from conftool-data and set to decom in netbox" [dns] - 10https://gerrit.wikimedia.org/r/599610 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [06:28:25] (03CR) 10Kormat: [C: 04-1] transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599596 (owner: 10Jcrespo) [06:28:43] (03PS1) 10Dzahn: remove production IPs of mw2150 through mw2162 [dns] - 10https://gerrit.wikimedia.org/r/599614 (https://phabricator.wikimedia.org/T247018) [06:29:48] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Some (recent?) uploads to Commons are not available on other wikis - https://phabricator.wikimedia.org/T253405 (10MGA73) Perhaps there still is a problem per https://phabricator.wikimedia.org/T253952 ? An image not s... [06:34:22] (03CR) 10Jcrespo: transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599596 (owner: 10Jcrespo) [06:36:21] !log deneb - systemctl start docker-reporter-releng-images [06:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:35] (03CR) 10Dzahn: [C: 03+2] remove production IPs of mw2150 through mw2162 [dns] - 10https://gerrit.wikimedia.org/r/599614 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [06:37:20] (03CR) 10Dzahn: "former jobrunners that are gone" [dns] - 10https://gerrit.wikimedia.org/r/599614 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [06:37:33] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:43] (03Abandoned) 10Dzahn: site: reorganize mediawiki appserver structure [puppet] - 10https://gerrit.wikimedia.org/r/577705 (owner: 10Dzahn) [06:41:26] (03Abandoned) 10Dzahn: mariadb-eventlogging-repl: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506558 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [06:42:29] (03PS4) 10Dzahn: mediawiki::cgroup: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/448778 (https://phabricator.wikimedia.org/T194724) [06:48:06] 10Operations, 10observability, 10User-fgiunchedi: Include apache_exporter in puppet module apache - https://phabricator.wikimedia.org/T187434 (10Dzahn) The apache module has been deleted. [06:49:22] 10Operations, 10observability, 10User-fgiunchedi: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434 (10Dzahn) [06:54:00] (03CR) 10Kormat: [C: 03+1] transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599596 (owner: 10Jcrespo) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200529T0700) [07:00:07] !log installing rake security updates [07:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:57] !log mw1293 (canary jobrunner ) replace apache2.conf with version from mwdebug1001, restart apache, to debug for T190111 [07:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:00] T190111: VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111 [07:12:24] !log installing xdg-utils update from latest Buster point release [07:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:35] (03PS1) 10Elukey: Add performance suggestions to the README [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/599678 [07:13:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add performance suggestions to the README [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/599678 (owner: 10Elukey) [07:15:25] !log installing el-api update from latest Buster point release [07:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:17] 10Operations, 10Wikimedia-Apache-configuration, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111 (10Dzahn) Just confirmed this is still the case. mw1299, as a jobrunner, behaves differently from mwdebug1001 and... [07:17:58] 10Operations, 10Wikimedia-Apache-configuration, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost (on jobrunners) - https://phabricator.wikimedia.org/T190111 (10Dzahn) [07:23:02] (03PS1) 10Dzahn: mediawiki: also include the MW apache2.conf on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/599683 (https://phabricator.wikimedia.org/T190111) [07:24:16] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [07:25:17] !log updating perf on buster systems to new version from 10.4 point release [07:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 87, down: 1, dormant: 0, excluded: 2, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:32:44] (03PS1) 10Ema: atskafka: increase queue.buffering.max.ms [puppet] - 10https://gerrit.wikimedia.org/r/599685 (https://phabricator.wikimedia.org/T253551) [07:34:43] (03CR) 10Ema: "pcc lgtm https://puppet-compiler.wmflabs.org/compiler1001/22883/cp3050.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/599685 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [07:35:25] 10Operations, 10SRE-tools, 10User-Joe, 10User-jijiki: Covert deploy_apache_change.sh to a spicerack cookbook - https://phabricator.wikimedia.org/T203948 (10Dzahn) This file has meanwhile moved to `./modules/httpbb/templates/deploy_apache_change.sh.erb` and is installed on cumin masters from `modules/httpbb... [07:35:48] (03CR) 10Elukey: [C: 03+1] atskafka: increase queue.buffering.max.ms [puppet] - 10https://gerrit.wikimedia.org/r/599685 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [07:36:49] (03PS2) 10Muehlenhoff: Rename squid3 class to squid [puppet] - 10https://gerrit.wikimedia.org/r/596201 [07:37:18] (03CR) 10Dzahn: [C: 03+1] "compiled on everything using this class: https://puppet-compiler.wmflabs.org/compiler1001/22882/" [puppet] - 10https://gerrit.wikimedia.org/r/448778 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [07:38:25] (03CR) 10Filippo Giunchedi: "Not familiar enough with docker-pkg for a meaningful vote but LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [07:38:33] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: install mtail 3.0.0~rc35 from component in ulsfo and codfw [puppet] - 10https://gerrit.wikimedia.org/r/599473 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [07:38:42] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: install mtail 3.0.0~rc35 from component in esams and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/599474 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [07:40:00] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove ganglia configs from cdh and jmxtrans modules [puppet] - 10https://gerrit.wikimedia.org/r/599415 (https://phabricator.wikimedia.org/T253555) (owner: 10Ottomata) [07:42:19] (03CR) 10Muehlenhoff: [C: 03+2] Rename squid3 class to squid [puppet] - 10https://gerrit.wikimedia.org/r/596201 (owner: 10Muehlenhoff) [07:43:23] (03PS1) 10Dzahn: stdlib: fix "double quoted string" lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/599689 [07:45:40] 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10Dzahn) 06:36 mutante: deneb - systemctl start docker-reporter-releng-images 06:37 <+icinga-wm> RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational h... [07:46:39] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:47:08] (03CR) 10Ema: [C: 03+2] atskafka: increase queue.buffering.max.ms [puppet] - 10https://gerrit.wikimedia.org/r/599685 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [07:47:26] (03PS1) 10Filippo Giunchedi: install_server: add thanos-fe/thanos-be to late_command swift uid preprovision [puppet] - 10https://gerrit.wikimedia.org/r/599693 (https://phabricator.wikimedia.org/T123918) [07:49:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 89, down: 0, dormant: 0, excluded: 2, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/599693 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi) [07:52:45] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:52:47] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atskafka site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:53:49] looks like ams-ix flapped when they added the extra link [07:54:23] PROBLEM - Check systemd state on cp3050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:55:15] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 77 probes of 649 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:55:37] the cp3050 alert is due to myself, please ignore [07:55:43] ok, thx [07:56:03] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 144 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:56:13] RECOVERY - Check systemd state on cp3050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:17] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: add thanos-fe/thanos-be to late_command swift uid preprovision [puppet] - 10https://gerrit.wikimedia.org/r/599693 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi) [07:56:18] waiting for whatever ams-ix did to stabilize [07:56:20] !log phabricator - killed pid 25070 (git) which used 100% of CPU, restarted phd service [07:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:27] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:57:00] mmhh yeah I can't reach the esams bastion now [07:57:35] mtr stops at telia [07:58:15] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:58:22] (03PS3) 10Elukey: Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/599389 [07:58:51] (back now) [07:59:25] trying to figure out what's up, I can't reach bast1002 over v6 so not sure if issues with my provider or with ams-ix [08:00:04] (03PS1) 10Muehlenhoff: profile::analytics::database::meta: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/599696 [08:01:11] !log phabricator - broken due to "PhabricatorRepositoryMirrorEngine::pushToGitRepository" starting git process that uses 100% CPU, stopped phd service [08:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:31] FWIW mtr seems to work for me over v6 from hetzner, some loss at zayo tho [08:01:34] (03PS1) 10Ema: atskafka: remove trailing whitespace from queue.buffering.max.ms [puppet] - 10https://gerrit.wikimedia.org/r/599697 (https://phabricator.wikimedia.org/T253551) [08:02:01] (03CR) 10jerkins-bot: [V: 04-1] atskafka: remove trailing whitespace from queue.buffering.max.ms [puppet] - 10https://gerrit.wikimedia.org/r/599697 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [08:02:03] (03PS2) 10Ema: atskafka: remove trailing whitespace from queue.buffering.max.ms [puppet] - 10https://gerrit.wikimedia.org/r/599697 (https://phabricator.wikimedia.org/T253551) [08:02:45] godog: is v4 still working? [08:03:04] XioNoX: yeah from home to bast3002 is working fine now [08:03:13] (03CR) 10Ema: [C: 03+2] atskafka: remove trailing whitespace from queue.buffering.max.ms [puppet] - 10https://gerrit.wikimedia.org/r/599697 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [08:03:14] !log add new AMS-IX link to LACP bundle [08:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:26] bast3004 even [08:03:51] ok now v4 to bast3004 from hetzner works too [08:03:54] err, v6 [08:04:05] PROBLEM - PHD should be running on phab1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:04:08] !log phabricator - restarted apache2 - back for me now [08:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:48] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:05:16] godog: everything looks back to normal [08:05:46] waiting for the ripe atlas check to recover [08:05:55] RECOVERY - PHD should be running on phab1001 is OK: PROCS OK: 1 process with regex args php ./phd-daemon, UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:06:00] 10Operations, 10Analytics, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843 (10King77001) thats serious [[ https://uniprojectmaterials.com | project topics ]] [08:06:30] XioNoX: yeah looks like it, I'm looking at https://grafana.wikimedia.org/d/000000479/frontend-traffic and traffic is coming back [08:06:38] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 23 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:08:11] great [08:08:35] looks like there is a bug in Junos and it doesn't want to display the stats for that one LACP bundle... [08:08:37] (03PS4) 10Elukey: Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/599389 [08:08:44] ae0 and ae1 fine, ae2 nop [08:12:41] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 8 probes of 649 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:13:31] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 45 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:22:23] (03PS5) 10Elukey: Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/599389 [08:24:23] (03CR) 10Elukey: [C: 03+1] profile::analytics::database::meta: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/599696 (owner: 10Muehlenhoff) [08:25:49] (03CR) 10Elukey: [C: 03+1] Remove ganglia configs from cdh and jmxtrans modules [puppet] - 10https://gerrit.wikimedia.org/r/599415 (https://phabricator.wikimedia.org/T253555) (owner: 10Ottomata) [08:30:26] !log update swift uid/gid on thanos hosts - T123918 [08:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:29] T123918: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 [08:31:44] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/22887/" [puppet] - 10https://gerrit.wikimedia.org/r/599389 (owner: 10Elukey) [08:33:00] (03PS6) 10Elukey: Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/599389 [08:33:45] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:35] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:54] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/599390 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [08:38:31] 10Operations, 10netops: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 (10ayounsi) p:05Triage→03Medium [08:39:22] 10Operations, 10CAS-SSO, 10User-jbond: CAS build as a deb - https://phabricator.wikimedia.org/T233947 (10jbond) 05Open→03Resolved >>! In T233947#6172089, @MoritzMuehlenhoff wrote: > Puppet is now deployed as a deb. Resolving please reopen if there are further task [08:39:24] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [08:40:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/599358 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [08:41:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/599359 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [08:41:47] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10MoritzMuehlenhoff) [08:42:23] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Closing this task, puppet 5 and facter 3 have since been folded into the "main" compo... [08:43:30] 10Operations, 10ops-codfw, 10User-jbond: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10MoritzMuehlenhoff) 05Open→03Resolved Closing this task, we now have monitoring for the application of microcode updates in place and this di... [08:44:04] 10Operations, 10Patch-For-Review, 10User-jbond: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:45:38] 10Operations, 10cloud-services-team (Kanban): Remove systemd from openstack-mitaka - https://phabricator.wikimedia.org/T231793 (10MoritzMuehlenhoff) 05Open→03Resolved That component is gone in the mean time, closing the task [08:47:09] 10Operations: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10MoritzMuehlenhoff) [08:47:21] 10Operations: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) [08:47:29] 10Operations, 10OTRS: Migrate mendelevium/OTRS host to Buster - https://phabricator.wikimedia.org/T224590 (10MoritzMuehlenhoff) [08:48:11] 10Operations: Migrate auth* servers to Stretch/Buster - https://phabricator.wikimedia.org/T224571 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff These have been reimaged to Buster a while ago [08:48:13] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [08:51:18] 10Operations: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) [08:51:23] 10Operations: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) Please don't consider dbmonitor2001 as upgraded- as the application doesn't work after os upgrade. [08:51:32] (03PS1) 10RhinosF1: Change $wgNamespaceRobotPolicies on Thai wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599737 [08:56:21] (03PS1) 10Kormat: mariadb: Enable notifications for db2139 [puppet] - 10https://gerrit.wikimedia.org/r/599739 (https://phabricator.wikimedia.org/T252985) [08:56:56] 10Operations: decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail - https://phabricator.wikimedia.org/T229998 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Closing this task, this was addressed later by adding a delay into the decom cookbook which addresses... [08:59:56] 10Operations, 10netops: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 (10ayounsi) To check after the reboot needed for T245520. [09:02:10] (03CR) 10RhinosF1: "I've just created the patch to do this on the other thwikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598999 (https://phabricator.wikimedia.org/T253574) (owner: 10Urbanecm) [09:06:10] (03PS2) 10RhinosF1: Change $wgNamespaceRobotPolicies on Thai wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599737 (https://phabricator.wikimedia.org/T253578) [09:08:23] (03PS1) 10Kormat: maraidb: Add db2040 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) [09:09:23] (03CR) 10jerkins-bot: [V: 04-1] maraidb: Add db2040 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [09:10:35] (03PS1) 10Dzahn: site: add new appservers mw2336 through mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/599749 (https://phabricator.wikimedia.org/T241852) [09:10:37] (03PS2) 10Kormat: maraidb: Add db2040 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) [09:11:41] (03PS1) 10Ladsgroup: mediawiki: Add api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) [09:12:03] (03CR) 10Jcrespo: maraidb: Add db2040 to s4 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [09:12:19] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/599300 (owner: 10Muehlenhoff) [09:15:42] (03PS3) 10Kormat: maraidb: Add db2040 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) [09:16:03] (03CR) 10Kormat: maraidb: Add db2040 to s4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [09:19:19] 10Operations, 10SRE-Access-Requests: Requesting access for deployment and restricted group for cicalese - https://phabricator.wikimedia.org/T253676 (10Dzahn) The "restricted" group is a subset of the "deployment" group. There are no hosts that have only restricted but not also deployment and the sudo privilege... [09:20:15] (03CR) 10Jcrespo: [C: 04-1] "xtrabackup_info has a server_version property, which is what should be checked on --prepare, not the package installed, which may not be t" [puppet] - 10https://gerrit.wikimedia.org/r/599343 (owner: 10Kormat) [09:21:27] (03Abandoned) 10Kormat: transfer.py: Enforce mariadb version match for xtrabackup. [puppet] - 10https://gerrit.wikimedia.org/r/599343 (owner: 10Kormat) [09:22:43] (03PS1) 10Ayounsi: Add cicalese to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/599758 (https://phabricator.wikimedia.org/T253676) [09:24:27] (03CR) 10Jcrespo: [C: 03+1] "This looks good to me, but I would wait for manuel review." [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [09:24:44] (03CR) 10Dzahn: "this also does something unrelated to Max Binder's account" [puppet] - 10https://gerrit.wikimedia.org/r/599758 (https://phabricator.wikimedia.org/T253676) (owner: 10Ayounsi) [09:24:50] (03PS2) 10Ayounsi: Add cicalese to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/599758 (https://phabricator.wikimedia.org/T253676) [09:26:08] (03CR) 10Dzahn: "lgtm. adding new deployers usually gets a review from releng though, a +1 would be good" [puppet] - 10https://gerrit.wikimedia.org/r/599758 (https://phabricator.wikimedia.org/T253676) (owner: 10Ayounsi) [09:28:12] (03CR) 10Jcrespo: [C: 03+1] mariadb: Enable notifications for db2139 [puppet] - 10https://gerrit.wikimedia.org/r/599739 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [09:29:17] (03CR) 10Kormat: [C: 03+2] mariadb: Enable notifications for db2139 [puppet] - 10https://gerrit.wikimedia.org/r/599739 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [09:31:45] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) >>! In T247018#6169200, @Papaul wrote: > @Dzahn Please to not resolve yet. I still have mgm... [09:32:46] (03PS1) 10Ema: 0.2: pass the appropriate list of labels for the metric [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/599759 (https://phabricator.wikimedia.org/T253551) [09:33:41] (03PS2) 10Ema: 0.2: pass the appropriate list of labels for the metric [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/599759 (https://phabricator.wikimedia.org/T253551) [09:33:48] (03PS1) 10RhinosF1: Add localised sitename for bewikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599760 [09:36:28] (03PS2) 10RhinosF1: Add localised sitename for bewikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599760 (https://phabricator.wikimedia.org/T253962) [09:36:37] (03CR) 10Elukey: [C: 03+1] 0.2: pass the appropriate list of labels for the metric [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/599759 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [09:51:37] (03CR) 10Jcrespo: [C: 03+1] maraidb: Add db2040 to s4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [09:52:58] ^^ kormat s/maraidb/mariadb/? OCD compliant comment [09:53:51] vgutierrez: sorry, are you not familiar with the maraidb new service? [09:54:11] am I THAT deprecated? [09:54:24] * vgutierrez cries in the corner [09:54:53] it is like mariadb but you hear the "Mery Christmas" song when it starts :-P [09:55:03] lol [09:55:12] All I Want for Christmas Is You, I meant [09:55:16] there is a file "typos" in the root of the puppet repo. it already has: mariabd, and maribdb. maybe you want to add it [09:56:05] it's ma-raid-b. it's a, uh, disk array config. [09:57:22] (03PS4) 10Kormat: mariadb: Add db2040 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) [09:57:29] (03PS1) 10Jcrespo: typos: Add maraidb to the list of detected typos [puppet] - 10https://gerrit.wikimedia.org/r/599768 [09:57:49] hahah [09:58:10] (03CR) 10Dzahn: [C: 03+1] typos: Add maraidb to the list of detected typos [puppet] - 10https://gerrit.wikimedia.org/r/599768 (owner: 10Jcrespo) [09:58:14] (03PS1) 10Jbond: puppetmaster2003: add AAAA [dns] - 10https://gerrit.wikimedia.org/r/599769 (https://phabricator.wikimedia.org/T253173) [09:58:18] I am serious about the patch [09:58:23] (03CR) 10Kormat: [C: 03+1] "Begrudging +1" [puppet] - 10https://gerrit.wikimedia.org/r/599768 (owner: 10Jcrespo) [09:58:32] (03CR) 10Jcrespo: [C: 03+2] typos: Add maraidb to the list of detected typos [puppet] - 10https://gerrit.wikimedia.org/r/599768 (owner: 10Jcrespo) [09:58:47] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster2003: add AAAA [dns] - 10https://gerrit.wikimedia.org/r/599769 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [10:00:00] (03PS5) 10Kormat: mariadb: Add db2040 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) [10:00:41] OCD happiness++, thanks kormat <3 [10:01:14] (03CR) 10Kormat: mariadb: Add db2040 to s4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [10:01:26] vgutierrez: thanks for spotting it :) [10:01:44] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [10:02:20] (03PS1) 10Elukey: Set Debian Buster for druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/599770 (https://phabricator.wikimedia.org/T253980) [10:02:37] !log Compress InnoDB on db1138 T232446 [10:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:42] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [10:03:47] (03PS1) 10Jcrespo: mariadb: Super important patch [puppet] - 10https://gerrit.wikimedia.org/r/599771 [10:04:22] (03CR) 10Elukey: [C: 03+2] Set Debian Buster for druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/599770 (https://phabricator.wikimedia.org/T253980) (owner: 10Elukey) [10:04:28] (03CR) 10Jcrespo: [C: 04-2] mariadb: Super important patch [puppet] - 10https://gerrit.wikimedia.org/r/599771 (owner: 10Jcrespo) [10:04:40] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Super important patch [puppet] - 10https://gerrit.wikimedia.org/r/599771 (owner: 10Jcrespo) [10:05:07] ^works as intended [10:05:35] (03Abandoned) 10Jcrespo: mariadb: Super important patch [puppet] - 10https://gerrit.wikimedia.org/r/599771 (owner: 10Jcrespo) [10:05:49] (03PS2) 10Jbond: puppetmaster2003: add AAAA [dns] - 10https://gerrit.wikimedia.org/r/599769 (https://phabricator.wikimedia.org/T253173) [10:06:18] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster2003: add AAAA [dns] - 10https://gerrit.wikimedia.org/r/599769 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [10:07:14] (03CR) 10Vgutierrez: puppetmaster2003: add AAAA (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/599769 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [10:08:46] (03CR) 10Jcrespo: [C: 03+1] "Cool to me, just wanted to make sure it was intentional :-D" [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [10:10:36] (03CR) 10Jcrespo: [C: 03+1] "Answering your question:" [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [10:11:20] jynus: thanks for the answer. icinga === sadness, as suspected [10:11:49] well, like everything it is relative [10:12:07] hmm. could i downtime the host for like 30mins, then enable notifications, then downtime just the relevant checks? [10:12:10] would I prefer a slightly different model: I do, and actually observability are working on it [10:12:21] but it works so far [10:12:44] (03PS3) 10Jbond: puppetmaster2003: add AAAA [dns] - 10https://gerrit.wikimedia.org/r/599769 (https://phabricator.wikimedia.org/T253173) [10:12:51] kormat: the main issue [10:12:58] is that if you downtime all services [10:13:09] that doesn't affect new services that are setup after downtime [10:13:13] (03CR) 10Jbond: "fixed thanks" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/599769 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [10:13:38] :(( [10:13:39] in other words, "downtime all services" means "downtime this list of named checks" [10:13:45] if the check changes name [10:13:52] icingaaaaaaa *shakes fist* [10:13:55] or a new one apears it doesn't apply [10:14:08] the other thing is downtime and alerting are separate [10:14:29] if you downtime a service after it started alerting [10:14:36] it will alert for the ok again [10:14:42] * kormat sobs [10:14:52] you may want to disable notifications instead [10:15:19] it is weird, so many people end up creating alerts when dealing with edge cases [10:15:41] but once you learn how it works it is consistent [10:16:03] again, I still would like a different model, and I think many people would agree with that [10:16:18] https://wikitech.wikimedia.org/wiki/Icinga#Avoid_Icinga_spam_on_new_server_installs [10:16:57] yeah, but if I remember correctly, even the very much perfected wmf-auto-reimage [10:17:02] (03CR) 10Jbond: [C: 03+2] puppetmaster2003: add AAAA [dns] - 10https://gerrit.wikimedia.org/r/599769 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [10:17:05] could run into race conditions [10:17:21] and that is after someone spent a lot of hours making it good [10:18:06] I would change the above into kormat's strategy [10:18:17] if you "ACK" a service that means it is silent until the next state change, that is the "will alert for the OK again". But if it's in a downtime it won't alert for an OK . [10:18:17] which is adding a hiera key disabling notifications [10:18:27] as long as the detached and delayed dowtime cookbook suceed the reimage is race-free ;) [10:18:35] oh, reimage yes [10:18:43] it is icinga what I don't trust [10:19:06] I meant race free for false alerts [10:19:29] I belive last iteration made those very very unlikely? [10:19:53] because it could happen that icinga puppet finishes 1 second after setup or something like that [10:20:35] mutante: oh, god. 🤮 [10:20:49] in any case, I would change https://wikitech.wikimedia.org/wiki/Icinga#Avoid_Icinga_spam_on_new_server_installs into "disable notifications on hiera" as kormat is doing [10:21:05] I think it is just easier [10:21:26] jynus: and doesn't involve blocking other monitoring changes in the meantime. [10:21:38] kormat: that is working as intended. an ACK is meant to start alerting again once things change. A downtime means it will stay silent until the downtime expires. [10:21:49] mutante: i meant the procedure you linked [10:21:59] yeah, that is why I said that it is consistent [10:22:09] but I would prefer a different model [10:22:41] i would really not recommend disabling notifications. it leads to https://phabricator.wikimedia.org/T149643 [10:22:49] but if I start saying to control on netbox a server lifecycle I will get jumped on [10:23:01] there is also https://phabricator.wikimedia.org/T252002 for the discussion [10:23:26] mutante: if it is on puppet it is difficult they will be forgotten [10:23:30] i have never seen an alert of something during a downtime [10:24:12] mutante: I can show you how if you downtime a service after downtimed, it will alert on UP even during downtime [10:24:23] because downtime doesn't cancel ongoing issues, only new ones [10:26:50] mutante: so, you feel the 6-step procedure you linked that involves disabling puppet on icinga is superior to disabling notifications for a host? [10:26:59] what you are describing sounds like the behaviour of ACK [10:28:10] (03PS1) 10Muehlenhoff: Extend MOU for tiziano and add Martin Gerlach as new point of contact [puppet] - 10https://gerrit.wikimedia.org/r/599778 [10:28:19] kormat: yea, for adding multiple new hosts i think it is. you can't disable notifications for services before they exist [10:28:25] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:29:17] mutante: in this case (https://gerrit.wikimedia.org/r/c/operations/puppet/+/599746) i'm configuring a host that already exists [10:29:40] the CR adds it to a mariadb role, which will cause icinga to start alerting [10:30:46] ideally i'd like to be able to remove `profile::base::notifications: disabled` in the same CR [10:32:02] yea, the procedure above was for adding the mw-appserver role to machines that already exist in site.pp [10:32:06] so it's similar [10:32:34] (03PS1) 10Jbond: AAAA: for flerovium and furud [dns] - 10https://gerrit.wikimedia.org/r/599779 [10:32:34] the difference is you have only 1 server though [10:32:52] yes. and it can take multiple hours for icinga to go green [10:32:54] yeah it can be simplified for 1 server [10:33:00] (03CR) 10jerkins-bot: [V: 04-1] AAAA: for flerovium and furud [dns] - 10https://gerrit.wikimedia.org/r/599779 (owner: 10Jbond) [10:33:16] but kormat need the role applied on puppet to even start setting it up [10:33:30] mariadb must exist before the service is provisioned [10:33:34] so what can be done is: [10:33:38] and that is created by puppet [10:33:54] please don't tell me to create a mariadb::core_in_setup role [10:34:06] - re-enable notifications in puppet in that CR, puppet-merge it [10:34:22] (what i'm really missing right now is prom alertmanager. there i could trivially create a silence to match everything for host X, even before host X was setup) [10:34:47] volans: then icinga runs and we all get a page :-D [10:34:51] (03PS2) 10Jbond: AAAA: for flerovium and furud [dns] - 10https://gerrit.wikimedia.org/r/599779 (https://phabricator.wikimedia.org/T253173) [10:34:52] let me finish [10:34:55] he he [10:34:58] please let volans finish [10:35:32] - run puppet on db2140, as soon as it gets the catalog from the master: Info: Applying configuration.... [10:35:39] - run puppet on icinga1001 [10:35:58] - as soon as the puppet run on icinga finish downtime the host for 30m [10:36:15] - after few minutes, when all is green and you have only 3 alerts red [10:36:21] downtime those 3 for X hours [10:36:28] it's mostly manual, I know [10:36:36] that wont work, it will alert on up [10:36:46] why? [10:36:53] the immediate downtime for 30m covers them [10:36:55] don't ask me, ask icinga [10:37:01] downtiming before they go HARD CRITICAL [10:37:01] nope, it will alert on up [10:37:38] also you most likelu have downtimed them before even the first check is done [10:37:42] when they are pending [10:37:56] I don't know, but we agreed with alex that "running to downtime a service" was a bad idea some time ago [10:37:58] i can't confirm this. i did not have alerts with the procedure linked [10:38:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/599779 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [10:38:28] kormat was asking a way to disable notifications in the same CR, this is the best I can offer with the current setup [10:38:38] *remove disable notification [10:39:09] volans: ack, thanks. i think in the end it's just much simpler to do 2 puppet CRs [10:39:24] I still belive kormat's method is superior on all cases (even if not perfect [10:39:37] given the constraints [10:39:57] and to avoid the stress of paging because "you didn't run fast enough to click a button" [10:40:07] +2 [10:40:09] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU for tiziano and add Martin Gerlach as new point of contact [puppet] - 10https://gerrit.wikimedia.org/r/599778 (owner: 10Muehlenhoff) [10:40:48] he is setting up 20 servers at the same time, it is trivial to do a check that everything was enabled at the end of the batch, unlikely to produce forgotten disablings [10:41:04] specialyl because there IS a log of the actions (it is on puppet) [10:41:13] we have 148 hosts with disabled notifications, how to ensure we didn't left anyone by mistake? [10:41:14] while icinga downtimes will be lost on restart/reset [10:41:17] on the other hand we just said the only difference is he has only 1 server [10:41:26] now it's 20 [10:41:40] volans: checking if they have been disabled on puppet [10:41:41] "jynus| while icinga downtimes will be lost on restart/reset" restart/reset of what? [10:42:22] 1) if the checks change name [10:42:40] 2) if icinga gets overloaded and the internal db gets out of space [10:42:59] 3) someone clicks the wrong button [10:42:59] 2) can't happen now, was fixed 1 or 2y ago [10:44:49] there is no button clicking involved in either of the methods [10:44:53] I think that the solution here should be to fix the checks, that seems that should be aware of some other "is this host in prod and needs paging" [10:45:01] state [10:45:12] (03CR) 10Muehlenhoff: [C: 03+2] Unconditionally apply SubjectAltNameWarning [puppet] - 10https://gerrit.wikimedia.org/r/599300 (owner: 10Muehlenhoff) [10:45:19] volans: the real fix is simpler: get rid of icinga [10:45:35] except that's not simple at all [10:45:35] not necesarilly [10:45:47] model can change without touching icinga [10:45:50] mutante: touché [10:46:08] I mentioned netbox-like approach to handle the lifecycle of servers [10:46:33] but I won't elaborate further before f*idon tells me that is not a good idea [10:46:35] :-D [10:46:58] but you get the general model change I am refering to [10:47:19] a dynamic, not on puppet, way to setup a host status [10:47:26] don't we already have the prod state in etcd via dbctl? [10:47:38] if it's not pooled don't page [10:47:39] uf [10:47:45] volans: partially [10:48:01] dbctl only contains databases that mediawiki cares about [10:48:02] then giuseppe will shout at me for using conftool by icinga :-D [10:49:12] not only that, a) it should page before pooling and b) icinga cannot change some things dynamically [10:49:35] I cannot make a check a warning based on dynamic logic of a check [10:49:41] or make it page or not [10:49:55] let me rephrase that [10:50:04] icinga doesn't know about warnings/crits [10:50:16] and the check doesn't know about paging/icinga setup [10:50:23] it is all very static [10:51:22] I cannot disable paging or disable a check right now based on dbctl, but not because of icinga, or at least not directly- but because how we configure it [10:53:00] sure, but there are middle ways if you want to improve the current state [10:53:27] !log updating mwdebug2002 to 7.2.31 [10:53:29] (03PS1) 10Jbond: profile/manifests/dumps: enable ipv6 drop ferm rule for 443 [puppet] - 10https://gerrit.wikimedia.org/r/599783 (https://phabricator.wikimedia.org/T253173) [10:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:33] volans: exactly my point :-D [10:53:55] (03CR) 10Ema: [C: 03+2] 0.2: pass the appropriate list of labels for the metric [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/599759 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [10:54:32] I might have misunderstood but it doesn't seem to me you're not interested in trying those middle ways [10:54:43] *too may negations, sorry [10:54:49] I might have misunderstood but it doesn't seem to me you're interested in trying those middle ways [11:01:12] !log upload prometheus-rdkafka-exporter 0.2 to buster-wikimedia T253551 [11:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:16] T253551: atskafka: expose rdkafka metrics to prometheus - https://phabricator.wikimedia.org/T253551 [11:01:29] (03PS1) 10Ema: 0.9: build against rdkafka-exporter 0.2 [software/atskafka] - 10https://gerrit.wikimedia.org/r/599786 (https://phabricator.wikimedia.org/T253551) [11:06:08] either way I think having Alertmanager isn't too far away at this point, is it [11:06:22] which probably goes a long way towards fixing these issues? [11:08:09] <_joe_> mark: I think the underlying problem is going to be fixed by a tighter integration between netbox and how we set up infrastructure. But yes, in this specific case we're struggling with a very peculiar icinga sillyness [11:08:23] <_joe_> with a set of very specific, even :P [11:12:28] (03CR) 10Ema: [C: 03+2] 0.9: build against rdkafka-exporter 0.2 [software/atskafka] - 10https://gerrit.wikimedia.org/r/599786 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [11:13:29] 10Operations: Migrate role::bastionhost::general and role::bastionhost::pop to Buster - https://phabricator.wikimedia.org/T253779 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:14:05] right [11:14:35] so in general, replacing icinga with something better will help either way ;) [11:14:58] 10Operations: Migrate role::bastionhost::general and role::bastionhost::pop to Buster - https://phabricator.wikimedia.org/T253779 (10MoritzMuehlenhoff) The reimage for the edge bastions is soft-blocked on the local Prometheus servers co-hosted on the bastions being into local Prometheus servers running on Ganeti. [11:16:33] (03CR) 10JMeybohm: [C: 03+1] blubberoid: ifguard volumes stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/599338 (owner: 10Alexandros Kosiaris) [11:17:07] (03CR) 10JMeybohm: [C: 03+1] zotero: if guard the volumes stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/599337 (owner: 10Alexandros Kosiaris) [11:23:31] (03CR) 10JMeybohm: "> Interestingly, that value is set nowhere in our deploys and hence the environment variable UPSTREAM_TIMEOUT ends up as an empty one." [deployment-charts] - 10https://gerrit.wikimedia.org/r/599336 (owner: 10Alexandros Kosiaris) [11:28:12] (03CR) 10JMeybohm: [C: 03+1] eventgate: Port tls.upstream_timeout in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/599336 (owner: 10Alexandros Kosiaris) [11:28:54] (03CR) 10JMeybohm: [C: 03+1] zotero: ifguard deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/599335 (owner: 10Alexandros Kosiaris) [11:31:02] (03CR) 10JMeybohm: [C: 03+1] eventgate: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599334 (owner: 10Alexandros Kosiaris) [11:31:45] (03PS1) 10Jbond: theemin.codfw.wmnet: add AAAA [dns] - 10https://gerrit.wikimedia.org/r/599798 (https://phabricator.wikimedia.org/T253173) [11:32:46] !log installing cups security updates (client-side libs/tools) [11:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:49] Can someone give https://phabricator.wikimedia.org/T253988 a look? [11:34:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/599798 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [11:35:09] (03CR) 10JMeybohm: [C: 03+1] chromium-render: Move ports to the debug pattern [deployment-charts] - 10https://gerrit.wikimedia.org/r/599333 (owner: 10Alexandros Kosiaris) [11:36:25] (03CR) 10JMeybohm: [C: 03+1] eventstreams: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599329 (owner: 10Alexandros Kosiaris) [11:37:34] (03CR) 10JMeybohm: [C: 03+1] debug: Don't pass nodePort: null across all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/599330 (owner: 10Alexandros Kosiaris) [11:38:40] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Majavah) See https://wikitech.wikimedia.org/wiki/Google_Search_Console_access. In short, you need a really good reason, a valid NDA and a sponsor from WM... [11:43:13] (03PS1) 10Jbond: mongodb: enable ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/599803 (https://phabricator.wikimedia.org/T253173) [11:44:12] (03CR) 10Jbond: [C: 03+2] theemin.codfw.wmnet: add AAAA [dns] - 10https://gerrit.wikimedia.org/r/599798 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [11:44:23] (03CR) 10JMeybohm: [C: 04-1] "Why create the secret at all if it does not contain data?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/599332 (owner: 10Alexandros Kosiaris) [11:46:20] (03CR) 10Jbond: "pcc: https://puppet-compiler.wmflabs.org/compiler1001/22888/" [puppet] - 10https://gerrit.wikimedia.org/r/599803 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [11:46:22] (03CR) 10Muehlenhoff: "This is probably no longer needed, mongodb went non-free and isn't part of Debian after Stretch anymore. Following that the Performance Te" [puppet] - 10https://gerrit.wikimedia.org/r/599803 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [11:46:53] 10Operations: Migrate role::bastionhost::general and role::bastionhost::pop to Buster - https://phabricator.wikimedia.org/T253779 (10Dzahn) These are physical machines, not VMs. I assume we want to keep it that way for bastions? Are are ganeti VMs an option for us even for bastions? [11:47:20] 10Operations: Migrate role::bastionhost::general and role::bastionhost::pop to Buster - https://phabricator.wikimedia.org/T253779 (10MoritzMuehlenhoff) These need to remain physical servers. [11:48:17] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Ferdi2005) And could someone at WMF work with me to add Wikinews in Italian to Google News without giving me access? [11:48:31] (03Abandoned) 10Jbond: mongodb: enable ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/599803 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [11:49:30] Hi! [11:49:33] https://phabricator.wikimedia.org/T253988 [11:49:34] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10RhinosF1) >>! In T253988#6176514, @Ferdi2005 wrote: > And could someone at WMF work with me to add Wikinews in Italian to Google News without giving me a... [11:49:54] I’d like to add Italian Wikinews to Google News [11:50:06] Couls someone at WMF operations work with me? [11:50:08] *Could [11:50:27] So I can prepare everything and then someone at WMF can send the request via search console? [11:51:47] (03CR) 10JMeybohm: [C: 03+1] all charts: Only declare nodePort if specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599331 (owner: 10Alexandros Kosiaris) [11:53:41] (03CR) 10JMeybohm: [C: 03+1] "> I am overly pedantic" [deployment-charts] - 10https://gerrit.wikimedia.org/r/599327 (owner: 10Alexandros Kosiaris) [11:55:15] (03PS3) 10JMeybohm: kafka-dev: Drop redundant YAML doc starts [deployment-charts] - 10https://gerrit.wikimedia.org/r/598279 (owner: 10Alexandros Kosiaris) [11:55:44] (03CR) 10JMeybohm: [C: 03+1] "Just fixed a typo in commit message" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598279 (owner: 10Alexandros Kosiaris) [11:55:52] (03CR) 10Ssingh: [C: 03+2] dnsdist: allow DoT (DNS-over-TLS) [puppet] - 10https://gerrit.wikimedia.org/r/599390 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:58:13] ferdi2005: sure, we have a rotating (weekly) schedule of SREs who handle such access requests (called "SRE clinic duty"), T253988 will be handled as part of that# [11:58:14] T253988: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 [12:00:35] (03PS1) 10RhinosF1: Enable VE on bnwikibook's wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599810 [12:01:19] (03CR) 10JMeybohm: [C: 04-1] Probes: If guard them in all charts (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/599328 (owner: 10Alexandros Kosiaris) [12:02:49] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [12:05:54] (03PS2) 10RhinosF1: Enable VE on bnwikibook's wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599810 (https://phabricator.wikimedia.org/T241893) [12:08:09] (03CR) 10Alexandros Kosiaris: "Ah yes, I remember the details now. Interestingly, the exact same pattern of the args parameter is followed more or less across all servic" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598055 (owner: 10JMeybohm) [12:10:41] (03PS1) 10Alexandros Kosiaris: Create namespaces/calico rules for new services [deployment-charts] - 10https://gerrit.wikimedia.org/r/599812 (https://phabricator.wikimedia.org/T225680) [12:13:59] (03PS1) 10Jbond: sretest: add AAAA records [dns] - 10https://gerrit.wikimedia.org/r/599813 (https://phabricator.wikimedia.org/T253173) [12:15:13] !log roll-restart to upgrade thanos to 0.13.0rc0 - T252186 T233956 [12:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:17] T252186: Deploy Thanos (Prometheus long-term storage) stateful components - https://phabricator.wikimedia.org/T252186 [12:15:17] T233956: Deploy Thanos (long-term storage) stateless components: sidecar and query - https://phabricator.wikimedia.org/T233956 [12:15:19] (03CR) 10Jbond: [C: 03+2] sretest: add AAAA records [dns] - 10https://gerrit.wikimedia.org/r/599813 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [12:28:39] (03PS1) 10Dzahn: bastionhost: include TFTP and DHCP profiles in bastion hosts in POPs [puppet] - 10https://gerrit.wikimedia.org/r/599817 (https://phabricator.wikimedia.org/T252526) [12:30:54] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10Cmjohnson) @wiki_willy I submitted a ticket with HPE, we'll see what they say Your case was successfully submitted. Please note your Case ID: 5347610050 for future reference. [12:32:44] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10hnowlan) Thanks @Papaul! I'm reimaging this host now. [12:34:48] (03PS2) 10Dzahn: bastionhost: include TFTP and DHCP profiles in bastion hosts in POPs [puppet] - 10https://gerrit.wikimedia.org/r/599817 (https://phabricator.wikimedia.org/T252526) [12:35:11] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [12:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:47] (03PS1) 10Cmjohnson: Removing old mgmt dns entries for rhodium [dns] - 10https://gerrit.wikimedia.org/r/599820 [12:42:01] (03PS3) 10Cmjohnson: Removing mgmt dns for asset tags associated w/elastic1020-1031 [dns] - 10https://gerrit.wikimedia.org/r/597894 (https://phabricator.wikimedia.org/T239821) [12:42:05] (03CR) 10jerkins-bot: [V: 04-1] Removing old mgmt dns entries for rhodium [dns] - 10https://gerrit.wikimedia.org/r/599820 (owner: 10Cmjohnson) [12:42:53] (03CR) 10Cmjohnson: [C: 03+2] Removing mgmt dns for asset tags associated w/elastic1020-1031 [dns] - 10https://gerrit.wikimedia.org/r/597894 (https://phabricator.wikimedia.org/T239821) (owner: 10Cmjohnson) [12:43:11] (03PS3) 10Jbond: build.gradle: add memcached support to cas blob [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/592659 (https://phabricator.wikimedia.org/T233931) [12:43:25] (03PS2) 10Cmjohnson: Removing old mgmt dns entries for rhodium [dns] - 10https://gerrit.wikimedia.org/r/599820 [12:48:07] (03CR) 10Cmjohnson: [C: 03+2] Removing old mgmt dns entries for rhodium [dns] - 10https://gerrit.wikimedia.org/r/599820 (owner: 10Cmjohnson) [12:49:26] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts: ` restbase2009.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202005291249_hnowlan_73... [12:49:55] !log reimaging restbase2009 after disk replacement [12:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:34] 10Operations, 10ops-eqiad, 10decommission, 10User-jbond: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10Cmjohnson) [12:51:04] 10Operations, 10ops-eqiad, 10decommission, 10User-jbond: decommission rhodium - https://phabricator.wikimedia.org/T235503 (10Cmjohnson) 05Open→03Resolved Removed from rack, dns removed, removed from network switch cfg, updated netbox. [12:56:44] (03PS1) 10Elukey: Add specific overrides for the zookeeper version on druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/599829 (https://phabricator.wikimedia.org/T253980) [12:57:44] (03PS2) 10Elukey: Add specific overrides for the zookeeper version on druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/599829 (https://phabricator.wikimedia.org/T253980) [12:58:00] (03CR) 10Elukey: [C: 03+2] Add specific overrides for the zookeeper version on druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/599829 (https://phabricator.wikimedia.org/T253980) (owner: 10Elukey) [13:00:34] (03PS3) 10Dzahn: bastionhost: include TFTP and DHCP profiles in bastion hosts in POPs [puppet] - 10https://gerrit.wikimedia.org/r/599817 (https://phabricator.wikimedia.org/T252526) [13:06:09] (03CR) 10Hashar: [C: 04-1] "We still have contint1001 running Jessie but it should be reimaged soonish." [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [13:08:42] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host americium [dns] - 10https://gerrit.wikimedia.org/r/599841 (https://phabricator.wikimedia.org/T245038) [13:09:37] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission americium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T245038 (10Cmjohnson) [13:09:45] (03PS2) 10Cmjohnson: Removing mgmt dns for decom host americium [dns] - 10https://gerrit.wikimedia.org/r/599841 (https://phabricator.wikimedia.org/T245038) [13:10:15] (03CR) 10Cmjohnson: [C: 03+2] Removing mgmt dns for decom host americium [dns] - 10https://gerrit.wikimedia.org/r/599841 (https://phabricator.wikimedia.org/T245038) (owner: 10Cmjohnson) [13:10:34] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase2009.codfw.wmnet'] ` Of which those **FAILED**: ` ['restbase2009.codfw.wmnet'] ` [13:13:00] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops: decommission americium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T245038 (10Cmjohnson) 05Open→03Resolved Removed from rack, mgmt dns removed, netbox updated [13:14:13] (03PS1) 10Elukey: Force Java 11 JRE for zookeeper on druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/599846 (https://phabricator.wikimedia.org/T253980) [13:14:54] (03PS1) 10Cmjohnson: Removing old mgmt dns entry for bismuth's asset tag [dns] - 10https://gerrit.wikimedia.org/r/599848 (https://phabricator.wikimedia.org/T248516) [13:15:25] (03PS2) 10Elukey: Force Java 11 JRE for zookeeper on druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/599846 (https://phabricator.wikimedia.org/T253980) [13:15:46] (03CR) 10Elukey: [C: 04-1] "This turned out to be non very great on druid, since we need java 11 in there, and following this path we cannot really do it." [puppet] - 10https://gerrit.wikimedia.org/r/599389 (owner: 10Elukey) [13:15:56] (03CR) 10Cmjohnson: [C: 03+2] Removing old mgmt dns entry for bismuth's asset tag [dns] - 10https://gerrit.wikimedia.org/r/599848 (https://phabricator.wikimedia.org/T248516) (owner: 10Cmjohnson) [13:17:45] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission bismuth.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T248516 (10Cmjohnson) 05Open→03Resolved removed from rack, dns removed, netbox updated. [13:17:51] (03PS4) 10Dzahn: bastionhost: include TFTP and DHCP profiles in bastion hosts in POPs [puppet] - 10https://gerrit.wikimedia.org/r/599817 (https://phabricator.wikimedia.org/T252526) [13:20:25] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/22891/" [puppet] - 10https://gerrit.wikimedia.org/r/599846 (https://phabricator.wikimedia.org/T253980) (owner: 10Elukey) [13:22:12] (03PS1) 10Gehel: Revert "enable dumps of structured data from commons" [puppet] - 10https://gerrit.wikimedia.org/r/599856 (https://phabricator.wikimedia.org/T221917) [13:23:55] (03PS2) 10Gehel: Revert "enable dumps of structured data from commons" [puppet] - 10https://gerrit.wikimedia.org/r/599856 (https://phabricator.wikimedia.org/T221917) [13:24:07] (03CR) 10ArielGlenn: [C: 03+1] "Sad face. But you gotta do what ya gotta do!" [puppet] - 10https://gerrit.wikimedia.org/r/599856 (https://phabricator.wikimedia.org/T221917) (owner: 10Gehel) [13:25:00] (03CR) 10Gehel: [C: 03+2] Revert "enable dumps of structured data from commons" [puppet] - 10https://gerrit.wikimedia.org/r/599856 (https://phabricator.wikimedia.org/T221917) (owner: 10Gehel) [13:27:28] (03PS4) 10Jbond: build.gradle: add memcached support to cas blob [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/592659 (https://phabricator.wikimedia.org/T233931) [13:32:17] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10Papaul) @hnowlan I changed the Netbox status for this server to failed. Once you finished the re-imaging please let me so i can update the Netbox status. thanks [13:35:24] 10Operations: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10MoritzMuehlenhoff) Nice writeup! I'll have a more in depth look and see what I can do to potentially fix this for the OpenSSH package in next/last Stretch point release. We... [13:38:27] (03CR) 10Jbond: [C: 03+2] apero_cas: alow ability to use memcached for tickets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [13:41:42] (03PS8) 10Jbond: profile::idp: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) [13:42:12] (03PS10) 10Jbond: apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) [13:46:35] (03CR) 10JMeybohm: [C: 03+1] "This is pretty cool!" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [13:48:10] (03CR) 10Jbond: [C: 03+2] profile::idp: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [13:50:11] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10akosiaris) >>! In T182331#6169433, @akosiaris wrote: >>>! In T182331#6167624, @ACraze wrote: >> I'm wondering about pod size limits and what that means for our current archit... [13:51:28] (03PS1) 10Muehlenhoff: Mask puppet in d-i on pre Buster [puppet] - 10https://gerrit.wikimedia.org/r/599867 [13:52:09] 10Operations: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10CDanis) Great, thanks! I haven't checked how easily [[ https://github.com/openssh/openssh-portable/commit/183ba55aaaecca0206184b854ad6155df237adbe | 183ba55 ]] would apply... [13:52:36] PROBLEM - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:53:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/599867 (owner: 10Muehlenhoff) [13:53:43] (03CR) 10Volans: [C: 03+1] "LGTM, thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/599867 (owner: 10Muehlenhoff) [13:53:56] (03CR) 10Muehlenhoff: [C: 03+2] Mask puppet in d-i on pre Buster [puppet] - 10https://gerrit.wikimedia.org/r/599867 (owner: 10Muehlenhoff) [13:54:38] 10Operations: Why do we have 2 sets of squid proxies? - https://phabricator.wikimedia.org/T254011 (10Dzahn) [13:54:46] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:55:37] hnowlan: for example the downtime is something that would have been taken care by the reimage ^^^ ;) [13:55:52] whoops :) [13:55:52] (03PS7) 10Jbond: apero_cas: enable memcached on idp_test [puppet] - 10https://gerrit.wikimedia.org/r/592661 (https://phabricator.wikimedia.org/T233931) [13:57:12] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [13:58:30] (03CR) 10Jbond: [C: 03+2] apero_cas: enable memcached on idp_test [puppet] - 10https://gerrit.wikimedia.org/r/592661 (https://phabricator.wikimedia.org/T233931) (owner: 10Jbond) [13:59:01] (03PS2) 10Alexandros Kosiaris: Add k8s dummy tokens for 3 new services. [labs/private] - 10https://gerrit.wikimedia.org/r/580294 (https://phabricator.wikimedia.org/T241230) [13:59:28] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:59:56] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add k8s dummy tokens for 3 new services. [labs/private] - 10https://gerrit.wikimedia.org/r/580294 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [14:02:23] (03CR) 10Alexandros Kosiaris: [C: 04-2] "I am sorry, this can't really happen. recommendation_api is colocated with other services on the scb cluster and all are nodejs6. Aside fr" [puppet] - 10https://gerrit.wikimedia.org/r/560454 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [14:02:24] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts: ` restbase2009.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202005291402_hnowlan_11... [14:02:59] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/597998 (owner: 10Hashar) [14:03:27] hnowlan: forgot to mention a pro-tip [14:03:33] use cumin2001 for codfw hosts ;) [14:04:13] fair [14:04:24] not an issue, just slightly quicker :) [14:05:51] 10Operations: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10CDanis) p:05Triage→03Medium [14:06:32] (03PS1) 10Jbond: apereo_cas: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/599872 [14:06:35] (03PS2) 10Alexandros Kosiaris: Kubernetes: Create token stanzas for some new services [puppet] - 10https://gerrit.wikimedia.org/r/580295 (https://phabricator.wikimedia.org/T241230) [14:06:52] PROBLEM - Host es1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:08:17] (03CR) 10Jbond: [C: 03+2] apereo_cas: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/599872 (owner: 10Jbond) [14:09:17] akosiaris: mereged your private repo changes [14:09:24] RECOVERY - Long running screen/tmux on mw1320 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [14:09:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] Kubernetes: Create token stanzas for some new services [puppet] - 10https://gerrit.wikimedia.org/r/580295 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [14:13:53] jbond42: thanks! [14:14:10] you beat by my 1min or so [14:14:20] you beat me by 1min or so* [14:15:40] !log ran extensions/MachineVision/maintenance/removeBlacklistedSuggestions.php on commonswiki (T253821) [14:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:45] T253821: CAT blacklist update, 2020-05-27 - https://phabricator.wikimedia.org/T253821 [14:16:24] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:05] PROBLEM - Check systemd state on idp-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] Create namespaces/calico rules for new services [deployment-charts] - 10https://gerrit.wikimedia.org/r/599812 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [14:20:02] (03Merged) 10jenkins-bot: Create namespaces/calico rules for new services [deployment-charts] - 10https://gerrit.wikimedia.org/r/599812 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [14:20:17] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={atlas_exporter,routinator} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:21:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:24:37] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:16] 10Operations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10CDanis) [14:26:31] PROBLEM - Check systemd state on idp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:10] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:58] (03PS1) 10Dzahn: add IPs for installservers in POPs [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) [14:32:25] (03CR) 10jerkins-bot: [V: 04-1] add IPs for installservers in POPs [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [14:33:16] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:54] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:34:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:35:02] PROBLEM - Memcached on idp-test1001 is CRITICAL: connect to address 208.80.154.87 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:35:08] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:35:37] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [14:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:41] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:37:33] (03CR) 10RLazarus: "Adding Chris to get a more Prometheus-experienced eye on it than mine, then I'm happy to handle the merge." [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) (owner: 10Dave Pifke) [14:40:30] 10Operations, 10netops, 10Patch-For-Review: intermittent brief data dropouts for esams netflow data - https://phabricator.wikimedia.org/T253128 (10CDanis) We had one more of these: `May 28 10:57:33 netflow3001 nfacctd[31200]: WARN ( default_kafka/kafka ): Missing data detected (plugin_buffer_size=1444 plugi... [14:41:00] PROBLEM - Memcached on idp2001 is CRITICAL: connect to address 208.80.153.23 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:41:51] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [14:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:55] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:39] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase2009.codfw.wmnet'] ` and were **ALL** successful. [14:44:11] \o/ [14:44:29] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10MoritzMuehlenhoff) Thanks for working on this! I'd need to read up on some of the Partman guts for a full review, but the general approach seems totally workable. As for the q... [14:46:16] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10hnowlan) @Papaul - reimage was successful, you can update status. thank you! [14:47:44] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [14:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:49] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:35] (03PS4) 10Alexandros Kosiaris: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 [14:48:46] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:48:52] (03CR) 10Alexandros Kosiaris: rake: Add kubeyaml validation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [14:49:16] (03PS4) 10CDanis: nfacctd: various increases to buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) [14:49:30] PROBLEM - Memcached on idp-test2001 is CRITICAL: connect to address 208.80.153.25 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:52:00] (03CR) 10Ayounsi: [C: 03+1] nfacctd: various increases to buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) (owner: 10CDanis) [14:52:08] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:54:05] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10Papaul) 05Open→03Resolved Done thank you. [14:56:17] (03PS2) 10Alexandros Kosiaris: Reorder YAML Service definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/599327 [14:56:27] (03PS1) 10Jbond: idp: fix route type [puppet] - 10https://gerrit.wikimedia.org/r/599890 [14:57:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] Reorder YAML Service definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/599327 (owner: 10Alexandros Kosiaris) [14:58:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/599327 (owner: 10Alexandros Kosiaris) [14:58:21] (03CR) 10CDanis: [C: 03+2] nfacctd: various increases to buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) (owner: 10CDanis) [14:58:23] (03CR) 10Elukey: [C: 03+1] nfacctd: various increases to buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) (owner: 10CDanis) [14:58:27] (03Merged) 10jenkins-bot: Reorder YAML Service definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/599327 (owner: 10Alexandros Kosiaris) [14:59:02] !log disabling puppet on netflow* to deploy Ic71e96f0 T253128 [14:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:06] T253128: intermittent brief data dropouts for esams netflow data - https://phabricator.wikimedia.org/T253128 [14:59:36] (03PS2) 10Jbond: idp: fix route type [puppet] - 10https://gerrit.wikimedia.org/r/599890 [15:00:17] (03PS2) 10Alexandros Kosiaris: Probes: If guard them in all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/599328 [15:00:24] 10Operations, 10Analytics, 10Traffic, 10Readers-Web-Backlog (Tracking): Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Jdlrobson) [15:00:32] (03CR) 10Alexandros Kosiaris: Probes: If guard them in all charts (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/599328 (owner: 10Alexandros Kosiaris) [15:00:34] (03CR) 10jerkins-bot: [V: 04-1] Probes: If guard them in all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/599328 (owner: 10Alexandros Kosiaris) [15:01:04] (03PS2) 10Dzahn: site: add new appservers mw2335 through mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/599749 (https://phabricator.wikimedia.org/T241852) [15:01:04] PROBLEM - Memcached on idp1001 is CRITICAL: connect to address 208.80.154.26 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [15:01:52] (03PS3) 10Alexandros Kosiaris: Probes: If guard them in all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/599328 [15:02:23] (03PS2) 10Rush: peek: add asana and env variable dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599433 (https://phabricator.wikimedia.org/T242285) [15:02:37] (03CR) 10Jbond: [C: 03+2] idp: fix route type [puppet] - 10https://gerrit.wikimedia.org/r/599890 (owner: 10Jbond) [15:03:15] (03CR) 10jerkins-bot: [V: 04-1] peek: add asana and env variable dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599433 (https://phabricator.wikimedia.org/T242285) (owner: 10Rush) [15:03:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] Probes: If guard them in all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/599328 (owner: 10Alexandros Kosiaris) [15:04:46] (03PS3) 10Rush: peek: add asana and env variable dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599433 (https://phabricator.wikimedia.org/T242285) [15:05:39] (03CR) 10jerkins-bot: [V: 04-1] peek: add asana and env variable dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599433 (https://phabricator.wikimedia.org/T242285) (owner: 10Rush) [15:06:36] (03PS2) 10Alexandros Kosiaris: eventstreams: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599329 [15:06:41] (03PS3) 10Alexandros Kosiaris: eventstreams: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599329 [15:06:45] (03CR) 10jerkins-bot: [V: 04-1] eventstreams: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599329 (owner: 10Alexandros Kosiaris) [15:07:02] (03PS4) 10Alexandros Kosiaris: Probes: If guard them in all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/599328 [15:07:04] (03PS4) 10Rush: peek: add asana and env variable dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599433 (https://phabricator.wikimedia.org/T242285) [15:07:57] (03PS2) 10Dzahn: add IPs for installservers in POPs [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) [15:08:06] (03CR) 10Rush: [C: 03+2] peek: add asana and env variable dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599433 (https://phabricator.wikimedia.org/T242285) (owner: 10Rush) [15:09:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599329 (owner: 10Alexandros Kosiaris) [15:09:17] (03PS4) 10Alexandros Kosiaris: eventstreams: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599329 [15:09:21] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] eventstreams: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599329 (owner: 10Alexandros Kosiaris) [15:10:35] (03PS2) 10Alexandros Kosiaris: debug: Don't pass nodePort: null across all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/599330 [15:11:47] (03Abandoned) 10Dzahn: bastionhost: include TFTP and DHCP profiles in bastion hosts in POPs [puppet] - 10https://gerrit.wikimedia.org/r/599817 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [15:13:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] debug: Don't pass nodePort: null across all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/599330 (owner: 10Alexandros Kosiaris) [15:13:37] (03Merged) 10jenkins-bot: debug: Don't pass nodePort: null across all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/599330 (owner: 10Alexandros Kosiaris) [15:13:56] (03PS2) 10Alexandros Kosiaris: all charts: Only declare nodePort if specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599331 [15:14:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] all charts: Only declare nodePort if specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599331 (owner: 10Alexandros Kosiaris) [15:14:56] RECOVERY - Memcached on idp-test1001 is OK: TCP OK - 0.000 second response time on 208.80.154.87 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [15:15:01] (03Merged) 10jenkins-bot: all charts: Only declare nodePort if specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599331 (owner: 10Alexandros Kosiaris) [15:21:46] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Some (recent?) uploads to Commons are not available on other wikis - https://phabricator.wikimedia.org/T253405 (10AntiCompositeNumber) That is not this bug. [15:22:22] (03PS3) 10Dzahn: add IPs for installservers in POPs [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) [15:22:31] 10Operations, 10netops: intermittent brief data dropouts for esams netflow data - https://phabricator.wikimedia.org/T253128 (10CDanis) 05Open→03Resolved a:03CDanis [15:23:28] (03CR) 10Bstorm: [C: 04-1] "Hit a snag in local testing. Going to double check before merging." [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) (owner: 10Bstorm) [15:23:38] RECOVERY - Check systemd state on idp-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:35] (03CR) 10JMeybohm: [C: 03+1] rake: Add kubeyaml validation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [15:26:46] (03PS1) 10Jbond: mcrouter: update ssl options if running on buster [puppet] - 10https://gerrit.wikimedia.org/r/599896 [15:28:17] (03CR) 10CDanis: Add check_prometheus rules for navtiming (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) (owner: 10Dave Pifke) [15:28:21] (03PS2) 10Jbond: mcrouter: update ssl options if running on buster [puppet] - 10https://gerrit.wikimedia.org/r/599896 [15:29:14] !log Performing a rolling restart of the `relforge` clusters as part of elasticsearch plugins upgrade [15:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:34] (03PS3) 10Jbond: mcrouter: update ssl options if running on buster [puppet] - 10https://gerrit.wikimedia.org/r/599896 [15:29:47] (03PS1) 10Volans: transports: catch configuration load errors [software/homer] - 10https://gerrit.wikimedia.org/r/599897 (https://phabricator.wikimedia.org/T253795) [15:31:44] 10Operations, 10User-MoritzMuehlenhoff: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10MoritzMuehlenhoff) [15:33:34] (03CR) 10Bstorm: [C: 04-1] wikireplicas: remove MCR-obsoleted fields from the replica views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) (owner: 10Bstorm) [15:34:04] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 53 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:36:35] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/22893/" [puppet] - 10https://gerrit.wikimedia.org/r/599896 (owner: 10Jbond) [15:38:55] (03CR) 10Muehlenhoff: [C: 03+1] "Renaming daemon arguments without backwards compat? It's almost as if they don't want their code to be used outside of Facebook..." [puppet] - 10https://gerrit.wikimedia.org/r/599896 (owner: 10Jbond) [15:39:54] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 47 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:39:57] (03PS4) 10Bstorm: wikireplicas: remove MCR-obsoleted fields from the replica views [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) [15:40:48] (03CR) 10Jbond: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/599896 (owner: 10Jbond) [15:44:08] 10Operations: Conffile handling for PHP 7.2 packages - https://phabricator.wikimedia.org/T231881 (10MoritzMuehlenhoff) 05Open→03Resolved This happens fairly often with the PHP packages (basically whenever a new default config is shipped), e.g. it happened again for the 7.2.26->7.2.31 update. These still need... [15:44:11] 10Operations, 10serviceops: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 (10MoritzMuehlenhoff) [15:44:14] (03CR) 10Bstorm: [C: 03+2] wikireplicas: remove MCR-obsoleted fields from the replica views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) (owner: 10Bstorm) [15:47:30] RECOVERY - Memcached on idp1001 is OK: TCP OK - 0.001 second response time on 208.80.154.26 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [15:47:52] 10Operations: confd: Superfluous golang dependency - https://phabricator.wikimedia.org/T215593 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This got fixed with the 0.16 re-packaging. We can ignore the 0.9 jessie package, closing. [15:48:38] RECOVERY - Memcached on idp2001 is OK: TCP OK - 0.036 second response time on 208.80.153.23 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [15:50:49] ACKNOWLEDGEMENT - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. John Bond migrating to memcache https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:49] ACKNOWLEDGEMENT - Memcached on idp-test2001 is CRITICAL: connect to address 208.80.153.25 and port 11000: Connection refused John Bond migrating to memcache https://wikitech.wikimedia.org/wiki/Memcached [15:50:49] ACKNOWLEDGEMENT - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. John Bond migrating to memcache https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:49] ACKNOWLEDGEMENT - Check systemd state on idp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. John Bond migrating to memcache https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:59] (03PS2) 10Volans: transports: catch configuration load errors [software/homer] - 10https://gerrit.wikimedia.org/r/599897 (https://phabricator.wikimedia.org/T253795) [15:56:21] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [15:58:40] Concluded rolling restart of the `relforge` clusters as part of elasticsearch plugins upgrade. Both hosts `relforge1001` and `relforge1002` are back up. Downtime removed. [15:58:46] (03PS2) 10Alexandros Kosiaris: chromium-render: Move ports to the debug pattern [deployment-charts] - 10https://gerrit.wikimedia.org/r/599333 [15:58:46] Oops forgot the log [15:59:05] !log Concluded rolling restart of the `relforge` clusters as part of elasticsearch plugins upgrade. Both hosts `relforge1001` and `relforge1002` are back up. Downtime lifted. [15:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] chromium-render: Move ports to the debug pattern [deployment-charts] - 10https://gerrit.wikimedia.org/r/599333 (owner: 10Alexandros Kosiaris) [16:00:17] (03PS2) 10Alexandros Kosiaris: eventgate: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599334 [16:00:27] (03Merged) 10jenkins-bot: chromium-render: Move ports to the debug pattern [deployment-charts] - 10https://gerrit.wikimedia.org/r/599333 (owner: 10Alexandros Kosiaris) [16:00:47] !log Updating views on labsdb1012 T252219 [16:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:50] T252219: Drop MCR-obsoleted fields from the wiki replicas - https://phabricator.wikimedia.org/T252219 [16:01:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599334 (owner: 10Alexandros Kosiaris) [16:01:15] 10Operations, 10User-MoritzMuehlenhoff: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10CDanis) I contacted OpenSSH upstream: https://lists.mindrot.org/pipermail/openssh-unix-dev/2020-May/038533.html [16:01:17] (03Merged) 10jenkins-bot: eventgate: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599334 (owner: 10Alexandros Kosiaris) [16:01:19] (03PS2) 10Alexandros Kosiaris: zotero: ifguard deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/599335 [16:01:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] zotero: ifguard deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/599335 (owner: 10Alexandros Kosiaris) [16:02:17] (03Merged) 10jenkins-bot: zotero: ifguard deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/599335 (owner: 10Alexandros Kosiaris) [16:02:42] (03CR) 10Alexandros Kosiaris: "> With v0.2 tls_helpers (used everywhere else) the upstream timeout is not set via environment variable but directly in the envoy config a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/599336 (owner: 10Alexandros Kosiaris) [16:02:45] (03PS2) 10Alexandros Kosiaris: eventgate: Port tls.upstream_timeout in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/599336 [16:03:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate: Port tls.upstream_timeout in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/599336 (owner: 10Alexandros Kosiaris) [16:03:51] (03PS2) 10Alexandros Kosiaris: zotero: if guard the volumes stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/599337 [16:04:16] (03Merged) 10jenkins-bot: eventgate: Port tls.upstream_timeout in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/599336 (owner: 10Alexandros Kosiaris) [16:04:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] zotero: if guard the volumes stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/599337 (owner: 10Alexandros Kosiaris) [16:05:13] (03Merged) 10jenkins-bot: zotero: if guard the volumes stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/599337 (owner: 10Alexandros Kosiaris) [16:05:32] (03PS2) 10Alexandros Kosiaris: blubberoid: ifguard volumes stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/599338 [16:06:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] blubberoid: ifguard volumes stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/599338 (owner: 10Alexandros Kosiaris) [16:07:05] (03Merged) 10jenkins-bot: blubberoid: ifguard volumes stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/599338 (owner: 10Alexandros Kosiaris) [16:07:20] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [16:07:59] (03CR) 10Greg Grossmeier: [C: 03+1] "Approved." [puppet] - 10https://gerrit.wikimedia.org/r/599758 (https://phabricator.wikimedia.org/T253676) (owner: 10Ayounsi) [16:08:29] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access for deployment and restricted group for cicalese - https://phabricator.wikimedia.org/T253676 (10greg) Approved. Godspeed, @CCicalese_WMF, let me/us know if you need anything. [16:09:08] (03CR) 10Greg Grossmeier: [C: 03+1] "btw, can we make the RelEng approval for the deployment group an official step (in the task or somewhere)?" [puppet] - 10https://gerrit.wikimedia.org/r/599758 (https://phabricator.wikimedia.org/T253676) (owner: 10Ayounsi) [16:09:36] 10Operations, 10User-MoritzMuehlenhoff: Investigate StorCLI - https://phabricator.wikimedia.org/T254019 (10MoritzMuehlenhoff) [16:12:02] (03PS4) 10Alexandros Kosiaris: kafka-dev: Drop redundant YAML doc starts [deployment-charts] - 10https://gerrit.wikimedia.org/r/598279 [16:13:27] (03CR) 10Alexandros Kosiaris: "That's actually a valid question, which I was trying to avoid answering in this change, being more preoccupied with appeasing kubeyaml." [deployment-charts] - 10https://gerrit.wikimedia.org/r/599332 (owner: 10Alexandros Kosiaris) [16:13:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] kafka-dev: Drop redundant YAML doc starts [deployment-charts] - 10https://gerrit.wikimedia.org/r/598279 (owner: 10Alexandros Kosiaris) [16:14:14] (03Merged) 10jenkins-bot: kafka-dev: Drop redundant YAML doc starts [deployment-charts] - 10https://gerrit.wikimedia.org/r/598279 (owner: 10Alexandros Kosiaris) [16:14:36] 10Operations, 10ops-esams, 10netops: Amsterdam maintenance (June 2020) - https://phabricator.wikimedia.org/T254021 (10ayounsi) p:05Triage→03Medium [16:27:22] PROBLEM - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:27:28] PROBLEM - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [16:27:32] PROBLEM - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:27:40] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [16:27:54] PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:27:58] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:28:04] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:28:16] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:28:58] PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [16:31:28] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:41:22] (03PS6) 10Privacybatm: Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) [16:45:03] /win 10 [16:46:00] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:50:43] !log Performing a rolling restart of the `cloudelastic` clusters (chi, psi, omega) as part of elasticsearch plugins upgrade. Host and service checks disabled. [16:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:17] 10Operations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) [16:53:42] 10Operations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) [16:56:21] 10Operations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) * Junos recommended version for the MX480s have OpenSSH_6.9 * The SRXs will need new models (SRX300) to support junos >12.1 * Switches upgrades (when possible) are very imp... [16:56:47] 10Operations, 10User-MoritzMuehlenhoff: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10ayounsi) [16:56:49] 10Operations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) 05Open→03Stalled p:05Triage→03Low [17:00:05] (03PS1) 10Bstorm: toolsdb: Add misbehaving table to replication filters [puppet] - 10https://gerrit.wikimedia.org/r/599926 (https://phabricator.wikimedia.org/T253738) [17:11:47] cd [17:11:49] oops :) [17:13:29] (03CR) 10Ayounsi: [C: 03+2] Add cicalese to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/599758 (https://phabricator.wikimedia.org/T253676) (owner: 10Ayounsi) [17:13:46] (03PS3) 10Ayounsi: Add cicalese to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/599758 (https://phabricator.wikimedia.org/T253676) [17:15:30] (03CR) 10Ayounsi: [C: 03+2] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/599758 (https://phabricator.wikimedia.org/T253676) (owner: 10Ayounsi) [17:17:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access for deployment and restricted group for cicalese - https://phabricator.wikimedia.org/T253676 (10ayounsi) [17:17:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access for deployment and restricted group for cicalese - https://phabricator.wikimedia.org/T253676 (10ayounsi) 05Open→03Resolved a:03ayounsi Done! let me know if you're having any issue. [17:28:52] !log updating views on labsdb1009 T252219 [17:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:56] T252219: Drop MCR-obsoleted fields from the wiki replicas - https://phabricator.wikimedia.org/T252219 [17:34:27] 10Operations: Look into feasibility of disabling sha-1 host keys on our ssh daemons - https://phabricator.wikimedia.org/T167966 (10CDanis) `ssh-rsa` keys aren't necessarily tied to SHA1 as of OpenSSH 7.2p1. See T253824 for far too many details. [17:47:59] (03PS6) 10Jdlrobson: Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) [17:48:12] (03PS2) 10RLazarus: maintenance: Migrate initsitestats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/593772 (https://phabricator.wikimedia.org/T211250) [17:49:54] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate initsitestats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/593772 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [17:52:40] (03PS2) 10RLazarus: maintenance: Migrate startupregistrystats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/593774 (https://phabricator.wikimedia.org/T211250) [17:55:11] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate startupregistrystats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/593774 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [17:59:55] (03PS3) 10RLazarus: maintenance: Migrate updatequerypages to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/593797 (https://phabricator.wikimedia.org/T211250) [18:02:30] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate updatequerypages to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/593797 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [18:18:44] 10Operations, 10Analytics, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843 (10Nuria) 05Open→03Declined [18:25:25] (03CR) 10Greg Grossmeier: [C: 03+1] "> It's in https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#Deployment_Groups" [puppet] - 10https://gerrit.wikimedia.org/r/599758 (https://phabricator.wikimedia.org/T253676) (owner: 10Ayounsi) [18:28:15] (03PS1) 10Volans: dns: add support for virtual machines [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/599948 (https://phabricator.wikimedia.org/T233183) [18:28:16] 10Operations, 10SRE-Access-Requests: Requesting access for deployment and restricted group for cicalese - https://phabricator.wikimedia.org/T253676 (10CCicalese_WMF) Thank you!! [18:39:23] (03PS1) 10RLazarus: maintenance: Migrate wikidata prune jobs to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/599956 (https://phabricator.wikimedia.org/T211250) [18:42:28] (03PS1) 10Ssingh: dnsdist: add a parameter to use dnsdist's packet cache [puppet] - 10https://gerrit.wikimedia.org/r/599958 (https://phabricator.wikimedia.org/T252132) [18:45:50] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/22895/" [puppet] - 10https://gerrit.wikimedia.org/r/599958 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:52:42] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10ACraze) @akosiaris ah yeah I see what you're saying, the 56GB RAM is utilized memory due to COW, but the RSS is much higher, which is not great for a containerized solution.... [19:14:59] PROBLEM - MariaDB Slave Lag: s1 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 893.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:18:02] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/599958 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [19:19:32] (03CR) 10Ssingh: [C: 03+2] dnsdist: add a parameter to use dnsdist's packet cache [puppet] - 10https://gerrit.wikimedia.org/r/599958 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [19:27:30] !log Successfully finished a rolling restart of the `cloudelastic` clusters (chi, psi, omega) as part of elasticsearch plugins upgrade. Host and service checks re-enabled. [19:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:03] 10Operations, 10ORES, 10Release Pipeline (Blubber), 10Scoring-platform-team (Current): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10ACraze) @thcipriani thanks, PipelineLib seems to be what I was missing here :) [19:52:00] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [19:54:19] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:56:07] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:12:14] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) [20:17:35] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Marostegui) As we discussed a few days ago on IRC, I will have db2113 depooled 24h before the day you pick. Any day works for me. [20:40:31] RECOVERY - MariaDB Slave Lag: s1 on db2097 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:43:18] (03PS2) 10Volans: dns: add support for virtual machines [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/599948 (https://phabricator.wikimedia.org/T233183) [20:55:33] !log updating views on labsdb1011 T252219 [20:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:37] T252219: Drop MCR-obsoleted fields from the wiki replicas - https://phabricator.wikimedia.org/T252219 [21:06:59] 10Operations: Why do we have 2 sets of squid proxies? - https://phabricator.wikimedia.org/T254011 (10Urbanecm) I recall seeing that url-downloader has some download limits, and some server-side upload requests didn't come through for that reason. Not sure if there is a usecase for limiting that through. [21:10:16] (03PS3) 10Volans: Improve error catching [software/homer] - 10https://gerrit.wikimedia.org/r/599897 (https://phabricator.wikimedia.org/T253795) [21:51:26] (03PS1) 10CDanis: node_nic_firmware: fix comments in header [puppet] - 10https://gerrit.wikimedia.org/r/600010 [21:52:53] (03CR) 10CDanis: [C: 03+2] node_nic_firmware: fix comments in header [puppet] - 10https://gerrit.wikimedia.org/r/600010 (owner: 10CDanis) [22:16:01] (03PS1) 10Andrew Bogott: designate: allow mdns to listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/600017 (https://phabricator.wikimedia.org/T253780) [22:32:24] !log updated views on labsdb1010 T252219 [22:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:29] T252219: Drop MCR-obsoleted fields from the wiki replicas - https://phabricator.wikimedia.org/T252219 [22:46:19] 10Operations, 10SRE-Access-Requests: Requesting access for deployment and restricted group for cicalese - https://phabricator.wikimedia.org/T253676 (10Dzahn) I just did the "The user should also be added to the Gerrit group wmf-deployment." from [[ https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#Deployment... [23:00:29] (03PS1) 10Andrew Bogott: pdns: add allow-axfr-ips setting for cloud auth recursors [puppet] - 10https://gerrit.wikimedia.org/r/600035 [23:27:07] (03CR) 10Bstorm: [C: 03+2] toolsdb: Add misbehaving table to replication filters [puppet] - 10https://gerrit.wikimedia.org/r/599926 (https://phabricator.wikimedia.org/T253738) (owner: 10Bstorm)