[00:04:54] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01083 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:11:11] (03PS9) 10Dave Pifke: arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) [00:12:46] (03PS10) 10Dave Pifke: arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) [00:13:31] (03CR) 10Legoktm: arclamp: serve SVGs, compressed logs from Swift (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [00:22:31] (03CR) 10Dave Pifke: arclamp: serve SVGs, compressed logs from Swift (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [00:31:59] (03PS11) 10Dave Pifke: arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) [00:32:26] (03CR) 10Dzahn: [C: 03+2] parsoid::testreduce: switch mysql data dir to /srv/data/mysql [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) (owner: 10Dzahn) [00:32:48] (03PS2) 10Dzahn: parsoid::testreduce: switch mysql data dir to /srv/data/mysql [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) [00:33:21] (03CR) 10Dave Pifke: "I'll need to test these changes in beta; will do so first thing tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [00:43:22] dancy: I think you may've forgotten to apply that commit before syncing [00:43:36] looking at /srv/mediawiki-staging/php-1.36.0-wmf.35, the commit was not applied [00:43:40] on deploy1002 [00:44:01] * Krinkle reopened task [00:45:02] !log testreduce1001 - stop mysql; rsyncing /var/lib/mysql to /srv/data/mysql (T277580) [00:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:11] T277580: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 [00:45:20] hmm. [00:46:31] Krinkle: The liquidthreads one? [00:46:47] yeah [00:47:03] forgot submodule update perhaps? [00:49:32] gah, yes. [00:50:14] (03PS1) 10Bstorm: static-binaries: first pass at a stripped-down image for binaries [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/673378 (https://phabricator.wikimedia.org/T277749) [00:51:41] !log dancy@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/LiquidThreads/classes/Thread.php: T277772 (duration: 00m 58s) [00:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:49] T277772: Use of Article::getId was deprecated in MediaWiki 1.35. [Called from Thread::setRoot] - https://phabricator.wikimedia.org/T277772 [00:52:38] Thanks for the heads-up Krinkle. [00:53:30] yw :) [00:54:08] dancy: I updated mediawiki-errors in logstash to incldue maintanance/shell.php in its debugging filter (previously this checked maintenance/eval.php only) [00:54:21] it also excludes mwdebug host names [00:54:35] I saw that. Thank ou. [00:54:40] *you [00:55:02] I... pressed save a few seconds ago? [00:55:08] (03CR) 10Bstorm: "This image removes text editors because they are mostly useless if you basically are uploading binary blobs to NFS. It basically assumes y" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/673378 (https://phabricator.wikimedia.org/T277749) (owner: 10Bstorm) [00:55:16] This one - https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors [00:55:41] anyhow, yeah, so it uses exception.trace, I don't know if that maps directly but.. aye, hope its of some use [00:57:08] I discussed how to translate that to the logspam script with brenn.en today and we decided to just leave the script as-is for the time being. [00:58:40] I'll let brennen know about the updates you made since they're relevant to the conversation we had. [00:59:04] k :) no problem either way [01:05:34] RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:37] (03PS1) 10Bstorm: maintain-dbusers: fix the order of the paws accounts listing [puppet] - 10https://gerrit.wikimedia.org/r/673380 (https://phabricator.wikimedia.org/T276284) [01:08:11] (03CR) 10BryanDavis: [C: 03+1] maintain-dbusers: fix the order of the paws accounts listing [puppet] - 10https://gerrit.wikimedia.org/r/673380 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [01:12:26] PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:25] !log T275885 Revoking current `relforge` TLS cert in advance of generation of new cert: `ryankemper@puppetmaster1001:/srv/private$ sudo puppet cert clean relforge.svc.eqiad.wmnet` [02:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:34] T275885: Generate SSL certification for relforge1003.eqiad.wmnet and relforge1004.eqiad.wmnet - https://phabricator.wikimedia.org/T275885 [03:06:06] RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:00] PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:18] (03PS1) 10Ryan Kemper: relforge: generate new TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/673386 (https://phabricator.wikimedia.org/T275885) [03:21:12] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:22:00] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:22:58] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:23:08] (03CR) 10Ryan Kemper: [C: 03+2] relforge: generate new TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/673386 (https://phabricator.wikimedia.org/T275885) (owner: 10Ryan Kemper) [03:25:08] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:26:36] !log T275885 `ryankemper@cumin1001:~$ sudo cumin 'P{relforge*}' 'sudo run-puppet-agent'` [03:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:45] T275885: Generate SSL certification for relforge1003.eqiad.wmnet and relforge1004.eqiad.wmnet - https://phabricator.wikimedia.org/T275885 [03:27:52] !log [wdqs] `ryankemper@wdqs1013:~$ sudo systemctl restart wdqs-blazegraph` [03:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:46] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:45:14] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:46:08] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 58, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:46:50] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:49:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:51:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:19:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:24:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:40:42] I'm doing some stuff on beta clustre [05:40:47] don't be alarmed [05:41:31] 10SRE, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Grant Access to wmf for TsepoThoabala - https://phabricator.wikimedia.org/T277804 (10Dzahn) confirmed wikitech/LDAP/developer account: ` [mwmaint1002:~] $ /usr/bin/ldapsearch -x "sn=Tsepo*"| grep uid dn: uid=tsepothoabala,ou=people,dc=wikimedia,dc... [05:48:30] Amir1: beta stuff generally belongs in #-releng, not here [05:48:52] I know, it's just if it alarms [05:49:19] appreciate the heads up in both places and that Icinga isn't ignored. but also, good night [05:49:27] if it does, I think those still go to releng [05:49:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:50:04] (03PS1) 10Dzahn: admin: add Tsepo Thoabala to ldap_only admins, group wmf [puppet] - 10https://gerrit.wikimedia.org/r/673388 (https://phabricator.wikimedia.org/T277804) [05:51:08] (03PS2) 10Dzahn: admin: add Tsepo Thoabala to ldap_only admins, group wmf [puppet] - 10https://gerrit.wikimedia.org/r/673388 (https://phabricator.wikimedia.org/T277804) [05:52:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:52:45] (03CR) 10Dzahn: "This needs to go together with a manual "[mwmaint1002:~] $ sudo modify-ldap-group wmf" to add to the LDAP group." [puppet] - 10https://gerrit.wikimedia.org/r/673388 (https://phabricator.wikimedia.org/T277804) (owner: 10Dzahn) [06:03:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:06:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:08:56] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:04] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:30:31] (03CR) 10Muehlenhoff: [C: 03+1] "Running the offboarding script is mostly non-destructive, since it generates an LDIF which you need apply outside of the script (only the " [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:37:20] (03PS1) 10Ladsgroup: beta: Fix beta's url shortener [puppet] - 10https://gerrit.wikimedia.org/r/673389 [06:38:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:40:27] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:41:00] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10Sergey.Trofimovsky.SF) >> Something missing from the docs? > ahh yes, i have placed the ldap cn=admin password in... [06:42:23] (03CR) 10Muehlenhoff: wmcs-webproxy.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:43:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:43:15] beta cluster only patch to fix url shortener in beta cluster. Any SRE willing to merge please πŸ₯Ί https://gerrit.wikimedia.org/r/c/operations/puppet/+/673389 [06:45:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:47:10] Amir1: having a look in a few [06:47:36] Thanks. I'm not super great in apache rules [06:47:52] but it should be fine [06:48:28] Amir1: o/ I am a little confused about the first redirect, since the location matches .+ [06:48:34] couldn't we use https://httpd.apache.org/docs/2.4/mod/mod_alias.html#redirectmatch ? [06:48:48] (I just had my coffee so my brain is slow, apologies for silly questions) [06:50:43] ah yes ok now I get it [06:51:06] elukey: o/ [06:51:19] so redirectmatch may be cleaner, but this works yes [06:51:28] it's basically redirecting w.wiki to one target and w.wiki/fff to another [06:51:38] (03CR) 10Muehlenhoff: [C: 03+2] beta: Fix beta's url shortener [puppet] - 10https://gerrit.wikimedia.org/r/673389 (owner: 10Ladsgroup) [06:51:51] yeah, apache configs are confusing and complicated [06:51:57] (not as much as exim4 though) [06:52:18] Amir1: well using redirectmatch + redirect without locations is less confusing, this is why I proposed it :) [06:52:22] anywayyyyyy [06:52:32] all good change merged [06:53:01] there's a patch by Ryan which hasn't been puppet-merged [06:53:23] elukey: it's beta cluster, I'm pretty sure it was broken for years. The whole thing is a mess :D [06:53:23] ryankemper: you still around, good to merge? (new TLS certs for relforge) [06:54:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:55:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:56:02] Amir1: would you be willing to fix beta logstash at the same time? Its also broken and I haven't managed to fix it :D [06:56:54] Majavah: if you give me some context I can take a look but I'm not great in ELK [06:58:23] Amir1: basically it just stopped receiving events, fun things like "UDP listener died EADDRINUSE" due to some port conflicts, no idea on how it's supposed to work [06:59:59] oh that seems fun [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210319T0700) [07:00:05] ticket? [07:01:09] spread over time to T233134, T241481 and T276521, I guess T233134 should be the "main" task [07:01:10] T276521: deployment-logstash03 puppet errors - https://phabricator.wikimedia.org/T276521 [07:01:10] T241481: deployment-logstash03: UDP listener died EADDRINUSE, logstash port conflict with rsyslogd - https://phabricator.wikimedia.org/T241481 [07:01:10] T233134: logstash-beta.wmflabs.org does not receive any mediawiki events - https://phabricator.wikimedia.org/T233134 [07:02:58] cool [07:03:18] Added to my todo list for today but these look like a mess already [07:04:11] thanks! I already tried but couldn't get it working, another pair of eyes would be helpful [07:06:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:08:49] (03PS1) 10Ladsgroup: Add Wikidata's query builder in toolforge to beta's url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673391 (https://phabricator.wikimedia.org/T273162) [07:11:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:12:21] moritzm: doh, my bad - yes it's clear to merge if it hasn't been already (checking now) [07:13:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:13:42] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [07:15:02] moritzm: Amir1: I'm merging both our changes now [07:15:15] all good [07:15:28] done [07:16:02] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [07:16:22] !log T275885 `ryankemper@cumin1001:~$ sudo cumin 'P{relforge*}' 'sudo run-puppet-agent'` (change hadn't been merged when I ran the agent earlier) [07:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:32] T275885: Generate SSL certification for relforge1003.eqiad.wmnet and relforge1004.eqiad.wmnet - https://phabricator.wikimedia.org/T275885 [07:21:11] ack, thx [07:21:22] (03PS3) 10ArielGlenn: update bash worker script for handling scondary workers processing job batches [dumps] - 10https://gerrit.wikimedia.org/r/638043 (https://phabricator.wikimedia.org/T252396) [07:25:45] (03CR) 10Sascha: [C: 03+1] static-binaries: first pass at a stripped-down image for binaries [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/673378 (https://phabricator.wikimedia.org/T277749) (owner: 10Bstorm) [07:29:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:31:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:36:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:39:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:44:50] 10SRE, 10CAS-SSO: Investigate/enable new actuators for U2F token management - https://phabricator.wikimedia.org/T277837 (10MoritzMuehlenhoff) [07:45:23] 10SRE, 10CAS-SSO: Investigate/enable new actuators for U2F token management - https://phabricator.wikimedia.org/T277837 (10MoritzMuehlenhoff) Same for "A number of new administrative actuator endpoints are presented to report back on the registered authentication handlers and policies." [07:55:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:56:53] (03CR) 10David Caro: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [07:58:23] 10SRE, 10CAS-SSO: CAS per-service TGT setting - https://phabricator.wikimedia.org/T277840 (10MoritzMuehlenhoff) [07:59:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:00:42] (03CR) 10Elukey: [C: 03+1] "LGTM even if I am not super familiar with helm admin_ng. IP ranges looks good :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [08:02:15] (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673391 (https://phabricator.wikimedia.org/T273162) (owner: 10Ladsgroup) [08:02:57] (03Merged) 10jenkins-bot: Add Wikidata's query builder in toolforge to beta's url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673391 (https://phabricator.wikimedia.org/T273162) (owner: 10Ladsgroup) [08:03:51] rebased on deploy1001 % [08:04:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:08:06] (03PS4) 10ArielGlenn: update bash worker script for handling scondary workers processing job batches [dumps] - 10https://gerrit.wikimedia.org/r/638043 (https://phabricator.wikimedia.org/T252396) [08:12:51] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) I have updated the `operations/debs/rsyslog` from salsa.debian.org, now it contains `8.2102.0`. I then tried something simple: * created a local branch `debian/buster-wikimedia` from m... [08:14:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:14:29] (03PS4) 10ArielGlenn: distinguish between "no wikis with batches available" and "no wikis left to run" [dumps] - 10https://gerrit.wikimedia.org/r/673210 (https://phabricator.wikimedia.org/T252396) [08:15:48] (03PS3) 10Kosta Harlan: linkrecommendation: Bump memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) [08:16:06] (03CR) 10DCausse: create helmfile.d structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [08:16:08] 10SRE, 10CAS-SSO: WebAuthn FIDO2 support in CAS - https://phabricator.wikimedia.org/T277841 (10MoritzMuehlenhoff) [08:16:31] (03PS5) 10ArielGlenn: update bash worker script for handling scondary workers processing job batches [dumps] - 10https://gerrit.wikimedia.org/r/638043 (https://phabricator.wikimedia.org/T252396) [08:18:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:22:12] 10SRE, 10CAS-SSO: Update CAS to 6.3 - https://phabricator.wikimedia.org/T271684 (10MoritzMuehlenhoff) I filed tasks for new features introduced in 6.3: https://phabricator.wikimedia.org/T277837 https://phabricator.wikimedia.org/T277840 https://phabricator.wikimedia.org/T277841 [08:22:54] !log upload alluxio 2.4.1 to thirdparty/bigtop15 on stretch/buster-wikimedia [08:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:30:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:32:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:35:46] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:36:00] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:37:37] (03PS2) 10Majavah: beta: remove deployment-restbase[01-02] [puppet] - 10https://gerrit.wikimedia.org/r/673047 (https://phabricator.wikimedia.org/T250574) [08:37:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] helm: Make ML k8s clusters visible to helm (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [08:46:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:49:03] 10SRE, 10Patch-For-Review: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10MoritzMuehlenhoff) >>! In T224579#6924029, @fgiunchedi wrote: > Sure enough, the exporter is out of FDs again. I'm +1 to just remove the exporter since the service doesn't have an owner, the expor... [08:53:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:56:01] (03PS1) 10Elukey: aptrepo: add a new rsyslog-k8s component for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/673442 (https://phabricator.wikimedia.org/T277739) [09:00:45] (03CR) 10Volans: [C: 03+2] beta: remove deployment-restbase[01-02] [puppet] - 10https://gerrit.wikimedia.org/r/673047 (https://phabricator.wikimedia.org/T250574) (owner: 10Majavah) [09:01:13] (03PS4) 10Klausman: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) [09:04:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:06:30] (03CR) 10Alexandros Kosiaris: linkrecommendation: Bump memory limit and image version (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [09:07:01] (03PS5) 10Klausman: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) [09:07:09] (03CR) 10Klausman: helm: Make ML k8s clusters visible to helm (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [09:07:40] (03PS1) 10Jcrespo: dbbackups: Reenable notifications on db2101 after data load [puppet] - 10https://gerrit.wikimedia.org/r/673443 (https://phabricator.wikimedia.org/T277632) [09:07:51] (03PS2) 10Jcrespo: dbbackups: Reenable notifications on db2101 after data load [puppet] - 10https://gerrit.wikimedia.org/r/673443 (https://phabricator.wikimedia.org/T277632) [09:08:39] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications on db2101 after data load [puppet] - 10https://gerrit.wikimedia.org/r/673443 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo) [09:09:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:11:55] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/673388 (https://phabricator.wikimedia.org/T277804) (owner: 10Dzahn) [09:12:53] (03CR) 10Volans: [C: 03+2] admin: add Tsepo Thoabala to ldap_only admins, group wmf [puppet] - 10https://gerrit.wikimedia.org/r/673388 (https://phabricator.wikimedia.org/T277804) (owner: 10Dzahn) [09:12:56] (03CR) 10Awight: [C: 03+1] Enable CodeMirror accessibility colors on initial wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673326 (https://phabricator.wikimedia.org/T276346) (owner: 10Andrew-WMDE) [09:15:20] (03PS4) 10Kosta Harlan: linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) [09:15:32] 10SRE, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for TsepoThoabala - https://phabricator.wikimedia.org/T277804 (10Volans) 05Openβ†’03Resolved p:05Triageβ†’03Medium a:03Volans Patch merged, added user to the `wmf` group. @TThoabala all done, resolving. [09:16:22] (03PS5) 10Kosta Harlan: linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) [09:16:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:33] (03CR) 10Kormat: [C: 03+1] Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:23:50] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) I don't really like option 3 just because it moves parts of the software stack to the node itself and I would personally like them to be as dumb as possible, ideally... [09:26:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] linkrecommendation: Bump requests memory limit and image version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [09:28:01] (03PS6) 10Kosta Harlan: linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) [09:28:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [09:28:20] (03CR) 10Kosta Harlan: linkrecommendation: Bump requests memory limit and image version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [09:32:14] (03CR) 10JMeybohm: [C: 03+2] chromium-render: Add default labels and fix name of configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/670464 (owner: 10JMeybohm) [09:32:38] (03CR) 10Jcrespo: [C: 03+2] Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:33:15] (03Merged) 10jenkins-bot: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:33:36] (03Merged) 10jenkins-bot: chromium-render: Add default labels and fix name of configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/670464 (owner: 10JMeybohm) [09:33:57] (03CR) 10Elukey: [C: 03+2] aptrepo: add a new rsyslog-k8s component for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/673442 (https://phabricator.wikimedia.org/T277739) (owner: 10Elukey) [09:34:04] (03CR) 10GergΕ‘ Tisza: [C: 03+2] linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [09:35:30] (03Merged) 10jenkins-bot: linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan) [09:36:20] (03PS1) 10Kormat: compare: Use dbutil.addr_split for parsing host:port [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673446 (https://phabricator.wikimedia.org/T277843) [09:36:25] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [09:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:39:57] (03CR) 10Jcrespo: [C: 03+1] "Looks fine to me: test, ship it, close the ticket! :-)" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673446 (https://phabricator.wikimedia.org/T277843) (owner: 10Kormat) [09:40:26] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [09:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:48] 10SRE, 10CAS-SSO: Update CAS to 6.3 - https://phabricator.wikimedia.org/T271684 (10jbond) 05Openβ†’03Resolved a:03jbond [09:47:28] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Convert helm releases to the new release naming schem - https://phabricator.wikimedia.org/T277849 (10JMeybohm) [09:47:38] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Convert helm releases to the new release naming schem - https://phabricator.wikimedia.org/T277849 (10JMeybohm) p:05Triageβ†’03Low [09:48:50] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Convert helm releases to the new release naming schema - https://phabricator.wikimedia.org/T277849 (10JMeybohm) [09:49:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:51:12] 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10akosiaris) >>! In T277740#6925615, @Volans wrote: > Doh, I think we have naming clash here :) I figured, hence the comment. > > - service: as in Icinga single service belong... [10:04:58] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [10:04:58] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [10:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:31] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond) [10:10:20] (03CR) 10Kormat: [C: 03+2] "Look good, shipping:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673446 (https://phabricator.wikimedia.org/T277843) (owner: 10Kormat) [10:10:59] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [10:10:59] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [10:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:16] (03CR) 10Volans: "I've no context on the task at hand, did just a generic Python pass. Feel free to ignore most of the comments." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite) [10:18:17] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) After a chat with Moritz we decided to create a specific component with 8.1901 for buster: ` root@apt1001:/srv/wikimedia# reprepro lsbycomponent rsyslog rsyslog | 8.1901.0-1~bpo8+wmf1... [10:18:25] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: master: ensure cpp package is installed [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653) [10:19:49] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: master: ensure cpp package is installed [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653) [10:20:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: master: ensure cpp package is installed [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [10:22:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/658414 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [10:22:56] (03CR) 10Jcrespo: "CCing current Swift and DB owners- consider if my advice on previous comment is fair or I am being too cautious. Up to you." [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [10:23:13] (03PS1) 10Elukey: profile::rsyslog::kubernetes: add component for buster [puppet] - 10https://gerrit.wikimedia.org/r/673450 (https://phabricator.wikimedia.org/T277739) [10:23:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:24:36] (03PS2) 10Elukey: profile::rsyslog::kubernetes: add component for buster [puppet] - 10https://gerrit.wikimedia.org/r/673450 (https://phabricator.wikimedia.org/T277739) [10:25:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:27:13] 10SRE, 10Security-Team, 10CAS-SSO, 10User-jbond: Validate Single Logout Flow - https://phabricator.wikimedia.org/T233941 (10jbond) https://wiki.shibboleth.net/confluence/display/CONCEPT/SLOIssues seems like a useful document when considering this [10:28:46] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Proton: Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10JMeybohm) [10:29:46] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28674/console" [puppet] - 10https://gerrit.wikimedia.org/r/673450 (https://phabricator.wikimedia.org/T277739) (owner: 10Elukey) [10:30:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:30:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [10:31:19] (03CR) 10Klausman: [C: 03+2] helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [10:34:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:35:31] (03CR) 10Volans: "As this seems to have a lot of shared code between the 2 blocks, consider if it might be useful to move the common part into sre/elasticse" [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper) [10:36:04] (03PS1) 10Elukey: cumin: fix ml-serve aliases and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/673452 (https://phabricator.wikimedia.org/T272918) [10:36:46] (03Merged) 10jenkins-bot: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [10:36:49] volans, klausman --^ [10:36:51] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/673452 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [10:37:04] not sure if you need the distrinction worker/masters within a DC too [10:37:10] but you'll see later on that [10:37:22] (03CR) 10Klausman: [C: 03+1] cumin: fix ml-serve aliases and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/673452 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [10:37:39] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Proton: Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10Jgiannelos) I think this is the patch that introduced the change from statsd metrics to native prometheus: https://gerrit.wikimedia.org/r/c/mediawiki/services/chromium-render/+/558213 [10:38:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:40:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:40:35] volans: yes yes I had the same idea [10:40:43] (03CR) 10Elukey: [C: 03+2] cumin: fix ml-serve aliases and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/673452 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [10:41:16] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::rsyslog::kubernetes: add component for buster [puppet] - 10https://gerrit.wikimedia.org/r/673450 (https://phabricator.wikimedia.org/T277739) (owner: 10Elukey) [10:41:49] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:41:49] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:41:53] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10Volans) 05Openβ†’03Resolved a:03Volans @CGlenn I've added you to the mobile domain too `am.m.wikipedia.org`, I consider the approval for the whole "//language//". Resolving, fe... [10:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:09] RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:23] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:42:23] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:38] !log installing dbmonitor1002 T224589 [10:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:45] T224589: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 [10:44:20] volans: just realized - ml-serve: A:ml-serve-master and A:ml-serve-worker [10:44:28] * elukey plays sad_trombone.wav [10:44:32] fixing it [10:44:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:45:31] RECOVERY - Check systemd state on ml-serve1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:33] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:45:33] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:52] (03PS1) 10Jgiannelos: Configure prometheus metrics for chromium-renderer [deployment-charts] - 10https://gerrit.wikimedia.org/r/673454 (https://phabricator.wikimedia.org/T277857) [10:48:28] (03PS1) 10Alexandros Kosiaris: deployment: Add ML cluster to deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/673455 [10:48:33] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002407 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:49:15] (03PS1) 10Elukey: cumin: fix ml-serve alias and add newer ones [puppet] - 10https://gerrit.wikimedia.org/r/673457 (https://phabricator.wikimedia.org/T272918) [10:49:33] volans: --^ [10:49:53] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28675/console" [puppet] - 10https://gerrit.wikimedia.org/r/673455 (owner: 10Alexandros Kosiaris) [10:53:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:54:41] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10jbond) >>! In T274461#6927626, @Sergey.Trofimovsky.SF wrote: >>> Something missing from the docs? >> ahh yes, i h... [10:58:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/673457 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [10:59:03] elukey: done, sorry for missing the and/or typo [10:59:21] (03CR) 10Elukey: [C: 03+2] cumin: fix ml-serve alias and add newer ones [puppet] - 10https://gerrit.wikimedia.org/r/673457 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [10:59:28] my bad :) [11:08:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:08:56] (03PS2) 10Alexandros Kosiaris: deployment: Add ML cluster to deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/673455 [11:12:10] (03PS3) 10Alexandros Kosiaris: deployment: Add ML cluster to deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/673455 [11:12:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:13:12] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28677/console" [puppet] - 10https://gerrit.wikimedia.org/r/673455 (owner: 10Alexandros Kosiaris) [11:13:37] (03PS1) 10Jbond: hiera - cloud: move debmon to sso project [puppet] - 10https://gerrit.wikimedia.org/r/673461 [11:14:41] (03CR) 10Jbond: [C: 03+2] hiera - cloud: move debmon to sso project [puppet] - 10https://gerrit.wikimedia.org/r/673461 (owner: 10Jbond) [11:17:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:18:03] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE [11:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:46] (03PS2) 10Ayounsi: tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [11:20:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE [11:20:04] (03CR) 10JMeybohm: [C: 03+1] "LTGM" [puppet] - 10https://gerrit.wikimedia.org/r/673455 (owner: 10Alexandros Kosiaris) [11:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:12] (03CR) 10Klausman: [C: 03+1] deployment: Add ML cluster to deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/673455 (owner: 10Alexandros Kosiaris) [11:20:23] (03CR) 10jerkins-bot: [V: 04-1] tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [11:22:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:23:52] (03PS1) 10Jbond: hiera - cloud: correct debmon name [puppet] - 10https://gerrit.wikimedia.org/r/673463 [11:23:58] 10SRE, 10Services, 10Patch-For-Review, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10akosiaris) >>! In T277483#6925411, @dpifke wrote: >>>! In T277483#6924456, @akosiaris wrote: >> * Is xhgui stateless? More specifical... [11:25:03] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] deployment: Add ML cluster to deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/673455 (owner: 10Alexandros Kosiaris) [11:25:07] 10SRE, 10Services, 10serviceops-radar, 10Patch-For-Review, and 2 others: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10akosiaris) p:05Triageβ†’03Medium [11:25:49] (03CR) 10Jbond: [C: 03+2] hiera - cloud: correct debmon name [puppet] - 10https://gerrit.wikimedia.org/r/673463 (owner: 10Jbond) [11:27:48] !log akosiaris@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:27:51] !log akosiaris@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:12] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:29:15] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:34] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [11:29:38] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [11:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:56] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [11:29:56] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [11:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,ircd} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:31] (03PS1) 10Jbond: cloud hiera - sso: add puppetmasters block [puppet] - 10https://gerrit.wikimedia.org/r/673464 [11:33:56] (03CR) 10Jbond: [C: 03+2] cloud hiera - sso: add puppetmasters block [puppet] - 10https://gerrit.wikimedia.org/r/673464 (owner: 10Jbond) [11:34:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:36:40] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [11:36:44] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [11:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:04] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:37:07] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:19] (03PS1) 10Jbond: cloud - sso: fix puppet masters format [puppet] - 10https://gerrit.wikimedia.org/r/673467 [11:45:00] (03CR) 10Jbond: [C: 03+2] cloud - sso: fix puppet masters format [puppet] - 10https://gerrit.wikimedia.org/r/673467 (owner: 10Jbond) [11:47:02] (03CR) 10JMeybohm: [C: 03+1] "This looks about right, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/673454 (https://phabricator.wikimedia.org/T277857) (owner: 10Jgiannelos) [11:47:51] (03PS3) 10Ayounsi: tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [11:47:53] (03PS2) 10Ayounsi: WIP. tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [11:48:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:48:17] (03CR) 10jerkins-bot: [V: 04-1] tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [11:48:25] (03CR) 10jerkins-bot: [V: 04-1] WIP. tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [11:50:01] (03PS1) 10Volans: tests: fix pip backtracking [software/pywmflib] - 10https://gerrit.wikimedia.org/r/673468 [11:50:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:54:56] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable bracket matching on group0 and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673312 (https://phabricator.wikimedia.org/T273591) (owner: 10Andrew-WMDE) [11:55:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:58:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:59:57] (03CR) 10Volans: "This is my proposal to fix the issues we're getting in the last days with the aborted CI due to pip backtracking [1] taking too long to re" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/673468 (owner: 10Volans) [12:03:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:10:13] !log upgrade memcached on mc1026,mc2026 [12:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:13:45] (03PS1) 10Kosta Harlan: linkrecommendation: Add Swagger UI environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/673471 (https://phabricator.wikimedia.org/T277644) [12:23:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:25:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10aborrero) a:05aborreroβ†’03RobH The missing VLAN was just recently resolved in {T277020} The contrlol plane 1G port was ther... [12:26:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10aborrero) [12:26:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:32:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:33:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:34:48] !log klausman@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE [12:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:50] !log klausman@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE [12:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:42:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:50:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] eventrouter: Update build and base image, switch to nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/669846 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [12:51:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] ratelimit: Switch to nobody, update build and base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670836 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [12:51:08] 10SRE, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Grant Access to wmf for TsepoThoabala - https://phabricator.wikimedia.org/T277804 (10Aklapper) @TThoabala: Hi, did this ticket supersede T277797 ? If yes, then please set the task status there to `declined` - thanks! [12:51:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] fluent-bit: Switch to nobody and use seed_image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670838 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [12:56:54] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: disable /etc/host management from cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673475 (https://phabricator.wikimedia.org/T277866) [12:59:34] (03CR) 10Arturo Borrero Gonzalez: "perhaps other option is to manage the template `/etc/cloud/templates/hosts.debian.tmp` via puppet before cloud-init runs at VM creating ti" [puppet] - 10https://gerrit.wikimedia.org/r/673475 (https://phabricator.wikimedia.org/T277866) (owner: 10Arturo Borrero Gonzalez) [13:10:34] 10SRE, 10CAS-SSO: Investigate/enable new actuators for U2F token management - https://phabricator.wikimedia.org/T277837 (10MoritzMuehlenhoff) p:05Triageβ†’03Low [13:10:40] 10SRE, 10CAS-SSO: CAS per-service TGT setting - https://phabricator.wikimedia.org/T277840 (10MoritzMuehlenhoff) p:05Triageβ†’03Low [13:10:46] 10SRE, 10CAS-SSO: WebAuthn FIDO2 support in CAS - https://phabricator.wikimedia.org/T277841 (10MoritzMuehlenhoff) p:05Triageβ†’03Medium [13:19:34] (03PS1) 10Kormat: WMFMariaDB: Allow setting debug via env var [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673480 [13:22:14] (03CR) 10Kormat: [C: 03+2] WMFMariaDB: Allow setting debug via env var [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673480 (owner: 10Kormat) [13:24:57] (03PS1) 10Jbond: cloud - hiera: add horizon config to yaml [puppet] - 10https://gerrit.wikimedia.org/r/673481 [13:25:56] (03Merged) 10jenkins-bot: WMFMariaDB: Allow setting debug via env var [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673480 (owner: 10Kormat) [13:26:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:28:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:31:33] (03PS2) 10Jbond: cloud - hiera: add horizon config to yaml [puppet] - 10https://gerrit.wikimedia.org/r/673481 [13:33:47] 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10akosiaris) Hello, >>! In T250110#6924592, @Chtnnh wrote: > Hello! > > Yes, we would love to have this service deployed. Although, over the course of... [13:34:27] (03CR) 10Jbond: [C: 03+2] cloud - hiera: add horizon config to yaml [puppet] - 10https://gerrit.wikimedia.org/r/673481 (owner: 10Jbond) [13:35:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:37:07] (03PS1) 10Jbond: cloud sso: add puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/673485 [13:37:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:38:33] (03CR) 10Jbond: [C: 03+2] cloud sso: add puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/673485 (owner: 10Jbond) [13:46:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:49:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:50:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I am sorry, my previous comments were wrong, please disregard." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry) [13:53:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:57:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 235 probes of 605 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:58:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:58:34] (03PS1) 10Andrew Bogott: nova vendordata: adjust cloud-init package list [puppet] - 10https://gerrit.wikimedia.org/r/673489 [13:59:42] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: adjust cloud-init package list [puppet] - 10https://gerrit.wikimedia.org/r/673489 (owner: 10Andrew Bogott) [14:03:07] (03PS2) 10Alexandros Kosiaris: docker: tabs to spaces [puppet] - 10https://gerrit.wikimedia.org/r/672450 (owner: 10Legoktm) [14:03:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] docker: tabs to spaces [puppet] - 10https://gerrit.wikimedia.org/r/672450 (owner: 10Legoktm) [14:03:23] 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) I understand @akosiaris ! Is it possible to deploy to production as volunteers? As in, is it possible for long time volunteers to have deploy... [14:03:27] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 50 probes of 605 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:08:51] (03PS6) 10KartikMistry: Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) [14:09:33] (03CR) 10KartikMistry: "> Patch Set 5: Code-Review-1" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry) [14:11:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:13:25] (03PS1) 10Andrew Bogott: nova-fullstack: temporarily run with a different base image [puppet] - 10https://gerrit.wikimedia.org/r/673496 [14:16:42] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: temporarily run with a different base image [puppet] - 10https://gerrit.wikimedia.org/r/673496 (owner: 10Andrew Bogott) [14:20:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:21:09] (03CR) 10David Caro: [C: 03+1] "This works on my local." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/673468 (owner: 10Volans) [14:23:45] (03PS1) 10Jbond: sso-debmon: comment out classes so we can at least get one puppet run [puppet] - 10https://gerrit.wikimedia.org/r/673499 [14:25:28] (03CR) 10Jbond: [C: 03+2] sso-debmon: comment out classes so we can at least get one puppet run [puppet] - 10https://gerrit.wikimedia.org/r/673499 (owner: 10Jbond) [14:29:43] (03PS1) 10Jbond: Revert "sso-debmon: comment out classes so we can at least get one puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/673121 [14:34:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:34:17] (03CR) 10Jbond: [C: 03+2] Revert "sso-debmon: comment out classes so we can at least get one puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/673121 (owner: 10Jbond) [14:34:56] 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10akosiaris) >>! In T250110#6928585, @Chtnnh wrote: > I understand @akosiaris ! > > Is it possible to deploy to production as volunteers? As in, is it... [14:36:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:38:22] 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) I see. I think the team (@Harshineesriram, @Abbasidaniyal and I) will have to put some thought into that. As far as the timeline is concerned... [14:47:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) @aborrero, Perhaps this wasn't conveyed at the time of order, and it may cause issues, but we don't support connecting mi... [14:47:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) a:05RobHβ†’03aborrero [14:48:40] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10akosiaris) Just a few clarifications and answers. > cloud vps is a kubernetes cluster It's toolforge that's half powered by a kubernetes cluster. The other half is powered by son of g... [14:52:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:54:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:57:43] (03CR) 10Bstorm: "You'll need this on the grid master as well. shadow_master should only be on the shadow server." [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [14:59:10] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [15:04:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:04:42] (03PS1) 10Jbond: P:debmonitor: fix dependencies in cloud [puppet] - 10https://gerrit.wikimedia.org/r/673511 [15:06:19] (03CR) 10Jbond: [C: 03+2] P:debmonitor: fix dependencies in cloud [puppet] - 10https://gerrit.wikimedia.org/r/673511 (owner: 10Jbond) [15:08:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:14:19] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [15:16:06] (03PS1) 10Jbond: P:debmonitor: fix nginx ssl config [puppet] - 10https://gerrit.wikimedia.org/r/673514 [15:16:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28678/console" [puppet] - 10https://gerrit.wikimedia.org/r/673514 (owner: 10Jbond) [15:17:25] (03PS1) 10Elukey: Add alluxio keytabs on Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/673515 (https://phabricator.wikimedia.org/T266641) [15:17:39] (03CR) 10Bstorm: "I think anything cloud-init does is our one guaranteed change on a VM, since a user can disable puppet or break it. It would be great if w" [puppet] - 10https://gerrit.wikimedia.org/r/673475 (https://phabricator.wikimedia.org/T277866) (owner: 10Arturo Borrero Gonzalez) [15:18:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:18:23] (03CR) 10Elukey: [C: 03+2] Add alluxio keytabs on Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/673515 (https://phabricator.wikimedia.org/T266641) (owner: 10Elukey) [15:20:57] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/673475 (https://phabricator.wikimedia.org/T277866) (owner: 10Arturo Borrero Gonzalez) [15:22:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:31:18] (03PS12) 10Dave Pifke: arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) [15:33:43] (03CR) 10Dave Pifke: "This is ready to merge at your convenience." [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [15:34:03] (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: fix the order of the paws accounts listing [puppet] - 10https://gerrit.wikimedia.org/r/673380 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [15:34:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:36:52] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor: fix nginx ssl config [puppet] - 10https://gerrit.wikimedia.org/r/673514 (owner: 10Jbond) [15:39:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:40:05] (03CR) 10CRusnov: "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/565800 (owner: 10Legoktm) [15:41:06] (03Abandoned) 10CRusnov: mwgrep.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670975 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:41:12] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) [15:41:20] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) p:05Triageβ†’03High [15:46:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:48:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:56:59] (03PS1) 10Jbond: P:debmonitor::client: update cas vhost and open FW [puppet] - 10https://gerrit.wikimedia.org/r/673523 [15:58:25] (03PS4) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) [15:59:29] (03CR) 10jerkins-bot: [V: 04-1] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite) [16:01:45] !log upgrade memcached on mc-gp200* [16:01:47] (03PS1) 10Bstorm: maintain-dbusers: type cast the uid for paws users [puppet] - 10https://gerrit.wikimedia.org/r/673524 (https://phabricator.wikimedia.org/T276284) [16:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28680/console" [puppet] - 10https://gerrit.wikimedia.org/r/673523 (owner: 10Jbond) [16:02:55] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::client: update cas vhost and open FW [puppet] - 10https://gerrit.wikimedia.org/r/673523 (owner: 10Jbond) [16:03:35] (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: type cast the uid for paws users [puppet] - 10https://gerrit.wikimedia.org/r/673524 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [16:05:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:06:28] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [16:07:22] (03PS1) 10Effie Mouzeli: hieradata: install memcached 1.6 to gutter pool servers [puppet] - 10https://gerrit.wikimedia.org/r/673527 (https://phabricator.wikimedia.org/T270315) [16:07:57] (03PS1) 10Jbond: cloud - hiera: move hiera keys to correct level [puppet] - 10https://gerrit.wikimedia.org/r/673528 [16:10:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:10:56] (03CR) 10Jbond: [C: 03+2] cloud - hiera: move hiera keys to correct level [puppet] - 10https://gerrit.wikimedia.org/r/673528 (owner: 10Jbond) [16:11:51] ssh is telling me that the key for bast1002.wikimedia.org changed, namely to SHA256:XfPttsgImI8r43WfwENq8eA36R6i88RNnE409XiNpBk. [16:11:59] Can someone confirm that this is expected? [16:12:38] duesen: https://wikitech.wikimedia.org/w/index.php?title=Help%3ASSH_Fingerprints%2Fbast1002.wikimedia.org&type=revision&diff=1900025&oldid=1799398, looks like yes [16:13:11] Majavah: thank you! [16:15:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:15:52] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: install memcached 1.6 to gutter pool servers [puppet] - 10https://gerrit.wikimedia.org/r/673527 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [16:16:18] (03PS5) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) [16:17:27] (03CR) 10jerkins-bot: [V: 04-1] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite) [16:21:34] (03PS1) 10Jbond: P:debmonitor::server: allow users to configure the cas required_groups [puppet] - 10https://gerrit.wikimedia.org/r/673533 [16:22:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:23:59] (03CR) 10jerkins-bot: [V: 04-1] P:debmonitor::server: allow users to configure the cas required_groups [puppet] - 10https://gerrit.wikimedia.org/r/673533 (owner: 10Jbond) [16:25:51] (03PS2) 10Jbond: P:debmonitor::server: allow users to configure the cas required_groups [puppet] - 10https://gerrit.wikimedia.org/r/673533 [16:27:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28684/console" [puppet] - 10https://gerrit.wikimedia.org/r/673533 (owner: 10Jbond) [16:28:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::server: allow users to configure the cas required_groups [puppet] - 10https://gerrit.wikimedia.org/r/673533 (owner: 10Jbond) [16:30:11] (03CR) 10Dzahn: [C: 03+1] "Ah, thanks for pointing out Joe's change. With that I am +1 then :) thanks" [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond) [16:31:32] (03CR) 10Jbond: [C: 03+2] P:tcpircbot: drop monitoring of service [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond) [16:33:18] (03PS6) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) [16:34:25] (03CR) 10jerkins-bot: [V: 04-1] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite) [16:35:41] (03PS7) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) [16:36:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:37:29] (03PS8) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) [16:40:16] (03CR) 10Cwhite: "Thanks for the review! All were valid points." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite) [16:46:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:52:11] (03PS1) 10Jbond: pki - cloud: add sso puppet CA to authorised CA's [puppet] - 10https://gerrit.wikimedia.org/r/673537 [16:52:55] (03PS3) 10Effie Mouzeli: profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773 [16:55:59] (03PS1) 10Bstorm: maintain-dbusers: correct the types on a the PAWS UID and paths [puppet] - 10https://gerrit.wikimedia.org/r/673538 (https://phabricator.wikimedia.org/T276284) [16:57:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:00:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:06:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:07:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:07:57] (03PS4) 10Effie Mouzeli: profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773 [17:15:34] (03CR) 10Jbond: profile::mcrouter_wancache: add spec tests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672773 (owner: 10Effie Mouzeli) [17:16:59] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) @elukey can I move the 2 servers anytime or does this need to be scheduled? Move an-worker1129 to A2 Move an-worker1139 to A7 [17:19:04] (03CR) 10Jbond: [C: 03+2] pki - cloud: add sso puppet CA to authorised CA's [puppet] - 10https://gerrit.wikimedia.org/r/673537 (owner: 10Jbond) [17:23:20] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10elukey) @Cmjohnson anytime is fine! Thanks :) [17:25:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,routinator} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:27:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:31:43] (03PS5) 10Effie Mouzeli: profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773 [17:31:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:31:55] 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-MoritzMuehlenhoff, and 2 others: Convert .py.erb files to files with configurations - https://phabricator.wikimedia.org/T277892 (10crusnov) [17:32:12] 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-MoritzMuehlenhoff, and 2 others: Convert .py.erb files to files with configurations - https://phabricator.wikimedia.org/T277892 (10crusnov) p:05Triageβ†’03Medium [17:33:33] (03CR) 10Legoktm: [C: 03+2] "Don't see any more backtracking occurring. Thanks!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/673468 (owner: 10Volans) [17:34:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:38:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:43:01] (03Merged) 10jenkins-bot: tests: fix pip backtracking [software/pywmflib] - 10https://gerrit.wikimedia.org/r/673468 (owner: 10Volans) [17:45:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:46:15] (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: correct the types on a the PAWS UID and paths [puppet] - 10https://gerrit.wikimedia.org/r/673538 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [18:00:18] (03PS34) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [18:03:32] (03PS12) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [18:03:47] (03PS7) 10Jbond: (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [18:05:20] (03CR) 10jerkins-bot: [V: 04-1] (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [18:12:03] (03CR) 10Mstyles: create helmfile.d structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [18:15:24] (03PS1) 10Razzi: turnilo: add monitoring for http [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) [18:16:34] (03CR) 10jerkins-bot: [V: 04-1] turnilo: add monitoring for http [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [18:19:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:24:26] (03PS8) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [18:31:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:34:46] (03CR) 10Elukey: "Razzi I think that we should try to hit the local endpoint, namely the one offered by the Turnilo nodejs app:" [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [18:35:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:37:26] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670985 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [18:40:47] (03CR) 10Legoktm: [C: 03+1] site/conftool-data: turn mw2251,mw2252 into canaries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673367 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn) [18:43:55] (03PS8) 10Jbond: (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [18:45:25] (03CR) 10jerkins-bot: [V: 04-1] (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [18:45:43] (03PS2) 10Razzi: turnilo: add monitoring for http [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) [18:46:09] !log deploy2002 - disable puppet, copy modified version of scap-master-sync over it that does not --exclude="**/cache/l10n/*.cdb" (for T275826) [18:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:19] T275826: L10n cache files building up on backup deploy hosts - https://phabricator.wikimedia.org/T275826 [18:47:43] (03CR) 10Razzi: "Unless I missed it, it looked like there aren't any local appserver checks yet; here's my attempt at a new one." [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [18:51:14] (03PS9) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [18:52:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:53:02] (03PS1) 10Legoktm: tests: fix pip backtracking [cookbooks] - 10https://gerrit.wikimedia.org/r/673558 [18:53:11] (03PS9) 10Jbond: (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [18:55:35] (03CR) 10Effie Mouzeli: "> Patch Set 4: Code-Review-1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [18:55:57] (03CR) 10Dzahn: turnilo: add monitoring for http (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [18:56:48] (03CR) 10Dzahn: turnilo: add monitoring for http (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [18:57:39] (03CR) 10Dzahn: turnilo: add monitoring for http (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [18:59:15] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: turn mw2251,mw2252 into canaries [puppet] - 10https://gerrit.wikimedia.org/r/673367 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn) [19:00:04] 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-crusnov, 10User-jbond: Port dstat related scripts to Python 3 - https://phabricator.wikimedia.org/T277910 (10crusnov) [19:00:16] 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-crusnov, 10User-jbond: Port dstat related scripts to Python 3 - https://phabricator.wikimedia.org/T277910 (10crusnov) p:05Triageβ†’03Medium [19:01:24] (03PS3) 10Razzi: turnilo: add monitoring for node application [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) [19:01:26] (03CR) 10Razzi: turnilo: add monitoring for node application (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [19:06:09] (03CR) 10BBlack: [C: 03+1] "This seems like a correct copy of the technique of the other referenced patch! 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/673558 (owner: 10Legoktm) [19:06:15] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670990 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:09:44] (03CR) 10Dzahn: turnilo: add monitoring for node application (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [19:09:56] (03CR) 10Legoktm: [C: 03+2] tests: fix pip backtracking [cookbooks] - 10https://gerrit.wikimedia.org/r/673558 (owner: 10Legoktm) [19:11:40] (03PS35) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [19:11:58] (03CR) 10Dzahn: "fyi, one change that happens if you turn a server into a canary is also a change in envoy config:" [puppet] - 10https://gerrit.wikimedia.org/r/673367 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn) [19:12:30] (03PS13) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [19:12:42] (03PS10) 10Jbond: (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [19:17:21] (03Merged) 10jenkins-bot: tests: fix pip backtracking [cookbooks] - 10https://gerrit.wikimedia.org/r/673558 (owner: 10Legoktm) [19:18:11] (03PS4) 10Legoktm: sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) [19:18:19] (03CR) 10Legoktm: [C: 03+2] "..." [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm) [19:20:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:22:32] (03Merged) 10jenkins-bot: sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm) [19:24:57] !log deploy2002 - re-enabled puppet, reverted patch of scap-sync-master [19:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:27:26] (03PS1) 10Legoktm: tests: fix pip backtracking [software/cumin] - 10https://gerrit.wikimedia.org/r/673564 [19:28:07] (03CR) 10Legoktm: tests: fix pip backtracking (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/673564 (owner: 10Legoktm) [19:33:09] !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw2251.codfw.wmnet,service=canary [19:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:15] !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw2252.codfw.wmnet,service=canary [19:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:06] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2251.codfw.wmnet,service=canary [19:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:17] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2252.codfw.wmnet,service=canary [19:37:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:31] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2244.codfw.wmnet [19:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:39] !log legoktm@cumin1001 START - Cookbook sre.ganeti.makevm for new host lists1002.wikimedia.org [19:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:45] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2245.codfw.wmnet [19:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2244.codfw.wmnet [19:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:42:15] ugh [19:42:22] mutante: I think we conflicted on the netbox step :/ [19:42:35] my diff shows the removal of mw2244 [19:43:04] legoktm: my cookbook is at the "Sleeping for 3 minutes step" [19:43:13] removal of mw2244 is correct [19:43:19] ok, I'm going to accept the diff [19:43:21] though I am not sure if it will mean my run will fail later [19:43:36] It happened to me when I tried to do 2 decoms at once [19:43:45] and I accepted it as well.. yes, please do [19:43:47] we will see [19:44:04] https://phabricator.wikimedia.org/rONED8c1c033f628adcada809fd30e27d3210f43d362f [19:44:33] if it's removed from DNS before all other decom steps are done [19:44:36] there might be remnants [19:44:38] not sure [19:44:47] but it is already past "removed from puppetDB" [19:45:14] feels like this step should have a lock [19:46:08] one issue i could see is when it tries to connect to mgmt to shut it down [19:46:31] 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 4 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Daimona) [19:48:08] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle) [19:48:30] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle) a:03Kormat [19:48:57] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle) >>! In T277831#6927485, @Krinkle wrote: >> The concerned raised by @Kormat is that the current se... [19:49:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:50:47] !log testreduce1001 - confirmed MariaDB @@datadir is /srv/data/mysql and deleting /var/lib/mysql (T277580) [19:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:54] T277580: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 [19:53:21] !log legoktm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host lists1002.wikimedia.org [19:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2244.codfw.wmnet [19:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:01] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2244.codfw.wmnet` - mw2244.codfw.wmnet (**PASS**) - Downtime... [19:54:02] legoktm: it's running homer now to shut down switch port and that's it. exit 0 [19:54:10] :D [19:54:10] seems to be fine [19:54:12] phew [19:54:16] yep [19:54:18] (03PS11) 10Jbond: P:netbase: parse the service catalouge and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [19:54:24] (03PS1) 10Legoktm: install_server: Add lists1002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/673590 (https://phabricator.wikimedia.org/T276686) [19:55:18] (03CR) 10Jbond: P:netbase: parse the service catalouge and inject the service ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [19:55:31] (03CR) 10jerkins-bot: [V: 04-1] P:netbase: parse the service catalouge and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [19:55:37] (03CR) 10Legoktm: [C: 03+2] install_server: Add lists1002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/673590 (https://phabricator.wikimedia.org/T276686) (owner: 10Legoktm) [19:55:37] I am doing one more decom but then that's it for this Friday [19:55:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2245.codfw.wmnet [19:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:57] I have no more VMs for today :) [19:57:08] ack [19:59:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:59:58] (03CR) 10Elukey: turnilo: add monitoring for node application (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [20:00:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry) [20:00:34] (03PS12) 10Jbond: P:netbase: parse the service catalouge and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [20:00:37] (03PS1) 10Legoktm: site.pp: Add lists1002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/673591 (https://phabricator.wikimedia.org/T276686) [20:00:50] (03CR) 10Jbond: "this is ready for review now" [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [20:01:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:03:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:03:10] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) [20:04:51] (03PS1) 10Dzahn: DHCP: switch scandium to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/673592 (https://phabricator.wikimedia.org/T268248) [20:08:34] (03CR) 10Legoktm: [C: 03+2] site.pp: Add lists1002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/673591 (https://phabricator.wikimedia.org/T276686) (owner: 10Legoktm) [20:08:37] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/673594 [20:09:26] (03CR) 10Dzahn: [C: 03+2] DHCP: switch scandium to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/673592 (https://phabricator.wikimedia.org/T268248) (owner: 10Dzahn) [20:10:49] (03PS36) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [20:11:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2245.codfw.wmnet [20:12:03] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2245.codfw.wmnet` - mw2245.codfw.wmnet (**PASS**) - Downtime... [20:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:13:31] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: decom mw2244,mw2245, former canary servers [puppet] - 10https://gerrit.wikimedia.org/r/673368 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn) [20:14:00] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on scandium.eqiad.wmnet with reason: reimage [20:14:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on scandium.eqiad.wmnet with reason: reimage [20:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:15:56] !log scandium - reimaging with buster [20:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:22] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Legoktm) >>! In T256538#6920846, @Marostegui wrote: > Databases are now created, once I get the IPs I will create the users :) 208.80.154.13 (https://netbox.wikimedia.org/virtualization... [20:19:56] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) [20:19:59] 10SRE, 10Wikimedia-Mailing-lists, 10vm-requests: Requesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Legoktm) 05Openβ†’03Resolved Done, lists1002.wikimedia.org now exists. [20:20:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:21:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:23:34] (03PS37) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [20:24:07] (03PS14) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [20:24:16] (03PS13) 10Jbond: P:netbase: parse the service catalouge and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [20:29:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on scandium.eqiad.wmnet with reason: REIMAGE [20:29:50] 10SRE, 10Performance-Team, 10Platform Engineering, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10Krinkle) [20:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:59] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [20:31:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on scandium.eqiad.wmnet with reason: REIMAGE [20:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:34:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:42:35] 10SRE, 10observability: The "logstash-*" index pattern does not contain any of the following field types: ip - https://phabricator.wikimedia.org/T238795 (10colewhite) 05Openβ†’03Resolved a:03colewhite ECS is typing these fields appropriately since https://gerrit.wikimedia.org/r/c/operations/puppet/+/647029 [20:43:07] (03PS1) 10Legoktm: sre.ganeti.makevm: Update example after 22c586eb2ac23 [cookbooks] - 10https://gerrit.wikimedia.org/r/673597 [20:49:21] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH) p:05Mediumβ†’03High a:03RobH [20:50:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install thumbor100[56] - https://phabricator.wikimedia.org/T273914 (10wiki_willy) a:03Jclark-ctr [20:51:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10wiki_willy) a:03Jclark-ctr [20:53:49] dpifke: I'm going to deploy the arclamp / swift change now [20:54:03] SGTM. [20:56:28] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28686/console" [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [20:57:37] (03CR) 10Legoktm: [V: 03+1 C: 03+2] arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [20:59:44] running puppet now [21:05:41] https://performance.wikimedia.org/arclamp/svgs/daily/2021-03-19.excimer.all.reversed.svgz "Internal Server Error" [21:06:03] [Fri Mar 19 21:05:42.367791 2021] [proxy:warn] [pid 15412:tid 139721669838592] [client 2620:0:861:101:10:64:0:215:33844] AH01144: No protocol handler was valid for the URL /arclamp/svgs/daily/2021-03-19.excimer.all.reversed.svgz. If you are using a DSO version of mod_proxy, make sure the proxy submodules are included in the configuration using LoadModule. [21:06:13] maybe it didn't match the regex? [21:06:37] Hmm, looking. [21:07:22] (03PS2) 10Dzahn: site/conftool-data: decom mw2244,mw2245, former canary servers [puppet] - 10https://gerrit.wikimedia.org/r/673368 (https://phabricator.wikimedia.org/T277780) [21:08:36] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: decom mw2244,mw2245, former canary servers [puppet] - 10https://gerrit.wikimedia.org/r/673368 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn) [21:08:37] Looks like Swift is HTTP in beta, HTTPS in prod. [21:08:42] (03PS3) 10Dzahn: site/conftool-data: decom mw2244,mw2245, former canary servers [puppet] - 10https://gerrit.wikimedia.org/r/673368 (https://phabricator.wikimedia.org/T277780) [21:08:52] Checking to see if there's a mod_proxy_https we need to add. [21:09:11] there is, yes [21:09:43] I don't see it in /etc/apache2/mods-available? Or is it part of proxy_http2? [21:09:49] or, maybe not [21:09:54] yeah, I just checked that too [21:10:19] https://httpd.apache.org/docs/2.4/mod/mod_proxy_http.html says it supports HTTPS [21:11:02] !log scandium - stop apache and rerun puppet which fails after reimaging because it tries to run an nginx on port 80 which is already used by apache T268248 [21:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:11] T268248: upgrade scandium to buster - https://phabricator.wikimedia.org/T268248 [21:11:52] using https://regex101.com/ the regex does match [21:14:13] I tried commenting out the block to see if that made a difference, but it didn't [21:14:33] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [21:14:44] It's odd, it seems to be working for logs but not for svgs. [21:14:57] I tried changing it to http instead of https and it didn't make a difference. [21:15:38] Unless maybe stepped on each other making changes. :) Trying again. [21:15:42] oops [21:16:11] https://stackoverflow.com/questions/23931987/apache-proxy-no-protocol-handler-was-valid says we need mod_ssl, which is not currently enabled [21:16:47] 10SRE, 10vm-requests, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) Do you want to keep this open? Or simply close and reopen if/once you want a second VM? [21:17:16] Makes sense. It works with http to the Swift backend, looking at Puppet code to see if changing that is a quick fix. [21:17:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [21:18:16] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) 05Openβ†’03Stalled mwmaint1002 will be upgraded during the DC switchover period in Q4 [21:18:24] (03PS1) 10Legoktm: webperf: Enable mod_ssl for performance website [puppet] - 10https://gerrit.wikimedia.org/r/673599 [21:19:20] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28687/console" [puppet] - 10https://gerrit.wikimedia.org/r/673599 (owner: 10Legoktm) [21:19:30] dpifke: ^ [21:19:47] (03CR) 10Dave Pifke: [C: 03+1] webperf: Enable mod_ssl for performance website [puppet] - 10https://gerrit.wikimedia.org/r/673599 (owner: 10Legoktm) [21:19:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:20:11] Works for me. That's cleaner than trying to rewrite the Swift URL from hieradata. [21:20:38] and I think we want internal traffic to go over HTTPS anyways [21:20:45] (03CR) 10Legoktm: [V: 03+1 C: 03+2] webperf: Enable mod_ssl for performance website [puppet] - 10https://gerrit.wikimedia.org/r/673599 (owner: 10Legoktm) [21:20:46] And means we don't depend on an infrequently-used Swift endpoint, in case HTTP access to it ever goes away. [21:20:54] (03PS2) 10Legoktm: webperf: Enable mod_ssl for performance website [puppet] - 10https://gerrit.wikimedia.org/r/673599 [21:20:57] (03CR) 10Legoktm: [V: 03+2 C: 03+2] webperf: Enable mod_ssl for performance website [puppet] - 10https://gerrit.wikimedia.org/r/673599 (owner: 10Legoktm) [21:21:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:21:12] yea, it's nice to encrypt all the internal traffic as well, +1 [21:22:25] ok, new error :p [21:22:32] https://performance.wikimedia.org/arclamp/svgs/daily/2021-03-19.excimer.all.reversed.svgz "upstream connect error or disconnect/reset before headers. reset reason: connection failure" [21:22:55] Certificate issue? [21:25:09] The certificate for ms-fe.svc.eqiad.wmnet was issued by Puppet, we probably need to tell Apache about it. [21:25:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:26:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:26:27] * legoktm looks to see how that's done elsewhere [21:27:08] It seems to be available in /var/lib/puppet/ssl/certs/ca.pem. [21:28:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:29:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:33:02] (03PS1) 10Dave Pifke: arclamp: allow Puppet CA for ms-fe.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/673602 (https://phabricator.wikimedia.org/T244776) [21:33:11] legoktm ^ I think that might do it. [21:33:37] did you try it out already? [21:33:44] No, can do so if you want. [21:33:56] please :) [21:34:06] Just looked at file permissions and tested using openssl s_client. [21:34:16] Manually adding to Apache config now. [21:34:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:36:28] Hmm. Apache is complaining it can't bind to 443 when run apache2ctl restart. [21:37:02] er you're not using systemd? [21:37:36] Same via systemctl restart. [21:38:19] envoy is probably sitting on 443 already [21:38:25] why is apache trying to bind to it though? [21:39:04] (yes, it is envoy on 443) [21:40:37] /etc/apache2/ports.conf [21:40:45] (03PS4) 10Razzi: turnilo: add monitoring for node application [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) [21:40:46] Dunno why it worked when Puppet restarted it though? [21:41:16] is that actually included? [21:41:37] From /etc/apache2.conf, yes. [21:41:40] * legoktm forces a puppet run [21:42:39] I/puppet only did a reload earlier [21:42:43] but now it's still down [21:44:31] I don't know of a great way to override ports.conf later, so I guess we need to have Puppet overwrite it. [21:45:11] I'm going manually comment out the Listen 443 for now and see if the other fix works. [21:45:18] ok [21:45:38] there's $remove_default_ports in puppet, I'm going to use that [21:46:21] Nice, someone else has already had this problem. :) [21:46:37] PROBLEM - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:45] I'll ack that in a minute [21:48:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:48:38] (03PS1) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 [21:49:49] ACKNOWLEDGEMENT - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service Legoktm working on it https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:50] Does remove_default_ports remove 80 as well? If so, do we need to add it back in? [21:50:04] (03PS2) 10Dave Pifke: arclamp: enable SSL to ms-fe.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/673602 (https://phabricator.wikimedia.org/T244776) [21:50:53] ^ tested and works. (Also needed "SSLProxyEngine On") [21:51:15] (03PS2) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 [21:52:14] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28689/console" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [21:54:16] I'm trying to figure out where the real ports are set [21:54:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:55:25] ok, it seems like the other places just define their own ports.conf [21:55:36] * legoktm just does that [21:56:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:58:57] (03PS3) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 [21:59:30] The fact that Varnish is reaching webperf1001 via HTTP negates at least some of the value of all this work to get webperf1001 β†’ ms-fe-svc working over HTTPS. But I guess that's a problem for another day. :) [21:59:40] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28690/console" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [21:59:44] is it not talking to envoy? [22:00:13] (03PS4) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 [22:00:16] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [22:00:18] (03CR) 10jerkins-bot: [V: 04-1] webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [22:01:24] (03CR) 10jerkins-bot: [V: 04-1] webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [22:02:00] (03PS5) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 [22:02:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:05:03] (03PS6) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 [22:05:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:05:52] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28692/console" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [22:06:42] (03PS7) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 [22:06:48] (03CR) 10Legoktm: [V: 03+2 C: 03+2] webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [22:07:04] (03CR) 10Legoktm: [C: 03+2] arclamp: enable SSL to ms-fe.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/673602 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [22:08:41] dpifke: ran puppet, I think it's all working now? [22:09:19] Looks good from here. Sorry this turned out to be such a chore. [22:09:50] :D it wouldn't a real Friday if it was boring [22:10:19] Thanks for your help! :) [22:10:23] :)) [22:10:45] one thing that did surprise me is that there didn't seem to be any monitoring that alarmed despite the site being down, I'll file a task for that [22:11:26] Yeah. I think we monitor the backends but not webperf1001 itself. [22:11:43] RECOVERY - Check systemd state on webperf2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:11:52] I thought there was something at the Varnish layer that did, but I guess that's wrong. [22:12:57] (03PS1) 10Bstorm: maintain-dbusers: polish things up a bit [puppet] - 10https://gerrit.wikimedia.org/r/673606 (https://phabricator.wikimedia.org/T276284) [22:13:48] 10SRE, 10Performance-Team, 10observability: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927 (10Legoktm) [22:15:15] * legoktm -> afk for a short break, still pingable though [22:28:47] 10SRE, 10Performance-Team, 10observability: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927 (10dpifke) a:03dpifke Related: T260086 We have Icinga checks for most of the backends (XHGui, ArcLamp), but not for Apache on webperf1001 itself. Ideally, we can monitor er... [22:43:28] (03PS7) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 [22:47:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:54:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:57:30] (03CR) 10Dzahn: turnilo: add monitoring for node application (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [23:03:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:03:40] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10dpifke) Intermediate proposal: can we give +2 rights on labs/private to everyone with ro... [23:05:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:12:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:13:15] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10bd808) >>! In T161675#6930652, @dpifke wrote: > Intermediate proposal: can we give +2 ri... [23:14:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:18:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:20:12] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10dpifke) >>! In T161675#6930689, @bd808 wrote: > For anyone wondering who this is, see th... [23:20:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:35:21] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10bd808) >>! In T161675#6930741, @dpifke wrote: >>>! In T161675#6930689, @bd808 wrote: >>... [23:38:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:39:51] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10Legoktm) >>! In T161675#6930652, @dpifke wrote: > Intermediate proposal: can we give +2... [23:40:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:47:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:50:33] 10Puppet, 10SRE: Have puppet httpd class support enabling mod_ssl without having apache listen on port 443 - https://phabricator.wikimedia.org/T277989 (10Legoktm) [23:51:25] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10bd808) >>! In T161675#6930761, @Legoktm wrote: >>>! In T161675#6930652, @dpifke wrote: >... [23:54:55] (03CR) 10Legoktm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672687 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff) [23:56:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:59:09] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase