[00:04:54] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01083 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[00:11:11] <wikibugs>	 (03PS9) 10Dave Pifke: arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776)
[00:12:46] <wikibugs>	 (03PS10) 10Dave Pifke: arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776)
[00:13:31] <wikibugs>	 (03CR) 10Legoktm: arclamp: serve SVGs, compressed logs from Swift (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[00:22:31] <wikibugs>	 (03CR) 10Dave Pifke: arclamp: serve SVGs, compressed logs from Swift (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[00:31:59] <wikibugs>	 (03PS11) 10Dave Pifke: arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776)
[00:32:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] parsoid::testreduce: switch mysql data dir to /srv/data/mysql [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) (owner: 10Dzahn)
[00:32:48] <wikibugs>	 (03PS2) 10Dzahn: parsoid::testreduce: switch mysql data dir to /srv/data/mysql [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580)
[00:33:21] <wikibugs>	 (03CR) 10Dave Pifke: "I'll need to test these changes in beta; will do so first thing tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[00:43:22] <Krinkle>	 dancy: I think you may've forgotten to apply that commit before syncing
[00:43:36] <Krinkle>	 looking at /srv/mediawiki-staging/php-1.36.0-wmf.35, the commit was not applied
[00:43:40] <Krinkle>	 on deploy1002
[00:44:01] * Krinkle reopened task
[00:45:02] <mutante>	 !log testreduce1001 - stop mysql; rsyncing /var/lib/mysql to /srv/data/mysql (T277580)
[00:45:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:45:11] <stashbot>	 T277580: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580
[00:45:20] <dancy>	 hmm.
[00:46:31] <dancy>	 Krinkle: The liquidthreads one?
[00:46:47] <Krinkle>	 yeah
[00:47:03] <Krinkle>	 forgot submodule update perhaps?
[00:49:32] <dancy>	 gah, yes.
[00:50:14] <wikibugs>	 (03PS1) 10Bstorm: static-binaries: first pass at a stripped-down image for binaries [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/673378 (https://phabricator.wikimedia.org/T277749)
[00:51:41] <logmsgbot>	 !log dancy@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/LiquidThreads/classes/Thread.php: T277772 (duration: 00m 58s)
[00:51:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:49] <stashbot>	 T277772: Use of Article::getId was deprecated in MediaWiki 1.35. [Called from Thread::setRoot] - https://phabricator.wikimedia.org/T277772
[00:52:38] <dancy>	 Thanks for the heads-up Krinkle.
[00:53:30] <Krinkle>	 yw :)
[00:54:08] <Krinkle>	 dancy: I updated mediawiki-errors in logstash to incldue maintanance/shell.php in its debugging filter (previously this checked maintenance/eval.php only)
[00:54:21] <Krinkle>	 it also excludes mwdebug host names
[00:54:35] <dancy>	 I saw that. Thank ou.
[00:54:40] <dancy>	 *you
[00:55:02] <Krinkle>	 I... pressed save a few seconds ago?
[00:55:08] <wikibugs>	 (03CR) 10Bstorm: "This image removes text editors because they are mostly useless if you basically are uploading binary blobs to NFS. It basically assumes y" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/673378 (https://phabricator.wikimedia.org/T277749) (owner: 10Bstorm)
[00:55:16] <Krinkle>	 This one - https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors
[00:55:41] <Krinkle>	 anyhow, yeah, so it uses exception.trace, I don't know if that maps directly but.. aye, hope its of some use
[00:57:08] <dancy>	 I discussed how to translate that to the logspam script with brenn.en today and we decided to just leave the script as-is for the time being.
[00:58:40] <dancy>	 I'll let brennen know about the updates you made since they're relevant to the conversation we had.
[00:59:04] <Krinkle>	 k :) no problem either way
[01:05:34] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:05:37] <wikibugs>	 (03PS1) 10Bstorm: maintain-dbusers: fix the order of the paws accounts listing [puppet] - 10https://gerrit.wikimedia.org/r/673380 (https://phabricator.wikimedia.org/T276284)
[01:08:11] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] maintain-dbusers: fix the order of the paws accounts listing [puppet] - 10https://gerrit.wikimedia.org/r/673380 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm)
[01:12:26] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:43:25] <ryankemper>	 !log T275885 Revoking current `relforge` TLS cert in advance of generation of new cert: `ryankemper@puppetmaster1001:/srv/private$ sudo puppet cert clean relforge.svc.eqiad.wmnet`
[02:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:43:34] <stashbot>	 T275885: Generate SSL certification for relforge1003.eqiad.wmnet and relforge1004.eqiad.wmnet - https://phabricator.wikimedia.org/T275885
[03:06:06] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:13:00] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:20:18] <wikibugs>	 (03PS1) 10Ryan Kemper: relforge: generate new TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/673386 (https://phabricator.wikimedia.org/T275885)
[03:21:12] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:22:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:22:58] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:23:08] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] relforge: generate new TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/673386 (https://phabricator.wikimedia.org/T275885) (owner: 10Ryan Kemper)
[03:25:08] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:26:36] <ryankemper>	 !log T275885 `ryankemper@cumin1001:~$ sudo cumin 'P{relforge*}' 'sudo run-puppet-agent'`
[03:26:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:26:45] <stashbot>	 T275885: Generate SSL certification for relforge1003.eqiad.wmnet and relforge1004.eqiad.wmnet - https://phabricator.wikimedia.org/T275885
[03:27:52] <ryankemper>	 !log [wdqs] `ryankemper@wdqs1013:~$ sudo systemctl restart wdqs-blazegraph`
[03:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:29:46] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[04:45:14] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:46:08] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 58, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:46:50] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:49:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:51:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:19:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:24:18] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:40:42] <Amir1>	 I'm doing some stuff on beta clustre
[05:40:47] <Amir1>	 don't be alarmed
[05:41:31] <wikibugs>	 10SRE, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Grant Access to wmf for TsepoThoabala - https://phabricator.wikimedia.org/T277804 (10Dzahn) confirmed wikitech/LDAP/developer account:  ` [mwmaint1002:~] $  /usr/bin/ldapsearch -x "sn=Tsepo*"| grep uid dn: uid=tsepothoabala,ou=people,dc=wikimedia,dc...
[05:48:30] <Majavah>	 Amir1: beta stuff generally belongs in #-releng, not here
[05:48:52] <Amir1>	 I know, it's just if it alarms
[05:49:19] <mutante>	 appreciate the heads up in both places and that Icinga isn't ignored. but also, good night
[05:49:27] <Majavah>	 if it does, I think those still go to releng
[05:49:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:50:04] <wikibugs>	 (03PS1) 10Dzahn: admin: add Tsepo Thoabala to ldap_only admins, group wmf [puppet] - 10https://gerrit.wikimedia.org/r/673388 (https://phabricator.wikimedia.org/T277804)
[05:51:08] <wikibugs>	 (03PS2) 10Dzahn: admin: add Tsepo Thoabala to ldap_only admins, group wmf [puppet] - 10https://gerrit.wikimedia.org/r/673388 (https://phabricator.wikimedia.org/T277804)
[05:52:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:52:45] <wikibugs>	 (03CR) 10Dzahn: "This needs to go together with a manual "[mwmaint1002:~] $ sudo modify-ldap-group wmf" to add to the LDAP group." [puppet] - 10https://gerrit.wikimedia.org/r/673388 (https://phabricator.wikimedia.org/T277804) (owner: 10Dzahn)
[06:03:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:06:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:08:56] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:09:04] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:29:00] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:30:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Running the offboarding script is mostly non-destructive, since it generates an LDIF which you need apply outside of the script (only the " [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[06:37:20] <wikibugs>	 (03PS1) 10Ladsgroup: beta: Fix beta's url shortener [puppet] - 10https://gerrit.wikimedia.org/r/673389
[06:38:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:40:27] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[06:41:00] <wikibugs>	 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10Sergey.Trofimovsky.SF) >> Something missing from the docs? > ahh yes, i have placed the ldap cn=admin password in...
[06:42:23] <wikibugs>	 (03CR) 10Muehlenhoff: wmcs-webproxy.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[06:43:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:43:15] <Amir1>	 beta cluster only patch to fix url shortener in beta cluster. Any SRE willing to merge please 🥺 https://gerrit.wikimedia.org/r/c/operations/puppet/+/673389
[06:45:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:47:10] <moritzm>	 Amir1: having a look in a few
[06:47:36] <Amir1>	 Thanks. I'm not super great in apache rules
[06:47:52] <Amir1>	 but it should be fine
[06:48:28] <elukey>	 Amir1: o/ I am a little confused about the first redirect, since the location matches .+
[06:48:34] <elukey>	 couldn't we use https://httpd.apache.org/docs/2.4/mod/mod_alias.html#redirectmatch ?
[06:48:48] <elukey>	 (I just had my coffee so my brain is slow, apologies for silly questions)
[06:50:43] <elukey>	 ah yes ok now I get it
[06:51:06] <Amir1>	 elukey: o/ 
[06:51:19] <elukey>	 so redirectmatch may be cleaner, but this works yes
[06:51:28] <Amir1>	 it's basically redirecting w.wiki to one target and w.wiki/fff to another
[06:51:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] beta: Fix beta's url shortener [puppet] - 10https://gerrit.wikimedia.org/r/673389 (owner: 10Ladsgroup)
[06:51:51] <Amir1>	 yeah, apache configs are confusing and complicated 
[06:51:57] <Amir1>	 (not as much as exim4 though)
[06:52:18] <elukey>	 Amir1: well using redirectmatch + redirect without locations is less confusing, this is why I proposed it :)
[06:52:22] <elukey>	 anywayyyyyy
[06:52:32] <elukey>	 all good change merged
[06:53:01] <moritzm>	 there's a patch by Ryan which hasn't been puppet-merged
[06:53:23] <Amir1>	 elukey: it's beta cluster, I'm pretty sure it was broken for years. The whole thing is a mess :D
[06:53:23] <moritzm>	 ryankemper: you still around, good to merge? (new TLS certs for relforge)
[06:54:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:55:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[06:56:02] <Majavah>	 Amir1: would you be willing to fix beta logstash at the same time? Its also broken and I haven't managed to fix it :D
[06:56:54] <Amir1>	 Majavah: if you give me some context I can take a look but I'm not great in ELK 
[06:58:23] <Majavah>	 Amir1: basically it just stopped receiving events, fun things like "UDP listener died EADDRINUSE" due to some port conflicts, no idea on how it's supposed to work
[06:59:59] <Amir1>	 oh that seems fun
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210319T0700)
[07:00:05] <Amir1>	 ticket?
[07:01:09] <Majavah>	 spread over time to T233134, T241481 and T276521, I guess T233134 should be the "main" task
[07:01:10] <stashbot>	 T276521: deployment-logstash03 puppet errors - https://phabricator.wikimedia.org/T276521
[07:01:10] <stashbot>	 T241481: deployment-logstash03: UDP listener died EADDRINUSE, logstash port conflict with rsyslogd - https://phabricator.wikimedia.org/T241481
[07:01:10] <stashbot>	 T233134: logstash-beta.wmflabs.org does not receive any mediawiki events - https://phabricator.wikimedia.org/T233134
[07:02:58] <Amir1>	 cool
[07:03:18] <Amir1>	 Added to my todo list for today but these look like a mess already
[07:04:11] <Majavah>	 thanks! I already tried but couldn't get it working, another pair of eyes would be helpful
[07:06:18] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:08:49] <wikibugs>	 (03PS1) 10Ladsgroup: Add Wikidata's query builder in toolforge to beta's url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673391 (https://phabricator.wikimedia.org/T273162)
[07:11:00] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:12:21] <ryankemper>	 moritzm: doh, my bad - yes it's clear to merge if it hasn't been already (checking now)
[07:13:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:13:42] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[07:15:02] <ryankemper>	 moritzm: Amir1: I'm merging both our changes now
[07:15:15] <Amir1>	 all good
[07:15:28] <ryankemper>	 done
[07:16:02] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[07:16:22] <ryankemper>	 !log T275885 `ryankemper@cumin1001:~$ sudo cumin 'P{relforge*}' 'sudo run-puppet-agent'` (change hadn't been merged when I ran the agent earlier)
[07:16:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:32] <stashbot>	 T275885: Generate SSL certification for relforge1003.eqiad.wmnet and relforge1004.eqiad.wmnet - https://phabricator.wikimedia.org/T275885
[07:21:11] <moritzm>	 ack, thx
[07:21:22] <wikibugs>	 (03PS3) 10ArielGlenn: update bash worker script for handling scondary workers processing job batches [dumps] - 10https://gerrit.wikimedia.org/r/638043 (https://phabricator.wikimedia.org/T252396)
[07:25:45] <wikibugs>	 (03CR) 10Sascha: [C: 03+1] static-binaries: first pass at a stripped-down image for binaries [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/673378 (https://phabricator.wikimedia.org/T277749) (owner: 10Bstorm)
[07:29:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:31:58] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:36:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:39:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:44:50] <wikibugs>	 10SRE, 10CAS-SSO: Investigate/enable new actuators for U2F token management - https://phabricator.wikimedia.org/T277837 (10MoritzMuehlenhoff)
[07:45:23] <wikibugs>	 10SRE, 10CAS-SSO: Investigate/enable new actuators for U2F token management - https://phabricator.wikimedia.org/T277837 (10MoritzMuehlenhoff) Same for "A number of new administrative actuator endpoints are presented to report back on the registered authentication handlers and policies."
[07:55:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:56:53] <wikibugs>	 (03CR) 10David Caro: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[07:58:23] <wikibugs>	 10SRE, 10CAS-SSO: CAS per-service TGT setting - https://phabricator.wikimedia.org/T277840 (10MoritzMuehlenhoff)
[07:59:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:00:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM even if I am not super familiar with helm admin_ng. IP ranges looks good :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman)
[08:02:15] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673391 (https://phabricator.wikimedia.org/T273162) (owner: 10Ladsgroup)
[08:02:57] <wikibugs>	 (03Merged) 10jenkins-bot: Add Wikidata's query builder in toolforge to beta's url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673391 (https://phabricator.wikimedia.org/T273162) (owner: 10Ladsgroup)
[08:03:51] <Amir1>	 rebased on deploy1001 %
[08:04:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:08:06] <wikibugs>	 (03PS4) 10ArielGlenn: update bash worker script for handling scondary workers processing job batches [dumps] - 10https://gerrit.wikimedia.org/r/638043 (https://phabricator.wikimedia.org/T252396)
[08:12:51] <wikibugs>	 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) I have updated the `operations/debs/rsyslog` from salsa.debian.org, now it contains `8.2102.0`. I then tried something simple:  * created a local branch `debian/buster-wikimedia` from m...
[08:14:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:14:29] <wikibugs>	 (03PS4) 10ArielGlenn: distinguish between "no wikis with batches available" and "no wikis left to run" [dumps] - 10https://gerrit.wikimedia.org/r/673210 (https://phabricator.wikimedia.org/T252396)
[08:15:48] <wikibugs>	 (03PS3) 10Kosta Harlan: linkrecommendation: Bump memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297)
[08:16:06] <wikibugs>	 (03CR) 10DCausse: create helmfile.d structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[08:16:08] <wikibugs>	 10SRE, 10CAS-SSO: WebAuthn FIDO2 support in CAS - https://phabricator.wikimedia.org/T277841 (10MoritzMuehlenhoff)
[08:16:31] <wikibugs>	 (03PS5) 10ArielGlenn: update bash worker script for handling scondary workers processing job batches [dumps] - 10https://gerrit.wikimedia.org/r/638043 (https://phabricator.wikimedia.org/T252396)
[08:18:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:22:12] <wikibugs>	 10SRE, 10CAS-SSO: Update CAS to 6.3 - https://phabricator.wikimedia.org/T271684 (10MoritzMuehlenhoff) I filed tasks for new features introduced in 6.3: https://phabricator.wikimedia.org/T277837 https://phabricator.wikimedia.org/T277840 https://phabricator.wikimedia.org/T277841
[08:22:54] <elukey>	 !log upload alluxio 2.4.1 to thirdparty/bigtop15 on stretch/buster-wikimedia
[08:23:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:30:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:32:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:35:46] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:36:00] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:37:37] <wikibugs>	 (03PS2) 10Majavah: beta: remove deployment-restbase[01-02] [puppet] - 10https://gerrit.wikimedia.org/r/673047 (https://phabricator.wikimedia.org/T250574)
[08:37:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] helm: Make ML k8s clusters visible to helm (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman)
[08:46:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:49:03] <wikibugs>	 10SRE, 10Patch-For-Review: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10MoritzMuehlenhoff) >>! In T224579#6924029, @fgiunchedi wrote: > Sure enough, the exporter is out of FDs again. I'm +1 to just remove the exporter since the service doesn't have an owner, the expor...
[08:53:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:56:01] <wikibugs>	 (03PS1) 10Elukey: aptrepo: add a new rsyslog-k8s component for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/673442 (https://phabricator.wikimedia.org/T277739)
[09:00:45] <wikibugs>	 (03CR) 10Volans: [C: 03+2] beta: remove deployment-restbase[01-02] [puppet] - 10https://gerrit.wikimedia.org/r/673047 (https://phabricator.wikimedia.org/T250574) (owner: 10Majavah)
[09:01:13] <wikibugs>	 (03PS4) 10Klausman: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918)
[09:04:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:06:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: linkrecommendation: Bump memory limit and image version (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan)
[09:07:01] <wikibugs>	 (03PS5) 10Klausman: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918)
[09:07:09] <wikibugs>	 (03CR) 10Klausman: helm: Make ML k8s clusters visible to helm (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman)
[09:07:40] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Reenable notifications on db2101 after data load [puppet] - 10https://gerrit.wikimedia.org/r/673443 (https://phabricator.wikimedia.org/T277632)
[09:07:51] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Reenable notifications on db2101 after data load [puppet] - 10https://gerrit.wikimedia.org/r/673443 (https://phabricator.wikimedia.org/T277632)
[09:08:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications on db2101 after data load [puppet] - 10https://gerrit.wikimedia.org/r/673443 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo)
[09:09:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:11:55] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/673388 (https://phabricator.wikimedia.org/T277804) (owner: 10Dzahn)
[09:12:53] <wikibugs>	 (03CR) 10Volans: [C: 03+2] admin: add Tsepo Thoabala to ldap_only admins, group wmf [puppet] - 10https://gerrit.wikimedia.org/r/673388 (https://phabricator.wikimedia.org/T277804) (owner: 10Dzahn)
[09:12:56] <wikibugs>	 (03CR) 10Awight: [C: 03+1] Enable CodeMirror accessibility colors on initial wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673326 (https://phabricator.wikimedia.org/T276346) (owner: 10Andrew-WMDE)
[09:15:20] <wikibugs>	 (03PS4) 10Kosta Harlan: linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297)
[09:15:32] <wikibugs>	 10SRE, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for TsepoThoabala - https://phabricator.wikimedia.org/T277804 (10Volans) 05Open→03Resolved p:05Triage→03Medium a:03Volans Patch merged, added user to the `wmf` group.  @TThoabala all done, resolving.
[09:16:22] <wikibugs>	 (03PS5) 10Kosta Harlan: linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297)
[09:16:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:23:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:23:33] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[09:23:50] <wikibugs>	 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) I don't really like option 3 just because it moves parts of the software stack to the node itself and I would personally like them to be as dumb as possible, ideally...
[09:26:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] linkrecommendation: Bump requests memory limit and image version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan)
[09:28:01] <wikibugs>	 (03PS6) 10Kosta Harlan: linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297)
[09:28:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan)
[09:28:20] <wikibugs>	 (03CR) 10Kosta Harlan: linkrecommendation: Bump requests memory limit and image version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan)
[09:32:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] chromium-render: Add default labels and fix name of configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/670464 (owner: 10JMeybohm)
[09:32:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[09:33:15] <wikibugs>	 (03Merged) 10jenkins-bot: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[09:33:36] <wikibugs>	 (03Merged) 10jenkins-bot: chromium-render: Add default labels and fix name of configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/670464 (owner: 10JMeybohm)
[09:33:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] aptrepo: add a new rsyslog-k8s component for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/673442 (https://phabricator.wikimedia.org/T277739) (owner: 10Elukey)
[09:34:04] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan)
[09:35:30] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Bump requests memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) (owner: 10Kosta Harlan)
[09:36:20] <wikibugs>	 (03PS1) 10Kormat: compare: Use dbutil.addr_split for parsing host:port [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673446 (https://phabricator.wikimedia.org/T277843)
[09:36:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
[09:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:39:57] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Looks fine to me: test, ship it, close the ticket! :-)" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673446 (https://phabricator.wikimedia.org/T277843) (owner: 10Kormat)
[09:40:26] <logmsgbot>	 !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[09:40:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:48] <wikibugs>	 10SRE, 10CAS-SSO: Update CAS to 6.3 - https://phabricator.wikimedia.org/T271684 (10jbond) 05Open→03Resolved a:03jbond
[09:47:28] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Convert helm releases to the new release naming schem - https://phabricator.wikimedia.org/T277849 (10JMeybohm)
[09:47:38] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Convert helm releases to the new release naming schem - https://phabricator.wikimedia.org/T277849 (10JMeybohm) p:05Triage→03Low
[09:48:50] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Convert helm releases to the new release naming schema - https://phabricator.wikimedia.org/T277849 (10JMeybohm)
[09:49:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:51:12] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10akosiaris) >>! In T277740#6925615, @Volans wrote: > Doh, I think we have naming clash here :)  I figured, hence the comment.  >  >   - service: as in Icinga single service belong...
[10:04:58] <logmsgbot>	 !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[10:04:58] <logmsgbot>	 !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[10:05:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:31] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond)
[10:10:20] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] "Look good, shipping:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673446 (https://phabricator.wikimedia.org/T277843) (owner: 10Kormat)
[10:10:59] <logmsgbot>	 !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[10:10:59] <logmsgbot>	 !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[10:11:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:16] <wikibugs>	 (03CR) 10Volans: "I've no context on the task at hand, did just a generic Python pass. Feel free to ignore most of the comments." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite)
[10:18:17] <wikibugs>	 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) After a chat with Moritz we decided to create a specific component with 8.1901 for buster:  ` root@apt1001:/srv/wikimedia# reprepro lsbycomponent rsyslog rsyslog | 8.1901.0-1~bpo8+wmf1...
[10:18:25] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: master: ensure cpp package is installed [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653)
[10:19:49] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: master: ensure cpp package is installed [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653)
[10:20:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: master: ensure cpp package is installed [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez)
[10:22:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/658414 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[10:22:56] <wikibugs>	 (03CR) 10Jcrespo: "CCing current Swift and DB owners- consider if my advice on previous comment is fair or I am being too cautious. Up to you." [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[10:23:13] <wikibugs>	 (03PS1) 10Elukey: profile::rsyslog::kubernetes: add component for buster [puppet] - 10https://gerrit.wikimedia.org/r/673450 (https://phabricator.wikimedia.org/T277739)
[10:23:35] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:24:36] <wikibugs>	 (03PS2) 10Elukey: profile::rsyslog::kubernetes: add component for buster [puppet] - 10https://gerrit.wikimedia.org/r/673450 (https://phabricator.wikimedia.org/T277739)
[10:25:43] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:27:13] <wikibugs>	 10SRE, 10Security-Team, 10CAS-SSO, 10User-jbond: Validate Single Logout Flow - https://phabricator.wikimedia.org/T233941 (10jbond) https://wiki.shibboleth.net/confluence/display/CONCEPT/SLOIssues seems like a useful document when considering this
[10:28:46] <wikibugs>	 10SRE, 10Product-Infrastructure-Team-Backlog, 10Proton: Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10JMeybohm)
[10:29:46] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28674/console" [puppet] - 10https://gerrit.wikimedia.org/r/673450 (https://phabricator.wikimedia.org/T277739) (owner: 10Elukey)
[10:30:01] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:30:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman)
[10:31:19] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman)
[10:34:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:35:31] <wikibugs>	 (03CR) 10Volans: "As this seems to have a lot of shared code between the 2 blocks, consider if it might be useful to move the common part into sre/elasticse" [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper)
[10:36:04] <wikibugs>	 (03PS1) 10Elukey: cumin: fix ml-serve aliases and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/673452 (https://phabricator.wikimedia.org/T272918)
[10:36:46] <wikibugs>	 (03Merged) 10jenkins-bot: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman)
[10:36:49] <elukey>	 volans, klausman --^
[10:36:51] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/673452 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[10:37:04] <volans>	 not sure if you need the distrinction worker/masters within a DC too
[10:37:10] <volans>	 but you'll see later on that
[10:37:22] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] cumin: fix ml-serve aliases and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/673452 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[10:37:39] <wikibugs>	 10SRE, 10Product-Infrastructure-Team-Backlog, 10Proton: Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10Jgiannelos) I think this is the patch that introduced the change from statsd metrics to native prometheus: https://gerrit.wikimedia.org/r/c/mediawiki/services/chromium-render/+/558213
[10:38:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:40:33] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:40:35] <elukey>	 volans: yes yes I had the same idea
[10:40:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cumin: fix ml-serve aliases and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/673452 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[10:41:16] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::rsyslog::kubernetes: add component for buster [puppet] - 10https://gerrit.wikimedia.org/r/673450 (https://phabricator.wikimedia.org/T277739) (owner: 10Elukey)
[10:41:49] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[10:41:49] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[10:41:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10Volans) 05Open→03Resolved a:03Volans @CGlenn I've added you to the mobile domain too `am.m.wikipedia.org`, I consider the approval for the whole "//language//". Resolving, fe...
[10:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:09] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:23] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[10:42:23] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[10:42:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:38] <moritzm>	 !log installing dbmonitor1002 T224589
[10:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:45] <stashbot>	 T224589: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589
[10:44:20] <elukey>	 volans: just realized - ml-serve: A:ml-serve-master and A:ml-serve-worker 
[10:44:28] * elukey plays sad_trombone.wav
[10:44:32] <elukey>	 fixing it
[10:44:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:45:31] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:45:33] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[10:45:33] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[10:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:52] <wikibugs>	 (03PS1) 10Jgiannelos: Configure prometheus metrics for chromium-renderer [deployment-charts] - 10https://gerrit.wikimedia.org/r/673454 (https://phabricator.wikimedia.org/T277857)
[10:48:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: deployment: Add ML cluster to deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/673455
[10:48:33] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002407 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[10:49:15] <wikibugs>	 (03PS1) 10Elukey: cumin: fix ml-serve alias and add newer ones [puppet] - 10https://gerrit.wikimedia.org/r/673457 (https://phabricator.wikimedia.org/T272918)
[10:49:33] <elukey>	 volans: --^
[10:49:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28675/console" [puppet] - 10https://gerrit.wikimedia.org/r/673455 (owner: 10Alexandros Kosiaris)
[10:53:37] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:54:41] <wikibugs>	 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10jbond) >>! In T274461#6927626, @Sergey.Trofimovsky.SF wrote: >>> Something missing from the docs? >> ahh yes, i h...
[10:58:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/673457 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[10:59:03] <volans>	 elukey: done, sorry for missing the and/or typo
[10:59:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cumin: fix ml-serve alias and add newer ones [puppet] - 10https://gerrit.wikimedia.org/r/673457 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[10:59:28] <elukey>	 my bad :)
[11:08:39] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:08:56] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: deployment: Add ML cluster to deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/673455
[11:12:10] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: deployment: Add ML cluster to deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/673455
[11:12:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:13:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28677/console" [puppet] - 10https://gerrit.wikimedia.org/r/673455 (owner: 10Alexandros Kosiaris)
[11:13:37] <wikibugs>	 (03PS1) 10Jbond: hiera - cloud: move debmon to sso project [puppet] - 10https://gerrit.wikimedia.org/r/673461
[11:14:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] hiera - cloud: move debmon to sso project [puppet] - 10https://gerrit.wikimedia.org/r/673461 (owner: 10Jbond)
[11:17:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:18:03] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
[11:18:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:46] <wikibugs>	 (03PS2) 10Ayounsi: tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans)
[11:20:02] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
[11:20:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LTGM" [puppet] - 10https://gerrit.wikimedia.org/r/673455 (owner: 10Alexandros Kosiaris)
[11:20:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:12] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] deployment: Add ML cluster to deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/673455 (owner: 10Alexandros Kosiaris)
[11:20:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans)
[11:22:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:23:52] <wikibugs>	 (03PS1) 10Jbond: hiera - cloud: correct debmon name [puppet] - 10https://gerrit.wikimedia.org/r/673463
[11:23:58] <wikibugs>	 10SRE, 10Services, 10Patch-For-Review, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10akosiaris) >>! In T277483#6925411, @dpifke wrote: >>>! In T277483#6924456, @akosiaris wrote: >> * Is xhgui stateless? More specifical...
[11:25:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] deployment: Add ML cluster to deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/673455 (owner: 10Alexandros Kosiaris)
[11:25:07] <wikibugs>	 10SRE, 10Services, 10serviceops-radar, 10Patch-For-Review, and 2 others: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10akosiaris) p:05Triage→03Medium
[11:25:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] hiera - cloud: correct debmon name [puppet] - 10https://gerrit.wikimedia.org/r/673463 (owner: 10Jbond)
[11:27:48] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:27:51] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:27:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:12] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:29:15] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:29:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:34] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[11:29:38] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[11:29:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:56] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[11:29:56] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[11:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:05] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,ircd} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:30:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:31] <wikibugs>	 (03PS1) 10Jbond: cloud hiera - sso: add puppetmasters block [puppet] - 10https://gerrit.wikimedia.org/r/673464
[11:33:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cloud hiera - sso: add puppetmasters block [puppet] - 10https://gerrit.wikimedia.org/r/673464 (owner: 10Jbond)
[11:34:19] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:36:40] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[11:36:44] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[11:36:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:04] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:37:07] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:37:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:19] <wikibugs>	 (03PS1) 10Jbond: cloud - sso: fix puppet masters format [puppet] - 10https://gerrit.wikimedia.org/r/673467
[11:45:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cloud - sso: fix puppet masters format [puppet] - 10https://gerrit.wikimedia.org/r/673467 (owner: 10Jbond)
[11:47:02] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "This looks about right, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/673454 (https://phabricator.wikimedia.org/T277857) (owner: 10Jgiannelos)
[11:47:51] <wikibugs>	 (03PS3) 10Ayounsi: tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans)
[11:47:53] <wikibugs>	 (03PS2) 10Ayounsi: WIP. tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans)
[11:48:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:48:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans)
[11:48:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP. tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans)
[11:50:01] <wikibugs>	 (03PS1) 10Volans: tests: fix pip backtracking [software/pywmflib] - 10https://gerrit.wikimedia.org/r/673468
[11:50:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:54:56] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable bracket matching on group0 and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673312 (https://phabricator.wikimedia.org/T273591) (owner: 10Andrew-WMDE)
[11:55:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:58:07] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:59:57] <wikibugs>	 (03CR) 10Volans: "This is my proposal to fix the issues we're getting in the last days with the aborted CI due to pip backtracking [1] taking too long to re" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/673468 (owner: 10Volans)
[12:03:17] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:10:13] <effie>	 !log upgrade memcached on mc1026,mc2026 
[12:10:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:13:45] <wikibugs>	 (03PS1) 10Kosta Harlan: linkrecommendation: Add Swagger UI environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/673471 (https://phabricator.wikimedia.org/T277644)
[12:23:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:25:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10aborrero) a:05aborrero→03RobH The missing VLAN was just recently resolved in {T277020}   The contrlol plane 1G port was ther...
[12:26:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10aborrero)
[12:26:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:32:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:33:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:34:48] <logmsgbot>	 !log klausman@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
[12:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:50] <logmsgbot>	 !log klausman@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
[12:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:42:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:50:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] eventrouter: Update build and base image, switch to nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/669846 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm)
[12:51:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] ratelimit: Switch to nobody, update build and base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670836 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm)
[12:51:08] <wikibugs>	 10SRE, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Grant Access to wmf for TsepoThoabala - https://phabricator.wikimedia.org/T277804 (10Aklapper) @TThoabala: Hi, did this ticket supersede T277797 ? If yes, then please set the task status there to `declined` - thanks!
[12:51:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] fluent-bit: Switch to nobody and use seed_image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670838 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm)
[12:56:54] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: disable /etc/host management from cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673475 (https://phabricator.wikimedia.org/T277866)
[12:59:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "perhaps other option is to manage the template `/etc/cloud/templates/hosts.debian.tmp` via puppet before cloud-init runs at VM creating ti" [puppet] - 10https://gerrit.wikimedia.org/r/673475 (https://phabricator.wikimedia.org/T277866) (owner: 10Arturo Borrero Gonzalez)
[13:10:34] <wikibugs>	 10SRE, 10CAS-SSO: Investigate/enable new actuators for U2F token management - https://phabricator.wikimedia.org/T277837 (10MoritzMuehlenhoff) p:05Triage→03Low
[13:10:40] <wikibugs>	 10SRE, 10CAS-SSO: CAS per-service TGT setting - https://phabricator.wikimedia.org/T277840 (10MoritzMuehlenhoff) p:05Triage→03Low
[13:10:46] <wikibugs>	 10SRE, 10CAS-SSO: WebAuthn FIDO2 support in CAS - https://phabricator.wikimedia.org/T277841 (10MoritzMuehlenhoff) p:05Triage→03Medium
[13:19:34] <wikibugs>	 (03PS1) 10Kormat: WMFMariaDB: Allow setting debug via env var [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673480
[13:22:14] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] WMFMariaDB: Allow setting debug via env var [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673480 (owner: 10Kormat)
[13:24:57] <wikibugs>	 (03PS1) 10Jbond: cloud - hiera: add horizon config to yaml [puppet] - 10https://gerrit.wikimedia.org/r/673481
[13:25:56] <wikibugs>	 (03Merged) 10jenkins-bot: WMFMariaDB: Allow setting debug via env var [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/673480 (owner: 10Kormat)
[13:26:39] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:28:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:31:33] <wikibugs>	 (03PS2) 10Jbond: cloud - hiera: add horizon config to yaml [puppet] - 10https://gerrit.wikimedia.org/r/673481
[13:33:47] <wikibugs>	 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10akosiaris) Hello,  >>! In T250110#6924592, @Chtnnh wrote: > Hello! >  > Yes, we would love to have this service deployed. Although, over the course of...
[13:34:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cloud - hiera: add horizon config to yaml [puppet] - 10https://gerrit.wikimedia.org/r/673481 (owner: 10Jbond)
[13:35:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:37:07] <wikibugs>	 (03PS1) 10Jbond: cloud sso: add puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/673485
[13:37:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:38:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cloud sso: add puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/673485 (owner: 10Jbond)
[13:46:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:49:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:50:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "I am sorry, my previous comments were wrong, please disregard." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry)
[13:53:33] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:57:13] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 235 probes of 605 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:58:01] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:58:34] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata: adjust cloud-init package list [puppet] - 10https://gerrit.wikimedia.org/r/673489
[13:59:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: adjust cloud-init package list [puppet] - 10https://gerrit.wikimedia.org/r/673489 (owner: 10Andrew Bogott)
[14:03:07] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: docker: tabs to spaces [puppet] - 10https://gerrit.wikimedia.org/r/672450 (owner: 10Legoktm)
[14:03:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] docker: tabs to spaces [puppet] - 10https://gerrit.wikimedia.org/r/672450 (owner: 10Legoktm)
[14:03:23] <wikibugs>	 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) I understand @akosiaris !   Is it possible to deploy to production as volunteers? As in, is it possible for long time volunteers to have deploy...
[14:03:27] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 50 probes of 605 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:08:51] <wikibugs>	 (03PS6) 10KartikMistry: Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711)
[14:09:33] <wikibugs>	 (03CR) 10KartikMistry: "> Patch Set 5: Code-Review-1" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry)
[14:11:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:13:25] <wikibugs>	 (03PS1) 10Andrew Bogott: nova-fullstack: temporarily run with a different base image [puppet] - 10https://gerrit.wikimedia.org/r/673496
[14:16:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: temporarily run with a different base image [puppet] - 10https://gerrit.wikimedia.org/r/673496 (owner: 10Andrew Bogott)
[14:20:31] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:21:09] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "This works on my local." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/673468 (owner: 10Volans)
[14:23:45] <wikibugs>	 (03PS1) 10Jbond: sso-debmon: comment out classes so we can at least get one puppet run [puppet] - 10https://gerrit.wikimedia.org/r/673499
[14:25:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sso-debmon: comment out classes so we can at least get one puppet run [puppet] - 10https://gerrit.wikimedia.org/r/673499 (owner: 10Jbond)
[14:29:43] <wikibugs>	 (03PS1) 10Jbond: Revert "sso-debmon: comment out classes so we can at least get one puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/673121
[14:34:17] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:34:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "sso-debmon: comment out classes so we can at least get one puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/673121 (owner: 10Jbond)
[14:34:56] <wikibugs>	 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10akosiaris) >>! In T250110#6928585, @Chtnnh wrote: > I understand @akosiaris !  >  > Is it possible to deploy to production as volunteers? As in, is it...
[14:36:41] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:38:22] <wikibugs>	 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) I see. I think the team (@Harshineesriram, @Abbasidaniyal and I) will have to put some thought into that.   As far as the timeline is concerned...
[14:47:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) @aborrero,  Perhaps this wasn't conveyed at the time of order, and it may cause issues, but we don't support connecting mi...
[14:47:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) a:05RobH→03aborrero
[14:48:40] <wikibugs>	 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10akosiaris) Just a few clarifications and answers.  > cloud vps is a kubernetes cluster  It's toolforge that's half powered by a kubernetes cluster. The other half is powered by son of g...
[14:52:35] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:54:51] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:57:43] <wikibugs>	 (03CR) 10Bstorm: "You'll need this on the grid master as well. shadow_master should only be on the shadow server." [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez)
[14:59:10] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez)
[15:04:03] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:04:42] <wikibugs>	 (03PS1) 10Jbond: P:debmonitor: fix dependencies in cloud [puppet] - 10https://gerrit.wikimedia.org/r/673511
[15:06:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:debmonitor: fix dependencies in cloud [puppet] - 10https://gerrit.wikimedia.org/r/673511 (owner: 10Jbond)
[15:08:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:14:19] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/673448 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez)
[15:16:06] <wikibugs>	 (03PS1) 10Jbond: P:debmonitor: fix nginx ssl config [puppet] - 10https://gerrit.wikimedia.org/r/673514
[15:16:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28678/console" [puppet] - 10https://gerrit.wikimedia.org/r/673514 (owner: 10Jbond)
[15:17:25] <wikibugs>	 (03PS1) 10Elukey: Add alluxio keytabs on Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/673515 (https://phabricator.wikimedia.org/T266641)
[15:17:39] <wikibugs>	 (03CR) 10Bstorm: "I think anything cloud-init does is our one guaranteed change on a VM, since a user can disable puppet or break it. It would be great if w" [puppet] - 10https://gerrit.wikimedia.org/r/673475 (https://phabricator.wikimedia.org/T277866) (owner: 10Arturo Borrero Gonzalez)
[15:18:15] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:18:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add alluxio keytabs on Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/673515 (https://phabricator.wikimedia.org/T266641) (owner: 10Elukey)
[15:20:57] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/673475 (https://phabricator.wikimedia.org/T277866) (owner: 10Arturo Borrero Gonzalez)
[15:22:51] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:31:18] <wikibugs>	 (03PS12) 10Dave Pifke: arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776)
[15:33:43] <wikibugs>	 (03CR) 10Dave Pifke: "This is ready to merge at your convenience." [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[15:34:03] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: fix the order of the paws accounts listing [puppet] - 10https://gerrit.wikimedia.org/r/673380 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm)
[15:34:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:36:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor: fix nginx ssl config [puppet] - 10https://gerrit.wikimedia.org/r/673514 (owner: 10Jbond)
[15:39:25] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:40:05] <wikibugs>	 (03CR) 10CRusnov: "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/565800 (owner: 10Legoktm)
[15:41:06] <wikibugs>	 (03Abandoned) 10CRusnov: mwgrep.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670975 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[15:41:12] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm)
[15:41:20] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) p:05Triage→03High
[15:46:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:48:41] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:56:59] <wikibugs>	 (03PS1) 10Jbond: P:debmonitor::client: update cas vhost and open FW [puppet] - 10https://gerrit.wikimedia.org/r/673523
[15:58:25] <wikibugs>	 (03PS4) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775)
[15:59:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite)
[16:01:45] <effie>	 !log upgrade memcached on mc-gp200*
[16:01:47] <wikibugs>	 (03PS1) 10Bstorm: maintain-dbusers: type cast the uid for paws users [puppet] - 10https://gerrit.wikimedia.org/r/673524 (https://phabricator.wikimedia.org/T276284)
[16:01:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28680/console" [puppet] - 10https://gerrit.wikimedia.org/r/673523 (owner: 10Jbond)
[16:02:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::client: update cas vhost and open FW [puppet] - 10https://gerrit.wikimedia.org/r/673523 (owner: 10Jbond)
[16:03:35] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: type cast the uid for paws users [puppet] - 10https://gerrit.wikimedia.org/r/673524 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm)
[16:05:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:06:28] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki)
[16:07:22] <wikibugs>	 (03PS1) 10Effie Mouzeli: hieradata: install memcached 1.6 to gutter pool servers [puppet] - 10https://gerrit.wikimedia.org/r/673527 (https://phabricator.wikimedia.org/T270315)
[16:07:57] <wikibugs>	 (03PS1) 10Jbond: cloud - hiera: move hiera keys to correct level [puppet] - 10https://gerrit.wikimedia.org/r/673528
[16:10:17] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:10:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cloud - hiera: move hiera keys to correct level [puppet] - 10https://gerrit.wikimedia.org/r/673528 (owner: 10Jbond)
[16:11:51] <duesen>	 ssh is telling me that the key for bast1002.wikimedia.org changed, namely to SHA256:XfPttsgImI8r43WfwENq8eA36R6i88RNnE409XiNpBk.
[16:11:59] <duesen>	 Can someone confirm that this is expected?
[16:12:38] <Majavah>	 duesen: https://wikitech.wikimedia.org/w/index.php?title=Help%3ASSH_Fingerprints%2Fbast1002.wikimedia.org&type=revision&diff=1900025&oldid=1799398, looks like yes
[16:13:11] <duesen>	 Majavah: thank you!
[16:15:15] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:15:52] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: install memcached 1.6 to gutter pool servers [puppet] - 10https://gerrit.wikimedia.org/r/673527 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli)
[16:16:18] <wikibugs>	 (03PS5) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775)
[16:17:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite)
[16:21:34] <wikibugs>	 (03PS1) 10Jbond: P:debmonitor::server: allow users to configure the cas required_groups [puppet] - 10https://gerrit.wikimedia.org/r/673533
[16:22:35] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:23:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:debmonitor::server: allow users to configure the cas required_groups [puppet] - 10https://gerrit.wikimedia.org/r/673533 (owner: 10Jbond)
[16:25:51] <wikibugs>	 (03PS2) 10Jbond: P:debmonitor::server: allow users to configure the cas required_groups [puppet] - 10https://gerrit.wikimedia.org/r/673533
[16:27:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28684/console" [puppet] - 10https://gerrit.wikimedia.org/r/673533 (owner: 10Jbond)
[16:28:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::server: allow users to configure the cas required_groups [puppet] - 10https://gerrit.wikimedia.org/r/673533 (owner: 10Jbond)
[16:30:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "Ah, thanks for pointing out Joe's change. With that I am +1 then :) thanks" [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond)
[16:31:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:tcpircbot: drop monitoring of service [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond)
[16:33:18] <wikibugs>	 (03PS6) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775)
[16:34:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite)
[16:35:41] <wikibugs>	 (03PS7) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775)
[16:36:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:37:29] <wikibugs>	 (03PS8) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775)
[16:40:16] <wikibugs>	 (03CR) 10Cwhite: "Thanks for the review!  All were valid points." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite)
[16:46:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:52:11] <wikibugs>	 (03PS1) 10Jbond: pki - cloud: add sso puppet CA to authorised CA's [puppet] - 10https://gerrit.wikimedia.org/r/673537
[16:52:55] <wikibugs>	 (03PS3) 10Effie Mouzeli: profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773
[16:55:59] <wikibugs>	 (03PS1) 10Bstorm: maintain-dbusers: correct the types on a the PAWS UID and paths [puppet] - 10https://gerrit.wikimedia.org/r/673538 (https://phabricator.wikimedia.org/T276284)
[16:57:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:00:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:06:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:07:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:07:57] <wikibugs>	 (03PS4) 10Effie Mouzeli: profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773
[17:15:34] <wikibugs>	 (03CR) 10Jbond: profile::mcrouter_wancache: add spec tests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672773 (owner: 10Effie Mouzeli)
[17:16:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) @elukey can I move the 2 servers anytime or does this need to be scheduled?  Move an-worker1129 to A2 Move an-worker1139 to A7
[17:19:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki - cloud: add sso puppet CA to authorised CA's [puppet] - 10https://gerrit.wikimedia.org/r/673537 (owner: 10Jbond)
[17:23:20] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10elukey) @Cmjohnson anytime is fine! Thanks :)
[17:25:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,routinator} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:27:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:31:43] <wikibugs>	 (03PS5) 10Effie Mouzeli: profile::mcrouter_wancache: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/672773
[17:31:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:31:55] <wikibugs>	 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-MoritzMuehlenhoff, and 2 others: Convert .py.erb files to files with configurations - https://phabricator.wikimedia.org/T277892 (10crusnov)
[17:32:12] <wikibugs>	 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-MoritzMuehlenhoff, and 2 others: Convert .py.erb files to files with configurations - https://phabricator.wikimedia.org/T277892 (10crusnov) p:05Triage→03Medium
[17:33:33] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "Don't see any more backtracking occurring. Thanks!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/673468 (owner: 10Volans)
[17:34:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:38:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:43:01] <wikibugs>	 (03Merged) 10jenkins-bot: tests: fix pip backtracking [software/pywmflib] - 10https://gerrit.wikimedia.org/r/673468 (owner: 10Volans)
[17:45:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:46:15] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: correct the types on a the PAWS UID and paths [puppet] - 10https://gerrit.wikimedia.org/r/673538 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm)
[18:00:18] <wikibugs>	 (03PS34) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[18:03:32] <wikibugs>	 (03PS12) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918
[18:03:47] <wikibugs>	 (03PS7) 10Jbond: (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105
[18:05:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond)
[18:12:03] <wikibugs>	 (03CR) 10Mstyles: create helmfile.d structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[18:15:24] <wikibugs>	 (03PS1) 10Razzi: turnilo: add monitoring for http [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729)
[18:16:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] turnilo: add monitoring for http [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[18:19:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:24:26] <wikibugs>	 (03PS8) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115)
[18:31:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:34:46] <wikibugs>	 (03CR) 10Elukey: "Razzi I think that we should try to hit the local endpoint, namely the one offered by the Turnilo nodejs app:" [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[18:35:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:37:26] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670985 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[18:40:47] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] site/conftool-data: turn mw2251,mw2252 into canaries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673367 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn)
[18:43:55] <wikibugs>	 (03PS8) 10Jbond: (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105
[18:45:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond)
[18:45:43] <wikibugs>	 (03PS2) 10Razzi: turnilo: add monitoring for http [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729)
[18:46:09] <mutante>	 !log deploy2002 - disable puppet, copy modified version of scap-master-sync over it that does not --exclude="**/cache/l10n/*.cdb"  (for T275826)
[18:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:19] <stashbot>	 T275826: L10n cache files building up on backup deploy hosts - https://phabricator.wikimedia.org/T275826
[18:47:43] <wikibugs>	 (03CR) 10Razzi: "Unless I missed it, it looked like there aren't any local appserver checks yet; here's my attempt at a new one." [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[18:51:14] <wikibugs>	 (03PS9) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115)
[18:52:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:53:02] <wikibugs>	 (03PS1) 10Legoktm: tests: fix pip backtracking [cookbooks] - 10https://gerrit.wikimedia.org/r/673558
[18:53:11] <wikibugs>	 (03PS9) 10Jbond: (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105
[18:55:35] <wikibugs>	 (03CR) 10Effie Mouzeli: "> Patch Set 4: Code-Review-1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli)
[18:55:57] <wikibugs>	 (03CR) 10Dzahn: turnilo: add monitoring for http (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[18:56:48] <wikibugs>	 (03CR) 10Dzahn: turnilo: add monitoring for http (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[18:57:39] <wikibugs>	 (03CR) 10Dzahn: turnilo: add monitoring for http (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[18:59:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site/conftool-data: turn mw2251,mw2252 into canaries [puppet] - 10https://gerrit.wikimedia.org/r/673367 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn)
[19:00:04] <wikibugs>	 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-crusnov, 10User-jbond: Port dstat related scripts to Python 3 - https://phabricator.wikimedia.org/T277910 (10crusnov)
[19:00:16] <wikibugs>	 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-crusnov, 10User-jbond: Port dstat related scripts to Python 3 - https://phabricator.wikimedia.org/T277910 (10crusnov) p:05Triage→03Medium
[19:01:24] <wikibugs>	 (03PS3) 10Razzi: turnilo: add monitoring for node application [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729)
[19:01:26] <wikibugs>	 (03CR) 10Razzi: turnilo: add monitoring for node application (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[19:06:09] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "This seems like a correct copy of the technique of the other referenced patch! 😊" [cookbooks] - 10https://gerrit.wikimedia.org/r/673558 (owner: 10Legoktm)
[19:06:15] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670990 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[19:09:44] <wikibugs>	 (03CR) 10Dzahn: turnilo: add monitoring for node application (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[19:09:56] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] tests: fix pip backtracking [cookbooks] - 10https://gerrit.wikimedia.org/r/673558 (owner: 10Legoktm)
[19:11:40] <wikibugs>	 (03PS35) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[19:11:58] <wikibugs>	 (03CR) 10Dzahn: "fyi, one change that happens if you turn a server into a canary is also a change in envoy config:" [puppet] - 10https://gerrit.wikimedia.org/r/673367 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn)
[19:12:30] <wikibugs>	 (03PS13) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918
[19:12:42] <wikibugs>	 (03PS10) 10Jbond: (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105
[19:17:21] <wikibugs>	 (03Merged) 10jenkins-bot: tests: fix pip backtracking [cookbooks] - 10https://gerrit.wikimedia.org/r/673558 (owner: 10Legoktm)
[19:18:11] <wikibugs>	 (03PS4) 10Legoktm: sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516)
[19:18:19] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "..." [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm)
[19:20:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:22:32] <wikibugs>	 (03Merged) 10jenkins-bot: sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm)
[19:24:57] <mutante>	 !log deploy2002 - re-enabled puppet, reverted patch of scap-sync-master
[19:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:27:26] <wikibugs>	 (03PS1) 10Legoktm: tests: fix pip backtracking [software/cumin] - 10https://gerrit.wikimedia.org/r/673564
[19:28:07] <wikibugs>	 (03CR) 10Legoktm: tests: fix pip backtracking (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/673564 (owner: 10Legoktm)
[19:33:09] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw2251.codfw.wmnet,service=canary
[19:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:15] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw2252.codfw.wmnet,service=canary
[19:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:06] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2251.codfw.wmnet,service=canary
[19:37:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:17] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2252.codfw.wmnet,service=canary
[19:37:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:37:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:31] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2244.codfw.wmnet
[19:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:39] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.ganeti.makevm for new host lists1002.wikimedia.org
[19:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:45] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2245.codfw.wmnet
[19:39:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:31] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2244.codfw.wmnet
[19:40:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:42:15] <legoktm>	 ugh
[19:42:22] <legoktm>	 mutante: I think we conflicted on the netbox step :/
[19:42:35] <legoktm>	 my diff shows the removal of mw2244
[19:43:04] <mutante>	 legoktm: my cookbook is at the "Sleeping for 3 minutes step"
[19:43:13] <mutante>	 removal of mw2244 is correct
[19:43:19] <legoktm>	 ok, I'm going to accept the diff 
[19:43:21] <mutante>	 though I am not sure if it will mean my run will fail later
[19:43:36] <mutante>	 It happened to me when I tried to do 2 decoms at once
[19:43:45] <mutante>	 and I accepted it as well.. yes, please do
[19:43:47] <mutante>	 we will see
[19:44:04] <legoktm>	 https://phabricator.wikimedia.org/rONED8c1c033f628adcada809fd30e27d3210f43d362f
[19:44:33] <mutante>	 if it's removed from DNS before all other decom steps are done
[19:44:36] <mutante>	 there might be remnants
[19:44:38] <mutante>	 not sure
[19:44:47] <mutante>	 but it is already past "removed from puppetDB"
[19:45:14] <legoktm>	 feels like this step should have a lock
[19:46:08] <mutante>	 one issue i could see is when it tries to connect to mgmt to shut it down
[19:46:31] <wikibugs>	 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 4 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Daimona)
[19:48:08] <wikibugs>	 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve  parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle)
[19:48:30] <wikibugs>	 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve  parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle) a:03Kormat
[19:48:57] <wikibugs>	 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve  parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle) >>! In T277831#6927485, @Krinkle wrote: >> The concerned raised by @Kormat is that the current se...
[19:49:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:50:47] <mutante>	 !log testreduce1001 - confirmed MariaDB @@datadir is /srv/data/mysql and deleting /var/lib/mysql (T277580)
[19:50:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:54] <stashbot>	 T277580: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580
[19:53:21] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host lists1002.wikimedia.org
[19:53:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:54] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2244.codfw.wmnet
[19:53:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:01] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2244.codfw.wmnet` - mw2244.codfw.wmnet (**PASS**)   - Downtime...
[19:54:02] <mutante>	 legoktm: it's running homer now to shut down switch port and that's it. exit 0 
[19:54:10] <legoktm>	 :D
[19:54:10] <mutante>	 seems to be fine
[19:54:12] <legoktm>	 phew
[19:54:16] <mutante>	 yep
[19:54:18] <wikibugs>	 (03PS11) 10Jbond: P:netbase: parse the service catalouge and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105
[19:54:24] <wikibugs>	 (03PS1) 10Legoktm: install_server: Add lists1002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/673590 (https://phabricator.wikimedia.org/T276686)
[19:55:18] <wikibugs>	 (03CR) 10Jbond: P:netbase: parse the service catalouge and inject the service ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond)
[19:55:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbase: parse the service catalouge and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond)
[19:55:37] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] install_server: Add lists1002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/673590 (https://phabricator.wikimedia.org/T276686) (owner: 10Legoktm)
[19:55:37] <mutante>	 I am doing one more decom but then that's it for this Friday
[19:55:51] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2245.codfw.wmnet
[19:55:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:57] <legoktm>	 I have no more VMs for today :)
[19:57:08] <mutante>	 ack
[19:59:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:59:58] <wikibugs>	 (03CR) 10Elukey: turnilo: add monitoring for node application (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[20:00:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry)
[20:00:34] <wikibugs>	 (03PS12) 10Jbond: P:netbase: parse the service catalouge and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105
[20:00:37] <wikibugs>	 (03PS1) 10Legoktm: site.pp: Add lists1002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/673591 (https://phabricator.wikimedia.org/T276686)
[20:00:50] <wikibugs>	 (03CR) 10Jbond: "this is ready for review now" [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond)
[20:01:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:03:07] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:03:10] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn)
[20:04:51] <wikibugs>	 (03PS1) 10Dzahn: DHCP: switch scandium to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/673592 (https://phabricator.wikimedia.org/T268248)
[20:08:34] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] site.pp: Add lists1002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/673591 (https://phabricator.wikimedia.org/T276686) (owner: 10Legoktm)
[20:08:37] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/673594
[20:09:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: switch scandium to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/673592 (https://phabricator.wikimedia.org/T268248) (owner: 10Dzahn)
[20:10:49] <wikibugs>	 (03PS36) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[20:11:57] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2245.codfw.wmnet
[20:12:03] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2245.codfw.wmnet` - mw2245.codfw.wmnet (**PASS**)   - Downtime...
[20:12:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:31] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:13:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site/conftool-data: decom mw2244,mw2245, former canary servers [puppet] - 10https://gerrit.wikimedia.org/r/673368 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn)
[20:14:00] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on scandium.eqiad.wmnet with reason: reimage
[20:14:01] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on scandium.eqiad.wmnet with reason: reimage
[20:14:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:19] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:15:56] <mutante>	 !log scandium - reimaging with buster
[20:16:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:22] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Legoktm) >>! In T256538#6920846, @Marostegui wrote: > Databases are now created, once I get the IPs I will create the users :)  208.80.154.13 (https://netbox.wikimedia.org/virtualization...
[20:19:56] <wikibugs>	 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm)
[20:19:59] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10vm-requests: Requesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Legoktm) 05Open→03Resolved Done, lists1002.wikimedia.org now exists.
[20:20:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:21:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:23:34] <wikibugs>	 (03PS37) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[20:24:07] <wikibugs>	 (03PS14) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918
[20:24:16] <wikibugs>	 (03PS13) 10Jbond: P:netbase: parse the service catalouge and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105
[20:29:44] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on scandium.eqiad.wmnet with reason: REIMAGE
[20:29:50] <wikibugs>	 10SRE, 10Performance-Team, 10Platform Engineering, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10Krinkle)
[20:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:59] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle)
[20:31:48] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on scandium.eqiad.wmnet with reason: REIMAGE
[20:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:34:50] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:42:35] <wikibugs>	 10SRE, 10observability: The "logstash-*" index pattern does not contain any of the following field types: ip - https://phabricator.wikimedia.org/T238795 (10colewhite) 05Open→03Resolved a:03colewhite ECS is typing these fields appropriately since https://gerrit.wikimedia.org/r/c/operations/puppet/+/647029
[20:43:07] <wikibugs>	 (03PS1) 10Legoktm: sre.ganeti.makevm: Update example after 22c586eb2ac23 [cookbooks] - 10https://gerrit.wikimedia.org/r/673597
[20:49:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH) p:05Medium→03High a:03RobH
[20:50:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install thumbor100[56] - https://phabricator.wikimedia.org/T273914 (10wiki_willy) a:03Jclark-ctr
[20:51:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10wiki_willy) a:03Jclark-ctr
[20:53:49] <legoktm>	 dpifke: I'm going to deploy the arclamp / swift change now
[20:54:03] <dpifke>	 SGTM.
[20:56:28] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28686/console" [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[20:57:37] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[20:59:44] <legoktm>	 running puppet now
[21:05:41] <legoktm>	 https://performance.wikimedia.org/arclamp/svgs/daily/2021-03-19.excimer.all.reversed.svgz "Internal Server Error"
[21:06:03] <legoktm>	 [Fri Mar 19 21:05:42.367791 2021] [proxy:warn] [pid 15412:tid 139721669838592] [client 2620:0:861:101:10:64:0:215:33844] AH01144: No protocol handler was valid for the URL /arclamp/svgs/daily/2021-03-19.excimer.all.reversed.svgz. If you are using a DSO version of mod_proxy, make sure the proxy submodules are included in the configuration using LoadModule.
[21:06:13] <legoktm>	 maybe it didn't match the regex?
[21:06:37] <dpifke>	 Hmm, looking.
[21:07:22] <wikibugs>	 (03PS2) 10Dzahn: site/conftool-data: decom mw2244,mw2245, former canary servers [puppet] - 10https://gerrit.wikimedia.org/r/673368 (https://phabricator.wikimedia.org/T277780)
[21:08:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site/conftool-data: decom mw2244,mw2245, former canary servers [puppet] - 10https://gerrit.wikimedia.org/r/673368 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn)
[21:08:37] <dpifke>	 Looks like Swift is HTTP in beta, HTTPS in prod.
[21:08:42] <wikibugs>	 (03PS3) 10Dzahn: site/conftool-data: decom mw2244,mw2245, former canary servers [puppet] - 10https://gerrit.wikimedia.org/r/673368 (https://phabricator.wikimedia.org/T277780)
[21:08:52] <dpifke>	 Checking to see if there's a mod_proxy_https we need to add.
[21:09:11] <legoktm>	 there is, yes
[21:09:43] <dpifke>	 I don't see it in /etc/apache2/mods-available?  Or is it part of proxy_http2?
[21:09:49] <legoktm>	 or, maybe not
[21:09:54] <legoktm>	 yeah, I just checked that too
[21:10:19] <legoktm>	 https://httpd.apache.org/docs/2.4/mod/mod_proxy_http.html says it supports HTTPS
[21:11:02] <mutante>	 !log scandium - stop apache and rerun puppet which fails after reimaging because it tries to run an nginx on port 80 which is already used by apache T268248
[21:11:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:11] <stashbot>	 T268248: upgrade scandium to buster - https://phabricator.wikimedia.org/T268248
[21:11:52] <legoktm>	 using https://regex101.com/ the regex does match
[21:14:13] <legoktm>	 I tried commenting out the <Location /arclamp> block to see if that made a difference, but it didn't
[21:14:33] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn)
[21:14:44] <dpifke>	 It's odd, it seems to be working for logs but not for svgs.
[21:14:57] <dpifke>	 I tried changing it to http instead of https and it didn't make a difference.
[21:15:38] <dpifke>	 Unless maybe stepped on each other making changes. :)  Trying again.
[21:15:42] <legoktm>	 oops
[21:16:11] <legoktm>	 https://stackoverflow.com/questions/23931987/apache-proxy-no-protocol-handler-was-valid says we need mod_ssl, which is not currently enabled
[21:16:47] <wikibugs>	 10SRE, 10vm-requests, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) Do you want to keep this open?  Or simply close and reopen if/once you want a second VM?
[21:17:16] <dpifke>	 Makes sense.  It works with http to the Swift backend, looking at Puppet code to see if changing that is a quick fix.
[21:17:51] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn)
[21:18:16] <wikibugs>	 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) 05Open→03Stalled mwmaint1002 will be upgraded during the DC switchover period in Q4
[21:18:24] <wikibugs>	 (03PS1) 10Legoktm: webperf: Enable mod_ssl for performance website [puppet] - 10https://gerrit.wikimedia.org/r/673599
[21:19:20] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28687/console" [puppet] - 10https://gerrit.wikimedia.org/r/673599 (owner: 10Legoktm)
[21:19:30] <legoktm>	 dpifke: ^
[21:19:47] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+1] webperf: Enable mod_ssl for performance website [puppet] - 10https://gerrit.wikimedia.org/r/673599 (owner: 10Legoktm)
[21:19:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:20:11] <dpifke>	 Works for me.  That's cleaner than trying to rewrite the Swift URL from hieradata.
[21:20:38] <legoktm>	 and I think we want internal traffic to go over HTTPS anyways
[21:20:45] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] webperf: Enable mod_ssl for performance website [puppet] - 10https://gerrit.wikimedia.org/r/673599 (owner: 10Legoktm)
[21:20:46] <dpifke>	 And means we don't depend on an infrequently-used Swift endpoint, in case HTTP access to it ever goes away.
[21:20:54] <wikibugs>	 (03PS2) 10Legoktm: webperf: Enable mod_ssl for performance website [puppet] - 10https://gerrit.wikimedia.org/r/673599
[21:20:57] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] webperf: Enable mod_ssl for performance website [puppet] - 10https://gerrit.wikimedia.org/r/673599 (owner: 10Legoktm)
[21:21:09] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:21:12] <mutante>	 yea, it's nice to encrypt all the internal traffic as well, +1
[21:22:25] <legoktm>	 ok, new error :p
[21:22:32] <legoktm>	 https://performance.wikimedia.org/arclamp/svgs/daily/2021-03-19.excimer.all.reversed.svgz "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
[21:22:55] <dpifke>	 Certificate issue?
[21:25:09] <dpifke>	 The certificate for ms-fe.svc.eqiad.wmnet was issued by Puppet, we probably need to tell Apache about it.
[21:25:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:26:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:26:27] * legoktm looks to see how that's done elsewhere
[21:27:08] <dpifke>	 It seems to be available in /var/lib/puppet/ssl/certs/ca.pem.
[21:28:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:29:37] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:33:02] <wikibugs>	 (03PS1) 10Dave Pifke: arclamp: allow Puppet CA for ms-fe.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/673602 (https://phabricator.wikimedia.org/T244776)
[21:33:11] <dpifke>	 legoktm ^ I think that might do it.
[21:33:37] <legoktm>	 did you try it out already?
[21:33:44] <dpifke>	 No, can do so if you want.
[21:33:56] <legoktm>	 please :)
[21:34:06] <dpifke>	 Just looked at file permissions and tested using openssl s_client.
[21:34:16] <dpifke>	 Manually adding to Apache config now.
[21:34:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:36:28] <dpifke>	 Hmm.  Apache is complaining it can't bind to 443 when run apache2ctl restart.
[21:37:02] <legoktm>	 er you're not using systemd?
[21:37:36] <dpifke>	 Same via systemctl restart.
[21:38:19] <legoktm>	 envoy is probably sitting on 443 already
[21:38:25] <legoktm>	 why is apache trying to bind to it though?
[21:39:04] <legoktm>	 (yes, it is envoy on 443)
[21:40:37] <dpifke>	  /etc/apache2/ports.conf
[21:40:45] <wikibugs>	 (03PS4) 10Razzi: turnilo: add monitoring for node application [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729)
[21:40:46] <dpifke>	 Dunno why it worked when Puppet restarted it though?
[21:41:16] <legoktm>	 is that actually included?
[21:41:37] <dpifke>	 From /etc/apache2.conf, yes.
[21:41:40] * legoktm forces a puppet run
[21:42:39] <legoktm>	 I/puppet only did a reload earlier
[21:42:43] <legoktm>	 but now it's still down
[21:44:31] <dpifke>	 I don't know of a great way to override ports.conf later, so I guess we need to have Puppet overwrite it.
[21:45:11] <dpifke>	 I'm going manually comment out the Listen 443 for now and see if the other fix works.
[21:45:18] <legoktm>	 ok
[21:45:38] <legoktm>	 there's $remove_default_ports in puppet, I'm going to use that
[21:46:21] <dpifke>	 Nice, someone else has already had this problem. :) 
[21:46:37] <icinga-wm>	 PROBLEM - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:47:45] <legoktm>	 I'll ack that in a minute
[21:48:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:48:38] <wikibugs>	 (03PS1) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603
[21:49:49] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service Legoktm working on it https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:49:50] <dpifke>	 Does remove_default_ports remove 80 as well?  If so, do we need to add it back in?
[21:50:04] <wikibugs>	 (03PS2) 10Dave Pifke: arclamp: enable SSL to ms-fe.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/673602 (https://phabricator.wikimedia.org/T244776)
[21:50:53] <dpifke>	 ^ tested and works.  (Also needed "SSLProxyEngine On")
[21:51:15] <wikibugs>	 (03PS2) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603
[21:52:14] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28689/console" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm)
[21:54:16] <legoktm>	 I'm trying to figure out where the real ports are set
[21:54:31] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:55:25] <legoktm>	 ok, it seems like the other places just define their own ports.conf
[21:55:36] * legoktm just does that
[21:56:39] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:58:57] <wikibugs>	 (03PS3) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603
[21:59:30] <dpifke>	 The fact that Varnish is reaching webperf1001 via HTTP negates at least some of the value of all this work to get webperf1001 → ms-fe-svc working over HTTPS.  But I guess that's a problem for another day. :)
[21:59:40] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28690/console" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm)
[21:59:44] <legoktm>	 is it not talking to envoy?
[22:00:13] <wikibugs>	 (03PS4) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603
[22:00:16] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm)
[22:00:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm)
[22:01:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm)
[22:02:00] <wikibugs>	 (03PS5) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603
[22:02:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:05:03] <wikibugs>	 (03PS6) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603
[22:05:07] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:05:52] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28692/console" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm)
[22:06:42] <wikibugs>	 (03PS7) 10Legoktm: webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603
[22:06:48] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] webperf: Don't have apache listen on 443 [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm)
[22:07:04] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] arclamp: enable SSL to ms-fe.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/673602 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[22:08:41] <legoktm>	 dpifke: ran puppet, I think it's all working now?
[22:09:19] <dpifke>	 Looks good from here.  Sorry this turned out to be such a chore.
[22:09:50] <legoktm>	 :D it wouldn't a real Friday if it was boring
[22:10:19] <dpifke>	 Thanks for your help! :)
[22:10:23] <legoktm>	 :))
[22:10:45] <legoktm>	 one thing that did surprise me is that there didn't seem to be any monitoring that alarmed despite the site being down, I'll file a task for that
[22:11:26] <dpifke>	 Yeah.  I think we monitor the backends but not webperf1001 itself.
[22:11:43] <icinga-wm>	 RECOVERY - Check systemd state on webperf2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:11:52] <dpifke>	 I thought there was something at the Varnish layer that did, but I guess that's wrong.
[22:12:57] <wikibugs>	 (03PS1) 10Bstorm: maintain-dbusers: polish things up a bit [puppet] - 10https://gerrit.wikimedia.org/r/673606 (https://phabricator.wikimedia.org/T276284)
[22:13:48] <wikibugs>	 10SRE, 10Performance-Team, 10observability: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927 (10Legoktm)
[22:15:15] * legoktm -> afk for a short break, still pingable though
[22:28:47] <wikibugs>	 10SRE, 10Performance-Team, 10observability: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927 (10dpifke) a:03dpifke Related: T260086  We have Icinga checks for most of the backends (XHGui, ArcLamp), but not for Apache on webperf1001 itself.  Ideally, we can monitor er...
[22:43:28] <wikibugs>	 (03PS7) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005
[22:47:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:54:43] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:57:30] <wikibugs>	 (03CR) 10Dzahn: turnilo: add monitoring for node application (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[23:03:39] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:03:40] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10dpifke) Intermediate proposal: can we give +2 rights on labs/private to everyone with ro...
[23:05:49] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:12:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:13:15] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10bd808) >>! In T161675#6930652, @dpifke wrote: > Intermediate proposal: can we give +2 ri...
[23:14:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:18:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:20:12] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10dpifke) >>! In T161675#6930689, @bd808 wrote: > For anyone wondering who this is, see th...
[23:20:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:35:21] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10bd808) >>! In T161675#6930741, @dpifke wrote: >>>! In T161675#6930689, @bd808 wrote: >>...
[23:38:33] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:39:51] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10Legoktm) >>! In T161675#6930652, @dpifke wrote: > Intermediate proposal: can we give +2...
[23:40:47] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:47:37] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:50:33] <wikibugs>	 10Puppet, 10SRE: Have puppet httpd class support enabling mod_ssl without having apache listen on port 443 - https://phabricator.wikimedia.org/T277989 (10Legoktm)
[23:51:25] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team (Other / Uncategorized), and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10bd808) >>! In T161675#6930761, @Legoktm wrote: >>>! In T161675#6930652, @dpifke wrote: >...
[23:54:55] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672687 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff)
[23:56:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:59:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase