[00:00:04] <jouncebot>	 twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T0000).
[00:05:31] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0102 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[00:09:27] <wikibugs>	 (03PS1) 10Ahmon Dancy: Only report errors from production mw servers [puppet] - 10https://gerrit.wikimedia.org/r/673171
[00:10:36] <wikibugs>	 (03PS2) 10Ahmon Dancy: logspam.pl: Only process errors from production mw servers [puppet] - 10https://gerrit.wikimedia.org/r/673171
[00:13:20] <wikibugs>	 (03CR) 10Krinkle: logspam.pl: Only process errors from production mw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy)
[00:14:27] <wikibugs>	 (03CR) 10Sharvaniharan: "@ottomata, @MHolloway any idea what the tab v/s space error is about? I did use tabs.. not sure what's wrong" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan)
[00:17:58] <wikibugs>	 10SRE, 10Services, 10Patch-For-Review, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10Krinkle)
[00:18:57] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 04-1] "Holding due to Krinkle's comments." [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy)
[00:19:27] <wikibugs>	 (03CR) 10Brennen Bearnes: logspam.pl: Only process errors from production mw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy)
[00:19:39] <Krinkle>	 dancy: any specific noise that led to this?
[00:19:44] <Krinkle>	 might be able to shed some light
[00:19:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:20:55] <dancy>	 yes, this:  `2021-03-17 23:12:39 [37c59da3060210298617bec3] mwdebug1001 enwiki 1.36.0-wmf.34 error WARNING: [37c59da3060210298617bec3] [no req]   ErrorException: PHP Notice: Writing to directory /home/urbanecm/.config/psysh is not allowed. `
[00:21:29] <Urbanecm>	 dancy: that's me running shell.php on mwdebug1001
[00:21:40] <Krinkle>	 Yeah, eval/shell.php are excluded in Logstash for that reason
[00:21:45] <dancy>	 We had an exclusion for mwmaint* already but this showed up today so I tried coming at it from a different direction.
[00:21:51] <Krinkle>	 can be run on any mw* server, including mw1234
[00:22:06] <Urbanecm>	 sorry, I'll try to not run shell.php elsewhere :)
[00:22:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:22:14] <Krinkle>	 well, I do, and have to.
[00:22:40] <dancy>	 OK. I can change it to an eval/shell.php filter.
[00:23:02] <wikibugs>	 (03CR) 10Krinkle: logspam.pl: Only process errors from production mw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy)
[00:23:42] <Krinkle>	 dancy: alright. if anything not eval/shell related does pop up, happy to take look. there might be something else we need to fix or avoid by other means.
[00:24:08] <dancy>	 Sounds good. Thanks all. I'll update the commit tomorrow.  
[00:34:25] <wikibugs>	 (03PS8) 10Krinkle: arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[00:34:44] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "This is ready to go afaics." [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[01:13:02] <wikibugs>	 (03CR) 10Brennen Bearnes: logspam.pl: Only process errors from production mw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy)
[01:14:59] <wikibugs>	 (03PS2) 10Aaron Schulz: Use $region for default mcrouter routes [puppet] - 10https://gerrit.wikimedia.org/r/654330
[01:32:29] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[01:32:33] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[01:35:39] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:42:57] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:43:19] <wikibugs>	 (03CR) 10Aaron Schulz: "Compiler shows diffs for me: https://puppet-compiler.wmflabs.org/compiler1002/28659/" [puppet] - 10https://gerrit.wikimedia.org/r/654330 (owner: 10Aaron Schulz)
[01:46:03] <wikibugs>	 (03PS1) 10Dzahn: parsoid::testreduce: switch mysql data dir to /srv/data/mysql [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580)
[01:48:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,pdu_sentry4} site={eqiad,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:50:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:07:05] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack
[03:09:31] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 3 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack
[03:26:45] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:35:33] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:36:31] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.591 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:37:29] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:42:03] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[03:42:05] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[03:42:53] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:43:53] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:44:47] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:25] <andrewbogott>	 !log restarting slapd on seaborgium, serpens, and r-o ldap replicas (we're getting irregular connection failures)
[03:46:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:01:07] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:01:51] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:08:29] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:09:11] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:10:31] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] "Will the current data in /var/lib/mysql be copied over separately after?" [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) (owner: 10Dzahn)
[04:10:44] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] arclamp: serve SVGs, compressed logs from Swift (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[04:11:58] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova vendordata first boot: try to work around a puppet race [puppet] - 10https://gerrit.wikimedia.org/r/673178
[04:12:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova vendordata first boot: try to work around a puppet race [puppet] - 10https://gerrit.wikimedia.org/r/673178 (owner: 10Andrew Bogott)
[04:13:55] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] "Ping me tomorrow (Thursday) and we can sync on deploying this?" [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)
[04:20:35] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup1002), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:29:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:32:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:52:31] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[04:52:31] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[05:13:01] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:36:21] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 83, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:44:55] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:51:55] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:04:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 for schema change', diff saved to https://phabricator.wikimedia.org/P14940 and previous config saved to /var/cache/conftool/dbconfig/20210318-060445-marostegui.json
[06:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:43] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[06:11:47] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[06:18:12] <wikibugs>	 (03PS1) 10Marostegui: db1084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/673186 (https://phabricator.wikimedia.org/T276302)
[06:18:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/673186 (https://phabricator.wikimedia.org/T276302) (owner: 10Marostegui)
[06:20:45] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1161 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/673187 (https://phabricator.wikimedia.org/T258361)
[06:21:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1161 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/673187 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:22:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2120', diff saved to https://phabricator.wikimedia.org/P14941 and previous config saved to /var/cache/conftool/dbconfig/20210318-062201-marostegui.json
[06:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Marostegui) Looking good: ` [06:29:42] marostegui@cumin1001:~$ sudo cumin 'db11[76-84].eqiad.wmnet' 'free -g ; echo ; df -hT /srv; echo ; pvs ; echo ; megacli -LdPdInfo -a0 | eg...
[06:32:29] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:32:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1161 to dbctl, depooled T258361', diff saved to https://phabricator.wikimedia.org/P14942 and previous config saved to /var/cache/conftool/dbconfig/20210318-063241-marostegui.json
[06:32:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:50] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[06:33:35] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1161 is now on dbctl but depooled. Won't pool till Monday
[06:48:43] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 83, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:52:35] <wikibugs>	 (03PS1) 10Ladsgroup: flaggedrevs: Disable multiple dimensions in hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673189
[06:53:29] <wikibugs>	 (03PS1) 10Marostegui: db11[77-84]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/673190 (https://phabricator.wikimedia.org/T275633)
[06:54:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db11[77-84]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/673190 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui)
[06:59:31] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage db1156 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/673191 (https://phabricator.wikimedia.org/T258361)
[07:00:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1156 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/673191 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[07:01:35] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1156.eqiad.wmnet'] ` The log ca...
[07:01:44] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Unduplicate beta cluster hiera keys set both in Horizon and in ops/puppet - https://phabricator.wikimedia.org/T277680 (10Majavah) See also: {T161675}
[07:02:18] <wikibugs>	 (03PS1) 10ArielGlenn: Dumps: continue restructuring page content batches [dumps] - 10https://gerrit.wikimedia.org/r/673192 (https://phabricator.wikimedia.org/T252396)
[07:02:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Dumps: continue restructuring page content batches [dumps] - 10https://gerrit.wikimedia.org/r/673192 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn)
[07:05:04] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) For db1165 which will replace db1085 in s6, what I will do: - Do not reimage db1165 to Stretch, instead will leave it as Buster and...
[07:05:20] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:07:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:09:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:13:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE
[07:13:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:15:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE
[07:15:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14943 and previous config saved to /var/cache/conftool/dbconfig/20210318-071747-root.json
[07:17:50] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:17:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:32] <marostegui>	 !log Deploy schema change on s4 codfw master, lag will appear - T276150 T276156
[07:19:35] <wikibugs>	 (03PS2) 10ArielGlenn: Dumps: continue restructuring page content batches [dumps] - 10https://gerrit.wikimedia.org/r/673192 (https://phabricator.wikimedia.org/T252396)
[07:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:40] <stashbot>	 T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150
[07:19:41] <stashbot>	 T276156: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156
[07:20:30] <dcausse>	 !log depooling & restarting blazegraph on wdqs1005
[07:20:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:34] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:22:00] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.079 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[07:22:54] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1156.eqiad.wmnet'] `  and were **ALL** successful.
[07:23:48] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:27:10] <_joe_>	 uhm why is tcpircbot alerting?
[07:27:15] <_joe_>	 not again, sigh
[07:27:56] <_joe_>	 I see dbctl works though
[07:28:28] <_joe_>	 yeah it'
[07:28:33] <_joe_>	 s the alert that's wrong
[07:32:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14944 and previous config saved to /var/cache/conftool/dbconfig/20210318-073250-root.json
[07:32:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:46] <icinga-wm>	 ACKNOWLEDGEMENT - tcpircbot_service_running on alert1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py Giuseppe Lavagetto False positive the service is running fine https://wikitech.wikimedia.org/wiki/Logmsgbot
[07:34:43] <wikibugs>	 (03PS1) 10ArielGlenn: Name the batch files lock retry variables better [dumps] - 10https://gerrit.wikimedia.org/r/673194 (https://phabricator.wikimedia.org/T252396)
[07:36:13] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336)
[07:36:27] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui)
[07:38:20] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update s7-master cname [dns] - 10https://gerrit.wikimedia.org/r/673196 (https://phabricator.wikimedia.org/T274336)
[07:39:36] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/673196 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui)
[07:40:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:41:06] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:41:08] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[07:41:18] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[07:42:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:42:12] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:44:12] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:47:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14945 and previous config saved to /var/cache/conftool/dbconfig/20210318-074754-root.json
[07:48:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:39] <wikibugs>	 (03PS1) 10ArielGlenn: worker.py --batches requires one job name be specified [dumps] - 10https://gerrit.wikimedia.org/r/673198 (https://phabricator.wikimedia.org/T252396)
[07:53:08] <wikibugs>	 (03PS2) 10ArielGlenn: update worker scripts to loop in secondary batch worker mode [dumps] - 10https://gerrit.wikimedia.org/r/638043 (https://phabricator.wikimedia.org/T252396)
[07:59:19] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ml_k8s::worker: Use new kubernetes/calico [puppet] - 10https://gerrit.wikimedia.org/r/673199
[08:01:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Gonna merge this as I am trying to debug the creation of the docker volume_group. It should NOT be related but it's causing noise and a di" [puppet] - 10https://gerrit.wikimedia.org/r/673199 (owner: 10Alexandros Kosiaris)
[08:02:58] <akosiaris>	 !log reimage ml-serve1004 to debug a docker volume_group issue
[08:02:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14946 and previous config saved to /var/cache/conftool/dbconfig/20210318-080258-root.json
[08:03:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:09] <elukey>	 akosiaris: thanks, lemme know if I can help
[08:03:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:17] <elukey>	 (I was asking about the calico packages in #serviceops :)
[08:05:43] <akosiaris>	 ah, we only have them for stretch? 
[08:05:51] <akosiaris>	 remind me again, why are you targetting buster?
[08:07:20] <elukey>	 akosiaris: simply because we thought that it would have been easier than later on with prod traffic flowing, but we assumed it wouldn't have caused pain to others :) We can revert to Stretch in case
[08:08:10] <akosiaris>	 I think it's the inverse. Per the buster migration task, it's actually quite more difficult. You are pioneering a bit there
[08:08:45] <akosiaris>	 and btw, we are pondering whether for the services cluster it makes sense to skip straight to bullseye 
[08:09:01] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:09:07] <elukey>	 akosiaris: the pioneering part is fine, that painful part is pinging you daily to ask for guidance, this is why I said that :)
[08:09:31] <akosiaris>	 services/main/the k8s cluster up to now. We need to name it now that we have >1 I guess
[08:11:26] <akosiaris>	 one thing that you will definitely need to rebuld is our rsyslog package for buster
[08:11:43] <akosiaris>	 we have the rsyslog-kubernetes package which IIRC is not on buster
[08:11:47] <elukey>	 akosiaris: what magic did you use to make dockerd running? The lvm partitions were not created by the storage profile
[08:11:48] * akosiaris double checking
[08:12:12] <elukey>	 I thought there was a race condition in puppet but now I am confused
[08:12:15] <akosiaris>	 elukey: No magic, I just merged the puppet change you saw above
[08:12:26] <akosiaris>	 well no it's actually magic, cause I don't understand why it failed
[08:12:37] <akosiaris>	 this is bashable ^ :P
[08:12:39] <elukey>	 ahahahah
[08:13:02] <elukey>	 my impression was that docker.io was installed and dockerd started before the lvm class in profile::docker::engine
[08:13:04] <akosiaris>	 but seriously, I kind of went on a hunch seeing the diff in the catalog
[08:13:05] <elukey>	 err storage
[08:13:24] <akosiaris>	 it tried to start but failed, but that was not the issue
[08:13:36] <akosiaris>	 the problem was the puppet wasn't trying to create the docker lvm volume_group
[08:13:44] <akosiaris>	 as to why it wasn't trying to do that, there, you got me
[08:13:53] <akosiaris>	 I even got the catalog and it clearly referenced it
[08:13:56] <elukey>	 ah it wasn't even trying?
[08:14:26] <akosiaris>	 yeah, in a way that made 0 sense to me. 
[08:14:52] <akosiaris>	 at some point I thought it was because of dependencies, but no indication of that
[08:15:08] <akosiaris>	 I am was probably way too tired and missed something yesterday night, but this is puzzling to say the least
[08:15:31] <elukey>	 akosiaris: sorry to re-ask, but isn't it because docker.io was installed by puppet before the lvm volumes were created in profile::docker::storagE?
[08:15:45] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:16:05] <elukey>	 the only explanation that I can have is that your recent change caused a different puppet execution order
[08:16:13] <elukey>	 otherwise I feel lost
[08:16:42] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: REIMAGE
[08:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:27] <akosiaris>	 elukey: my feeling exactly yesterday. 
[08:17:36] <akosiaris>	 it's not better right now. but at least I got a lead now
[08:17:55] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,routinator} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:17:59] <wikibugs>	 (03PS1) 10Majavah: Enable CentralAuth IRC feed in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673201 (https://phabricator.wikimedia.org/T277432)
[08:18:13] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:18:53] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: REIMAGE
[08:19:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:00] <akosiaris>	 elukey: you might be right, but I should be getting some error about failed dependencies creating the lvm volumes and I did not
[08:20:19] <akosiaris>	 plus... how on earth could kubernetes-node and calico be related to the docker profile class.
[08:21:02] <icinga-wm>	 RECOVERY - DPKG on ml-serve1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:21:08] <elukey>	 akosiaris: I stopped wondering why when dealing with puppet a long time ago :D
[08:21:37] <elukey>	 (due also to my ignorance about its internals, but I chose mental sanity instead)
[08:22:05] <akosiaris>	 yeah, that's where you got me. I went down the road of stitching puppet catalogs yesterday
[08:22:19] <elukey>	 thanks a lot for all the work btw :)
[08:22:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:22:34] <akosiaris>	 and it's clear that something was up with the catalog and it would not create the the volume_group, but I am still not sure what
[08:22:42] <akosiaris>	 or how 2 hiera variables fixed it
[08:23:58] <elukey>	 maybe we could add an explicit dep between profile::docker::engine and ::storage
[08:24:50] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:27:48] <godog>	 !log swift eqiad-prod: less weight for ms-be[1019-1026] - T272836
[08:27:50] <akosiaris>	 it's probably the calico change that did the diff. It does rely on docker
[08:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:57] <stashbot>	 T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836
[08:28:07] <akosiaris>	 whereas on version 3 it does not
[08:28:40] <elukey>	 ah good point
[08:28:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:29:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatdata-user for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10JAllemandou) Thanks for letting me know @Ottomata  :) @cmassaro : Let's sync on the work you wish to accomplish, as wikitext-history is really big and I might have some h...
[08:31:40] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater:create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[08:31:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:32:34] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 67 probes of 602 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:34:08] <wikibugs>	 10SRE, 10Patch-For-Review: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10fgiunchedi) >>! In T224579#6920712, @fgiunchedi wrote: > And connections from Prometheus kept piling up. AFAIK the service/exporter is not owned ATM, I've restarted the exporter but this is obviou...
[08:34:54] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:38:26] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 602 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:38:44] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:18] <icinga-wm>	 RECOVERY - DPKG on ml-serve1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:40:48] <icinga-wm>	 RECOVERY - DPKG on ml-serve2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:44:48] <icinga-wm>	 RECOVERY - DPKG on ml-serve1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:45:13] <wikibugs>	 10SRE, 10Patch-For-Review: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Majavah)
[08:46:10] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:47:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:49:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:51:27] <icinga-wm>	 RECOVERY - DPKG on ml-serve2004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:53:05] <icinga-wm>	 RECOVERY - DPKG on ml-serve2003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:54:33] <_joe_>	 apergos: snapshot1005 is in a very bad state
[08:54:46] <_joe_>	 I can't get dmesg output and the fs is read-only
[08:56:03] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209
[08:56:51] <_joe_>	 yeah gonna try to reboot it, but I doubt it will come up cleanly
[08:57:22] <_joe_>	 Mar 18 08:33:01 snapshot1005 kernel: [2677270.299698] sd 0:1:0:0: [sda] tag#611 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[08:57:23] <wikibugs>	 (03PS1) 10ArielGlenn: distinguish between "no wikis with batches available" and "no wikis left to run" [dumps] - 10https://gerrit.wikimedia.org/r/673210 (https://phabricator.wikimedia.org/T252396)
[08:57:35] <apergos>	 ouch
[08:57:45] <apergos>	 nothing is going on over there so I dunno why
[08:58:10] <apergos>	 sda orilly
[08:58:15] <apergos>	 well that is dead as dead all right
[08:58:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto)
[08:58:31] <apergos>	 it's out of warranty, the new boxes were ordered to replace snapshot1005,6,7 
[08:58:43] <apergos>	 and expected Jan 31 but no word from Dell, they are just gone
[08:58:55] <apergos>	 I saw an escalation to willy on the task overnight
[08:58:56] <_joe_>	 wut
[08:59:09] <_joe_>	 so, ok to reboot?
[08:59:10] <apergos>	 in he meantime I have a testbed host I can assign the production role to, for the next round
[08:59:25] <_joe_>	 worst that can happen is it doesn't come back
[08:59:26] <apergos>	 oh sure, actually lemme see if I can get on the host first
[08:59:34] <_joe_>	 yes you can
[08:59:42] <_joe_>	 ssh works, sudo works, that's about it :P
[09:00:33] <apergos>	 lol /usr/bin/various are broken
[09:00:38] <apergos>	 ps axuww worked :-D
[09:00:51] <apergos>	 nothing of interet happening there so lemme get off 
[09:01:01] <apergos>	 reboot away
[09:02:03] <apergos>	 I'm amazed that's the one thing that alrted in icinga (the systemd unit)
[09:02:56] <_joe_>	 yeah...
[09:03:55] <apergos>	 doo dee doo dee doo
[09:04:08] <_joe_>	 so the only clean way to reboot was using systemd
[09:04:14] <_joe_>	 because it's in-memory and running
[09:04:20] <_joe_>	 take that systemd-haters
[09:04:27] <apergos>	 hahahaha
[09:04:34] <apergos>	 you were a systemd hater once
[09:04:39] <_joe_>	 nope
[09:04:41] <_joe_>	 never been
[09:04:49] <apergos>	 don't make me look in the logs
[09:04:53] <_joe_>	 !log attempted reboot of snapshot1005, read-only filesystem and probably disks are broken beyond repair
[09:05:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:03] <_joe_>	 I'm pretty sure I've never been :)
[09:05:11] <_joe_>	 you're confusing me and brandon
[09:05:25] <apergos>	 nice attempt at a save :-P :-D
[09:05:27] <_joe_>	 I might have expressed contempt towards Lennart and his attitude
[09:05:37] <apergos>	 you and brandon don't even vaguely kinda look alike
[09:05:42] <apergos>	 if it was Seddon, well...
[09:05:59] <icinga-wm>	 PROBLEM - Host snapshot1005 is DOWN: PING CRITICAL - Packet loss = 100%
[09:06:00] <_joe_>	 well we both do ramble like old men a lot
[09:06:03] <apergos>	 heh
[09:06:06] <_joe_>	 hey icinga, keep up
[09:06:06] <apergos>	 and you're both not
[09:06:21] <apergos>	 I'm an old geezer, the rest of you younguns are just wannabes :-P
[09:06:45] <apergos>	 I assume you're watching the console?
[09:06:59] <_joe_>	 no I'm not
[09:07:14] <_joe_>	 my plan was to wait a bit first, and if it doesn't come back, go to the console
[09:07:25] <apergos>	 I think there's two os disks in raid 1 (ssd)
[09:07:30] <apergos>	 oh ok, that's fine too
[09:08:10] <apergos>	 yep hw raid 1 two ssds for the os indeed
[09:08:27] <_joe_>	 hw raid and it broke? wtf
[09:08:43] <apergos>	 unclear
[09:09:10] <_joe_>	 ugh, it's an ilo?
[09:09:16] <apergos>	 these are hp proliant boxes
[09:09:19] <_joe_>	 sigh
[09:09:24] <apergos>	 the three we are replacing, incl this one
[09:13:22] <_joe_>	 !log hard reboot of snapshot1005
[09:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of inline comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto)
[09:15:55] <_joe_>	 apergos: it looks like the failure is in the bios controller
[09:16:04] <apergos>	 lovely
[09:16:19] <apergos>	 any message that can be screenshotted or copy-pastad into a task?
[09:16:55] <_joe_>	 ]it's booting now though
[09:17:00] <apergos>	 er wut
[09:17:09] <apergos>	 how is it booting if the...   >_<
[09:17:12] <_joe_>	 yeah the message was about a *previous* failure
[09:17:17] <apergos>	 oh. hrm
[09:17:21] <_joe_>	 I think the hard reset might have done the trick
[09:17:25] <apergos>	 welllll
[09:17:35] <_joe_>	 [FAILED] Failed to mount /mnt/dumpsdata.
[09:17:35] <apergos>	 maybe I'll just turn that into the testbed host anyways to be safe
[09:17:37] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:17:47] <apergos>	 huh
[09:17:49] <icinga-wm>	 RECOVERY - Host snapshot1005 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[09:17:56] <_joe_>	 please take a look yourself at this point :)
[09:18:35] <apergos>	 why would it fail to mount a frickin nfs share
[09:19:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: check_systemd_state: improve alerting message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto)
[09:19:43] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:20:37] <apergos>	 it's there now :-/
[09:20:43] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:24:05] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:24:06] <apergos>	 mount.nfs: Failed to resolve server dumpsdata1003.eqiad.wmnet: Name or service not known
[09:24:08] <apergos>	 ok really?
[09:24:53] <apergos>	 but two minutes later it was ok during the puppet run after reboot
[09:25:35] <wikibugs>	 (03CR) 10Kormat: mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui)
[09:25:36] <_joe_>	 apergos: i guess that systemd unit needs to add a dependency on network.target maybe?
[09:25:43] <apergos>	 ah maybe so
[09:26:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:26:49] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui)
[09:27:13] <apergos>	 no indication in the logs of how it died, nothing on the reboot to indicate issues that I can see
[09:27:20] <apergos>	 still going to switch it out for another host though
[09:27:20] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] wmnet: Update s7-master cname [dns] - 10https://gerrit.wikimedia.org/r/673196 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui)
[09:27:31] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui)
[09:27:38] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336)
[09:32:38] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui)
[09:32:57] <apergos>	 not really a systemd unit, I think it's just from the nfs mount being in pass 0 in /etc/fstab (which is likely my fault someway or other)... still looking around
[09:34:28] <apergos>	 hmm no. meh
[09:35:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:37:39] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:37:40] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10SRE-tools, 10serviceops: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm)
[09:44:24] <apergos>	 lol snapshot1005 is already the testbed. so I guess I don't have to move it :-D
[09:46:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:debmonitor: Make profile compatible with cloud environments [puppet] - 10https://gerrit.wikimedia.org/r/673050 (owner: 10Jbond)
[09:46:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] hiera - cloud: add config for pki-debmon [puppet] - 10https://gerrit.wikimedia.org/r/673048 (owner: 10Jbond)
[09:46:34] <wikibugs>	 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Majavah)
[09:47:08] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage db1181 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/673217 (https://phabricator.wikimedia.org/T275633)
[09:47:35] <wikibugs>	 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey)
[09:47:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1181 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/673217 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui)
[09:48:44] <wikibugs>	 (03PS1) 10Kormat: hiera: Reenable notifications for db2089+db2137 [puppet] - 10https://gerrit.wikimedia.org/r/673219 (https://phabricator.wikimedia.org/T277632)
[09:49:32] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10JMeybohm) p:05Triage→03Medium
[09:49:42] <wikibugs>	 (03CR) 10Marostegui: "The other two will be handled by jaime?" [puppet] - 10https://gerrit.wikimedia.org/r/673219 (https://phabricator.wikimedia.org/T277632) (owner: 10Kormat)
[09:50:09] <wikibugs>	 (03CR) 10Kormat: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/673219 (https://phabricator.wikimedia.org/T277632) (owner: 10Kormat)
[09:50:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] hiera: Reenable notifications for db2089+db2137 [puppet] - 10https://gerrit.wikimedia.org/r/673219 (https://phabricator.wikimedia.org/T277632) (owner: 10Kormat)
[09:51:09] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[09:51:16] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10SRE-tools, 10serviceops: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) The cookbook does not seem to work (tried during the kubernetes codfw reinit): * It did not allow multiple services a...
[09:51:51] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[09:55:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10Volans) @CGlenn access granted. Please confirm that it's working and that you've read the following notice:  ` If you believe your Google account delegated for Google Webmaster Too...
[09:56:50] <wikibugs>	 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: [DRAFT] New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10Jgiannelos)
[09:57:20] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Move s5 from db2139 to db2101 [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632)
[09:57:40] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Move s5 from db2139 to db2101 [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632)
[09:57:55] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209
[09:57:57] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] hiera: Reenable notifications for db2089+db2137 [puppet] - 10https://gerrit.wikimedia.org/r/673219 (https://phabricator.wikimedia.org/T277632) (owner: 10Kormat)
[09:58:04] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Move s5 from db2139 to db2101 [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632)
[10:00:04] <wikibugs>	 (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo)
[10:00:04] <jouncebot>	 mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid /  Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1000).
[10:00:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto)
[10:04:18] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209
[10:04:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] dbbackups: Move s5 from db2139 to db2101 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo)
[10:06:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "I was aware." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo)
[10:11:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatdata-user for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans)
[10:13:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatdata-user for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) p:05Triage→03Medium @cmassaro I can't find your signature on L3, could you please double check you've signed it?
[10:13:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: check_systemd_state: improve alerting message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto)
[10:15:18] <wikibugs>	 (03PS1) 10Jbond: systemd::monitor: create a generic nrpe systemd check [puppet] - 10https://gerrit.wikimedia.org/r/673221
[10:16:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd::monitor: create a generic nrpe systemd check [puppet] - 10https://gerrit.wikimedia.org/r/673221 (owner: 10Jbond)
[10:19:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-user for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans)
[10:19:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans)
[10:19:45] <wikibugs>	 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10akosiaris) > After a chat with Filippo, IIUC 8.2008.0-1 is used only on centrallog nodes (hence the component) but we might want to use 8.19011 provided on Buster and add the custom bits for rs...
[10:23:20] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:25:10] <wikibugs>	 (03PS1) 10Klausman: helm: Add calico/cordns/values config for ML k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/673222
[10:27:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto)
[10:28:02] <icinga-wm>	 PROBLEM - mysqld processes on db2101 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[10:28:15] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml_k8s::worker: Use new kubernetes/calico [puppet] - 10https://gerrit.wikimedia.org/r/673199 (owner: 10Alexandros Kosiaris)
[10:28:49] <marostegui>	 jynus: ^ is that one of the backup ones?
[10:29:45] <wikibugs>	 (03PS13) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062)
[10:29:47] <wikibugs>	 (03PS15) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062)
[10:30:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] helm: Add calico/cordns/values config for ML k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/673222 (owner: 10Klausman)
[10:30:08] <marostegui>	 yep, it is a backup source
[10:30:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:30:43] <jynus>	 oh, I guess there is a race condition with alert1001
[10:30:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] WIP: Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry)
[10:30:58] <jynus>	 I will run puppet on icinga so it is disabled
[10:31:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans)
[10:32:11] <kormat>	 jynus: in case it's relevant: once you add `profile::base::notifications: disabled`, you need to run puppet on the affected machine once, and then on alert1001 *twice*
[10:32:29] <jynus>	 yeah, it is not great
[10:32:31] <kormat>	 (don't ask me _why_. i've just discovered this empherically)
[10:32:51] <kormat>	 *empirically
[10:33:03] <jynus>	 not sure if you have to run it twice or just "wait an undeterminite amount of minute  between puppet runs"
[10:33:09] <volans>	 https://json-schema.org/understanding-json-schema/structuring.html
[10:33:10] <jynus>	 I guess for the backend to get updated
[10:33:19] <volans>	 ops, wrong tab :D
[10:33:32] <wikibugs>	 (03PS2) 10Jbond: systemd::monitor: create a generic nrpe systemd check [puppet] - 10https://gerrit.wikimedia.org/r/673221
[10:33:43] <kormat>	 jynus: i run it twice in succession, and that works.
[10:33:50] <jynus>	 volans, ther right tab is: https://learnxinyminutes.com/docs/yaml/
[10:34:19] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28661/console" [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey)
[10:34:28] <jynus>	 kormat, sure. I just think there is a wait, for example, on the reimaging scripts like that
[10:34:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:35:39] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] helm: Add calico/cordns/values config for ML k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/673222 (owner: 10Klausman)
[10:35:55] <wikibugs>	 (03CR) 10Elukey: "All right all puppet-style comments resolved, now I am going to only modify the capacity scheduler settings :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey)
[10:36:58] <wikibugs>	 (03Merged) 10jenkins-bot: helm: Add calico/cordns/values config for ML k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/673222 (owner: 10Klausman)
[10:37:03] <wikibugs>	 (03PS1) 10Volans: admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692)
[10:37:16] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] helm: Add calico/cordns/values config for ML k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/673222 (owner: 10Klausman)
[10:37:42] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Waiting for L3 to be signed" [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans)
[10:37:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28662/console" [puppet] - 10https://gerrit.wikimedia.org/r/673221 (owner: 10Jbond)
[10:44:58] <wikibugs>	 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) I am a little confused from the debian/stretch-wikimedia branch of the rsyslog repo, because the debian changelog seems to mention `8.38.0-1~bpo9+1wmf1` (it corresponds to a commit from...
[10:48:37] <wikibugs>	 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey)
[10:56:57] <wikibugs>	 (03PS1) 10Kosta Harlan: Remove variant C from list of valid variants [extensions/GrowthExperiments] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673107 (https://phabricator.wikimedia.org/T277727)
[10:57:34] <wikibugs>	 (03PS1) 10Gergő Tisza: GrowthExperiments: set $wgGEHomepageNewAccountVariants to D only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673224 (https://phabricator.wikimedia.org/T277727)
[10:57:51] <wikibugs>	 (03PS3) 10Jbond: systemd::monitor: create a generic nrpe systemd check [puppet] - 10https://gerrit.wikimedia.org/r/673221
[10:57:53] <wikibugs>	 (03PS1) 10Jbond: systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1100).
[11:00:04] <jouncebot>	 Majavah: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:12] <Urbanecm>	 i can deploy today
[11:00:15] <Majavah>	 here, beta only patch
[11:00:18] <Urbanecm>	 kostajh: around?
[11:00:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond)
[11:00:35] <kostajh>	 Urbanecm: indeed
[11:00:44] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable CentralAuth IRC feed in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673201 (https://phabricator.wikimedia.org/T277432) (owner: 10Majavah)
[11:00:45] <kostajh>	 just finished updating the calendar with the two patches
[11:01:00] <wikibugs>	 (03PS2) 10Mvolz: Update Zotero to 2021-03-12-015945-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/673011
[11:01:02] <wikibugs>	 (03PS2) 10Jbond: systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225
[11:01:04] <wikibugs>	 (03PS2) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672723 (owner: 10PipelineBot)
[11:01:06] <Urbanecm>	 Majavah: done, it will apply automagically within 30 minutes. Shout at me if it doesn't ;)
[11:01:14] <Majavah>	 Urbanecm: suree, thanks!
[11:01:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove variant C from list of valid variants [extensions/GrowthExperiments] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673107 (https://phabricator.wikimedia.org/T277727) (owner: 10Kosta Harlan)
[11:01:43] <wikibugs>	 (03Merged) 10jenkins-bot: Enable CentralAuth IRC feed in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673201 (https://phabricator.wikimedia.org/T277432) (owner: 10Majavah)
[11:01:51] <Urbanecm>	 kostajh: please remind me, does the config depend on backport?
[11:02:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond)
[11:03:22] <wikibugs>	 (03PS3) 10Jbond: systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225
[11:03:42] <kostajh>	 Urbanecm: it does not
[11:03:58] <Urbanecm>	 kostajh: okay, let's do the config first then
[11:04:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: set $wgGEHomepageNewAccountVariants to D only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673224 (https://phabricator.wikimedia.org/T277727) (owner: 10Gergő Tisza)
[11:04:31] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28665/console" [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond)
[11:05:13] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672723 (owner: 10PipelineBot)
[11:05:21] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: set $wgGEHomepageNewAccountVariants to D only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673224 (https://phabricator.wikimedia.org/T277727) (owner: 10Gergő Tisza)
[11:05:52] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/669967 (owner: 10PipelineBot)
[11:06:02] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/670839 (owner: 10PipelineBot)
[11:06:14] <Urbanecm>	 kostajh: can you test on mwdebug1001 please?
[11:06:23] <kostajh>	 Urbanecm: the config patch?
[11:06:26] <Urbanecm>	 yup
[11:06:28] <Urbanecm>	 if possible
[11:06:37] <Majavah>	 Urbanecm: beta scaps seem to be stuck looking at https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ :-/
[11:06:38] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672723 (owner: 10PipelineBot)
[11:06:48] <Urbanecm>	 Majavah: lemme restart the job
[11:06:49] <kostajh>	 Urbanecm: alright
[11:07:51] <Urbanecm>	 Majavah: restarted
[11:08:22] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: NOOP: e7f5eac: Enable CentralAuth IRC feed in beta cluster (T277432) (duration: 01m 12s)
[11:08:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:40] <stashbot>	 T277432: Enable CentralAuth IRC feed in beta cluster - https://phabricator.wikimedia.org/T277432
[11:09:10] <Majavah>	 Urbanecm: thanks! does that require some Jenkins UI access or could I have done that myself on deployemnt-deploy01 somehow?
[11:09:36] <Urbanecm>	 Majavah: it requires access in jenkins UI
[11:09:45] <Urbanecm>	 (precisely, be in https://ldap.toolforge.org/group/nda IIRC)
[11:09:45] <kostajh>	 Urbanecm: seems fine
[11:09:50] <Urbanecm>	 thanks kostajh, syncing
[11:10:00] <Urbanecm>	 Majavah: (or in wmf LDAP group)
[11:10:27] <Majavah>	 Urbanecm: okay, thanks
[11:10:41] <Amir1>	 If there's time, can this be done too? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/673189
[11:11:04] <Amir1>	 I can check it on mwdebug to make sure the wiki doesn't break
[11:11:35] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0005676e704cad907655a4a0bca7bd2164714b1c: GrowthExperiments: set $wgGEHomepageNewAccountVariants to D only (T277727) (duration: 01m 10s)
[11:11:38] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
[11:11:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:43] <stashbot>	 T277727: hewiki users seem to get variant C on desktop which breaks the UI - https://phabricator.wikimedia.org/T277727
[11:11:47] <Urbanecm>	 config done kostajh 
[11:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:51] <wikibugs>	 (03PS3) 10Mvolz: Update Zotero to 2021-03-12-015945-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/673011
[11:11:59] <kostajh>	 Urbanecm: cool :) 
[11:12:16] <Urbanecm>	 Amir1: sure
[11:12:22] <wikibugs>	 (03PS7) 10David Caro: wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497)
[11:12:24] <wikibugs>	 (03PS5) 10David Caro: wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497)
[11:12:26] <wikibugs>	 (03PS3) 10David Caro: wmcs.toolforge: add cookbook to add a new etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497)
[11:12:54] <wikibugs>	 (03Merged) 10jenkins-bot: Remove variant C from list of valid variants [extensions/GrowthExperiments] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673107 (https://phabricator.wikimedia.org/T277727) (owner: 10Kosta Harlan)
[11:13:11] <Amir1>	 Thanks!
[11:14:27] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .
[11:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:41] <Urbanecm>	 kostajh: backport now at mwdebug1001
[11:14:43] * Urbanecm testing as well
[11:14:48] <kostajh>	 Urbanecm: thx, looking
[11:14:56] <wikibugs>	 (03PS1) 10Klausman: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918)
[11:15:44] <kostajh>	 Urbanecm: hmm, I'm not seeing the SE module for a user who has their variant set to C
[11:15:57] <Urbanecm>	 kostajh: my test acc from yesterday has this https://usercontent.irccloud-cdn.com/file/PTF84Zae/image.png
[11:16:27] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
[11:16:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:36] <Urbanecm>	 and the same appears when I run `new mw.Api().saveOption('growthexperiments-homepage-variant', 'C').done( function() { window.location.reload() });` again
[11:16:44] <kostajh>	 Urbanecm: that looks good but the account I created after you synced the config patch has https://imgur.com/a/K1AWDUE
[11:16:55] * kostajh looks at logs
[11:17:07] <Urbanecm>	 kostajh: that looks like frwiki?
[11:17:21] <kostajh>	 Urbanecm: oops. right
[11:17:28] <Urbanecm>	 you need to test at hewiki, because that's the only Wikipedia we target that's in group1
[11:17:34] <kostajh>	 wrong group
[11:17:44] <kostajh>	 yep looking again :)
[11:17:52] <Urbanecm>	 thx
[11:18:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:18:42] <kostajh>	 Urbanecm: yeah seems good to me
[11:18:45] <Urbanecm>	 cool
[11:18:46] <Urbanecm>	 let's sync
[11:19:03] <wikibugs>	 10SRE, 10Services, 10Patch-For-Review, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10akosiaris) Hi,  Thanks for this request.  Couple of quick questions and pointers  * Is xhgui stateless? More specifically ** Does xhg...
[11:19:16] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] flaggedrevs: Disable multiple dimensions in hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673189 (owner: 10Ladsgroup)
[11:19:23] <wikibugs>	 (03PS1) 10Jbond: P:tcpircbot: drop monitoring of service [puppet] - 10https://gerrit.wikimedia.org/r/673228
[11:19:25] <wikibugs>	 (03PS1) 10Jbond: P:tcpircbot: delete a previoulsy absented resource [puppet] - 10https://gerrit.wikimedia.org/r/673229
[11:19:27] <wikibugs>	 (03PS1) 10Jbond: P:tcpircbot: fix minor style violations [puppet] - 10https://gerrit.wikimedia.org/r/673230
[11:19:39] <wikibugs>	 (03PS2) 10Urbanecm: flaggedrevs: Disable multiple dimensions in hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673189 (owner: 10Ladsgroup)
[11:19:43] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] flaggedrevs: Disable multiple dimensions in hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673189 (owner: 10Ladsgroup)
[11:20:08] <Urbanecm>	 kostajh: should we backport to wmf.34 as well, to fix the frwiki blank homepage?
[11:20:27] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:20:33] <wikibugs>	 (03CR) 10Jbond: "i think this looks fine however i think we should drop this check in favour of the generic check_systemd_state check, see https://gerrit.w" [puppet] - 10https://gerrit.wikimedia.org/r/673077 (owner: 10CRusnov)
[11:20:34] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/GrowthExperiments/includes/HomepageHooks.php: 3b2aa1aa28e9d204f32ae937a84ec211137cbb2e: Remove variant C from list of valid variants (T277727) (duration: 01m 09s)
[11:20:41] <Urbanecm>	 anyway, live for wmf.35
[11:20:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:44] <stashbot>	 T277727: hewiki users seem to get variant C on desktop which breaks the UI - https://phabricator.wikimedia.org/T277727
[11:20:51] <wikibugs>	 (03Merged) 10jenkins-bot: flaggedrevs: Disable multiple dimensions in hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673189 (owner: 10Ladsgroup)
[11:21:13] <Urbanecm>	 Amir1: pulled ^^onto mwdebug1001
[11:21:30] <Amir1>	 ack
[11:22:00] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] Update Zotero to 2021-03-12-015945-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/673011 (owner: 10Mvolz)
[11:22:30] <wikibugs>	 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10akosiaris) Hi,  Is this still something that might happen? I don't see any activity on this task for the last 5 months. Note that it will require a per...
[11:23:14] <Amir1>	 Urbanecm: looks good, please proceed
[11:23:18] <Urbanecm>	 syncing
[11:23:23] <wikibugs>	 (03Merged) 10jenkins-bot: Update Zotero to 2021-03-12-015945-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/673011 (owner: 10Mvolz)
[11:23:33] <Majavah>	 Urbanecm: patch was finally synced to beta and seems to be working, thanks
[11:23:51] <Urbanecm>	 cool Majavah 
[11:24:57] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/flaggedrevs.php: 896c9f019b17d1ad3a1589d377158ca2fb91ebaa: flaggedrevs: Disable multiple dimensions in hewikisource (duration: 01m 09s)
[11:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:05] <Urbanecm>	 Amir1: done
[11:25:10] <Urbanecm>	 anything else?
[11:25:14] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
[11:25:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:27] <Amir1>	 nah, thanks!
[11:25:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[11:25:49] <Urbanecm>	 cool :)
[11:27:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[11:27:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto)
[11:31:24] <wikibugs>	 (03CR) 10Jbond: check_keystone_roles.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670925 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[11:32:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: check_systemd_state: fix variable to format [puppet] - 10https://gerrit.wikimedia.org/r/673234
[11:32:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro)
[11:32:52] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: check_systemd_state: fix variable to format [puppet] - 10https://gerrit.wikimedia.org/r/673234
[11:32:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge: add cookbook to add a new etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro)
[11:32:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro)
[11:33:24] <wikibugs>	 (03CR) 10Jbond: wmcs-spreadcheck.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670928 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[11:33:47] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: check_systemd_state: fix variable to format [puppet] - 10https://gerrit.wikimedia.org/r/673234
[11:34:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] check_systemd_state: fix variable to format [puppet] - 10https://gerrit.wikimedia.org/r/673234 (owner: 10Giuseppe Lavagetto)
[11:34:11] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' .
[11:34:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:35] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:58] <wikibugs>	 (03CR) 10Jbond: wmcs-webproxy.py: Port to Python 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[11:37:24] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' .
[11:37:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:27] <wikibugs>	 (03CR) 10Jbond: wmcs-webproxy.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[11:42:37] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:28] <wikibugs>	 (03PS1) 10Volans: openldap: improve cross-validate-accounts [puppet] - 10https://gerrit.wikimedia.org/r/673241
[11:46:50] <wikibugs>	 (03CR) 10Volans: "Tested with my user and the WMCS key and it printed:" [puppet] - 10https://gerrit.wikimedia.org/r/673241 (owner: 10Volans)
[11:47:15] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:52:30] <wikibugs>	 (03PS14) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062)
[11:52:31] <wikibugs>	 (03PS16) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062)
[11:53:27] <wikibugs>	 (03CR) 10Jbond: "lgtm but i there are a few dependencies between required and not required params" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673241 (owner: 10Volans)
[11:53:49] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:57:30] <wikibugs>	 (03CR) 10Volans: "replies inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673241 (owner: 10Volans)
[11:57:48] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632)
[11:58:02] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632)
[11:58:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo)
[11:58:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo)
[11:59:28] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632)
[12:00:03] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo)
[12:00:10] <wikibugs>	 (03PS4) 10Jcrespo: dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632)
[12:00:13] <wikibugs>	 (03CR) 10Jcrespo: [V: 03+2 C: 03+2] dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo)
[12:01:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/655743 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[12:03:43] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/670937 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[12:03:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] proxylistener.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670937 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[12:05:59] <wikibugs>	 (03PS1) 10Majavah: swift: compare kernel version directly [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179)
[12:06:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) a:05Ottomata→03cmassaro
[12:07:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm few minor nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670938 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[12:10:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] pybal-eval-check.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670952 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[12:10:57] <wikibugs>	 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) Hello!  Yes, we would love to have this service deployed. Although, over the course of the last 5 months, we have developed a newer iteration o...
[12:16:33] <wikibugs>	 (03CR) 10Jbond: "ack thanks for the info lgtm" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673241 (owner: 10Volans)
[12:21:34] <wikibugs>	 (03CR) 10Majavah: "cc'ing people listed on todays puppet request window, requesting review instead of cherrypicking on beta (recommended by https://wikitech." [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179) (owner: 10Majavah)
[12:31:59] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[12:33:01] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[12:34:23] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10Volans) @JMeybohm thanks for the task, this is surely something we want to add support for.  There's also a catch that I'm not sure how to solve right now, because the since serv...
[12:39:52] <wikibugs>	 (03PS2) 10Jbond: swift: compare kernel version directly [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179) (owner: 10Majavah)
[12:40:10] <jbond42>	 Majavah: you o kfor me to merge ^^^ now or do you want to wait for the window?
[12:42:47] <wikibugs>	 (03CR) 10Jbond: "fyi i made a few minor changes" [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179) (owner: 10Majavah)
[12:42:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] swift: compare kernel version directly [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179) (owner: 10Majavah)
[12:49:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] 6.3.2: create 6.3.2 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/670483 (owner: 10Jbond)
[12:53:15] <wikibugs>	 (03PS1) 10Ladsgroup: languageLabelDescriptionAliases: use getLanguageNameByCode [extensions/Wikibase] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673108 (https://phabricator.wikimedia.org/T275611)
[12:54:15] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "UBN" [extensions/Wikibase] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673108 (https://phabricator.wikimedia.org/T275611) (owner: 10Ladsgroup)
[12:54:32] <Amir1>	 I'm deploying a fix for the train (UBN)
[12:55:56] <Majavah>	 jbond42: hi, around now, feel free to merge
[12:56:09] <jbond42>	 Majavah: ack will mereg now
[12:56:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] swift: compare kernel version directly [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179) (owner: 10Majavah)
[12:57:31] <jbond42>	 Majavah: merged and deployed to deployment-puppetmaster04
[12:57:56] <Majavah>	 jbond42: thanks! I'll run puppet manually on the affected deployment-prep instances and will report back if there are any problems
[12:58:44] <jbond42>	 !log upload cas_6.3.2 to apt buster-wikimedia
[12:58:46] <jbond42>	 Majavah: ack
[12:58:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:40] <wikibugs>	 (03PS1) 10Jbond: idp: failover for upgrade [dns] - 10https://gerrit.wikimedia.org/r/673265 (https://phabricator.wikimedia.org/T271684)
[13:08:07] <Majavah>	 jbond42: the patch correctly detected the kernel version, it updated fstab but even after systemctl daemon-reload systemctl has not picked up the new params, trying to reboot a host now to see if systemctl picks up the changes
[13:08:25] <jbond42>	 ack
[13:08:45] <wikibugs>	 10SRE, 10ops-eqiad: analytics1063 interface errors - https://phabricator.wikimedia.org/T277633 (10Cmjohnson) 05Open→03Resolved This has been completed
[13:08:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: failover for upgrade [dns] - 10https://gerrit.wikimedia.org/r/673265 (https://phabricator.wikimedia.org/T271684) (owner: 10Jbond)
[13:09:56] <wikibugs>	 (03PS1) 10Jbond: Revert "idp: failover for upgrade" [dns] - 10https://gerrit.wikimedia.org/r/673109
[13:10:41] <Majavah>	 jbond42: found the issue, it does the options wrong way around, I'll make a patch to fix
[13:11:07] <jbond42>	 ack ping me when its in
[13:12:02] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0102 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:12:09] <wikibugs>	 (03PS1) 10Majavah: swift: fix kernel version check [puppet] - 10https://gerrit.wikimedia.org/r/673266
[13:12:10] <Majavah>	 there ^
[13:12:17] <Majavah>	 jbond42: ^
[13:12:25] <Majavah>	 did that cause that prod puppet failure check or other problems?
[13:12:29] <jbond42>	 Majavah: ack one sec im just checking on the icinga error
[13:12:35] <jbond42>	 possibly
[13:14:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] swift: fix kernel version check [puppet] - 10https://gerrit.wikimedia.org/r/673266 (owner: 10Majavah)
[13:14:19] <elukey>	 jbond42: the alert is barely above the threshold, could it be that the ml-servexxxx nodes (WIP) are contributing?
[13:15:02] <jbond42>	 elukey: yes it could be 
[13:15:39] <jbond42>	 i think the change above may have cuased a transient error on some of the swift nodes which pushed it opver the edge
[13:16:21] <jbond42>	 indeed we got 'mount point not mounted or bad option.'
[13:16:38] <jbond42>	 Majavah: that fix has been deployed
[13:16:50] <Majavah>	 jbond42: thanks, seems to be working on beta
[13:16:55] <Majavah>	 oops, sorry for that
[13:17:34] <jbond42>	 Majavah: no probes, my fault i actully thought it was wrong but somehow convinced my self it wasn't
[13:17:49] <jbond42>	 eitherway no harm done
[13:18:00] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0102 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:18:46] <Majavah>	 also beta swift works now again!  systemctl needed a manual daemon-reload
[13:19:15] <wikibugs>	 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) >>! In T277739#6924397, @elukey wrote: > I am a little confused from the debian/stretch-wikimedia branch of the rsyslog repo, because the debian changelog seems to mention `8.38.0-1~bpo...
[13:19:46] <wikibugs>	 (03Merged) 10jenkins-bot: languageLabelDescriptionAliases: use getLanguageNameByCode [extensions/Wikibase] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673108 (https://phabricator.wikimedia.org/T275611) (owner: 10Ladsgroup)
[13:21:17] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: grid: base: stop using LVM [puppet] - 10https://gerrit.wikimedia.org/r/673267 (https://phabricator.wikimedia.org/T272114)
[13:21:35] <Amir1>	 it's on mwdebug1001 and confirming that it works, syncing to the world
[13:22:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "idp: failover for upgrade" [dns] - 10https://gerrit.wikimedia.org/r/673109 (owner: 10Jbond)
[13:23:47] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/Wikibase/repo: [[gerrit:673108|languageLabelDescriptionAliases: use getLanguageNameByCode]] (T275611 T277722) (duration: 01m 14s)
[13:23:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:56] <stashbot>	 T277722: TypeError: this._languageCodes is undefined   at getLanguageNameMap - https://phabricator.wikimedia.org/T277722
[13:23:56] <stashbot>	 T275611: Termbox v2 uses the wrong list of languages - https://phabricator.wikimedia.org/T275611
[13:25:48] <wikibugs>	 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10akosiaris) >>! In T277739#6924819, @elukey wrote: >>>! In T277739#6924397, @elukey wrote: >> I am a little confused from the debian/stretch-wikimedia branch of the rsyslog repo, because the deb...
[13:26:47] <wikibugs>	 (03CR) 10Ottomata: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/658396 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[13:27:48] <wikibugs>	 (03CR) 10Ottomata: "I see some spaces on the opening bracket of your config entry, I bet that's it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan)
[13:29:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Samwalton9) Thanks @Volans! Looks like I can get further than before, but I now see the following error when attempting to run a query: `Error while compiling statement: FAILED: RuntimeExce...
[13:33:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) @RobH thank you! @Jclark-ctr, mc1039-mc1054 can be racked in Q4, unless we have more mc* victims. Thank you!
[13:34:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ladsgroup) I assume you need kerberos setup and if you have it, you need to initialize it with "kinit" first
[13:34:23] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ladsgroup) https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide
[13:35:21] <wikibugs>	 10SRE, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Gilles)
[13:35:42] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ottomata) This might have got lost in all the comments, but to query mediawiki_history Sam will need to be in the analytics-privatedata-users group.  Hm, we don't have this case covered in...
[13:36:23] <wikibugs>	 (03CR) 10Andrew Bogott: "This may be a more complete solution: https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456" [puppet] - 10https://gerrit.wikimedia.org/r/673267 (https://phabricator.wikimedia.org/T272114) (owner: 10Arturo Borrero Gonzalez)
[13:38:10] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 88 probes of 604 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:40:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:42:33] <wikibugs>	 (03PS1) 10Ottomata: Add samwalton as a posix user in analytics-privatedata-users, but no ssh [puppet] - 10https://gerrit.wikimedia.org/r/673270 (https://phabricator.wikimedia.org/T277298)
[13:42:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:43:46] <wikibugs>	 (03PS1) 10Volans: interface automation: fix support for cloud-hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673271
[13:44:22] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 49 probes of 604 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:44:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] interface automation: fix support for cloud-hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673271 (owner: 10Volans)
[13:44:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] interface automation: fix support for cloud-hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673271 (owner: 10Volans)
[13:46:10] <icinga-wm>	 PROBLEM - k8s API server requests latencies on chlorine is CRITICAL: instance=10.64.0.45 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:46:35] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) Signed!
[13:47:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) a:05cmassaro→03Ottomata
[13:48:35] <wikibugs>	 (03PS2) 10Volans: interface automation: fix support for cloud-hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673271
[13:49:20] <elukey>	 !log reboot analytics1066
[13:49:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:58] <wikibugs>	 (03CR) 10Volans: [C: 03+2] interface automation: fix support for cloud-hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673271 (owner: 10Volans)
[13:50:12] <wikibugs>	 10SRE, 10CAS-SSO: Update CAS to 6.3 - https://phabricator.wikimedia.org/T271684 (10jbond) Production, staging and the cloud idp have all now been upgraded to cas 6.3.2
[13:53:54] <wikibugs>	 (03CR) 10Volans: "see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673270 (https://phabricator.wikimedia.org/T277298) (owner: 10Ottomata)
[13:55:44] <wikibugs>	 (03PS1) 10Jbond: P:pki::client: add missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/673275
[13:56:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:pki::client: add missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/673275 (owner: 10Jbond)
[13:57:22] <icinga-wm>	 RECOVERY - k8s API server requests latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:02:38] <wikibugs>	 (03CR) 10Ottomata: Add samwalton as a posix user in analytics-privatedata-users, but no ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673270 (https://phabricator.wikimedia.org/T277298) (owner: 10Ottomata)
[14:03:00] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert "[beta] Disable captchas while they are completely broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673112 (https://phabricator.wikimedia.org/T276176)
[14:04:06] <wikibugs>	 (03CR) 10David Caro: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro)
[14:04:08] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673270 (https://phabricator.wikimedia.org/T277298) (owner: 10Ottomata)
[14:06:24] <wikibugs>	 (03PS1) 10Volans: interface automation: 2nd fix for cloud-hosts VLAN [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673276
[14:07:31] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] interface automation: 2nd fix for cloud-hosts VLAN [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673276 (owner: 10Volans)
[14:10:01] <wikibugs>	 (03CR) 10Volans: [C: 03+2] interface automation: 2nd fix for cloud-hosts VLAN [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673276 (owner: 10Volans)
[14:10:59] <wikibugs>	 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10fgiunchedi) >>! In T277739#6924365, @akosiaris wrote: >> After a chat with Filippo, IIUC 8.2008.0-1 is used only on centrallog nodes (hence the component) but we might want to use 8.19011 provi...
[14:12:46] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) firing: Primary inbound port utilisation over 80%  #page - https://alerts.wikimedia.org
[14:13:00] <XioNoX>	 looking
[14:13:22] <wikibugs>	 (03CR) 10Joal: "A bunch of comments - happy to talk more :)" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey)
[14:13:47] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Primary outbound port utilisation over 80%  #page - https://alerts.wikimedia.org
[14:13:54] <cdanis>	 looking
[14:14:13] <XioNoX>	 what's going on between codfw and ulsfo ? https://librenms.wikimedia.org/device/device=89/tab=port/port=16787/
[14:14:16] <wikibugs>	 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) This is what I did on deneb before reading Filippo's answer :)  * `apt-get source rsyslog -t buster` * applied Filippo's patch manually (git diff HEAD~1 HEAD on the debian/stretch-wikim...
[14:14:24] <cdanis>	 XioNoX: https://librenms.wikimedia.org/bill/bill_id=10/
[14:14:44] <XioNoX>	 yep
[14:15:18] <XioNoX>	 cdanis: all through equinix
[14:15:42] <XioNoX>	 let's see netflow
[14:15:52] <cdanis>	 XioNoX: https://w.wiki/36x$
[14:16:13] <XioNoX>	 cdanis: and it's not being cashed
[14:16:15] <XioNoX>	 cached
[14:16:28] <XioNoX>	 as it comes from a transport link
[14:16:35] <cdanis>	 yeah
[14:16:39] <icinga-wm>	 PROBLEM - LibreNMS has a critical alert #page on alert1001 is CRITICAL: Primary outbound port utilisation over 80% #page (cr1-codfw.wikimedia.org) https://bit.ly/wmf-librenms
[14:16:48] <cdanis>	 there's the actual page :)
[14:17:04] <XioNoX>	 splunk loging failed...
[14:17:08] <XioNoX>	 can someone ack it...
[14:17:15] <cdanis>	 it's upload-lb unsurprisingly
[14:17:19] <rzl>	 acked
[14:17:20] <bblack>	 hey
[14:17:28] <rzl>	 I can't stay though, interviewing shortly
[14:17:32] <volans>	 acked
[14:17:36] <godog>	 I'm here too
[14:17:46] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) resolved: Primary inbound port utilisation over 80%  #page - https://alerts.wikimedia.org
[14:18:47] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Primary outbound port utilisation over 80%  #page - https://alerts.wikimedia.org
[14:18:57] <icinga-wm>	 RECOVERY - LibreNMS has a critical alert #page on alert1001 is OK: OK: zero critical LibreNMS alerts https://bit.ly/wmf-librenms
[14:21:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:23:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:24:56] <wikibugs>	 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) Update after a chat with Filippo:  * a fleetwide update of rsyslog is painful, we can avoid it with the component solution (extremely wise point) * update the rsyslog repo seems good -...
[14:25:39] <wikibugs>	 (03PS1) 10BBlack: Ratelimit applebot temporarily [puppet] - 10https://gerrit.wikimedia.org/r/673281
[14:27:29] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Ratelimit applebot temporarily [puppet] - 10https://gerrit.wikimedia.org/r/673281 (owner: 10BBlack)
[14:29:18] <hashar>	 !log Restarting CI Jenkins for plugin upgrade
[14:29:22] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add samwalton as a posix user in analytics-privatedata-users, but no ssh [puppet] - 10https://gerrit.wikimedia.org/r/673270 (https://phabricator.wikimedia.org/T277298) (owner: 10Ottomata)
[14:29:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:41] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ottomata) @Samwalton9 try now!
[14:34:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Samwalton9) `org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec...
[14:35:21] <wikibugs>	 (03PS1) 10Ottomata: Don't echo unset in env_vars.sh on deactivate [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/673284
[14:35:36] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Don't echo unset in env_vars.sh on deactivate [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/673284 (owner: 10Ottomata)
[14:37:49] <dcausse>	 !log repooling wdqs1005
[14:37:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:20] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I still don't know which PHP version to use, most probably will aim at using php 7.2 to be aligned with MediaWiki prod.  But that can be a" [puppet] - 10https://gerrit.wikimedia.org/r/673027 (owner: 10Jbond)
[14:40:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:doc: use the correct php version for each debian distro [puppet] - 10https://gerrit.wikimedia.org/r/673027 (owner: 10Jbond)
[14:43:09] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[14:43:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:16] <wikibugs>	 (03PS1) 10Hashar: contint: remove erroneous hiera setting for labs [puppet] - 10https://gerrit.wikimedia.org/r/673286 (https://phabricator.wikimedia.org/T269354)
[14:48:35] <wikibugs>	 (03PS2) 10Klausman: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918)
[14:49:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove type setting on dlq input [puppet] - 10https://gerrit.wikimedia.org/r/673131 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite)
[14:49:54] <wikibugs>	 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10cloud-services-team (Kanban): WMCS standalone puppet master does not lookup cherry picked hiera change - https://phabricator.wikimedia.org/T277526 (10hashar) I just misunderstood how hiera lookup works on WMCS.  The bulk of it is that parameters specific to...
[14:50:10] <wikibugs>	 (03PS2) 10Hashar: contint: remove erroneous hiera setting for labs [puppet] - 10https://gerrit.wikimedia.org/r/673286 (https://phabricator.wikimedia.org/T277526)
[14:50:20] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:50:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Do we know why the host field is causing the exceptions ?" [puppet] - 10https://gerrit.wikimedia.org/r/673142 (owner: 10Cwhite)
[14:51:33] <wikibugs>	 (03PS3) 10Klausman: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918)
[14:54:14] <wikibugs>	 (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/673142 (owner: 10Cwhite)
[14:54:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:56:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:57:12] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 87 probes of 604 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:58:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add late-stage host field type validation for ecs events [puppet] - 10https://gerrit.wikimedia.org/r/673142 (owner: 10Cwhite)
[14:59:07] <wikibugs>	 (03PS1) 10Jbond: admin: clean up yaml lint issues [puppet] - 10https://gerrit.wikimedia.org/r/673287
[14:59:09] <wikibugs>	 (03PS1) 10Jbond: admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288
[14:59:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288 (owner: 10Jbond)
[15:00:12] <wikibugs>	 (03PS2) 10Jbond: admin: clean up yaml lint issues [puppet] - 10https://gerrit.wikimedia.org/r/673287
[15:00:31] <wikibugs>	 (03PS2) 10Jbond: admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288
[15:00:35] <wikibugs>	 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS standalone puppet master does not lookup cherry picked hiera change - https://phabricator.wikimedia.org/T277526 (10hashar) 05Open→03Resolved a:03hashar Thank you @jbond !
[15:01:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288 (owner: 10Jbond)
[15:02:00] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ottomata) access=WRITE ... ?  What is the query you are trying to run?
[15:03:20] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 50 probes of 604 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:07:38] <wikibugs>	 (03CR) 10Volans: "L3 signed now" [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans)
[15:08:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) a:05Ottomata→03Volans
[15:11:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Samwalton9) 05Open→03Resolved a:03Samwalton9 >>! In T277298#6925220, @Ottomata wrote: > access=WRITE ... ? >  > What is the query you are trying to run?  I'm doing a simple test query...
[15:15:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans)
[15:16:16] <wikibugs>	 (03PS3) 10Jbond: admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288
[15:18:53] <wikibugs>	 (03CR) 10Volans: [C: 03+2] admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans)
[15:19:00] <wikibugs>	 (03PS1) 10Jbond: DO NOT MERGE: Test CI picks up duplicate yaml keys [puppet] - 10https://gerrit.wikimedia.org/r/673291
[15:19:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: Test CI picks up duplicate yaml keys [puppet] - 10https://gerrit.wikimedia.org/r/673291 (owner: 10Jbond)
[15:20:34] <wikibugs>	 (03PS2) 10Volans: admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692)
[15:20:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: clean up yaml lint issues [puppet] - 10https://gerrit.wikimedia.org/r/673287 (owner: 10Jbond)
[15:20:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288 (owner: 10Jbond)
[15:21:41] <wikibugs>	 (03PS4) 10Jcrespo: WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383
[15:21:43] <wikibugs>	 (03PS1) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562)
[15:21:52] <wikibugs>	 (03PS2) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562)
[15:21:55] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles)
[15:22:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[15:22:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 (owner: 10Jcrespo)
[15:27:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ottomata) Hm, maybe your user had to be created on the host that runs Hue first.  I did not manually do that after I merged, so it may have just done that now.  Great!@
[15:28:53] <wikibugs>	 (03CR) 10Volans: [C: 03+2] admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans)
[15:29:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH) 05Open→03Resolved
[15:30:12] <wikibugs>	 (03PS3) 10Volans: admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692)
[15:30:16] <volans>	 damn, 2 conflicts in 10 minutes :d
[15:31:22] <wikibugs>	 (03Abandoned) 10Jbond: DO NOT MERGE: Test CI picks up duplicate yaml keys [puppet] - 10https://gerrit.wikimedia.org/r/673291 (owner: 10Jbond)
[15:32:20] <wikibugs>	 (03CR) 10Volans: [C: 03+2] admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans)
[15:32:41] <jbond42>	 volans: ahh that could have been me reformating the few low hanging yamllint errors
[15:33:00] <volans>	 you and otto, but it's fine, no worries :D
[15:33:04] <jbond42>	 :)
[15:33:10] <shdubsh>	 !log clean up dead letter queue and restart all logstashes
[15:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott)
[15:38:40] <wikibugs>	 (03PS3) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562)
[15:38:57] <wikibugs>	 10SRE, 10Services, 10Patch-For-Review, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10dpifke) >>! In T277483#6924456, @akosiaris wrote: > * Is xhgui stateless? More specifically > ** Does xhgui in any way persist anythi...
[15:39:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[15:39:18] <wikibugs>	 (03PS4) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562)
[15:39:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[15:40:52] <wikibugs>	 (03PS5) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562)
[15:41:06] <wikibugs>	 (03PS6) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562)
[15:42:29] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: remove type setting on dlq input [puppet] - 10https://gerrit.wikimedia.org/r/673131 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite)
[15:42:36] <wikibugs>	 (03CR) 10Jcrespo: "This will require later debian changes to install the .sql on the filesystem." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[15:42:51] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add late-stage host field type validation for ecs events [puppet] - 10https://gerrit.wikimedia.org/r/673142 (owner: 10Cwhite)
[15:43:40] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] ensure host field is the correct type in late-stage ecs filter [software/ecs] - 10https://gerrit.wikimedia.org/r/673148 (owner: 10Cwhite)
[15:43:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) Created kerberos principal:  ` krb1001 ~$ sudo manage_principals.py create apine --email_address=cmassaro@wikimedia.org Principal successf...
[15:44:05] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: toolforge: grid: base: stop using LVM [puppet] - 10https://gerrit.wikimedia.org/r/673267 (https://phabricator.wikimedia.org/T272114) (owner: 10Arturo Borrero Gonzalez)
[15:44:07] <wikibugs>	 (03Merged) 10jenkins-bot: ensure host field is the correct type in late-stage ecs filter [software/ecs] - 10https://gerrit.wikimedia.org/r/673148 (owner: 10Cwhite)
[15:50:05] <wikibugs>	 (03PS7) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562)
[15:50:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) @cmassaro kerberos activated and patch with your access merged, please follow https://wikitech.wikimedia.org/wiki/Production_access#Settin...
[15:51:06] <wikibugs>	 (03Abandoned) 10Ahmon Dancy: logspam.pl: Only process errors from production mw servers [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy)
[15:51:47] <wikibugs>	 (03PS2) 10ArielGlenn: distinguish between "no wikis with batches available" and "no wikis left to run" [dumps] - 10https://gerrit.wikimedia.org/r/673210 (https://phabricator.wikimedia.org/T252396)
[15:52:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) And feel free to resolve this task once it's all working as expected.
[15:53:28] <wikibugs>	 (03PS1) 10Gilles: Add edge cache hostname to Server-Timing header [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769)
[15:55:00] <wikibugs>	 (03CR) 10Volans: "post-merge optional nit ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673288 (owner: 10Jbond)
[15:56:02] <wikibugs>	 (03CR) 10Jcrespo: "Like tendril, the database deployment for metadata db backups was never documented, as we handled it with puppet/backup infrastructure." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[15:56:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) a:05Volans→03None @Ottomata is there anything to be done on the analytics side to sync the user for the intended usage?
[15:57:05] <wikibugs>	 (03PS8) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562)
[15:57:32] <wikibugs>	 (03PS1) 10Jbond: admin: data_validate use Path().parent vs Path.parents[0] [puppet] - 10https://gerrit.wikimedia.org/r/673296
[15:57:47] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Ottomata) Nope, that's it!  Thank you!
[15:58:25] <wikibugs>	 (03PS9) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562)
[16:00:04] <jouncebot>	 jbond42 and cdanis: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1600).
[16:00:04] <jouncebot>	 Majavah: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:42] <Majavah>	 ^ already done
[16:01:52] <jbond42>	 :) thanks
[16:02:06] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/673296 (owner: 10Jbond)
[16:02:21] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Traffic: Traffic anomalies: Factor out list of countries into a dedicated Hive table - https://phabricator.wikimedia.org/T272052 (10fdans) 05Open→03Resolved
[16:02:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: data_validate use Path().parent vs Path.parents[0] [puppet] - 10https://gerrit.wikimedia.org/r/673296 (owner: 10Jbond)
[16:03:43] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673288 (owner: 10Jbond)
[16:04:17] <wikibugs>	 (03CR) 10Volans: [C: 03+2] openldap: improve cross-validate-accounts [puppet] - 10https://gerrit.wikimedia.org/r/673241 (owner: 10Volans)
[16:06:17] <wikibugs>	 (03PS1) 10Jon Harald Søby: Add alt, bcl, diq, mad, mni, mnw, nia, skr, tay and trv to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673297
[16:06:34] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10akosiaris) fwiw, a possibly desired UX would be something like    `$ sre.downtime.service 'service1|service2|service3'` or `$ sre.downtime.service service1 [service2] [service3]`...
[16:07:40] <wikibugs>	 (03PS2) 10Jforrester: Revert "[beta] Disable captchas while they are completely broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673112 (https://phabricator.wikimedia.org/T276176) (owner: 10Gergő Tisza)
[16:07:53] <wikibugs>	 (03PS1) 10Jbond: P:debmonitor::server: make logback and monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/673299
[16:08:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add alt, bcl, diq, mad, mni, mnw, nia, skr, tay and trv to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673297 (owner: 10Jon Harald Søby)
[16:09:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "thanks for the patch!" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673297 (owner: 10Jon Harald Søby)
[16:10:14] <wikibugs>	 (03PS2) 10Jon Harald Søby: Add alt, bcl, diq, mad, mni, mnw, nia, skr, tay and trv to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673297
[16:10:58] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM if PCC is happy" [puppet] - 10https://gerrit.wikimedia.org/r/673299 (owner: 10Jbond)
[16:11:18] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "seems good, unless the extra test was in place in order to differentiate from generic systemd alerts." [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond)
[16:13:14] <wikibugs>	 (03CR) 10Jforrester: "@Urbanecm, want to deploy this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657696 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[16:14:06] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10Volans) Doh, I think we have naming clash here :)    - service: as in Icinga single service belonging to an Icinga host   - service: as in a WMF .svc. service but treated as a Ho...
[16:15:17] <tgr_>	 I'll deploy a beta-only patch.
[16:15:37] <James_F>	 tgr_: Ha, beat me to it.
[16:15:47] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Revert "[beta] Disable captchas while they are completely broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673112 (https://phabricator.wikimedia.org/T276176) (owner: 10Gergő Tisza)
[16:16:17] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Revert "[beta] Disable captchas while they are completely broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673112 (https://phabricator.wikimedia.org/T276176) (owner: 10Gergő Tisza)
[16:17:36] <wikibugs>	 (03PS2) 10Jbond: P:debmonitor::server: make logback and monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/673299
[16:18:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "[beta] Disable captchas while they are completely broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673112 (https://phabricator.wikimedia.org/T276176) (owner: 10Gergő Tisza)
[16:19:20] <wikibugs>	 (03PS2) 10Jforrester: Drop ability to use graphoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654954 (https://phabricator.wikimedia.org/T242855)
[16:19:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28668/console" [puppet] - 10https://gerrit.wikimedia.org/r/673299 (owner: 10Jbond)
[16:19:26] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond)
[16:19:57] <wikibugs>	 (03PS2) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657696 (https://phabricator.wikimedia.org/T269712)
[16:22:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::server: make logback and monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/673299 (owner: 10Jbond)
[16:23:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:26:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:33:12] <wikibugs>	 (03PS3) 10ArielGlenn: distinguish between "no wikis with batches available" and "no wikis left to run" [dumps] - 10https://gerrit.wikimedia.org/r/673210 (https://phabricator.wikimedia.org/T252396)
[16:37:30] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2239.codfw.wmnet
[16:37:36] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2240.codfw.wmnet
[16:37:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:41] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2241.codfw.wmnet
[16:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:44] <wikibugs>	 (03PS4) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005
[16:37:46] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2242.codfw.wmnet
[16:37:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:15] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2239.codfw.wmnet
[16:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan)
[16:40:24] <wikibugs>	 (03CR) 10Mholloway: Add event stream config for android.image_recommendations_interaction (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan)
[16:40:44] <wikibugs>	 (03CR) 10Dzahn: "We would leave the data import/export to you like last time. This was more to point out there is this path to change once it's time for it" [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) (owner: 10Dzahn)
[16:45:21] <wikibugs>	 (03CR) 10Mholloway: Add event stream config for android.image_recommendations_interaction (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan)
[16:47:50] <wikibugs>	 10SRE, 10Analytics-Radar: Upgrade to Kafka MirrorMaker 2 - https://phabricator.wikimedia.org/T277467 (10fdans)
[16:50:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "I think the biggest advantage is that a user on IRC will immediately see what the actual problem is instead of some generic "systemd state" [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond)
[16:51:08] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2239.codfw.wmnet
[16:51:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:42] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:53:02] <icinga-wm>	 PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4006: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:53:28] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "Oh, I love this. I'd been hoping we could have this sort of unit-specific monitoring -- the problem Daniel mentions has been bugging me, b" [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond)
[16:54:31] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2240.codfw.wmnet
[16:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:52] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:57:28] <icinga-wm>	 RECOVERY - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 3.771 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:57:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond)
[16:57:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] systemd::monitor: create a generic nrpe systemd check [puppet] - 10https://gerrit.wikimedia.org/r/673221 (owner: 10Jbond)
[16:57:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond)
[16:58:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/673221 (owner: 10Jbond)
[16:59:57] <wikibugs>	 (03PS1) 10Bstorm: toolforge: set up an "opt out" of the infrastructure profile [puppet] - 10https://gerrit.wikimedia.org/r/673304 (https://phabricator.wikimedia.org/T277756)
[17:00:04] <jouncebot>	 chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1700).
[17:00:56] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:01:49] <wikibugs>	 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10calbon) Sounds great! Can you post here when the model is up on toolforge? I'd love to take a look.
[17:02:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:02:48] <wikibugs>	 10SRE, 10Analytics, 10Traffic: varnishkafka / ATSkafka should support setting the kafka message timestamp - https://phabricator.wikimedia.org/T277553 (10fdans) p:05Triage→03Medium a:03razzi cc @ema
[17:03:36] <icinga-wm>	 PROBLEM - AQS root url on aqs1010 is CRITICAL: connect to address 10.64.0.40 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[17:03:39] <wikibugs>	 (03PS1) 10Jforrester: FlaggedRevs: Stop setting wgFlaggedRevsWhitelist, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673306
[17:04:04] <elukey>	 aqs1010 is a new host, downtime expired for sure
[17:07:36] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2240.codfw.wmnet
[17:07:40] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:43] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2241.codfw.wmnet
[17:09:46] <wikibugs>	 (03PS1) 10Andrew-WMDE: Enable bracket matching on group0 and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673312 (https://phabricator.wikimedia.org/T273591)
[17:09:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:49] <wikibugs>	 (03PS6) 10Mstyles: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006)
[17:09:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Cmjohnson) @robh these are ready for you, had a delay in getting them set up because the netbox script didn't work for them, the...
[17:10:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Cmjohnson)
[17:10:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Cmjohnson) a:05Cmjohnson→03RobH
[17:10:52] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:11:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) @Papaul I just noticed this host has status "offline" in netbox. But should be "decom" state.
[17:13:25] <wikibugs>	 10ops-codfw, 10serviceops: decom 7 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn)
[17:15:13] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670973 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:17:19] <wikibugs>	 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) @calbon we have hit some snags in deploying the model on toolforge and are in the process of resolving those. But in the meantime you can have...
[17:17:39] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670975 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:18:37] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670977 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:23:49] <wikibugs>	 (03PS1) 10Urbanecm: hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684)
[17:24:13] <Urbanecm>	 jouncebot: next
[17:24:13] <jouncebot>	 In 0 hour(s) and 35 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1800)
[17:26:05] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:26:24] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: [WiP] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327)
[17:27:06] <wikibugs>	 (03CR) 10David Caro: "You can ignore my 'nit' comments." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:27:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WiP] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto)
[17:27:44] <wikibugs>	 (03CR) 10Legoktm: "See Change-Id: I51ba05f2537b4c068a0150c22fc00920a9f70edb :)" [puppet] - 10https://gerrit.wikimedia.org/r/670975 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:28:14] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2241.codfw.wmnet
[17:28:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:33] <wikibugs>	 (03CR) 10David Caro: "Same comments as the other patch (https://gerrit.wikimedia.org/r/c/operations/puppet/+/670933)" [puppet] - 10https://gerrit.wikimedia.org/r/670928 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:28:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] rabbitmqadmin.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:29:38] <wikibugs>	 (03PS1) 10Urbanecm: simplewiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673319 (https://phabricator.wikimedia.org/T277550)
[17:31:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:31:17] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "The 'nit' can be ignored." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673304 (https://phabricator.wikimedia.org/T277756) (owner: 10Bstorm)
[17:31:52] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2242.codfw.wmnet
[17:31:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:21] <wikibugs>	 (03CR) 10David Caro: paws: block using the Jupyterhub from Tor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm)
[17:32:29] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:33:15] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2242 is CRITICAL: Host mw2242 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[17:34:30] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673321
[17:35:23] <wikibugs>	 (03PS1) 10Urbanecm: tewiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673323 (https://phabricator.wikimedia.org/T277491)
[17:35:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] simplewiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673319 (https://phabricator.wikimedia.org/T277550) (owner: 10Urbanecm)
[17:36:36] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673321 (owner: 10PipelineBot)
[17:37:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Cmjohnson)
[17:37:59] <wikibugs>	 (03Merged) 10jenkins-bot: simplewiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673319 (https://phabricator.wikimedia.org/T277550) (owner: 10Urbanecm)
[17:38:23] <wikibugs>	 (03CR) 10Bstorm: paws: block using the Jupyterhub from Tor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm)
[17:39:34] <wikibugs>	 (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673321 (owner: 10PipelineBot)
[17:40:14] <wikibugs>	 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Cmjohnson) a:05Cmjohnson→03Jgreen Assigning this to @Jgreen to complete the installs.  All the on-site work has been completed, network ports are set up and enabled so please ins...
[17:40:19] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[17:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:12] <wikibugs>	 (03PS2) 10Urbanecm: tewiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673323 (https://phabricator.wikimedia.org/T277491)
[17:41:14] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] "Okay, in that case, let us wait till we finish any rt testing we need to do for next week's train before rolling this out." [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) (owner: 10Dzahn)
[17:41:16] <wikibugs>	 (03CR) 10Bstorm: toolforge: set up an "opt out" of the infrastructure profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673304 (https://phabricator.wikimedia.org/T277756) (owner: 10Bstorm)
[17:41:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] tewiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673323 (https://phabricator.wikimedia.org/T277491) (owner: 10Urbanecm)
[17:41:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:42:02] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 04342e9bb0765a6a58ad78bd7eaa380d4167f0c1: simplewiki: Enable Growth team features in stealth mode (T277550) (duration: 01m 10s)
[17:42:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:10] <stashbot>	 T277550: Deploy Growth features on Simple English Wikipedia - https://phabricator.wikimedia.org/T277550
[17:42:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) So far I have found out nothing This is a PCI error, in the past, this would mean a blown capacitor but this is the first I've seen of this error since we left...
[17:43:10] <wikibugs>	 (03CR) 10Dzahn: "Ok, sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) (owner: 10Dzahn)
[17:44:20] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670981 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:44:37] <wikibugs>	 (03Merged) 10jenkins-bot: tewiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673323 (https://phabricator.wikimedia.org/T277491) (owner: 10Urbanecm)
[17:45:00] <wikibugs>	 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki)
[17:45:11] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 04342e9bb0765a6a58ad78bd7eaa380d4167f0c1: simplewiki: Enable Growth team features in stealth mode (T277550) (duration: 01m 09s)
[17:45:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:38] <wikibugs>	 (03CR) 10Jbond: get-raid-status-megacli.py: Port to Python 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670973 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:46:49] <wikibugs>	 (03PS1) 10Zabe: Use of Article::getId was deprecated in MediaWiki 1.35 [extensions/LiquidThreads] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673114 (https://phabricator.wikimedia.org/T277772)
[17:47:30] <wikibugs>	 (03CR) 10Jcrespo: "If it works, ship it, but you may want to deploy at the same time the Swift and DB owners are around, as they will probably have ongoing R" [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:47:48] <wikibugs>	 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarting mcrouter and/or it crashing on the nod...
[17:48:23] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 55aa6cb: tewiki: Enable Growth features in stealth mode (T277491; 1/2) (duration: 01m 10s)
[17:48:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:31] <stashbot>	 T277491: Request to implement Growth experiments on Telugu Wikipedia (Tewiki) - https://phabricator.wikimedia.org/T277491
[17:48:39] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/28670/ happy with that so merging" [puppet] - 10https://gerrit.wikimedia.org/r/673304 (https://phabricator.wikimedia.org/T277756) (owner: 10Bstorm)
[17:49:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/670977 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:49:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:50:17] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2242.codfw.wmnet
[17:50:17] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 55aa6cb: tewiki: Enable Growth features in stealth mode (T277491; 2/2) (duration: 01m 08s)
[17:50:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:01] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10vm-requests: Requesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Legoktm) a:03Legoktm
[17:51:14] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal: Research storage solutions for media backups - https://phabricator.wikimedia.org/T264190 (10jcrespo) 05Open→03Resolved Research (analysis) and Design finished for now, we are now in implementation phase: T276442 and T276445.  Documentation...
[17:51:18] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo)
[17:51:27] <wikibugs>	 (03PS1) 10Andrew-WMDE: Enable CodeMirror accessibility colors on initial wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673326 (https://phabricator.wikimedia.org/T276346)
[17:53:19] <wikibugs>	 (03CR) 10Andrew Bogott: "I built a test VM, toolsbeta-sgeexec-0902.toolsbeta.eqiad1.wikimedia.cloud, which looks OK.  Swap is turned off for the toolsbeta-sgeexec " [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott)
[17:53:55] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1001.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:54:30] <wikibugs>	 (03CR) 10Jbond: "lgtm but see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:55:52] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] docker_registry_ha: Require authentication from k8s nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm)
[17:56:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:56:12] <wikibugs>	 10SRE, 10ops-eqiad: elastic1062 interface errors - https://phabricator.wikimedia.org/T277634 (10Cmjohnson) 05Open→03Resolved replaced the production cable
[17:56:52] <wikibugs>	 (03CR) 10Jbond: wmcs-webproxy.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:56:55] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission frqueue1001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T277171 (10Cmjohnson) a:05Cmjohnson→03Papaul
[17:58:51] <legoktm>	 !log disabled puppet on registry* for rolling out https://gerrit.wikimedia.org/r/672537
[17:58:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1800).
[18:00:04] <jouncebot>	 Urbanecm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:09] <wikibugs>	 (03PS1) 10Urbanecm: mswiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673329 (https://phabricator.wikimedia.org/T277562)
[18:00:19] <Urbanecm>	 I'll self-service
[18:00:36] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] mswiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673329 (https://phabricator.wikimedia.org/T277562) (owner: 10Urbanecm)
[18:00:48] <wikibugs>	 (03PS2) 10Urbanecm: hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684)
[18:00:52] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm)
[18:01:51] <wikibugs>	 (03Merged) 10jenkins-bot: mswiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673329 (https://phabricator.wikimedia.org/T277562) (owner: 10Urbanecm)
[18:01:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:02:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:02:08] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker_registry_ha: Require authentication from k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm)
[18:02:18] <wikibugs>	 (03PS3) 10Urbanecm: hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684)
[18:02:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm)
[18:02:53] <wikibugs>	 (03CR) 10Jbond: mwgrep.py: Port to Python 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670975 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[18:03:15] <wikibugs>	 (03Merged) 10jenkins-bot: hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm)
[18:06:06] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670981 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[18:07:23] <wikibugs>	 (03PS3) 1001miki10: Disable ContentTranslation New article campaign in fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672416 (https://phabricator.wikimedia.org/T277473)
[18:07:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:07:51] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:09:07] <icinga-wm>	 PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4006: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:09:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:10:01] <jynus>	 what's the right reaction to that?^
[18:10:42] <legoktm>	 I remember seeing something about linkrecommendation in phab, one sec
[18:10:45] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 179d9e5: mswiki: Enable Growth features in stealth mode (T277562; 1/2) (duration: 01m 11s)
[18:10:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:53] <stashbot>	 T277562: Deploy Growth features on Malay Wikipedia - https://phabricator.wikimedia.org/T277562
[18:11:43] <wikibugs>	 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Majavah)
[18:11:57] <legoktm>	 https://phabricator.wikimedia.org/T277297
[18:12:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1006.eqiad.
[18:12:05] <icinga-wm>	 down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:12:13] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 179d9e5: mswiki: Enable Growth features in stealth mode (T277562; 2/2) (duration: 01m 08s)
[18:12:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:33] <icinga-wm>	 RECOVERY - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 5.943 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:12:34] <jynus>	 legoktm, thanks
[18:12:46] <jynus>	 If it has been happening for days, probably not an emergency
[18:12:52] <legoktm>	 https://phabricator.wikimedia.org/T277297 
[18:12:57] <legoktm>	 I don't think the service is in active use yet
[18:13:04] <jynus>	 ah, much better
[18:13:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10wiki_willy) Hi @Dzahn - we typically change the status to "offline" after the server is unracked.
[18:13:09] <jynus>	 :-)
[18:13:15] <jynus>	 that is the part I didn't know
[18:13:53] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:16:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:17:18] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 44eddcc: hrwiki: Deploy Growth features to newcomers (T275684) (duration: 01m 08s)
[18:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:26] <stashbot>	 T275684: Deploy Growth features on Croatian Wikipedia - https://phabricator.wikimedia.org/T275684
[18:17:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10Cmjohnson) a:05Cmjohnson→03ayounsi @ayounsi I verified all of the ports listed in https://librenms.wikimedia.org/ports/state=down/hostname=asw/format=list_basic/ are not in service at the moment.  There were 2 i...
[18:18:08] <legoktm>	 left a comment for now
[18:18:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) @wiki_willy Oh, right, I got confused here myself and compared it to the servers that have been decom'ed but are still physically in racks. All is good the...
[18:19:44] <Urbanecm>	 jynus: legoktm: linkrecommendation is supposed to be used...soon :)
[18:19:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Andrew) Thanks Chris
[18:19:55] <jynus>	 yeah, no problem
[18:20:13] <jynus>	 it is just that when seeing lvs complain, normally it is a very bad thing :-)
[18:20:58] <wikibugs>	 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) >>! In T277711#6926081, @Joe wrote: > As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarti...
[18:21:19] <Urbanecm>	 definitely :)
[18:23:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) I am able to access the bastion. I am not able to access stat1006.equiad.wmnet, though. I can provide the output from `ssh -v` if that would help.
[18:23:46] <Zabe>	 Urbanecm: Hey, I have a question if you have time. If I want to submit a patch for backporting to wmf/1.36.0-wmf.35, does the patch for the master branch already has to be merged, or is that not important?
[18:24:08] <legoktm>	 unless there are exceptional circumstances, it should already be merged into master
[18:24:57] <Majavah>	 do you have a specific example in mind?
[18:25:26] <Urbanecm>	 Zabe: it SHOULD be merged
[18:25:30] <Urbanecm>	 (really really really should)
[18:25:36] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Urbanecm) >>! In T277692#6926262, @cmassaro wrote: > I am able to access the bastion. I am not able to access stat1006.equiad.wmnet, though. I can provide the output fr...
[18:25:53] <Zabe>	 ok thx
[18:26:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1001.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:26:26] <dancy>	 Zabe: Given that the warning only showed up once, let's wait for the normal processes.  We can do a backport during the train window if needed.
[18:26:37] <icinga-wm>	 PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4006: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:26:40] <dancy>	 (which is in about 30 minutes)
[18:27:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1016.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:28:04] <legoktm>	 !log re-enabled puppet on registry*
[18:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:28:18] <Majavah>	 Urbanecm: sadly RFC 6919 has defined "REALLY SHOULD NOT" but not "really really really should" :D
[18:28:29] <Urbanecm>	 :D
[18:29:02] <wikibugs>	 (03PS6) 10Jbond: (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105
[18:29:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) Got it. I did set up my SSH config as prescribed there, and `ssh bast1002.wikimedia.org` works as a result. When I try `ssh stat1006.eqiad.wmnet`, it looks li...
[18:29:53] <wikibugs>	 10SRE, 10netops, 10Patch-For-Review: Auhoritative ports list - https://phabricator.wikimedia.org/T277146 (10jbond) >>! In T277146#6913871, @Kormat wrote: > Just in case it's relevant, we use a range of ports for mariadb. Most (but not all) of them are [[ https://github.com/wikimedia/puppet/blob/da54cc6f29deb...
[18:30:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond)
[18:31:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:31:49] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Dzahn) @cmassaro It seems to be a typo in the host name.  found in auth.log on bast1002  ` error: connect_to stat1006.equiad.wmnet: unknown        host (Name or service...
[18:35:16] <wikibugs>	 (03CR) 10Effie Mouzeli: create helmfile.d structure (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[18:35:37] <icinga-wm>	 RECOVERY - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 193 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:36:33] <wikibugs>	 (03CR) 10Effie Mouzeli: "There are a few more things that we should look into, but let's fix them in the next iteration:)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[18:36:56] <wikibugs>	 (03CR) 10Legoktm: hiera: add dummy secrets for ML k8s workers (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/672455 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman)
[18:37:28] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-1] create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[18:37:56] <wikibugs>	 (03PS1) 10Legoktm: ml_k8s: Remove docker registry password [labs/private] - 10https://gerrit.wikimedia.org/r/673333
[18:38:08] <wikibugs>	 (03PS1) 10Dzahn: site/conftool-data: decom mw2239 through mw2242, rack A4 [puppet] - 10https://gerrit.wikimedia.org/r/673334 (https://phabricator.wikimedia.org/T277119)
[18:38:35] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] ml_k8s: Remove docker registry password [labs/private] - 10https://gerrit.wikimedia.org/r/673333 (owner: 10Legoktm)
[18:42:03] <wikibugs>	 (03PS5) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005
[18:44:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10wiki_willy) No worries @Dzahn, thanks for checking.  =)
[18:44:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) Ahhhh sorry, that is embarrassing. It's all good now. Thank you!
[18:44:53] <wikibugs>	 (03PS1) 10Bstorm: prometheus: re-order sudo access for the cron [puppet] - 10https://gerrit.wikimedia.org/r/673336
[18:45:07] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) >>! In T277692#6924028, @JAllemandou wrote: > Thanks for letting me know @Ottomata  :) > @cmassaro : Let's sync on the work you wish to accomplish, as wikitex...
[18:45:10] <wikibugs>	 (03CR) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan)
[18:46:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] prometheus: re-order sudo access for the cron [puppet] - 10https://gerrit.wikimedia.org/r/673336 (owner: 10Bstorm)
[18:46:46] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] prometheus: re-order sudo access for the cron [puppet] - 10https://gerrit.wikimedia.org/r/673336 (owner: 10Bstorm)
[18:46:59] <wikibugs>	 (03PS6) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005
[18:52:13] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan)
[18:56:01] <wikibugs>	 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Krinkle)
[18:57:07] <wikibugs>	 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney) p:05Medium→03High
[18:57:53] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize infrastructure differences between Beta Cluster and production - https://phabricator.wikimedia.org/T87220 (10Krinkle)
[18:58:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site/conftool-data: decom mw2239 through mw2242, rack A4 [puppet] - 10https://gerrit.wikimedia.org/r/673334 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn)
[18:59:59] <wikibugs>	 (03PS2) 10Majavah: Remove deploymentwiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673)
[19:00:04] <jouncebot>	 dancy and brennen: Dear deployers, time to do the Mediawiki train - American Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1900).
[19:00:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) 05Open→03Resolved
[19:01:37] <wikibugs>	 (03PS1) 10Ahmon Dancy: group2 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673340
[19:01:39] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] group2 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673340 (owner: 10Ahmon Dancy)
[19:02:45] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673340 (owner: 10Ahmon Dancy)
[19:04:35] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.36.0-wmf.35
[19:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:33] <wikibugs>	 (03PS1) 10Jgreen: add A/PTR records for payments100[5-8].frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/673342 (https://phabricator.wikimedia.org/T266481)
[19:15:02] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] add A/PTR records for payments100[5-8].frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/673342 (https://phabricator.wikimedia.org/T266481) (owner: 10Jgreen)
[19:17:20] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Add ssl check for frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/673132 (https://phabricator.wikimedia.org/T260183) (owner: 10Dwisehaupt)
[19:19:31] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:21:05] <wikibugs>	 10SRE, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jgreen)
[19:21:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:24:51] <wikibugs>	 (03PS3) 10Legoktm: sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516)
[19:25:03] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm)
[19:26:05] <wikibugs>	 (03PS1) 10Ryan Kemper: elasticsearch: combined plugin upgrade + reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792)
[19:28:59] <wikibugs>	 (03CR) 10Ryan Kemper: "This implements the combined plugin upgrade + reboot functionally as a new cookbook. We'll want to circle back and refactor this because t" [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper)
[19:35:54] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:40:46] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:45:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm)
[19:46:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: combined plugin upgrade + reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper)
[19:50:19] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata.txt: try to fix signing of the wikimedia apt repo [puppet] - 10https://gerrit.wikimedia.org/r/673349 (https://phabricator.wikimedia.org/T271273)
[19:51:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata.txt: try to fix signing of the wikimedia apt repo [puppet] - 10https://gerrit.wikimedia.org/r/673349 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[19:56:20] <dancy>	 Zabe: Can you nag someone for a review on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LiquidThreads/+/673325 ?
[20:00:05] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata: install gpg and dirmngr earlier in the cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673351 (https://phabricator.wikimedia.org/T271273)
[20:01:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: install gpg and dirmngr earlier in the cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673351 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[20:05:11] <wikibugs>	 (03PS1) 10Herron: add dummy grafana api key to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/673352
[20:06:24] <wikibugs>	 (03CR) 10Herron: [V: 03+2 C: 03+2] add dummy grafana api key to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/673352 (owner: 10Herron)
[20:07:53] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata: install gnupg instead of gpg [puppet] - 10https://gerrit.wikimedia.org/r/673353 (https://phabricator.wikimedia.org/T271273)
[20:09:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: install gnupg instead of gpg [puppet] - 10https://gerrit.wikimedia.org/r/673353 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[20:10:51] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper)
[20:11:06] <gehel>	 ryankemper: ^
[20:11:20] <ryankemper>	 gehel: thanks
[20:15:49] <wikibugs>	 (03PS2) 10Kosta Harlan: linkrecommendation: Bump memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297)
[20:16:25] <wikibugs>	 (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/28673/" [puppet] - 10https://gerrit.wikimedia.org/r/671283 (owner: 10Herron)
[20:19:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:19:31] <Zabe>	 DannyS712: See ^. Do you have time to look at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LiquidThreads/+/673325 ?
[20:20:00] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] "LGTM pending deployment of the flagged revs patch to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673306 (owner: 10Jforrester)
[20:20:36] <legoktm>	 +2'd
[20:20:48] <Zabe>	 legoktm: thx
[20:21:05] <legoktm>	 LQT is like, whatever comes after living on life support but not dead yet
[20:21:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:22:18] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm)
[20:22:40] <Majavah>	 legoktm: maybe they should have used a short Gerrit URL without the project name visible :P
[20:22:51] <legoktm>	 hehe
[20:25:21] <Zabe>	 dancy: patch is merged
[20:25:59] <dancy>	 Awesome. Do you have a before/after test to verify it?
[20:27:02] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Use of Article::getId was deprecated in MediaWiki 1.35 [extensions/LiquidThreads] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673114 (https://phabricator.wikimedia.org/T277772) (owner: 10Zabe)
[20:27:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) >>! In T272403#6898639, @Jclark-ctr wrote: > @Cmjohnson  > cloudgw1001 c8 u29 ports13/19  cableid #5322 > cloudgw1002. d5...
[20:27:55] <Zabe>	 no, because I don't realy know how to test if there still is this deprecation warning.
[20:28:32] <dancy>	 hmm. ok.  I +2'd the .35 cherry pick.  Waiting for it to merge.
[20:31:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) a:05RobH→03aborrero The networking requirements for this request are wholly unclear at the time of this comment.  The...
[20:33:09] <wikibugs>	 (03Merged) 10jenkins-bot: Use of Article::getId was deprecated in MediaWiki 1.35 [extensions/LiquidThreads] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673114 (https://phabricator.wikimedia.org/T277772) (owner: 10Zabe)
[20:34:30] <legoktm>	 based on the stacktrace on the task it looks like it shows up if you reply to a post
[20:37:02] <logmsgbot>	 !log dancy@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/LiquidThreads/classes/Thread.php: (no justification provided) (duration: 01m 05s)
[20:37:08] <dancy>	 Zabe: deployed
[20:37:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:28] <Zabe>	 dancy: Thanks for your help
[20:38:41] <dancy>	 Thanks of the fix!
[20:42:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm)
[20:43:13] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata: disable ssh password logins with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673357
[20:44:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: disable ssh password logins with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673357 (owner: 10Andrew Bogott)
[20:50:11] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "nova vendordata.txt: try to fix signing of the wikimedia apt repo" [puppet] - 10https://gerrit.wikimedia.org/r/673359
[20:51:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova vendordata.txt: try to fix signing of the wikimedia apt repo" [puppet] - 10https://gerrit.wikimedia.org/r/673359 (owner: 10Andrew Bogott)
[21:07:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_sanitize_eventlogging_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:11:21] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata: further attempts to get gnupg installed up front [puppet] - 10https://gerrit.wikimedia.org/r/673362
[21:12:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: further attempts to get gnupg installed up front [puppet] - 10https://gerrit.wikimedia.org/r/673362 (owner: 10Andrew Bogott)
[21:13:51] <wikibugs>	 (03PS27) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[21:14:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond)
[21:17:23] <wikibugs>	 (03CR) 10Mstyles: create helmfile.d structure (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[21:22:48] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:59] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata: fix apt-get command [puppet] - 10https://gerrit.wikimedia.org/r/673364
[21:35:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: fix apt-get command [puppet] - 10https://gerrit.wikimedia.org/r/673364 (owner: 10Andrew Bogott)
[21:41:11] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) You can start installing new servers in rack A3 in place of mw2215 through mw2238:  https://netbox.wikimedia.org/dcim/devices/?q=mw2&rack_id=45&m...
[21:46:23] <wikibugs>	 (03PS7) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115)
[21:49:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10CGlenn) Hello @Volans ! Thank you for pointing that out to me. Just to confirm, I can access am.wikipedia.org in GSC.  Is possible we can add am.m.wikipedia.org as well?  Or would...
[22:00:55] <wikibugs>	 (03PS1) 10Dzahn: site/conftool-data: turn mw2251,mw2252 into canaries [puppet] - 10https://gerrit.wikimedia.org/r/673367 (https://phabricator.wikimedia.org/T277780)
[22:00:57] <wikibugs>	 (03PS1) 10Dzahn: site/conftool-data: decom mw2244,mw2245, former canary servers [puppet] - 10https://gerrit.wikimedia.org/r/673368 (https://phabricator.wikimedia.org/T277780)
[22:04:23] <wikibugs>	 (03PS28) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[22:05:07] <wikibugs>	 10SRE, 10netops, 10Patch-For-Review: Authoritative ports list - https://phabricator.wikimedia.org/T277146 (10Reedy)
[22:05:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond)
[22:10:51] <wikibugs>	 (03PS1) 10Brennen Bearnes: ActorStore::getActorById - fall back to master. [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673115 (https://phabricator.wikimedia.org/T277795)
[22:13:50] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] "The commit message is a bit misleading now, cause there's no TODO anymore, but overall it should fix the prod error." [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673115 (https://phabricator.wikimedia.org/T277795) (owner: 10Brennen Bearnes)
[22:16:28] <brennen>	 Pchelolo: i can go ahead and sling above out, assuming zuul is happy - is there any kind of mwdebug testing possible / needed?  i guess my assumption is it's not plausible to reproduce...
[22:17:23] <brennen>	 jouncebot now
[22:17:23] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 42 minute(s)
[22:17:38] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] "> Patch Set 1: Code-Review+1" [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673115 (https://phabricator.wikimedia.org/T277795) (owner: 10Brennen Bearnes)
[22:18:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:18:50] <marxarelli>	 brennen: mind if i do a blubberoid deploy? it shouldn't be very eventful
[22:18:59] <brennen>	 marxarelli: go for it
[22:19:04] <marxarelli>	 cool
[22:19:35] <Pchelolo>	 brennen: no clue how to reproduce it reliably
[22:19:38] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673371
[22:20:34] <brennen>	 Pchelolo: yeah, figured.  guess we'll just keep an eye for errors to drop off.
[22:20:35] <Pchelolo>	 I think the best way is to just wait and see if logstash errors are gone
[22:20:48] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:21:46] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673371 (owner: 10PipelineBot)
[22:22:22] * brennen twiddles thumbs and waits for zuul.
[22:23:42] <wikibugs>	 (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673371 (owner: 10PipelineBot)
[22:23:49] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/636081 (owner: 10PipelineBot)
[22:23:57] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/656521 (owner: 10PipelineBot)
[22:24:03] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/658682 (owner: 10PipelineBot)
[22:24:12] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/667045 (owner: 10PipelineBot)
[22:24:38] <marxarelli>	 sorry ^ just cleanup
[22:25:25] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[22:25:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:16] <wikibugs>	 (03PS2) 10Dzahn: site/conftool-data: turn mw2251,mw2252 into canaries [puppet] - 10https://gerrit.wikimedia.org/r/673367 (https://phabricator.wikimedia.org/T277780)
[22:29:09] <wikibugs>	 (03PS29) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[22:29:29] <brennen>	 jouncebot next
[22:29:30] <jouncebot>	 In 0 hour(s) and 30 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T2300)
[22:29:45] <logmsgbot>	 !log dduvall@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[22:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond)
[22:30:53] <logmsgbot>	 !log dduvall@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[22:30:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:37:04] <wikibugs>	 (03PS7) 10Mstyles: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006)
[22:41:55] <wikibugs>	 (03PS1) 10Jdlrobson: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372
[22:43:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson)
[22:48:43] <wikibugs>	 (03Merged) 10jenkins-bot: ActorStore::getActorById - fall back to master. [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673115 (https://phabricator.wikimedia.org/T277795) (owner: 10Brennen Bearnes)
[22:48:56] <wikibugs>	 (03CR) 10Jdlrobson: "I can't backport this today Legoktm, Greg but it seems reasonable that the WMF logo should only apply to office wiki, and not be the defau" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson)
[22:49:05] <wikibugs>	 (03PS2) 10Jdlrobson: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372
[22:50:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson)
[22:51:43] <brennen>	 going ahead with that backport.
[22:52:52] <wikibugs>	 (03PS3) 10Jdlrobson: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372
[22:53:04] <logmsgbot>	 !log brennen@deploy1002 Synchronized php-1.36.0-wmf.35/includes/specials/SpecialContributions.php: Backport: [[gerrit:673115|ActorStore::getActorById - fall back to master. (T277795)]] (duration: 01m 07s)
[22:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:12] <stashbot>	 T277795: User not found by actor ID: [id] - https://phabricator.wikimedia.org/T277795
[22:53:27] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02  - https://phabricator.wikimedia.org/T277780 (10Dzahn)
[22:54:43] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) 6 out of 8 are jobrunners.  Maybe best to wait for T274171 to have started and turn some new servers in A3 into jobrunners, then remove these in A4 af...
[22:55:15] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn)
[22:55:20] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn)
[22:55:23] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) 05Open→03Stalled p:05Triage→03High
[22:57:24] <wikibugs>	 (03CR) 10Dzahn: "Well, I have nothing against this but also I can't help to think "but he literally just merged the systemd::service class" which would be " [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond)
[22:59:00] <wikibugs>	 (03CR) 10Dzahn: "I don't think the notes URL is a difference here, but the IRC part is. It's nice to know early what it is about rather than first having t" [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond)
[22:59:11] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "The #1 request so far is that I remove the nested ternary expressions. On that note, since I know that part is working from the last time " [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm)
[22:59:26] <wikibugs>	 (03PS1) 10Dduvall: pipeline: Use build environment HTTP proxy for APT sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673375 (https://phabricator.wikimedia.org/T277109)
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T2300).
[23:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:02:05] <wikibugs>	 (03CR) 10Krinkle: Don't define a default icon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson)
[23:02:58] <wikibugs>	 (03CR) 10Jdlrobson: Don't define a default icon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson)
[23:03:57] <wikibugs>	 (03PS4) 10Jdlrobson: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199)
[23:06:00] <wikibugs>	 (03CR) 10Krinkle: Don't define a default icon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson)
[23:06:39] <wikibugs>	 (03PS30) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[23:06:41] <wikibugs>	 (03CR) 10Krinkle: Don't define a default icon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson)
[23:06:44] <brennen>	 !log train status: 1.36.0-wmf.35 (T274939) stable on all wikis after deploy of hotfix for T277795
[23:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:54] <stashbot>	 T277795: User not found by actor ID: [id] - https://phabricator.wikimedia.org/T277795
[23:06:54] <stashbot>	 T274939: 1.36.0-wmf.35 deployment blockers - https://phabricator.wikimedia.org/T274939
[23:07:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond)
[23:07:36] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[23:08:22] <wikibugs>	 (03PS31) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[23:09:21] <wikibugs>	 (03CR) 10Jbond: netbase: add new module to manage /etc/services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond)
[23:13:20] <wikibugs>	 (03PS5) 10Jdlrobson: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372
[23:13:33] <wikibugs>	 (03CR) 10Jdlrobson: Don't define a default icon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson)
[23:14:16] <wikibugs>	 (03PS32) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[23:14:28] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[23:14:47] <wikibugs>	 (03CR) 10Dduvall: [C: 03+1] "Tested successfully by running https://releases-jenkins.wikimedia.org/job/mediawiki-config-pipeline-wmf-publish/28/console and verifying t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673375 (https://phabricator.wikimedia.org/T277109) (owner: 10Dduvall)
[23:15:07] <wikibugs>	 (03PS1) 10Legoktm: docker_registry_ha: Add documentation to profile class [puppet] - 10https://gerrit.wikimedia.org/r/673376
[23:15:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond)
[23:17:20] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] pipeline: Use build environment HTTP proxy for APT sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673375 (https://phabricator.wikimedia.org/T277109) (owner: 10Dduvall)
[23:18:15] <wikibugs>	 (03Merged) 10jenkins-bot: pipeline: Use build environment HTTP proxy for APT sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673375 (https://phabricator.wikimedia.org/T277109) (owner: 10Dduvall)
[23:18:53] <wikibugs>	 (03PS33) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[23:21:52] <marxarelli>	 since the current backports window is empty, i'm going to deploy ^. it's a noop pipeline-only mediawiki-config change. cc: longma 
[23:22:08] <longma>	 hi
[23:22:09] <legoktm>	 :thumbsup:
[23:22:12] <brennen>	 marxarelli: ack.
[23:25:39] <logmsgbot>	 !log dduvall@deploy1002 Synchronized .pipeline: config: [[gerrit:673375|Use build environment HTTP proxy for APT sources (T277109)]] (duration: 01m 02s)
[23:25:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:48] <stashbot>	 T277109: Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109
[23:26:06] <marxarelli>	 legoktm, brennen: done. thanks!
[23:26:24] <legoktm>	 I'm not sure I did anything :p
[23:26:34] <longma>	 :P
[23:26:57] <marxarelli>	 haha, well you have my thanks nonetheless :p
[23:27:04] <marxarelli>	 deal with it
[23:29:12] <brennen>	 moral support. :)
[23:30:24] <wikibugs>	 (03PS1) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775)
[23:30:30] <wikibugs>	 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10RandomCanadian) Would it be possible to implement the temporary solutions as described at [[ https://en.wiki...
[23:30:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite)
[23:33:11] <brennen>	 gah, i think i may have synced the wrong file earlier.
[23:33:48] <wikibugs>	 (03PS2) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775)
[23:33:55] <brennen>	 yep.  going to rectify.
[23:34:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite)
[23:35:42] <logmsgbot>	 !log brennen@deploy1002 Synchronized php-1.36.0-wmf.35/includes/user/ActorStore.php: Backport: [[gerrit:673115|ActorStore::getActorById - fall back to master. (T277795)]] (duration: 00m 58s)
[23:35:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:50] <stashbot>	 T277795: User not found by actor ID: [id] - https://phabricator.wikimedia.org/T277795
[23:38:51] <logmsgbot>	 !log brennen@deploy1002 Synchronized php-1.36.0-wmf.35/includes/user/ActorStore.php: Backport: [[gerrit:673115|ActorStore::getActorById - fall back to master. (T277795)]] (duration: 00m 57s)
[23:38:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:45:19] <wikibugs>	 (03PS3) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775)
[23:46:43] <wikibugs>	 (03PS6) 10Legoktm: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson)
[23:46:54] <legoktm>	 brennen: OK if I sync a config change?
[23:47:05] <brennen>	 legoktm: yeah, you're clear.
[23:47:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:47:28] <brennen>	 i'm signing off for the day, train looks good.
[23:48:02] <wikibugs>	 (03PS7) 10Legoktm: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson)
[23:48:41] <wikibugs>	 (03CR) 10Legoktm: "PS5: Fixed ordering of special projects list to be alphabetical, and left a pointer to the phab task about why the default is null." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson)
[23:49:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:50:54] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson)
[23:51:45] <wikibugs>	 (03Merged) 10jenkins-bot: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson)
[23:52:33] <legoktm>	 marxarelli: I merged your change in /srv/mediawiki-staging, so I believe I've officially earned the thanks now ;)
[23:53:22] <legoktm>	 tested config patch on mwdebug1002, the correct Meta logo is back
[23:53:32] <marxarelli>	 legoktm: oh. which change?
[23:53:35] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "first steps LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/671283 (owner: 10Herron)
[23:53:53] <legoktm>	 the pipeline one apt proxy one 
[23:54:11] <marxarelli>	 er... wasn't it already merge?
[23:54:14] <marxarelli>	 merged
[23:54:14] <legoktm>	 well, pulled it in, not merged
[23:54:24] <legoktm>	 git fetch origin; git rebase origin/master
[23:54:32] <marxarelli>	 oh boy... haha, k thanks!
[23:54:37] <legoktm>	 :D
[23:54:50] <marxarelli>	 i must have done `git log` to compare and then forgot to rebase :/
[23:55:29] <legoktm>	 np :))
[23:56:20] <logmsgbot>	 !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Don't define a default icon (T274199) (duration: 00m 57s)
[23:56:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:28] <stashbot>	 T274199: With new Vector skin and Timeless, Meta, Wikimania and Wikitech logos are replaced by the WM Foundation logo - https://phabricator.wikimedia.org/T274199
[23:57:55] <wikibugs>	 (03CR) 10Dave Pifke: "Huh, I didn't realize this never got merged.  Thanks for picking it up!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)