[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210115T0000).
[00:00:05] <jouncebot>	 James_F and MatmaRex: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:01:06] <James_F>	 (Already done.)
[00:01:10] <MatmaRex>	 oh, oops
[00:01:23] <James_F>	 MatmaRex: No worries, I marked them as done already.
[00:01:26] <MatmaRex>	 ah, you marked them as done on the schedule. alright
[00:01:33] <MatmaRex>	 thanks
[00:07:51] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2236 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn recent reimage makes it normal https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[00:07:51] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2269 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn recent reimage makes it normal https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[00:07:51] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2270 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn recent reimage makes it normal https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[00:21:59] <DannyS712>	 RoanKattouw Niharika Urbanecm is it too late to add something? I'd like to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/655417 to the currently deployed branch
[00:27:43] <RoanKattouw>	 DannyS712: How urgent is that? It contains an i18n change, so it would require a full scap
[00:30:04] <DannyS712>	 it would be very helpful to oversighters, since the functionality to change the visibility of a hit from the details view, which is usually how these are reported, is currently broken. Its *possible* to work around this, but takes a bit of time.
[00:31:58] <DannyS712>	 I think its worth it for a full scap
[00:34:41] <James_F>	 Full scap out of hours before a holiday weekend? Eh.
[00:35:20] <DannyS712>	 out of hours? Its the middle of the backport window... also there is a holidays?
[00:36:07] <wikibugs>	 (03PS1) 10DannyS712: Restore hide link when viewing single AbuseLog entries [extensions/AbuseFilter] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655943 (https://phabricator.wikimedia.org/T271667)
[00:36:10] <James_F>	 DannyS712: Monday is a no-deploy day.
[00:36:34] <James_F>	 DannyS712: By the time CI finished on the patch we'll be way past the backport window.
[00:37:04] <DannyS712>	 oh, MLK. Okay. So no-go until Tuesday?
[00:37:27] <James_F>	 Unless it's really urgent.
[00:37:51] <James_F>	 I don't do AF suppression in practice, so…
[00:38:05] <DannyS712>	 I haven't run into it yet, but I know other oversighters have
[00:39:31] <DannyS712>	 its especially a pain on mobile
[00:44:09] <wikibugs>	 (03CR) 10Daimona Eaytoy: "Do we really need a backport?" [extensions/AbuseFilter] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655943 (https://phabricator.wikimedia.org/T271667) (owner: 10DannyS712)
[00:44:47] <Daimona>	 Catching up now
[00:45:35] <Daimona>	 I understand it might be a pain rn, but not sure if it deserves a full scap right now
[00:45:51] <Daimona>	 s/ rn//
[00:58:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Restore hide link when viewing single AbuseLog entries [extensions/AbuseFilter] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655943 (https://phabricator.wikimedia.org/T271667) (owner: 10DannyS712)
[01:04:54] <wikibugs>	 (03CR) 10DannyS712: "recheck" [extensions/AbuseFilter] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655943 (https://phabricator.wikimedia.org/T271667) (owner: 10DannyS712)
[01:19:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:22:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:32:32] <icinga-wm>	 PROBLEM - PHP opcache health on mw2268 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[02:29:05] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` restbase2009.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20210...
[02:29:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase2009.codfw.wmnet'] `  Of which those **FAILED**: ` ['restbase2009.codfw.wmnet'] `
[02:29:21] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` restbase2009.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20210...
[02:49:59] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone: slight policy adjustments [puppet] - 10https://gerrit.wikimedia.org/r/656283 (https://phabricator.wikimedia.org/T272117)
[02:51:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] keystone: slight policy adjustments [puppet] - 10https://gerrit.wikimedia.org/r/656283 (https://phabricator.wikimedia.org/T272117) (owner: 10Andrew Bogott)
[03:06:22] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase2009.codfw.wmnet'] `  Of which those **FAILED**: ` ['restbase2009.codfw.wmnet'] `
[03:12:55] <wikibugs>	 (03PS1) 10Gergő Tisza: Update /analytics/legacy/homepagemodule/ schema version to 1.1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656284 (https://phabricator.wikimedia.org/T270309)
[06:07:41] <wikibugs>	 (03PS1) 10Joal: profile::analytics::refinery Add HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560)
[06:09:26] <wikibugs>	 (03CR) 10Joal: "@elukey: I'm not sure about the place where to create the folders to ensure correct ordering (groups existing for isntance). Let me know i" [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[07:08:44] <wikibugs>	 (03CR) 10Elukey: "The change looks very good but I think that we'd need to add a hiera selector since we use the profile in multiple places/hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[07:35:18] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.reboot-single: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656310
[07:36:48] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.reboot-single: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656310
[07:38:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.reboot-single: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656310 (owner: 10Elukey)
[07:41:30] <elukey>	 yep yep
[07:43:37] <icinga-wm>	 PROBLEM - SSH on logstash2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:45:47] <wikibugs>	 (03PS3) 10Elukey: sre.hosts.reboot-single: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656310
[07:46:22] <wikibugs>	 10SRE, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:49:20] <wikibugs>	 (03PS1) 10Ryan Kemper: T262211: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369
[07:49:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] T262211: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (owner: 10Ryan Kemper)
[07:50:51] <wikibugs>	 (03PS2) 10Ryan Kemper: search: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211)
[07:57:07] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single
[07:57:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:14] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] query_service: Migrate hiera() to lookup() in common.pp [puppet] - 10https://gerrit.wikimedia.org/r/656266 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[07:59:27] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[07:59:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210115T0800)
[08:01:51] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single
[08:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:41] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:05:53] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "Effie: Ok to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[08:06:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cache: Make statsd address an argument and hiera() -> lookup() [puppet] - 10https://gerrit.wikimedia.org/r/655790 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:07:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cache: Migrate hiera() to lookup() and setting datatype in eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/656015 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:13:46] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] query_service: Remove gui files from wdqs [puppet] - 10https://gerrit.wikimedia.org/r/655955 (https://phabricator.wikimedia.org/T271851) (owner: 10Ladsgroup)
[08:15:12] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[08:15:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:18] <wikibugs>	 (03CR) 10Ryan Kemper: "Sounds like we can circle back to clean up `modules/query_service/manifests/gui.pp` at a latter time if we so choose." [puppet] - 10https://gerrit.wikimedia.org/r/655955 (https://phabricator.wikimedia.org/T271851) (owner: 10Ladsgroup)
[08:15:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single
[08:15:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:40] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.ceph.osd: disable write caches when possible [puppet] - 10https://gerrit.wikimedia.org/r/655923 (https://phabricator.wikimedia.org/T271527) (owner: 10David Caro)
[08:17:59] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:18:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:18:53] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:19:37] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01545 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[08:21:24] <icinga-wm>	 RECOVERY - SSH on logstash2005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:23:31] <_joe_>	 uh several puppet failures
[08:23:56] <_joe_>	 wdqs
[08:24:07] <ryankemper>	 _joe_: yup reverting right now
[08:24:15] <ryankemper>	 https://www.irccloud.com/pastebin/KwZ1fyzY/
[08:24:33] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "query_service: Remove gui files from wdqs" [puppet] - 10https://gerrit.wikimedia.org/r/656295
[08:24:34] <_joe_>	 ryankemper: hi! I didn't notice you were still up :D
[08:25:07] <ryankemper>	 Working a weird schedule today :P
[08:25:17] <wikibugs>	 (03CR) 10Ryan Kemper: "----- OUTPUT of 'sudo run-puppet-agent' -----" [puppet] - 10https://gerrit.wikimedia.org/r/656295 (owner: 10Ryan Kemper)
[08:25:21] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Revert "query_service: Remove gui files from wdqs" [puppet] - 10https://gerrit.wikimedia.org/r/656295 (owner: 10Ryan Kemper)
[08:25:37] <kostajh>	 XioNoX: we (Growth team) would like to do an emergency deploy sometime today, a one line patch to unbreak our main feature. Would this be OK? The patch is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/656370
[08:26:05] <kostajh>	 (Cc tgr_ )
[08:27:24] <_joe_>	 kostajh: what is broken?
[08:27:44] <wikibugs>	 (03PS1) 10David Caro: wmcs.ceph.osd: actually disable write caches [puppet] - 10https://gerrit.wikimedia.org/r/656371
[08:27:47] <_joe_>	 oh I see
[08:28:10] <ryankemper>	 !log WDQS puppet run successful
[08:28:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:22] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[08:28:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:24] <kostajh>	 _joe_: on the wikis we have our extension deployed to, most users are directed to Special:Homepage after account creation. The main content on there is broken due to a failing elastic search query
[08:28:43] <_joe_>	 kostajh: yeah found the task and reading it
[08:29:05] <_joe_>	 So the user experience is broken, that's definitely something that should be deployed in an emergency
[08:29:19] <_joe_>	 and AIUI has no expected perf impact/risk with rollback
[08:29:20] <kostajh>	 _joe_: yes, the user experience is broken
[08:29:27] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.ceph.osd: actually disable write caches [puppet] - 10https://gerrit.wikimedia.org/r/656371 (owner: 10David Caro)
[08:30:02] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003708 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[08:30:09] <_joe_>	 ryankemper: <3
[08:30:10] <kostajh>	 The list of tasks we show to a user comes from the cache. The patch disables validating those tasks for freshness
[08:35:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27483/console" [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[08:36:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:39:25] <marostegui>	 !log Restart clouddb1013-clouddb1020
[08:39:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:26] <icinga-wm>	 RECOVERY - Check systemd state on ncredir5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:44:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[08:44:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] monitoring::host: move hostgroup_default to params, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[08:45:02] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 2 down 24 https://wikitech.wikimedia.org/wiki/HAProxy
[08:45:22] <marostegui>	 ^ expected
[08:45:31] <moritzm>	 !log installing bast4003 T257324
[08:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:34] <stashbot>	 T257324: Consolidate edge bastion server into ganeti - https://phabricator.wikimedia.org/T257324
[09:04:47] <wikibugs>	 (03CR) 10Joal: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[09:04:56] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:05:53] <wikibugs>	 (03PS2) 10Joal: profile::analytics::refinery Create HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560)
[09:07:16] <RhinosF1>	 Is there anyone who can merge a change to unblock CI for CentralAuth?
[09:07:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::refinery Create HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[09:07:58] <moritzm>	 !log installing bast5002 T257324
[09:08:00] <RhinosF1>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/656298 for https://phabricator.wikimedia.org/T272123
[09:08:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:02] <stashbot>	 T257324: Consolidate edge bastion server into ganeti - https://phabricator.wikimedia.org/T257324
[09:08:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Analytics access for dev: Bill Pirkle - https://phabricator.wikimedia.org/T272065 (10Aklapper) //(For future reference, feel free to use https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ for such requests, linked from https://wikitech.wikimedia.org/wiki/Analytics/Data_...
[09:09:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui)
[09:09:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: Analytics access for dev: Nikki Nikkhoui - https://phabricator.wikimedia.org/T272057 (10Aklapper) //(For future reference, feel free to use https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ for such requests, linked from https://wikitech.wikimedia.org/wiki/Analytics/Da...
[09:10:08] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:11:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui)
[09:12:14] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1008.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[09:12:34] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:12:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) p:05Triage→03High Setting to high as we are trying to finish up the new wiki replicas infra
[09:13:56] <wikibugs>	 (03CR) 10Elukey: profile::analytics::refinery Create HDFS folders (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[09:14:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1ed. Ping me to merge after the -1 by Dan (pending on the completion of the aforementioned task) is removed." [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall)
[09:15:06] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:16:24] <elukey>	 we are working on --^
[09:16:35] <elukey>	 it is a little subset of the aqs api, related to druid
[09:16:52] <elukey>	 when we drop old datasources it sadly stops answering queries, we didn't find a good way to do it
[09:17:05] <elukey>	 also I think that I should group alerts for AQS, to avoid spamming
[09:18:59] <godog>	 !log swift codfw-prod: more weight to ms-be20[58-61] - T269337
[09:19:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:02] <stashbot>	 T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337
[09:23:22] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[09:24:02] <jynus>	 !log rolling restart of dbprov1* hosts
[09:24:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/656310 (owner: 10Elukey)
[09:24:27] <wikibugs>	 (03PS1) 10Kosta Harlan: Temporarily disable cache revalidation [extensions/GrowthExperiments] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/656301 (https://phabricator.wikimedia.org/T272103)
[09:25:04] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:25:04] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:25:54] <wikibugs>	 (03PS2) 10Kosta Harlan: Temporarily disable cache revalidation [extensions/GrowthExperiments] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/656301 (https://phabricator.wikimedia.org/T272103)
[09:27:47] <kostajh>	 _joe_ / XioNoX here's the patch for wmf.26, what is the next step in the process to deploy it? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/656301 
[09:28:27] <_joe_>	 kostajh: merge and deploy
[09:29:03] <_joe_>	 there is no additional process to the normal emergency deployment
[09:29:03] <kostajh>	 tgr_: are you around to deploy?
[09:29:28] <kostajh>	 ok, so it doesn't need to be added somewhere here for example https://wikitech.wikimedia.org/wiki/Deployments#Friday,_January_15
[09:29:34] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:30:07] <tgr_>	 I'd rather not deploy at 1 AM, easy to make mistakes and a few more hours won't make that much of a difference
[09:30:12] <tgr_>	 I can deploy it in the morning
[09:30:20] <_joe_>	 kostajh: a good record in SAL is better
[09:30:46] <_joe_>	 tgr_: we can search for a deployer in the meantime, but ack!
[09:30:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) There's something going on with this host: ` racadm>>serveraction powerstatus Server power status: OFF racadm>>serveraction powerup Server power operation initiated successfully racadm>>serverac...
[09:31:54] <RhinosF1>	 kostajh: could I get a +2 for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/656298 if you don't mind?
[09:34:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10aborrero)
[09:34:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) And without doing anything again:  ` racadm>>serveraction powerstatus Server power status: ON `
[09:34:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) Interestingly, I cannot see anything on the console, so I have no idea what it is doing and if it is rebooting or doing something else.
[09:35:35] <kostajh>	 RhinosF1: trying to take a quick look now but would be better if you could find someone else. Have two kids at home due to school closures and dealing with other issues now unfortunately 
[09:35:49] <Reedy>	 What're we looking at?
[09:35:58] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) p:05Triage→03High
[09:36:09] <RhinosF1>	 Reedy: unblocking CI on central auth due to phan failure
[09:36:17] <RhinosF1>	 kostajh: ack
[09:36:21] <_joe_>	 no, we're not talking about that
[09:36:39] <_joe_>	 we're talking about https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/656301
[09:36:46] <vgutierrez>	 !log rolling restart acme-chief servers to catch up on kernel upgrades
[09:36:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:25] <_joe_>	 RhinosF1: we're in the midst of an emergency deployment, please hold on :)
[09:38:06] <kostajh>	 Reedy: are you able to help with deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/656301 ?
[09:38:07] <_joe_>	 kostajh: I would assume the task should be UBN!, btw
[09:38:17] <hashar>	 yeah I am around as well
[09:38:18] <Reedy>	 I was about to send an all caps message
[09:38:27] <hashar>	 was busy ranting over some private message
[09:38:28] <Reedy>	 I'm just quickly reading
[09:38:30] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[09:38:30] <kostajh>	 {done} 
[09:38:42] <_joe_>	 hashar: :*
[09:39:07] <hashar>	 kostajh: tgr_ I can take over, please do take care of kids or your sleep schedule! :]
[09:40:36] <hashar>	 of course the change in master breaks bah
[09:40:37] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Temporarily disable cache revalidation [extensions/GrowthExperiments] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/656301 (https://phabricator.wikimedia.org/T272103) (owner: 10Kosta Harlan)
[09:40:42] <_joe_>	 elukey: have you seen the alert about the druid brokers in LVS?
[09:41:12] <Reedy>	 hashar: unrelated AF errors on master though
[09:41:16] <kostajh>	 hashar: the change in master should be OK once the abusefilter patch is done merging. But... yeah
[09:41:27] <Reedy>	 deployment branch is seemingly ok, it's just doing ALL OF THE BROWSER TESTS
[09:41:43] <_joe_>	 Reedy: sit back, relax, and enjoy the tests
[09:42:03] <Reedy>	 We should live stream the browsertests somewhere
[09:42:12] <Reedy>	 hashar: gogo high priority feature request
[09:42:23] <hashar>	 hmm
[09:42:27] <hashar>	 yeah that should be doable
[09:42:48] <hashar>	 since we use Xvfb as a frame buffer, theorically we can stream the buffer to Youtube or Twitch
[09:43:23] <_joe_>	 elukey: nevermind, it's icinga being weird, the alert has recovered since forever, it's just still critical in icinga for $reasons
[09:44:07] * apergos is around too if testing help is needed
[09:44:22] <hashar>	 so basically lets wait for patches to get merged, then I guess it is all about confirming the use case in https://phabricator.wikimedia.org/T272103  is addressed on mwdebug
[09:44:25] <hashar>	 and we can roll forward
[09:44:32] <apergos>	 uh huh
[09:45:09] <apergos>	 let me load up the page on mwdebug now and confirm I can break it, then I'll be in shape to test when it rolls out 
[09:45:38] <hashar>	 I wasn't even aware we have [[Special:Homepage]] :-\
[09:46:02] <apergos>	 me neither
[09:46:41] <hashar>	 that is super nice (has to be enabled in one user preferences)
[09:47:23] <hashar>	 which gets one a customized home page that lists edit suggestions, how many folks watched articles I have changed (a few millions yeah!!) etc
[09:47:39] <hashar>	 and somehow Trizek is my tutor (hi!)
[09:47:49] <Reedy>	 quick, ask lots of questions
[09:49:47] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single
[09:49:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:34] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily disable cache revalidation [extensions/GrowthExperiments] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/656301 (https://phabricator.wikimedia.org/T272103) (owner: 10Kosta Harlan)
[09:51:54] <wikibugs>	 (03PS1) 10Ayounsi: Add jhernandez to deployment [puppet] - 10https://gerrit.wikimedia.org/r/656375 (https://phabricator.wikimedia.org/T271859)
[09:52:23] <kostajh>	 hashar: https://www.mediawiki.org/wiki/Growth has the gory details if you're curious. It's not on all wikis (yet)
[09:52:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add jhernandez to deployment [puppet] - 10https://gerrit.wikimedia.org/r/656375 (https://phabricator.wikimedia.org/T271859) (owner: 10Ayounsi)
[09:52:45] <apergos>	 ok wow with uselang=en I get the name of the string for translation....  but anyways, yep I can get the error. 
[09:53:11] <hashar>	 great thank you apergos !
[09:53:28] <apergos>	 an now we watch jenkins :-)
[09:53:33] <hashar>	 I get it as well
[09:53:53] <kostajh>	 apergos: what do you mean about name of the string for translation?
[09:54:03] <Reedy>	 I'm guessing <foo-bar-message-name>
[09:54:06] <apergos>	 I mea that all these strings that you see
[09:54:09] <Reedy>	 rather than an english string
[09:54:14] <Reedy>	 so it's behaving like qqx
[09:54:16] <apergos>	 are look-upable in a file o i18 things
[09:54:20] <apergos>	 each of those things has a name
[09:54:29] <kostajh>	 right. but I don't see that with uselang=en, which is why I'm concerned/confused
[09:54:39] <wikibugs>	 (03PS2) 10Ayounsi: Add jhernandez to deployment [puppet] - 10https://gerrit.wikimedia.org/r/656375 (https://phabricator.wikimedia.org/T271859)
[09:54:40] <apergos>	 when you  go to translatewiki you can get the name of the string and translate to the lang of your choic
[09:54:41] <apergos>	 e
[09:54:59] <apergos>	 I get the name of that string displayed rather than an en error message :-D
[09:55:06] <kostajh>	 if you're seeing the string name with uselang=en, that sounds like a problem of its own
[09:55:17] <apergos>	 growthexperiments-homepage-suggestededits-error-title
[09:55:21] <Reedy>	 I've just pulled the fix onto mwdebug1002
[09:55:28] <apergos>	 ok lemme see what happens
[09:55:32] <wikibugs>	 (03PS1) 10Joal: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560)
[09:55:45] <kostajh>	 Reedy: thanks, looks good to me
[09:55:52] <hashar>	 fixed
[09:56:08] <apergos>	 ah
[09:56:13] <apergos>	 I had "uselang=n"
[09:56:14] <apergos>	 lol
[09:56:16] <Reedy>	 lmao
[09:56:26] <kostajh>	 FWIW the non-JavaScript experience was never broken, for the NoScript users among you :)
[09:56:36] <hashar>	 and now my day is going to be ruined as I empty up my queue of 200 suggested edits
[09:57:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to labweb1001 and labweb1002 for jhernandez - https://phabricator.wikimedia.org/T271859 (10ayounsi)
[09:57:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[09:57:56] <_joe_>	 apergos: what language is that?
[09:58:24] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.36.0-wmf.26/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/CacheDecorator.php: T272103 (duration: 00m 57s)
[09:58:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:27] <stashbot>	 T272103: [regression - wmf.26] frwiki Homepage  SE module has 'cirrussearch-query-too-long' for default filters - https://phabricator.wikimedia.org/T272103
[09:58:43] <kostajh>	 I guess I didn't realize you can add arbitrary input for the uselang parameter, without validation / fallback to a known language code
[09:58:49] <apergos>	 none!
[09:59:23] <_joe_>	 uuh yeah maybe let's not elaborate too much on that :)
[09:59:51] <apergos>	 yeah ayways no errors for me
[10:00:23] <apergos>	 long after the deploy already went around :-D
[10:00:46] <hashar>	 I had once defined an 'EN'  language
[10:00:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/656375 (https://phabricator.wikimedia.org/T271859) (owner: 10Ayounsi)
[10:01:08] <kostajh>	 Reedy / hashar / apergos / _joe_ thank you for your help! 
[10:01:09] <hashar>	 which really was english but applying uppercase() to all messages. That was for the INTERNATIONAL CAPS LOCK DAY
[10:01:11] <Reedy>	 np
[10:01:17] <hashar>	 \o/
[10:01:25] <hashar>	 and thanks tgr_ for the extra investigation!
[10:02:06] <wikibugs>	 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10jijiki) p:05Triage→03Medium
[10:02:23] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
[10:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:59] <icinga-wm>	 PROBLEM - Keyholder SSH agent on acmechief2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder
[10:03:50] <elukey>	 vgutierrez: o/ if you are ok I'd merge https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/656310 to update the cookbook for reboot single, and then we could test it to see if there is anything to fix
[10:03:54] <elukey>	 would it be ok?
[10:04:02] * vgutierrez checking
[10:04:09] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: Switch the base image to buster from stretch. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683
[10:04:11] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add jhernandez to deployment [puppet] - 10https://gerrit.wikimedia.org/r/656375 (https://phabricator.wikimedia.org/T271859) (owner: 10Ayounsi)
[10:04:16] <vgutierrez>	 elukey: yeah :)
[10:05:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Switch the base image to buster from stretch. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 (owner: 10Giuseppe Lavagetto)
[10:05:59] <icinga-wm>	 RECOVERY - Keyholder SSH agent on acmechief2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[10:06:10] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to labweb1001 and labweb1002 for jhernandez - https://phabricator.wikimedia.org/T271859 (10ayounsi) You're all set, give it 30min for Puppet to run. Let me know if any issues.
[10:06:16] <wikibugs>	 10SRE, 10Wikimedia-Logstash: Update saved / short links with objects in ELK7 - https://phabricator.wikimedia.org/T272016 (10fgiunchedi) p:05Triage→03Medium
[10:06:17] <elukey>	 vgutierrez: ack merging
[10:06:18] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to labweb1001 and labweb1002 for jhernandez - https://phabricator.wikimedia.org/T271859 (10ayounsi) 05Open→03Resolved a:03ayounsi
[10:06:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hosts.reboot-single: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656310 (owner: 10Elukey)
[10:06:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/656310 (owner: 10Elukey)
[10:06:51] <hashar>	 I have droppe dthe unbreak now status
[10:06:58] <hashar>	 Reedy: thx for the deployment!
[10:07:55] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-client1001.eqiad.wmnet
[10:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:58] <wikibugs>	 10SRE, 10Wikimedia-Logstash: Update saved / short links with objects in ELK7 - https://phabricator.wikimedia.org/T272016 (10fgiunchedi) >>! In T272016#6747870, @Lucas_Werkmeister_WMDE wrote: > Is it possible to restore the /goto/ links?  AIUI missing `/goto/` links was an expected side effect of the migration,...
[10:08:04] <elukey>	 nice :)
[10:09:58] <wikibugs>	 (03PS3) 10Joal: profile::analytics::refinery Create HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560)
[10:10:02] <wikibugs>	 (03CR) 10Joal: profile::analytics::refinery Create HDFS folders (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[10:10:14] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-client1001.eqiad.wmnet
[10:10:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:16] <wikibugs>	 (03PS2) 10Joal: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560)
[10:12:43] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:13:05] <wikibugs>	 (03PS1) 10Ayounsi: Add nikkin to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656377 (https://phabricator.wikimedia.org/T272057)
[10:13:07] <wikibugs>	 (03PS1) 10Ayounsi: Add bpirkle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656378 (https://phabricator.wikimedia.org/T272065)
[10:13:58] <elukey>	 vgutierrez: my test worked, you can proceed if you have another one
[10:16:12] <wikibugs>	 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat)
[10:16:43] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:18:23] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Analytics access for dev: Nikki Nikkhoui - https://phabricator.wikimedia.org/T272057 (10ayounsi) @Ottomata do we need approval from you (Analytics) as well?
[10:18:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add nikkin to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656377 (https://phabricator.wikimedia.org/T272057) (owner: 10Ayounsi)
[10:18:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add bpirkle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656378 (https://phabricator.wikimedia.org/T272065) (owner: 10Ayounsi)
[10:18:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Analytics access for dev: Bill Pirkle - https://phabricator.wikimedia.org/T272065 (10ayounsi) @Ottomata do we need approval from you (Analytics) as well?
[10:19:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "We also should drop the reference from modules/aptrepo/files/distributions-wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm)
[10:19:04] <wikibugs>	 (03CR) 10Muehlenhoff: "One note inline, also needs approval by Otto in the Phab task, other than that looks fine." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656377 (https://phabricator.wikimedia.org/T272057) (owner: 10Ayounsi)
[10:19:49] <wikibugs>	 (03CR) 10Muehlenhoff: Add bpirkle to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656378 (https://phabricator.wikimedia.org/T272065) (owner: 10Ayounsi)
[10:21:02] <vgutierrez>	 elukey: ok :)
[10:21:19] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet
[10:21:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:29] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:24:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:25:13] <vgutierrez>	 elukey: looking good here as well
[10:25:22] <elukey>	 perfect
[10:25:27] <godog>	 the ms-be2* errors are sort-of expected, relabalcing
[10:26:36] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet
[10:26:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:21] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet
[10:27:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:41] <icinga-wm>	 PROBLEM - SSH on ms-be2032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:30:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Make bast4003/bast5002 bastion hosts [puppet] - 10https://gerrit.wikimedia.org/r/656380 (https://phabricator.wikimedia.org/T257324)
[10:39:42] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: seperate dumps service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654823 (https://phabricator.wikimedia.org/T209082)
[10:39:44] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: seperate gerrit service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654826 (https://phabricator.wikimedia.org/T209082)
[10:39:46] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: fix ports for contint2001 [homer/public] - 10https://gerrit.wikimedia.org/r/654830 (https://phabricator.wikimedia.org/T209082)
[10:39:59] <icinga-wm>	 RECOVERY - SSH on ms-be2032 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:40:50] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet
[10:40:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:58] <effie>	 !log reboot mc2036 - T269596
[10:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:08] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: production-images: switch to buster as seed image [puppet] - 10https://gerrit.wikimedia.org/r/656381
[10:45:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1001.eqiad.wmnet
[10:45:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:01] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 16 down 2 https://wikitech.wikimedia.org/wiki/HAProxy
[10:46:19] <_joe_>	 uh, kormat/marostegui?
[10:46:44] <vgutierrez>	 !log disable puppet on acme-chief clients
[10:46:46] <kormat>	 oh. hi. looking
[10:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] production-images: switch to buster as seed image [puppet] - 10https://gerrit.wikimedia.org/r/656381 (owner: 10Giuseppe Lavagetto)
[10:48:23] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2 Kormat Checking https://wikitech.wikimedia.org/wiki/HAProxy
[10:48:23] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 16 down 2 Kormat Checking https://wikitech.wikimedia.org/wiki/HAProxy
[10:48:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Sorry, I completely forgot this patch was here, and I reimplemented it myself yesterday :/" [puppet] - 10https://gerrit.wikimedia.org/r/597559 (owner: 10Cwhite)
[10:48:47] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet
[10:48:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:52] <marostegui>	 hey the haproxy thing is expected
[10:50:06] <marostegui>	 kormat _joe_ ^
[10:50:18] <kormat>	 marostegui: i figured that once i realised what's behind it :)
[10:50:25] <marostegui>	  :-)
[10:50:40] <marostegui>	 I will be back in an hour or so
[10:51:12] <jynus>	 some of our alerting is complicated to ack, it happened to me yesterday
[10:51:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1001.eqiad.wmnet
[10:51:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:49] <jynus>	 e.g. non-obvious dependencies or metrics monitoring more than one thing with a single alert
[10:52:02] <_joe_>	 !log rebuilding the docker images coredns,nutcracker,prometheus-statsd-exporter,service-checker,wmfdebug to use wikimedia-buster as a base
[10:52:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:36] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet
[10:52:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:15] <jynus>	 !log rolling restart of dbprov2* hosts
[10:53:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:17] <vgutierrez>	 !log re-enable puppet on acme-chief clients
[10:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1003.eqiad.wmnet
[10:53:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:09] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:55:57] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2036.codfw.wmnet
[10:55:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:20] <logmsgbot>	 !log jiji@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host mc2036.codfw.wmnet
[10:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:44] <elukey>	 effie: I merged a new version of the cookbook this morning, if you see anything weird let me know
[10:57:39] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:58:11] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1003.eqiad.wmnet
[10:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:01] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2036.codfw.wmnet
[10:59:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:19] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:59:36] <effie>	 elukey: ok !
[11:00:41] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] cr/firewall.cf: cloud-in4: seperate dumps service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654823 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez)
[11:01:04] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] cr/firewall.cf: cloud-in4: seperate gerrit service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654826 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez)
[11:01:13] <wikibugs>	 (03Merged) 10jenkins-bot: cr/firewall.cf: cloud-in4: seperate dumps service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654823 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez)
[11:01:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] cr/firewall.cf: cloud-in4: fix ports for contint2001 [homer/public] - 10https://gerrit.wikimedia.org/r/654830 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez)
[11:01:43] <wikibugs>	 (03Merged) 10jenkins-bot: cr/firewall.cf: cloud-in4: seperate gerrit service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654826 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez)
[11:01:53] <wikibugs>	 (03Merged) 10jenkins-bot: cr/firewall.cf: cloud-in4: fix ports for contint2001 [homer/public] - 10https://gerrit.wikimedia.org/r/654830 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez)
[11:03:27] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:03:49] <wikibugs>	 (03PS1) 10Joal: Update web.xml removing jetty default dir listing [debs/archiva] - 10https://gerrit.wikimedia.org/r/656382 (https://phabricator.wikimedia.org/T272082)
[11:03:58] <joal>	 elukey: --^ for when you're back
[11:05:35] <wikibugs>	 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat)
[11:06:52] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2036.codfw.wmnet
[11:06:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:58] <XioNoX>	 !log update cloud-in4 firewall rules
[11:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:06] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1002.eqiad.wmnet
[11:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: wmfdebug: swap iproute with iproute2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/656383
[11:18:25] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] wmfdebug: swap iproute with iproute2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/656383 (owner: 10Giuseppe Lavagetto)
[11:18:39] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: production-images: add stretch to base images [puppet] - 10https://gerrit.wikimedia.org/r/656384
[11:19:49] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:19:54] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1002.eqiad.wmnet
[11:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:57] <icinga-wm>	 PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:23:07] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:26] <wikibugs>	 (03PS1) 10Effie Mouzeli: hiera: clean up memcached configuration [puppet] - 10https://gerrit.wikimedia.org/r/656385 (https://phabricator.wikimedia.org/T213089)
[11:25:01] <icinga-wm>	 RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:25:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] production-images: add stretch to base images [puppet] - 10https://gerrit.wikimedia.org/r/656384 (owner: 10Giuseppe Lavagetto)
[11:28:19] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 213 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:30:17] <jynus>	 !log rolling restart of eqiad source backup dbs
[11:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] wmfdebug: swap iproute with iproute2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/656383 (owner: 10Giuseppe Lavagetto)
[11:30:53] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 36 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:35:37] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 285 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:37:15] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 27 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:38:01] <jynus>	 there seems to be spikes of exceptions every 6 minutes
[11:41:40] <jynus>	 https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&viewPanel=2&from=1610707291821&to=1610710891822
[11:42:52] <jynus>	 they seem to be OOMs, I think
[11:53:37] <wikibugs>	 (03CR) 10Volans: "LGTM, couple of questions/nits inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond)
[11:54:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: introduce set counters [puppet] - 10https://gerrit.wikimedia.org/r/656388
[11:56:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: introduce set counters [puppet] - 10https://gerrit.wikimedia.org/r/656388 (owner: 10Arturo Borrero Gonzalez)
[11:59:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make bast4003/bast5002 bastion hosts [puppet] - 10https://gerrit.wikimedia.org/r/656380 (https://phabricator.wikimedia.org/T257324) (owner: 10Muehlenhoff)
[12:04:49] <wikibugs>	 (03CR) 10Volans: "General comment inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond)
[12:06:02] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: production-images: correctly refer to the registry with its variable [puppet] - 10https://gerrit.wikimedia.org/r/656391
[12:10:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] production-images: correctly refer to the registry with its variable [puppet] - 10https://gerrit.wikimedia.org/r/656391 (owner: 10Giuseppe Lavagetto)
[12:10:42] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Looks reasonable as a pure conversion. Some possible future expansion for later inline." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656212 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey)
[12:12:33] <wikibugs>	 (03PS1) 10Hashar: Display image label when publishing [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/656393
[12:13:41] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:27:35] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:27:53] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Remove the 'letsencrypt' module [puppet] - 10https://gerrit.wikimedia.org/r/655762 (https://phabricator.wikimedia.org/T252199) (owner: 10Andrew Bogott)
[12:34:02] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "careful though.. it looks like the module still has some references on the code:" [puppet] - 10https://gerrit.wikimedia.org/r/655762 (https://phabricator.wikimedia.org/T252199) (owner: 10Andrew Bogott)
[12:48:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:53:38] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:02:06] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:13:00] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:13:20] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:17:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Update SSH default config for new bastions running on Ganeti [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/656402
[13:21:30] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2] Update SSH default config for new bastions running on Ganeti [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/656402 (owner: 10Muehlenhoff)
[13:21:35] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update SSH default config for new bastions running on Ganeti [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/656402 (owner: 10Muehlenhoff)
[13:24:40] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:28:32] <icinga-wm>	 PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:30:40] <icinga-wm>	 RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:32:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to production shell and wmf ldap access for Razzi Abuissa - https://phabricator.wikimedia.org/T261443 (10Aklapper)
[13:36:50] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[13:41:18] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:32:39] <wikibugs>	 (03PS1) 10Urbanecm: Compress frwiki's anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656508 (https://phabricator.wikimedia.org/T272075)
[20:32:49] <Urbanecm>	 James_F: sorry to distract you, mind +1'ing the above compress patch? ;)
[20:33:57] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Looks roughly right, eyeballing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656508 (https://phabricator.wikimedia.org/T272075) (owner: 10Urbanecm)
[20:34:15] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Compress frwiki's anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656508 (https://phabricator.wikimedia.org/T272075) (owner: 10Urbanecm)
[20:34:18] <Urbanecm>	 thanks, going to finish this then
[20:34:59] <_joe_>	 James_F: it would be great if people did reply when a time sensitive inquiry gets made by a volunteer, during their work day
[20:35:28] <wikibugs>	 (03Merged) 10jenkins-bot: Compress frwiki's anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656508 (https://phabricator.wikimedia.org/T272075) (owner: 10Urbanecm)
[20:35:37] <_joe_>	 apparently I was the only one available, and I don't think it was my role to say more than "there is no problem if you deploy a svg", which I did
[20:36:07] <_joe_>	 (I'm not referring to you ofc, rather to people in releng and sre)
[20:36:09] <James_F>	 _joe_: Totally. It's not your fault. The system is explicitly designed to say 'no' unless things are sufficiently on fire that people are here anyway.
[20:36:15] * James_F nods.
[20:36:30] <James_F>	 But also our CI for the config repo should spot un-crunched logos and whine.
[20:36:41] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/wikipedia-fr-20.svg: 66e6be391ecfde7ca0604146ab978987ce472b5c: Set anniversary logo for frwiki (1/3; T272075) (duration: 00m 58s)
[20:36:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:45] <stashbot>	 T272075: Enable anniversary logo for fr.wikipedia - 20th birthday - https://phabricator.wikimedia.org/T272075
[20:36:50] <_joe_>	 well, it is a special occasion, and the thing that is on fire is we need a code deploy to change a logo :)
[20:37:00] <James_F>	 Yes.
[20:37:06] <James_F>	 But the on-wiki process was worse.
[20:37:15] <_joe_>	 oh my :D
[20:37:33] <_joe_>	 you know there are alternatives to on-wiki and in-code, right? :P
[20:37:56] <James_F>	 _joe_: Alternatives, yes, but not processes we've actually tried.
[20:37:56] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/wikipedia-tagline-fr-20.svg: 66e6be391ecfde7ca0604146ab978987ce472b5c: Set anniversary logo for frwiki (2/3; T272075) (duration: 00m 55s)
[20:37:56] <_joe_>	 some sort of configuration backoffice 
[20:37:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:01] <James_F>	 Yeah.
[20:38:11] <icinga-wm>	 PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100%
[20:38:30] <James_F>	 Or even on-wiki configuration requests via a special system, Special:SiteConfiguration, like Wikia do.
[20:39:14] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 66e6be391ecfde7ca0604146ab978987ce472b5c: Set anniversary logo for frwiki (3/3; T272075) (duration: 00m 55s)
[20:39:15] <icinga-wm>	 RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 225.27 ms
[20:39:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:36] <Urbanecm>	 anyway, thanks James_F and _joe_, this should be done now
[20:39:59] <_joe_>	 Urbanecm: thank you for taking care of that for the french community :)
[20:40:20] <James_F>	 Urbanecm: I get the new logo on desktop but not mobile; is that intended?
[20:40:38] <Urbanecm>	 James_F: yes
[20:40:42] <James_F>	 Ack.
[20:40:48] <Lofhi>	 Thanks James_F and Urbanecm!
[20:40:58] <James_F>	 OK, looks good. Thanks for doing this. Boo to frwiki for not asking for this beforehand. :-)
[20:41:09] <James_F>	 (Also it's not frwiki's birthday until March. Tsk. ;-))
[20:42:04] <_joe_>	 James_F: DETAILS
[20:42:15] <James_F>	 _joe_: Yeah yeah, I know. :-)
[20:42:34] <_joe_>	 James_F: do you remember if frwiki was born before than itwiki?
[20:43:03] <_joe_>	 yeah it was, boo
[20:43:06] <James_F>	 Yeah; first batch was fr/ca/de. Second batch was it/es/eo I think.
[20:43:11] <James_F>	 Sorry. :-)
[20:50:32] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10I18n: Mailman password reminder mail (and other texts) has broken encoding in Czech - https://phabricator.wikimedia.org/T271123 (10Mormegil) Well, yes, for Czech, the subscription confirmation e-mail seems to be sent correctly, now. But as I said above, it is a problem for...
[20:53:15] <wikibugs>	 10SRE, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ganeti5002.eqsin.wmnet ` The log can be found in `/var/log/...
[20:55:27] <Nemo_bis>	 I initially thought it was a flying spaghetti monster
[20:57:38] <wikibugs>	 (03Abandoned) 10Cwhite: profile: add ca_bundle configuration option to docker-pkg configs [puppet] - 10https://gerrit.wikimedia.org/r/597559 (owner: 10Cwhite)
[21:03:16] <wikibugs>	 10SRE, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10Papaul) This issue was that after replacing the  system motherboard I am guess that  the credentials were restored in the new IDRAC board from the chassis flash bac...
[21:19:01] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5002.eqsin.wmnet with reason: REIMAGE
[21:19:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:40] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5002.eqsin.wmnet with reason: REIMAGE
[21:22:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:24:33] <wikibugs>	 (03PS1) 10Bstorm: openstack: remove the queens hiera hiding out in places [puppet] - 10https://gerrit.wikimedia.org/r/656514 (https://phabricator.wikimedia.org/T261134)
[21:26:14] <wikibugs>	 (03PS3) 10Andrew Bogott: Add designate packages and manifests for openstack/train [puppet] - 10https://gerrit.wikimedia.org/r/656502 (https://phabricator.wikimedia.org/T261135)
[21:26:16] <wikibugs>	 (03PS1) 10Andrew Bogott: Change profile::openstack::eqiad1::version from queens to stein for VMs [puppet] - 10https://gerrit.wikimedia.org/r/656515
[21:27:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Change profile::openstack::eqiad1::version from queens to stein for VMs [puppet] - 10https://gerrit.wikimedia.org/r/656515 (owner: 10Andrew Bogott)
[21:28:59] <wikibugs>	 (03PS2) 10Bstorm: openstack: remove the queens hiera hiding out in places [puppet] - 10https://gerrit.wikimedia.org/r/656514 (https://phabricator.wikimedia.org/T261134)
[21:29:28] <wikibugs>	 (03CR) 10Andrew Bogott: "correction: on Stretch this doesn't upgrade anything.  On Buster VMs it will cause newly-built or -upgraded VMs to install Stein client pa" [puppet] - 10https://gerrit.wikimedia.org/r/656515 (owner: 10Andrew Bogott)
[21:31:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:05] <wikibugs>	 (03PS3) 10Bstorm: openstack: remove the queens hiera hiding out in places [puppet] - 10https://gerrit.wikimedia.org/r/656514 (https://phabricator.wikimedia.org/T261134)
[21:34:24] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27498/" [puppet] - 10https://gerrit.wikimedia.org/r/656514 (https://phabricator.wikimedia.org/T261134) (owner: 10Bstorm)
[21:38:04] <wikibugs>	 10SRE, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti5002.eqsin.wmnet'] `  and were **ALL** successful.
[21:39:42] <wikibugs>	 10SRE, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) 05Open→03Resolved a:05wiki_willy→03RobH So this is now ready to be pushed back into service, resolving this hw repair task.
[21:55:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+1] k8s_infrastructure_users: Amend to support groups, avoid uid conflicts [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) (owner: 10Alexandros Kosiaris)
[21:55:37] <icinga-wm>	 RECOVERY - Long running screen/tmux on maps1009 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[21:59:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to labweb1001 and labweb1002 for jhernandez - https://phabricator.wikimedia.org/T271859 (10Jhernandez) @Majavah right on the money! I thought `Host bast` would be matching bast on the input host but apparently not (no asterisks I guess).  I've added a explicit se...
[22:06:47] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 03+1] "Actually, I just realized that this change is currently a no-op. The default stage was made WRITE_BOTH directly in AbuseFilter back in Dec" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[22:07:31] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 03+1] "I believe this is ready, since all wikis are already at WRITE_BOTH" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647117 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[22:21:59] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:27:28] <wikibugs>	 (03PS1) 10RobH: adding new an-workers to puppet [puppet] - 10https://gerrit.wikimedia.org/r/656516 (https://phabricator.wikimedia.org/T260445)
[22:27:43] <wikibugs>	 (03PS2) 10RobH: adding new an-workers to puppet [puppet] - 10https://gerrit.wikimedia.org/r/656516 (https://phabricator.wikimedia.org/T260445)
[22:29:23] <wikibugs>	 (03CR) 10RobH: [C: 03+2] adding new an-workers to puppet [puppet] - 10https://gerrit.wikimedia.org/r/656516 (https://phabricator.wikimedia.org/T260445) (owner: 10RobH)
[22:36:31] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:44:10] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1118.eqiad.wmnet ` Th...
[22:46:43] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH)
[22:54:17] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10I18n: Mailman password reminder mail (and other texts) has broken encoding in Czech - https://phabricator.wikimedia.org/T271123 (10Ladsgroup) That's what I have been saying, if you fix something, it breaks something else. It's a whack-a-mole at the current state.
[22:58:18] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1118.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1118....
[23:00:57] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/656521
[23:15:44] <wikibugs>	 (03PS1) 10RobH: splitting an-workers to their own netboot line [puppet] - 10https://gerrit.wikimedia.org/r/656522 (https://phabricator.wikimedia.org/T260445)
[23:16:40] <wikibugs>	 (03PS2) 10RobH: splitting an-workers to their own netboot line [puppet] - 10https://gerrit.wikimedia.org/r/656522 (https://phabricator.wikimedia.org/T260445)
[23:17:23] <wikibugs>	 (03CR) 10RobH: [C: 03+2] splitting an-workers to their own netboot line [puppet] - 10https://gerrit.wikimedia.org/r/656522 (https://phabricator.wikimedia.org/T260445) (owner: 10RobH)
[23:24:25] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH)
[23:30:11] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:31:11] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1118.eqiad.wmnet ` The log can be found in...
[23:41:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1118.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1118.eqiad.wmnet'] `
[23:46:07] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1118.eqiad.wmnet ` The log can be found in...
[23:57:56] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1118.eqiad.wmnet with reason: REIMAGE
[23:57:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:59:58] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1118.eqiad.wmnet with reason: REIMAGE
[23:59:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log