[00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210115T0000). [00:00:05] James_F and MatmaRex: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:06] (Already done.) [00:01:10] oh, oops [00:01:23] MatmaRex: No worries, I marked them as done already. [00:01:26] ah, you marked them as done on the schedule. alright [00:01:33] thanks [00:07:51] ACKNOWLEDGEMENT - PHP opcache health on mw2236 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn recent reimage makes it normal https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:07:51] ACKNOWLEDGEMENT - PHP opcache health on mw2269 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn recent reimage makes it normal https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:07:51] ACKNOWLEDGEMENT - PHP opcache health on mw2270 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn recent reimage makes it normal https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:21:59] RoanKattouw Niharika Urbanecm is it too late to add something? I'd like to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/655417 to the currently deployed branch [00:27:43] DannyS712: How urgent is that? It contains an i18n change, so it would require a full scap [00:30:04] it would be very helpful to oversighters, since the functionality to change the visibility of a hit from the details view, which is usually how these are reported, is currently broken. Its *possible* to work around this, but takes a bit of time. [00:31:58] I think its worth it for a full scap [00:34:41] Full scap out of hours before a holiday weekend? Eh. [00:35:20] out of hours? Its the middle of the backport window... also there is a holidays? [00:36:07] (03PS1) 10DannyS712: Restore hide link when viewing single AbuseLog entries [extensions/AbuseFilter] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655943 (https://phabricator.wikimedia.org/T271667) [00:36:10] DannyS712: Monday is a no-deploy day. [00:36:34] DannyS712: By the time CI finished on the patch we'll be way past the backport window. [00:37:04] oh, MLK. Okay. So no-go until Tuesday? [00:37:27] Unless it's really urgent. [00:37:51] I don't do AF suppression in practice, so… [00:38:05] I haven't run into it yet, but I know other oversighters have [00:39:31] its especially a pain on mobile [00:44:09] (03CR) 10Daimona Eaytoy: "Do we really need a backport?" [extensions/AbuseFilter] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655943 (https://phabricator.wikimedia.org/T271667) (owner: 10DannyS712) [00:44:47] Catching up now [00:45:35] I understand it might be a pain rn, but not sure if it deserves a full scap right now [00:45:51] s/ rn// [00:58:08] (03CR) 10jerkins-bot: [V: 04-1] Restore hide link when viewing single AbuseLog entries [extensions/AbuseFilter] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655943 (https://phabricator.wikimedia.org/T271667) (owner: 10DannyS712) [01:04:54] (03CR) 10DannyS712: "recheck" [extensions/AbuseFilter] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655943 (https://phabricator.wikimedia.org/T271667) (owner: 10DannyS712) [01:19:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:22:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:32:32] PROBLEM - PHP opcache health on mw2268 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:29:05] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` restbase2009.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20210... [02:29:07] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase2009.codfw.wmnet'] ` Of which those **FAILED**: ` ['restbase2009.codfw.wmnet'] ` [02:29:21] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` restbase2009.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20210... [02:49:59] (03PS1) 10Andrew Bogott: keystone: slight policy adjustments [puppet] - 10https://gerrit.wikimedia.org/r/656283 (https://phabricator.wikimedia.org/T272117) [02:51:00] (03CR) 10Andrew Bogott: [C: 03+2] keystone: slight policy adjustments [puppet] - 10https://gerrit.wikimedia.org/r/656283 (https://phabricator.wikimedia.org/T272117) (owner: 10Andrew Bogott) [03:06:22] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: restbase2009 reimaging issues - https://phabricator.wikimedia.org/T269853 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase2009.codfw.wmnet'] ` Of which those **FAILED**: ` ['restbase2009.codfw.wmnet'] ` [03:12:55] (03PS1) 10Gergő Tisza: Update /analytics/legacy/homepagemodule/ schema version to 1.1.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656284 (https://phabricator.wikimedia.org/T270309) [06:07:41] (03PS1) 10Joal: profile::analytics::refinery Add HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) [06:09:26] (03CR) 10Joal: "@elukey: I'm not sure about the place where to create the folders to ensure correct ordering (groups existing for isntance). Let me know i" [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [07:08:44] (03CR) 10Elukey: "The change looks very good but I think that we'd need to add a hiera selector since we use the profile in multiple places/hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [07:35:18] (03PS1) 10Elukey: sre.hosts.reboot-single: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656310 [07:36:48] (03PS2) 10Elukey: sre.hosts.reboot-single: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656310 [07:38:42] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.reboot-single: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656310 (owner: 10Elukey) [07:41:30] yep yep [07:43:37] PROBLEM - SSH on logstash2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:45:47] (03PS3) 10Elukey: sre.hosts.reboot-single: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656310 [07:46:22] 10SRE, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:49:20] (03PS1) 10Ryan Kemper: T262211: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 [07:49:48] (03CR) 10jerkins-bot: [V: 04-1] T262211: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (owner: 10Ryan Kemper) [07:50:51] (03PS2) 10Ryan Kemper: search: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) [07:57:07] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single [07:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:14] (03CR) 10Ryan Kemper: [C: 03+2] query_service: Migrate hiera() to lookup() in common.pp [puppet] - 10https://gerrit.wikimedia.org/r/656266 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [07:59:27] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210115T0800) [08:01:51] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single [08:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:41] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:53] (03CR) 10Elukey: [V: 03+1] "Effie: Ok to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [08:06:59] (03CR) 10Elukey: [C: 03+2] cache: Make statsd address an argument and hiera() -> lookup() [puppet] - 10https://gerrit.wikimedia.org/r/655790 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:07:27] (03CR) 10Elukey: [C: 03+2] cache: Migrate hiera() to lookup() and setting datatype in eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/656015 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:13:46] (03CR) 10Ryan Kemper: [C: 03+2] query_service: Remove gui files from wdqs [puppet] - 10https://gerrit.wikimedia.org/r/655955 (https://phabricator.wikimedia.org/T271851) (owner: 10Ladsgroup) [08:15:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:18] (03CR) 10Ryan Kemper: "Sounds like we can circle back to clean up `modules/query_service/manifests/gui.pp` at a latter time if we so choose." [puppet] - 10https://gerrit.wikimedia.org/r/655955 (https://phabricator.wikimedia.org/T271851) (owner: 10Ladsgroup) [08:15:28] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single [08:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:40] (03CR) 10David Caro: [C: 03+2] wmcs.ceph.osd: disable write caches when possible [puppet] - 10https://gerrit.wikimedia.org/r/655923 (https://phabricator.wikimedia.org/T271527) (owner: 10David Caro) [08:17:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:18:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:18:53] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:37] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01545 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:21:24] RECOVERY - SSH on logstash2005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:23:31] <_joe_> uh several puppet failures [08:23:56] <_joe_> wdqs [08:24:07] _joe_: yup reverting right now [08:24:15] https://www.irccloud.com/pastebin/KwZ1fyzY/ [08:24:33] (03PS1) 10Ryan Kemper: Revert "query_service: Remove gui files from wdqs" [puppet] - 10https://gerrit.wikimedia.org/r/656295 [08:24:34] <_joe_> ryankemper: hi! I didn't notice you were still up :D [08:25:07] Working a weird schedule today :P [08:25:17] (03CR) 10Ryan Kemper: "----- OUTPUT of 'sudo run-puppet-agent' -----" [puppet] - 10https://gerrit.wikimedia.org/r/656295 (owner: 10Ryan Kemper) [08:25:21] (03CR) 10Ryan Kemper: [C: 03+2] Revert "query_service: Remove gui files from wdqs" [puppet] - 10https://gerrit.wikimedia.org/r/656295 (owner: 10Ryan Kemper) [08:25:37] XioNoX: we (Growth team) would like to do an emergency deploy sometime today, a one line patch to unbreak our main feature. Would this be OK? The patch is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/656370 [08:26:05] (Cc tgr_ ) [08:27:24] <_joe_> kostajh: what is broken? [08:27:44] (03PS1) 10David Caro: wmcs.ceph.osd: actually disable write caches [puppet] - 10https://gerrit.wikimedia.org/r/656371 [08:27:47] <_joe_> oh I see [08:28:10] !log WDQS puppet run successful [08:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:24] _joe_: on the wikis we have our extension deployed to, most users are directed to Special:Homepage after account creation. The main content on there is broken due to a failing elastic search query [08:28:43] <_joe_> kostajh: yeah found the task and reading it [08:29:05] <_joe_> So the user experience is broken, that's definitely something that should be deployed in an emergency [08:29:19] <_joe_> and AIUI has no expected perf impact/risk with rollback [08:29:20] _joe_: yes, the user experience is broken [08:29:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.ceph.osd: actually disable write caches [puppet] - 10https://gerrit.wikimedia.org/r/656371 (owner: 10David Caro) [08:30:02] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003708 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:30:09] <_joe_> ryankemper: <3 [08:30:10] The list of tasks we show to a user comes from the cache. The patch disables validating those tasks for freshness [08:35:12] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27483/console" [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [08:36:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:39:25] !log Restart clouddb1013-clouddb1020 [08:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:26] RECOVERY - Check systemd state on ncredir5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:44:48] (03CR) 10Filippo Giunchedi: [V: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [08:44:52] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] monitoring::host: move hostgroup_default to params, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [08:45:02] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 2 down 24 https://wikitech.wikimedia.org/wiki/HAProxy [08:45:22] ^ expected [08:45:31] !log installing bast4003 T257324 [08:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:34] T257324: Consolidate edge bastion server into ganeti - https://phabricator.wikimedia.org/T257324 [09:04:47] (03CR) 10Joal: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [09:04:56] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:53] (03PS2) 10Joal: profile::analytics::refinery Create HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) [09:07:16] Is there anyone who can merge a change to unblock CI for CentralAuth? [09:07:24] (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::refinery Create HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [09:07:58] !log installing bast5002 T257324 [09:08:00] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/656298 for https://phabricator.wikimedia.org/T272123 [09:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:02] T257324: Consolidate edge bastion server into ganeti - https://phabricator.wikimedia.org/T257324 [09:08:56] 10SRE, 10SRE-Access-Requests: Analytics access for dev: Bill Pirkle - https://phabricator.wikimedia.org/T272065 (10Aklapper) //(For future reference, feel free to use https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ for such requests, linked from https://wikitech.wikimedia.org/wiki/Analytics/Data_... [09:09:41] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) [09:09:58] 10SRE, 10SRE-Access-Requests: Analytics access for dev: Nikki Nikkhoui - https://phabricator.wikimedia.org/T272057 (10Aklapper) //(For future reference, feel free to use https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ for such requests, linked from https://wikitech.wikimedia.org/wiki/Analytics/Da... [09:10:08] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:11:26] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) [09:12:14] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1008.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [09:12:34] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:12:40] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) p:05Triage→03High Setting to high as we are trying to finish up the new wiki replicas infra [09:13:56] (03CR) 10Elukey: profile::analytics::refinery Create HDFS folders (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [09:14:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1ed. Ping me to merge after the -1 by Dan (pending on the completion of the aforementioned task) is removed." [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [09:15:06] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:16:24] we are working on --^ [09:16:35] it is a little subset of the aqs api, related to druid [09:16:52] when we drop old datasources it sadly stops answering queries, we didn't find a good way to do it [09:17:05] also I think that I should group alerts for AQS, to avoid spamming [09:18:59] !log swift codfw-prod: more weight to ms-be20[58-61] - T269337 [09:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:02] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [09:23:22] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:24:02] !log rolling restart of dbprov1* hosts [09:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/656310 (owner: 10Elukey) [09:24:27] (03PS1) 10Kosta Harlan: Temporarily disable cache revalidation [extensions/GrowthExperiments] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/656301 (https://phabricator.wikimedia.org/T272103) [09:25:04] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:25:04] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:25:54] (03PS2) 10Kosta Harlan: Temporarily disable cache revalidation [extensions/GrowthExperiments] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/656301 (https://phabricator.wikimedia.org/T272103) [09:27:47] _joe_ / XioNoX here's the patch for wmf.26, what is the next step in the process to deploy it? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/656301 [09:28:27] <_joe_> kostajh: merge and deploy [09:29:03] <_joe_> there is no additional process to the normal emergency deployment [09:29:03] tgr_: are you around to deploy? [09:29:28] ok, so it doesn't need to be added somewhere here for example https://wikitech.wikimedia.org/wiki/Deployments#Friday,_January_15 [09:29:34] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:30:07] I'd rather not deploy at 1 AM, easy to make mistakes and a few more hours won't make that much of a difference [09:30:12] I can deploy it in the morning [09:30:20] <_joe_> kostajh: a good record in SAL is better [09:30:46] <_joe_> tgr_: we can search for a deployer in the meantime, but ack! [09:30:49] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) There's something going on with this host: ` racadm>>serveraction powerstatus Server power status: OFF racadm>>serveraction powerup Server power operation initiated successfully racadm>>serverac... [09:31:54] kostajh: could I get a +2 for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/656298 if you don't mind? [09:34:04] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10aborrero) [09:34:07] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) And without doing anything again: ` racadm>>serveraction powerstatus Server power status: ON ` [09:34:39] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) Interestingly, I cannot see anything on the console, so I have no idea what it is doing and if it is rebooting or doing something else. [09:35:35] RhinosF1: trying to take a quick look now but would be better if you could find someone else. Have two kids at home due to school closures and dealing with other issues now unfortunately [09:35:49] What're we looking at? [09:35:58] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) p:05Triage→03High [09:36:09] Reedy: unblocking CI on central auth due to phan failure [09:36:17] kostajh: ack [09:36:21] <_joe_> no, we're not talking about that [09:36:39] <_joe_> we're talking about https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/656301 [09:36:46] !log rolling restart acme-chief servers to catch up on kernel upgrades [09:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:25] <_joe_> RhinosF1: we're in the midst of an emergency deployment, please hold on :) [09:38:06] Reedy: are you able to help with deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/656301 ? [09:38:07] <_joe_> kostajh: I would assume the task should be UBN!, btw [09:38:17] yeah I am around as well [09:38:18] I was about to send an all caps message [09:38:27] was busy ranting over some private message [09:38:28] I'm just quickly reading [09:38:30] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:38:30] {done} [09:38:42] <_joe_> hashar: :* [09:39:07] kostajh: tgr_ I can take over, please do take care of kids or your sleep schedule! :] [09:40:36] of course the change in master breaks bah [09:40:37] (03CR) 10Reedy: [C: 03+2] Temporarily disable cache revalidation [extensions/GrowthExperiments] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/656301 (https://phabricator.wikimedia.org/T272103) (owner: 10Kosta Harlan) [09:40:42] <_joe_> elukey: have you seen the alert about the druid brokers in LVS? [09:41:12] hashar: unrelated AF errors on master though [09:41:16] hashar: the change in master should be OK once the abusefilter patch is done merging. But... yeah [09:41:27] deployment branch is seemingly ok, it's just doing ALL OF THE BROWSER TESTS [09:41:43] <_joe_> Reedy: sit back, relax, and enjoy the tests [09:42:03] We should live stream the browsertests somewhere [09:42:12] hashar: gogo high priority feature request [09:42:23] hmm [09:42:27] yeah that should be doable [09:42:48] since we use Xvfb as a frame buffer, theorically we can stream the buffer to Youtube or Twitch [09:43:23] <_joe_> elukey: nevermind, it's icinga being weird, the alert has recovered since forever, it's just still critical in icinga for $reasons [09:44:07] * apergos is around too if testing help is needed [09:44:22] so basically lets wait for patches to get merged, then I guess it is all about confirming the use case in https://phabricator.wikimedia.org/T272103 is addressed on mwdebug [09:44:25] and we can roll forward [09:44:32] uh huh [09:45:09] let me load up the page on mwdebug now and confirm I can break it, then I'll be in shape to test when it rolls out [09:45:38] I wasn't even aware we have [[Special:Homepage]] :-\ [09:46:02] me neither [09:46:41] that is super nice (has to be enabled in one user preferences) [09:47:23] which gets one a customized home page that lists edit suggestions, how many folks watched articles I have changed (a few millions yeah!!) etc [09:47:39] and somehow Trizek is my tutor (hi!) [09:47:49] quick, ask lots of questions [09:49:47] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single [09:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:34] (03Merged) 10jenkins-bot: Temporarily disable cache revalidation [extensions/GrowthExperiments] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/656301 (https://phabricator.wikimedia.org/T272103) (owner: 10Kosta Harlan) [09:51:54] (03PS1) 10Ayounsi: Add jhernandez to deployment [puppet] - 10https://gerrit.wikimedia.org/r/656375 (https://phabricator.wikimedia.org/T271859) [09:52:23] hashar: https://www.mediawiki.org/wiki/Growth has the gory details if you're curious. It's not on all wikis (yet) [09:52:23] (03CR) 10jerkins-bot: [V: 04-1] Add jhernandez to deployment [puppet] - 10https://gerrit.wikimedia.org/r/656375 (https://phabricator.wikimedia.org/T271859) (owner: 10Ayounsi) [09:52:45] ok wow with uselang=en I get the name of the string for translation.... but anyways, yep I can get the error. [09:53:11] great thank you apergos ! [09:53:28] an now we watch jenkins :-) [09:53:33] I get it as well [09:53:53] apergos: what do you mean about name of the string for translation? [09:54:03] I'm guessing [09:54:06] I mea that all these strings that you see [09:54:09] rather than an english string [09:54:14] so it's behaving like qqx [09:54:16] are look-upable in a file o i18 things [09:54:20] each of those things has a name [09:54:29] right. but I don't see that with uselang=en, which is why I'm concerned/confused [09:54:39] (03PS2) 10Ayounsi: Add jhernandez to deployment [puppet] - 10https://gerrit.wikimedia.org/r/656375 (https://phabricator.wikimedia.org/T271859) [09:54:40] when you go to translatewiki you can get the name of the string and translate to the lang of your choic [09:54:41] e [09:54:59] I get the name of that string displayed rather than an en error message :-D [09:55:06] if you're seeing the string name with uselang=en, that sounds like a problem of its own [09:55:17] growthexperiments-homepage-suggestededits-error-title [09:55:21] I've just pulled the fix onto mwdebug1002 [09:55:28] ok lemme see what happens [09:55:32] (03PS1) 10Joal: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) [09:55:45] Reedy: thanks, looks good to me [09:55:52] fixed [09:56:08] ah [09:56:13] I had "uselang=n" [09:56:14] lol [09:56:16] lmao [09:56:26] FWIW the non-JavaScript experience was never broken, for the NoScript users among you :) [09:56:36] and now my day is going to be ruined as I empty up my queue of 200 suggested edits [09:57:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to labweb1001 and labweb1002 for jhernandez - https://phabricator.wikimedia.org/T271859 (10ayounsi) [09:57:15] (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [09:57:56] <_joe_> apergos: what language is that? [09:58:24] !log reedy@deploy1001 Synchronized php-1.36.0-wmf.26/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/CacheDecorator.php: T272103 (duration: 00m 57s) [09:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:27] T272103: [regression - wmf.26] frwiki Homepage SE module has 'cirrussearch-query-too-long' for default filters - https://phabricator.wikimedia.org/T272103 [09:58:43] I guess I didn't realize you can add arbitrary input for the uselang parameter, without validation / fallback to a known language code [09:58:49] none! [09:59:23] <_joe_> uuh yeah maybe let's not elaborate too much on that :) [09:59:51] yeah ayways no errors for me [10:00:23] long after the deploy already went around :-D [10:00:46] I had once defined an 'EN' language [10:00:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/656375 (https://phabricator.wikimedia.org/T271859) (owner: 10Ayounsi) [10:01:08] Reedy / hashar / apergos / _joe_ thank you for your help! [10:01:09] which really was english but applying uppercase() to all messages. That was for the INTERNATIONAL CAPS LOCK DAY [10:01:11] np [10:01:17] \o/ [10:01:25] and thanks tgr_ for the extra investigation! [10:02:06] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10jijiki) p:05Triage→03Medium [10:02:23] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [10:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:59] PROBLEM - Keyholder SSH agent on acmechief2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [10:03:50] vgutierrez: o/ if you are ok I'd merge https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/656310 to update the cookbook for reboot single, and then we could test it to see if there is anything to fix [10:03:54] would it be ok? [10:04:02] * vgutierrez checking [10:04:09] (03PS6) 10Giuseppe Lavagetto: Switch the base image to buster from stretch. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 [10:04:11] (03CR) 10Ayounsi: [C: 03+2] Add jhernandez to deployment [puppet] - 10https://gerrit.wikimedia.org/r/656375 (https://phabricator.wikimedia.org/T271859) (owner: 10Ayounsi) [10:04:16] elukey: yeah :) [10:05:29] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Switch the base image to buster from stretch. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 (owner: 10Giuseppe Lavagetto) [10:05:59] RECOVERY - Keyholder SSH agent on acmechief2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [10:06:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to labweb1001 and labweb1002 for jhernandez - https://phabricator.wikimedia.org/T271859 (10ayounsi) You're all set, give it 30min for Puppet to run. Let me know if any issues. [10:06:16] 10SRE, 10Wikimedia-Logstash: Update saved / short links with objects in ELK7 - https://phabricator.wikimedia.org/T272016 (10fgiunchedi) p:05Triage→03Medium [10:06:17] vgutierrez: ack merging [10:06:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to labweb1001 and labweb1002 for jhernandez - https://phabricator.wikimedia.org/T271859 (10ayounsi) 05Open→03Resolved a:03ayounsi [10:06:24] (03CR) 10Elukey: [C: 03+2] sre.hosts.reboot-single: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656310 (owner: 10Elukey) [10:06:39] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/656310 (owner: 10Elukey) [10:06:51] I have droppe dthe unbreak now status [10:06:58] Reedy: thx for the deployment! [10:07:55] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-client1001.eqiad.wmnet [10:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:58] 10SRE, 10Wikimedia-Logstash: Update saved / short links with objects in ELK7 - https://phabricator.wikimedia.org/T272016 (10fgiunchedi) >>! In T272016#6747870, @Lucas_Werkmeister_WMDE wrote: > Is it possible to restore the /goto/ links? AIUI missing `/goto/` links was an expected side effect of the migration,... [10:08:04] nice :) [10:09:58] (03PS3) 10Joal: profile::analytics::refinery Create HDFS folders [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) [10:10:02] (03CR) 10Joal: profile::analytics::refinery Create HDFS folders (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656307 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [10:10:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-client1001.eqiad.wmnet [10:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:16] (03PS2) 10Joal: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) [10:12:43] PROBLEM - Check systemd state on ms-be2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:05] (03PS1) 10Ayounsi: Add nikkin to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656377 (https://phabricator.wikimedia.org/T272057) [10:13:07] (03PS1) 10Ayounsi: Add bpirkle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656378 (https://phabricator.wikimedia.org/T272065) [10:13:58] vgutierrez: my test worked, you can proceed if you have another one [10:16:12] 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [10:16:43] PROBLEM - Check systemd state on ms-be2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Analytics access for dev: Nikki Nikkhoui - https://phabricator.wikimedia.org/T272057 (10ayounsi) @Ottomata do we need approval from you (Analytics) as well? [10:18:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add nikkin to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656377 (https://phabricator.wikimedia.org/T272057) (owner: 10Ayounsi) [10:18:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add bpirkle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656378 (https://phabricator.wikimedia.org/T272065) (owner: 10Ayounsi) [10:18:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Analytics access for dev: Bill Pirkle - https://phabricator.wikimedia.org/T272065 (10ayounsi) @Ottomata do we need approval from you (Analytics) as well? [10:19:00] (03CR) 10Arturo Borrero Gonzalez: "We also should drop the reference from modules/aptrepo/files/distributions-wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm) [10:19:04] (03CR) 10Muehlenhoff: "One note inline, also needs approval by Otto in the Phab task, other than that looks fine." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656377 (https://phabricator.wikimedia.org/T272057) (owner: 10Ayounsi) [10:19:49] (03CR) 10Muehlenhoff: Add bpirkle to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656378 (https://phabricator.wikimedia.org/T272065) (owner: 10Ayounsi) [10:21:02] elukey: ok :) [10:21:19] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet [10:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:29] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:57] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:13] elukey: looking good here as well [10:25:22] perfect [10:25:27] the ms-be2* errors are sort-of expected, relabalcing [10:26:36] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet [10:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:21] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet [10:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:41] PROBLEM - SSH on ms-be2032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:30:06] (03PS1) 10Muehlenhoff: Make bast4003/bast5002 bastion hosts [puppet] - 10https://gerrit.wikimedia.org/r/656380 (https://phabricator.wikimedia.org/T257324) [10:39:42] (03PS3) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: seperate dumps service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654823 (https://phabricator.wikimedia.org/T209082) [10:39:44] (03PS4) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: seperate gerrit service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654826 (https://phabricator.wikimedia.org/T209082) [10:39:46] (03PS3) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: fix ports for contint2001 [homer/public] - 10https://gerrit.wikimedia.org/r/654830 (https://phabricator.wikimedia.org/T209082) [10:39:59] RECOVERY - SSH on ms-be2032 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:40:50] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet [10:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:58] !log reboot mc2036 - T269596 [10:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:08] (03PS1) 10Giuseppe Lavagetto: production-images: switch to buster as seed image [puppet] - 10https://gerrit.wikimedia.org/r/656381 [10:45:33] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1001.eqiad.wmnet [10:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:01] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 16 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [10:46:19] <_joe_> uh, kormat/marostegui? [10:46:44] !log disable puppet on acme-chief clients [10:46:46] oh. hi. looking [10:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] production-images: switch to buster as seed image [puppet] - 10https://gerrit.wikimedia.org/r/656381 (owner: 10Giuseppe Lavagetto) [10:48:23] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2 Kormat Checking https://wikitech.wikimedia.org/wiki/HAProxy [10:48:23] ACKNOWLEDGEMENT - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 16 down 2 Kormat Checking https://wikitech.wikimedia.org/wiki/HAProxy [10:48:46] (03CR) 10Giuseppe Lavagetto: "Sorry, I completely forgot this patch was here, and I reimplemented it myself yesterday :/" [puppet] - 10https://gerrit.wikimedia.org/r/597559 (owner: 10Cwhite) [10:48:47] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet [10:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:52] hey the haproxy thing is expected [10:50:06] kormat _joe_ ^ [10:50:18] marostegui: i figured that once i realised what's behind it :) [10:50:25] :-) [10:50:40] I will be back in an hour or so [10:51:12] some of our alerting is complicated to ack, it happened to me yesterday [10:51:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1001.eqiad.wmnet [10:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:49] e.g. non-obvious dependencies or metrics monitoring more than one thing with a single alert [10:52:02] <_joe_> !log rebuilding the docker images coredns,nutcracker,prometheus-statsd-exporter,service-checker,wmfdebug to use wikimedia-buster as a base [10:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:36] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet [10:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:15] !log rolling restart of dbprov2* hosts [10:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:17] !log re-enable puppet on acme-chief clients [10:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:56] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1003.eqiad.wmnet [10:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:09] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:57] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2036.codfw.wmnet [10:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:20] !log jiji@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host mc2036.codfw.wmnet [10:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:44] effie: I merged a new version of the cookbook this morning, if you see anything weird let me know [10:57:39] RECOVERY - Check systemd state on ms-be2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1003.eqiad.wmnet [10:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:01] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2036.codfw.wmnet [10:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:19] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:36] elukey: ok ! [11:00:41] (03CR) 10Ayounsi: [C: 03+2] cr/firewall.cf: cloud-in4: seperate dumps service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654823 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:01:04] (03CR) 10Ayounsi: [C: 03+2] cr/firewall.cf: cloud-in4: seperate gerrit service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654826 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:01:13] (03Merged) 10jenkins-bot: cr/firewall.cf: cloud-in4: seperate dumps service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654823 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:01:20] (03CR) 10Ayounsi: [C: 03+2] cr/firewall.cf: cloud-in4: fix ports for contint2001 [homer/public] - 10https://gerrit.wikimedia.org/r/654830 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:01:43] (03Merged) 10jenkins-bot: cr/firewall.cf: cloud-in4: seperate gerrit service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654826 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:01:53] (03Merged) 10jenkins-bot: cr/firewall.cf: cloud-in4: fix ports for contint2001 [homer/public] - 10https://gerrit.wikimedia.org/r/654830 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:03:27] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:49] (03PS1) 10Joal: Update web.xml removing jetty default dir listing [debs/archiva] - 10https://gerrit.wikimedia.org/r/656382 (https://phabricator.wikimedia.org/T272082) [11:03:58] elukey: --^ for when you're back [11:05:35] 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [11:06:52] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2036.codfw.wmnet [11:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:58] !log update cloud-in4 firewall rules [11:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:06] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1002.eqiad.wmnet [11:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:37] (03PS1) 10Giuseppe Lavagetto: wmfdebug: swap iproute with iproute2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/656383 [11:18:25] (03CR) 10Kormat: [C: 03+1] wmfdebug: swap iproute with iproute2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/656383 (owner: 10Giuseppe Lavagetto) [11:18:39] (03PS1) 10Giuseppe Lavagetto: production-images: add stretch to base images [puppet] - 10https://gerrit.wikimedia.org/r/656384 [11:19:49] RECOVERY - Check systemd state on ms-be2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1002.eqiad.wmnet [11:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:57] PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:23:07] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:26] (03PS1) 10Effie Mouzeli: hiera: clean up memcached configuration [puppet] - 10https://gerrit.wikimedia.org/r/656385 (https://phabricator.wikimedia.org/T213089) [11:25:01] RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:25:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] production-images: add stretch to base images [puppet] - 10https://gerrit.wikimedia.org/r/656384 (owner: 10Giuseppe Lavagetto) [11:28:19] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 213 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:30:17] !log rolling restart of eqiad source backup dbs [11:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:36] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] wmfdebug: swap iproute with iproute2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/656383 (owner: 10Giuseppe Lavagetto) [11:30:53] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 36 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:35:37] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 285 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:37:15] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 27 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:38:01] there seems to be spikes of exceptions every 6 minutes [11:41:40] https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&viewPanel=2&from=1610707291821&to=1610710891822 [11:42:52] they seem to be OOMs, I think [11:53:37] (03CR) 10Volans: "LGTM, couple of questions/nits inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [11:54:13] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: introduce set counters [puppet] - 10https://gerrit.wikimedia.org/r/656388 [11:56:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: introduce set counters [puppet] - 10https://gerrit.wikimedia.org/r/656388 (owner: 10Arturo Borrero Gonzalez) [11:59:10] (03CR) 10Muehlenhoff: [C: 03+2] Make bast4003/bast5002 bastion hosts [puppet] - 10https://gerrit.wikimedia.org/r/656380 (https://phabricator.wikimedia.org/T257324) (owner: 10Muehlenhoff) [12:04:49] (03CR) 10Volans: "General comment inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [12:06:02] (03PS1) 10Giuseppe Lavagetto: production-images: correctly refer to the registry with its variable [puppet] - 10https://gerrit.wikimedia.org/r/656391 [12:10:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] production-images: correctly refer to the registry with its variable [puppet] - 10https://gerrit.wikimedia.org/r/656391 (owner: 10Giuseppe Lavagetto) [12:10:42] (03CR) 10Volans: [C: 03+1] "Looks reasonable as a pure conversion. Some possible future expansion for later inline." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656212 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [12:12:33] (03PS1) 10Hashar: Display image label when publishing [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/656393 [12:13:41] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:35] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:53] (03CR) 10Vgutierrez: [C: 03+1] Remove the 'letsencrypt' module [puppet] - 10https://gerrit.wikimedia.org/r/655762 (https://phabricator.wikimedia.org/T252199) (owner: 10Andrew Bogott) [12:34:02] (03CR) 10Vgutierrez: [C: 04-1] "careful though.. it looks like the module still has some references on the code:" [puppet] - 10https://gerrit.wikimedia.org/r/655762 (https://phabricator.wikimedia.org/T252199) (owner: 10Andrew Bogott) [12:48:57] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:38] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:06] RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:00] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:13:20] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:17:17] (03PS1) 10Muehlenhoff: Update SSH default config for new bastions running on Ganeti [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/656402 [13:21:30] (03CR) 10Muehlenhoff: [V: 03+2] Update SSH default config for new bastions running on Ganeti [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/656402 (owner: 10Muehlenhoff) [13:21:35] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update SSH default config for new bastions running on Ganeti [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/656402 (owner: 10Muehlenhoff) [13:24:40] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:32] PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:30:40] RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:32:11] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to production shell and wmf ldap access for Razzi Abuissa - https://phabricator.wikimedia.org/T261443 (10Aklapper) [13:36:50] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:41:18] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:32:39] (03PS1) 10Urbanecm: Compress frwiki's anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656508 (https://phabricator.wikimedia.org/T272075) [20:32:49] James_F: sorry to distract you, mind +1'ing the above compress patch? ;) [20:33:57] (03CR) 10Jforrester: [C: 03+1] "Looks roughly right, eyeballing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656508 (https://phabricator.wikimedia.org/T272075) (owner: 10Urbanecm) [20:34:15] (03CR) 10Urbanecm: [C: 03+2] Compress frwiki's anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656508 (https://phabricator.wikimedia.org/T272075) (owner: 10Urbanecm) [20:34:18] thanks, going to finish this then [20:34:59] <_joe_> James_F: it would be great if people did reply when a time sensitive inquiry gets made by a volunteer, during their work day [20:35:28] (03Merged) 10jenkins-bot: Compress frwiki's anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656508 (https://phabricator.wikimedia.org/T272075) (owner: 10Urbanecm) [20:35:37] <_joe_> apparently I was the only one available, and I don't think it was my role to say more than "there is no problem if you deploy a svg", which I did [20:36:07] <_joe_> (I'm not referring to you ofc, rather to people in releng and sre) [20:36:09] _joe_: Totally. It's not your fault. The system is explicitly designed to say 'no' unless things are sufficiently on fire that people are here anyway. [20:36:15] * James_F nods. [20:36:30] But also our CI for the config repo should spot un-crunched logos and whine. [20:36:41] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/wikipedia-fr-20.svg: 66e6be391ecfde7ca0604146ab978987ce472b5c: Set anniversary logo for frwiki (1/3; T272075) (duration: 00m 58s) [20:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:45] T272075: Enable anniversary logo for fr.wikipedia - 20th birthday - https://phabricator.wikimedia.org/T272075 [20:36:50] <_joe_> well, it is a special occasion, and the thing that is on fire is we need a code deploy to change a logo :) [20:37:00] Yes. [20:37:06] But the on-wiki process was worse. [20:37:15] <_joe_> oh my :D [20:37:33] <_joe_> you know there are alternatives to on-wiki and in-code, right? :P [20:37:56] _joe_: Alternatives, yes, but not processes we've actually tried. [20:37:56] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/wikipedia-tagline-fr-20.svg: 66e6be391ecfde7ca0604146ab978987ce472b5c: Set anniversary logo for frwiki (2/3; T272075) (duration: 00m 55s) [20:37:56] <_joe_> some sort of configuration backoffice [20:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:01] Yeah. [20:38:11] PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100% [20:38:30] Or even on-wiki configuration requests via a special system, Special:SiteConfiguration, like Wikia do. [20:39:14] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 66e6be391ecfde7ca0604146ab978987ce472b5c: Set anniversary logo for frwiki (3/3; T272075) (duration: 00m 55s) [20:39:15] RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 225.27 ms [20:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:36] anyway, thanks James_F and _joe_, this should be done now [20:39:59] <_joe_> Urbanecm: thank you for taking care of that for the french community :) [20:40:20] Urbanecm: I get the new logo on desktop but not mobile; is that intended? [20:40:38] James_F: yes [20:40:42] Ack. [20:40:48] Thanks James_F and Urbanecm! [20:40:58] OK, looks good. Thanks for doing this. Boo to frwiki for not asking for this beforehand. :-) [20:41:09] (Also it's not frwiki's birthday until March. Tsk. ;-)) [20:42:04] <_joe_> James_F: DETAILS [20:42:15] _joe_: Yeah yeah, I know. :-) [20:42:34] <_joe_> James_F: do you remember if frwiki was born before than itwiki? [20:43:03] <_joe_> yeah it was, boo [20:43:06] Yeah; first batch was fr/ca/de. Second batch was it/es/eo I think. [20:43:11] Sorry. :-) [20:50:32] 10SRE, 10Wikimedia-Mailing-lists, 10I18n: Mailman password reminder mail (and other texts) has broken encoding in Czech - https://phabricator.wikimedia.org/T271123 (10Mormegil) Well, yes, for Czech, the subscription confirmation e-mail seems to be sent correctly, now. But as I said above, it is a problem for... [20:53:15] 10SRE, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ganeti5002.eqsin.wmnet ` The log can be found in `/var/log/... [20:55:27] I initially thought it was a flying spaghetti monster [20:57:38] (03Abandoned) 10Cwhite: profile: add ca_bundle configuration option to docker-pkg configs [puppet] - 10https://gerrit.wikimedia.org/r/597559 (owner: 10Cwhite) [21:03:16] 10SRE, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10Papaul) This issue was that after replacing the system motherboard I am guess that the credentials were restored in the new IDRAC board from the chassis flash bac... [21:19:01] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5002.eqsin.wmnet with reason: REIMAGE [21:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:40] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5002.eqsin.wmnet with reason: REIMAGE [21:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:33] (03PS1) 10Bstorm: openstack: remove the queens hiera hiding out in places [puppet] - 10https://gerrit.wikimedia.org/r/656514 (https://phabricator.wikimedia.org/T261134) [21:26:14] (03PS3) 10Andrew Bogott: Add designate packages and manifests for openstack/train [puppet] - 10https://gerrit.wikimedia.org/r/656502 (https://phabricator.wikimedia.org/T261135) [21:26:16] (03PS1) 10Andrew Bogott: Change profile::openstack::eqiad1::version from queens to stein for VMs [puppet] - 10https://gerrit.wikimedia.org/r/656515 [21:27:07] (03CR) 10Andrew Bogott: [C: 03+2] Change profile::openstack::eqiad1::version from queens to stein for VMs [puppet] - 10https://gerrit.wikimedia.org/r/656515 (owner: 10Andrew Bogott) [21:28:59] (03PS2) 10Bstorm: openstack: remove the queens hiera hiding out in places [puppet] - 10https://gerrit.wikimedia.org/r/656514 (https://phabricator.wikimedia.org/T261134) [21:29:28] (03CR) 10Andrew Bogott: "correction: on Stretch this doesn't upgrade anything. On Buster VMs it will cause newly-built or -upgraded VMs to install Stein client pa" [puppet] - 10https://gerrit.wikimedia.org/r/656515 (owner: 10Andrew Bogott) [21:31:17] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:05] (03PS3) 10Bstorm: openstack: remove the queens hiera hiding out in places [puppet] - 10https://gerrit.wikimedia.org/r/656514 (https://phabricator.wikimedia.org/T261134) [21:34:24] (03CR) 10Bstorm: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27498/" [puppet] - 10https://gerrit.wikimedia.org/r/656514 (https://phabricator.wikimedia.org/T261134) (owner: 10Bstorm) [21:38:04] 10SRE, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti5002.eqsin.wmnet'] ` and were **ALL** successful. [21:39:42] 10SRE, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) 05Open→03Resolved a:05wiki_willy→03RobH So this is now ready to be pushed back into service, resolving this hw repair task. [21:55:13] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+1] k8s_infrastructure_users: Amend to support groups, avoid uid conflicts [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) (owner: 10Alexandros Kosiaris) [21:55:37] RECOVERY - Long running screen/tmux on maps1009 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [21:59:08] 10SRE, 10SRE-Access-Requests: Requesting access to labweb1001 and labweb1002 for jhernandez - https://phabricator.wikimedia.org/T271859 (10Jhernandez) @Majavah right on the money! I thought `Host bast` would be matching bast on the input host but apparently not (no asterisks I guess). I've added a explicit se... [22:06:47] (03CR) 10Daimona Eaytoy: [C: 03+1] "Actually, I just realized that this change is currently a no-op. The default stage was made WRITE_BOTH directly in AbuseFilter back in Dec" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [22:07:31] (03CR) 10Daimona Eaytoy: [C: 03+1] "I believe this is ready, since all wikis are already at WRITE_BOTH" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647117 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [22:21:59] RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:28] (03PS1) 10RobH: adding new an-workers to puppet [puppet] - 10https://gerrit.wikimedia.org/r/656516 (https://phabricator.wikimedia.org/T260445) [22:27:43] (03PS2) 10RobH: adding new an-workers to puppet [puppet] - 10https://gerrit.wikimedia.org/r/656516 (https://phabricator.wikimedia.org/T260445) [22:29:23] (03CR) 10RobH: [C: 03+2] adding new an-workers to puppet [puppet] - 10https://gerrit.wikimedia.org/r/656516 (https://phabricator.wikimedia.org/T260445) (owner: 10RobH) [22:36:31] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:44:10] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1118.eqiad.wmnet ` Th... [22:46:43] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [22:54:17] 10SRE, 10Wikimedia-Mailing-lists, 10I18n: Mailman password reminder mail (and other texts) has broken encoding in Czech - https://phabricator.wikimedia.org/T271123 (10Ladsgroup) That's what I have been saying, if you fix something, it breaks something else. It's a whack-a-mole at the current state. [22:58:18] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1118.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1118.... [23:00:57] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/656521 [23:15:44] (03PS1) 10RobH: splitting an-workers to their own netboot line [puppet] - 10https://gerrit.wikimedia.org/r/656522 (https://phabricator.wikimedia.org/T260445) [23:16:40] (03PS2) 10RobH: splitting an-workers to their own netboot line [puppet] - 10https://gerrit.wikimedia.org/r/656522 (https://phabricator.wikimedia.org/T260445) [23:17:23] (03CR) 10RobH: [C: 03+2] splitting an-workers to their own netboot line [puppet] - 10https://gerrit.wikimedia.org/r/656522 (https://phabricator.wikimedia.org/T260445) (owner: 10RobH) [23:24:25] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [23:30:11] RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:11] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1118.eqiad.wmnet ` The log can be found in... [23:41:53] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1118.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1118.eqiad.wmnet'] ` [23:46:07] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1118.eqiad.wmnet ` The log can be found in... [23:57:56] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1118.eqiad.wmnet with reason: REIMAGE [23:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:58] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1118.eqiad.wmnet with reason: REIMAGE [23:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log