[00:03:20] 10Operations, 10PHP 7.0 support, 10Patch-For-Review, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Tgr) >>! In T211488#5103305, @Joe wrote: > - `include_path` in php's ini is still set to the value it had in the old t... [00:03:24] urandom: :) np [00:03:45] urandom: i am not entirely sure if we can remove that from restbase::base. it's been added in 2017 i see [00:04:02] we can try and compile on * though [00:07:35] (03PS2) 10Volans: flake8: enforce import order and adopt W504 [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 [00:12:02] 10Operations, 10ops-codfw: audit all codfw pdu tower draws - https://phabricator.wikimedia.org/T163362 (10Dzahn) duplicate of T163339 ? [00:12:38] (03CR) 10Volans: flake8: enforce import order and adopt W504 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [00:14:53] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Papaul) 05Open→03Resolved This is done, we can close it. [00:15:20] (03CR) 10Tim Starling: [C: 03+2] profiler: Increase max stack depth for sampling profiler to 250 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503083 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [00:16:13] (03Merged) 10jenkins-bot: profiler: Increase max stack depth for sampling profiler to 250 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503083 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [00:21:34] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 (10Papaul) [00:21:38] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) 05Open→03Resolved @Gehel We can close this. Thanks [00:22:38] (03PS1) 10Dzahn: restbase::base: remove include passwords::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/503151 [00:24:26] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: switch port configuration for frmon2001 - https://phabricator.wikimedia.org/T196557 (10Papaul) 05Open→03Resolved This is done , it can be close. [00:24:29] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476 (10Papaul) [00:25:28] !log tstarling@deploy1001 Synchronized wmf-config/profiler.php: increase excimer max depth (duration: 00m 53s) [00:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:43] (03CR) 10jenkins-bot: profiler: Increase max stack depth for sampling profiler to 250 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503083 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [00:26:55] 10Operations, 10ops-codfw: Degraded RAID on labtestcontrol2003 - https://phabricator.wikimedia.org/T218403 (10Dzahn) This ticket isn't an actual RAID failure. As the output says the check just failed to connect to the host at that time. the RAID check in Icinga says today: OK: Active: 6, Working: 6, Failed:... [00:27:05] 10Operations, 10ops-codfw: Degraded RAID on labtestcontrol2003 - https://phabricator.wikimedia.org/T218403 (10Dzahn) 05Open→03Invalid [00:28:50] 10Operations, 10ops-codfw: Degraded RAID on labtestcontrol2003 - https://phabricator.wikimedia.org/T218403 (10Dzahn) also "get_raid_status_md" doesn't exist at that location anymore, but: ` root@labtestcontrol2003:~# sudo /usr/local/lib/nagios/plugins/check_raid OK: Active: 6, Working: 6, Failed: 0, Spare:... [00:32:13] 10Operations, 10ops-codfw, 10monitoring: labtestcontrol2003 - UNKNOWN power supply status - https://phabricator.wikimedia.org/T220783 (10Dzahn) [00:32:20] 10Operations, 10ops-codfw, 10cloud-services-team: Degraded RAID on labtestservices2002 - https://phabricator.wikimedia.org/T218405 (10Papaul) 05Open→03Resolved a:03Papaul This host was renamed to cloudservices2002-dev and reimaged on T220101 and icinga is showing OK: Active: 6, Working: 6, Failed: 0,... [00:34:39] 10Operations, 10ops-codfw, 10DC-Ops: labtestneutron2002: refresh/rename to cloudnet2002-dev - https://phabricator.wikimedia.org/T214370 (10Papaul) a:03Papaul [00:38:48] 10Operations, 10ops-codfw, 10DC-Ops: codfw: rename/relabel labtestneutron2001 to cloudnet2001-dev - https://phabricator.wikimedia.org/T214181 (10Papaul) 05Open→03Resolved a:03Papaul @faidon yes we can resolve this [00:41:34] 10Operations, 10ops-codfw: update physical labels from naos.codfw.wmnet to deploy2001.codfw.wmnet - https://phabricator.wikimedia.org/T195421 (10Papaul) a:03Papaul [00:52:14] (03PS1) 10Dzahn: decom cloudnet2001-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/503152 (https://phabricator.wikimedia.org/T218025) [00:53:26] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Dzahn) per chat with Papaul: - switch port is done - server is > 5 years old and should not go back to spare - removi... [00:56:14] (03CR) 10Dzahn: [C: 03+2] decom cloudnet2001-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/503152 (https://phabricator.wikimedia.org/T218025) (owner: 10Dzahn) [00:57:08] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Dzahn) [01:00:05] !log puppet cert clean, puppet node clean, puppet node deactivate on cloudnet2001-dev.codfw.wmnet (T218025) [01:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:09] T218025: decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 [01:17:31] (03PS1) 10Dzahn: add wikibase.org as parked domain [dns] - 10https://gerrit.wikimedia.org/r/503154 [01:19:16] 10Operations, 10netops: Juniper security advisories (April 2019) - https://phabricator.wikimedia.org/T220716 (10ayounsi) 05Open→03Resolved thanks, tl;dr; all good! > 2019-04 Security Bulletin: Junos OS: SRX5000 series: Kernel crash (vmcore) upon receipt of a specific packet on fxp0 interface (CVE-2019-004... [01:56:33] PROBLEM - Free Blazegraph allocators wdqs-blazegraph on wdqs1009 is CRITICAL: cluster=wdqs-test instance=wdqs1009:9193 job=blazegraph site=eqiad https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=32&fullscreen [01:57:21] (03CR) 10BryanDavis: "I have corrected the Title-casing issues for all but 8 '(objectClass=posixaccount)' records in the LDAP directory. These 8 all have duplic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) (owner: 10BryanDavis) [02:24:22] (03CR) 10MarkAHershberger: Package 1.19.4 with stdeb (033 comments) [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/500201 (owner: 10MarkAHershberger) [02:26:07] (03PS1) 10MarkAHershberger: Package 1.19.4 with stdeb [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503155 [02:26:42] (03PS1) 10MarkAHershberger: I7e66e85e242f865246474e493bf92846f371ae2a [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 [02:28:10] (03Abandoned) 10MarkAHershberger: Package 1.19.4 with stdeb [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503155 (owner: 10MarkAHershberger) [02:35:18] (03PS2) 10MarkAHershberger: Address Kunal's concerns with already-merged code [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 [02:41:52] (03PS3) 10MarkAHershberger: Address Kunal's concerns with already-merged code [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 [03:20:07] (03PS4) 10MarkAHershberger: Address Kunal's concerns with already-merged code [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 [03:24:05] (03CR) 10MarkAHershberger: "W00! lintian clean" [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 (owner: 10MarkAHershberger) [04:57:10] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [04:57:30] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) db2044 again: ` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Name: 1I Port Name: 2I Gen8 ServBP... [04:58:08] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:02:06] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Marostegui) [05:03:32] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Marostegui) 05Open→03Resolved All these host are now ready to be productionized at T220572. There is a problem with the controller exposure to the OS w... [05:07:28] (03PS1) 10Marostegui: db2[097|098|099|100|101]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/503164 (https://phabricator.wikimedia.org/T219463) [05:10:27] (03CR) 10Marostegui: [C: 03+2] db2[097|098|099|100|101]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/503164 (https://phabricator.wikimedia.org/T219463) (owner: 10Marostegui) [05:38:27] (03CR) 10Vgutierrez: [C: 04-1] "I'd like those checks to be there, as we intend to use this as a staging environment to validate changes before going to production." [puppet] - 10https://gerrit.wikimedia.org/r/503122 (owner: 10Dzahn) [06:01:43] 10Operations, 10Mail, 10Patch-For-Review: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408 (10Vgutierrez) [06:01:47] 10Operations, 10DNS, 10Mail, 10Traffic, and 3 others: wikidata.org lacks SPF record - https://phabricator.wikimedia.org/T210134 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Fixed by T193408 [06:09:11] 10Operations, 10Analytics, 10User-Elukey: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created - https://phabricator.wikimedia.org/T220784 (10elukey) p:05Triage→03Normal [06:10:26] 10Operations, 10SRE-Access-Requests: Requesting deployment access for santhosh - https://phabricator.wikimedia.org/T220785 (10santhosh) [06:11:03] (03PS1) 10Vgutierrez: Add SPF record for wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/503165 (https://phabricator.wikimedia.org/T193408) [06:11:21] (03CR) 10Vgutierrez: [C: 03+1] Add SPF record for wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/503165 (https://phabricator.wikimedia.org/T193408) (owner: 10Vgutierrez) [06:13:02] (03PS1) 10KartikMistry: Add santhosh to deploy-service [puppet] - 10https://gerrit.wikimedia.org/r/503167 (https://phabricator.wikimedia.org/T220785) [06:16:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access for santhosh - https://phabricator.wikimedia.org/T220785 (10Arrbee) This is an approved request for Santhosh. Thanks. [06:20:10] (03PS1) 10Vgutierrez: Add SPF record for wikibooks.org [dns] - 10https://gerrit.wikimedia.org/r/503177 (https://phabricator.wikimedia.org/T193408) [06:21:00] (03CR) 10Vgutierrez: [C: 03+1] Add SPF record for wikibooks.org [dns] - 10https://gerrit.wikimedia.org/r/503177 (https://phabricator.wikimedia.org/T193408) (owner: 10Vgutierrez) [06:24:47] 10Operations: HP Gen9 onboard controller review - https://phabricator.wikimedia.org/T216175 (10Marostegui) Heh...HP decided to rename the tool and on the Gen10, @MoritzMuehlenhoff found it (T220572#5106204): ` HPE renamed the tool, I installed "ssacli" and now "ssacli controller all show config" works fine. ` [06:26:54] 10Operations, 10DNS, 10Traffic: Add SPF record for non-canonical domains that are not parked - https://phabricator.wikimedia.org/T220786 (10Vgutierrez) [06:27:04] 10Operations, 10DNS, 10Traffic: Add SPF record for non-canonical domains that are not parked - https://phabricator.wikimedia.org/T220786 (10Vgutierrez) p:05Triage→03Normal [06:31:47] 10Operations, 10DNS, 10Traffic: Add SPF record for non-canonical domains that are not parked - https://phabricator.wikimedia.org/T220786 (10Vgutierrez) [06:31:51] 10Operations, 10Cloud-VPS, 10DNS, 10Mail, and 3 others: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930 (10Vgutierrez) [06:33:11] (03PS1) 10Vgutierrez: Add SPF records for non-canonical non-parked domains [dns] - 10https://gerrit.wikimedia.org/r/503258 (https://phabricator.wikimedia.org/T220786) [06:35:43] (03CR) 10Vgutierrez: [C: 03+1] "For those domains which have MX records set to something different than mx[12]001.wm.o or gmail, I've added "mx" to their SPF record" [dns] - 10https://gerrit.wikimedia.org/r/503258 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [06:37:45] 10Operations: Sort out which RAID packages are still needed - https://phabricator.wikimedia.org/T216043 (10Marostegui) [06:38:53] 10Operations, 10Icinga, 10monitoring: Fix RAID handler alert to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10Marostegui) [06:46:04] (03PS1) 10Muehlenhoff: Sync ssacli from the HPE repository [puppet] - 10https://gerrit.wikimedia.org/r/503261 (https://phabricator.wikimedia.org/T220787) [06:49:45] (03CR) 10Marostegui: [C: 03+1] Sync ssacli from the HPE repository [puppet] - 10https://gerrit.wikimedia.org/r/503261 (https://phabricator.wikimedia.org/T220787) (owner: 10Muehlenhoff) [06:57:52] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10Marostegui) [07:00:10] (03CR) 10Muehlenhoff: [C: 03+2] Sync ssacli from the HPE repository [puppet] - 10https://gerrit.wikimedia.org/r/503261 (https://phabricator.wikimedia.org/T220787) (owner: 10Muehlenhoff) [07:04:45] !log synced ssacli to thirdparty/hwraid components for jessie/stretch T220787 [07:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:50] T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 [07:09:16] 10Operations, 10Analytics, 10EventBus, 10monitoring, and 3 others: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 (10akosiaris) p:05Triage→03Normal [07:12:24] !log Manually install ssacli on db2[097|098|099|100|101|102] T220787 T220572 [07:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:29] T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 [07:12:30] T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 [07:13:28] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10MoritzMuehlenhoff) We need to extend the "raid" fact in modules/raid/lib/facter/raid.rb to also detect the Gen10 control... [07:23:51] (03PS3) 10Arturo Borrero Gonzalez: striker: factor out common code to a shared profile [puppet] - 10https://gerrit.wikimedia.org/r/502472 [07:24:36] (03PS3) 10Filippo Giunchedi: aptrepo: reflow and sort distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/503013 [07:24:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] striker: factor out common code to a shared profile [puppet] - 10https://gerrit.wikimedia.org/r/502472 (owner: 10Arturo Borrero Gonzalez) [07:25:21] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: reflow and sort distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/503013 (owner: 10Filippo Giunchedi) [07:25:40] (03PS4) 10Filippo Giunchedi: aptrepo: reflow and sort distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/503013 [07:27:29] (03PS5) 10Arturo Borrero Gonzalez: Toolforge: cleanup unused puppet code [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) [07:27:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Toolforge: cleanup unused puppet code [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [07:32:55] (03PS1) 10Muehlenhoff: Update the source distro for the HPE thirdparty suite [puppet] - 10https://gerrit.wikimedia.org/r/503264 (https://phabricator.wikimedia.org/T220787) [07:33:49] (03CR) 10Marostegui: [C: 03+1] Update the source distro for the HPE thirdparty suite [puppet] - 10https://gerrit.wikimedia.org/r/503264 (https://phabricator.wikimedia.org/T220787) (owner: 10Muehlenhoff) [07:35:38] (03CR) 10Muehlenhoff: [C: 03+2] Update the source distro for the HPE thirdparty suite [puppet] - 10https://gerrit.wikimedia.org/r/503264 (https://phabricator.wikimedia.org/T220787) (owner: 10Muehlenhoff) [07:37:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503116 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [07:38:17] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: add component/elastalert [puppet] - 10https://gerrit.wikimedia.org/r/503014 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [07:38:26] (03PS3) 10Filippo Giunchedi: aptrepo: add component/elastalert [puppet] - 10https://gerrit.wikimedia.org/r/503014 (https://phabricator.wikimedia.org/T213933) [07:40:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [07:40:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "minor comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [07:41:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] ores: use hiera for statsd host [puppet] - 10https://gerrit.wikimedia.org/r/499875 (https://phabricator.wikimedia.org/T218567) (owner: 10Ladsgroup) [07:41:59] (03PS8) 10Alexandros Kosiaris: ores: use hiera for statsd host [puppet] - 10https://gerrit.wikimedia.org/r/499875 (https://phabricator.wikimedia.org/T218567) (owner: 10Ladsgroup) [07:42:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503119 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [07:42:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503117 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [07:43:00] (03PS1) 10Muehlenhoff: Remove support for trusty in two Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/503265 [07:43:40] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10jcrespo) @robh @faidon Re: T219461#5103942 I wonder if we should document this stop as one to do for these models. The sda/sdb renam... [07:47:54] (03CR) 10Arturo Borrero Gonzalez: "Did you check puppet catalog compiler for labnet/labcontrol servers?" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [07:54:59] (03PS19) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [07:55:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10MoritzMuehlenhoff) >>! In T219461#5106335, @jcrespo wrote: > @robh @faidon Re: T219461#5103942 I wonder if we should document this s... [07:55:33] (03PS1) 10Elukey: oozie: override the oozie-setup script [puppet/cdh] - 10https://gerrit.wikimedia.org/r/503266 (https://phabricator.wikimedia.org/T218343) [07:55:49] (03PS20) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [07:56:31] (03PS20) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [07:57:27] (03CR) 10jerkins-bot: [V: 04-1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [07:58:14] (03PS21) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [07:58:56] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10jcrespo) @MoritzMuehlenhoff, just guessing, but I am assuming it is a chassis "bundled" SD card reader, not something we have bought... [07:59:09] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) >>! In T219461#5106371, @MoritzMuehlenhoff wrote: >>>! In T219461#5106335, @jcrespo wrote: >> @robh @faidon Re: T219461#... [08:02:25] !log updated ssacli in thirdparty/hwraid component for stretch to 3.30-13.0 T220787 [08:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:29] T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 [08:04:05] 10Operations: HP Gen9 onboard controller review - https://phabricator.wikimedia.org/T216175 (10Marostegui) [08:04:07] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10Marostegui) [08:09:38] (03CR) 10Alexandros Kosiaris: "PCC quite happy at https://integration.wikimedia.org/ci/view/Ops/job/operations-puppet-catalog-compiler/15718/console, merging." [puppet] - 10https://gerrit.wikimedia.org/r/502499 (owner: 10Alex Monk) [08:09:46] (03PS4) 10Alexandros Kosiaris: Move maintenance_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502499 (owner: 10Alex Monk) [08:09:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] Move maintenance_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502499 (owner: 10Alex Monk) [08:14:49] (03PS3) 10Alexandros Kosiaris: swift-rw: Mock it as a geo-resource [dns] - 10https://gerrit.wikimedia.org/r/502453 (https://phabricator.wikimedia.org/T204245) [08:14:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] swift-rw: Mock it as a geo-resource [dns] - 10https://gerrit.wikimedia.org/r/502453 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [08:22:51] (03PS2) 10Alexandros Kosiaris: Move bastion_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [08:24:12] (03CR) 10Alexandros Kosiaris: "I went ahead and rebase this one since I broke the chain in Ie0ff7f3fc383251acabc5eb8e49d719a627e17b3" [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [08:24:25] (03CR) 10jerkins-bot: [V: 04-1] Move bastion_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [08:25:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1ed, will merge after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/502607/1 is merged" [puppet] - 10https://gerrit.wikimedia.org/r/502612 (owner: 10Alex Monk) [08:33:19] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [08:34:06] (03CR) 10jerkins-bot: [V: 04-1] Move bastion_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [08:35:26] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.25/extensions/NavigationTiming/modules/ext.navigationTiming.js: T220788 Fix veaction === null case (duration: 00m 54s) [08:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:30] T220788: NavigationTiming probably broken in 1.33.0-wmf.25 - https://phabricator.wikimedia.org/T220788 [08:56:27] (03PS1) 10Ladsgroup: Add Western Armenian Wikipedia to wmf-config/InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503270 (https://phabricator.wikimedia.org/T219871) [08:56:41] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10aborrero) a:03aborrero [09:00:12] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: use DBs hosted at clouddb2001-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/503272 (https://phabricator.wikimedia.org/T220096) [09:01:23] (03PS1) 10Alexandros Kosiaris: Revert "swift-rw: Mock it as a geo-resource" [dns] - 10https://gerrit.wikimedia.org/r/503273 [09:02:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "swift-rw: Mock it as a geo-resource" [dns] - 10https://gerrit.wikimedia.org/r/503273 (owner: 10Alexandros Kosiaris) [09:03:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc as expected: https://puppet-compiler.wmflabs.org/compiler1002/15721/" [puppet] - 10https://gerrit.wikimedia.org/r/503272 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [09:03:14] (03PS1) 10Alexandros Kosiaris: Add a new swift.discovery.wmnet resource [dns] - 10https://gerrit.wikimedia.org/r/503274 (https://phabricator.wikimedia.org/T204245) [09:03:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add a new swift.discovery.wmnet resource [dns] - 10https://gerrit.wikimedia.org/r/503274 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [09:05:52] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10aborrero) [09:05:57] !log T218021 disable icinga checks for labtestcontrol2001 [09:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:01] T218021: decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 [09:07:45] !log added the wikimedia repository key to the stretch build chroot on boron, fixes builds using the PHP72/SPICERACK hooks [09:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:20] (03PS3) 10Alexandros Kosiaris: Move bastion_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [09:10:25] 10Operations, 10Wikimedia-Mailing-lists: Change ownership of wikimania-program@lists.wikimedia.org - https://phabricator.wikimedia.org/T220641 (10fgiunchedi) a:05fgiunchedi→03ICueva This is done, please let us know if you need a new password for the list as well! [09:13:39] (03PS2) 10Effie Mouzeli: Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:14:05] (03CR) 10jerkins-bot: [V: 04-1] Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:17:14] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10Volans) In addition io T220787#5106275, from the top of my head I think we need also: - check if the DSA script we're us... [09:17:41] (03PS3) 10Effie Mouzeli: Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:18:13] (03CR) 10jerkins-bot: [V: 04-1] Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:20:55] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10MoritzMuehlenhoff) >>! In T220787#5106465, @Volans wrote: > In addition io T220787#5106275, from the top of my head I th... [09:25:01] (03PS4) 10Effie Mouzeli: Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:25:31] (03CR) 10jerkins-bot: [V: 04-1] Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:30:34] !log reset mgmt card on labtestcontrol2003 - T220783 [09:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:42] T220783: labtestcontrol2003 - UNKNOWN power supply status - https://phabricator.wikimedia.org/T220783 [09:33:40] 10Operations, 10ops-codfw: Degraded RAID on labtestcontrol2003 - https://phabricator.wikimedia.org/T218403 (10Volans) [09:33:44] 10Operations, 10ops-codfw, 10monitoring: labtestcontrol2003 - UNKNOWN power supply status - https://phabricator.wikimedia.org/T220783 (10Volans) 05Open→03Resolved p:05Triage→03Normal a:03Volans I've reset the mgmt card (see https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_managem... [09:34:07] (03CR) 10Elukey: "Thanks a lot for the work!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502511 (owner: 10Muehlenhoff) [09:36:17] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10jcrespo) This is slightly offtopic, but there is a bit of overlap between the -SMART- checks and the RAID (Megacli/HP) o... [09:37:19] !log updated mwdebug1002 to php-wikidiff 1.8.1 [09:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:53] (03PS4) 10Jcrespo: transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) [09:42:55] (03PS5) 10Jcrespo: mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) [09:43:03] RECOVERY - Check systemd state on cloudcontrol2001-dev is OK: OK - running: The system is fully operational [09:43:17] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [09:43:20] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [09:46:23] 10Operations, 10Analytics, 10User-Elukey: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created - https://phabricator.wikimedia.org/T220784 (10fgiunchedi) +1, something that parses the json and write metrics in text format for node-exporter to pick up sounds good to me [09:51:57] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:52:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: stretch: use python-openssl from stretch [puppet] - 10https://gerrit.wikimedia.org/r/503279 (https://phabricator.wikimedia.org/T215407) [09:52:53] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:53:28] !log updated mwdebug1001 to php-wikidiff 1.8.1 [09:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:37] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:55:07] (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: stretch: use python-openssl from stretch [puppet] - 10https://gerrit.wikimedia.org/r/503279 (https://phabricator.wikimedia.org/T215407) [09:55:27] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:55:33] (03PS5) 10Effie Mouzeli: Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:55:35] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:56:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:56:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: keystone: stretch: use python-openssl from stretch [puppet] - 10https://gerrit.wikimedia.org/r/503279 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [10:00:16] (03PS7) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [10:00:18] (03PS5) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [10:00:28] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Mathew.onipe) This task is now complete and the lessons learnt have been documented here: https://wikitech.wikimedia.org/wik... [10:00:33] what can we do about the ripe-atlas alerts? [10:01:09] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Mathew.onipe) 05Open→03Resolved [10:01:16] appservers.svc.codfw.wmnet flapped briefly as well [10:02:18] although CRITICAL: Traceback (most recent call last) [10:03:43] PROBLEM - puppet last run on cloudcontrol2001-dev is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[debmonitor-client],File[/etc/apt/preferences.d/mitaka_stretch_nojessiebpo.pref] [10:03:44] (03CR) 10Gilles: [C: 03+1] Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [10:09:51] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:10:09] 10Operations, 10DBA, 10Patch-For-Review, 10codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523 (10jcrespo) A bit of a recap on the original questions: * Parsercache keys are renamed to pc1, pc2, pc3 at: T210725 * Parsercaches are write-wri... [10:11:06] (03PS1) 10Arturo Borrero Gonzalez: openstack: virt: reallocate libssl1.0.0 package exclusion [puppet] - 10https://gerrit.wikimedia.org/r/503284 (https://phabricator.wikimedia.org/T215407) [10:11:11] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:13:04] !log matomo updated to 3.9.1 on matomo1001 + deb upload to wikimedia-stretch - T218037 [10:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:08] T218037: Upgrade matomo1001 to latest upstream - https://phabricator.wikimedia.org/T218037 [10:13:51] (03CR) 10Effie Mouzeli: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15723/ looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [10:15:06] (03CR) 10Effie Mouzeli: [C: 03+2] Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [10:16:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15722/ LGTM, merging" [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [10:16:30] (03PS4) 10Alexandros Kosiaris: Move bastion_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [10:17:57] (03CR) 10Elukey: "LGTM, left a nit for variable naming :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [10:21:10] (03PS2) 10Arturo Borrero Gonzalez: openstack: virt: reallocate libssl1.0.0 package exclusion [puppet] - 10https://gerrit.wikimedia.org/r/503284 (https://phabricator.wikimedia.org/T215407) [10:21:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc as expected: https://puppet-compiler.wmflabs.org/compiler1002/15727/" [puppet] - 10https://gerrit.wikimedia.org/r/503284 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [10:23:00] (03CR) 10Marostegui: "I would deploy this on a single host with puppet disabled on the rest, reload haproxy and all that and just look at the graphs just in cas" [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [10:27:33] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Yesterday's alerts weren't (aren't?) spam though. This is an actual problem, with a manifestation at the kubelet operation latencies level" [puppet] - 10https://gerrit.wikimedia.org/r/503079 (https://phabricator.wikimedia.org/T219556) (owner: 10CDanis) [10:30:07] RECOVERY - puppet last run on cloudcontrol2001-dev is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:32:15] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Perhaps we should re-engineer a bit these alerts to distinguish between the various operation types. For example we could exclude from the" [puppet] - 10https://gerrit.wikimedia.org/r/503079 (https://phabricator.wikimedia.org/T219556) (owner: 10CDanis) [10:32:51] !log T219626 reimaging cloudcontrol2001-dev [10:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:55] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [10:39:31] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: compute: mitaka: stretch: refresh comment about libss1.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/503291 [10:40:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova: compute: mitaka: stretch: refresh comment about libss1.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/503291 (owner: 10Arturo Borrero Gonzalez) [10:44:49] (03PS1) 10Arturo Borrero Gonzalez: labtestcontrol2001: decommision [puppet] - 10https://gerrit.wikimedia.org/r/503296 (https://phabricator.wikimedia.org/T218021) [10:45:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestcontrol2001: decommision [puppet] - 10https://gerrit.wikimedia.org/r/503296 (https://phabricator.wikimedia.org/T218021) (owner: 10Arturo Borrero Gonzalez) [10:46:19] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10aborrero) 05Stalled→03Open a:05aborrero→03RobH [10:47:12] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10aborrero) [10:49:20] 10Operations, 10DBA, 10Patch-For-Review, 10codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523 (10jcrespo) 3 additional items/proposals regarding purging: * Smarter purging- something maybe priority queue based, while respecting TTL, not s... [10:50:33] (03PS2) 10Arturo Borrero Gonzalez: labtestweb2001: decommission [puppet] - 10https://gerrit.wikimedia.org/r/502966 (https://phabricator.wikimedia.org/T218024) [10:51:17] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) [10:54:22] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:57:30] PROBLEM - Check systemd state on cloudnet2003-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:57:34] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 15 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:57:40] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 4 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:57:40] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 5 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:58:00] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 13 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:58:08] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 4 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:58:58] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 13 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:59:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestweb2001: decommission [puppet] - 10https://gerrit.wikimedia.org/r/502966 (https://phabricator.wikimedia.org/T218024) (owner: 10Arturo Borrero Gonzalez) [10:59:50] (03PS16) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [11:00:14] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 5 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:00:21] (03CR) 10Mathew.onipe: Add wdqs data transfer cookbook (0315 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [11:00:34] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 13 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:00:57] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) a:05aborrero→03RobH [11:01:46] PROBLEM - Free Blazegraph allocators wdqs-blazegraph on wdqs1009 is CRITICAL: cluster=wdqs-test instance=wdqs1009:9193 job=blazegraph site=eqiad https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=32&fullscreen [11:02:40] * gehel is looking at those allocators, data reimport coming up soon [11:07:06] PROBLEM - Check systemd state on labtestcontrol2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:07:50] PROBLEM - puppet last run on cp1077 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:07:58] PROBLEM - puppet last run on cp1075 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:08:10] PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:08:56] taking a look at those puppet failures [11:09:18] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:10:17] !log installing Java security updates on remaining maps hosts [11:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:27] Name of VCL object, 'cloudweb2001-dev', contains illegal charac [11:10:28] ter '-' [11:11:26] PROBLEM - puppet last run on cp1079 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:12:34] PROBLEM - puppet last run on cp1087 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:12:48] let's see if I can fix it [11:15:18] (03PS1) 10Filippo Giunchedi: hieradata: fix VCL illegal character for director [puppet] - 10https://gerrit.wikimedia.org/r/503312 [11:16:14] PROBLEM - puppet last run on cp1081 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:16:18] (03PS1) 10Arturo Borrero Gonzalez: cloudweb2001-dev: add IPv6 [dns] - 10https://gerrit.wikimedia.org/r/503313 (https://phabricator.wikimedia.org/T220426) [11:16:42] arturo: FYI https://gerrit.wikimedia.org/r/c/operations/puppet/+/503312 [11:16:50] * arturo looking [11:16:59] or anyone else really, if available for a quick review [11:17:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: fix VCL illegal character for director [puppet] - 10https://gerrit.wikimedia.org/r/503312 (owner: 10Filippo Giunchedi) [11:17:33] godog: +1 [11:18:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudweb2001-dev: add IPv6 [dns] - 10https://gerrit.wikimedia.org/r/503313 (https://phabricator.wikimedia.org/T220426) (owner: 10Arturo Borrero Gonzalez) [11:18:15] arturo: thanks! [11:18:25] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix VCL illegal character for director [puppet] - 10https://gerrit.wikimedia.org/r/503312 (owner: 10Filippo Giunchedi) [11:18:28] sorry for the mess [11:18:47] np, easy enough to fix [11:19:48] !log installed Java security updates on relforge* hosts [11:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:13] RECOVERY - puppet last run on cp1075 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:23:27] RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:24:23] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:27:45] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) [11:29:39] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:30:41] RECOVERY - Check systemd state on cloudnet2003-dev is OK: OK - running: The system is fully operational [11:31:07] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:31:35] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:31:41] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:31:55] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:32:09] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:32:10] (03PS1) 10Gilles: Oversample navtiming on cawiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503317 (https://phabricator.wikimedia.org/T220807) [11:32:19] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:33:11] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:33:21] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 2.839 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:33:27] RECOVERY - puppet last run on cp1077 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [11:34:11] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:34:19] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:34:19] (03CR) 10Gilles: [C: 03+2] Oversample navtiming on cawiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503317 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [11:35:19] (03Merged) 10jenkins-bot: Oversample navtiming on cawiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503317 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [11:37:04] !log reindexing Greek, Turkish, and Irish wikis on elastic@eqiad and elastic@codfw complete (T217806) [11:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:08] T217806: Reindex Greek, Turkish, and Irish wikis to keep lang-specific lowercasing & enable empty-token filtering (Greek) - https://phabricator.wikimedia.org/T217806 [11:37:09] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:40:57] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 8.776 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:41:08] (03CR) 10jenkins-bot: Oversample navtiming on cawiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503317 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [11:42:23] RECOVERY - puppet last run on cp1079 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:43:31] RECOVERY - puppet last run on cp1087 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:44:08] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T220807 Oversample navtiming on cawiki and commonswiki (duration: 05m 14s) [11:44:10] thx godog <3 [11:44:49] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:53] T220807: Alert on group1 canary wikis navtiming report rate - https://phabricator.wikimedia.org/T220807 [11:46:40] !log upgrading acmechief hosts to latest buster state [11:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:59] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 5 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:47:01] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 13 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:47:01] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:47:03] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 4 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:47:11] RECOVERY - puppet last run on cp1081 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:47:35] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) p:05Triage→03High Today (2019-04-12), I 've raised the possibility that T220661 is related to the reason these alerts are flapping so much. Judging fro... [11:47:59] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 13 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:48:05] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:48:08] (03CR) 10Alexandros Kosiaris: [C: 04-2] "https://phabricator.wikimedia.org/T220808 FWIW" [puppet] - 10https://gerrit.wikimedia.org/r/503079 (https://phabricator.wikimedia.org/T219556) (owner: 10CDanis) [11:49:11] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 5 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:49:41] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 14 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:49:57] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 4.821 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:51:09] (03PS1) 10Arturo Borrero Gonzalez: clouddb2001-dev: include ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) [11:51:13] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 14 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:51:58] (03CR) 10jerkins-bot: [V: 04-1] clouddb2001-dev: include ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [11:52:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected https://puppet-compiler.wmflabs.org/compiler1002/15728/clouddb2001-dev.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [11:54:46] (03PS2) 10Arturo Borrero Gonzalez: clouddb2001-dev: include ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) [11:55:41] (03CR) 10jerkins-bot: [V: 04-1] clouddb2001-dev: include ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [11:56:43] (03PS3) 10Arturo Borrero Gonzalez: clouddb2001-dev: include ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) [11:56:49] !log upgrading app server canaries to version 1.8.1 of the PHP wikidiff extension (HHVM already deployed) T203069 [11:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:53] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [11:57:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc as expeccted: https://puppet-compiler.wmflabs.org/compiler1002/15730/clouddb2001-dev.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [12:00:22] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: drop local keystone db [puppet] - 10https://gerrit.wikimedia.org/r/503327 (https://phabricator.wikimedia.org/T219626) [12:05:34] (03PS1) 10Gilles: Reduce cawiki survey sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503328 (https://phabricator.wikimedia.org/T220807) [12:05:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: drop local keystone db [puppet] - 10https://gerrit.wikimedia.org/r/503327 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [12:07:33] (03CR) 10Gilles: [C: 03+2] Reduce cawiki survey sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503328 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [12:08:45] (03Merged) 10jenkins-bot: Reduce cawiki survey sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503328 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [12:11:13] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:13:11] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:14:45] (03CR) 10jenkins-bot: Reduce cawiki survey sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503328 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [12:16:03] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T220807 Reduce cawiki survey sampling rate (duration: 05m 11s) [12:16:07] (03PS1) 10Jbond: ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) [12:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:09] (03PS1) 10Jbond: raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) [12:16:09] T220807: Alert on group1 canary wikis navtiming report rate - https://phabricator.wikimedia.org/T220807 [12:16:11] (03PS1) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [12:16:21] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:16:45] (03CR) 10jerkins-bot: [V: 04-1] raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:16:54] (03CR) 10jerkins-bot: [V: 04-1] ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:16:57] (03CR) 10jerkins-bot: [V: 04-1] raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:16:59] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 5.858 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:18:11] (03PS1) 10Muehlenhoff: Enable profile::base::firewall for profile::openstack::codfw1dev::db [puppet] - 10https://gerrit.wikimedia.org/r/503335 [12:18:50] (03PS2) 10Jbond: ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) [12:19:33] (03PS3) 10Jbond: ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) [12:19:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Enable profile::base::firewall for profile::openstack::codfw1dev::db [puppet] - 10https://gerrit.wikimedia.org/r/503335 (owner: 10Muehlenhoff) [12:20:39] (03PS4) 10Jbond: ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) [12:20:41] (03PS2) 10Jbond: raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) [12:20:43] (03PS2) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [12:21:17] (03CR) 10jerkins-bot: [V: 04-1] raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:21:23] (03CR) 10jerkins-bot: [V: 04-1] ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:21:26] (03CR) 10jerkins-bot: [V: 04-1] raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:21:33] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:22:15] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:22:19] !log T220095 disable icinga checks for labtestcontrol2003 [12:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:22] T220095: labtestcontrol2003: rename to cloudcontrol2003-dev - https://phabricator.wikimedia.org/T220095 [12:23:31] (03PS3) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [12:24:08] (03CR) 10jerkins-bot: [V: 04-1] raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:25:59] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.963 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:26:14] (03Abandoned) 10Revi: Add SPF record for wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/477034 (https://phabricator.wikimedia.org/T210134) (owner: 10Revi) [12:28:43] (03CR) 10Vgutierrez: [C: 03+1] varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [12:31:21] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:31:53] (03PS3) 10Jbond: raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) [12:32:40] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) A breakdown of the alerts per host follows starting from 2019-03-26 to 2019-04-12 follows ` 89 instance=kubernetes2001 84 instance=kubernetes1001... [12:35:07] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:39:53] (03CR) 10Gehel: [C: 04-1] "A few minor comments..." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [12:44:23] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:45:31] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:45:40] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) I am thinking about excluding `exec_sync` operations for a while from the checks to restore faith in the alerts. [12:45:46] (03PS4) 10Jbond: raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) [12:46:41] (03PS15) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [12:49:15] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:49:46] !log rolling restart of cassandra on maps* for jvm upgrade [12:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:05] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:50:45] PROBLEM - Check systemd state on clouddb2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:51:04] PROBLEM - mysqld processes on clouddb2001-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:51:20] is that in use? [12:51:30] seriously that is paging? [12:51:35] * apergos looks in [12:52:06] <_joe_> arturo: you tell me [12:52:07] arturo: I asked cloud to fix that last time, and in the sre meeting [12:52:15] <_joe_> :) [12:52:29] <_joe_> arturo: I think you have all your hosts paging by default [12:52:36] sigh [12:52:53] <_joe_> but well, this is a single service [12:52:56] <_joe_> so it shouldn't [12:53:01] <_joe_> it's clearly unintended [12:53:26] (03PS2) 10Alexandros Kosiaris: Move bastion_hosts out of ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/502612 (owner: 10Alex Monk) [12:53:29] !log decommissioning cassandra-c, restbase2008 -- T208087 [12:53:30] (03PS4) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [12:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:32] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [12:54:05] <_joe_> arturo: anyhow, nothing to see, move along? [12:54:06] ACKNOWLEDGEMENT - Check systemd state on clouddb2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Arturo Borrero Gonzalez this shouldnt page. We are working on this server. [12:54:07] ACKNOWLEDGEMENT - mysqld processes on clouddb2001-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Arturo Borrero Gonzalez this shouldnt page. We are working on this server. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:54:19] _joe_ I belive all cloud databases may be paging, because they just copied the production hosts [12:54:34] <_joe_> arturo: may I suggest to downtime that host for now? [12:54:50] thanks for the ack in any case [12:54:52] _joe_: sure, also I disabled all the checks [12:54:59] <_joe_> thanks :) [12:55:06] (03PS1) 10Lucas Werkmeister (WMDE): Remove constraint-suggestions beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503342 (https://phabricator.wikimedia.org/T220609) [12:56:18] I also suggested to use notifications_enabled: 0 for all hosts that are being setup [12:57:18] !log Purge old rows and optimize tables on spare host pc1010 T210725 [12:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:22] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [12:59:23] (03PS10) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [12:59:29] (03PS1) 10Alexandros Kosiaris: Omit exec_sync operations from kubelet alerts [puppet] - 10https://gerrit.wikimedia.org/r/503344 (https://phabricator.wikimedia.org/T220808) [12:59:47] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "DNM yet (see T209879)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503342 (https://phabricator.wikimedia.org/T220609) (owner: 10Lucas Werkmeister (WMDE)) [13:00:23] (03PS16) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [13:00:31] (03CR) 10Alexandros Kosiaris: [C: 04-2] "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/503344 for a different approach" [puppet] - 10https://gerrit.wikimedia.org/r/503079 (https://phabricator.wikimedia.org/T219556) (owner: 10CDanis) [13:02:05] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: disable paging for all hosts [puppet] - 10https://gerrit.wikimedia.org/r/503345 [13:02:18] (03CR) 10Marostegui: raid: add ssacli class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [13:03:18] _joe_: hope this prevents any more noise from those servers https://gerrit.wikimedia.org/r/c/operations/puppet/+/503345 [13:04:01] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T220691 (10Rosalie_WMDE) Recieved and Signed. Thanks [13:04:08] <_joe_> arturo: you don't need to get them in irc either? [13:04:16] nop [13:04:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services: labsdb1009.mgmt down - https://phabricator.wikimedia.org/T218789 (10jcrespo) Any updates? This would block any serious maintenance on the host. [13:04:29] (at least, not in the next 2 months) [13:04:55] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) >>! In T220808#5107076, @akosiaris wrote: > I am thinking about excluding `exec_sync` operations for a while from the checks to restore... [13:06:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] Omit exec_sync operations from kubelet alerts [puppet] - 10https://gerrit.wikimedia.org/r/503344 (https://phabricator.wikimedia.org/T220808) (owner: 10Alexandros Kosiaris) [13:06:44] (03PS2) 10Arturo Borrero Gonzalez: codfw1dev: disable paging for all hosts [puppet] - 10https://gerrit.wikimedia.org/r/503345 [13:08:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: disable paging for all hosts [puppet] - 10https://gerrit.wikimedia.org/r/503345 (owner: 10Arturo Borrero Gonzalez) [13:09:35] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:10:18] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) p:05High→03Low Change merged and shepherded into production. I am lowering priority but not resolving as we probably want to evalua... [13:10:38] (03PS17) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [13:10:59] (03PS5) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [13:12:36] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503069 (owner: 10Bearloga) [13:13:07] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:13:18] (03CR) 10Gehel: flake8: enforce import order and adopt W504 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [13:13:20] (03CR) 10Krinkle: profile::mediawiki::php: tweak ini settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502986 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [13:13:35] (03PS5) 10Jbond: ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) [13:14:06] ACKNOWLEDGEMENT - Check systemd state on labtestcontrol2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Arturo Borrero Gonzalez WIP [13:14:06] ACKNOWLEDGEMENT - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Arturo Borrero Gonzalez WIP https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:14:24] (03PS5) 10Jbond: raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) [13:14:34] (03PS6) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [13:15:13] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) a:03akosiaris [13:15:35] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:15:43] (03PS3) 10Alexandros Kosiaris: Move bastion_hosts out of ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/502612 (owner: 10Alex Monk) [13:15:45] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Move bastion_hosts out of ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/502612 (owner: 10Alex Monk) [13:16:39] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:17:22] (03CR) 10Jbond: raid: add ssacli class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [13:17:41] ah dammit [13:17:55] wave of puppet alerts incoming, disabling puppet across the fleet [13:18:52] you need any help akosiaris ? [13:18:59] !log disable puppet across the fleet to avoid incoming puppet alert storm [13:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:12] cdanis: niah, thanks, easy enough to fix, some parenthesis forgotten [13:19:13] PROBLEM - Check systemd state on logstash1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:19:23] PROBLEM - Check systemd state on mw1250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:19:35] PROBLEM - Check systemd state on dns5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:19:35] PROBLEM - Check systemd state on db1069 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:19:43] PROBLEM - Check systemd state on mw1343 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:19:43] PROBLEM - Check systemd state on mw1342 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:19:53] PROBLEM - Check systemd state on poolcounter1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:01] PROBLEM - Check systemd state on archiva1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:03] PROBLEM - Check systemd state on aqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:11] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:11] PROBLEM - Check systemd state on es2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:15] PROBLEM - Check systemd state on mw1306 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:17] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:19] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:23] PROBLEM - Check systemd state on mc2027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:27] PROBLEM - Check systemd state on wtp1043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:33] PROBLEM - Check systemd state on helium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:33] PROBLEM - Check systemd state on elastic1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:33] PROBLEM - Check systemd state on acmechief2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:35] PROBLEM - Check systemd state on mw1336 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:37] PROBLEM - Check systemd state on mc1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:37] PROBLEM - Check systemd state on mw1244 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:39] PROBLEM - Check systemd state on db2073 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:41] PROBLEM - Check systemd state on mw2232 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:41] PROBLEM - Check systemd state on thumbor2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:43] PROBLEM - Check systemd state on es2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:43] PROBLEM - Check systemd state on mw2284 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:47] PROBLEM - Check systemd state on mw2198 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:49] PROBLEM - Check systemd state on cloudelastic1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:51] PROBLEM - Check systemd state on wtp1047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:53] PROBLEM - Check systemd state on ganeti1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:55] PROBLEM - Check systemd state on mw1324 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:55] PROBLEM - Check systemd state on db2088 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:20:57] PROBLEM - Check systemd state on wtp1041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:03] PROBLEM - Check systemd state on ms-be1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:07] PROBLEM - Check systemd state on ms-be2022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:09] PROBLEM - Check systemd state on mw2205 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:11] PROBLEM - Check systemd state on mw1347 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:13] PROBLEM - Check systemd state on mw2263 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:13] PROBLEM - Check systemd state on wtp1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:15] PROBLEM - Check systemd state on mw2260 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:15] PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:17] PROBLEM - Check systemd state on kubetcd2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:19] PROBLEM - Check systemd state on db1085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:20] patch coming up [13:21:21] PROBLEM - Check systemd state on krypton is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:23] PROBLEM - Check systemd state on install2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:23] PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:25] PROBLEM - Check systemd state on mc2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:25] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:31] PROBLEM - Check systemd state on db2043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:31] PROBLEM - Check systemd state on mw1270 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:31] PROBLEM - Check systemd state on kafka-jumbo1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:31] PROBLEM - Check systemd state on ores1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:31] (03PS1) 10Alexandros Kosiaris: Add missing parentheses in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/503348 [13:21:35] PROBLEM - Check systemd state on dbstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:37] PROBLEM - Check systemd state on wtp1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:41] PROBLEM - Check systemd state on elastic2040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:45] PROBLEM - Check systemd state on graphite1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:47] PROBLEM - Check systemd state on dns5002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:51] (03PS1) 10Esanders: mwgrep: Include JSON files in search [puppet] - 10https://gerrit.wikimedia.org/r/503349 [13:21:55] PROBLEM - Check systemd state on relforge1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:55] PROBLEM - Check systemd state on multatuli is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:55] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:57] PROBLEM - Check systemd state on elastic1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:59] PROBLEM - Check systemd state on maps1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:59] PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:01] PROBLEM - Check systemd state on mw2175 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:01] PROBLEM - Check systemd state on mw1258 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:03] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:03] (03PS11) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [13:22:03] PROBLEM - Check systemd state on analytics1072 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:05] (03PS18) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [13:22:05] PROBLEM - Check systemd state on analytics1057 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:05] PROBLEM - Check systemd state on mw2187 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:07] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add missing parentheses in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/503348 (owner: 10Alexandros Kosiaris) [13:22:09] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:11] PROBLEM - Check systemd state on analytics1056 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:15] PROBLEM - Check systemd state on restbase2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:15] PROBLEM - Check systemd state on wtp1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:19] PROBLEM - Check systemd state on ms-fe2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:21] PROBLEM - Check systemd state on mw1334 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:21] PROBLEM - Check systemd state on dns2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:23] PROBLEM - Check systemd state on db2080 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:29] PROBLEM - Check systemd state on db1090 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:31] PROBLEM - Check systemd state on mw2139 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:35] PROBLEM - Check systemd state on mw2180 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:37] sigh, I wasn't fast enough to avoid most of this [13:22:37] PROBLEM - Check systemd state on analytics1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:39] PROBLEM - Check systemd state on ms-be2048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:39] PROBLEM - Check systemd state on wtp2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:41] PROBLEM - Check systemd state on aqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:43] PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:43] PROBLEM - Check systemd state on analytics1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:43] PROBLEM - Check systemd state on db1078 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:43] PROBLEM - Check systemd state on rdb1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:55] PROBLEM - Check systemd state on es1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:57] PROBLEM - Check systemd state on torrelay1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:05] PROBLEM - Check systemd state on analytics1062 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:05] PROBLEM - Check systemd state on cloudservices1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:07] PROBLEM - Check systemd state on mw1317 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:09] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:13] PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:17] PROBLEM - Check systemd state on mw2142 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:23] PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:25] PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:27] PROBLEM - Check systemd state on wtp2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:29] PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:35] PROBLEM - Check systemd state on wtp1027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:23:39] PROBLEM - Check systemd state on wtp2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:24:03] PROBLEM - Check systemd state on mw2239 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:24:07] PROBLEM - Check systemd state on mw2171 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:24:29] RECOVERY - Check systemd state on mw1317 is OK: OK - running: The system is fully operational [13:24:56] !log re-enable puppet across the fleet. Patch merged, recovery storm coming [13:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:05] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:26:26] seems like I was only fast enough to stop 90% of that storm [13:27:10] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10jbond) I have created a series of changes starting with [[ https://gerrit.wikimedia.org/r/503332 | 503332 ]] which adds... [13:27:19] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [13:27:59] RECOVERY - Check systemd state on krypton is OK: OK - running: The system is fully operational [13:28:29] PROBLEM - puppet last run on acmechief2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[ferm] [13:30:42] 10Operations, 10Puppet: Add a CI check for the use of hiera() function - https://phabricator.wikimedia.org/T220820 (10MoritzMuehlenhoff) [13:30:49] 10Operations, 10Puppet: Add a CI check for the use of hiera() function - https://phabricator.wikimedia.org/T220820 (10MoritzMuehlenhoff) p:05Triage→03Low [13:31:35] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 4 others: Add security sensitive nodes to our kubernetes cluster - https://phabricator.wikimedia.org/T220821 (10akosiaris) [13:34:09] 10Operations, 10vm-requests: Site: 4 VM request for kubernetes - https://phabricator.wikimedia.org/T220822 (10akosiaris) [13:35:31] RECOVERY - Check systemd state on labtestcontrol2003 is OK: OK - running: The system is fully operational [13:36:16] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 4 others: Add security sensitive nodes to our kubernetes cluster - https://phabricator.wikimedia.org/T220821 (10akosiaris) p:05Triage→03Normal [13:36:21] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 4 others: Add security sensitive nodes to our kubernetes cluster - https://phabricator.wikimedia.org/T220821 (10akosiaris) [13:36:23] RECOVERY - Check systemd state on acmechief2001 is OK: OK - running: The system is fully operational [13:36:25] 10Operations, 10vm-requests: Site: 4 VM request for kubernetes - https://phabricator.wikimedia.org/T220822 (10akosiaris) [13:37:46] (03PS2) 10Alex Monk: Move caches out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502630 [13:39:03] RECOVERY - puppet last run on acmechief2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:39:47] 10Operations, 10vm-requests: Site: 4 VM request for kubernetes - https://phabricator.wikimedia.org/T220822 (10akosiaris) p:05Triage→03Normal [13:40:27] (03CR) 10Krinkle: [C: 03+1] "I'd recommend including 'Gadgets-definition' by default as well, that would allow us to make it a one-stop-shop for 99% of cases regarding" [puppet] - 10https://gerrit.wikimedia.org/r/503349 (owner: 10Esanders) [13:41:23] (03PS12) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [13:41:25] (03PS19) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [13:46:48] RECOVERY - Check systemd state on db2073 is OK: OK - running: The system is fully operational [13:46:50] RECOVERY - Check systemd state on mw1250 is OK: OK - running: The system is fully operational [13:46:50] RECOVERY - Check systemd state on mw2232 is OK: OK - running: The system is fully operational [13:46:56] RECOVERY - Check systemd state on cloudelastic1002 is OK: OK - running: The system is fully operational [13:47:00] RECOVERY - Check systemd state on db1069 is OK: OK - running: The system is fully operational [13:47:02] RECOVERY - Check systemd state on mw1324 is OK: OK - running: The system is fully operational [13:47:16] RECOVERY - Check systemd state on dns5001 is OK: OK - running: The system is fully operational [13:47:18] RECOVERY - Check systemd state on poolcounter1001 is OK: OK - running: The system is fully operational [13:47:24] RECOVERY - Check systemd state on mw2263 is OK: OK - running: The system is fully operational [13:47:28] akosiaris, monitoring_hosts and deployment_hosts are looking more complicated... monitoring_hosts gets used in standard::ntp which comes from ::standard which seems to be included in over 200 different places [13:47:30] RECOVERY - Check systemd state on archiva1001 is OK: OK - running: The system is fully operational [13:47:34] RECOVERY - Check systemd state on wtp1043 is OK: OK - running: The system is fully operational [13:47:38] RECOVERY - Check systemd state on helium is OK: OK - running: The system is fully operational [13:47:40] RECOVERY - Check systemd state on elastic1038 is OK: OK - running: The system is fully operational [13:47:40] RECOVERY - Check systemd state on logstash1011 is OK: OK - running: The system is fully operational [13:47:44] RECOVERY - Check systemd state on mw1244 is OK: OK - running: The system is fully operational [13:47:47] (03CR) 10Muehlenhoff: Initial Kerberos KDC/kadmin server profiles/roles (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502511 (owner: 10Muehlenhoff) [13:47:48] RECOVERY - Check systemd state on analytics1072 is OK: OK - running: The system is fully operational [13:47:54] RECOVERY - Check systemd state on thumbor2002 is OK: OK - running: The system is fully operational [13:47:54] RECOVERY - Check systemd state on es2003 is OK: OK - running: The system is fully operational [13:48:00] RECOVERY - Check systemd state on wtp1047 is OK: OK - running: The system is fully operational [13:48:00] RECOVERY - Check systemd state on ganeti1004 is OK: OK - running: The system is fully operational [13:48:04] RECOVERY - Check systemd state on wtp1041 is OK: OK - running: The system is fully operational [13:48:05] (03PS4) 10Muehlenhoff: Initial Kerberos KDC/kadmin server profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/502511 [13:48:06] RECOVERY - Check systemd state on db2088 is OK: OK - running: The system is fully operational [13:48:10] RECOVERY - Check systemd state on mw1343 is OK: OK - running: The system is fully operational [13:48:10] RECOVERY - Check systemd state on mw1342 is OK: OK - running: The system is fully operational [13:48:14] RECOVERY - Check systemd state on ms-be1037 is OK: OK - running: The system is fully operational [13:48:18] RECOVERY - Check systemd state on mw1347 is OK: OK - running: The system is fully operational [13:48:20] RECOVERY - Check systemd state on ms-be2022 is OK: OK - running: The system is fully operational [13:48:20] RECOVERY - Check systemd state on mw2205 is OK: OK - running: The system is fully operational [13:48:24] deployment_hosts gets used under ssh::server which comes from ::ssh but also profile::base creates it like this: create_resources('class', {'ssh::server' => $ssh_server_settings}) [13:48:28] RECOVERY - Check systemd state on db1085 is OK: OK - running: The system is fully operational [13:48:28] RECOVERY - Check systemd state on mw2260 is OK: OK - running: The system is fully operational [13:48:40] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational [13:48:41] $ssh_server_settings = hiera('profile::base::ssh_server_settings', {}), [13:48:42] RECOVERY - Check systemd state on mw1336 is OK: OK - running: The system is fully operational [13:48:44] RECOVERY - Check systemd state on dns5002 is OK: OK - running: The system is fully operational [13:48:44] RECOVERY - Check systemd state on mc1036 is OK: OK - running: The system is fully operational [13:48:44] RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational [13:48:44] RECOVERY - Check systemd state on maps1003 is OK: OK - running: The system is fully operational [13:48:50] RECOVERY - Check systemd state on analytics1057 is OK: OK - running: The system is fully operational [13:48:56] RECOVERY - Check systemd state on mw2284 is OK: OK - running: The system is fully operational [13:48:56] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational [13:48:58] RECOVERY - Check systemd state on mw2198 is OK: OK - running: The system is fully operational [13:49:02] (03CR) 10jerkins-bot: [V: 04-1] Initial Kerberos KDC/kadmin server profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/502511 (owner: 10Muehlenhoff) [13:49:06] RECOVERY - Check systemd state on ms-fe2008 is OK: OK - running: The system is fully operational [13:49:10] RECOVERY - Check systemd state on dns2001 is OK: OK - running: The system is fully operational [13:49:10] RECOVERY - Check systemd state on db2080 is OK: OK - running: The system is fully operational [13:49:14] RECOVERY - Check systemd state on db1090 is OK: OK - running: The system is fully operational [13:49:16] RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational [13:49:22] RECOVERY - Check systemd state on wtp1025 is OK: OK - running: The system is fully operational [13:49:26] RECOVERY - Check systemd state on mw2180 is OK: OK - running: The system is fully operational [13:49:26] RECOVERY - Check systemd state on analytics1046 is OK: OK - running: The system is fully operational [13:49:28] RECOVERY - Check systemd state on wtp2018 is OK: OK - running: The system is fully operational [13:49:28] RECOVERY - Check systemd state on aqs1008 is OK: OK - running: The system is fully operational [13:49:30] RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational [13:49:30] RECOVERY - Check systemd state on db1078 is OK: OK - running: The system is fully operational [13:49:32] RECOVERY - Check systemd state on kubetcd2002 is OK: OK - running: The system is fully operational [13:49:38] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [13:49:38] RECOVERY - Check systemd state on relforge1001 is OK: OK - running: The system is fully operational [13:49:42] RECOVERY - Check systemd state on elastic1036 is OK: OK - running: The system is fully operational [13:49:46] RECOVERY - Check systemd state on mw2142 is OK: OK - running: The system is fully operational [13:49:46] RECOVERY - Check systemd state on mw1258 is OK: OK - running: The system is fully operational [13:49:48] RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational [13:49:50] RECOVERY - Check systemd state on multatuli is OK: OK - running: The system is fully operational [13:49:52] RECOVERY - Check systemd state on mw2175 is OK: OK - running: The system is fully operational [13:49:56] RECOVERY - Check systemd state on wtp2008 is OK: OK - running: The system is fully operational [13:49:56] RECOVERY - Check systemd state on analytics1056 is OK: OK - running: The system is fully operational [13:49:56] RECOVERY - Check systemd state on mw2187 is OK: OK - running: The system is fully operational [13:49:58] RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational [13:50:02] RECOVERY - Check systemd state on wtp1045 is OK: OK - running: The system is fully operational [13:50:02] RECOVERY - Check systemd state on db2043 is OK: OK - running: The system is fully operational [13:50:04] RECOVERY - Check systemd state on restbase2014 is OK: OK - running: The system is fully operational [13:50:04] RECOVERY - Check systemd state on es2001 is OK: OK - running: The system is fully operational [13:50:06] RECOVERY - Check systemd state on mw1334 is OK: OK - running: The system is fully operational [13:50:10] RECOVERY - Check systemd state on wtp2015 is OK: OK - running: The system is fully operational [13:50:24] RECOVERY - Check systemd state on mw2139 is OK: OK - running: The system is fully operational [13:50:32] RECOVERY - Check systemd state on ms-be2048 is OK: OK - running: The system is fully operational [13:50:32] RECOVERY - Check systemd state on analytics1049 is OK: OK - running: The system is fully operational [13:50:34] RECOVERY - Check systemd state on rdb1009 is OK: OK - running: The system is fully operational [13:50:40] RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational [13:50:54] RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational [13:50:54] RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational [13:51:02] RECOVERY - Check systemd state on wtp1027 is OK: OK - running: The system is fully operational [13:51:20] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational [13:51:41] (03PS20) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [13:51:45] I wonder which hosts do not have ::standard [13:51:54] RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational [13:52:02] RECOVERY - Check systemd state on mw2171 is OK: OK - running: The system is fully operational [13:52:06] RECOVERY - Check systemd state on torrelay1001 is OK: OK - running: The system is fully operational [13:53:22] PROBLEM - Mediawiki Cirrussearch update lag - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:53:44] RECOVERY - Check systemd state on es1011 is OK: OK - running: The system is fully operational [13:53:46] PROBLEM - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:53:48] RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational [13:53:56] RECOVERY - Check systemd state on wtp1034 is OK: OK - running: The system is fully operational [13:55:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me (and tested on db2102)" [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [13:55:32] RECOVERY - Check systemd state on elastic2040 is OK: OK - running: The system is fully operational [13:55:36] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [13:57:14] RECOVERY - Check systemd state on aqs1004 is OK: OK - running: The system is fully operational [13:57:22] RECOVERY - Check systemd state on kafka-jumbo1004 is OK: OK - running: The system is fully operational [13:57:34] RECOVERY - Check systemd state on ores1009 is OK: OK - running: The system is fully operational [13:59:24] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational [14:01:04] RECOVERY - Check systemd state on install2002 is OK: OK - running: The system is fully operational [14:02:50] RECOVERY - Check systemd state on dbstore1005 is OK: OK - running: The system is fully operational [14:02:54] RECOVERY - Check systemd state on mc2024 is OK: OK - running: The system is fully operational [14:04:23] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) Opened https://github.com/RadeonOpenCompute/ROCm/issues/761 to... [14:04:32] RECOVERY - Check systemd state on cloudservices1003 is OK: OK - running: The system is fully operational [14:04:44] RECOVERY - Check systemd state on mc2027 is OK: OK - running: The system is fully operational [14:04:46] RECOVERY - Check systemd state on mw1270 is OK: OK - running: The system is fully operational [14:04:48] RECOVERY - Check systemd state on mw1306 is OK: OK - running: The system is fully operational [14:06:22] RECOVERY - Check systemd state on analytics1062 is OK: OK - running: The system is fully operational [14:06:42] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational [14:07:29] (03PS6) 10Jbond: ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) [14:08:24] (03CR) 10Jbond: [C: 03+2] ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:08:30] RECOVERY - Check systemd state on mw2239 is OK: OK - running: The system is fully operational [14:10:10] RECOVERY - Check systemd state on graphite1004 is OK: OK - running: The system is fully operational [14:14:52] RECOVERY - Mediawiki Cirrussearch update lag - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:15:22] RECOVERY - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:15:26] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32094736840 and 0 seconds [14:15:46] * onimisionipe is checking postgres lag [14:19:58] (03CR) 10Mathew.onipe: Add wdqs data transfer cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [14:20:09] (03PS17) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [14:22:37] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10MoritzMuehlenhoff) We could also look into a backport of https://github.com/systemd/systemd/commit/9009d3b5c3b6d191be69215736be77583e0f23f9 to Stretch, seems totally doable... [14:23:10] (03PS1) 10Gehel: elasticsearch: rename "update lag" check to "update rate" [puppet] - 10https://gerrit.wikimedia.org/r/503366 [14:29:15] !log depool maps2001 for postgres initialization [14:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:22] (03PS2) 10Filippo Giunchedi: aptrepo: validate deb822 files [puppet] - 10https://gerrit.wikimedia.org/r/503025 [14:30:46] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Andrew) Installing cloudvirt1024 with Buster isn't really an option -- we'd have to port all OpenStack packages for versions M and N to Buster just to keep this one server a... [14:32:51] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: rename "update lag" check to "update rate" [puppet] - 10https://gerrit.wikimedia.org/r/503366 (owner: 10Gehel) [14:33:47] (03PS13) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [14:34:23] (03PS21) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [14:35:04] PROBLEM - Check systemd state on maps2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:35:15] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/15735/install1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/503025 (owner: 10Filippo Giunchedi) [14:35:19] (03CR) 10Marostegui: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:36:10] maps is me [14:36:13] silencing! [14:38:13] (03CR) 10Jbond: "> Patch Set 6:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:41:38] (03PS14) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [14:42:43] (03PS22) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [14:43:40] (03CR) 10Marostegui: raid: refactor structure (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:43:43] (03CR) 10Andrew Bogott: [C: 03+1] "As I understand it, the proposed solution is only available on Buster. We will still need something like this in order to keep Stretch wo" [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [14:48:52] (03PS23) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [14:49:50] PROBLEM - puppet last run on cloudcontrol2001-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:50:28] 10Operations, 10Deployments, 10HHVM, 10Performance-Team (Radar), and 2 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886 (10Krinkle) [14:50:31] 10Operations, 10HHVM, 10Scap (Scap3-MediaWiki-MVP): Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352 (10Krinkle) [14:51:02] 10Operations, 10Deployments, 10HHVM, 10Performance-Team (Radar), and 2 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886 (10Krinkle) 05Open→03Resolved Seems fine. We'll find out for sure when we continue work on T99740 for local... [14:52:24] (03PS1) 10CDanis: check_ripe_atlas: log exceptions to syslog, not /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/503374 [14:52:33] (03PS6) 10Jbond: raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) [14:53:01] (03PS7) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [14:54:45] (03CR) 10Jbond: raid: refactor structure (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:54:57] (03PS8) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [14:55:21] 10Operations, 10PHP 7.0 support, 10Patch-For-Review, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) @Joe Regarding opcache, I'm not sure why this change for manual reloading is applied now. Could we do that af... [14:58:42] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: data reimport on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T220830 (10Gehel) [15:00:20] (03CR) 10Gehel: [C: 03+1] "LGTM, let's see if volans has something to say" [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [15:01:32] (03PS15) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [15:02:04] (03PS24) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [15:03:00] ACKNOWLEDGEMENT - Free Blazegraph allocators wdqs-blazegraph on wdqs1009 is CRITICAL: cluster=wdqs-test instance=wdqs1009:9193 job=blazegraph site=eqiad Gehel data reimport needed - https://phabricator.wikimedia.org/T220830 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=32&fullscreen [15:03:38] (03PS25) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [15:05:26] (03PS3) 10Krinkle: Consistently use HTML attributes with quotes [puppet] - 10https://gerrit.wikimedia.org/r/497346 (owner: 10Fomafix) [15:05:57] (03PS4) 10Krinkle: mediawiki: Add missing quotes to HTML attributes on error pages [puppet] - 10https://gerrit.wikimedia.org/r/497346 (owner: 10Fomafix) [15:06:37] (03CR) 10Krinkle: [C: 03+1] "LGTM. Later, the hhvm-fatal-error.php.erb file should be converted to use errorpage.html.erb template. I have not yet done that :) - part " [puppet] - 10https://gerrit.wikimedia.org/r/497346 (owner: 10Fomafix) [15:06:54] (03CR) 10Lucas Werkmeister (WMDE): "Where does the value of 3in come from? On my local wiki, that looks rather short, and 4in fit into the statement box pretty well." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [15:07:58] (03CR) 10Muehlenhoff: cloudelastic: allow jobrunners and mwmaint nodes to access cloudelastic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502829 (https://phabricator.wikimedia.org/T220625) (owner: 10Gehel) [15:08:18] 10Operations, 10Performance-Team (Radar): PHP fatal error handler not working on mwdebug servers - https://phabricator.wikimedia.org/T217846 (10Krinkle) (This is blocking , which is low priority cleanup.) [15:09:48] (03CR) 10Jbond: "Looks good to me however there is one comment relating to django internals which i coudln't resolve" (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [15:13:18] (03CR) 10Gehel: cloudelastic: allow jobrunners and mwmaint nodes to access cloudelastic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502829 (https://phabricator.wikimedia.org/T220625) (owner: 10Gehel) [15:15:19] (03PS1) 10Alex Monk: deployment-prep: Update upload host [puppet] - 10https://gerrit.wikimedia.org/r/503380 [15:15:58] Krenair: yeah I know, which is why I said it's gonna be complicated ;-) [15:16:24] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access for santhosh - https://phabricator.wikimedia.org/T220785 (10fgiunchedi) a:03akosiaris Followed up with Alex, assigning to him. [15:17:02] I knew some would be more complicated than others [15:17:27] akosiaris, do you know which hosts deliberately do not have standard? [15:17:49] I wonder if this stuff should go through profile::base ? [15:19:22] (03PS1) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [15:20:07] (03CR) 10Michael Große: [C: 03+1] "Looks good to me, but there is still a pending discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503342 (https://phabricator.wikimedia.org/T220609) (owner: 10Lucas Werkmeister (WMDE)) [15:20:58] (03CR) 10Muehlenhoff: cloudelastic: allow jobrunners and mwmaint nodes to access cloudelastic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502829 (https://phabricator.wikimedia.org/T220625) (owner: 10Gehel) [15:27:43] (03PS2) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [15:29:23] (03PS5) 10Krinkle: Avoid redirects from HTTPS to HTTP and back to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/469262 (owner: 10Fomafix) [15:29:33] (03CR) 10Krinkle: [C: 03+1] "LGMT. Confirmed the redirects." [puppet] - 10https://gerrit.wikimedia.org/r/469262 (owner: 10Fomafix) [15:29:37] !log gerrit restart incoming [15:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:48] !log gerrit back [15:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:36] PROBLEM - puppet last run on db2095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [15:36:00] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [15:36:44] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [15:36:51] * apergos raises an eyebrow curiously [15:37:12] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [15:37:32] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [15:38:32] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] [15:39:45] I guess that is expected because of the gerrit restart [15:41:39] (03CR) 10Andrew Bogott: [C: 03+2] deployment-prep: Update upload host [puppet] - 10https://gerrit.wikimedia.org/r/503380 (owner: 10Alex Monk) [15:43:16] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:44:24] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:44:41] 10Operations, 10SRE-Access-Requests: analytics-wmde group addition for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10RazShuty) @Nuria can you please take action? this would help a lot. [15:46:58] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/compiler1002/15741/" [puppet] - 10https://gerrit.wikimedia.org/r/501587 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [15:47:01] (03CR) 10Volans: "Thanks a lot for the review, I know it's a lot of code, in particular if lacking context. Replies inline." (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [15:47:09] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10ayounsi) From: https://netbox.wikimedia.org/dcim/devices/?q=kafka-jumbo&status=1 kafka-jumbo1002 kafka-jumbo1004 kafka-jumbo1005... [15:47:54] (03PS3) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [15:53:55] (03CR) 10Jbond: [C: 04-1] "mostly fine bu one error i think" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [15:56:07] !log starting data trasnfer from wdqs1008 to wdqs1009 - T220830 [15:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:11] T220830: data reimport on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T220830 [15:56:53] (03PS16) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [15:56:55] (03PS4) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [15:58:49] (03CR) 10Jbond: [C: 03+1] "Thanks for the responses, looks good to me" (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [15:58:52] 10Operations, 10SRE-Access-Requests: analytics-wmde group addition for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10Nuria) @RazShuty has user signed NDA? [16:01:55] 10Operations, 10SRE-Access-Requests: analytics-wmde group addition for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10RazShuty) @Nuria : [x] - User has signed the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document. [x] - User has a valid NDA on file with WMF legal. (Thi... [16:02:30] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:03:14] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:03:42] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:04:56] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:05:18] RECOVERY - puppet last run on db2095 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:12:30] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T220691 (10RStallman-legalteam) Confirming that NDA has been executed. Thanks! [16:13:16] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:13:30] (03PS17) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [16:14:27] (03PS5) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [16:14:41] !log install ifstat on all the mc1* hosts for network bandwidth investigation [16:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:33] how do i find out which nodes are using toollabs::kube2proxy? it's not in https://tools.wmflabs.org/openstack-browser/puppetclass/ and also not elsewhere? [16:15:48] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 4.039 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:16:11] (03PS6) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [16:19:18] (03PS5) 10Andrew Bogott: openstack::puppet::master::encapi: Avoid nginx-apache conflict [puppet] - 10https://gerrit.wikimedia.org/r/501587 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [16:20:24] (03CR) 10Andrew Bogott: [C: 03+2] openstack::puppet::master::encapi: Avoid nginx-apache conflict [puppet] - 10https://gerrit.wikimedia.org/r/501587 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [16:23:46] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:23:48] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Kanban (Done with CPT), and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10bmansurov) [16:23:52] 10Operations, 10Recommendation-API, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move recommendation-api logging to new logging pipeline - https://phabricator.wikimedia.org/T219926 (10bmansurov) 05Open→03Resolved [16:24:56] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:26:56] RECOVERY - Check systemd state on maps2001 is OK: OK - running: The system is fully operational [16:27:00] (03PS18) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [16:27:14] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 208 and 0 seconds [16:27:23] (03PS7) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [16:34:12] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:34:39] 10Operations, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10MSantos) 05Open→03Resolved [16:35:22] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:35:41] 10Operations, 10Maps, 10Traffic, 10Reading-Infrastructure-Team-Backlog (Kanban): Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10MSantos) 05Open→03Resolved [16:42:14] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [16:45:19] (03PS2) 10Jbond: debdeploy: add zsh autocompletion script [puppet] - 10https://gerrit.wikimedia.org/r/503058 [16:46:55] "Repooling thumbor1004 until we replace its memory" [16:46:58] eh.. re or "DE" [16:48:27] mutante: iut has not been causing issues [16:48:38] jijiki: you literally just downtimed, right :) [16:48:43] I did [16:48:46] ok:) [16:48:46] (03CR) 10Jbond: "Thanks for the review chris, will wait for moritz to review as he may want this in the package as opposed to puppet" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/503058 (owner: 10Jbond) [16:53:46] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:54:18] (03PS1) 10Paladox: Gerrit: Put