[00:03:20] 10Operations, 10PHP 7.0 support, 10Patch-For-Review, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Tgr) >>! In T211488#5103305, @Joe wrote: > - `include_path` in php's ini is still set to the value it had in the old t... [00:03:24] urandom: :) np [00:03:45] urandom: i am not entirely sure if we can remove that from restbase::base. it's been added in 2017 i see [00:04:02] we can try and compile on * though [00:07:35] (03PS2) 10Volans: flake8: enforce import order and adopt W504 [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 [00:12:02] 10Operations, 10ops-codfw: audit all codfw pdu tower draws - https://phabricator.wikimedia.org/T163362 (10Dzahn) duplicate of T163339 ? [00:12:38] (03CR) 10Volans: flake8: enforce import order and adopt W504 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [00:14:53] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Papaul) 05Open→03Resolved This is done, we can close it. [00:15:20] (03CR) 10Tim Starling: [C: 03+2] profiler: Increase max stack depth for sampling profiler to 250 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503083 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [00:16:13] (03Merged) 10jenkins-bot: profiler: Increase max stack depth for sampling profiler to 250 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503083 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [00:21:34] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 (10Papaul) [00:21:38] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) 05Open→03Resolved @Gehel We can close this. Thanks [00:22:38] (03PS1) 10Dzahn: restbase::base: remove include passwords::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/503151 [00:24:26] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: switch port configuration for frmon2001 - https://phabricator.wikimedia.org/T196557 (10Papaul) 05Open→03Resolved This is done , it can be close. [00:24:29] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476 (10Papaul) [00:25:28] !log tstarling@deploy1001 Synchronized wmf-config/profiler.php: increase excimer max depth (duration: 00m 53s) [00:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:43] (03CR) 10jenkins-bot: profiler: Increase max stack depth for sampling profiler to 250 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503083 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [00:26:55] 10Operations, 10ops-codfw: Degraded RAID on labtestcontrol2003 - https://phabricator.wikimedia.org/T218403 (10Dzahn) This ticket isn't an actual RAID failure. As the output says the check just failed to connect to the host at that time. the RAID check in Icinga says today: OK: Active: 6, Working: 6, Failed:... [00:27:05] 10Operations, 10ops-codfw: Degraded RAID on labtestcontrol2003 - https://phabricator.wikimedia.org/T218403 (10Dzahn) 05Open→03Invalid [00:28:50] 10Operations, 10ops-codfw: Degraded RAID on labtestcontrol2003 - https://phabricator.wikimedia.org/T218403 (10Dzahn) also "get_raid_status_md" doesn't exist at that location anymore, but: ` root@labtestcontrol2003:~# sudo /usr/local/lib/nagios/plugins/check_raid OK: Active: 6, Working: 6, Failed: 0, Spare:... [00:32:13] 10Operations, 10ops-codfw, 10monitoring: labtestcontrol2003 - UNKNOWN power supply status - https://phabricator.wikimedia.org/T220783 (10Dzahn) [00:32:20] 10Operations, 10ops-codfw, 10cloud-services-team: Degraded RAID on labtestservices2002 - https://phabricator.wikimedia.org/T218405 (10Papaul) 05Open→03Resolved a:03Papaul This host was renamed to cloudservices2002-dev and reimaged on T220101 and icinga is showing OK: Active: 6, Working: 6, Failed: 0,... [00:34:39] 10Operations, 10ops-codfw, 10DC-Ops: labtestneutron2002: refresh/rename to cloudnet2002-dev - https://phabricator.wikimedia.org/T214370 (10Papaul) a:03Papaul [00:38:48] 10Operations, 10ops-codfw, 10DC-Ops: codfw: rename/relabel labtestneutron2001 to cloudnet2001-dev - https://phabricator.wikimedia.org/T214181 (10Papaul) 05Open→03Resolved a:03Papaul @faidon yes we can resolve this [00:41:34] 10Operations, 10ops-codfw: update physical labels from naos.codfw.wmnet to deploy2001.codfw.wmnet - https://phabricator.wikimedia.org/T195421 (10Papaul) a:03Papaul [00:52:14] (03PS1) 10Dzahn: decom cloudnet2001-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/503152 (https://phabricator.wikimedia.org/T218025) [00:53:26] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Dzahn) per chat with Papaul: - switch port is done - server is > 5 years old and should not go back to spare - removi... [00:56:14] (03CR) 10Dzahn: [C: 03+2] decom cloudnet2001-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/503152 (https://phabricator.wikimedia.org/T218025) (owner: 10Dzahn) [00:57:08] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Dzahn) [01:00:05] !log puppet cert clean, puppet node clean, puppet node deactivate on cloudnet2001-dev.codfw.wmnet (T218025) [01:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:09] T218025: decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 [01:17:31] (03PS1) 10Dzahn: add wikibase.org as parked domain [dns] - 10https://gerrit.wikimedia.org/r/503154 [01:19:16] 10Operations, 10netops: Juniper security advisories (April 2019) - https://phabricator.wikimedia.org/T220716 (10ayounsi) 05Open→03Resolved thanks, tl;dr; all good! > 2019-04 Security Bulletin: Junos OS: SRX5000 series: Kernel crash (vmcore) upon receipt of a specific packet on fxp0 interface (CVE-2019-004... [01:56:33] PROBLEM - Free Blazegraph allocators wdqs-blazegraph on wdqs1009 is CRITICAL: cluster=wdqs-test instance=wdqs1009:9193 job=blazegraph site=eqiad https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=32&fullscreen [01:57:21] (03CR) 10BryanDavis: "I have corrected the Title-casing issues for all but 8 '(objectClass=posixaccount)' records in the LDAP directory. These 8 all have duplic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) (owner: 10BryanDavis) [02:24:22] (03CR) 10MarkAHershberger: Package 1.19.4 with stdeb (033 comments) [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/500201 (owner: 10MarkAHershberger) [02:26:07] (03PS1) 10MarkAHershberger: Package 1.19.4 with stdeb [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503155 [02:26:42] (03PS1) 10MarkAHershberger: I7e66e85e242f865246474e493bf92846f371ae2a [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 [02:28:10] (03Abandoned) 10MarkAHershberger: Package 1.19.4 with stdeb [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503155 (owner: 10MarkAHershberger) [02:35:18] (03PS2) 10MarkAHershberger: Address Kunal's concerns with already-merged code [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 [02:41:52] (03PS3) 10MarkAHershberger: Address Kunal's concerns with already-merged code [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 [03:20:07] (03PS4) 10MarkAHershberger: Address Kunal's concerns with already-merged code [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 [03:24:05] (03CR) 10MarkAHershberger: "W00! lintian clean" [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 (owner: 10MarkAHershberger) [04:57:10] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [04:57:30] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) db2044 again: ` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Name: 1I Port Name: 2I Gen8 ServBP... [04:58:08] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:02:06] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Marostegui) [05:03:32] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Marostegui) 05Open→03Resolved All these host are now ready to be productionized at T220572. There is a problem with the controller exposure to the OS w... [05:07:28] (03PS1) 10Marostegui: db2[097|098|099|100|101]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/503164 (https://phabricator.wikimedia.org/T219463) [05:10:27] (03CR) 10Marostegui: [C: 03+2] db2[097|098|099|100|101]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/503164 (https://phabricator.wikimedia.org/T219463) (owner: 10Marostegui) [05:38:27] (03CR) 10Vgutierrez: [C: 04-1] "I'd like those checks to be there, as we intend to use this as a staging environment to validate changes before going to production." [puppet] - 10https://gerrit.wikimedia.org/r/503122 (owner: 10Dzahn) [06:01:43] 10Operations, 10Mail, 10Patch-For-Review: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408 (10Vgutierrez) [06:01:47] 10Operations, 10DNS, 10Mail, 10Traffic, and 3 others: wikidata.org lacks SPF record - https://phabricator.wikimedia.org/T210134 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Fixed by T193408 [06:09:11] 10Operations, 10Analytics, 10User-Elukey: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created - https://phabricator.wikimedia.org/T220784 (10elukey) p:05Triage→03Normal [06:10:26] 10Operations, 10SRE-Access-Requests: Requesting deployment access for santhosh - https://phabricator.wikimedia.org/T220785 (10santhosh) [06:11:03] (03PS1) 10Vgutierrez: Add SPF record for wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/503165 (https://phabricator.wikimedia.org/T193408) [06:11:21] (03CR) 10Vgutierrez: [C: 03+1] Add SPF record for wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/503165 (https://phabricator.wikimedia.org/T193408) (owner: 10Vgutierrez) [06:13:02] (03PS1) 10KartikMistry: Add santhosh to deploy-service [puppet] - 10https://gerrit.wikimedia.org/r/503167 (https://phabricator.wikimedia.org/T220785) [06:16:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access for santhosh - https://phabricator.wikimedia.org/T220785 (10Arrbee) This is an approved request for Santhosh. Thanks. [06:20:10] (03PS1) 10Vgutierrez: Add SPF record for wikibooks.org [dns] - 10https://gerrit.wikimedia.org/r/503177 (https://phabricator.wikimedia.org/T193408) [06:21:00] (03CR) 10Vgutierrez: [C: 03+1] Add SPF record for wikibooks.org [dns] - 10https://gerrit.wikimedia.org/r/503177 (https://phabricator.wikimedia.org/T193408) (owner: 10Vgutierrez) [06:24:47] 10Operations: HP Gen9 onboard controller review - https://phabricator.wikimedia.org/T216175 (10Marostegui) Heh...HP decided to rename the tool and on the Gen10, @MoritzMuehlenhoff found it (T220572#5106204): ` HPE renamed the tool, I installed "ssacli" and now "ssacli controller all show config" works fine. ` [06:26:54] 10Operations, 10DNS, 10Traffic: Add SPF record for non-canonical domains that are not parked - https://phabricator.wikimedia.org/T220786 (10Vgutierrez) [06:27:04] 10Operations, 10DNS, 10Traffic: Add SPF record for non-canonical domains that are not parked - https://phabricator.wikimedia.org/T220786 (10Vgutierrez) p:05Triage→03Normal [06:31:47] 10Operations, 10DNS, 10Traffic: Add SPF record for non-canonical domains that are not parked - https://phabricator.wikimedia.org/T220786 (10Vgutierrez) [06:31:51] 10Operations, 10Cloud-VPS, 10DNS, 10Mail, and 3 others: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930 (10Vgutierrez) [06:33:11] (03PS1) 10Vgutierrez: Add SPF records for non-canonical non-parked domains [dns] - 10https://gerrit.wikimedia.org/r/503258 (https://phabricator.wikimedia.org/T220786) [06:35:43] (03CR) 10Vgutierrez: [C: 03+1] "For those domains which have MX records set to something different than mx[12]001.wm.o or gmail, I've added "mx" to their SPF record" [dns] - 10https://gerrit.wikimedia.org/r/503258 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [06:37:45] 10Operations: Sort out which RAID packages are still needed - https://phabricator.wikimedia.org/T216043 (10Marostegui) [06:38:53] 10Operations, 10Icinga, 10monitoring: Fix RAID handler alert to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10Marostegui) [06:46:04] (03PS1) 10Muehlenhoff: Sync ssacli from the HPE repository [puppet] - 10https://gerrit.wikimedia.org/r/503261 (https://phabricator.wikimedia.org/T220787) [06:49:45] (03CR) 10Marostegui: [C: 03+1] Sync ssacli from the HPE repository [puppet] - 10https://gerrit.wikimedia.org/r/503261 (https://phabricator.wikimedia.org/T220787) (owner: 10Muehlenhoff) [06:57:52] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10Marostegui) [07:00:10] (03CR) 10Muehlenhoff: [C: 03+2] Sync ssacli from the HPE repository [puppet] - 10https://gerrit.wikimedia.org/r/503261 (https://phabricator.wikimedia.org/T220787) (owner: 10Muehlenhoff) [07:04:45] !log synced ssacli to thirdparty/hwraid components for jessie/stretch T220787 [07:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:50] T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 [07:09:16] 10Operations, 10Analytics, 10EventBus, 10monitoring, and 3 others: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 (10akosiaris) p:05Triage→03Normal [07:12:24] !log Manually install ssacli on db2[097|098|099|100|101|102] T220787 T220572 [07:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:29] T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 [07:12:30] T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 [07:13:28] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10MoritzMuehlenhoff) We need to extend the "raid" fact in modules/raid/lib/facter/raid.rb to also detect the Gen10 control... [07:23:51] (03PS3) 10Arturo Borrero Gonzalez: striker: factor out common code to a shared profile [puppet] - 10https://gerrit.wikimedia.org/r/502472 [07:24:36] (03PS3) 10Filippo Giunchedi: aptrepo: reflow and sort distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/503013 [07:24:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] striker: factor out common code to a shared profile [puppet] - 10https://gerrit.wikimedia.org/r/502472 (owner: 10Arturo Borrero Gonzalez) [07:25:21] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: reflow and sort distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/503013 (owner: 10Filippo Giunchedi) [07:25:40] (03PS4) 10Filippo Giunchedi: aptrepo: reflow and sort distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/503013 [07:27:29] (03PS5) 10Arturo Borrero Gonzalez: Toolforge: cleanup unused puppet code [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) [07:27:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Toolforge: cleanup unused puppet code [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [07:32:55] (03PS1) 10Muehlenhoff: Update the source distro for the HPE thirdparty suite [puppet] - 10https://gerrit.wikimedia.org/r/503264 (https://phabricator.wikimedia.org/T220787) [07:33:49] (03CR) 10Marostegui: [C: 03+1] Update the source distro for the HPE thirdparty suite [puppet] - 10https://gerrit.wikimedia.org/r/503264 (https://phabricator.wikimedia.org/T220787) (owner: 10Muehlenhoff) [07:35:38] (03CR) 10Muehlenhoff: [C: 03+2] Update the source distro for the HPE thirdparty suite [puppet] - 10https://gerrit.wikimedia.org/r/503264 (https://phabricator.wikimedia.org/T220787) (owner: 10Muehlenhoff) [07:37:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503116 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [07:38:17] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: add component/elastalert [puppet] - 10https://gerrit.wikimedia.org/r/503014 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [07:38:26] (03PS3) 10Filippo Giunchedi: aptrepo: add component/elastalert [puppet] - 10https://gerrit.wikimedia.org/r/503014 (https://phabricator.wikimedia.org/T213933) [07:40:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [07:40:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "minor comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [07:41:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] ores: use hiera for statsd host [puppet] - 10https://gerrit.wikimedia.org/r/499875 (https://phabricator.wikimedia.org/T218567) (owner: 10Ladsgroup) [07:41:59] (03PS8) 10Alexandros Kosiaris: ores: use hiera for statsd host [puppet] - 10https://gerrit.wikimedia.org/r/499875 (https://phabricator.wikimedia.org/T218567) (owner: 10Ladsgroup) [07:42:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503119 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [07:42:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503117 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [07:43:00] (03PS1) 10Muehlenhoff: Remove support for trusty in two Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/503265 [07:43:40] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10jcrespo) @robh @faidon Re: T219461#5103942 I wonder if we should document this stop as one to do for these models. The sda/sdb renam... [07:47:54] (03CR) 10Arturo Borrero Gonzalez: "Did you check puppet catalog compiler for labnet/labcontrol servers?" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [07:54:59] (03PS19) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [07:55:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10MoritzMuehlenhoff) >>! In T219461#5106335, @jcrespo wrote: > @robh @faidon Re: T219461#5103942 I wonder if we should document this s... [07:55:33] (03PS1) 10Elukey: oozie: override the oozie-setup script [puppet/cdh] - 10https://gerrit.wikimedia.org/r/503266 (https://phabricator.wikimedia.org/T218343) [07:55:49] (03PS20) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [07:56:31] (03PS20) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [07:57:27] (03CR) 10jerkins-bot: [V: 04-1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [07:58:14] (03PS21) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [07:58:56] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10jcrespo) @MoritzMuehlenhoff, just guessing, but I am assuming it is a chassis "bundled" SD card reader, not something we have bought... [07:59:09] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) >>! In T219461#5106371, @MoritzMuehlenhoff wrote: >>>! In T219461#5106335, @jcrespo wrote: >> @robh @faidon Re: T219461#... [08:02:25] !log updated ssacli in thirdparty/hwraid component for stretch to 3.30-13.0 T220787 [08:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:29] T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 [08:04:05] 10Operations: HP Gen9 onboard controller review - https://phabricator.wikimedia.org/T216175 (10Marostegui) [08:04:07] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10Marostegui) [08:09:38] (03CR) 10Alexandros Kosiaris: "PCC quite happy at https://integration.wikimedia.org/ci/view/Ops/job/operations-puppet-catalog-compiler/15718/console, merging." [puppet] - 10https://gerrit.wikimedia.org/r/502499 (owner: 10Alex Monk) [08:09:46] (03PS4) 10Alexandros Kosiaris: Move maintenance_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502499 (owner: 10Alex Monk) [08:09:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] Move maintenance_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502499 (owner: 10Alex Monk) [08:14:49] (03PS3) 10Alexandros Kosiaris: swift-rw: Mock it as a geo-resource [dns] - 10https://gerrit.wikimedia.org/r/502453 (https://phabricator.wikimedia.org/T204245) [08:14:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] swift-rw: Mock it as a geo-resource [dns] - 10https://gerrit.wikimedia.org/r/502453 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [08:22:51] (03PS2) 10Alexandros Kosiaris: Move bastion_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [08:24:12] (03CR) 10Alexandros Kosiaris: "I went ahead and rebase this one since I broke the chain in Ie0ff7f3fc383251acabc5eb8e49d719a627e17b3" [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [08:24:25] (03CR) 10jerkins-bot: [V: 04-1] Move bastion_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [08:25:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1ed, will merge after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/502607/1 is merged" [puppet] - 10https://gerrit.wikimedia.org/r/502612 (owner: 10Alex Monk) [08:33:19] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [08:34:06] (03CR) 10jerkins-bot: [V: 04-1] Move bastion_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [08:35:26] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.25/extensions/NavigationTiming/modules/ext.navigationTiming.js: T220788 Fix veaction === null case (duration: 00m 54s) [08:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:30] T220788: NavigationTiming probably broken in 1.33.0-wmf.25 - https://phabricator.wikimedia.org/T220788 [08:56:27] (03PS1) 10Ladsgroup: Add Western Armenian Wikipedia to wmf-config/InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503270 (https://phabricator.wikimedia.org/T219871) [08:56:41] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10aborrero) a:03aborrero [09:00:12] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: use DBs hosted at clouddb2001-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/503272 (https://phabricator.wikimedia.org/T220096) [09:01:23] (03PS1) 10Alexandros Kosiaris: Revert "swift-rw: Mock it as a geo-resource" [dns] - 10https://gerrit.wikimedia.org/r/503273 [09:02:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "swift-rw: Mock it as a geo-resource" [dns] - 10https://gerrit.wikimedia.org/r/503273 (owner: 10Alexandros Kosiaris) [09:03:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc as expected: https://puppet-compiler.wmflabs.org/compiler1002/15721/" [puppet] - 10https://gerrit.wikimedia.org/r/503272 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [09:03:14] (03PS1) 10Alexandros Kosiaris: Add a new swift.discovery.wmnet resource [dns] - 10https://gerrit.wikimedia.org/r/503274 (https://phabricator.wikimedia.org/T204245) [09:03:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add a new swift.discovery.wmnet resource [dns] - 10https://gerrit.wikimedia.org/r/503274 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [09:05:52] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10aborrero) [09:05:57] !log T218021 disable icinga checks for labtestcontrol2001 [09:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:01] T218021: decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 [09:07:45] !log added the wikimedia repository key to the stretch build chroot on boron, fixes builds using the PHP72/SPICERACK hooks [09:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:20] (03PS3) 10Alexandros Kosiaris: Move bastion_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [09:10:25] 10Operations, 10Wikimedia-Mailing-lists: Change ownership of wikimania-program@lists.wikimedia.org - https://phabricator.wikimedia.org/T220641 (10fgiunchedi) a:05fgiunchedi→03ICueva This is done, please let us know if you need a new password for the list as well! [09:13:39] (03PS2) 10Effie Mouzeli: Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:14:05] (03CR) 10jerkins-bot: [V: 04-1] Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:17:14] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10Volans) In addition io T220787#5106275, from the top of my head I think we need also: - check if the DSA script we're us... [09:17:41] (03PS3) 10Effie Mouzeli: Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:18:13] (03CR) 10jerkins-bot: [V: 04-1] Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:20:55] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10MoritzMuehlenhoff) >>! In T220787#5106465, @Volans wrote: > In addition io T220787#5106275, from the top of my head I th... [09:25:01] (03PS4) 10Effie Mouzeli: Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:25:31] (03CR) 10jerkins-bot: [V: 04-1] Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:30:34] !log reset mgmt card on labtestcontrol2003 - T220783 [09:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:42] T220783: labtestcontrol2003 - UNKNOWN power supply status - https://phabricator.wikimedia.org/T220783 [09:33:40] 10Operations, 10ops-codfw: Degraded RAID on labtestcontrol2003 - https://phabricator.wikimedia.org/T218403 (10Volans) [09:33:44] 10Operations, 10ops-codfw, 10monitoring: labtestcontrol2003 - UNKNOWN power supply status - https://phabricator.wikimedia.org/T220783 (10Volans) 05Open→03Resolved p:05Triage→03Normal a:03Volans I've reset the mgmt card (see https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_managem... [09:34:07] (03CR) 10Elukey: "Thanks a lot for the work!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502511 (owner: 10Muehlenhoff) [09:36:17] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10jcrespo) This is slightly offtopic, but there is a bit of overlap between the -SMART- checks and the RAID (Megacli/HP) o... [09:37:19] !log updated mwdebug1002 to php-wikidiff 1.8.1 [09:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:53] (03PS4) 10Jcrespo: transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) [09:42:55] (03PS5) 10Jcrespo: mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) [09:43:03] RECOVERY - Check systemd state on cloudcontrol2001-dev is OK: OK - running: The system is fully operational [09:43:17] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Allow for a 3rd transfer type: decompression [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/500043 (https://phabricator.wikimedia.org/T219631) (owner: 10Jcrespo) [09:43:20] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [09:46:23] 10Operations, 10Analytics, 10User-Elukey: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created - https://phabricator.wikimedia.org/T220784 (10fgiunchedi) +1, something that parses the json and write metrics in text format for node-exporter to pick up sounds good to me [09:51:57] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:52:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: stretch: use python-openssl from stretch [puppet] - 10https://gerrit.wikimedia.org/r/503279 (https://phabricator.wikimedia.org/T215407) [09:52:53] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:53:28] !log updated mwdebug1001 to php-wikidiff 1.8.1 [09:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:37] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:55:07] (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: stretch: use python-openssl from stretch [puppet] - 10https://gerrit.wikimedia.org/r/503279 (https://phabricator.wikimedia.org/T215407) [09:55:27] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:55:33] (03PS5) 10Effie Mouzeli: Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:55:35] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:56:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:56:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: keystone: stretch: use python-openssl from stretch [puppet] - 10https://gerrit.wikimedia.org/r/503279 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [10:00:16] (03PS7) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [10:00:18] (03PS5) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [10:00:28] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Mathew.onipe) This task is now complete and the lessons learnt have been documented here: https://wikitech.wikimedia.org/wik... [10:00:33] what can we do about the ripe-atlas alerts? [10:01:09] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Mathew.onipe) 05Open→03Resolved [10:01:16] appservers.svc.codfw.wmnet flapped briefly as well [10:02:18] although CRITICAL: Traceback (most recent call last) [10:03:43] PROBLEM - puppet last run on cloudcontrol2001-dev is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[debmonitor-client],File[/etc/apt/preferences.d/mitaka_stretch_nojessiebpo.pref] [10:03:44] (03CR) 10Gilles: [C: 03+1] Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [10:09:51] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:10:09] 10Operations, 10DBA, 10Patch-For-Review, 10codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523 (10jcrespo) A bit of a recap on the original questions: * Parsercache keys are renamed to pc1, pc2, pc3 at: T210725 * Parsercaches are write-wri... [10:11:06] (03PS1) 10Arturo Borrero Gonzalez: openstack: virt: reallocate libssl1.0.0 package exclusion [puppet] - 10https://gerrit.wikimedia.org/r/503284 (https://phabricator.wikimedia.org/T215407) [10:11:11] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:13:04] !log matomo updated to 3.9.1 on matomo1001 + deb upload to wikimedia-stretch - T218037 [10:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:08] T218037: Upgrade matomo1001 to latest upstream - https://phabricator.wikimedia.org/T218037 [10:13:51] (03CR) 10Effie Mouzeli: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15723/ looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [10:15:06] (03CR) 10Effie Mouzeli: [C: 03+2] Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [10:16:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15722/ LGTM, merging" [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [10:16:30] (03PS4) 10Alexandros Kosiaris: Move bastion_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502607 (owner: 10Alex Monk) [10:17:57] (03CR) 10Elukey: "LGTM, left a nit for variable naming :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [10:21:10] (03PS2) 10Arturo Borrero Gonzalez: openstack: virt: reallocate libssl1.0.0 package exclusion [puppet] - 10https://gerrit.wikimedia.org/r/503284 (https://phabricator.wikimedia.org/T215407) [10:21:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc as expected: https://puppet-compiler.wmflabs.org/compiler1002/15727/" [puppet] - 10https://gerrit.wikimedia.org/r/503284 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [10:23:00] (03CR) 10Marostegui: "I would deploy this on a single host with puppet disabled on the rest, reload haproxy and all that and just look at the graphs just in cas" [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [10:27:33] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Yesterday's alerts weren't (aren't?) spam though. This is an actual problem, with a manifestation at the kubelet operation latencies level" [puppet] - 10https://gerrit.wikimedia.org/r/503079 (https://phabricator.wikimedia.org/T219556) (owner: 10CDanis) [10:30:07] RECOVERY - puppet last run on cloudcontrol2001-dev is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:32:15] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Perhaps we should re-engineer a bit these alerts to distinguish between the various operation types. For example we could exclude from the" [puppet] - 10https://gerrit.wikimedia.org/r/503079 (https://phabricator.wikimedia.org/T219556) (owner: 10CDanis) [10:32:51] !log T219626 reimaging cloudcontrol2001-dev [10:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:55] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [10:39:31] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: compute: mitaka: stretch: refresh comment about libss1.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/503291 [10:40:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova: compute: mitaka: stretch: refresh comment about libss1.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/503291 (owner: 10Arturo Borrero Gonzalez) [10:44:49] (03PS1) 10Arturo Borrero Gonzalez: labtestcontrol2001: decommision [puppet] - 10https://gerrit.wikimedia.org/r/503296 (https://phabricator.wikimedia.org/T218021) [10:45:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestcontrol2001: decommision [puppet] - 10https://gerrit.wikimedia.org/r/503296 (https://phabricator.wikimedia.org/T218021) (owner: 10Arturo Borrero Gonzalez) [10:46:19] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10aborrero) 05Stalled→03Open a:05aborrero→03RobH [10:47:12] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10aborrero) [10:49:20] 10Operations, 10DBA, 10Patch-For-Review, 10codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523 (10jcrespo) 3 additional items/proposals regarding purging: * Smarter purging- something maybe priority queue based, while respecting TTL, not s... [10:50:33] (03PS2) 10Arturo Borrero Gonzalez: labtestweb2001: decommission [puppet] - 10https://gerrit.wikimedia.org/r/502966 (https://phabricator.wikimedia.org/T218024) [10:51:17] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) [10:54:22] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:57:30] PROBLEM - Check systemd state on cloudnet2003-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:57:34] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 15 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:57:40] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 4 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:57:40] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 5 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:58:00] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 13 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:58:08] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 4 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:58:58] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 13 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:59:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestweb2001: decommission [puppet] - 10https://gerrit.wikimedia.org/r/502966 (https://phabricator.wikimedia.org/T218024) (owner: 10Arturo Borrero Gonzalez) [10:59:50] (03PS16) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [11:00:14] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 5 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:00:21] (03CR) 10Mathew.onipe: Add wdqs data transfer cookbook (0315 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [11:00:34] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 13 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:00:57] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) a:05aborrero→03RobH [11:01:46] PROBLEM - Free Blazegraph allocators wdqs-blazegraph on wdqs1009 is CRITICAL: cluster=wdqs-test instance=wdqs1009:9193 job=blazegraph site=eqiad https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=32&fullscreen [11:02:40] * gehel is looking at those allocators, data reimport coming up soon [11:07:06] PROBLEM - Check systemd state on labtestcontrol2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:07:50] PROBLEM - puppet last run on cp1077 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:07:58] PROBLEM - puppet last run on cp1075 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:08:10] PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:08:56] taking a look at those puppet failures [11:09:18] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:10:17] !log installing Java security updates on remaining maps hosts [11:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:27] Name of VCL object, 'cloudweb2001-dev', contains illegal charac [11:10:28] ter '-' [11:11:26] PROBLEM - puppet last run on cp1079 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:12:34] PROBLEM - puppet last run on cp1087 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:12:48] let's see if I can fix it [11:15:18] (03PS1) 10Filippo Giunchedi: hieradata: fix VCL illegal character for director [puppet] - 10https://gerrit.wikimedia.org/r/503312 [11:16:14] PROBLEM - puppet last run on cp1081 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [11:16:18] (03PS1) 10Arturo Borrero Gonzalez: cloudweb2001-dev: add IPv6 [dns] - 10https://gerrit.wikimedia.org/r/503313 (https://phabricator.wikimedia.org/T220426) [11:16:42] arturo: FYI https://gerrit.wikimedia.org/r/c/operations/puppet/+/503312 [11:16:50] * arturo looking [11:16:59] or anyone else really, if available for a quick review [11:17:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: fix VCL illegal character for director [puppet] - 10https://gerrit.wikimedia.org/r/503312 (owner: 10Filippo Giunchedi) [11:17:33] godog: +1 [11:18:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudweb2001-dev: add IPv6 [dns] - 10https://gerrit.wikimedia.org/r/503313 (https://phabricator.wikimedia.org/T220426) (owner: 10Arturo Borrero Gonzalez) [11:18:15] arturo: thanks! [11:18:25] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix VCL illegal character for director [puppet] - 10https://gerrit.wikimedia.org/r/503312 (owner: 10Filippo Giunchedi) [11:18:28] sorry for the mess [11:18:47] np, easy enough to fix [11:19:48] !log installed Java security updates on relforge* hosts [11:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:13] RECOVERY - puppet last run on cp1075 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:23:27] RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:24:23] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:27:45] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) [11:29:39] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:30:41] RECOVERY - Check systemd state on cloudnet2003-dev is OK: OK - running: The system is fully operational [11:31:07] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:31:35] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:31:41] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:31:55] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:32:09] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:32:10] (03PS1) 10Gilles: Oversample navtiming on cawiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503317 (https://phabricator.wikimedia.org/T220807) [11:32:19] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:33:11] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:33:21] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 2.839 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:33:27] RECOVERY - puppet last run on cp1077 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [11:34:11] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:34:19] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: Traceback (most recent call last): https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:34:19] (03CR) 10Gilles: [C: 03+2] Oversample navtiming on cawiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503317 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [11:35:19] (03Merged) 10jenkins-bot: Oversample navtiming on cawiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503317 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [11:37:04] !log reindexing Greek, Turkish, and Irish wikis on elastic@eqiad and elastic@codfw complete (T217806) [11:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:08] T217806: Reindex Greek, Turkish, and Irish wikis to keep lang-specific lowercasing & enable empty-token filtering (Greek) - https://phabricator.wikimedia.org/T217806 [11:37:09] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:40:57] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 8.776 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:41:08] (03CR) 10jenkins-bot: Oversample navtiming on cawiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503317 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [11:42:23] RECOVERY - puppet last run on cp1079 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:43:31] RECOVERY - puppet last run on cp1087 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:44:08] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T220807 Oversample navtiming on cawiki and commonswiki (duration: 05m 14s) [11:44:10] thx godog <3 [11:44:49] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:53] T220807: Alert on group1 canary wikis navtiming report rate - https://phabricator.wikimedia.org/T220807 [11:46:40] !log upgrading acmechief hosts to latest buster state [11:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:59] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 5 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:47:01] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 13 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:47:01] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:47:03] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 4 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:47:11] RECOVERY - puppet last run on cp1081 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:47:35] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) p:05Triage→03High Today (2019-04-12), I 've raised the possibility that T220661 is related to the reason these alerts are flapping so much. Judging fro... [11:47:59] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 13 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:48:05] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:48:08] (03CR) 10Alexandros Kosiaris: [C: 04-2] "https://phabricator.wikimedia.org/T220808 FWIW" [puppet] - 10https://gerrit.wikimedia.org/r/503079 (https://phabricator.wikimedia.org/T219556) (owner: 10CDanis) [11:49:11] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 5 probes of 444 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:49:41] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 14 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:49:57] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 4.821 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:51:09] (03PS1) 10Arturo Borrero Gonzalez: clouddb2001-dev: include ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) [11:51:13] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 14 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:51:58] (03CR) 10jerkins-bot: [V: 04-1] clouddb2001-dev: include ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [11:52:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected https://puppet-compiler.wmflabs.org/compiler1002/15728/clouddb2001-dev.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [11:54:46] (03PS2) 10Arturo Borrero Gonzalez: clouddb2001-dev: include ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) [11:55:41] (03CR) 10jerkins-bot: [V: 04-1] clouddb2001-dev: include ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [11:56:43] (03PS3) 10Arturo Borrero Gonzalez: clouddb2001-dev: include ferm configuration [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) [11:56:49] !log upgrading app server canaries to version 1.8.1 of the PHP wikidiff extension (HHVM already deployed) T203069 [11:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:53] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [11:57:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc as expeccted: https://puppet-compiler.wmflabs.org/compiler1002/15730/clouddb2001-dev.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/503323 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [12:00:22] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: drop local keystone db [puppet] - 10https://gerrit.wikimedia.org/r/503327 (https://phabricator.wikimedia.org/T219626) [12:05:34] (03PS1) 10Gilles: Reduce cawiki survey sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503328 (https://phabricator.wikimedia.org/T220807) [12:05:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: drop local keystone db [puppet] - 10https://gerrit.wikimedia.org/r/503327 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [12:07:33] (03CR) 10Gilles: [C: 03+2] Reduce cawiki survey sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503328 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [12:08:45] (03Merged) 10jenkins-bot: Reduce cawiki survey sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503328 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [12:11:13] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:13:11] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:14:45] (03CR) 10jenkins-bot: Reduce cawiki survey sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503328 (https://phabricator.wikimedia.org/T220807) (owner: 10Gilles) [12:16:03] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T220807 Reduce cawiki survey sampling rate (duration: 05m 11s) [12:16:07] (03PS1) 10Jbond: ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) [12:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:09] (03PS1) 10Jbond: raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) [12:16:09] T220807: Alert on group1 canary wikis navtiming report rate - https://phabricator.wikimedia.org/T220807 [12:16:11] (03PS1) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [12:16:21] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:16:45] (03CR) 10jerkins-bot: [V: 04-1] raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:16:54] (03CR) 10jerkins-bot: [V: 04-1] ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:16:57] (03CR) 10jerkins-bot: [V: 04-1] raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:16:59] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 5.858 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:18:11] (03PS1) 10Muehlenhoff: Enable profile::base::firewall for profile::openstack::codfw1dev::db [puppet] - 10https://gerrit.wikimedia.org/r/503335 [12:18:50] (03PS2) 10Jbond: ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) [12:19:33] (03PS3) 10Jbond: ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) [12:19:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Enable profile::base::firewall for profile::openstack::codfw1dev::db [puppet] - 10https://gerrit.wikimedia.org/r/503335 (owner: 10Muehlenhoff) [12:20:39] (03PS4) 10Jbond: ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) [12:20:41] (03PS2) 10Jbond: raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) [12:20:43] (03PS2) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [12:21:17] (03CR) 10jerkins-bot: [V: 04-1] raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:21:23] (03CR) 10jerkins-bot: [V: 04-1] ssacli: update raid fact to detect Gen10 devices [puppet] - 10https://gerrit.wikimedia.org/r/503332 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:21:26] (03CR) 10jerkins-bot: [V: 04-1] raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:21:33] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:22:15] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:22:19] !log T220095 disable icinga checks for labtestcontrol2003 [12:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:22] T220095: labtestcontrol2003: rename to cloudcontrol2003-dev - https://phabricator.wikimedia.org/T220095 [12:23:31] (03PS3) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [12:24:08] (03CR) 10jerkins-bot: [V: 04-1] raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:25:59] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.963 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:26:14] (03Abandoned) 10Revi: Add SPF record for wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/477034 (https://phabricator.wikimedia.org/T210134) (owner: 10Revi) [12:28:43] (03CR) 10Vgutierrez: [C: 03+1] varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [12:31:21] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:31:53] (03PS3) 10Jbond: raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) [12:32:40] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) A breakdown of the alerts per host follows starting from 2019-03-26 to 2019-04-12 follows ` 89 instance=kubernetes2001 84 instance=kubernetes1001... [12:35:07] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:39:53] (03CR) 10Gehel: [C: 04-1] "A few minor comments..." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [12:44:23] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:45:31] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:45:40] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) I am thinking about excluding `exec_sync` operations for a while from the checks to restore faith in the alerts. [12:45:46] (03PS4) 10Jbond: raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) [12:46:41] (03PS15) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [12:49:15] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:49:46] !log rolling restart of cassandra on maps* for jvm upgrade [12:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:05] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:50:45] PROBLEM - Check systemd state on clouddb2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:51:04] PROBLEM - mysqld processes on clouddb2001-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:51:20] is that in use? [12:51:30] seriously that is paging? [12:51:35] * apergos looks in [12:52:06] <_joe_> arturo: you tell me [12:52:07] arturo: I asked cloud to fix that last time, and in the sre meeting [12:52:15] <_joe_> :) [12:52:29] <_joe_> arturo: I think you have all your hosts paging by default [12:52:36] sigh [12:52:53] <_joe_> but well, this is a single service [12:52:56] <_joe_> so it shouldn't [12:53:01] <_joe_> it's clearly unintended [12:53:26] (03PS2) 10Alexandros Kosiaris: Move bastion_hosts out of ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/502612 (owner: 10Alex Monk) [12:53:29] !log decommissioning cassandra-c, restbase2008 -- T208087 [12:53:30] (03PS4) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [12:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:32] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [12:54:05] <_joe_> arturo: anyhow, nothing to see, move along? [12:54:06] ACKNOWLEDGEMENT - Check systemd state on clouddb2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Arturo Borrero Gonzalez this shouldnt page. We are working on this server. [12:54:07] ACKNOWLEDGEMENT - mysqld processes on clouddb2001-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Arturo Borrero Gonzalez this shouldnt page. We are working on this server. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:54:19] _joe_ I belive all cloud databases may be paging, because they just copied the production hosts [12:54:34] <_joe_> arturo: may I suggest to downtime that host for now? [12:54:50] thanks for the ack in any case [12:54:52] _joe_: sure, also I disabled all the checks [12:54:59] <_joe_> thanks :) [12:55:06] (03PS1) 10Lucas Werkmeister (WMDE): Remove constraint-suggestions beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503342 (https://phabricator.wikimedia.org/T220609) [12:56:18] I also suggested to use notifications_enabled: 0 for all hosts that are being setup [12:57:18] !log Purge old rows and optimize tables on spare host pc1010 T210725 [12:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:22] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [12:59:23] (03PS10) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [12:59:29]