[00:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T0000).
[00:00:04] <jouncebot>	 MatmaRex and tgr: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:17] <MatmaRex>	 hi
[00:03:04] <twentyafterfour>	 o/ 
[00:03:12] <Platonides>	 hi MatmaRex 
[00:03:33] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:03:45] <twentyafterfour>	 I can swat if needed
[00:05:09] <twentyafterfour>	 tgr: are you able to test for SWAT? 
[00:05:25] <twentyafterfour>	 MatmaRex: I'll merge your changes first 
[00:06:35] <tgr>	 twentyafterfour: yeah
[00:06:42] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] "Looks good, if Erik's experiments are  successful we will also give sudo to gilles and (maybe) Adam Baso" [puppet] - 10https://gerrit.wikimedia.org/r/488606 (https://phabricator.wikimedia.org/T215384) (owner: 10Dzahn)
[00:15:17] <wikibugs>	 (03PS5) 1020after4: Merge the "extended-uploader" and "autopatrolled" user groups on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21)
[00:19:03] <twentyafterfour>	 ok the extension patches are taking the slow ride through CI.  tgr: I take it that the migration should happen first before the config change?
[00:19:49] <tgr>	 twentyafterfour: yeah, after the patch is merged one of the groups wouldn't exist anymore
[00:24:23] <twentyafterfour>	 !log running `mwscript migrateUserGroup.php commonswiki extended-uploader autopatrolled` on deploy1001
[00:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:24:27] <icinga-wm>	 PROBLEM - Gerritj on gerrit.wikimedia.org is CRITICAL: The command defined for service Gerritj does not exist
[00:25:13] <twentyafterfour>	 gerritj?
[00:26:25] <MatmaRex>	 :o
[00:29:45] <twentyafterfour>	 MatmaRex: should I sync these individually or do them at the same time? 
[00:30:16] <MatmaRex>	 twentyafterfour: safe to do either way, they are unrelated fixes
[00:31:18] <twentyafterfour>	 ok both merged 
[00:31:24] <twentyafterfour>	 I'll sync them one at a time though 
[00:32:15] <icinga-wm>	 RECOVERY - Gerritj on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 351334 bytes in 0.165 second response time
[00:38:52] <MatmaRex>	 twentyafterfour: please ping me when i can confirm the fixes
[00:42:12] <twentyafterfour>	 MatmaRex: they should be on mwdebug1001 now 
[00:42:52] <twentyafterfour>	 tgr: any idea how long this kind of migration should take?  It's done 6 million users so far 
[00:43:00] <MatmaRex>	 1001? a bit of variety
[00:43:27] <tgr>	 it goes through all users? wow
[00:43:35] <twentyafterfour>	 tgr apparently :-/
[00:43:48] <twentyafterfour>	 MatmaRex: :-o
[00:44:16] <tgr>	 there are about 5000 users who should be affected
[00:44:45] <James_F>	 twentyafterfour: There are only 7,488,571 registered users on Commons, so shouldn't be too long.
[00:44:58] <MatmaRex>	 twentyafterfour: both work as expected
[00:45:13] <MatmaRex>	 (i think every swat in at least several months had me test on 1002 :) )
[00:45:19] <twentyafterfour>	 James_F: nice, thanks
[00:45:44] <wikibugs>	 (03PS1) 10Dzahn: icinga/gerrit: add double quotes around URL part in check command [puppet] - 10https://gerrit.wikimedia.org/r/488636 (https://phabricator.wikimedia.org/T215033)
[00:45:45] <twentyafterfour>	 MatmaRex: I'm pretty sure it doesn't matter which one as long as I sync the same one you test ;)
[00:46:03] * James_F grins.
[00:46:10] <twentyafterfour>	  MatmaRex: thanks for testing
[00:46:38] <wikibugs>	 (03PS2) 10Dzahn: icinga/gerrit: add double quotes around URL part in check command [puppet] - 10https://gerrit.wikimedia.org/r/488636 (https://phabricator.wikimedia.org/T215033)
[00:47:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] icinga/gerrit: add double quotes around URL part in check command [puppet] - 10https://gerrit.wikimedia.org/r/488636 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn)
[00:47:14] <twentyafterfour>	 !log syncing commit dd8654ac9b3f2e88241e65d3ea35aea9699defc5 for Bug: T209052
[00:47:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:47:17] <stashbot>	 T209052: Load page content in parallel with VE code on Mobile with ArticleTargetLoader - https://phabricator.wikimedia.org/T209052
[00:47:30] <tgr>	 I'm not sure how many autopatrollers should be there but at a glance there are way less now than total commons users so the script doesn't seem to be doing anything stupid
[00:47:41] <tgr>	 well, in terms of output, anyway
[00:47:51] <tgr>	 processing all users is definitely stupid
[00:48:20] <twentyafterfour>	 Done! 72 users in group 'extended-uploader' are now in 'autopatrolled' instead.
[00:48:52] <twentyafterfour>	 tgr: so I'll merge the config change now
[00:48:54] <logmsgbot>	 !log twentyafterfour@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/MobileFrontend/: SWAT dd8654ac9b3f2e88241e65d3ea35aea9699defc5 (duration: 01m 00s)
[00:48:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:45] <wikibugs>	 (03CR) 1020after4: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21)
[00:49:49] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: The command defined for service Gerrit JSON does not exist
[00:50:02] <tgr>	 hm, maybe I misremembered and autopatrollers is the one with 5000ish members then
[00:50:25] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27192 bytes in 0.033 second response time
[00:50:51] <wikibugs>	 (03Merged) 10jenkins-bot: Merge the "extended-uploader" and "autopatrolled" user groups on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21)
[00:51:50] <mutante>	 paladox: ^ 
[00:52:22] <paladox>	 mutante nice!
[00:52:31] <tgr>	 apparently no user rights log entry either :/
[00:53:08] <logmsgbot>	 !log twentyafterfour@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/VisualEditor/: SWAT f89e12fc466d2c51343d9815c70a0b4602acc333 to fix bug: T209610 (duration: 00m 55s)
[00:53:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:53:11] <stashbot>	 T209610: On mobile, template context menu doesn't show the name of the template - https://phabricator.wikimedia.org/T209610
[00:54:01] <twentyafterfour>	 tgr: the config change should be live on mwdebug1001
[00:54:16] <wikibugs>	 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) The new check "Gerrit JSON" works now:  https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi...
[00:55:08] <tgr>	 twentyafterfour: looks good
[00:55:09] <mutante>	 paladox: dont know if should close or only after also doing the healthcheck plugin thing
[00:55:27] <wikibugs>	 (03CR) 10jenkins-bot: Merge the "extended-uploader" and "autopatrolled" user groups on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21)
[00:56:20] <wikibugs>	 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) 05Open→03Resolved a:03Dzahn
[00:56:59] <logmsgbot>	 !log twentyafterfour@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT config change for Bug: T214003 (duration: 00m 53s)
[00:57:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:57:02] <stashbot>	 T214003: Merge the "extended-uploader" and "autopatrolled" user groups on Commons - https://phabricator.wikimedia.org/T214003
[00:58:12] <tgr>	 thanks! filed T215479 and T215480 about the issues
[00:58:13] <stashbot>	 T215479: migrateUserGroup.php should not process all user records - https://phabricator.wikimedia.org/T215479
[00:58:13] <stashbot>	 T215480: migrateUserGroup.php should make a user rights log entry - https://phabricator.wikimedia.org/T215480
[01:00:04] <jouncebot>	 twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T0100).
[01:01:51] <MatmaRex>	 thanks for deploting twentyafterfour!
[01:03:59] <twentyafterfour>	 you're welcome! glad to help out ;) 
[01:04:39] <twentyafterfour>	 !log no phabricator deployment tonight 
[01:04:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:01] <twentyafterfour>	 !log US Evening SWAT is complete 
[01:05:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:17:45] <wikibugs>	 (03CR) 10Volans: administrative: add owner getter to Reason class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[01:34:38] <wikibugs>	 (03PS2) 10Volans: sre.hosts: add decommission cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/487982 (https://phabricator.wikimedia.org/T205886)
[01:35:03] <wikibugs>	 (03CR) 10Volans: "REplies inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/487982 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans)
[01:36:25] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] administrative: add owner getter to Reason class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[01:40:34] <wikibugs>	 (03CR) 10Volans: icinga: enable check for psi and omega clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488485 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe)
[01:46:08] <wikibugs>	 (03PS4) 10Volans: management: add management module [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885)
[01:46:10] <wikibugs>	 (03PS4) 10Volans: icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530
[01:46:12] <wikibugs>	 (03PS2) 10Volans: puppet: add delete() method to remove a host [software/spicerack] - 10https://gerrit.wikimedia.org/r/487981 (https://phabricator.wikimedia.org/T205884)
[01:46:35] <wikibugs>	 (03CR) 10Volans: management: add management module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans)
[01:50:54] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans)
[01:52:51] <wikibugs>	 (03CR) 10Volans: [C: 03+2] management: add management module [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans)
[01:59:26] <wikibugs>	 (03Merged) 10jenkins-bot: management: add management module [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans)
[01:59:28] <wikibugs>	 (03Merged) 10jenkins-bot: icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans)
[01:59:36] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM, will merge tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson)
[02:00:36] <wikibugs>	 (03CR) 10jenkins-bot: management: add management module [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans)
[02:00:43] <wikibugs>	 (03CR) 10Volans: [C: 03+2] puppet: add delete() method to remove a host [software/spicerack] - 10https://gerrit.wikimedia.org/r/487981 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[02:01:38] <wikibugs>	 (03CR) 10jenkins-bot: icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans)
[02:03:45] <wikibugs>	 (03Abandoned) 10Gehel: Proposal: cleanup of management class [software/spicerack] - 10https://gerrit.wikimedia.org/r/487094 (owner: 10Gehel)
[02:06:37] <wikibugs>	 (03Merged) 10jenkins-bot: puppet: add delete() method to remove a host [software/spicerack] - 10https://gerrit.wikimedia.org/r/487981 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[02:07:18] <wikibugs>	 (03PS1) 10Milimetric: Use correct command from systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/488670
[02:07:38] <wikibugs>	 (03CR) 10jenkins-bot: puppet: add delete() method to remove a host [software/spicerack] - 10https://gerrit.wikimedia.org/r/487981 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[02:07:40] <wikibugs>	 (03PS2) 10Volans: administrative: add owner getter to Reason class [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884)
[02:10:04] <wikibugs>	 (03CR) 10Volans: "replies inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[02:10:16] <wikibugs>	 (03CR) 10Krinkle: "Would this explain why some mwgrep queries produced outdated or incomplete results? I don't have concrete examples right not, but I've sen" [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson)
[02:17:55] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Enable search for help panel on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301)
[02:18:22] <wikibugs>	 (03PS2) 10Milimetric: Use correct command from systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/488670
[02:22:17] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Merging as there were already +1 and the last change is only on the docstring." [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[02:28:03] <wikibugs>	 (03Merged) 10jenkins-bot: administrative: add owner getter to Reason class [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[02:29:13] <wikibugs>	 (03CR) 10jenkins-bot: administrative: add owner getter to Reason class [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[02:50:53] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm)
[03:44:47] <icinga-wm>	 RECOVERY - Long running screen/tmux on an-coord1001 is OK: OK: SCREEN detected but not long running.
[03:54:27] <wikibugs>	 (03PS1) 10Reedy: Don't add EP NS where the wiki has no pages in that NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488720 (https://phabricator.wikimedia.org/T200391)
[03:56:12] <Reedy>	 jouncebot: now
[03:56:12] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 3 minute(s)
[03:56:15] <Reedy>	 jouncebot: next
[03:56:15] <jouncebot>	 In 8 hour(s) and 3 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1200)
[03:56:28] <wikibugs>	 (03PS2) 10Reedy: Don't add EP NS where the wiki has no pages in that NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488720 (https://phabricator.wikimedia.org/T200391)
[03:57:56] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Don't add EP NS where the wiki has no pages in that NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488720 (https://phabricator.wikimedia.org/T200391) (owner: 10Reedy)
[03:59:03] <wikibugs>	 (03Merged) 10jenkins-bot: Don't add EP NS where the wiki has no pages in that NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488720 (https://phabricator.wikimedia.org/T200391) (owner: 10Reedy)
[03:59:15] <wikibugs>	 (03CR) 10jenkins-bot: Don't add EP NS where the wiki has no pages in that NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488720 (https://phabricator.wikimedia.org/T200391) (owner: 10Reedy)
[04:00:36] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable EP namespaces on wikis with no EP pages (duration: 00m 57s)
[04:00:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:08:11] <wikibugs>	 (03PS2) 10Tim Starling: Use excimer to set a graceful wall clock time limit in PHP 7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069
[04:20:44] <wikibugs>	 (03PS3) 10Tim Starling: Use excimer to set a graceful wall clock time limit in PHP 7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069
[04:21:20] <wikibugs>	 (03CR) 10Tim Starling: "PS3: remove set_time_limit() in the excimer case, for simplicity, as suggested by Krinkle." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 (owner: 10Tim Starling)
[04:22:12] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Use excimer to set a graceful wall clock time limit in PHP 7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 (owner: 10Tim Starling)
[04:23:21] <wikibugs>	 (03Merged) 10jenkins-bot: Use excimer to set a graceful wall clock time limit in PHP 7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 (owner: 10Tim Starling)
[04:32:49] <wikibugs>	 (03CR) 10jenkins-bot: Use excimer to set a graceful wall clock time limit in PHP 7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 (owner: 10Tim Starling)
[04:35:10] <logmsgbot>	 !log tstarling@deploy1001 Synchronized wmf-config/set-time-limit.php: (no justification provided) (duration: 00m 54s)
[04:35:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:44:34] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] "It works, except that the error displayed is not very user-friendly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 (owner: 10Tim Starling)
[04:59:06] <wikibugs>	 (03PS1) 10Effie Mouzeli: admin: fixed typo in username ha78na [puppet] - 10https://gerrit.wikimedia.org/r/488753 (https://phabricator.wikimedia.org/T215352)
[05:04:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admin: fixed typo in username ha78na [puppet] - 10https://gerrit.wikimedia.org/r/488753 (https://phabricator.wikimedia.org/T215352) (owner: 10Effie Mouzeli)
[05:04:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "[mwmaint1002:~] $ ldaplist -l passwd ha78na" [puppet] - 10https://gerrit.wikimedia.org/r/488753 (https://phabricator.wikimedia.org/T215352) (owner: 10Effie Mouzeli)
[05:05:34] <wikibugs>	 (03CR) 10Dzahn: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/488753 (https://phabricator.wikimedia.org/T215352) (owner: 10Effie Mouzeli)
[05:07:27] <wikibugs>	 (03CR) 10Effie Mouzeli: ":D" [puppet] - 10https://gerrit.wikimedia.org/r/488753 (https://phabricator.wikimedia.org/T215352) (owner: 10Effie Mouzeli)
[05:20:58] <wikibugs>	 (03PS1) 10Samwilson: Add all fonts used in production MediaWiki [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669)
[05:32:10] <wikibugs>	 (03PS3) 10Fsero: Bump helm to 2.12.2 for security and features [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244)
[05:33:17] <wikibugs>	 (03PS4) 10Fsero: Bump helm to 2.12.2 for security and features [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244)
[05:35:00] <wikibugs>	 (03CR) 10Fsero: "Thanks for the review!" [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero)
[06:01:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 330.67 seconds
[06:01:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.28 seconds
[06:01:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.41 seconds
[06:01:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.43 seconds
[06:01:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.88 seconds
[06:03:05] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 377.20 seconds
[06:03:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 384.25 seconds
[06:03:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 383.55 seconds
[06:10:51] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) dbstore1002 crashed, possibly due to {T215450}
[06:11:00] <wikibugs>	 10Operations: issue pulling 1 layer of docker-registry.wikimedia.org/releng/composer-php71:latest - https://phabricator.wikimedia.org/T209507 (10fsero) 05Open→03Resolved I think this was fixed adjusting Cache-Control headers on docker-registry so varnish can serve content accordingly, report back if not :)
[06:12:59] <wikibugs>	 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10User-fsero: eqiad: 1-2 VM requests for docker-registry-beta.wikimedia.org - https://phabricator.wikimedia.org/T212212 (10fsero)
[06:14:06] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fsero)
[06:14:17] <marostegui>	 !log Ease consistency options on db2051 (s4 master) to let it catch up on replication
[06:14:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:55] <wikibugs>	 10Operations, 10Citoid, 10serviceops, 10Patch-For-Review, and 2 others: allow zotero container nodejs server to define the amount of heap used instead of the fixed limit of 1.7Gi - https://phabricator.wikimedia.org/T213414 (10fsero)
[06:15:09] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero)
[06:15:43] <wikibugs>	 (03PS3) 10Marostegui: dbstore1003: Increase number mysql of instances [puppet] - 10https://gerrit.wikimedia.org/r/488454 (https://phabricator.wikimedia.org/T210478)
[06:18:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbstore1003: Increase number mysql of instances [puppet] - 10https://gerrit.wikimedia.org/r/488454 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui)
[06:25:55] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/488606 (https://phabricator.wikimedia.org/T215384) (owner: 10Dzahn)
[06:26:13] <wikibugs>	 (03PS3) 10Elukey: Use correct command from systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/488670 (owner: 10Milimetric)
[06:27:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Use correct command from systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/488670 (owner: 10Milimetric)
[06:27:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 45247.34 seconds
[06:27:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 44010.34 seconds
[06:27:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 44205.36 seconds
[06:28:05] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 43122.55 seconds
[06:28:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 43922.30 seconds
[06:28:21] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikidatawiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 547679268
[06:28:23] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 40570.74 seconds
[06:28:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 42857.16 seconds
[06:28:33] <elukey>	 going to fix dbstore1002 in a bit --^
[06:29:56] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mathoid: Remove mwapi_req/restbase_req [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800
[06:30:41] <icinga-wm>	 PROBLEM - puppet last run on an-worker1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/biocLite.R]
[06:33:09] <icinga-wm>	 PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/40-prometheus.conf]
[06:34:57] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational
[06:45:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 32.28 seconds
[06:45:39] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 30.12 seconds
[06:45:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 6.67 seconds
[06:46:07] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.43 seconds
[06:46:07] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 0.44 seconds
[06:46:43] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[06:48:14] <marostegui>	 !log Restore consistency options on db2051
[06:48:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:09] <icinga-wm>	 RECOVERY - puppet last run on an-worker1084 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:57:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Add staging-db-analytics.eqiad.wmnet CNAME to dbstore1003 [dns] - 10https://gerrit.wikimedia.org/r/488535 (https://phabricator.wikimedia.org/T210478) (owner: 10Elukey)
[06:57:50] <wikibugs>	 (03PS3) 10Marostegui: dbstore-grants: Add research user and fixing styling [puppet] - 10https://gerrit.wikimedia.org/r/488267 (https://phabricator.wikimedia.org/T214469)
[06:58:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.50 seconds
[06:58:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[06:59:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbstore-grants: Add research user and fixing styling [puppet] - 10https://gerrit.wikimedia.org/r/488267 (https://phabricator.wikimedia.org/T214469) (owner: 10Marostegui)
[06:59:31] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488816 (https://phabricator.wikimedia.org/T210713)
[06:59:35] <icinga-wm>	 RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:00:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488816 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui)
[07:01:51] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488816 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui)
[07:02:03] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488816 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui)
[07:03:22] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 (duration: 00m 55s)
[07:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:35] <marostegui>	 !log Deploy schema change on db1084 - T210713
[07:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:38] <stashbot>	 T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713
[07:10:42] <wikibugs>	 (03CR) 10Marostegui: "Make sure to review grants to make sure check_mariadb can access those hosts via socket, it has been a long while since we set them up" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo)
[07:25:56] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488828
[07:29:38] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10Traffic, 10Patch-For-Review: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10akosiaris) I 've added the capacity to varnish puppet code to augment the wikimed...
[07:34:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add staging-db-analytics.eqiad.wmnet CNAME to dbstore1003 [dns] - 10https://gerrit.wikimedia.org/r/488535 (https://phabricator.wikimedia.org/T210478) (owner: 10Elukey)
[07:36:10] <wikibugs>	 (03PS1) 10Reedy: Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486)
[07:36:27] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy)
[07:36:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy)
[07:37:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy)
[07:38:24] <wikibugs>	 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10Psychoslave) Hello everybody, would it be possible to know how much time in average it takes for such a ticket to be treat, so we can take that into account in how w...
[07:39:15] <wikibugs>	 (03PS2) 10Reedy: Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486)
[07:39:56] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy)
[07:40:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488828 (owner: 10Marostegui)
[07:40:59] <wikibugs>	 (03Merged) 10jenkins-bot: Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy)
[07:42:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488828 (owner: 10Marostegui)
[07:42:24] <marostegui>	 Reedy: I will go after you :)
[07:42:27] <logmsgbot>	 !log reedy@deploy1001 Synchronized dblists/: Wikimania T215486 (duration: 00m 54s)
[07:42:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:30] <stashbot>	 T215486: Shortcut interwiki links have wrong target at wikimaniawiki - https://phabricator.wikimedia.org/T215486
[07:43:01] <Reedy>	 marostegui: feel free
[07:43:06] <marostegui>	 Thanks!
[07:43:57] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 (duration: 00m 53s)
[07:43:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Use standard version of plain-text GPL (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm)
[07:44:52] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488835 (https://phabricator.wikimedia.org/T210713)
[07:45:04] <wikibugs>	 (03PS1) 10Reedy: sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836
[07:46:06] <wikibugs>	 (03PS1) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488837
[07:46:09] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488837 (owner: 10Reedy)
[07:46:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488835 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui)
[07:46:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy)
[07:46:28] <wikibugs>	 (03CR) 10jenkins-bot: Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy)
[07:46:30] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488828 (owner: 10Marostegui)
[07:47:20] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488837 (owner: 10Reedy)
[07:47:23] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488835 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui)
[07:47:33] <wikibugs>	 (03CR) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488837 (owner: 10Reedy)
[07:47:35] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488835 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui)
[07:47:47] <marostegui>	 Reedy: After you :)
[07:48:23] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 20s)
[07:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:40] <wikibugs>	 (03CR) 10Reedy: "Some of these definitely are out of place... I dunno which way round the _ should be. We don't document the correct sorting command, do we" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy)
[07:49:41] <Reedy>	 marostegui: Cheers. That's me done now
[07:49:56] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081 (duration: 00m 53s)
[07:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:20] <marostegui>	 Reedy: :)
[07:50:24] <marostegui>	 !log Deploy schema change on db1081
[07:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:31] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Erik Bernhardson to have root access on stat1005 for GPU testing - https://phabricator.wikimedia.org/T215384 (10Joe) I second the idea, and I see @Nuria has given +1 to the patch which I assume can count as manager approval. Given...
[08:09:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488851
[08:10:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488851 (owner: 10Marostegui)
[08:12:01] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488851 (owner: 10Marostegui)
[08:12:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero)
[08:13:05] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[08:13:06] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1081 (duration: 00m 54s)
[08:13:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:15] <marostegui>	 elukey: \o/
[08:14:06] <marostegui>	 !log Deploy schema change on s4 primary master (db1068) - T210713
[08:14:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:09] <stashbot>	 T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713
[08:14:50] <elukey>	 marostegui: I think it is still broken :(
[08:15:07] <marostegui>	 :(
[08:16:57] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table cywiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 550016117
[08:16:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 284.50 seconds
[08:18:15] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[08:18:23] <elukey>	 this time is good \o/
[08:18:36] <elukey>	 sigh too soon
[08:19:05] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] Bump helm to 2.12.2 for security and features [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero)
[08:20:05] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488851 (owner: 10Marostegui)
[08:23:21] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table wikishared.echo_unread_wikis: Duplicate entry 34079543-enwiki for key echo_unread_wikis_user_wiki, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1069-bin.000334, end_log_pos 555529662
[08:32:21] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[08:34:24] <godog>	 !log swift codfw-prod: more weight to ms-be2047 - T209395 T209921
[08:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:28] <stashbot>	 T209395: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395
[08:34:28] <stashbot>	 T209921: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921
[08:36:13] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table itwiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 560643293
[08:39:17] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to deployment, contint-admins, and contint-docker for Brennen Bearnes - https://phabricator.wikimedia.org/T215328 (10Joe) Hi @brennen - before I can grant you access some things are needed:  - Please read and sign https://phabricator.wikimedia.org/L3 if yo...
[08:39:27] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to deployment, contint-admins, and contint-docker for Brennen Bearnes - https://phabricator.wikimedia.org/T215328 (10Joe) a:03Joe
[08:42:53] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Erik Bernhardson to have root access on stat1005 for GPU testing - https://phabricator.wikimedia.org/T215384 (10Joe) a:03Joe
[08:45:14] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10Joe) I guess this is ok as long as @Nuria  approves the addition.
[08:45:24] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10Joe) a:03Joe
[08:46:21] <wikibugs>	 10Operations, 10Cloud-VPS, 10SRE-Access-Requests, 10cloud-services-team, and 2 others: Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10Joe) Hi @Mathew.onipe I'd need more context on why we want to create this group please.
[08:46:26] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10elukey) I think that it is fine to proceed in this case! :)
[08:47:01] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: admins: add phuedx to researchers [puppet] - 10https://gerrit.wikimedia.org/r/488595 (https://phabricator.wikimedia.org/T214957) (owner: 10Dzahn)
[08:48:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] admins: add phuedx to researchers [puppet] - 10https://gerrit.wikimedia.org/r/488595 (https://phabricator.wikimedia.org/T214957) (owner: 10Dzahn)
[08:51:21] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10Joe) Yes, it is fine, she also gave +1 to the patch already. Merging it. Thanks @DZahn for writing the patch.  @phuedx you should have your a...
[08:51:30] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10Joe) 05Open→03Resolved
[08:53:01] <marostegui>	 !log Deploy schema change on s7 codfw master (db2047), this will generate lag on s7 codfw - T210713
[08:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:07] <stashbot>	 T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713
[08:54:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 44.48 seconds
[09:00:21] <icinga-wm>	 RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops
[09:01:45] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Joe) a:03Dzahn
[09:02:10] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10Marostegui) 05Open→03Resolved And back again: `RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1`  As Jaime said: T213664#4924636 thi...
[09:04:27] <wikibugs>	 (03CR) 10Jcrespo: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo)
[09:05:35] <wikibugs>	 (03CR) 10Marostegui: "> >" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo)
[09:09:02] <wikibugs>	 (03CR) 10Jcrespo: "> > >" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo)
[09:10:01] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "One known reason for stale results to have appeared recently is the activation of these new clusters but only between Jan 16 and 24, perio" [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson)
[09:15:29] <fsero>	 !log uploading helm and tiller 2.12.2 deb package to stretch and jessie
[09:15:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:16] <wikibugs>	 (03CR) 10Marostegui: "> > > >" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo)
[09:18:39] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:18:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "hm this is for debian/stretch-wikimedia. This probably belongs in master as well." [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero)
[09:18:51] <icinga-wm>	 PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[researchers_ensure_members]
[09:19:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: admin: add dsharpe, give access to deployment/analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/488880 (https://phabricator.wikimedia.org/T214130)
[09:20:15] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on db1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops
[09:23:17] <jynus>	 !log running alter table on db2055 for perforamance testing T212092
[09:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:20] <stashbot>	 T212092: Provide a strategy for testing the performance of queries needed to show the list of user-agents for each IP - https://phabricator.wikimedia.org/T212092
[09:24:24] <wikibugs>	 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10akosiaris) >>! In T213371#4932956, @pmiazga wrote: > @Tgr I assume you're still waiting for answers from @...
[09:24:40] <wikibugs>	 10Operations, 10Gerrit, 10Icinga, 10Release-Engineering-Team, and 2 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10hashar)
[09:24:43] <wikibugs>	 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: Install "healthcheck" plugin on gerrit - https://phabricator.wikimedia.org/T214326 (10hashar)
[09:25:09] <wikibugs>	 (03CR) 10Jcrespo: "So +1 ?" [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo)
[09:25:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Set read_only monitoring for core_test hosts [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo)
[09:25:54] <wikibugs>	 (03CR) 10Fsero: "> Patch Set 4:" [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero)
[09:26:15] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Set read_only monitoring for core_test hosts [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489)
[09:27:22] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Set read_only monitoring for core_test hosts [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo)
[09:30:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add dsharpe, give access to deployment/analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/488880 (https://phabricator.wikimedia.org/T214130) (owner: 10Giuseppe Lavagetto)
[09:36:17] <wikibugs>	 10Operations, 10Cloud-VPS, 10SRE-Access-Requests, 10cloud-services-team, and 2 others: Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10Mathew.onipe) Hi @Joe  cloudelastic is a replica of cirrussearch like labsdb* is to maps*. So this group separates access to cloudelastic and...
[09:36:30] <wikibugs>	 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU and memory(?) - https://phabricator.wikimedia.org/T212955 (10hashar) I think @fsero / @akosiaris should be able to bump the number of CPUs on those Ganeti instances :-] We can try wit...
[09:36:49] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 274.59 seconds
[09:37:09] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) >>! In T211661#4931840, @ori wrote: >>>! In T211661#4931056, @fgiunchedi wrote: >> And indeed I share the concerns already mentioned, na...
[09:40:09] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:41:49] <akosiaris>	 !log reboot mwdebug1001, mwdebug1002, mwdebug2001, mwdebug2002 for VCPU upgrade. T212955
[09:41:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:52] <stashbot>	 T212955: Increase mwdebugXXXX hosts CPU and memory(?) - https://phabricator.wikimedia.org/T212955
[09:42:14] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:42:18] <wikibugs>	 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Create Icinga check for failed shard allocation - https://phabricator.wikimedia.org/T212850 (10fgiunchedi)
[09:42:54] <icinga-wm>	 PROBLEM - Host mwdebug2002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:43:48] <icinga-wm>	 RECOVERY - Host mwdebug2002 is UP: PING OK - Packet loss = 0%, RTA = 36.39 ms
[09:44:32] <icinga-wm>	 RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[09:46:04] <wikibugs>	 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU - https://phabricator.wikimedia.org/T212955 (10akosiaris)
[09:46:13] <hashar>	 akosiaris: that was fast :)
[09:46:42] <hashar>	 feel free to m.ark the task resolved
[09:47:38] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational
[09:49:04] <wikibugs>	 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild - https://phabricator.wikimedia.org/T203625 (10akosiaris)
[09:49:06] <wikibugs>	 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU - https://phabricator.wikimedia.org/T212955 (10akosiaris) 05Open→03Resolved a:03akosiaris I 've removed the memory part cause https://grafana.wikimedia.org/d/000000377/host-over...
[09:49:57] <marostegui>	 !log Deploy schema change on db1116 - T210713
[09:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:59] <stashbot>	 T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713
[09:53:08] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 644994322
[09:53:16] <elukey>	 reallyyyyyyy
[09:53:18] <wikibugs>	 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild - https://phabricator.wikimedia.org/T203625 (10hashar) The hosts mwdebug1001, mwdebug1002, mwdebug2001, mwdebug2002 now have four vCPUs allocated (was...
[09:53:23] <elukey>	 sigh
[09:53:31] <marostegui>	 hahaha
[09:58:00] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[10:01:44] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 664164087
[10:05:28] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: Depool db2055 for performance testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488483
[10:08:12] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 232.30 seconds
[10:10:56] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: spamassasion: Skip localhost entries [puppet] - 10https://gerrit.wikimedia.org/r/488894
[10:12:42] <wikibugs>	 (03PS32) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381)
[10:12:44] <wikibugs>	 (03PS1) 10DCausse: [cirrus] Start using local nginx reverse proxy for connections reuse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (https://phabricator.wikimedia.org/T215491)
[10:13:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[10:13:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start using local nginx reverse proxy for connections reuse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (https://phabricator.wikimedia.org/T215491) (owner: 10DCausse)
[10:14:25] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: spamassasion: Skip localhost entries [puppet] - 10https://gerrit.wikimedia.org/r/488894
[10:15:00] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) I thought we would start with a very low percentage and ramp it up gradually. And yes, I thought our beloved swift proxy is where it would l...
[10:16:39] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10jijiki) Is it possible to hold this a bit for until after we upgrade all Thumbor servers to stretch? Two birds with one stone :)
[10:16:43] <wikibugs>	 (03PS2) 10DCausse: [cirrus] Start using local nginx reverse proxy for connections reuse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (https://phabricator.wikimedia.org/T215491)
[10:16:45] <wikibugs>	 (03PS33) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381)
[10:21:32] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational
[10:21:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] spamassasion: Skip localhost entries [puppet] - 10https://gerrit.wikimedia.org/r/488894 (owner: 10Alexandros Kosiaris)
[10:23:36] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) I'd argue that we don't want both changes to happen around the same time. And this is probably less prone to emergency bugfixes than the Str...
[10:28:36] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[10:28:41] <elukey>	 let's see
[10:30:50] <marostegui>	 broke again :(
[10:31:33] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: fix PTR record [dns] - 10https://gerrit.wikimedia.org/r/488896 (https://phabricator.wikimedia.org/T214448)
[10:31:45] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: admin: add dsharpe, give access to deployment/analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/488880 (https://phabricator.wikimedia.org/T214130)
[10:31:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: fix PTR record [dns] - 10https://gerrit.wikimedia.org/r/488896 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez)
[10:32:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add dsharpe, give access to deployment/analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/488880 (https://phabricator.wikimedia.org/T214130) (owner: 10Giuseppe Lavagetto)
[10:32:26] <_joe_>	 arturo: uh?
[10:32:30] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table enwiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 684451049
[10:32:36] <arturo>	 _joe_: ?
[10:32:59] <_joe_>	 arturo: that's not strictly related to the diffscan results, right?
[10:33:16] <arturo>	 _joe_: I guess not, but I'm reviewing all the stuff and found this inconsistency
[10:33:29] <_joe_>	 sure, sure :)
[10:33:56] <_joe_>	 I wasn't sure how it related, it's good to fix stuff anyways, I just didn't get the relationship :)
[10:34:42] <arturo>	 I think the problem is perhaps the server has no role applied
[10:35:14] <_joe_>	 not even "standard"?
[10:35:24] * arturo nods
[10:36:08] <_joe_>	 ok that looks like an issue
[10:36:25] <arturo>	 they were imaged just yesterday I think
[10:36:53] <_joe_>	 they should usually get role "spare::system" or whatever applied
[10:37:38] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[10:38:10] <icinga-wm>	 PROBLEM - puppet last run on mw2279 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:39:03] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: spare system for now [puppet] - 10https://gerrit.wikimedia.org/r/488897 (https://phabricator.wikimedia.org/T214448)
[10:39:08] <icinga-wm>	 PROBLEM - puppet last run on mwlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:39:16] <arturo>	 _joe_: T214448
[10:39:16] <stashbot>	 T214448: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448
[10:39:20] <icinga-wm>	 PROBLEM - puppet last run on mw2286 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:39:24] <arturo>	 _joe_: sorry https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/488897/
[10:39:32] <icinga-wm>	 PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:39:32] <icinga-wm>	 PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:39:52] <icinga-wm>	 PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:39:54] <icinga-wm>	 PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:39:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] cloudcontrol2001-dev: spare system for now [puppet] - 10https://gerrit.wikimedia.org/r/488897 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez)
[10:40:09] <_joe_>	 oh the puppet failures are my fault
[10:40:29] <_joe_>	 but they're going away on a second run. I'll fix it
[10:40:56] <icinga-wm>	 PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:41:02] <icinga-wm>	 PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:41:22] <icinga-wm>	 PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:41:22] <icinga-wm>	 PROBLEM - puppet last run on mw2230 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:41:22] <icinga-wm>	 PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:41:30] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table commonswiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 718537987
[10:41:46] <icinga-wm>	 PROBLEM - puppet last run on mw2262 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:41:52] <icinga-wm>	 PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:41:58] <icinga-wm>	 PROBLEM - puppet last run on an-master1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members]
[10:42:00] <icinga-wm>	 PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:42:02] <icinga-wm>	 PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:42:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: spare system for now [puppet] - 10https://gerrit.wikimedia.org/r/488897 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez)
[10:42:48] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[10:43:18] <icinga-wm>	 PROBLEM - puppet last run on people1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members]
[10:43:31] <marostegui>	 !log Run mysqldump from dbstore1003 to dump dbstore1002:staging.mep_word_persistence - T215450
[10:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:34] <stashbot>	 T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002 - https://phabricator.wikimedia.org/T215450
[10:44:18] <icinga-wm>	 PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members]
[10:44:18] <icinga-wm>	 PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members]
[10:46:42] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10Joe) a:03Joe
[10:47:16] <icinga-wm>	 RECOVERY - puppet last run on an-master1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:49:36] <icinga-wm>	 RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[10:49:36] <icinga-wm>	 RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[10:50:20] <icinga-wm>	 PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:50:50] <icinga-wm>	 PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 seconds ago with 1 failures. Failed resources (up to 3 shown)
[10:51:56] <icinga-wm>	 RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:52:28] <icinga-wm>	 RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[10:53:30] <icinga-wm>	 PROBLEM - puppet last run on mw1348 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:53:30] <icinga-wm>	 PROBLEM - puppet last run on mw2196 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown)
[10:53:38] <icinga-wm>	 PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 25 seconds ago with 1 failures. Failed resources (up to 3 shown)
[10:53:56] <_joe_>	 these will autorecover soon
[10:54:01] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi)
[10:54:04] <icinga-wm>	 PROBLEM - puppet last run on mw2274 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 42 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:54:16] <icinga-wm>	 PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 58 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:54:21] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt200X-dev: add roles in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/488899 (https://phabricator.wikimedia.org/T214448)
[10:54:28] <icinga-wm>	 PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 59 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:55:10] <icinga-wm>	 PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:55:13] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488900 (https://phabricator.wikimedia.org/T210713)
[10:55:18] <icinga-wm>	 PROBLEM - puppet last run on mw1320 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:55:26] <icinga-wm>	 PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 23 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:55:28] <icinga-wm>	 PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:55:36] <icinga-wm>	 RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[10:55:46] <icinga-wm>	 PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:55:50] <icinga-wm>	 PROBLEM - puppet last run on mw2248 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 seconds ago with 1 failures. Failed resources (up to 3 shown)
[10:55:54] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 863963344
[10:55:54] <icinga-wm>	 PROBLEM - puppet last run on bast1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members]
[10:56:01] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt200X-dev: add roles in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/488899 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez)
[10:56:06] <icinga-wm>	 RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:56:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488900 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui)
[10:57:00] <icinga-wm>	 PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:57:29] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488900 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui)
[10:57:41] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488900 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui)
[10:57:52] <icinga-wm>	 RECOVERY - puppet last run on mw2185 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[10:58:44] <icinga-wm>	 PROBLEM - puppet last run on mw2223 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:58:45] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101 for alter and mysql upgrade (duration: 00m 56s)
[10:58:46] <icinga-wm>	 RECOVERY - puppet last run on mw1348 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:58:46] <icinga-wm>	 RECOVERY - puppet last run on mw2196 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:58:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:54] <icinga-wm>	 PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:58:56] <icinga-wm>	 RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:58:58] <wikibugs>	 10Operations, 10Proton, 10Security-Team, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q3), 10Reading-Infrastructure-Team-Backlog (Kanban): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10hashar) TLDR: in the CI job, puppeteer does not down...
[10:59:04] <icinga-wm>	 PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 27 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:59:20] <icinga-wm>	 RECOVERY - puppet last run on mw2274 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:59:22] <icinga-wm>	 PROBLEM - puppet last run on mw1313 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:59:22] <icinga-wm>	 PROBLEM - puppet last run on mw2252 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:59:32] <icinga-wm>	 RECOVERY - puppet last run on mw2197 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:59:42] <icinga-wm>	 PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[10:59:44] <icinga-wm>	 RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:59:52] <icinga-wm>	 PROBLEM - puppet last run on bast4002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members]
[11:00:16] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt2001-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:00:20] <icinga-wm>	 RECOVERY - puppet last run on mwlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:00:26] <icinga-wm>	 RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:00:30] <icinga-wm>	 RECOVERY - puppet last run on mw2286 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:00:32] <icinga-wm>	 PROBLEM - puppet last run on mw2269 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:00:32] <icinga-wm>	 PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 seconds ago with 1 failures. Failed resources (up to 3 shown)
[11:00:34] <icinga-wm>	 RECOVERY - puppet last run on mw1320 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:00:44] <icinga-wm>	 RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:00:44] <icinga-wm>	 RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:00:56] <icinga-wm>	 PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:01:02] <icinga-wm>	 RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:01:06] <icinga-wm>	 RECOVERY - puppet last run on mw2248 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:01:13] <wikibugs>	 10Puppet, 10cloud-services-team (Kanban): ops/puppet: generalize systemd resource control for users - https://phabricator.wikimedia.org/T215401 (10elukey) So user ids are set in the admin module's data.yaml:  ` elukey@stat1006:~$ id elukey uid=13926(elukey)    elukey:     ensure: present     gid: 500     name:...
[11:01:18] <icinga-wm>	 PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:01:32] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members]
[11:02:14] <icinga-wm>	 RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:02:24] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10jijiki) >>! In T211661#4934183, @Gilles wrote: > I'd argue that we don't want both changes to happen around the same time. And this is probably less...
[11:02:30] <icinga-wm>	 RECOVERY - puppet last run on mw2230 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:02:32] <icinga-wm>	 PROBLEM - puppet last run on mw2277 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 11 seconds ago with 1 failures. Failed resources (up to 3 shown)
[11:02:36] <icinga-wm>	 PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:02:36] <icinga-wm>	 PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 25 seconds ago with 1 failures. Failed resources (up to 3 shown)
[11:03:14] <icinga-wm>	 PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:04:00] <icinga-wm>	 RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:04:04] <icinga-wm>	 PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:04:06] <icinga-wm>	 PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:04:10] <icinga-wm>	 RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:04:12] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudvirt200X-dev: add hosts overrides [puppet] - 10https://gerrit.wikimedia.org/r/488902 (https://phabricator.wikimedia.org/T214448)
[11:04:14] <icinga-wm>	 PROBLEM - puppet last run on mw2217 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 59 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:04:20] <icinga-wm>	 RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:04:34] <icinga-wm>	 RECOVERY - puppet last run on mw2279 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:04:38] <icinga-wm>	 RECOVERY - puppet last run on mw1313 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:04:38] <icinga-wm>	 RECOVERY - puppet last run on mw2252 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:04:38] <icinga-wm>	 PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 35 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:04:46] <icinga-wm>	 PROBLEM - puppet last run on mw2275 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:04:52] <icinga-wm>	 PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 32 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:04:56] <icinga-wm>	 RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:04:58] <icinga-wm>	 PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 58 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:05:03] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloudvirt200X-dev: add hosts overrides [puppet] - 10https://gerrit.wikimedia.org/r/488902 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez)
[11:05:26] <icinga-wm>	 PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 48 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:05:28] <icinga-wm>	 PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 26 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:05:48] <icinga-wm>	 RECOVERY - puppet last run on mw2269 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:05:48] <icinga-wm>	 RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:05:50] <icinga-wm>	 PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:06:02] <icinga-wm>	 PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:06:02] <icinga-wm>	 RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[11:06:12] <icinga-wm>	 RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:06:12] <icinga-wm>	 PROBLEM - puppet last run on mw1327 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 30 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:06:14] <icinga-wm>	 PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 seconds ago with 1 failures. Failed resources (up to 3 shown)
[11:06:20] <icinga-wm>	 RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[11:06:22] <icinga-wm>	 PROBLEM - puppet last run on mw2289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 49 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:06:22] <icinga-wm>	 PROBLEM - puppet last run on mw2225 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 seconds ago with 1 failures. Failed resources (up to 3 shown)
[11:06:34] <icinga-wm>	 RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:06:36] <icinga-wm>	 PROBLEM - puppet last run on mw1337 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 41 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:07:24] <icinga-wm>	 PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 32 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:07:32] <icinga-wm>	 PROBLEM - puppet last run on mwdebug2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:07:35] <wikibugs>	 (03PS5) 10Elukey: Introduce systemd::slice::all_users [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824)
[11:07:46] <icinga-wm>	 PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 11 seconds ago with 1 failures. Failed resources (up to 3 shown)
[11:07:46] <icinga-wm>	 RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[11:07:46] <icinga-wm>	 RECOVERY - puppet last run on mw2277 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:07:48] <icinga-wm>	 PROBLEM - puppet last run on mw2281 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 seconds ago with 1 failures. Failed resources (up to 3 shown)
[11:07:51] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Fix spamassassin's "warn: netset: cannot include <network>" warning - https://phabricator.wikimedia.org/T215496 (10jijiki) p:05Triage→03Normal
[11:07:52] <icinga-wm>	 RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:07:52] <icinga-wm>	 RECOVERY - puppet last run on mw2241 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:07:54] <icinga-wm>	 PROBLEM - puppet last run on mw2239 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 23 seconds ago with 1 failures. Failed resources (up to 3 shown)
[11:07:54] <icinga-wm>	 PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:08:30] <icinga-wm>	 RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:09:20] <icinga-wm>	 RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:09:22] <icinga-wm>	 RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:09:30] <icinga-wm>	 RECOVERY - puppet last run on mw2217 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:09:32] <icinga-wm>	 PROBLEM - puppet last run on mw1328 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:09:40] <icinga-wm>	 PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 20 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:09:40] <icinga-wm>	 PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:09:40] <icinga-wm>	 PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 50 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:09:40] <icinga-wm>	 PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 56 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:09:44] <_joe_>	 sorry for the spam
[11:09:46] <icinga-wm>	 RECOVERY - puppet last run on people1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[11:09:52] <icinga-wm>	 PROBLEM - DPKG on cloudvirt2003-dev is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[11:09:52] <icinga-wm>	 PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 48 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:09:56] <icinga-wm>	 RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:10:02] <icinga-wm>	 RECOVERY - puppet last run on mw2275 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[11:10:04] <icinga-wm>	 PROBLEM - DPKG on cloudvirt2001-dev is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[11:10:08] <icinga-wm>	 RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:10:14] <icinga-wm>	 RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:10:14] <icinga-wm>	 PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:10:24] <icinga-wm>	 PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:10:24] <icinga-wm>	 PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members]
[11:10:36] <icinga-wm>	 PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown)
[11:10:42] <icinga-wm>	 RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:10:44] <icinga-wm>	 RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:11:04] <icinga-wm>	 RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:11:08] <icinga-wm>	 PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 seconds ago with 1 failures. Failed resources (up to 3 shown)
[11:11:18] <icinga-wm>	 RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:11:18] <icinga-wm>	 RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:11:22] <icinga-wm>	 RECOVERY - DPKG on cloudvirt2001-dev is OK: All packages OK
[11:11:28] <icinga-wm>	 RECOVERY - puppet last run on mw1327 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:11:30] <icinga-wm>	 RECOVERY - puppet last run on mw2199 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:11:30] <icinga-wm>	 PROBLEM - puppet last run on mw1346 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 58 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:11:38] <icinga-wm>	 RECOVERY - puppet last run on mw2289 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:11:38] <icinga-wm>	 RECOVERY - puppet last run on mw2225 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:11:48] <icinga-wm>	 PROBLEM - puppet last run on mw1314 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:11:52] <icinga-wm>	 RECOVERY - puppet last run on mw1337 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:11:56] <icinga-wm>	 PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 31 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members]
[11:12:26] <icinga-wm>	 RECOVERY - DPKG on cloudvirt2003-dev is OK: All packages OK
[11:12:36] <icinga-wm>	 RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:12:38] <icinga-wm>	 RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:12:44] <icinga-wm>	 RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:12:46] <icinga-wm>	 RECOVERY - puppet last run on mwdebug2002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:12:46] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] ssl: get rid of the expired digicert-2017 certificate [puppet] - 10https://gerrit.wikimedia.org/r/487584 (https://phabricator.wikimedia.org/T215103) (owner: 10Vgutierrez)
[11:12:52] <wikibugs>	 (03PS1) 10Ladsgroup: Update interwiki cache to have yuewiktionary instead of zh-yue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488903 (https://phabricator.wikimedia.org/T214400)
[11:13:02] <icinga-wm>	 RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:13:04] <icinga-wm>	 RECOVERY - puppet last run on mw2281 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:13:08] <icinga-wm>	 RECOVERY - puppet last run on mw2239 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:13:08] <icinga-wm>	 RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:13:30] <icinga-wm>	 RECOVERY - puppet last run on mw2262 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:13:54] <icinga-wm>	 PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members]
[11:14:32] <icinga-wm>	 PROBLEM - Check systemd state on cloudvirt2003-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:14:48] <icinga-wm>	 RECOVERY - puppet last run on mw1328 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:14:49] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Move iegreview from udp2log to syslog - https://phabricator.wikimedia.org/T215497 (10fgiunchedi) p:05Triage→03Normal
[11:14:56] <icinga-wm>	 RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:14:56] <icinga-wm>	 RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:14:56] <icinga-wm>	 RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:14:56] <icinga-wm>	 RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:15:08] <icinga-wm>	 RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[11:15:30] <icinga-wm>	 PROBLEM - Check systemd state on cloudvirt2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:15:32] <icinga-wm>	 RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:15:40] <icinga-wm>	 RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:15:52] <icinga-wm>	 RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:16:26] <icinga-wm>	 RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[11:16:28] <fsero>	 !log upgrade helm to 2.12.2 on deploy{1001,2001} and contint{1001,2001}
[11:16:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:44] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudvirt200X-dev: fix wrong hiera keys names [puppet] - 10https://gerrit.wikimedia.org/r/488905 (https://phabricator.wikimedia.org/T214448)
[11:16:46] <icinga-wm>	 RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:16:46] <icinga-wm>	 PROBLEM - DPKG on cloudvirt2002-dev is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[11:17:02] <icinga-wm>	 RECOVERY - puppet last run on mw1314 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:17:10] <fsero>	 !log upgrade helm to 2.12.2 on deploy{1001,2001} and contint{1001,2001} T215244
[11:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:12] <icinga-wm>	 RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:17:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloudvirt200X-dev: fix wrong hiera keys names [puppet] - 10https://gerrit.wikimedia.org/r/488905 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez)
[11:18:04] <icinga-wm>	 RECOVERY - DPKG on cloudvirt2002-dev is OK: All packages OK
[11:18:13] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Move wikimania-scholarships from udp2log to syslog - https://phabricator.wikimedia.org/T215499 (10fgiunchedi) p:05Triage→03Normal
[11:18:22] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt2003-dev is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Mount[/var/lib/nova/instances]
[11:19:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ssl: get rid of the expired digicert-2017 certificate [puppet] - 10https://gerrit.wikimedia.org/r/487584 (https://phabricator.wikimedia.org/T215103) (owner: 10Vgutierrez)
[11:19:22] <icinga-wm>	 RECOVERY - Check systemd state on cloudvirt2001-dev is OK: OK - running: The system is fully operational
[11:19:30] <wikibugs>	 (03PS2) 10Vgutierrez: ssl: get rid of the expired digicert-2017 certificate [puppet] - 10https://gerrit.wikimedia.org/r/487584 (https://phabricator.wikimedia.org/T215103)
[11:19:44] <icinga-wm>	 RECOVERY - Check systemd state on cloudvirt2003-dev is OK: OK - running: The system is fully operational
[11:20:44] <icinga-wm>	 PROBLEM - Check systemd state on cloudvirt2002-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:22:31] <wikibugs>	 (03PS9) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275)
[11:22:44] <icinga-wm>	 PROBLEM - Host cloudvirt2003-dev is DOWN: PING CRITICAL - Packet loss = 100%
[11:23:06] <icinga-wm>	 PROBLEM - Host cloudvirt2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[11:23:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond)
[11:24:40] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt2002-dev is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/nova/policy.json]
[11:25:14] <wikibugs>	 (03CR) 10Hashar: "recheck" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487612 (owner: 10CRusnov)
[11:26:14] <icinga-wm>	 RECOVERY - puppet last run on bast4002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:26:18] <wikibugs>	 (03PS1) 10Ladsgroup: Set EntityUsageTable addUsage batch size to 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488907 (https://phabricator.wikimedia.org/T215146)
[11:26:36] <icinga-wm>	 RECOVERY - Host cloudvirt2003-dev is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms
[11:26:51] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi)
[11:27:12] <icinga-wm>	 RECOVERY - Check systemd state on cloudvirt2002-dev is OK: OK - running: The system is fully operational
[11:27:22] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp4026 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345546 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 287 days)
[11:27:24] <icinga-wm>	 RECOVERY - Check systemd state on cp4026 is OK: OK - running: The system is fully operational
[11:27:26] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp4026 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345544 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 287 days)
[11:27:36] <icinga-wm>	 PROBLEM - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100%
[11:27:38] <icinga-wm>	 RECOVERY - puppet last run on bast1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:28:28] <icinga-wm>	 RECOVERY - Host cloudvirt2001-dev is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms
[11:29:12] <icinga-wm>	 PROBLEM - configured eth on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused
[11:29:18] <icinga-wm>	 PROBLEM - DPKG on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused
[11:29:52] <icinga-wm>	 PROBLEM - SSH on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 and port 22: Connection refused
[11:30:04] <icinga-wm>	 PROBLEM - Disk space on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused
[11:30:04] <icinga-wm>	 PROBLEM - Check systemd state on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused
[11:30:10] <icinga-wm>	 PROBLEM - MD RAID on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused
[11:30:22] <wikibugs>	 (03CR) 10Addshore: [C: 03+1] Set EntityUsageTable addUsage batch size to 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488907 (https://phabricator.wikimedia.org/T215146) (owner: 10Ladsgroup)
[11:30:24] <icinga-wm>	 PROBLEM - dhclient process on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused
[11:31:06] <icinga-wm>	 PROBLEM - dhclient process on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused
[11:31:22] <icinga-wm>	 PROBLEM - SSH on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 and port 22: Connection refused
[11:31:28] <icinga-wm>	 PROBLEM - MD RAID on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused
[11:31:30] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused
[11:31:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Other than that LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey)
[11:31:48] <icinga-wm>	 PROBLEM - configured eth on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused
[11:31:54] <icinga-wm>	 PROBLEM - Disk space on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused
[11:32:04] <icinga-wm>	 PROBLEM - DPKG on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused
[11:32:18] <icinga-wm>	 PROBLEM - Check systemd state on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused
[11:32:18] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db2055 for performance testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488483 (owner: 10Jcrespo)
[11:33:02] <icinga-wm>	 RECOVERY - puppet last run on cp4026 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:33:10] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:33:23] <wikibugs>	 (03PS1) 10Hashar: Add .gitreview file [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/488909
[11:33:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2055 for performance testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488483 (owner: 10Jcrespo)
[11:33:58] <wikibugs>	 (03CR) 10Hashar: "That is for https://www.mediawiki.org/wiki/Gerrit/git-review :)" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/488909 (owner: 10Hashar)
[11:34:32] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused
[11:34:41] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] ":)" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487612 (owner: 10CRusnov)
[11:35:23] <jynus>	 'mw1279.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD
[11:36:01] <jynus>	 and I think mw1299 is down
[11:36:48] <icinga-wm>	 RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:37:05] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2055 (duration: 03m 02s)
[11:37:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:16] <icinga-wm>	 RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:40:38] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1080 is OK: OK
[11:40:40] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2004 is OK: OK
[11:40:42] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2024 is OK: OK
[11:40:42] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2012 is OK: OK
[11:40:42] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2001 is OK: OK
[11:40:42] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2006 is OK: OK
[11:40:42] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2016 is OK: OK
[11:40:44] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3030 is OK: OK
[11:40:52] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp4032 is OK: OK
[11:40:52] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp4022 is OK: OK
[11:40:56] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp5008 is OK: OK
[11:40:58] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3032 is OK: OK
[11:41:02] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2020 is OK: OK
[11:41:02] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2017 is OK: OK
[11:41:02] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2023 is OK: OK
[11:41:02] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2013 is OK: OK
[11:41:02] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2019 is OK: OK
[11:41:04] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1082 is OK: OK
[11:41:04] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3036 is OK: OK
[11:41:04] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3044 is OK: OK
[11:41:04] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3049 is OK: OK
[11:41:04] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3045 is OK: OK
[11:41:08] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp4025 is OK: OK
[11:41:08] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3041 is OK: OK
[11:41:12] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp5002 is OK: OK
[11:41:12] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp5003 is OK: OK
[11:41:12] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp5005 is OK: OK
[11:41:12] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp5006 is OK: OK
[11:41:14] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp4031 is OK: OK
[11:41:14] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3033 is OK: OK
[11:41:14] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3037 is OK: OK
[11:41:14] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3035 is OK: OK
[11:41:16] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1087 is OK: OK
[11:41:18] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp4023 is OK: OK
[11:41:18] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3043 is OK: OK
[11:41:18] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3040 is OK: OK
[11:41:20] * vgutierrez hides
[11:41:20] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2005 is OK: OK
[11:41:24] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2010 is OK: OK
[11:41:28] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1077 is OK: OK
[11:41:28] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1085 is OK: OK
[11:41:28] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp4030 is OK: OK
[11:41:29] <vgutierrez>	 sorry about the noise folks :)
[11:41:30] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1078 is OK: OK
[11:41:30] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1083 is OK: OK
[11:41:36] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3042 is OK: OK
[11:41:38] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1079 is OK: OK
[11:41:38] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1090 is OK: OK
[11:41:41] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db2055 for performance testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488483 (owner: 10Jcrespo)
[11:41:44] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp4021 is OK: OK
[11:41:44] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp4029 is OK: OK
[11:41:44] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3038 is OK: OK
[11:41:44] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3047 is OK: OK
[11:41:44] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3046 is OK: OK
[11:41:44] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3034 is OK: OK
[11:41:46] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1088 is OK: OK
[11:41:48] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1084 is OK: OK
[11:41:48] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp5011 is OK: OK
[11:41:48] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2022 is OK: OK
[11:41:50] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1089 is OK: OK
[11:41:50] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2026 is OK: OK
[11:41:56] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1081 is OK: OK
[11:41:56] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1076 is OK: OK
[11:41:56] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp1086 is OK: OK
[11:41:58] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp5009 is OK: OK
[11:41:58] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp5012 is OK: OK
[11:41:58] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp5007 is OK: OK
[11:43:19] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10Joe)
[11:43:34] <wikibugs>	 (03PS6) 10Elukey: Introduce systemd::slice::all_users [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824)
[11:44:19] <wikibugs>	 (03PS10) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275)
[11:45:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond)
[11:46:45] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10Joe) @Dsharpe you should be able to long onto the systems accessible via those groups - for example, `deploy1001`.  If you can access those servers, please resol...
[11:47:49] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused
[11:48:26] <wikibugs>	 10Operations, 10Cloud-VPS, 10SRE-Access-Requests, 10cloud-services-team, and 2 others: Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10Joe) ok great - this should be discussed in the SRE meeting on monday.
[11:49:37] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused
[11:52:07] <wikibugs>	 (03CR) 10Elukey: "Arturo: let's see if Moritz has any comment about this approach and then if none, let's merge? :)" [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey)
[11:53:32] <wikibugs>	 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) >>! In T214448#4909868, @Andrew wrote: >>>! In T214448#4909558, @Papaul wrote: >> @Andrew there is no  raid controller on the new server...
[11:53:41] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused
[11:55:25] <marostegui>	 !log Stop MySQL on db1101:3317 and db1101:3318 for mysql upgrade
[11:55:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:41] <wikibugs>	 (03CR) 10Hashar: Improve CI checks to ensure a basic catalogue compiles on all supported OS's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond)
[11:57:39] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused
[11:57:50] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt200[123]-dev: use partman/raid1-lvm-xfs-nova.cfg [puppet] - 10https://gerrit.wikimedia.org/r/488914 (https://phabricator.wikimedia.org/T214448)
[11:58:09] <icinga-wm>	 PROBLEM - Long running screen/tmux on prometheus2003 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 11023, 1741693s 1728000s).
[11:58:26] <wikibugs>	 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) >>! In T214448#4934464, @aborrero wrote: >>>! In T214448#4909868, @Andrew wrote: >>>>! In T214448#4909558, @Papaul wrote: >>> @Andrew th...
[11:58:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt200[123]-dev: use partman/raid1-lvm-xfs-nova.cfg [puppet] - 10https://gerrit.wikimedia.org/r/488914 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez)
[11:59:27] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) @elukey @jcrespo Any objection to put dbstore1002 as IDEMPOTENT? This host crashes every single day, the data is already drifts a lot...
[12:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1200).
[12:00:05] <jouncebot>	 Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:19] <Amir1>	 o/
[12:03:16] <arturo>	 !log T214448 reimaging again cloudvirt200[1-3]-dev.codfw.wmnet
[12:03:18] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10User-Addshore, 10User-jijiki: Add "raz-shuty" to nda ldap group - https://phabricator.wikimedia.org/T214488 (10jijiki) p:05Triage→03Normal
[12:03:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:19] <stashbot>	 T214448: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448
[12:03:20] <wikibugs>	 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin2001.codfw.wmnet for hosts: ` ['cloudvirt2001-dev.codfw.wmnet', 'clou...
[12:03:52] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10jijiki) p:05Triage→03Normal
[12:04:09] <wikibugs>	 10Operations, 10Traffic: cp nodes still try to OCSP staple the already expired digicert-2017 certificate - https://phabricator.wikimedia.org/T215103 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez After merging the change, the following commands have been issued over cumin: ` rm -f /etc/update-ocsp.d/dig...
[12:04:44] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10jcrespo) ok to me, data is already garbage, more garbage would not be a problem :-)
[12:06:50] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4026.ulsfo.wmnet
[12:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:14] <wikibugs>	 10Operations, 10Maps, 10Patch-For-Review: Kartotherian service on maps100[2-4]  timed out on when trying to get tiles. - https://phabricator.wikimedia.org/T214434 (10Joe) p:05Triage→03High
[12:07:35] <icinga-wm>	 RECOVERY - SSH on cloudvirt2003-dev is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0)
[12:07:49] <icinga-wm>	 RECOVERY - SSH on cloudvirt2001-dev is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0)
[12:10:08] <wikibugs>	 (03PS11) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275)
[12:10:10] <wikibugs>	 10Operations, 10Traffic, 10HTTPS: en.wikipedia.com [sic] serves an invalid certificate - https://phabricator.wikimedia.org/T214253 (10Joe) p:05Triage→03Low
[12:11:27] <wikibugs>	 (03CR) 10Jbond: "@Hashar thanks for the extensive review, see comments inline" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond)
[12:11:31] <wikibugs>	 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists, 10User-jijiki: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10jijiki) p:05Triage→03Normal
[12:12:23] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488916
[12:12:59] <Amir1>	 I guess I do the SWAT then
[12:14:28] <wikibugs>	 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Joe)
[12:14:31] <wikibugs>	 10Operations, 10vm-requests: Site: 1 VM request for recommender-systems - https://phabricator.wikimedia.org/T215421 (10Joe) 05Open→03Stalled
[12:15:48] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488903 (https://phabricator.wikimedia.org/T214400) (owner: 10Ladsgroup)
[12:15:56] <wikibugs>	 10Operations, 10vm-requests: Site: 1 VM request for recommender-systems - https://phabricator.wikimedia.org/T215421 (10Joe) I'm not sure how this request derives from the non-conclusive discussion that is ongoing in the parent task.  I am unsure if this ticket should be declined or just stalled - stalling it t...
[12:16:06] <wikibugs>	 10Operations, 10vm-requests: Site: 1 VM request for recommender-systems - https://phabricator.wikimedia.org/T215421 (10Joe) p:05Triage→03Normal
[12:16:57] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache to have yuewiktionary instead of zh-yue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488903 (https://phabricator.wikimedia.org/T214400) (owner: 10Ladsgroup)
[12:20:08] <wikibugs>	 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists, 10User-jijiki: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10Joe) @Psychoslave I would need additional information, yes. Can you still receive emails at the email address listed as the wll@ admin address? I ne...
[12:20:32] <wikibugs>	 (03PS5) 10Vgutierrez: certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737)
[12:21:11] <wikibugs>	 (03CR) 10Vgutierrez: "Thx for the review!" (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez)
[12:21:32] <Amir1>	 Works fine at mwdebug1002, moving forward
[12:21:59] <marostegui>	 Amir1: let me know when I can deploy db-eqiad.php
[12:22:06] <Amir1>	 sure!
[12:22:09] <marostegui>	 thanks!
[12:22:10] <wikibugs>	 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists, 10User-jijiki: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10Joe) a:03Joe
[12:23:37] <wikibugs>	 10Operations: Archival of home directories on servers with very large homes - https://phabricator.wikimedia.org/T215171 (10Joe) p:05Triage→03Normal
[12:26:18] <wikibugs>	 (03CR) 10jenkins-bot: Update interwiki cache to have yuewiktionary instead of zh-yue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488903 (https://phabricator.wikimedia.org/T214400) (owner: 10Ladsgroup)
[12:26:24] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: SWAT: [[gerrit:488903|Update interwiki cache to have yuewiktionary instead of zh-yue (T214400)]] (duration: 03m 04s)
[12:26:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:27] <stashbot>	 T214400: Add yue.wikt to Cognate - https://phabricator.wikimedia.org/T214400
[12:27:03] <Amir1>	 marostegui: I'm done
[12:27:08] <marostegui>	 thank you!
[12:27:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488916 (owner: 10Marostegui)
[12:27:18] <Amir1>	 I have another patch going but I need to wait in between
[12:27:54] <Amir1>	 btw. One apache had sync error
[12:28:13] <Amir1>	 marostegui: i.e. tell me when you're done :D
[12:28:16] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488916 (owner: 10Marostegui)
[12:31:35] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1101 after alter and mysql upgrade (duration: 03m 02s)
[12:31:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:48] <marostegui>	 Amir1: I am done! Was mw1299.eqiad.wmnet the one that failed for you?
[12:32:30] <Amir1>	 yup
[12:32:34] <marostegui>	 I will take a look
[12:33:24] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudvirt200[1-3]-dev: fix extra LVM volume name [puppet] - 10https://gerrit.wikimedia.org/r/488918 (https://phabricator.wikimedia.org/T214448)
[12:33:58] <Amir1>	 Thanks!
[12:34:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloudvirt200[1-3]-dev: fix extra LVM volume name [puppet] - 10https://gerrit.wikimedia.org/r/488918 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez)
[12:34:14] <marostegui>	 !log Powercycle mw1299 as it is down and not responding
[12:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:29] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "SWAT!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488907 (https://phabricator.wikimedia.org/T215146) (owner: 10Ladsgroup)
[12:34:59] <Amir1>	 marostegui: btw ^ This might have some effects on the database
[12:35:32] <wikibugs>	 (03Merged) 10jenkins-bot: Set EntityUsageTable addUsage batch size to 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488907 (https://phabricator.wikimedia.org/T215146) (owner: 10Ladsgroup)
[12:35:34] <Amir1>	 makes batches smaller, so more sql commands 
[12:35:54] <marostegui>	 as long as they are fast...
[12:35:56] <marostegui>	 :)
[12:36:08] <marostegui>	 what is the batch size now?
[12:36:17] <Amir1>	 now, it's 500
[12:36:22] <Amir1>	 it reduce it to 300
[12:36:42] <Amir1>	 which hopefully helps with T205045
[12:36:43] <stashbot>	 T205045: Exception from LinksUpdate: Deadlock found in database query  (from Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages) - https://phabricator.wikimedia.org/T205045
[12:37:26] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488916 (owner: 10Marostegui)
[12:37:28] <wikibugs>	 (03CR) 10jenkins-bot: Set EntityUsageTable addUsage batch size to 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488907 (https://phabricator.wikimedia.org/T215146) (owner: 10Ladsgroup)
[12:37:39] <addshore>	 woo!
[12:37:48] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Joe) p:05Triage→03Normal
[12:38:17] <icinga-wm>	 RECOVERY - Host mw1299 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[12:38:21] <Amir1>	 Since it's untestable, I'm moving forward, if things break, it'll show up
[12:38:33] * addshore is watching too :)
[12:39:52] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:488907|Set EntityUsageTable addUsage batch size to 300 (T215146)]], Part I (duration: 00m 55s)
[12:39:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:55] <stashbot>	 T215146: Decrease EntityUsageTable addUsage batch size - https://phabricator.wikimedia.org/T215146
[12:40:25] <marostegui>	 Amir1: mw1299 didn't fail this time, right?
[12:40:34] <Amir1>	 marostegui: nope, thanks!
[12:40:39] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on cp3030 - https://phabricator.wikimedia.org/T214879 (10Joe) 05Open→03Invalid
[12:40:43] <marostegui>	 Coolio
[12:41:01] <wikibugs>	 10Operations: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10jijiki) p:05Triage→03Normal
[12:41:35] <wikibugs>	 10Operations, 10WMF-Legal, 10Graphite, 10Performance-Team (Radar), 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10Joe) p:05Triage→03Low
[12:42:15] <marostegui>	 !log Set dbstore1002 as IDEMPOTENT - T213670
[12:42:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:18] <stashbot>	 T213670: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670
[12:42:26] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[12:42:28] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:488907|Set EntityUsageTable addUsage batch size to 300]], Part II (duration: 00m 54s)
[12:42:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:30] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 233.52 seconds
[12:42:41] <Amir1>	 !log EU SWAT is done
[12:42:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:50] <wikibugs>	 10Operations, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Joe) p:05Triage→03High
[12:45:59] <wikibugs>	 (03PS1) 10Marostegui: dbstore.my.cnf: Make the slave IDEMPOTENT [puppet] - 10https://gerrit.wikimedia.org/r/488920 (https://phabricator.wikimedia.org/T213670)
[12:46:38] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging: Upgrade jenkins-debian-glue to v0.20.0 - https://phabricator.wikimedia.org/T212774 (10jijiki) p:05Triage→03Normal
[12:47:14] <wikibugs>	 10Operations, 10monitoring: WMF's Grafana installation does not follow Wikimedia's visual identity guidelines - https://phabricator.wikimedia.org/T214762 (10Joe) p:05Triage→03Low
[12:47:34] <wikibugs>	 10Operations, 10PHP 7.2 support: PHP Fatal error:  The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Joe) p:05Triage→03High
[12:47:38] <wikibugs>	 10Operations, 10Discovery, 10Discovery-Search: Create extra elasticsearch clusters in beta cluster - https://phabricator.wikimedia.org/T213940 (10jijiki) p:05Triage→03Normal
[12:48:06] <wikibugs>	 (03CR) 10Marostegui: "Just to confirm, this file is only used on dbstore1002:" [puppet] - 10https://gerrit.wikimedia.org/r/488920 (https://phabricator.wikimedia.org/T213670) (owner: 10Marostegui)
[12:48:47] <wikibugs>	 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10jijiki) p:05Triage→03Normal
[12:49:17] <wikibugs>	 10Operations, 10Continuous-Integration-Config: CI errors  not being displayed in console log - https://phabricator.wikimedia.org/T214726 (10jijiki) p:05Triage→03Normal
[12:50:54] <wikibugs>	 10Operations, 10MobileFrontend, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krenair) (People interested in merging subdomains may also be interested in {T215071} which is about mergi...
[12:57:54] <wikibugs>	 10Operations, 10puppet-compiler: puppet: compiler-update-facts error and warning - https://phabricator.wikimedia.org/T214472 (10jijiki) p:05Triage→03Normal
[13:00:05] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1300)
[13:00:43] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10jijiki) p:05Triage→03Normal
[13:02:47] <wikibugs>	 10Operations, 10DNS, 10Traffic, 10fundraising-tech-ops: remove IBM/Silverpop 1024-bit domain key - https://phabricator.wikimedia.org/T214525 (10jijiki) p:05Triage→03Normal
[13:04:03] <wikibugs>	 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists, 10User-jijiki: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10jijiki) a:05Joe→03jijiki
[13:05:08] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 197.71 seconds
[13:10:43] <wikibugs>	 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10BBlack) The linked ESNI ticket is kind of a random user question ticket, and not actually one created for working on it (which still off in the...
[13:13:03] <wikibugs>	 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) I'm seeing this in cloudvirt2003-dev:  ` [   13.270987] kvm: disabled by bios [   13.729525] kvm: disabled by bios `
[13:18:32] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10JAllemandou) sqoop for actor and comment tables just finished and we should use the new hardware next month, ,so no problem fir me either :)
[13:24:15] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Krenair) >>! In T215376#4932704, @Dzahn wrote: >>>! In T215376#4932577, @Reedy wrote: >> In `modul...
[13:25:04] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10CDanis) a:03CDanis
[13:33:57] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudvirt200[1-3]-dev: fix again instance_dev hiera key [puppet] - 10https://gerrit.wikimedia.org/r/488926 (https://phabricator.wikimedia.org/T214448)
[13:34:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloudvirt200[1-3]-dev: fix again instance_dev hiera key [puppet] - 10https://gerrit.wikimedia.org/r/488926 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez)
[13:35:30] <wikibugs>	 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10monitoring: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989 (10fgiunchedi) Status update: mw logs that were going to logstash in plaintext now are being sent via localhost -> rsyslog -> kafka -> logstash and the netw...
[13:41:12] <wikibugs>	 10Operations, 10monitoring, 10User-fgiunchedi: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610 (10fgiunchedi) 05Open→03Invalid We haven't seen this reoccurring afaik, also we're upgrading to Prometheus 2.6, tentatively resolving.
[13:41:25] <wikibugs>	 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10BBlack) p:05Normal→03Low Expounding on the lamentations above in a more realistic triage sort of sense:  * It's a very complex project which...
[13:43:36] <wikibugs>	 10Operations, 10Goal, 10User-Elukey, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10fgiunchedi)
[13:44:53] <wikibugs>	 10Operations, 10media-storage, 10User-fgiunchedi: Track down the source of periodic increases in requests to swift eqiad - https://phabricator.wikimedia.org/T173721 (10fgiunchedi) 05Open→03Resolved Turns out the spikes are varnish upload backends periodic restarts, thus expected.
[13:45:09] <wikibugs>	 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin2001.codfw.wmnet for hosts: ` ['cloudvirt2003-dev.codfw.wmnet'] ` The...
[13:46:24] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:54:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Introduce systemd::slice::all_users [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey)
[13:54:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Arturo: let's see if Moritz has any comment about this approach and" [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey)
[13:56:09] <wikibugs>	 (03PS1) 10Milimetric: Separate logfile for production sqoop [puppet] - 10https://gerrit.wikimedia.org/r/488928
[13:58:08] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Works for me - Thanks Dan" [puppet] - 10https://gerrit.wikimedia.org/r/488928 (owner: 10Milimetric)
[14:00:04] <jouncebot>	 Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1400)
[14:01:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Separate logfile for production sqoop [puppet] - 10https://gerrit.wikimedia.org/r/488928 (owner: 10Milimetric)
[14:08:11] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] dbstore.my.cnf: Make the slave IDEMPOTENT [puppet] - 10https://gerrit.wikimedia.org/r/488920 (https://phabricator.wikimedia.org/T213670) (owner: 10Marostegui)
[14:11:34] <wikibugs>	 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt2003-dev.codfw.wmnet'] `  and were **ALL** successful.
[14:12:02] <wikibugs>	 10Operations, 10monitoring: Expose linux kernel firewall and connections statistics - https://phabricator.wikimedia.org/T215277 (10jbond) p:05Triage→03Normal
[14:12:24] <wikibugs>	 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review: Improve CI checks to cover more of the code base - https://phabricator.wikimedia.org/T215275 (10jbond) p:05Triage→03Normal
[14:12:34] <wikibugs>	 10Operations, 10Puppet: Audit /etc/apt directories - https://phabricator.wikimedia.org/T214605 (10jbond) p:05Triage→03Low
[14:25:13] <wikibugs>	 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10akosiaris) How is the data going to make it from Hadoop, which resides in the analytics cluster and is firewalled at the router level...
[14:28:58] <wikibugs>	 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) > How is the data going to make it from Hadoop, which resides in the analytics cluster and is firewalled at the router level...
[14:33:41] <wikibugs>	 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) >>! In T213566#4934832, @akosiaris wrote: > Is it just a `LOAD DATA INFILE "something.tsv"` or is it something more complex...
[14:34:16] <wikibugs>	 (03PS2) 10Marostegui: dbstore.my.cnf: Make the slave IDEMPOTENT [puppet] - 10https://gerrit.wikimedia.org/r/488920 (https://phabricator.wikimedia.org/T213670)
[14:34:27] <jbond42>	 !log deploying security updates for libgd3
[14:34:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbstore.my.cnf: Make the slave IDEMPOTENT [puppet] - 10https://gerrit.wikimedia.org/r/488920 (https://phabricator.wikimedia.org/T213670) (owner: 10Marostegui)
[14:36:03] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488931
[14:37:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488931 (owner: 10Marostegui)
[14:38:29] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10phuedx) Thanks, @Dzahn, @elukey, @Joe, and @Nuria!  I apologise for not including an approver on the task. I wasn't actually sure who should...
[14:38:46] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488931 (owner: 10Marostegui)
[14:39:30] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488931 (owner: 10Marostegui)
[14:39:52] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1101 after alter and mysql upgrade (duration: 00m 55s)
[14:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:00] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932
[14:41:14] <wikibugs>	 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10akosiaris) >>! In T213566#4934835, @Ottomata wrote: >> How is the data going to make it from Hadoop, which resides in the analytics cl...
[14:42:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932 (owner: 10Jcrespo)
[14:44:29] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational
[14:44:30] <wikibugs>	 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) > That does look simple enough and not resource expensive on mwmaint1002. I guess it can fit in there as well? But a VM is...
[14:46:39] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488934
[14:51:48] <marostegui>	 jouncebot: next
[14:51:48] <jouncebot>	 In 2 hour(s) and 8 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1700)
[14:56:41] <wikibugs>	 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) > they will also not allow them to send the SYN/ACK packet required for the second (of the three) phase of the TCP handshake...
[14:59:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488934 (owner: 10Marostegui)
[15:00:19] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488934 (owner: 10Marostegui)
[15:01:36] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1101 (duration: 00m 55s)
[15:01:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:49] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488934 (owner: 10Marostegui)
[15:07:15] <logmsgbot>	 !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on test wikis and mediawikiwiki for T215464. This may cause lag in codfw.
[15:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:20] <stashbot>	 T215464: Oversighters can no longer see suppressed contributions past a certain date when using the offender parameter - https://phabricator.wikimedia.org/T215464
[15:07:28] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) So, re-reading https://phabricator.wik...
[15:14:28] <wikibugs>	 (03PS6) 10Gehel: mwgrep: Query all search clusters [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson)
[15:15:50] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] mwgrep: Query all search clusters [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson)
[15:16:18] <logmsgbot>	 !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 1 wikis for T215464. This may cause lag in codfw.
[15:16:18] <logmsgbot>	 !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 2 wikis for T215464. This may cause lag in codfw.
[15:16:18] <logmsgbot>	 !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on remaining section 3 wikis for T215464. This may cause lag in codfw.
[15:16:18] <logmsgbot>	 !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 4 wikis for T215464. This may cause lag in codfw.
[15:16:18] <logmsgbot>	 !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 5 wikis for T215464. This may cause lag in codfw.
[15:16:18] <logmsgbot>	 !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 6 wikis for T215464. This may cause lag in codfw.
[15:16:19] <logmsgbot>	 !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 7 wikis for T215464. This may cause lag in codfw.
[15:16:19] <logmsgbot>	 !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 8 wikis for T215464. This may cause lag in codfw.
[15:16:20] <logmsgbot>	 !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on wikitech for T215464. This may cause lag in codfw.
[15:16:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:21] <stashbot>	 T215464: Oversighters can no longer see suppressed contributions past a certain date when using the offender parameter - https://phabricator.wikimedia.org/T215464
[15:16:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:59] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Dzahn) If we use "present" (and not a specific version or "latest" either) we would get whatever t...
[15:20:28] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932
[15:23:52] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:24:54] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:32:18] <wikibugs>	 (03PS3) 10Gehel: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/488485 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe)
[15:32:19] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to deployment, contint-admins, and contint-docker for Brennen Bearnes - https://phabricator.wikimedia.org/T215328 (10brennen) Hi @Joe  -   > Please read and sign https://phabricator.wikimedia.org/L3 if you didn't do it already  Read and signed.  > Confirm...
[15:34:08] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/488485 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe)
[15:34:19] <wikibugs>	 10Operations, 10ops-codfw, 10decommission: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10Papaul)
[15:35:36] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932 (owner: 10Jcrespo)
[15:37:20] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932 (owner: 10Jcrespo)
[15:39:10] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch https://10.64.0.37:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661)
[15:39:10] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1005 is CRITICAL: CRITICAL - elasticsearch https://10.64.16.185:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661)
[15:39:12] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch https://10.64.32.27:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661)
[15:39:19] <wikibugs>	 (03PS1) 10Gehel: Revert "icinga: enable check for psi and omega clusters" [puppet] - 10https://gerrit.wikimedia.org/r/488952
[15:39:38] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: CRITICAL - elasticsearch https://10.64.4.13:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661)
[15:39:44] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2003 is CRITICAL: CRITICAL - elasticsearch https://10.192.48.131:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661)
[15:39:44] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2002 is CRITICAL: CRITICAL - elasticsearch https://10.192.32.180:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661)
[15:39:58] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1004 is CRITICAL: CRITICAL - elasticsearch https://10.64.0.162:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661)
[15:40:41] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Revert "icinga: enable check for psi and omega clusters" [puppet] - 10https://gerrit.wikimedia.org/r/488952 (owner: 10Gehel)
[15:41:00] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2001 is CRITICAL: CRITICAL - elasticsearch https://10.192.0.112:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661)
[15:46:00] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch https://10.64.0.90:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661)
[15:46:27] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2003 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, a
[15:46:27] <icinga-wm>	 initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[15:46:27] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2002 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, a
[15:46:27] <icinga-wm>	 initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[15:46:29] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932 (owner: 10Jcrespo)
[15:46:40] <vgutierrez>	 uh...
[15:46:41] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2001 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, a
[15:46:41] <icinga-wm>	 initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[15:46:45] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1004 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 3, active_shards_percent_as_number: 100.0, 
[15:46:45] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[15:46:57] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1005 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 3, active_shards_percent_as_number: 100.0, 
[15:46:57] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[15:46:57] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 3, active_shards_percent_as_number: 100.0, 
[15:46:57] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[15:47:01] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 3, active_shards_percent_as_number: 100.0, 
[15:47:01] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[15:47:15] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 3, active_shards_percent_as_number: 100.0, 
[15:47:15] <icinga-wm>	 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0
[15:51:05] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1085 (duration: 00m 58s)
[15:51:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:09] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 83, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 104, in
[15:53:09] <icinga-wm>	 : 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0
[15:53:23] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) I've ran an audit on producers that sent lo...
[15:53:57] <wikibugs>	 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Gehel)
[15:55:05] <gehel>	 !log starting reimage of maps2004 - T198622
[15:55:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:09] <stashbot>	 T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622
[15:56:01] <wikibugs>	 (03PS3) 10Gehel: maps: migrate maps2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/487360 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe)
[15:58:21] <icinga-wm>	 RECOVERY - Long running screen/tmux on prometheus2003 is OK: OK: No SCREEN or tmux processes detected.
[15:58:33] <wikibugs>	 10Operations, 10ops-codfw, 10decommission: Decommission baham - https://phabricator.wikimedia.org/T199247 (10Papaul)
[15:59:52] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] maps: migrate maps2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/487360 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe)
[16:00:09] <icinga-wm>	 PROBLEM - Long running screen/tmux on restbase1016 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 37796, 1737147s 1728000s).
[16:02:04] <godog>	 fixed ^
[16:03:29] <jynus>	 !log restart db1085, temporary s6 lag on wikireplicas
[16:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:38] <wikibugs>	 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Gehel)
[16:05:53] <icinga-wm>	 PROBLEM - Long running screen/tmux on an-coord1001 is CRITICAL: CRIT: Long running SCREEN process. (user: otto PID: 26051, 2072360s 1728000s).
[16:07:30] <wikibugs>	 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2004.codfw.wmn...
[16:07:48] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) I don't understand the preference for sampling Swift requests rather than Varnish requests. You'd have greater resilience to overload (for the...
[16:10:53] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10Vgutierrez) since @ayounsi is going to eqsin datacenter later this month maybe we could join efforts and replace sdb. ^^ @RobH
[16:11:26] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488956
[16:18:11] <wikibugs>	 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists, 10User-jijiki: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10Psychoslave) Yes, the email is still valid, if you can also send me the email subject here once sent, that might help find it more quickly in case i...
[16:25:27] <icinga-wm>	 PROBLEM - Host cloudcontrol1004 is DOWN: PING CRITICAL - Packet loss = 100%
[16:25:56] <robh>	 that paged
[16:26:19] <robh>	 is someone working on cloudcontrol1004?
[16:26:22] <mark>	 yes
[16:26:32] <robh>	 ok, didnt wanna assume =]
[16:26:33] <icinga-wm>	 ACKNOWLEDGEMENT - Host cloudcontrol1004 is DOWN: PING CRITICAL - Packet loss = 100% GTirloni T215075
[16:27:18] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Old server has been shipped out. Shipping information below.   {F28148277}
[16:29:34] <wikibugs>	 (03PS5) 10AndyRussG: Give protect right to centralnoticeadmin on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873)
[16:40:36] <icinga-wm>	 PROBLEM - Host cloudstore1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:41:09] <chaomodus>	 presumably that's related?
[16:44:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10aborrero) a:05aborrero→03RobH
[16:45:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10aborrero)
[16:46:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10aborrero)
[16:46:19] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH) ` root@cloudvirt1015.mgmt.eqiad.wmnet's password:  /admin1-> racadm getsel Record:      1 Date/Time:   10/29/...
[16:46:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10aborrero)
[16:47:16] <icinga-wm>	 PROBLEM - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100%
[16:49:12] <icinga-wm>	 RECOVERY - Host cloudstore1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms
[16:51:26] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) So group 1 has been deployed (that should...
[16:52:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH)
[16:53:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH) >>! In T215012#4924650, @Andrew wrote: > Since this host is empty we should rebuild it with Stretch before pu...
[16:54:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH) a:05RobH→03Cmjohnson
[16:54:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH)
[16:55:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH)
[16:55:52] <icinga-wm>	 PROBLEM - Host cloudstore1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:56:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH)
[16:58:05] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH)
[16:58:07] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488956 (owner: 10Jcrespo)
[16:59:16] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488956 (owner: 10Jcrespo)
[16:59:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10Andrew) Note that this isn't the first time we've had issues with 1015:  T171473
[17:00:04] <jouncebot>	 godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1700).
[17:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[17:01:10] <godog>	 \o/
[17:05:24] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488956 (owner: 10Jcrespo)
[17:06:38] <icinga-wm>	 RECOVERY - Host cloudstore1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms
[17:06:44] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1085 (duration: 03m 03s)
[17:06:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:58] <jynus>	 connect to host mw1299.eqiad.wmnet port 22: Connection timed out
[17:10:16] <jynus>	 wasn't that one powercycled recently?
[17:10:45] <marostegui>	 jynus: yes, and it came back fine, but looks like it only lasted 5 hours... :(
[17:25:34] <wikibugs>	 10Operations, 10vm-requests: Site: 1 VM request for recommender-systems - https://phabricator.wikimedia.org/T215421 (10Dzahn) @Joe I recommended starting it, partially because i thought the outcome to use a VM was pretty likely and partially because actually listing what resources are needed might be a valuabl...
[17:28:56] <wikibugs>	 10Operations, 10ops-eqiad: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10RobH) Ok, assisting in this I've done the following:  * removed cloudstore100[89]  from asw2-a-eqiad(ge-5/0/14 & ge-6/0/17) and cloudstore1009 from asw-a-eqiad:ge-6/0/17. ** removed the descriptions,...
[17:31:57] <icinga-wm>	 RECOVERY - MegaRAID on db1073 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
[17:32:42] <marostegui>	 cmjohnson1: ^ I assume you changed the disk? :-)
[17:33:03] <cmjohnson1>	 yes...sorry I got hung up with cloud stuff
[17:33:17] <marostegui>	 Sure no worries! I will close the task - thank you!
[17:33:26] <cmjohnson1>	 thanks
[17:34:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T215050 (10Marostegui) 05Open→03Resolved Thanks @Cmjohnson for replacing disk #6! ` 17:31 <+icinga-wm> RECOVERY - MegaRAID on db1073 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy `
[17:37:00] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team: cloudcontrol1004 mgmt HTTPS SSL error - https://phabricator.wikimedia.org/T215075 (10Cmjohnson) I updated the f/w and bios with the SPP provided by HP The error did not resolve, I  had to reset the rbsu to manufacturer settings and the err...
[17:38:45] <wikibugs>	 10Operations, 10Proton, 10Security-Team, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q3), 10Reading-Infrastructure-Team-Backlog (Kanban): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10Jhernandez)
[17:40:30] <icinga-wm>	 RECOVERY - Host cloudcontrol1004 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[17:43:45] <wikibugs>	 (03PS1) 10Bstorm: wiki replicas: Adding the ar_comment_id field to archive_userindex [puppet] - 10https://gerrit.wikimedia.org/r/488972 (https://phabricator.wikimedia.org/T212617)
[17:43:47] <wikibugs>	 (03PS1) 10RobH: migrate cloudstore100[89] to row d dns change [dns] - 10https://gerrit.wikimedia.org/r/488973 (https://phabricator.wikimedia.org/T214079)
[17:44:59] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/481154 (https://phabricator.wikimedia.org/T150264) (owner: 10Faidon Liambotis)
[17:46:02] <wikibugs>	 (03PS2) 10Bstorm: toolforge: shuffle some packages into and around genpp [puppet] - 10https://gerrit.wikimedia.org/r/488208 (https://phabricator.wikimedia.org/T210116)
[17:47:18] <wikibugs>	 (03CR) 10Sbisson: [C: 03+1] GrowthExperiments: Enable search for help panel on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301) (owner: 10Kosta Harlan)
[17:47:32] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+1] "looks good to me" [dns] - 10https://gerrit.wikimedia.org/r/488973 (https://phabricator.wikimedia.org/T214079) (owner: 10RobH)
[17:47:49] <wikibugs>	 (03CR) 10RobH: [C: 03+2] migrate cloudstore100[89] to row d dns change [dns] - 10https://gerrit.wikimedia.org/r/488973 (https://phabricator.wikimedia.org/T214079) (owner: 10RobH)
[17:48:18] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wiki replicas: Adding the ar_comment_id field to archive_userindex [puppet] - 10https://gerrit.wikimedia.org/r/488972 (https://phabricator.wikimedia.org/T212617) (owner: 10Bstorm)
[17:48:53] <icinga-wm>	 PROBLEM - HHVM rendering on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:50:03] <icinga-wm>	 RECOVERY - HHVM rendering on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 75081 bytes in 0.116 second response time
[17:52:22] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) One thing we'd need to make sure of is...
[17:54:39] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team: cloudcontrol1004 mgmt HTTPS SSL error - https://phabricator.wikimedia.org/T215075 (10GTirloni) Server looks okay to me. Thanks @Cmjohnson
[17:55:49] <wikibugs>	 10Operations, 10cloud-services-team, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Prometheus to 2.6 in deployment-prep and tools - https://phabricator.wikimedia.org/T215272 (10fgiunchedi) Conversion of tools-prometheus-02 worked as expected, I've stopped v1, moved v1 metrics out of the wa...
[18:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1800).
[18:00:24] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) Since this host is important for the Analytics team, I'd be up to take over from the OS install perspective to remove some work from...
[18:05:07] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] mediawiki/scap: do not install sql scripts on canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/479142 (https://phabricator.wikimedia.org/T211512) (owner: 10Dzahn)
[18:11:05] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10User-Addshore, 10User-jijiki: Add "raz-shuty" to nda ldap group - https://phabricator.wikimedia.org/T214488 (10RStallman-legalteam) Yes, both have signed. Please proceed. Thanks!
[18:27:47] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team: cloudcontrol1004 mgmt HTTPS SSL error - https://phabricator.wikimedia.org/T215075 (10Cmjohnson) 05Open→03Resolved
[18:32:48] <mutante>	 !log LDAP - adding raz-shuty to group nda (T214488)
[18:32:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:52] <stashbot>	 T214488: Add "raz-shuty" to nda ldap group - https://phabricator.wikimedia.org/T214488
[18:34:50] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10User-Addshore, 10User-jijiki: Add "raz-shuty" to nda ldap group - https://phabricator.wikimedia.org/T214488 (10Dzahn) 05Open→03Resolved a:03Dzahn @RazShuty @addshore done !  (Raz was already in other LDAP groups (wmde) so no code change needed in the admin modu...
[18:35:13] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1004 - https://phabricator.wikimedia.org/T215542 (10ops-monitoring-bot)
[18:38:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10Cmjohnson) The disk has been replaced, @aborrero the OS will need to be re-installed.  Until then the raid is out of whack because I removed /dev/sda.
[18:39:04] <wikibugs>	 (03PS3) 10Bstorm: toolforge: shuffle some packages into and around genpp [puppet] - 10https://gerrit.wikimedia.org/r/488208 (https://phabricator.wikimedia.org/T210116)
[18:40:30] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: shuffle some packages into and around genpp [puppet] - 10https://gerrit.wikimedia.org/r/488208 (https://phabricator.wikimedia.org/T210116) (owner: 10Bstorm)
[18:40:51] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10Cmjohnson) a:03RobH @RobH Can you do a re-install and hand off to cloud, please.  I moved the servers to row D racks d2 and d7 I connected to 10G switch I changed bios boot cf...
[18:41:49] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1004 - https://phabricator.wikimedia.org/T215542 (10Cmjohnson) 05Open→03Invalid
[18:43:56] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cloudstore1008.wikimedia.org ` The log can be found in `/var/log/wmf-auto-...
[18:44:26] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudstore1008.wikimedia.org'] `  Of which those **FAILED**: ` ['cloudstore1008.wikimedia.org'] `
[18:47:11] <wikibugs>	 (03PS6) 10Cwhite: role: add backwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708)
[18:47:19] <wikibugs>	 (03Abandoned) 10Dzahn: admins: remove empty OIT admin group [puppet] - 10https://gerrit.wikimedia.org/r/488119 (owner: 10Dzahn)
[18:56:30] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:57:55] <wikibugs>	 (03PS1) 10RobH: update cloudstore100[89] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/488992 (https://phabricator.wikimedia.org/T214079)
[18:59:08] <wikibugs>	 (03CR) 10RobH: [C: 03+2] update cloudstore100[89] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/488992 (https://phabricator.wikimedia.org/T214079) (owner: 10RobH)
[18:59:10] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] role: add backwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite)
[18:59:42] <wikibugs>	 (03PS7) 10Cwhite: role: add backwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708)
[19:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1900).
[19:00:04] <jouncebot>	 Zppix and kostajh: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:26] <kostajh>	 I'm here
[19:00:48] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cloudstore1008.wikimedia.org and performed the following actions: - Revoked Puppet certificate - Removed from...
[19:01:22] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cloudstore1009.wikimedia.org and performed the following actions: - Revoked Puppet certificate - Removed from...
[19:09:23] <kostajh>	 Anyone around to do SWAT?
[19:13:21] <stephanebisson>	 I'll SWAT
[19:13:30] <kostajh>	 stephanebisson: thanks
[19:15:13] <wikibugs>	 (03PS2) 10Sbisson: GrowthExperiments: Enable search for help panel on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301) (owner: 10Kosta Harlan)
[19:15:28] <wikibugs>	 (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301) (owner: 10Kosta Harlan)
[19:15:30] <wikibugs>	 (03PS4) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150
[19:16:49] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Enable search for help panel on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301) (owner: 10Kosta Harlan)
[19:17:41] <anomie>	 stephanebisson: FYI, I just added something to the SWAT.
[19:17:51] <stephanebisson>	 anomie: ok
[19:19:39] <wikibugs>	 (03CR) 10jenkins-bot: GrowthExperiments: Enable search for help panel on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301) (owner: 10Kosta Harlan)
[19:21:16] <wikibugs>	 (03PS20) 10Cwhite: prometheus: upgrade to node-exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708)
[19:23:35] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10RobH) a:05RobH→03GTirloni Ok, these are both reinstalled and ready for use/takeover.
[19:23:44] <stephanebisson>	 kostajh: I've tested enabling on search, now syncing it. Do you want to test the ios scrolling issue?
[19:24:16] <kostajh>	 stephanebisson: yes please
[19:24:37] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] prometheus: upgrade to node-exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite)
[19:25:13] <logmsgbot>	 !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:488675|GrowthExperiments: Enable search for help panel on testwiki]] (duration: 03m 04s)
[19:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:01] <logmsgbot>	 !log sbisson@deploy1001 sync-file aborted: SWAT: [[gerrit:488675|GrowthExperiments: Enable search for help panel on testwiki]] (duration: 02m 22s)
[19:28:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:16] <stephanebisson>	 kostajh: Your patch is now on mwdebug1002
[19:30:13] <kostajh>	 stephanebisson: ok, looking
[19:30:21] <stephanebisson>	 FYI operation people: I encountered sync timeouts today https://phabricator.wikimedia.org/P8060
[19:30:51] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "parameter 'admins' expects a Boolean value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn)
[19:31:35] <wikibugs>	 (03PS3) 10Herron: lists:warn if unknown host issues mail from cmd containing our domain [puppet] - 10https://gerrit.wikimedia.org/r/488602 (https://phabricator.wikimedia.org/T215251)
[19:32:46] <wikibugs>	 (03CR) 10BryanDavis: admin: create new system groups for cloudelastic nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe)
[19:36:05] <ebernhardson>	 stephanebisson: can add another to swat?
[19:37:10] <ebernhardson>	 (i can deploy if you're already done)
[19:37:34] <SMalyshev>	 we probably want to deploy it asap since it's a production error (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseLexeme/+/489000)
[19:39:15] <stephanebisson>	 ebernhardson: sure. We're almost finish with kostajh's patches. You and anomie can discuss the relative priorities of your patches.
[19:39:47] <SMalyshev>	 we can wait for anomie patches t be done
[19:40:16] <SMalyshev>	 it's not *that* urgent (it's some empty searches erroring out but not anything on fire seriously)
[19:42:30] <kostajh>	 stephanebisson: looks good, please merge
[19:42:37] <stephanebisson>	 kostajh: syncing you patch now
[19:43:01] <stephanebisson>	 anomie: You patch is next? Do you prefer to do it yourself?
[19:43:12] <anomie>	 stephanebisson: I can, but I'd rather be lazy ;)
[19:43:27] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack: refactor 'envscript' bits into their own profile [puppet] - 10https://gerrit.wikimedia.org/r/489001 (https://phabricator.wikimedia.org/T215211)
[19:43:45] <stephanebisson>	 anomie: I understand
[19:43:49] <stephanebisson>	 I'll do it
[19:45:01] <logmsgbot>	 !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/GrowthExperiments/: SWAT: [[gerrit:488988|Help Panel: Fix iOS scroll bug]] (duration: 03m 02s)
[19:45:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:50] <stephanebisson>	 mw1299.eqiad.wmnet is always timing out on scap-sync... Is it a problem?
[19:47:07] <wikibugs>	 (03PS5) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150
[19:47:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn)
[19:49:35] <stephanebisson>	 anomie: It looks like your change is going to fail the php71-docker job: https://integration.wikimedia.org/zuul/
[19:50:21] <stephanebisson>	 I'll proceed with SMalyshev 's patch
[19:50:26] <anomie>	 stephanebisson: Stupid flaky npm.
[19:50:32] <SMalyshev>	 cool
[19:50:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack: refactor 'envscript' bits into their own profile [puppet] - 10https://gerrit.wikimedia.org/r/489001 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott)
[19:51:05] <wikibugs>	 10Operations, 10Cloud-VPS, 10SRE-Access-Requests, 10cloud-services-team, and 2 others: Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10bd808) Related rights groups are wmcs-roots and wmcs-admin. Those 2 groups grant broader rights across Cloud Services bare metal instances (Op...
[19:55:41] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack: include 'envscripts' on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/489005 (https://phabricator.wikimedia.org/T215211)
[19:57:45] <wikibugs>	 (03PS1) 10Herron: logstash: add input identifier tags to kafka logstash inputs [puppet] - 10https://gerrit.wikimedia.org/r/489006 (https://phabricator.wikimedia.org/T213899)
[20:00:04] <jouncebot>	 twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Americas version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T2000).
[20:00:17] <icinga-wm>	 RECOVERY - Long running screen/tmux on restbase1016 is OK: OK: No SCREEN or tmux processes detected.
[20:00:47] <wikibugs>	 (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/14575/" [puppet] - 10https://gerrit.wikimedia.org/r/489006 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron)
[20:00:53] <stephanebisson>	 We still have 1 patch in progress in this SWAT window
[20:01:03] <wikibugs>	 (03PS2) 10Herron: logstash: add input identifier tags to kafka logstash inputs [puppet] - 10https://gerrit.wikimedia.org/r/489006 (https://phabricator.wikimedia.org/T213899)
[20:01:15] <stephanebisson>	 You can do it Jenkins, come on
[20:01:42] <SMalyshev>	 yeah these things are long... about 20 mins
[20:02:05] <wikibugs>	 (03CR) 10Herron: [C: 03+2] logstash: add input identifier tags to kafka logstash inputs [puppet] - 10https://gerrit.wikimedia.org/r/489006 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron)
[20:02:35] <stephanebisson>	 SMalyshev: Is you patch testable through a debug server?
[20:02:48] <SMalyshev>	 stephanebisson: should be...
[20:07:38] <wikibugs>	 (03PS4) 10Mathew.onipe: admin: create new system groups for cloudelastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922)
[20:14:09] <SMalyshev>	 stephanebisson: ok CI is done
[20:15:19] <stephanebisson>	 SMalyshev: your change should be on mwdebug1002 for you to test
[20:15:29] <SMalyshev>	 great testing
[20:16:15] <SMalyshev>	 stephanebisson: yep seems to be working just like it should
[20:16:47] <stephanebisson>	 SMalyshev: deploying now
[20:17:57] <SMalyshev>	 thanks!
[20:19:38] <logmsgbot>	 !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/WikibaseLexeme/src/DataAccess/Search/LexemeFulltextResult.php: SWAT: [[gerrit:489000|Fix fatal error - EmptySet does not exist anymore]] (duration: 03m 03s)
[20:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:06] <stephanebisson>	 And that concludes SWAT for now. Sorry for the delay. 
[20:21:04] <stephanebisson>	 Just want to reiterate that syncing to mw1299.eqiad.wmnet has been timing out during this SWAT window.
[20:31:39] <wikibugs>	 (03PS1) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012
[20:33:25] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10RobH) Ok, I opened a support request with dell to ship a replacement SSD to eqsin:   Confirmed: Request 986142470 was successfully submitted.
[20:35:24] <wikibugs>	 (03PS6) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150
[20:36:15] <wikibugs>	 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) Ideally I would prefer that stats machines are completely out of the workflow of pushing data to machines like mwmaint1002.eqia...
[20:38:00] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) The "initial ramp up" might not ever be done, if we reach a point where the writes and deletes introduced are creating too much overhead, we...
[20:39:42] <wikibugs>	 10Operations, 10ops-codfw, 10decommission: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10Papaul)
[20:42:26] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria)
[20:43:14] <wikibugs>	 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) As @Ottomata pointed out more generic discussion about this topic can be found here: https://phabricator.wikimedia.org/T213976
[20:44:43] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] mathoid: Remove mwapi_req/restbase_req [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 (owner: 10Alexandros Kosiaris)
[20:55:35] <twentyafterfour>	 !log train status: deploying 1.33.0-wmf.16 to group2 
[20:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:22] <wikibugs>	 (03CR) 10BryanDavis: "I understand why this is desired by the Community Tech team, but I'm not super excited about adding all of this bloat to every php7.2 Kube" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson)
[20:57:25] <wikibugs>	 (03PS1) 1020after4: group2 wikis to 1.33.0-wmf.16  refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489017
[20:57:27] <wikibugs>	 (03CR) 1020after4: [C: 03+2] group2 wikis to 1.33.0-wmf.16  refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489017 (owner: 1020after4)
[20:58:37] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.33.0-wmf.16  refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489017 (owner: 1020after4)
[20:59:44] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, this must be merged at the same time of I731669c28791005237418c36787d2eb42f4c3312 so that the next puppet run should do the right th" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487612 (owner: 10CRusnov)
[21:00:01] <wikibugs>	 (03CR) 10jenkins-bot: group2 wikis to 1.33.0-wmf.16  refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489017 (owner: 1020after4)
[21:00:06] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, this must be merged together with I19ed3b30a71a11226447779055601463a2b43fd3" [puppet] - 10https://gerrit.wikimedia.org/r/488235 (owner: 10CRusnov)
[21:05:17] <wikibugs>	 10Operations, 10Cloud-Services, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10Bstorm)
[21:05:20] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10RobH) Oh, just the output from troubleshooting on the system.  The system should show TWO SSDs and only sees one now:   `  robh@cp5010:~$ cat /proc/mdstat Personalities : [raid1] [linear] [multipath]...
[21:06:56] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10herron) >>! In T213899#4935098, @...
[21:09:51] <wikibugs>	 (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy)
[21:11:28] <wikibugs>	 (03CR) 10Reedy: "Seems the most sensible option longer term, yeah" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy)
[21:12:00] <wikibugs>	 10Operations, 10CirrusSearch, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10debt) Moving to #discovery-search-sprint waiting column to see if there is anything else we need...
[21:12:27] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Small detail inline, as discussed on IRC" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn)
[21:12:48] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10Vgutierrez) that's right, the kernel shutdown sdb due to the errors, that's why is not even listed on lshw
[21:15:30] <wikibugs>	 (03PS9) 10Ottomata: Add kafka-dev chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247)
[21:15:33] <wikibugs>	 (03CR) 10Ottomata: Add kafka-dev chart for local development (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata)
[21:18:02] <wikibugs>	 (03PS1) 10Gilles: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661)
[21:19:42] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10Vgutierrez) here is the log line: `Jan 21 01:39:21 cp5010 kernel: [7472184.163052] sd 1:0:0:0: [sdb] Stopping disk`
[21:21:54] <wikibugs>	 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) all PDUs in ulsfo are now properly mounted.  The temp/humidity leads are plugged in, but not run anywhere until AFTER we get rid of the decom sys...
[21:22:15] <robh>	 !log updating firmware on ps1-22-ulsfo via T209101
[21:22:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:18] <stashbot>	 T209101: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101
[21:22:23] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "We have at least two other related changes that are needed:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond)
[21:31:20] <wikibugs>	 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10jeena) I think it would be useful to have a tag with the version or include it in one of the tags. The date only tells you what...
[21:33:00] <wikibugs>	 (03CR) 10Mobrovac: "Hmm, while I agree about simplifying things, these templates are loaded by the template code regardless. Even though Mathoid doesn't use t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 (owner: 10Alexandros Kosiaris)
[21:33:46] <wikibugs>	 (03CR) 10Niharika29: "Bryan, without this the SVG Translate tool is quite useless because most of the languages don't show up. What do you recommend we do?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson)
[21:38:41] <wikibugs>	 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) Ok, while updating these, I've noticed that the power feeds in ulsfo are not balanced.  Tower A is around 7 amps and tower B is around 2 amps for...
[21:38:53] <robh>	 !log updating firmware on ps1-23-ulsfo via T209101  ps1-22-ulsfo update completed
[21:38:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:57] <stashbot>	 T209101: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101
[21:40:22] <wikibugs>	 (03CR) 10Ottomata: add statsd_exporter config to mathoid (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/482718 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[21:41:13] <wikibugs>	 (03PS2) 10Andrew Bogott: openstack: include 'envscripts' on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/489005 (https://phabricator.wikimedia.org/T215211)
[21:42:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack: include 'envscripts' on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/489005 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott)
[21:43:27] <logmsgbot>	 !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group2 wikis to 1.33.0-wmf.16  refs T206670
[21:43:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:30] <stashbot>	 T206670: 1.33.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T206670
[21:44:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:47:13] <icinga-wm>	 RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 75072 bytes in 0.957 second response time
[21:50:42] <wikibugs>	 (03PS2) 10Volans: Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm)
[21:51:08] <twentyafterfour>	 hmm, there is a significant increase of 60 second timeouts after promoting group2 to wmf.16
[21:52:54] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm)
[21:54:39] <wikibugs>	 (03Merged) 10jenkins-bot: Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm)
[21:57:24] <twentyafterfour>	 meh looks transient. it's no longer possible to push out the train without a big flood of timeouts spamming the logs, or at least that seems to be the new normal
[21:58:08] <bd808>	 let's hope that is an HHVM warmup problem that php7 will fix
[21:58:49] <twentyafterfour>	 yeah I hope so 
[21:58:54] <wikibugs>	 (03PS16) 10Ottomata: Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247)
[21:59:28] <twentyafterfour>	 the spike lasts for about 10 minutes, that's one hell of a warmup period
[22:00:50] <wikibugs>	 (03PS1) 10Reedy: Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086
[22:01:33] <wikibugs>	 (03CR) 10Ottomata: Helm chart for eventgate-analytics deployment (0320 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata)
[22:01:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086 (owner: 10Reedy)
[22:15:36] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10CDanis) Talked some with @BBlack today, who observed that there are in fact a variety of drivers that back this stuff in the kernel, and that it's very possible we'r...
[22:23:05] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Please create docker-sig@ mailing list - https://phabricator.wikimedia.org/T215563 (10greg)
[22:30:04] <wikibugs>	 (03PS7) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150
[22:33:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14577/" [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn)
[22:34:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] librenms/smokeping/rancid/netbox: add data types to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn)
[22:35:07] <wikibugs>	 (03PS8) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150
[22:41:06] <wikibugs>	 (03PS2) 10Reedy: sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836
[22:41:09] <Reedy>	 jouncebot: now
[22:41:09] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 18 minute(s)
[22:41:10] <Reedy>	 jouncebot: next
[22:41:10] <jouncebot>	 In 1 hour(s) and 18 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190208T0000)
[22:42:05] <wikibugs>	 (03CR) 10Dzahn: "noop on all netmon servers" [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn)
[22:42:43] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy)
[22:43:07] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy)
[22:44:55] <wikibugs>	 (03Merged) 10jenkins-bot: sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy)
[22:46:07] <wikibugs>	 (03PS2) 10Reedy: Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086
[22:48:37] <logmsgbot>	 !log reedy@deploy1001 Synchronized dblists/: alphasort dblists (duration: 02m 56s)
[22:48:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:01] <icinga-wm>	 PROBLEM - puppet last run on an-worker1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:53:11] <wikibugs>	 (03CR) 10jenkins-bot: sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy)
[22:59:11] <wikibugs>	 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) Ok, firmware updated and all power balanced.
[23:00:27] <wikibugs>	 (03PS3) 10Reedy: Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086
[23:00:38] <mutante>	 Krinkle: gpg: sending key 06670C4D66D17553 to hkps://hkps.pool.sks-keyservers.net
[23:01:00] <mutante>	 signed
[23:01:29] <mutante>	 (re: keysigning party that didn't happen at allhands)
[23:03:10] <robh>	 i wanna make a gpg key joke but i dont have the heart to mock it.
[23:05:59] <Krinkle>	 mutante: okay
[23:06:22] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086 (owner: 10Reedy)
[23:07:25] <wikibugs>	 (03Merged) 10jenkins-bot: Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086 (owner: 10Reedy)
[23:09:13] <Krinkle>	 Reedy: thanks :)
[23:13:00] <wikibugs>	 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH)
[23:13:12] <wikibugs>	 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH)
[23:14:33] <Reedy>	 Krinkle: Hm?
[23:14:59] <wikibugs>	 10Operations, 10ops-ulsfo: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 (10RobH)
[23:15:01] <icinga-wm>	 RECOVERY - puppet last run on an-worker1090 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[23:16:39] <wikibugs>	 (03CR) 10jenkins-bot: Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086 (owner: 10Reedy)
[23:17:10] <Krinkle>	 Reedy: The sorting dblist test
[23:17:12] <wikibugs>	 (03CR) 10BryanDavis: "> Bryan, without this the SVG Translate tool is quite useless because" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson)
[23:17:15] <Reedy>	 Aha
[23:17:24] <Krinkle>	 mutante: haven't received it yet btw, I guess it takes a while to replica. Want to e-mail?
[23:17:37] <Krinkle>	 (can send encrypted for my key)
[23:17:46] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 450.61 seconds
[23:18:12] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 465.49 seconds
[23:18:14] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 467.14 seconds
[23:18:16] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 469.17 seconds
[23:18:34] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 478.49 seconds
[23:18:45] <logmsgbot>	 !log reedy@deploy1001 Synchronized README: must be up to date (duration: 02m 54s)
[23:18:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 488.08 seconds
[23:18:56] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 493.12 seconds
[23:20:17] <wikibugs>	 10Operations, 10ops-eqiad: mw1299 is down - https://phabricator.wikimedia.org/T215569 (10Reedy)
[23:21:06] <mutante>	 Krinkle: yea, somehow it's always delayed a bit. mailed!
[23:23:25] <wikibugs>	 10Operations, 10ops-eqiad: mw1299 is down - https://phabricator.wikimedia.org/T215569 (10Reedy) Depending what's up with it... It might want depooling and removing from the scap host lists
[23:23:49] <logmsgbot>	 !log reedy@deploy1001 Synchronized tests/dblistTest.php: Sync test (duration: 02m 55s)
[23:23:49] <Krinkle>	 mutante: thx, got it
[23:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:28:56] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 42.26 seconds
[23:28:58] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 34.05 seconds
[23:29:00] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 30.11 seconds
[23:29:10] <XioNoX>	 !log restart ps1-22-ulsfo
[23:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:20] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.24 seconds
[23:29:34] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[23:29:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.43 seconds
[23:29:46] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[23:36:49] <greg-g>	 mutante: Krinkle wait, are we doing a keysigning party still? :)
[23:37:18] <Krinkle>	 greg-g: you can be next ^_^
[23:38:48] <mutante>	 that's why i said it on channel instead of just PM basically :)
[23:39:09] <mutante>	 krinkle had given me a piece of paper 
[23:40:35] <greg-g>	 I just made https://people.wikimedia.org/~gjg/tmp/ksp-releng-20190129.txt for our team to do on monday in hangout
[23:40:40] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+1] "labs will continue using hhvm pools until the next patch after which they will be un-pooled. Seems reasonable enough." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (https://phabricator.wikimedia.org/T215491) (owner: 10DCausse)
[23:40:57] <mutante>	 greg-g: we can have a global signing party on hangout some day
[23:41:05] <greg-g>	 I'm down
[23:41:52] <mutante>	 you know. this would have been a good activity for the icebreaker challenge.. bonus item if you get your bingo card signed with gpg
[23:42:18] <mutante>	 well, maybe for engineering
[23:44:15] <greg-g>	 :)
[23:48:58] <wikibugs>	 (03CR) 10Samwilson: "> Run it as a webservice on the Debian Stretch job grid? We have all the fonts in that environment as far as I know." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson)
[23:49:46] <greg-g>	 I love saying random numbers letters to my computer in a coffee shop
[23:49:55] <greg-g>	 they already think I'm weird here, so it's OK
[23:50:09] <wikibugs>	 (03PS2) 10CRusnov: Add reports element to reports path in netbox config [puppet] - 10https://gerrit.wikimedia.org/r/488235
[23:50:26] <wikibugs>	 (03PS1) 10Dzahn: have CNAMEs for bastions in each DC, so numbers dont change for users [dns] - 10https://gerrit.wikimedia.org/r/489103
[23:50:34] <mutante>	 greg-g: ^
[23:50:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] have CNAMEs for bastions in each DC, so numbers dont change for users [dns] - 10https://gerrit.wikimedia.org/r/489103 (owner: 10Dzahn)
[23:51:10] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] Add reports element to reports path in netbox config [puppet] - 10https://gerrit.wikimedia.org/r/488235 (owner: 10CRusnov)
[23:51:28] <greg-g>	 mutante: :)
[23:51:49] <mutante>	 RESULT: 0 Errors, 2223 Warnings  :p
[23:52:06] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] Reorganize and add tox/CI support for repository. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487612 (owner: 10CRusnov)