[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T0000). [00:00:04] MatmaRex and tgr: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:17] hi [00:03:04] o/ [00:03:12] hi MatmaRex [00:03:33] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:03:45] I can swat if needed [00:05:09] tgr: are you able to test for SWAT? [00:05:25] MatmaRex: I'll merge your changes first [00:06:35] twentyafterfour: yeah [00:06:42] (03CR) 10Nuria: [C: 03+1] "Looks good, if Erik's experiments are successful we will also give sudo to gilles and (maybe) Adam Baso" [puppet] - 10https://gerrit.wikimedia.org/r/488606 (https://phabricator.wikimedia.org/T215384) (owner: 10Dzahn) [00:15:17] (03PS5) 1020after4: Merge the "extended-uploader" and "autopatrolled" user groups on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21) [00:19:03] ok the extension patches are taking the slow ride through CI. tgr: I take it that the migration should happen first before the config change? [00:19:49] twentyafterfour: yeah, after the patch is merged one of the groups wouldn't exist anymore [00:24:23] !log running `mwscript migrateUserGroup.php commonswiki extended-uploader autopatrolled` on deploy1001 [00:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:27] PROBLEM - Gerritj on gerrit.wikimedia.org is CRITICAL: The command defined for service Gerritj does not exist [00:25:13] gerritj? [00:26:25] :o [00:29:45] MatmaRex: should I sync these individually or do them at the same time? [00:30:16] twentyafterfour: safe to do either way, they are unrelated fixes [00:31:18] ok both merged [00:31:24] I'll sync them one at a time though [00:32:15] RECOVERY - Gerritj on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 351334 bytes in 0.165 second response time [00:38:52] twentyafterfour: please ping me when i can confirm the fixes [00:42:12] MatmaRex: they should be on mwdebug1001 now [00:42:52] tgr: any idea how long this kind of migration should take? It's done 6 million users so far [00:43:00] 1001? a bit of variety [00:43:27] it goes through all users? wow [00:43:35] tgr apparently :-/ [00:43:48] MatmaRex: :-o [00:44:16] there are about 5000 users who should be affected [00:44:45] twentyafterfour: There are only 7,488,571 registered users on Commons, so shouldn't be too long. [00:44:58] twentyafterfour: both work as expected [00:45:13] (i think every swat in at least several months had me test on 1002 :) ) [00:45:19] James_F: nice, thanks [00:45:44] (03PS1) 10Dzahn: icinga/gerrit: add double quotes around URL part in check command [puppet] - 10https://gerrit.wikimedia.org/r/488636 (https://phabricator.wikimedia.org/T215033) [00:45:45] MatmaRex: I'm pretty sure it doesn't matter which one as long as I sync the same one you test ;) [00:46:03] * James_F grins. [00:46:10] MatmaRex: thanks for testing [00:46:38] (03PS2) 10Dzahn: icinga/gerrit: add double quotes around URL part in check command [puppet] - 10https://gerrit.wikimedia.org/r/488636 (https://phabricator.wikimedia.org/T215033) [00:47:09] (03CR) 10Dzahn: [C: 03+2] icinga/gerrit: add double quotes around URL part in check command [puppet] - 10https://gerrit.wikimedia.org/r/488636 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [00:47:14] !log syncing commit dd8654ac9b3f2e88241e65d3ea35aea9699defc5 for Bug: T209052 [00:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:17] T209052: Load page content in parallel with VE code on Mobile with ArticleTargetLoader - https://phabricator.wikimedia.org/T209052 [00:47:30] I'm not sure how many autopatrollers should be there but at a glance there are way less now than total commons users so the script doesn't seem to be doing anything stupid [00:47:41] well, in terms of output, anyway [00:47:51] processing all users is definitely stupid [00:48:20] Done! 72 users in group 'extended-uploader' are now in 'autopatrolled' instead. [00:48:52] tgr: so I'll merge the config change now [00:48:54] !log twentyafterfour@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/MobileFrontend/: SWAT dd8654ac9b3f2e88241e65d3ea35aea9699defc5 (duration: 01m 00s) [00:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:45] (03CR) 1020after4: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21) [00:49:49] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: The command defined for service Gerrit JSON does not exist [00:50:02] hm, maybe I misremembered and autopatrollers is the one with 5000ish members then [00:50:25] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27192 bytes in 0.033 second response time [00:50:51] (03Merged) 10jenkins-bot: Merge the "extended-uploader" and "autopatrolled" user groups on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21) [00:51:50] paladox: ^ [00:52:22] mutante nice! [00:52:31] apparently no user rights log entry either :/ [00:53:08] !log twentyafterfour@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/VisualEditor/: SWAT f89e12fc466d2c51343d9815c70a0b4602acc333 to fix bug: T209610 (duration: 00m 55s) [00:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:11] T209610: On mobile, template context menu doesn't show the name of the template - https://phabricator.wikimedia.org/T209610 [00:54:01] tgr: the config change should be live on mwdebug1001 [00:54:16] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) The new check "Gerrit JSON" works now: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi... [00:55:08] twentyafterfour: looks good [00:55:09] paladox: dont know if should close or only after also doing the healthcheck plugin thing [00:55:27] (03CR) 10jenkins-bot: Merge the "extended-uploader" and "autopatrolled" user groups on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21) [00:56:20] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) 05Open→03Resolved a:03Dzahn [00:56:59] !log twentyafterfour@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT config change for Bug: T214003 (duration: 00m 53s) [00:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:02] T214003: Merge the "extended-uploader" and "autopatrolled" user groups on Commons - https://phabricator.wikimedia.org/T214003 [00:58:12] thanks! filed T215479 and T215480 about the issues [00:58:13] T215479: migrateUserGroup.php should not process all user records - https://phabricator.wikimedia.org/T215479 [00:58:13] T215480: migrateUserGroup.php should make a user rights log entry - https://phabricator.wikimedia.org/T215480 [01:00:04] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T0100). [01:01:51] thanks for deploting twentyafterfour! [01:03:59] you're welcome! glad to help out ;) [01:04:39] !log no phabricator deployment tonight [01:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:01] !log US Evening SWAT is complete [01:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:45] (03CR) 10Volans: administrative: add owner getter to Reason class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [01:34:38] (03PS2) 10Volans: sre.hosts: add decommission cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/487982 (https://phabricator.wikimedia.org/T205886) [01:35:03] (03CR) 10Volans: "REplies inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/487982 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [01:36:25] (03CR) 10CRusnov: [C: 03+1] administrative: add owner getter to Reason class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [01:40:34] (03CR) 10Volans: icinga: enable check for psi and omega clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488485 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [01:46:08] (03PS4) 10Volans: management: add management module [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) [01:46:10] (03PS4) 10Volans: icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 [01:46:12] (03PS2) 10Volans: puppet: add delete() method to remove a host [software/spicerack] - 10https://gerrit.wikimedia.org/r/487981 (https://phabricator.wikimedia.org/T205884) [01:46:35] (03CR) 10Volans: management: add management module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [01:50:54] (03CR) 10Gehel: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [01:52:51] (03CR) 10Volans: [C: 03+2] management: add management module [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [01:59:26] (03Merged) 10jenkins-bot: management: add management module [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [01:59:28] (03Merged) 10jenkins-bot: icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans) [01:59:36] (03CR) 10Gehel: [C: 03+1] "LGTM, will merge tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson) [02:00:36] (03CR) 10jenkins-bot: management: add management module [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [02:00:43] (03CR) 10Volans: [C: 03+2] puppet: add delete() method to remove a host [software/spicerack] - 10https://gerrit.wikimedia.org/r/487981 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [02:01:38] (03CR) 10jenkins-bot: icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans) [02:03:45] (03Abandoned) 10Gehel: Proposal: cleanup of management class [software/spicerack] - 10https://gerrit.wikimedia.org/r/487094 (owner: 10Gehel) [02:06:37] (03Merged) 10jenkins-bot: puppet: add delete() method to remove a host [software/spicerack] - 10https://gerrit.wikimedia.org/r/487981 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [02:07:18] (03PS1) 10Milimetric: Use correct command from systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/488670 [02:07:38] (03CR) 10jenkins-bot: puppet: add delete() method to remove a host [software/spicerack] - 10https://gerrit.wikimedia.org/r/487981 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [02:07:40] (03PS2) 10Volans: administrative: add owner getter to Reason class [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) [02:10:04] (03CR) 10Volans: "replies inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [02:10:16] (03CR) 10Krinkle: "Would this explain why some mwgrep queries produced outdated or incomplete results? I don't have concrete examples right not, but I've sen" [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson) [02:17:55] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable search for help panel on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301) [02:18:22] (03PS2) 10Milimetric: Use correct command from systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/488670 [02:22:17] (03CR) 10Volans: [C: 03+2] "Merging as there were already +1 and the last change is only on the docstring." [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [02:28:03] (03Merged) 10jenkins-bot: administrative: add owner getter to Reason class [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [02:29:13] (03CR) 10jenkins-bot: administrative: add owner getter to Reason class [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [02:50:53] (03CR) 10Krinkle: [C: 03+1] Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm) [03:44:47] RECOVERY - Long running screen/tmux on an-coord1001 is OK: OK: SCREEN detected but not long running. [03:54:27] (03PS1) 10Reedy: Don't add EP NS where the wiki has no pages in that NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488720 (https://phabricator.wikimedia.org/T200391) [03:56:12] jouncebot: now [03:56:12] No deployments scheduled for the next 8 hour(s) and 3 minute(s) [03:56:15] jouncebot: next [03:56:15] In 8 hour(s) and 3 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1200) [03:56:28] (03PS2) 10Reedy: Don't add EP NS where the wiki has no pages in that NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488720 (https://phabricator.wikimedia.org/T200391) [03:57:56] (03CR) 10Reedy: [C: 03+2] Don't add EP NS where the wiki has no pages in that NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488720 (https://phabricator.wikimedia.org/T200391) (owner: 10Reedy) [03:59:03] (03Merged) 10jenkins-bot: Don't add EP NS where the wiki has no pages in that NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488720 (https://phabricator.wikimedia.org/T200391) (owner: 10Reedy) [03:59:15] (03CR) 10jenkins-bot: Don't add EP NS where the wiki has no pages in that NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488720 (https://phabricator.wikimedia.org/T200391) (owner: 10Reedy) [04:00:36] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable EP namespaces on wikis with no EP pages (duration: 00m 57s) [04:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:11] (03PS2) 10Tim Starling: Use excimer to set a graceful wall clock time limit in PHP 7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 [04:20:44] (03PS3) 10Tim Starling: Use excimer to set a graceful wall clock time limit in PHP 7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 [04:21:20] (03CR) 10Tim Starling: "PS3: remove set_time_limit() in the excimer case, for simplicity, as suggested by Krinkle." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 (owner: 10Tim Starling) [04:22:12] (03CR) 10Tim Starling: [C: 03+2] Use excimer to set a graceful wall clock time limit in PHP 7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 (owner: 10Tim Starling) [04:23:21] (03Merged) 10jenkins-bot: Use excimer to set a graceful wall clock time limit in PHP 7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 (owner: 10Tim Starling) [04:32:49] (03CR) 10jenkins-bot: Use excimer to set a graceful wall clock time limit in PHP 7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 (owner: 10Tim Starling) [04:35:10] !log tstarling@deploy1001 Synchronized wmf-config/set-time-limit.php: (no justification provided) (duration: 00m 54s) [04:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:34] (03CR) 10Tim Starling: [C: 03+2] "It works, except that the error displayed is not very user-friendly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 (owner: 10Tim Starling) [04:59:06] (03PS1) 10Effie Mouzeli: admin: fixed typo in username ha78na [puppet] - 10https://gerrit.wikimedia.org/r/488753 (https://phabricator.wikimedia.org/T215352) [05:04:08] (03CR) 10Dzahn: [C: 03+2] admin: fixed typo in username ha78na [puppet] - 10https://gerrit.wikimedia.org/r/488753 (https://phabricator.wikimedia.org/T215352) (owner: 10Effie Mouzeli) [05:04:35] (03CR) 10Dzahn: [C: 03+2] "[mwmaint1002:~] $ ldaplist -l passwd ha78na" [puppet] - 10https://gerrit.wikimedia.org/r/488753 (https://phabricator.wikimedia.org/T215352) (owner: 10Effie Mouzeli) [05:05:34] (03CR) 10Dzahn: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/488753 (https://phabricator.wikimedia.org/T215352) (owner: 10Effie Mouzeli) [05:07:27] (03CR) 10Effie Mouzeli: ":D" [puppet] - 10https://gerrit.wikimedia.org/r/488753 (https://phabricator.wikimedia.org/T215352) (owner: 10Effie Mouzeli) [05:20:58] (03PS1) 10Samwilson: Add all fonts used in production MediaWiki [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) [05:32:10] (03PS3) 10Fsero: Bump helm to 2.12.2 for security and features [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) [05:33:17] (03PS4) 10Fsero: Bump helm to 2.12.2 for security and features [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) [05:35:00] (03CR) 10Fsero: "Thanks for the review!" [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero) [06:01:19] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 330.67 seconds [06:01:25] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.28 seconds [06:01:29] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.41 seconds [06:01:33] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.43 seconds [06:01:33] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.88 seconds [06:03:05] PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 377.20 seconds [06:03:17] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 384.25 seconds [06:03:17] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 383.55 seconds [06:10:51] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) dbstore1002 crashed, possibly due to {T215450} [06:11:00] 10Operations: issue pulling 1 layer of docker-registry.wikimedia.org/releng/composer-php71:latest - https://phabricator.wikimedia.org/T209507 (10fsero) 05Open→03Resolved I think this was fixed adjusting Cache-Control headers on docker-registry so varnish can serve content accordingly, report back if not :) [06:12:59] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10User-fsero: eqiad: 1-2 VM requests for docker-registry-beta.wikimedia.org - https://phabricator.wikimedia.org/T212212 (10fsero) [06:14:06] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fsero) [06:14:17] !log Ease consistency options on db2051 (s4 master) to let it catch up on replication [06:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:55] 10Operations, 10Citoid, 10serviceops, 10Patch-For-Review, and 2 others: allow zotero container nodejs server to define the amount of heap used instead of the fixed limit of 1.7Gi - https://phabricator.wikimedia.org/T213414 (10fsero) [06:15:09] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) [06:15:43] (03PS3) 10Marostegui: dbstore1003: Increase number mysql of instances [puppet] - 10https://gerrit.wikimedia.org/r/488454 (https://phabricator.wikimedia.org/T210478) [06:18:55] (03CR) 10Marostegui: [C: 03+2] dbstore1003: Increase number mysql of instances [puppet] - 10https://gerrit.wikimedia.org/r/488454 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [06:25:55] (03CR) 10Elukey: [C: 03+1] "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/488606 (https://phabricator.wikimedia.org/T215384) (owner: 10Dzahn) [06:26:13] (03PS3) 10Elukey: Use correct command from systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/488670 (owner: 10Milimetric) [06:27:27] (03CR) 10Elukey: [C: 03+2] Use correct command from systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/488670 (owner: 10Milimetric) [06:27:41] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 45247.34 seconds [06:27:41] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 44010.34 seconds [06:27:55] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 44205.36 seconds [06:28:05] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 43122.55 seconds [06:28:09] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 43922.30 seconds [06:28:21] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikidatawiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 547679268 [06:28:23] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 40570.74 seconds [06:28:25] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 42857.16 seconds [06:28:33] going to fix dbstore1002 in a bit --^ [06:29:56] (03PS1) 10Alexandros Kosiaris: mathoid: Remove mwapi_req/restbase_req [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 [06:30:41] PROBLEM - puppet last run on an-worker1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/biocLite.R] [06:33:09] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/40-prometheus.conf] [06:34:57] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [06:45:31] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 32.28 seconds [06:45:39] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 30.12 seconds [06:45:55] RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 6.67 seconds [06:46:07] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [06:46:07] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [06:46:43] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:48:14] !log Restore consistency options on db2051 [06:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:09] RECOVERY - puppet last run on an-worker1084 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:35] (03CR) 10Marostegui: [C: 03+1] Add staging-db-analytics.eqiad.wmnet CNAME to dbstore1003 [dns] - 10https://gerrit.wikimedia.org/r/488535 (https://phabricator.wikimedia.org/T210478) (owner: 10Elukey) [06:57:50] (03PS3) 10Marostegui: dbstore-grants: Add research user and fixing styling [puppet] - 10https://gerrit.wikimedia.org/r/488267 (https://phabricator.wikimedia.org/T214469) [06:58:33] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.50 seconds [06:58:41] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:59:23] (03CR) 10Marostegui: [C: 03+2] dbstore-grants: Add research user and fixing styling [puppet] - 10https://gerrit.wikimedia.org/r/488267 (https://phabricator.wikimedia.org/T214469) (owner: 10Marostegui) [06:59:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488816 (https://phabricator.wikimedia.org/T210713) [06:59:35] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:00:49] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488816 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:01:51] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488816 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:02:03] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488816 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:03:22] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 (duration: 00m 55s) [07:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:35] !log Deploy schema change on db1084 - T210713 [07:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:38] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [07:10:42] (03CR) 10Marostegui: "Make sure to review grants to make sure check_mariadb can access those hosts via socket, it has been a long while since we set them up" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [07:25:56] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488828 [07:29:38] 10Operations, 10Cloud-VPS, 10Toolforge, 10Traffic, 10Patch-For-Review: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10akosiaris) I 've added the capacity to varnish puppet code to augment the wikimed... [07:34:43] (03CR) 10Elukey: [C: 03+2] Add staging-db-analytics.eqiad.wmnet CNAME to dbstore1003 [dns] - 10https://gerrit.wikimedia.org/r/488535 (https://phabricator.wikimedia.org/T210478) (owner: 10Elukey) [07:36:10] (03PS1) 10Reedy: Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) [07:36:27] (03CR) 10Reedy: [C: 03+2] Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy) [07:36:57] (03CR) 10jerkins-bot: [V: 04-1] Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy) [07:37:13] (03CR) 10jerkins-bot: [V: 04-1] Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy) [07:38:24] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10Psychoslave) Hello everybody, would it be possible to know how much time in average it takes for such a ticket to be treat, so we can take that into account in how w... [07:39:15] (03PS2) 10Reedy: Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) [07:39:56] (03CR) 10Reedy: [C: 03+2] Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy) [07:40:49] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488828 (owner: 10Marostegui) [07:40:59] (03Merged) 10jenkins-bot: Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy) [07:42:02] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488828 (owner: 10Marostegui) [07:42:24] Reedy: I will go after you :) [07:42:27] !log reedy@deploy1001 Synchronized dblists/: Wikimania T215486 (duration: 00m 54s) [07:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:30] T215486: Shortcut interwiki links have wrong target at wikimaniawiki - https://phabricator.wikimedia.org/T215486 [07:43:01] marostegui: feel free [07:43:06] Thanks! [07:43:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 (duration: 00m 53s) [07:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] Use standard version of plain-text GPL (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm) [07:44:52] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488835 (https://phabricator.wikimedia.org/T210713) [07:45:04] (03PS1) 10Reedy: sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 [07:46:06] (03PS1) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488837 [07:46:09] (03CR) 10Reedy: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488837 (owner: 10Reedy) [07:46:11] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488835 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:46:20] (03CR) 10jerkins-bot: [V: 04-1] sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy) [07:46:28] (03CR) 10jenkins-bot: Add wikimaniawiki and wikimania2018wiki to some more dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488829 (https://phabricator.wikimedia.org/T215486) (owner: 10Reedy) [07:46:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488828 (owner: 10Marostegui) [07:47:20] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488837 (owner: 10Reedy) [07:47:23] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488835 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:47:33] (03CR) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488837 (owner: 10Reedy) [07:47:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488835 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:47:47] Reedy: After you :) [07:48:23] !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 20s) [07:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:40] (03CR) 10Reedy: "Some of these definitely are out of place... I dunno which way round the _ should be. We don't document the correct sorting command, do we" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy) [07:49:41] marostegui: Cheers. That's me done now [07:49:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081 (duration: 00m 53s) [07:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:20] Reedy: :) [07:50:24] !log Deploy schema change on db1081 [07:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:31] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Erik Bernhardson to have root access on stat1005 for GPU testing - https://phabricator.wikimedia.org/T215384 (10Joe) I second the idea, and I see @Nuria has given +1 to the patch which I assume can count as manager approval. Given... [08:09:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488851 [08:10:56] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488851 (owner: 10Marostegui) [08:12:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488851 (owner: 10Marostegui) [08:12:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero) [08:13:05] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:13:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1081 (duration: 00m 54s) [08:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:15] elukey: \o/ [08:14:06] !log Deploy schema change on s4 primary master (db1068) - T210713 [08:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:09] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [08:14:50] marostegui: I think it is still broken :( [08:15:07] :( [08:16:57] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table cywiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 550016117 [08:16:59] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 284.50 seconds [08:18:15] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:18:23] this time is good \o/ [08:18:36] sigh too soon [08:19:05] (03CR) 10Fsero: [C: 03+2] Bump helm to 2.12.2 for security and features [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero) [08:20:05] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488851 (owner: 10Marostegui) [08:23:21] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table wikishared.echo_unread_wikis: Duplicate entry 34079543-enwiki for key echo_unread_wikis_user_wiki, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1069-bin.000334, end_log_pos 555529662 [08:32:21] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:34:24] !log swift codfw-prod: more weight to ms-be2047 - T209395 T209921 [08:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:28] T209395: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 [08:34:28] T209921: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 [08:36:13] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table itwiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 560643293 [08:39:17] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, contint-admins, and contint-docker for Brennen Bearnes - https://phabricator.wikimedia.org/T215328 (10Joe) Hi @brennen - before I can grant you access some things are needed: - Please read and sign https://phabricator.wikimedia.org/L3 if yo... [08:39:27] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, contint-admins, and contint-docker for Brennen Bearnes - https://phabricator.wikimedia.org/T215328 (10Joe) a:03Joe [08:42:53] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Erik Bernhardson to have root access on stat1005 for GPU testing - https://phabricator.wikimedia.org/T215384 (10Joe) a:03Joe [08:45:14] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10Joe) I guess this is ok as long as @Nuria approves the addition. [08:45:24] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10Joe) a:03Joe [08:46:21] 10Operations, 10Cloud-VPS, 10SRE-Access-Requests, 10cloud-services-team, and 2 others: Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10Joe) Hi @Mathew.onipe I'd need more context on why we want to create this group please. [08:46:26] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10elukey) I think that it is fine to proceed in this case! :) [08:47:01] (03PS3) 10Giuseppe Lavagetto: admins: add phuedx to researchers [puppet] - 10https://gerrit.wikimedia.org/r/488595 (https://phabricator.wikimedia.org/T214957) (owner: 10Dzahn) [08:48:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admins: add phuedx to researchers [puppet] - 10https://gerrit.wikimedia.org/r/488595 (https://phabricator.wikimedia.org/T214957) (owner: 10Dzahn) [08:51:21] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10Joe) Yes, it is fine, she also gave +1 to the patch already. Merging it. Thanks @DZahn for writing the patch. @phuedx you should have your a... [08:51:30] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10Joe) 05Open→03Resolved [08:53:01] !log Deploy schema change on s7 codfw master (db2047), this will generate lag on s7 codfw - T210713 [08:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:07] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [08:54:11] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 44.48 seconds [09:00:21] RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [09:01:45] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Joe) a:03Dzahn [09:02:10] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10Marostegui) 05Open→03Resolved And back again: `RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1` As Jaime said: T213664#4924636 thi... [09:04:27] (03CR) 10Jcrespo: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [09:05:35] (03CR) 10Marostegui: "> >" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [09:09:02] (03CR) 10Jcrespo: "> > >" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [09:10:01] (03CR) 10DCausse: [C: 03+1] "One known reason for stale results to have appeared recently is the activation of these new clusters but only between Jan 16 and 24, perio" [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson) [09:15:29] !log uploading helm and tiller 2.12.2 deb package to stretch and jessie [09:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:16] (03CR) 10Marostegui: "> > > >" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [09:18:39] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:18:44] (03CR) 10Alexandros Kosiaris: "hm this is for debian/stretch-wikimedia. This probably belongs in master as well." [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero) [09:18:51] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[researchers_ensure_members] [09:19:13] (03PS1) 10Giuseppe Lavagetto: admin: add dsharpe, give access to deployment/analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/488880 (https://phabricator.wikimedia.org/T214130) [09:20:15] RECOVERY - Memory correctable errors -EDAC- on db1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [09:23:17] !log running alter table on db2055 for perforamance testing T212092 [09:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:20] T212092: Provide a strategy for testing the performance of queries needed to show the list of user-agents for each IP - https://phabricator.wikimedia.org/T212092 [09:24:24] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10akosiaris) >>! In T213371#4932956, @pmiazga wrote: > @Tgr I assume you're still waiting for answers from @... [09:24:40] 10Operations, 10Gerrit, 10Icinga, 10Release-Engineering-Team, and 2 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10hashar) [09:24:43] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: Install "healthcheck" plugin on gerrit - https://phabricator.wikimedia.org/T214326 (10hashar) [09:25:09] (03CR) 10Jcrespo: "So +1 ?" [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [09:25:37] (03CR) 10Marostegui: [C: 03+1] mariadb: Set read_only monitoring for core_test hosts [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [09:25:54] (03CR) 10Fsero: "> Patch Set 4:" [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero) [09:26:15] (03PS3) 10Jcrespo: mariadb: Set read_only monitoring for core_test hosts [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) [09:27:22] (03CR) 10Jcrespo: [C: 03+2] mariadb: Set read_only monitoring for core_test hosts [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [09:30:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add dsharpe, give access to deployment/analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/488880 (https://phabricator.wikimedia.org/T214130) (owner: 10Giuseppe Lavagetto) [09:36:17] 10Operations, 10Cloud-VPS, 10SRE-Access-Requests, 10cloud-services-team, and 2 others: Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10Mathew.onipe) Hi @Joe cloudelastic is a replica of cirrussearch like labsdb* is to maps*. So this group separates access to cloudelastic and... [09:36:30] 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU and memory(?) - https://phabricator.wikimedia.org/T212955 (10hashar) I think @fsero / @akosiaris should be able to bump the number of CPUs on those Ganeti instances :-] We can try wit... [09:36:49] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 274.59 seconds [09:37:09] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) >>! In T211661#4931840, @ori wrote: >>>! In T211661#4931056, @fgiunchedi wrote: >> And indeed I share the concerns already mentioned, na... [09:40:09] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:41:49] !log reboot mwdebug1001, mwdebug1002, mwdebug2001, mwdebug2002 for VCPU upgrade. T212955 [09:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:52] T212955: Increase mwdebugXXXX hosts CPU and memory(?) - https://phabricator.wikimedia.org/T212955 [09:42:14] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:42:18] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Create Icinga check for failed shard allocation - https://phabricator.wikimedia.org/T212850 (10fgiunchedi) [09:42:54] PROBLEM - Host mwdebug2002 is DOWN: PING CRITICAL - Packet loss = 100% [09:43:48] RECOVERY - Host mwdebug2002 is UP: PING OK - Packet loss = 0%, RTA = 36.39 ms [09:44:32] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:46:04] 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU - https://phabricator.wikimedia.org/T212955 (10akosiaris) [09:46:13] akosiaris: that was fast :) [09:46:42] feel free to m.ark the task resolved [09:47:38] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational [09:49:04] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild - https://phabricator.wikimedia.org/T203625 (10akosiaris) [09:49:06] 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU - https://phabricator.wikimedia.org/T212955 (10akosiaris) 05Open→03Resolved a:03akosiaris I 've removed the memory part cause https://grafana.wikimedia.org/d/000000377/host-over... [09:49:57] !log Deploy schema change on db1116 - T210713 [09:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:59] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [09:53:08] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 644994322 [09:53:16] reallyyyyyyy [09:53:18] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild - https://phabricator.wikimedia.org/T203625 (10hashar) The hosts mwdebug1001, mwdebug1002, mwdebug2001, mwdebug2002 now have four vCPUs allocated (was... [09:53:23] sigh [09:53:31] hahaha [09:58:00] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:01:44] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 664164087 [10:05:28] (03PS2) 10Jcrespo: Revert "mariadb: Depool db2055 for performance testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488483 [10:08:12] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 232.30 seconds [10:10:56] (03PS1) 10Alexandros Kosiaris: spamassasion: Skip localhost entries [puppet] - 10https://gerrit.wikimedia.org/r/488894 [10:12:42] (03PS32) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [10:12:44] (03PS1) 10DCausse: [cirrus] Start using local nginx reverse proxy for connections reuse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (https://phabricator.wikimedia.org/T215491) [10:13:35] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [10:13:37] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start using local nginx reverse proxy for connections reuse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (https://phabricator.wikimedia.org/T215491) (owner: 10DCausse) [10:14:25] (03PS2) 10Alexandros Kosiaris: spamassasion: Skip localhost entries [puppet] - 10https://gerrit.wikimedia.org/r/488894 [10:15:00] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) I thought we would start with a very low percentage and ramp it up gradually. And yes, I thought our beloved swift proxy is where it would l... [10:16:39] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10jijiki) Is it possible to hold this a bit for until after we upgrade all Thumbor servers to stretch? Two birds with one stone :) [10:16:43] (03PS2) 10DCausse: [cirrus] Start using local nginx reverse proxy for connections reuse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (https://phabricator.wikimedia.org/T215491) [10:16:45] (03PS33) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [10:21:32] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational [10:21:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] spamassasion: Skip localhost entries [puppet] - 10https://gerrit.wikimedia.org/r/488894 (owner: 10Alexandros Kosiaris) [10:23:36] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) I'd argue that we don't want both changes to happen around the same time. And this is probably less prone to emergency bugfixes than the Str... [10:28:36] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:28:41] let's see [10:30:50] broke again :( [10:31:33] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: fix PTR record [dns] - 10https://gerrit.wikimedia.org/r/488896 (https://phabricator.wikimedia.org/T214448) [10:31:45] (03PS2) 10Giuseppe Lavagetto: admin: add dsharpe, give access to deployment/analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/488880 (https://phabricator.wikimedia.org/T214130) [10:31:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: fix PTR record [dns] - 10https://gerrit.wikimedia.org/r/488896 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez) [10:32:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add dsharpe, give access to deployment/analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/488880 (https://phabricator.wikimedia.org/T214130) (owner: 10Giuseppe Lavagetto) [10:32:26] <_joe_> arturo: uh? [10:32:30] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table enwiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 684451049 [10:32:36] _joe_: ? [10:32:59] <_joe_> arturo: that's not strictly related to the diffscan results, right? [10:33:16] _joe_: I guess not, but I'm reviewing all the stuff and found this inconsistency [10:33:29] <_joe_> sure, sure :) [10:33:56] <_joe_> I wasn't sure how it related, it's good to fix stuff anyways, I just didn't get the relationship :) [10:34:42] I think the problem is perhaps the server has no role applied [10:35:14] <_joe_> not even "standard"? [10:35:24] * arturo nods [10:36:08] <_joe_> ok that looks like an issue [10:36:25] they were imaged just yesterday I think [10:36:53] <_joe_> they should usually get role "spare::system" or whatever applied [10:37:38] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:38:10] PROBLEM - puppet last run on mw2279 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:39:03] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: spare system for now [puppet] - 10https://gerrit.wikimedia.org/r/488897 (https://phabricator.wikimedia.org/T214448) [10:39:08] PROBLEM - puppet last run on mwlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:39:16] _joe_: T214448 [10:39:16] T214448: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 [10:39:20] PROBLEM - puppet last run on mw2286 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:39:24] _joe_: sorry https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/488897/ [10:39:32] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:39:32] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:39:52] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:39:54] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:39:55] (03CR) 10Giuseppe Lavagetto: [C: 03+1] cloudcontrol2001-dev: spare system for now [puppet] - 10https://gerrit.wikimedia.org/r/488897 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez) [10:40:09] <_joe_> oh the puppet failures are my fault [10:40:29] <_joe_> but they're going away on a second run. I'll fix it [10:40:56] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:41:02] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:41:22] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:41:22] PROBLEM - puppet last run on mw2230 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:41:22] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:41:30] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table commonswiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 718537987 [10:41:46] PROBLEM - puppet last run on mw2262 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:41:52] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:41:58] PROBLEM - puppet last run on an-master1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members] [10:42:00] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:42:02] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:42:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: spare system for now [puppet] - 10https://gerrit.wikimedia.org/r/488897 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez) [10:42:48] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:43:18] PROBLEM - puppet last run on people1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [10:43:31] !log Run mysqldump from dbstore1003 to dump dbstore1002:staging.mep_word_persistence - T215450 [10:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:34] T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002 - https://phabricator.wikimedia.org/T215450 [10:44:18] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members] [10:44:18] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members] [10:46:42] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10Joe) a:03Joe [10:47:16] RECOVERY - puppet last run on an-master1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:49:36] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:49:36] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:50:20] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:50:50] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 seconds ago with 1 failures. Failed resources (up to 3 shown) [10:51:56] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:52:28] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [10:53:30] PROBLEM - puppet last run on mw1348 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:53:30] PROBLEM - puppet last run on mw2196 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown) [10:53:38] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 25 seconds ago with 1 failures. Failed resources (up to 3 shown) [10:53:56] <_joe_> these will autorecover soon [10:54:01] 10Operations, 10Wikimedia-Logstash, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [10:54:04] PROBLEM - puppet last run on mw2274 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 42 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:54:16] PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 58 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:54:21] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt200X-dev: add roles in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/488899 (https://phabricator.wikimedia.org/T214448) [10:54:28] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 59 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:55:10] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:55:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488900 (https://phabricator.wikimedia.org/T210713) [10:55:18] PROBLEM - puppet last run on mw1320 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:55:26] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 23 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:55:28] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:55:36] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:55:46] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:55:50] PROBLEM - puppet last run on mw2248 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 seconds ago with 1 failures. Failed resources (up to 3 shown) [10:55:54] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000334, end_log_pos 863963344 [10:55:54] PROBLEM - puppet last run on bast1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [10:56:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt200X-dev: add roles in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/488899 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez) [10:56:06] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:56:25] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488900 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:57:00] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:57:29] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488900 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:57:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488900 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:57:52] RECOVERY - puppet last run on mw2185 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:58:44] PROBLEM - puppet last run on mw2223 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:58:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101 for alter and mysql upgrade (duration: 00m 56s) [10:58:46] RECOVERY - puppet last run on mw1348 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:58:46] RECOVERY - puppet last run on mw2196 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:54] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:58:56] RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:58:58] 10Operations, 10Proton, 10Security-Team, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q3), 10Reading-Infrastructure-Team-Backlog (Kanban): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10hashar) TLDR: in the CI job, puppeteer does not down... [10:59:04] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 27 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:59:20] RECOVERY - puppet last run on mw2274 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:59:22] PROBLEM - puppet last run on mw1313 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:59:22] PROBLEM - puppet last run on mw2252 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:59:32] RECOVERY - puppet last run on mw2197 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:59:42] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [10:59:44] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:59:52] PROBLEM - puppet last run on bast4002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [11:00:16] PROBLEM - puppet last run on cloudvirt2001-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:00:20] RECOVERY - puppet last run on mwlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:00:26] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:00:30] RECOVERY - puppet last run on mw2286 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:00:32] PROBLEM - puppet last run on mw2269 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:00:32] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 seconds ago with 1 failures. Failed resources (up to 3 shown) [11:00:34] RECOVERY - puppet last run on mw1320 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:00:44] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:00:44] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:00:56] PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:01:02] RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:01:06] RECOVERY - puppet last run on mw2248 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:01:13] 10Puppet, 10cloud-services-team (Kanban): ops/puppet: generalize systemd resource control for users - https://phabricator.wikimedia.org/T215401 (10elukey) So user ids are set in the admin module's data.yaml: ` elukey@stat1006:~$ id elukey uid=13926(elukey) elukey: ensure: present gid: 500 name:... [11:01:18] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:01:32] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [11:02:14] RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:02:24] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10jijiki) >>! In T211661#4934183, @Gilles wrote: > I'd argue that we don't want both changes to happen around the same time. And this is probably less... [11:02:30] RECOVERY - puppet last run on mw2230 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:02:32] PROBLEM - puppet last run on mw2277 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 11 seconds ago with 1 failures. Failed resources (up to 3 shown) [11:02:36] PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:02:36] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 25 seconds ago with 1 failures. Failed resources (up to 3 shown) [11:03:14] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:04:00] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:04:04] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:04:06] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:04:10] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:04:12] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudvirt200X-dev: add hosts overrides [puppet] - 10https://gerrit.wikimedia.org/r/488902 (https://phabricator.wikimedia.org/T214448) [11:04:14] PROBLEM - puppet last run on mw2217 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 59 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:04:20] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:04:34] RECOVERY - puppet last run on mw2279 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:04:38] RECOVERY - puppet last run on mw1313 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:04:38] RECOVERY - puppet last run on mw2252 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:04:38] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 35 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:04:46] PROBLEM - puppet last run on mw2275 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:04:52] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 32 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:04:56] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:04:58] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 58 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:05:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloudvirt200X-dev: add hosts overrides [puppet] - 10https://gerrit.wikimedia.org/r/488902 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez) [11:05:26] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 48 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:05:28] PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 26 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:05:48] RECOVERY - puppet last run on mw2269 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:05:48] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:05:50] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:06:02] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:06:02] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:06:12] RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:06:12] PROBLEM - puppet last run on mw1327 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 30 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:06:14] PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 seconds ago with 1 failures. Failed resources (up to 3 shown) [11:06:20] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:06:22] PROBLEM - puppet last run on mw2289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 49 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:06:22] PROBLEM - puppet last run on mw2225 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 seconds ago with 1 failures. Failed resources (up to 3 shown) [11:06:34] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:06:36] PROBLEM - puppet last run on mw1337 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 41 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:07:24] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 32 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:07:32] PROBLEM - puppet last run on mwdebug2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:07:35] (03PS5) 10Elukey: Introduce systemd::slice::all_users [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) [11:07:46] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 11 seconds ago with 1 failures. Failed resources (up to 3 shown) [11:07:46] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [11:07:46] RECOVERY - puppet last run on mw2277 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:07:48] PROBLEM - puppet last run on mw2281 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 seconds ago with 1 failures. Failed resources (up to 3 shown) [11:07:51] 10Operations, 10serviceops, 10User-jijiki: Fix spamassassin's "warn: netset: cannot include " warning - https://phabricator.wikimedia.org/T215496 (10jijiki) p:05Triage→03Normal [11:07:52] RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:07:52] RECOVERY - puppet last run on mw2241 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:07:54] PROBLEM - puppet last run on mw2239 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 23 seconds ago with 1 failures. Failed resources (up to 3 shown) [11:07:54] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:08:30] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:09:20] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:09:22] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:09:30] RECOVERY - puppet last run on mw2217 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:09:32] PROBLEM - puppet last run on mw1328 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:09:40] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 20 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:09:40] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:09:40] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 50 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:09:40] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 56 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:09:44] <_joe_> sorry for the spam [11:09:46] RECOVERY - puppet last run on people1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [11:09:52] PROBLEM - DPKG on cloudvirt2003-dev is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:09:52] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 48 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:09:56] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:10:02] RECOVERY - puppet last run on mw2275 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:10:04] PROBLEM - DPKG on cloudvirt2001-dev is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:10:08] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:10:14] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:10:14] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:10:24] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:10:24] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [11:10:36] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown) [11:10:42] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:10:44] RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:11:04] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:11:08] PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 seconds ago with 1 failures. Failed resources (up to 3 shown) [11:11:18] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:11:18] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:11:22] RECOVERY - DPKG on cloudvirt2001-dev is OK: All packages OK [11:11:28] RECOVERY - puppet last run on mw1327 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:11:30] RECOVERY - puppet last run on mw2199 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:11:30] PROBLEM - puppet last run on mw1346 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 58 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:11:38] RECOVERY - puppet last run on mw2289 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:11:38] RECOVERY - puppet last run on mw2225 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:11:48] PROBLEM - puppet last run on mw1314 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:11:52] RECOVERY - puppet last run on mw1337 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:11:56] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 31 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [11:12:26] RECOVERY - DPKG on cloudvirt2003-dev is OK: All packages OK [11:12:36] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:12:38] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:12:44] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:12:46] RECOVERY - puppet last run on mwdebug2002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:12:46] (03CR) 10BBlack: [C: 03+1] ssl: get rid of the expired digicert-2017 certificate [puppet] - 10https://gerrit.wikimedia.org/r/487584 (https://phabricator.wikimedia.org/T215103) (owner: 10Vgutierrez) [11:12:52] (03PS1) 10Ladsgroup: Update interwiki cache to have yuewiktionary instead of zh-yue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488903 (https://phabricator.wikimedia.org/T214400) [11:13:02] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:13:04] RECOVERY - puppet last run on mw2281 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:13:08] RECOVERY - puppet last run on mw2239 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:13:08] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:13:30] RECOVERY - puppet last run on mw2262 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:13:54] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [11:14:32] PROBLEM - Check systemd state on cloudvirt2003-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:14:48] RECOVERY - puppet last run on mw1328 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:14:49] 10Operations, 10Wikimedia-Logstash: Move iegreview from udp2log to syslog - https://phabricator.wikimedia.org/T215497 (10fgiunchedi) p:05Triage→03Normal [11:14:56] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:14:56] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:14:56] RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:14:56] RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:15:08] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [11:15:30] PROBLEM - Check systemd state on cloudvirt2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:32] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:15:40] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:15:52] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:16:26] RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:16:28] !log upgrade helm to 2.12.2 on deploy{1001,2001} and contint{1001,2001} [11:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:44] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudvirt200X-dev: fix wrong hiera keys names [puppet] - 10https://gerrit.wikimedia.org/r/488905 (https://phabricator.wikimedia.org/T214448) [11:16:46] RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:16:46] PROBLEM - DPKG on cloudvirt2002-dev is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:17:02] RECOVERY - puppet last run on mw1314 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:17:10] !log upgrade helm to 2.12.2 on deploy{1001,2001} and contint{1001,2001} T215244 [11:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:12] RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:17:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloudvirt200X-dev: fix wrong hiera keys names [puppet] - 10https://gerrit.wikimedia.org/r/488905 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez) [11:18:04] RECOVERY - DPKG on cloudvirt2002-dev is OK: All packages OK [11:18:13] 10Operations, 10Wikimedia-Logstash: Move wikimania-scholarships from udp2log to syslog - https://phabricator.wikimedia.org/T215499 (10fgiunchedi) p:05Triage→03Normal [11:18:22] PROBLEM - puppet last run on cloudvirt2003-dev is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Mount[/var/lib/nova/instances] [11:19:19] (03CR) 10Vgutierrez: [C: 03+2] ssl: get rid of the expired digicert-2017 certificate [puppet] - 10https://gerrit.wikimedia.org/r/487584 (https://phabricator.wikimedia.org/T215103) (owner: 10Vgutierrez) [11:19:22] RECOVERY - Check systemd state on cloudvirt2001-dev is OK: OK - running: The system is fully operational [11:19:30] (03PS2) 10Vgutierrez: ssl: get rid of the expired digicert-2017 certificate [puppet] - 10https://gerrit.wikimedia.org/r/487584 (https://phabricator.wikimedia.org/T215103) [11:19:44] RECOVERY - Check systemd state on cloudvirt2003-dev is OK: OK - running: The system is fully operational [11:20:44] PROBLEM - Check systemd state on cloudvirt2002-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:22:31] (03PS9) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [11:22:44] PROBLEM - Host cloudvirt2003-dev is DOWN: PING CRITICAL - Packet loss = 100% [11:23:06] PROBLEM - Host cloudvirt2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [11:23:26] (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [11:24:40] PROBLEM - puppet last run on cloudvirt2002-dev is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/nova/policy.json] [11:25:14] (03CR) 10Hashar: "recheck" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487612 (owner: 10CRusnov) [11:26:14] RECOVERY - puppet last run on bast4002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:26:18] (03PS1) 10Ladsgroup: Set EntityUsageTable addUsage batch size to 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488907 (https://phabricator.wikimedia.org/T215146) [11:26:36] RECOVERY - Host cloudvirt2003-dev is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [11:26:51] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi) [11:27:12] RECOVERY - Check systemd state on cloudvirt2002-dev is OK: OK - running: The system is fully operational [11:27:22] RECOVERY - HTTPS Unified ECDSA on cp4026 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345546 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 287 days) [11:27:24] RECOVERY - Check systemd state on cp4026 is OK: OK - running: The system is fully operational [11:27:26] RECOVERY - HTTPS Unified RSA on cp4026 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345544 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 287 days) [11:27:36] PROBLEM - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:38] RECOVERY - puppet last run on bast1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:28:28] RECOVERY - Host cloudvirt2001-dev is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [11:29:12] PROBLEM - configured eth on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused [11:29:18] PROBLEM - DPKG on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused [11:29:52] PROBLEM - SSH on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 and port 22: Connection refused [11:30:04] PROBLEM - Disk space on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused [11:30:04] PROBLEM - Check systemd state on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused [11:30:10] PROBLEM - MD RAID on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused [11:30:22] (03CR) 10Addshore: [C: 03+1] Set EntityUsageTable addUsage batch size to 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488907 (https://phabricator.wikimedia.org/T215146) (owner: 10Ladsgroup) [11:30:24] PROBLEM - dhclient process on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused [11:31:06] PROBLEM - dhclient process on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused [11:31:22] PROBLEM - SSH on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 and port 22: Connection refused [11:31:28] PROBLEM - MD RAID on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused [11:31:30] PROBLEM - puppet last run on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused [11:31:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Other than that LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [11:31:48] PROBLEM - configured eth on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused [11:31:54] PROBLEM - Disk space on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused [11:32:04] PROBLEM - DPKG on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused [11:32:18] PROBLEM - Check systemd state on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused [11:32:18] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db2055 for performance testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488483 (owner: 10Jcrespo) [11:33:02] RECOVERY - puppet last run on cp4026 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:33:10] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:33:23] (03PS1) 10Hashar: Add .gitreview file [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/488909 [11:33:25] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2055 for performance testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488483 (owner: 10Jcrespo) [11:33:58] (03CR) 10Hashar: "That is for https://www.mediawiki.org/wiki/Gerrit/git-review :)" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/488909 (owner: 10Hashar) [11:34:32] PROBLEM - puppet last run on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused [11:34:41] (03CR) 10Hashar: [C: 03+1] ":)" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487612 (owner: 10CRusnov) [11:35:23] 'mw1279.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD [11:36:01] and I think mw1299 is down [11:36:48] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:37:05] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2055 (duration: 03m 02s) [11:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:16] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:40:38] RECOVERY - Freshness of OCSP Stapling files on cp1080 is OK: OK [11:40:40] RECOVERY - Freshness of OCSP Stapling files on cp2004 is OK: OK [11:40:42] RECOVERY - Freshness of OCSP Stapling files on cp2024 is OK: OK [11:40:42] RECOVERY - Freshness of OCSP Stapling files on cp2012 is OK: OK [11:40:42] RECOVERY - Freshness of OCSP Stapling files on cp2001 is OK: OK [11:40:42] RECOVERY - Freshness of OCSP Stapling files on cp2006 is OK: OK [11:40:42] RECOVERY - Freshness of OCSP Stapling files on cp2016 is OK: OK [11:40:44] RECOVERY - Freshness of OCSP Stapling files on cp3030 is OK: OK [11:40:52] RECOVERY - Freshness of OCSP Stapling files on cp4032 is OK: OK [11:40:52] RECOVERY - Freshness of OCSP Stapling files on cp4022 is OK: OK [11:40:56] RECOVERY - Freshness of OCSP Stapling files on cp5008 is OK: OK [11:40:58] RECOVERY - Freshness of OCSP Stapling files on cp3032 is OK: OK [11:41:02] RECOVERY - Freshness of OCSP Stapling files on cp2020 is OK: OK [11:41:02] RECOVERY - Freshness of OCSP Stapling files on cp2017 is OK: OK [11:41:02] RECOVERY - Freshness of OCSP Stapling files on cp2023 is OK: OK [11:41:02] RECOVERY - Freshness of OCSP Stapling files on cp2013 is OK: OK [11:41:02] RECOVERY - Freshness of OCSP Stapling files on cp2019 is OK: OK [11:41:04] RECOVERY - Freshness of OCSP Stapling files on cp1082 is OK: OK [11:41:04] RECOVERY - Freshness of OCSP Stapling files on cp3036 is OK: OK [11:41:04] RECOVERY - Freshness of OCSP Stapling files on cp3044 is OK: OK [11:41:04] RECOVERY - Freshness of OCSP Stapling files on cp3049 is OK: OK [11:41:04] RECOVERY - Freshness of OCSP Stapling files on cp3045 is OK: OK [11:41:08] RECOVERY - Freshness of OCSP Stapling files on cp4025 is OK: OK [11:41:08] RECOVERY - Freshness of OCSP Stapling files on cp3041 is OK: OK [11:41:12] RECOVERY - Freshness of OCSP Stapling files on cp5002 is OK: OK [11:41:12] RECOVERY - Freshness of OCSP Stapling files on cp5003 is OK: OK [11:41:12] RECOVERY - Freshness of OCSP Stapling files on cp5005 is OK: OK [11:41:12] RECOVERY - Freshness of OCSP Stapling files on cp5006 is OK: OK [11:41:14] RECOVERY - Freshness of OCSP Stapling files on cp4031 is OK: OK [11:41:14] RECOVERY - Freshness of OCSP Stapling files on cp3033 is OK: OK [11:41:14] RECOVERY - Freshness of OCSP Stapling files on cp3037 is OK: OK [11:41:14] RECOVERY - Freshness of OCSP Stapling files on cp3035 is OK: OK [11:41:16] RECOVERY - Freshness of OCSP Stapling files on cp1087 is OK: OK [11:41:18] RECOVERY - Freshness of OCSP Stapling files on cp4023 is OK: OK [11:41:18] RECOVERY - Freshness of OCSP Stapling files on cp3043 is OK: OK [11:41:18] RECOVERY - Freshness of OCSP Stapling files on cp3040 is OK: OK [11:41:20] * vgutierrez hides [11:41:20] RECOVERY - Freshness of OCSP Stapling files on cp2005 is OK: OK [11:41:24] RECOVERY - Freshness of OCSP Stapling files on cp2010 is OK: OK [11:41:28] RECOVERY - Freshness of OCSP Stapling files on cp1077 is OK: OK [11:41:28] RECOVERY - Freshness of OCSP Stapling files on cp1085 is OK: OK [11:41:28] RECOVERY - Freshness of OCSP Stapling files on cp4030 is OK: OK [11:41:29] sorry about the noise folks :) [11:41:30] RECOVERY - Freshness of OCSP Stapling files on cp1078 is OK: OK [11:41:30] RECOVERY - Freshness of OCSP Stapling files on cp1083 is OK: OK [11:41:36] RECOVERY - Freshness of OCSP Stapling files on cp3042 is OK: OK [11:41:38] RECOVERY - Freshness of OCSP Stapling files on cp1079 is OK: OK [11:41:38] RECOVERY - Freshness of OCSP Stapling files on cp1090 is OK: OK [11:41:41] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2055 for performance testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488483 (owner: 10Jcrespo) [11:41:44] RECOVERY - Freshness of OCSP Stapling files on cp4021 is OK: OK [11:41:44] RECOVERY - Freshness of OCSP Stapling files on cp4029 is OK: OK [11:41:44] RECOVERY - Freshness of OCSP Stapling files on cp3038 is OK: OK [11:41:44] RECOVERY - Freshness of OCSP Stapling files on cp3047 is OK: OK [11:41:44] RECOVERY - Freshness of OCSP Stapling files on cp3046 is OK: OK [11:41:44] RECOVERY - Freshness of OCSP Stapling files on cp3034 is OK: OK [11:41:46] RECOVERY - Freshness of OCSP Stapling files on cp1088 is OK: OK [11:41:48] RECOVERY - Freshness of OCSP Stapling files on cp1084 is OK: OK [11:41:48] RECOVERY - Freshness of OCSP Stapling files on cp5011 is OK: OK [11:41:48] RECOVERY - Freshness of OCSP Stapling files on cp2022 is OK: OK [11:41:50] RECOVERY - Freshness of OCSP Stapling files on cp1089 is OK: OK [11:41:50] RECOVERY - Freshness of OCSP Stapling files on cp2026 is OK: OK [11:41:56] RECOVERY - Freshness of OCSP Stapling files on cp1081 is OK: OK [11:41:56] RECOVERY - Freshness of OCSP Stapling files on cp1076 is OK: OK [11:41:56] RECOVERY - Freshness of OCSP Stapling files on cp1086 is OK: OK [11:41:58] RECOVERY - Freshness of OCSP Stapling files on cp5009 is OK: OK [11:41:58] RECOVERY - Freshness of OCSP Stapling files on cp5012 is OK: OK [11:41:58] RECOVERY - Freshness of OCSP Stapling files on cp5007 is OK: OK [11:43:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10Joe) [11:43:34] (03PS6) 10Elukey: Introduce systemd::slice::all_users [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) [11:44:19] (03PS10) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [11:45:16] (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [11:46:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10Joe) @Dsharpe you should be able to long onto the systems accessible via those groups - for example, `deploy1001`. If you can access those servers, please resol... [11:47:49] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused [11:48:26] 10Operations, 10Cloud-VPS, 10SRE-Access-Requests, 10cloud-services-team, and 2 others: Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10Joe) ok great - this should be discussed in the SRE meeting on monday. [11:49:37] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused [11:52:07] (03CR) 10Elukey: "Arturo: let's see if Moritz has any comment about this approach and then if none, let's merge? :)" [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [11:53:32] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) >>! In T214448#4909868, @Andrew wrote: >>>! In T214448#4909558, @Papaul wrote: >> @Andrew there is no raid controller on the new server... [11:53:41] PROBLEM - IPMI Sensor Status on cloudvirt2003-dev is CRITICAL: connect to address 10.192.20.14 port 5666: Connection refused [11:55:25] !log Stop MySQL on db1101:3317 and db1101:3318 for mysql upgrade [11:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:41] (03CR) 10Hashar: Improve CI checks to ensure a basic catalogue compiles on all supported OS's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [11:57:39] PROBLEM - IPMI Sensor Status on cloudvirt2001-dev is CRITICAL: connect to address 10.192.20.5 port 5666: Connection refused [11:57:50] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt200[123]-dev: use partman/raid1-lvm-xfs-nova.cfg [puppet] - 10https://gerrit.wikimedia.org/r/488914 (https://phabricator.wikimedia.org/T214448) [11:58:09] PROBLEM - Long running screen/tmux on prometheus2003 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 11023, 1741693s 1728000s). [11:58:26] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) >>! In T214448#4934464, @aborrero wrote: >>>! In T214448#4909868, @Andrew wrote: >>>>! In T214448#4909558, @Papaul wrote: >>> @Andrew th... [11:58:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt200[123]-dev: use partman/raid1-lvm-xfs-nova.cfg [puppet] - 10https://gerrit.wikimedia.org/r/488914 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez) [11:59:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) @elukey @jcrespo Any objection to put dbstore1002 as IDEMPOTENT? This host crashes every single day, the data is already drifts a lot... [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1200). [12:00:05] Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:19] o/ [12:03:16] !log T214448 reimaging again cloudvirt200[1-3]-dev.codfw.wmnet [12:03:18] 10Operations, 10LDAP-Access-Requests, 10User-Addshore, 10User-jijiki: Add "raz-shuty" to nda ldap group - https://phabricator.wikimedia.org/T214488 (10jijiki) p:05Triage→03Normal [12:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:19] T214448: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 [12:03:20] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin2001.codfw.wmnet for hosts: ` ['cloudvirt2001-dev.codfw.wmnet', 'clou... [12:03:52] 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10jijiki) p:05Triage→03Normal [12:04:09] 10Operations, 10Traffic: cp nodes still try to OCSP staple the already expired digicert-2017 certificate - https://phabricator.wikimedia.org/T215103 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez After merging the change, the following commands have been issued over cumin: ` rm -f /etc/update-ocsp.d/dig... [12:04:44] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10jcrespo) ok to me, data is already garbage, more garbage would not be a problem :-) [12:06:50] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4026.ulsfo.wmnet [12:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:14] 10Operations, 10Maps, 10Patch-For-Review: Kartotherian service on maps100[2-4] timed out on when trying to get tiles. - https://phabricator.wikimedia.org/T214434 (10Joe) p:05Triage→03High [12:07:35] RECOVERY - SSH on cloudvirt2003-dev is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [12:07:49] RECOVERY - SSH on cloudvirt2001-dev is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [12:10:08] (03PS11) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [12:10:10] 10Operations, 10Traffic, 10HTTPS: en.wikipedia.com [sic] serves an invalid certificate - https://phabricator.wikimedia.org/T214253 (10Joe) p:05Triage→03Low [12:11:27] (03CR) 10Jbond: "@Hashar thanks for the extensive review, see comments inline" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [12:11:31] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists, 10User-jijiki: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10jijiki) p:05Triage→03Normal [12:12:23] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488916 [12:12:59] I guess I do the SWAT then [12:14:28] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Joe) [12:14:31] 10Operations, 10vm-requests: Site: 1 VM request for recommender-systems - https://phabricator.wikimedia.org/T215421 (10Joe) 05Open→03Stalled [12:15:48] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488903 (https://phabricator.wikimedia.org/T214400) (owner: 10Ladsgroup) [12:15:56] 10Operations, 10vm-requests: Site: 1 VM request for recommender-systems - https://phabricator.wikimedia.org/T215421 (10Joe) I'm not sure how this request derives from the non-conclusive discussion that is ongoing in the parent task. I am unsure if this ticket should be declined or just stalled - stalling it t... [12:16:06] 10Operations, 10vm-requests: Site: 1 VM request for recommender-systems - https://phabricator.wikimedia.org/T215421 (10Joe) p:05Triage→03Normal [12:16:57] (03Merged) 10jenkins-bot: Update interwiki cache to have yuewiktionary instead of zh-yue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488903 (https://phabricator.wikimedia.org/T214400) (owner: 10Ladsgroup) [12:20:08] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists, 10User-jijiki: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10Joe) @Psychoslave I would need additional information, yes. Can you still receive emails at the email address listed as the wll@ admin address? I ne... [12:20:32] (03PS5) 10Vgutierrez: certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) [12:21:11] (03CR) 10Vgutierrez: "Thx for the review!" (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [12:21:32] Works fine at mwdebug1002, moving forward [12:21:59] Amir1: let me know when I can deploy db-eqiad.php [12:22:06] sure! [12:22:09] thanks! [12:22:10] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists, 10User-jijiki: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10Joe) a:03Joe [12:23:37] 10Operations: Archival of home directories on servers with very large homes - https://phabricator.wikimedia.org/T215171 (10Joe) p:05Triage→03Normal [12:26:18] (03CR) 10jenkins-bot: Update interwiki cache to have yuewiktionary instead of zh-yue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488903 (https://phabricator.wikimedia.org/T214400) (owner: 10Ladsgroup) [12:26:24] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: SWAT: [[gerrit:488903|Update interwiki cache to have yuewiktionary instead of zh-yue (T214400)]] (duration: 03m 04s) [12:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:27] T214400: Add yue.wikt to Cognate - https://phabricator.wikimedia.org/T214400 [12:27:03] marostegui: I'm done [12:27:08] thank you! [12:27:14] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488916 (owner: 10Marostegui) [12:27:18] I have another patch going but I need to wait in between [12:27:54] btw. One apache had sync error [12:28:13] marostegui: i.e. tell me when you're done :D [12:28:16] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488916 (owner: 10Marostegui) [12:31:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1101 after alter and mysql upgrade (duration: 03m 02s) [12:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:48] Amir1: I am done! Was mw1299.eqiad.wmnet the one that failed for you? [12:32:30] yup [12:32:34] I will take a look [12:33:24] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudvirt200[1-3]-dev: fix extra LVM volume name [puppet] - 10https://gerrit.wikimedia.org/r/488918 (https://phabricator.wikimedia.org/T214448) [12:33:58] Thanks! [12:34:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloudvirt200[1-3]-dev: fix extra LVM volume name [puppet] - 10https://gerrit.wikimedia.org/r/488918 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez) [12:34:14] !log Powercycle mw1299 as it is down and not responding [12:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:29] (03CR) 10Ladsgroup: [C: 03+2] "SWAT!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488907 (https://phabricator.wikimedia.org/T215146) (owner: 10Ladsgroup) [12:34:59] marostegui: btw ^ This might have some effects on the database [12:35:32] (03Merged) 10jenkins-bot: Set EntityUsageTable addUsage batch size to 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488907 (https://phabricator.wikimedia.org/T215146) (owner: 10Ladsgroup) [12:35:34] makes batches smaller, so more sql commands [12:35:54] as long as they are fast... [12:35:56] :) [12:36:08] what is the batch size now? [12:36:17] now, it's 500 [12:36:22] it reduce it to 300 [12:36:42] which hopefully helps with T205045 [12:36:43] T205045: Exception from LinksUpdate: Deadlock found in database query (from Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages) - https://phabricator.wikimedia.org/T205045 [12:37:26] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488916 (owner: 10Marostegui) [12:37:28] (03CR) 10jenkins-bot: Set EntityUsageTable addUsage batch size to 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488907 (https://phabricator.wikimedia.org/T215146) (owner: 10Ladsgroup) [12:37:39] woo! [12:37:48] 10Operations, 10MobileFrontend, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Joe) p:05Triage→03Normal [12:38:17] RECOVERY - Host mw1299 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [12:38:21] Since it's untestable, I'm moving forward, if things break, it'll show up [12:38:33] * addshore is watching too :) [12:39:52] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:488907|Set EntityUsageTable addUsage batch size to 300 (T215146)]], Part I (duration: 00m 55s) [12:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:55] T215146: Decrease EntityUsageTable addUsage batch size - https://phabricator.wikimedia.org/T215146 [12:40:25] Amir1: mw1299 didn't fail this time, right? [12:40:34] marostegui: nope, thanks! [12:40:39] 10Operations, 10ops-esams: Degraded RAID on cp3030 - https://phabricator.wikimedia.org/T214879 (10Joe) 05Open→03Invalid [12:40:43] Coolio [12:41:01] 10Operations: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10jijiki) p:05Triage→03Normal [12:41:35] 10Operations, 10WMF-Legal, 10Graphite, 10Performance-Team (Radar), 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10Joe) p:05Triage→03Low [12:42:15] !log Set dbstore1002 as IDEMPOTENT - T213670 [12:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:18] T213670: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 [12:42:26] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:42:28] !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:488907|Set EntityUsageTable addUsage batch size to 300]], Part II (duration: 00m 54s) [12:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:30] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 233.52 seconds [12:42:41] !log EU SWAT is done [12:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:50] 10Operations, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Joe) p:05Triage→03High [12:45:59] (03PS1) 10Marostegui: dbstore.my.cnf: Make the slave IDEMPOTENT [puppet] - 10https://gerrit.wikimedia.org/r/488920 (https://phabricator.wikimedia.org/T213670) [12:46:38] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging: Upgrade jenkins-debian-glue to v0.20.0 - https://phabricator.wikimedia.org/T212774 (10jijiki) p:05Triage→03Normal [12:47:14] 10Operations, 10monitoring: WMF's Grafana installation does not follow Wikimedia's visual identity guidelines - https://phabricator.wikimedia.org/T214762 (10Joe) p:05Triage→03Low [12:47:34] 10Operations, 10PHP 7.2 support: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Joe) p:05Triage→03High [12:47:38] 10Operations, 10Discovery, 10Discovery-Search: Create extra elasticsearch clusters in beta cluster - https://phabricator.wikimedia.org/T213940 (10jijiki) p:05Triage→03Normal [12:48:06] (03CR) 10Marostegui: "Just to confirm, this file is only used on dbstore1002:" [puppet] - 10https://gerrit.wikimedia.org/r/488920 (https://phabricator.wikimedia.org/T213670) (owner: 10Marostegui) [12:48:47] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10jijiki) p:05Triage→03Normal [12:49:17] 10Operations, 10Continuous-Integration-Config: CI errors not being displayed in console log - https://phabricator.wikimedia.org/T214726 (10jijiki) p:05Triage→03Normal [12:50:54] 10Operations, 10MobileFrontend, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krenair) (People interested in merging subdomains may also be interested in {T215071} which is about mergi... [12:57:54] 10Operations, 10puppet-compiler: puppet: compiler-update-facts error and warning - https://phabricator.wikimedia.org/T214472 (10jijiki) p:05Triage→03Normal [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1300) [13:00:43] 10Operations, 10monitoring, 10Patch-For-Review: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10jijiki) p:05Triage→03Normal [13:02:47] 10Operations, 10DNS, 10Traffic, 10fundraising-tech-ops: remove IBM/Silverpop 1024-bit domain key - https://phabricator.wikimedia.org/T214525 (10jijiki) p:05Triage→03Normal [13:04:03] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists, 10User-jijiki: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10jijiki) a:05Joe→03jijiki [13:05:08] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 197.71 seconds [13:10:43] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10BBlack) The linked ESNI ticket is kind of a random user question ticket, and not actually one created for working on it (which still off in the... [13:13:03] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) I'm seeing this in cloudvirt2003-dev: ` [ 13.270987] kvm: disabled by bios [ 13.729525] kvm: disabled by bios ` [13:18:32] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10JAllemandou) sqoop for actor and comment tables just finished and we should use the new hardware next month, ,so no problem fir me either :) [13:24:15] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Krenair) >>! In T215376#4932704, @Dzahn wrote: >>>! In T215376#4932577, @Reedy wrote: >> In `modul... [13:25:04] 10Operations, 10monitoring, 10Patch-For-Review: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10CDanis) a:03CDanis [13:33:57] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudvirt200[1-3]-dev: fix again instance_dev hiera key [puppet] - 10https://gerrit.wikimedia.org/r/488926 (https://phabricator.wikimedia.org/T214448) [13:34:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloudvirt200[1-3]-dev: fix again instance_dev hiera key [puppet] - 10https://gerrit.wikimedia.org/r/488926 (https://phabricator.wikimedia.org/T214448) (owner: 10Arturo Borrero Gonzalez) [13:35:30] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10monitoring: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989 (10fgiunchedi) Status update: mw logs that were going to logstash in plaintext now are being sent via localhost -> rsyslog -> kafka -> logstash and the netw... [13:41:12] 10Operations, 10monitoring, 10User-fgiunchedi: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610 (10fgiunchedi) 05Open→03Invalid We haven't seen this reoccurring afaik, also we're upgrading to Prometheus 2.6, tentatively resolving. [13:41:25] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10BBlack) p:05Normal→03Low Expounding on the lamentations above in a more realistic triage sort of sense: * It's a very complex project which... [13:43:36] 10Operations, 10Goal, 10User-Elukey, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10fgiunchedi) [13:44:53] 10Operations, 10media-storage, 10User-fgiunchedi: Track down the source of periodic increases in requests to swift eqiad - https://phabricator.wikimedia.org/T173721 (10fgiunchedi) 05Open→03Resolved Turns out the spikes are varnish upload backends periodic restarts, thus expected. [13:45:09] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin2001.codfw.wmnet for hosts: ` ['cloudvirt2003-dev.codfw.wmnet'] ` The... [13:46:24] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:54:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Introduce systemd::slice::all_users [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [13:54:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Arturo: let's see if Moritz has any comment about this approach and" [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [13:56:09] (03PS1) 10Milimetric: Separate logfile for production sqoop [puppet] - 10https://gerrit.wikimedia.org/r/488928 [13:58:08] (03CR) 10Joal: [C: 03+1] "Works for me - Thanks Dan" [puppet] - 10https://gerrit.wikimedia.org/r/488928 (owner: 10Milimetric) [14:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1400) [14:01:29] (03CR) 10Elukey: [C: 03+2] Separate logfile for production sqoop [puppet] - 10https://gerrit.wikimedia.org/r/488928 (owner: 10Milimetric) [14:08:11] (03CR) 10Jcrespo: [C: 03+1] dbstore.my.cnf: Make the slave IDEMPOTENT [puppet] - 10https://gerrit.wikimedia.org/r/488920 (https://phabricator.wikimedia.org/T213670) (owner: 10Marostegui) [14:11:34] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt2003-dev.codfw.wmnet'] ` and were **ALL** successful. [14:12:02] 10Operations, 10monitoring: Expose linux kernel firewall and connections statistics - https://phabricator.wikimedia.org/T215277 (10jbond) p:05Triage→03Normal [14:12:24] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review: Improve CI checks to cover more of the code base - https://phabricator.wikimedia.org/T215275 (10jbond) p:05Triage→03Normal [14:12:34] 10Operations, 10Puppet: Audit /etc/apt directories - https://phabricator.wikimedia.org/T214605 (10jbond) p:05Triage→03Low [14:25:13] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10akosiaris) How is the data going to make it from Hadoop, which resides in the analytics cluster and is firewalled at the router level... [14:28:58] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) > How is the data going to make it from Hadoop, which resides in the analytics cluster and is firewalled at the router level... [14:33:41] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) >>! In T213566#4934832, @akosiaris wrote: > Is it just a `LOAD DATA INFILE "something.tsv"` or is it something more complex... [14:34:16] (03PS2) 10Marostegui: dbstore.my.cnf: Make the slave IDEMPOTENT [puppet] - 10https://gerrit.wikimedia.org/r/488920 (https://phabricator.wikimedia.org/T213670) [14:34:27] !log deploying security updates for libgd3 [14:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:54] (03CR) 10Marostegui: [C: 03+2] dbstore.my.cnf: Make the slave IDEMPOTENT [puppet] - 10https://gerrit.wikimedia.org/r/488920 (https://phabricator.wikimedia.org/T213670) (owner: 10Marostegui) [14:36:03] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488931 [14:37:41] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488931 (owner: 10Marostegui) [14:38:29] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10phuedx) Thanks, @Dzahn, @elukey, @Joe, and @Nuria! I apologise for not including an approver on the task. I wasn't actually sure who should... [14:38:46] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488931 (owner: 10Marostegui) [14:39:30] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488931 (owner: 10Marostegui) [14:39:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1101 after alter and mysql upgrade (duration: 00m 55s) [14:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:00] (03PS1) 10Jcrespo: mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932 [14:41:14] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10akosiaris) >>! In T213566#4934835, @Ottomata wrote: >> How is the data going to make it from Hadoop, which resides in the analytics cl... [14:42:29] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932 (owner: 10Jcrespo) [14:44:29] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational [14:44:30] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) > That does look simple enough and not resource expensive on mwmaint1002. I guess it can fit in there as well? But a VM is... [14:46:39] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488934 [14:51:48] jouncebot: next [14:51:48] In 2 hour(s) and 8 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1700) [14:56:41] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) > they will also not allow them to send the SYN/ACK packet required for the second (of the three) phase of the TCP handshake... [14:59:13] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488934 (owner: 10Marostegui) [15:00:19] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488934 (owner: 10Marostegui) [15:01:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1101 (duration: 00m 55s) [15:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:49] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488934 (owner: 10Marostegui) [15:07:15] !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on test wikis and mediawikiwiki for T215464. This may cause lag in codfw. [15:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:20] T215464: Oversighters can no longer see suppressed contributions past a certain date when using the offender parameter - https://phabricator.wikimedia.org/T215464 [15:07:28] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) So, re-reading https://phabricator.wik... [15:14:28] (03PS6) 10Gehel: mwgrep: Query all search clusters [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson) [15:15:50] (03CR) 10Gehel: [C: 03+2] mwgrep: Query all search clusters [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson) [15:16:18] !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 1 wikis for T215464. This may cause lag in codfw. [15:16:18] !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 2 wikis for T215464. This may cause lag in codfw. [15:16:18] !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on remaining section 3 wikis for T215464. This may cause lag in codfw. [15:16:18] !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 4 wikis for T215464. This may cause lag in codfw. [15:16:18] !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 5 wikis for T215464. This may cause lag in codfw. [15:16:18] !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 6 wikis for T215464. This may cause lag in codfw. [15:16:19] !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 7 wikis for T215464. This may cause lag in codfw. [15:16:19] !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on section 8 wikis for T215464. This may cause lag in codfw. [15:16:20] !log anomie@mwmaint1002 Fixing log_search after migrateActors.php on wikitech for T215464. This may cause lag in codfw. [15:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:21] T215464: Oversighters can no longer see suppressed contributions past a certain date when using the offender parameter - https://phabricator.wikimedia.org/T215464 [15:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:59] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Dzahn) If we use "present" (and not a specific version or "latest" either) we would get whatever t... [15:20:28] (03PS2) 10Jcrespo: mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932 [15:23:52] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:24:54] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:32:18] (03PS3) 10Gehel: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/488485 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [15:32:19] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, contint-admins, and contint-docker for Brennen Bearnes - https://phabricator.wikimedia.org/T215328 (10brennen) Hi @Joe - > Please read and sign https://phabricator.wikimedia.org/L3 if you didn't do it already Read and signed. > Confirm... [15:34:08] (03CR) 10Gehel: [C: 03+2] icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/488485 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [15:34:19] 10Operations, 10ops-codfw, 10decommission: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10Papaul) [15:35:36] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932 (owner: 10Jcrespo) [15:37:20] (03Merged) 10jenkins-bot: mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932 (owner: 10Jcrespo) [15:39:10] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch https://10.64.0.37:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661) [15:39:10] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1005 is CRITICAL: CRITICAL - elasticsearch https://10.64.16.185:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661) [15:39:12] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch https://10.64.32.27:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661) [15:39:19] (03PS1) 10Gehel: Revert "icinga: enable check for psi and omega clusters" [puppet] - 10https://gerrit.wikimedia.org/r/488952 [15:39:38] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: CRITICAL - elasticsearch https://10.64.4.13:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661) [15:39:44] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2003 is CRITICAL: CRITICAL - elasticsearch https://10.192.48.131:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661) [15:39:44] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2002 is CRITICAL: CRITICAL - elasticsearch https://10.192.32.180:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661) [15:39:58] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1004 is CRITICAL: CRITICAL - elasticsearch https://10.64.0.162:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661) [15:40:41] (03CR) 10Gehel: [C: 03+2] Revert "icinga: enable check for psi and omega clusters" [puppet] - 10https://gerrit.wikimedia.org/r/488952 (owner: 10Gehel) [15:41:00] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2001 is CRITICAL: CRITICAL - elasticsearch https://10.192.0.112:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661) [15:46:00] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch https://10.64.0.90:9200/_cluster/health error while fetching: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:661) [15:46:27] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2003 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, a [15:46:27] initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [15:46:27] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2002 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, a [15:46:27] initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [15:46:29] (03CR) 10jenkins-bot: mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488932 (owner: 10Jcrespo) [15:46:40] uh... [15:46:41] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2001 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, a [15:46:41] initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [15:46:45] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1004 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 3, active_shards_percent_as_number: 100.0, [15:46:45] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [15:46:57] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1005 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 3, active_shards_percent_as_number: 100.0, [15:46:57] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [15:46:57] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 3, active_shards_percent_as_number: 100.0, [15:46:57] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [15:47:01] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 3, active_shards_percent_as_number: 100.0, [15:47:01] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [15:47:15] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 3, active_shards_percent_as_number: 100.0, [15:47:15] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [15:51:05] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1085 (duration: 00m 58s) [15:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:09] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 83, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 104, in [15:53:09] : 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0 [15:53:23] 10Operations, 10Wikimedia-Logstash, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) I've ran an audit on producers that sent lo... [15:53:57] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Gehel) [15:55:05] !log starting reimage of maps2004 - T198622 [15:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:09] T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 [15:56:01] (03PS3) 10Gehel: maps: migrate maps2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/487360 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [15:58:21] RECOVERY - Long running screen/tmux on prometheus2003 is OK: OK: No SCREEN or tmux processes detected. [15:58:33] 10Operations, 10ops-codfw, 10decommission: Decommission baham - https://phabricator.wikimedia.org/T199247 (10Papaul) [15:59:52] (03CR) 10Gehel: [C: 03+2] maps: migrate maps2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/487360 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [16:00:09] PROBLEM - Long running screen/tmux on restbase1016 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 37796, 1737147s 1728000s). [16:02:04] fixed ^ [16:03:29] !log restart db1085, temporary s6 lag on wikireplicas [16:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:38] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Gehel) [16:05:53] PROBLEM - Long running screen/tmux on an-coord1001 is CRITICAL: CRIT: Long running SCREEN process. (user: otto PID: 26051, 2072360s 1728000s). [16:07:30] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2004.codfw.wmn... [16:07:48] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) I don't understand the preference for sampling Swift requests rather than Varnish requests. You'd have greater resilience to overload (for the... [16:10:53] 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10Vgutierrez) since @ayounsi is going to eqsin datacenter later this month maybe we could join efforts and replace sdb. ^^ @RobH [16:11:26] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488956 [16:18:11] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists, 10User-jijiki: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10Psychoslave) Yes, the email is still valid, if you can also send me the email subject here once sent, that might help find it more quickly in case i... [16:25:27] PROBLEM - Host cloudcontrol1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:25:56] that paged [16:26:19] is someone working on cloudcontrol1004? [16:26:22] yes [16:26:32] ok, didnt wanna assume =] [16:26:33] ACKNOWLEDGEMENT - Host cloudcontrol1004 is DOWN: PING CRITICAL - Packet loss = 100% GTirloni T215075 [16:27:18] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Old server has been shipped out. Shipping information below. {F28148277} [16:29:34] (03PS5) 10AndyRussG: Give protect right to centralnoticeadmin on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873) [16:40:36] PROBLEM - Host cloudstore1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:41:09] presumably that's related? [16:44:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10aborrero) a:05aborrero→03RobH [16:45:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10aborrero) [16:46:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10aborrero) [16:46:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH) ` root@cloudvirt1015.mgmt.eqiad.wmnet's password: /admin1-> racadm getsel Record: 1 Date/Time: 10/29/... [16:46:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10aborrero) [16:47:16] PROBLEM - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100% [16:49:12] RECOVERY - Host cloudstore1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [16:51:26] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) So group 1 has been deployed (that should... [16:52:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH) [16:53:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH) >>! In T215012#4924650, @Andrew wrote: > Since this host is empty we should rebuild it with Stretch before pu... [16:54:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH) a:05RobH→03Cmjohnson [16:54:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH) [16:55:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH) [16:55:52] PROBLEM - Host cloudstore1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:56:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH) [16:58:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10RobH) [16:58:07] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488956 (owner: 10Jcrespo) [16:59:16] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488956 (owner: 10Jcrespo) [16:59:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10Andrew) Note that this isn't the first time we've had issues with 1015: T171473 [17:00:04] godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:01:10] \o/ [17:05:24] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488956 (owner: 10Jcrespo) [17:06:38] RECOVERY - Host cloudstore1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [17:06:44] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1085 (duration: 03m 03s) [17:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:58] connect to host mw1299.eqiad.wmnet port 22: Connection timed out [17:10:16] wasn't that one powercycled recently? [17:10:45] jynus: yes, and it came back fine, but looks like it only lasted 5 hours... :( [17:25:34] 10Operations, 10vm-requests: Site: 1 VM request for recommender-systems - https://phabricator.wikimedia.org/T215421 (10Dzahn) @Joe I recommended starting it, partially because i thought the outcome to use a VM was pretty likely and partially because actually listing what resources are needed might be a valuabl... [17:28:56] 10Operations, 10ops-eqiad: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10RobH) Ok, assisting in this I've done the following: * removed cloudstore100[89] from asw2-a-eqiad(ge-5/0/14 & ge-6/0/17) and cloudstore1009 from asw-a-eqiad:ge-6/0/17. ** removed the descriptions,... [17:31:57] RECOVERY - MegaRAID on db1073 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [17:32:42] cmjohnson1: ^ I assume you changed the disk? :-) [17:33:03] yes...sorry I got hung up with cloud stuff [17:33:17] Sure no worries! I will close the task - thank you! [17:33:26] thanks [17:34:08] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T215050 (10Marostegui) 05Open→03Resolved Thanks @Cmjohnson for replacing disk #6! ` 17:31 <+icinga-wm> RECOVERY - MegaRAID on db1073 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ` [17:37:00] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team: cloudcontrol1004 mgmt HTTPS SSL error - https://phabricator.wikimedia.org/T215075 (10Cmjohnson) I updated the f/w and bios with the SPP provided by HP The error did not resolve, I had to reset the rbsu to manufacturer settings and the err... [17:38:45] 10Operations, 10Proton, 10Security-Team, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q3), 10Reading-Infrastructure-Team-Backlog (Kanban): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10Jhernandez) [17:40:30] RECOVERY - Host cloudcontrol1004 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:43:45] (03PS1) 10Bstorm: wiki replicas: Adding the ar_comment_id field to archive_userindex [puppet] - 10https://gerrit.wikimedia.org/r/488972 (https://phabricator.wikimedia.org/T212617) [17:43:47] (03PS1) 10RobH: migrate cloudstore100[89] to row d dns change [dns] - 10https://gerrit.wikimedia.org/r/488973 (https://phabricator.wikimedia.org/T214079) [17:44:59] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/481154 (https://phabricator.wikimedia.org/T150264) (owner: 10Faidon Liambotis) [17:46:02] (03PS2) 10Bstorm: toolforge: shuffle some packages into and around genpp [puppet] - 10https://gerrit.wikimedia.org/r/488208 (https://phabricator.wikimedia.org/T210116) [17:47:18] (03CR) 10Sbisson: [C: 03+1] GrowthExperiments: Enable search for help panel on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301) (owner: 10Kosta Harlan) [17:47:32] (03CR) 10Cmjohnson: [C: 03+1] "looks good to me" [dns] - 10https://gerrit.wikimedia.org/r/488973 (https://phabricator.wikimedia.org/T214079) (owner: 10RobH) [17:47:49] (03CR) 10RobH: [C: 03+2] migrate cloudstore100[89] to row d dns change [dns] - 10https://gerrit.wikimedia.org/r/488973 (https://phabricator.wikimedia.org/T214079) (owner: 10RobH) [17:48:18] (03CR) 10Bstorm: [C: 03+2] wiki replicas: Adding the ar_comment_id field to archive_userindex [puppet] - 10https://gerrit.wikimedia.org/r/488972 (https://phabricator.wikimedia.org/T212617) (owner: 10Bstorm) [17:48:53] PROBLEM - HHVM rendering on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:03] RECOVERY - HHVM rendering on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 75081 bytes in 0.116 second response time [17:52:22] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) One thing we'd need to make sure of is... [17:54:39] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team: cloudcontrol1004 mgmt HTTPS SSL error - https://phabricator.wikimedia.org/T215075 (10GTirloni) Server looks okay to me. Thanks @Cmjohnson [17:55:49] 10Operations, 10cloud-services-team, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Prometheus to 2.6 in deployment-prep and tools - https://phabricator.wikimedia.org/T215272 (10fgiunchedi) Conversion of tools-prometheus-02 worked as expected, I've stopped v1, moved v1 metrics out of the wa... [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1800). [18:00:24] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) Since this host is important for the Analytics team, I'd be up to take over from the OS install perspective to remove some work from... [18:05:07] (03CR) 10Dzahn: [C: 04-1] mediawiki/scap: do not install sql scripts on canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/479142 (https://phabricator.wikimedia.org/T211512) (owner: 10Dzahn) [18:11:05] 10Operations, 10LDAP-Access-Requests, 10User-Addshore, 10User-jijiki: Add "raz-shuty" to nda ldap group - https://phabricator.wikimedia.org/T214488 (10RStallman-legalteam) Yes, both have signed. Please proceed. Thanks! [18:27:47] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team: cloudcontrol1004 mgmt HTTPS SSL error - https://phabricator.wikimedia.org/T215075 (10Cmjohnson) 05Open→03Resolved [18:32:48] !log LDAP - adding raz-shuty to group nda (T214488) [18:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:52] T214488: Add "raz-shuty" to nda ldap group - https://phabricator.wikimedia.org/T214488 [18:34:50] 10Operations, 10LDAP-Access-Requests, 10User-Addshore, 10User-jijiki: Add "raz-shuty" to nda ldap group - https://phabricator.wikimedia.org/T214488 (10Dzahn) 05Open→03Resolved a:03Dzahn @RazShuty @addshore done ! (Raz was already in other LDAP groups (wmde) so no code change needed in the admin modu... [18:35:13] 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1004 - https://phabricator.wikimedia.org/T215542 (10ops-monitoring-bot) [18:38:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10Cmjohnson) The disk has been replaced, @aborrero the OS will need to be re-installed. Until then the raid is out of whack because I removed /dev/sda. [18:39:04] (03PS3) 10Bstorm: toolforge: shuffle some packages into and around genpp [puppet] - 10https://gerrit.wikimedia.org/r/488208 (https://phabricator.wikimedia.org/T210116) [18:40:30] (03CR) 10Bstorm: [C: 03+2] toolforge: shuffle some packages into and around genpp [puppet] - 10https://gerrit.wikimedia.org/r/488208 (https://phabricator.wikimedia.org/T210116) (owner: 10Bstorm) [18:40:51] 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10Cmjohnson) a:03RobH @RobH Can you do a re-install and hand off to cloud, please. I moved the servers to row D racks d2 and d7 I connected to 10G switch I changed bios boot cf... [18:41:49] 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1004 - https://phabricator.wikimedia.org/T215542 (10Cmjohnson) 05Open→03Invalid [18:43:56] 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cloudstore1008.wikimedia.org ` The log can be found in `/var/log/wmf-auto-... [18:44:26] 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudstore1008.wikimedia.org'] ` Of which those **FAILED**: ` ['cloudstore1008.wikimedia.org'] ` [18:47:11] (03PS6) 10Cwhite: role: add backwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708) [18:47:19] (03Abandoned) 10Dzahn: admins: remove empty OIT admin group [puppet] - 10https://gerrit.wikimedia.org/r/488119 (owner: 10Dzahn) [18:56:30] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:57:55] (03PS1) 10RobH: update cloudstore100[89] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/488992 (https://phabricator.wikimedia.org/T214079) [18:59:08] (03CR) 10RobH: [C: 03+2] update cloudstore100[89] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/488992 (https://phabricator.wikimedia.org/T214079) (owner: 10RobH) [18:59:10] (03CR) 10Cwhite: [C: 03+2] role: add backwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [18:59:42] (03PS7) 10Cwhite: role: add backwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708) [19:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T1900). [19:00:04] Zppix and kostajh: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:26] I'm here [19:00:48] 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cloudstore1008.wikimedia.org and performed the following actions: - Revoked Puppet certificate - Removed from... [19:01:22] 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cloudstore1009.wikimedia.org and performed the following actions: - Revoked Puppet certificate - Removed from... [19:09:23] Anyone around to do SWAT? [19:13:21] I'll SWAT [19:13:30] stephanebisson: thanks [19:15:13] (03PS2) 10Sbisson: GrowthExperiments: Enable search for help panel on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301) (owner: 10Kosta Harlan) [19:15:28] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301) (owner: 10Kosta Harlan) [19:15:30] (03PS4) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150 [19:16:49] (03Merged) 10jenkins-bot: GrowthExperiments: Enable search for help panel on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301) (owner: 10Kosta Harlan) [19:17:41] stephanebisson: FYI, I just added something to the SWAT. [19:17:51] anomie: ok [19:19:39] (03CR) 10jenkins-bot: GrowthExperiments: Enable search for help panel on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488675 (https://phabricator.wikimedia.org/T209301) (owner: 10Kosta Harlan) [19:21:16] (03PS20) 10Cwhite: prometheus: upgrade to node-exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) [19:23:35] 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10RobH) a:05RobH→03GTirloni Ok, these are both reinstalled and ready for use/takeover. [19:23:44] kostajh: I've tested enabling on search, now syncing it. Do you want to test the ios scrolling issue? [19:24:16] stephanebisson: yes please [19:24:37] (03CR) 10Cwhite: [C: 03+2] prometheus: upgrade to node-exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [19:25:13] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:488675|GrowthExperiments: Enable search for help panel on testwiki]] (duration: 03m 04s) [19:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:01] !log sbisson@deploy1001 sync-file aborted: SWAT: [[gerrit:488675|GrowthExperiments: Enable search for help panel on testwiki]] (duration: 02m 22s) [19:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:16] kostajh: Your patch is now on mwdebug1002 [19:30:13] stephanebisson: ok, looking [19:30:21] FYI operation people: I encountered sync timeouts today https://phabricator.wikimedia.org/P8060 [19:30:51] (03CR) 10Dzahn: [C: 04-1] "parameter 'admins' expects a Boolean value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn) [19:31:35] (03PS3) 10Herron: lists:warn if unknown host issues mail from cmd containing our domain [puppet] - 10https://gerrit.wikimedia.org/r/488602 (https://phabricator.wikimedia.org/T215251) [19:32:46] (03CR) 10BryanDavis: admin: create new system groups for cloudelastic nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [19:36:05] stephanebisson: can add another to swat? [19:37:10] (i can deploy if you're already done) [19:37:34] we probably want to deploy it asap since it's a production error (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseLexeme/+/489000) [19:39:15] ebernhardson: sure. We're almost finish with kostajh's patches. You and anomie can discuss the relative priorities of your patches. [19:39:47] we can wait for anomie patches t be done [19:40:16] it's not *that* urgent (it's some empty searches erroring out but not anything on fire seriously) [19:42:30] stephanebisson: looks good, please merge [19:42:37] kostajh: syncing you patch now [19:43:01] anomie: You patch is next? Do you prefer to do it yourself? [19:43:12] stephanebisson: I can, but I'd rather be lazy ;) [19:43:27] (03PS1) 10Andrew Bogott: openstack: refactor 'envscript' bits into their own profile [puppet] - 10https://gerrit.wikimedia.org/r/489001 (https://phabricator.wikimedia.org/T215211) [19:43:45] anomie: I understand [19:43:49] I'll do it [19:45:01] !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/GrowthExperiments/: SWAT: [[gerrit:488988|Help Panel: Fix iOS scroll bug]] (duration: 03m 02s) [19:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:50] mw1299.eqiad.wmnet is always timing out on scap-sync... Is it a problem? [19:47:07] (03PS5) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150 [19:47:40] (03CR) 10jerkins-bot: [V: 04-1] librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn) [19:49:35] anomie: It looks like your change is going to fail the php71-docker job: https://integration.wikimedia.org/zuul/ [19:50:21] I'll proceed with SMalyshev 's patch [19:50:26] stephanebisson: Stupid flaky npm. [19:50:32] cool [19:50:33] (03CR) 10Andrew Bogott: [C: 03+2] openstack: refactor 'envscript' bits into their own profile [puppet] - 10https://gerrit.wikimedia.org/r/489001 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [19:51:05] 10Operations, 10Cloud-VPS, 10SRE-Access-Requests, 10cloud-services-team, and 2 others: Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10bd808) Related rights groups are wmcs-roots and wmcs-admin. Those 2 groups grant broader rights across Cloud Services bare metal instances (Op... [19:55:41] (03PS1) 10Andrew Bogott: openstack: include 'envscripts' on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/489005 (https://phabricator.wikimedia.org/T215211) [19:57:45] (03PS1) 10Herron: logstash: add input identifier tags to kafka logstash inputs [puppet] - 10https://gerrit.wikimedia.org/r/489006 (https://phabricator.wikimedia.org/T213899) [20:00:04] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Americas version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190207T2000). [20:00:17] RECOVERY - Long running screen/tmux on restbase1016 is OK: OK: No SCREEN or tmux processes detected. [20:00:47] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/14575/" [puppet] - 10https://gerrit.wikimedia.org/r/489006 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [20:00:53] We still have 1 patch in progress in this SWAT window [20:01:03] (03PS2) 10Herron: logstash: add input identifier tags to kafka logstash inputs [puppet] - 10https://gerrit.wikimedia.org/r/489006 (https://phabricator.wikimedia.org/T213899) [20:01:15] You can do it Jenkins, come on [20:01:42] yeah these things are long... about 20 mins [20:02:05] (03CR) 10Herron: [C: 03+2] logstash: add input identifier tags to kafka logstash inputs [puppet] - 10https://gerrit.wikimedia.org/r/489006 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [20:02:35] SMalyshev: Is you patch testable through a debug server? [20:02:48] stephanebisson: should be... [20:07:38] (03PS4) 10Mathew.onipe: admin: create new system groups for cloudelastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) [20:14:09] stephanebisson: ok CI is done [20:15:19] SMalyshev: your change should be on mwdebug1002 for you to test [20:15:29] great testing [20:16:15] stephanebisson: yep seems to be working just like it should [20:16:47] SMalyshev: deploying now [20:17:57] thanks! [20:19:38] !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/WikibaseLexeme/src/DataAccess/Search/LexemeFulltextResult.php: SWAT: [[gerrit:489000|Fix fatal error - EmptySet does not exist anymore]] (duration: 03m 03s) [20:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:06] And that concludes SWAT for now. Sorry for the delay. [20:21:04] Just want to reiterate that syncing to mw1299.eqiad.wmnet has been timing out during this SWAT window. [20:31:39] (03PS1) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 [20:33:25] 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10RobH) Ok, I opened a support request with dell to ship a replacement SSD to eqsin: Confirmed: Request 986142470 was successfully submitted. [20:35:24] (03PS6) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150 [20:36:15] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) Ideally I would prefer that stats machines are completely out of the workflow of pushing data to machines like mwmaint1002.eqia... [20:38:00] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) The "initial ramp up" might not ever be done, if we reach a point where the writes and deletes introduced are creating too much overhead, we... [20:39:42] 10Operations, 10ops-codfw, 10decommission: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10Papaul) [20:42:26] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) [20:43:14] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) As @Ottomata pointed out more generic discussion about this topic can be found here: https://phabricator.wikimedia.org/T213976 [20:44:43] (03CR) 10Ppchelko: [C: 03+1] mathoid: Remove mwapi_req/restbase_req [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 (owner: 10Alexandros Kosiaris) [20:55:35] !log train status: deploying 1.33.0-wmf.16 to group2 [20:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:22] (03CR) 10BryanDavis: "I understand why this is desired by the Community Tech team, but I'm not super excited about adding all of this bloat to every php7.2 Kube" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson) [20:57:25] (03PS1) 1020after4: group2 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489017 [20:57:27] (03CR) 1020after4: [C: 03+2] group2 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489017 (owner: 1020after4) [20:58:37] (03Merged) 10jenkins-bot: group2 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489017 (owner: 1020after4) [20:59:44] (03CR) 10Volans: [C: 03+1] "LGTM, this must be merged at the same time of I731669c28791005237418c36787d2eb42f4c3312 so that the next puppet run should do the right th" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487612 (owner: 10CRusnov) [21:00:01] (03CR) 10jenkins-bot: group2 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489017 (owner: 1020after4) [21:00:06] (03CR) 10Volans: [C: 03+1] "LGTM, this must be merged together with I19ed3b30a71a11226447779055601463a2b43fd3" [puppet] - 10https://gerrit.wikimedia.org/r/488235 (owner: 10CRusnov) [21:05:17] 10Operations, 10Cloud-Services, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10Bstorm) [21:05:20] 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10RobH) Oh, just the output from troubleshooting on the system. The system should show TWO SSDs and only sees one now: ` robh@cp5010:~$ cat /proc/mdstat Personalities : [raid1] [linear] [multipath]... [21:06:56] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10herron) >>! In T213899#4935098, @... [21:09:51] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy) [21:11:28] (03CR) 10Reedy: "Seems the most sensible option longer term, yeah" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy) [21:12:00] 10Operations, 10CirrusSearch, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10debt) Moving to #discovery-search-sprint waiting column to see if there is anything else we need... [21:12:27] (03CR) 10Volans: [C: 04-1] "Small detail inline, as discussed on IRC" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn) [21:12:48] 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10Vgutierrez) that's right, the kernel shutdown sdb due to the errors, that's why is not even listed on lshw [21:15:30] (03PS9) 10Ottomata: Add kafka-dev chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) [21:15:33] (03CR) 10Ottomata: Add kafka-dev chart for local development (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [21:18:02] (03PS1) 10Gilles: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) [21:19:42] 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10Vgutierrez) here is the log line: `Jan 21 01:39:21 cp5010 kernel: [7472184.163052] sd 1:0:0:0: [sdb] Stopping disk` [21:21:54] 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) all PDUs in ulsfo are now properly mounted. The temp/humidity leads are plugged in, but not run anywhere until AFTER we get rid of the decom sys... [21:22:15] !log updating firmware on ps1-22-ulsfo via T209101 [21:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:18] T209101: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 [21:22:23] (03CR) 10Volans: [C: 04-1] "We have at least two other related changes that are needed:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [21:31:20] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10jeena) I think it would be useful to have a tag with the version or include it in one of the tags. The date only tells you what... [21:33:00] (03CR) 10Mobrovac: "Hmm, while I agree about simplifying things, these templates are loaded by the template code regardless. Even though Mathoid doesn't use t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 (owner: 10Alexandros Kosiaris) [21:33:46] (03CR) 10Niharika29: "Bryan, without this the SVG Translate tool is quite useless because most of the languages don't show up. What do you recommend we do?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson) [21:38:41] 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) Ok, while updating these, I've noticed that the power feeds in ulsfo are not balanced. Tower A is around 7 amps and tower B is around 2 amps for... [21:38:53] !log updating firmware on ps1-23-ulsfo via T209101 ps1-22-ulsfo update completed [21:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:57] T209101: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 [21:40:22] (03CR) 10Ottomata: add statsd_exporter config to mathoid (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/482718 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [21:41:13] (03PS2) 10Andrew Bogott: openstack: include 'envscripts' on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/489005 (https://phabricator.wikimedia.org/T215211) [21:42:08] (03CR) 10Andrew Bogott: [C: 03+2] openstack: include 'envscripts' on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/489005 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [21:43:27] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group2 wikis to 1.33.0-wmf.16 refs T206670 [21:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:30] T206670: 1.33.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T206670 [21:44:45] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:47:13] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 75072 bytes in 0.957 second response time [21:50:42] (03PS2) 10Volans: Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm) [21:51:08] hmm, there is a significant increase of 60 second timeouts after promoting group2 to wmf.16 [21:52:54] (03CR) 10Volans: [C: 03+2] Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm) [21:54:39] (03Merged) 10jenkins-bot: Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm) [21:57:24] meh looks transient. it's no longer possible to push out the train without a big flood of timeouts spamming the logs, or at least that seems to be the new normal [21:58:08] let's hope that is an HHVM warmup problem that php7 will fix [21:58:49] yeah I hope so [21:58:54] (03PS16) 10Ottomata: Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [21:59:28] the spike lasts for about 10 minutes, that's one hell of a warmup period [22:00:50] (03PS1) 10Reedy: Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086 [22:01:33] (03CR) 10Ottomata: Helm chart for eventgate-analytics deployment (0320 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [22:01:41] (03CR) 10jerkins-bot: [V: 04-1] Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086 (owner: 10Reedy) [22:15:36] 10Operations, 10monitoring, 10Patch-For-Review: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10CDanis) Talked some with @BBlack today, who observed that there are in fact a variety of drivers that back this stuff in the kernel, and that it's very possible we'r... [22:23:05] 10Operations, 10Wikimedia-Mailing-lists: Please create docker-sig@ mailing list - https://phabricator.wikimedia.org/T215563 (10greg) [22:30:04] (03PS7) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150 [22:33:26] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14577/" [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn) [22:34:19] (03CR) 10Dzahn: [C: 03+2] librenms/smokeping/rancid/netbox: add data types to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn) [22:35:07] (03PS8) 10Dzahn: librenms/smokeping/rancid/netbox: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486150 [22:41:06] (03PS2) 10Reedy: sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 [22:41:09] jouncebot: now [22:41:09] No deployments scheduled for the next 1 hour(s) and 18 minute(s) [22:41:10] jouncebot: next [22:41:10] In 1 hour(s) and 18 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190208T0000) [22:42:05] (03CR) 10Dzahn: "noop on all netmon servers" [puppet] - 10https://gerrit.wikimedia.org/r/486150 (owner: 10Dzahn) [22:42:43] (03CR) 10Reedy: [C: 03+2] sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy) [22:43:07] (03CR) 10Jforrester: [C: 03+1] sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy) [22:44:55] (03Merged) 10jenkins-bot: sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy) [22:46:07] (03PS2) 10Reedy: Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086 [22:48:37] !log reedy@deploy1001 Synchronized dblists/: alphasort dblists (duration: 02m 56s) [22:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:01] PROBLEM - puppet last run on an-worker1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:53:11] (03CR) 10jenkins-bot: sort dblists... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488836 (owner: 10Reedy) [22:59:11] 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) Ok, firmware updated and all power balanced. [23:00:27] (03PS3) 10Reedy: Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086 [23:00:38] Krinkle: gpg: sending key 06670C4D66D17553 to hkps://hkps.pool.sks-keyservers.net [23:01:00] signed [23:01:29] (re: keysigning party that didn't happen at allhands) [23:03:10] i wanna make a gpg key joke but i dont have the heart to mock it. [23:05:59] mutante: okay [23:06:22] (03CR) 10Reedy: [C: 03+2] Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086 (owner: 10Reedy) [23:07:25] (03Merged) 10jenkins-bot: Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086 (owner: 10Reedy) [23:09:13] Reedy: thanks :) [23:13:00] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) [23:13:12] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) [23:14:33] Krinkle: Hm? [23:14:59] 10Operations, 10ops-ulsfo: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 (10RobH) [23:15:01] RECOVERY - puppet last run on an-worker1090 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:16:39] (03CR) 10jenkins-bot: Add a test to check dblists are sorted consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489086 (owner: 10Reedy) [23:17:10] Reedy: The sorting dblist test [23:17:12] (03CR) 10BryanDavis: "> Bryan, without this the SVG Translate tool is quite useless because" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson) [23:17:15] Aha [23:17:24] mutante: haven't received it yet btw, I guess it takes a while to replica. Want to e-mail? [23:17:37] (can send encrypted for my key) [23:17:46] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 450.61 seconds [23:18:12] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 465.49 seconds [23:18:14] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 467.14 seconds [23:18:16] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 469.17 seconds [23:18:34] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 478.49 seconds [23:18:45] !log reedy@deploy1001 Synchronized README: must be up to date (duration: 02m 54s) [23:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:50] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 488.08 seconds [23:18:56] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 493.12 seconds [23:20:17] 10Operations, 10ops-eqiad: mw1299 is down - https://phabricator.wikimedia.org/T215569 (10Reedy) [23:21:06] Krinkle: yea, somehow it's always delayed a bit. mailed! [23:23:25] 10Operations, 10ops-eqiad: mw1299 is down - https://phabricator.wikimedia.org/T215569 (10Reedy) Depending what's up with it... It might want depooling and removing from the scap host lists [23:23:49] !log reedy@deploy1001 Synchronized tests/dblistTest.php: Sync test (duration: 02m 55s) [23:23:49] mutante: thx, got it [23:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:56] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 42.26 seconds [23:28:58] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 34.05 seconds [23:29:00] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 30.11 seconds [23:29:10] !log restart ps1-22-ulsfo [23:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:20] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.24 seconds [23:29:34] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [23:29:40] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [23:29:46] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [23:36:49] mutante: Krinkle wait, are we doing a keysigning party still? :) [23:37:18] greg-g: you can be next ^_^ [23:38:48] that's why i said it on channel instead of just PM basically :) [23:39:09] krinkle had given me a piece of paper [23:40:35] I just made https://people.wikimedia.org/~gjg/tmp/ksp-releng-20190129.txt for our team to do on monday in hangout [23:40:40] (03CR) 10EBernhardson: [C: 03+1] "labs will continue using hhvm pools until the next patch after which they will be un-pooled. Seems reasonable enough." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (https://phabricator.wikimedia.org/T215491) (owner: 10DCausse) [23:40:57] greg-g: we can have a global signing party on hangout some day [23:41:05] I'm down [23:41:52] you know. this would have been a good activity for the icebreaker challenge.. bonus item if you get your bingo card signed with gpg [23:42:18] well, maybe for engineering [23:44:15] :) [23:48:58] (03CR) 10Samwilson: "> Run it as a webservice on the Debian Stretch job grid? We have all the fonts in that environment as far as I know." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson) [23:49:46] I love saying random numbers letters to my computer in a coffee shop [23:49:55] they already think I'm weird here, so it's OK [23:50:09] (03PS2) 10CRusnov: Add reports element to reports path in netbox config [puppet] - 10https://gerrit.wikimedia.org/r/488235 [23:50:26] (03PS1) 10Dzahn: have CNAMEs for bastions in each DC, so numbers dont change for users [dns] - 10https://gerrit.wikimedia.org/r/489103 [23:50:34] greg-g: ^ [23:50:40] (03CR) 10jerkins-bot: [V: 04-1] have CNAMEs for bastions in each DC, so numbers dont change for users [dns] - 10https://gerrit.wikimedia.org/r/489103 (owner: 10Dzahn) [23:51:10] (03CR) 10CRusnov: [C: 03+2] Add reports element to reports path in netbox config [puppet] - 10https://gerrit.wikimedia.org/r/488235 (owner: 10CRusnov) [23:51:28] mutante: :) [23:51:49] RESULT: 0 Errors, 2223 Warnings :p [23:52:06] (03CR) 10CRusnov: [C: 03+2] Reorganize and add tox/CI support for repository. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487612 (owner: 10CRusnov)