[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T0000). [00:00:04] Smalyshev and AaronSchulz: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:04:21] I'm going to sneak one more tiny patch into SWAT if there is room [00:06:00] actually, nevermind it only matters that that patch makes it to beta cluster [00:08:34] heh [00:08:57] it's SDC related, and they don't have captions on commons yet :) [00:11:43] here [00:14:10] any takers for SWAT? [00:15:01] I'm waiting on a patch to merge [00:15:13] so not doing anything atm [00:16:38] SMalyshev: do you want to do those config patches yourself? [00:17:12] (03PS3) 10Smalyshev: Run WikibaseCirrusSearch code for search on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493629 (https://phabricator.wikimedia.org/T217276) [00:17:28] AaronSchulz: I can't do them myself.... I don't have permissions [00:17:49] heh, OK, I'll do them [00:18:32] (03CR) 10Aaron Schulz: [C: 03+2] Run WikibaseCirrusSearch code for search on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493629 (https://phabricator.wikimedia.org/T217276) (owner: 10Smalyshev) [00:21:25] (03Merged) 10jenkins-bot: Run WikibaseCirrusSearch code for search on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493629 (https://phabricator.wikimedia.org/T217276) (owner: 10Smalyshev) [00:23:03] (03PS2) 10Aaron Schulz: Enable loading WikibaseCirrusSearch (disabled) on production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494632 (owner: 10Smalyshev) [00:23:48] !log aaron@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Run WikibaseCirrusSearch code for search on testwikidatawiki (duration: 00m 56s) [00:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:51] (03CR) 10Aaron Schulz: [C: 03+2] Enable loading WikibaseCirrusSearch (disabled) on production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494632 (owner: 10Smalyshev) [00:24:02] (03PS9) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [00:25:40] anyone running something on mwmaint1002? [00:25:52] (03Merged) 10jenkins-bot: Enable loading WikibaseCirrusSearch (disabled) on production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494632 (owner: 10Smalyshev) [00:25:53] it's been spaming eval errors for 30 min or so [00:26:16] not me... [00:28:05] !log aaron@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable loading WikibaseCirrusSearch (disabled) on production wikis (duration: 00m 55s) [00:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:17] (03CR) 10jenkins-bot: Run WikibaseCirrusSearch code for search on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493629 (https://phabricator.wikimedia.org/T217276) (owner: 10Smalyshev) [00:29:19] (03CR) 10jenkins-bot: Enable loading WikibaseCirrusSearch (disabled) on production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494632 (owner: 10Smalyshev) [00:34:22] (03PS10) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [00:35:09] AaronSchulz: is everything deployed? [00:36:13] still waiting on my own core patch to merge, but nothing else [00:36:28] (nothing else left) [00:36:36] ok, my checks show whatever needs to be working is working, so thanks! [00:36:53] (03PS11) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [00:42:08] (03CR) 10Smalyshev: icinga: add notes URLs to various monitoring checks, part 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494729 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [00:43:18] !log aaron@deploy1001 Synchronized php-1.33.0-wmf.20/includes/specials/SpecialActiveusers.php: f929e2a5069 (duration: 00m 56s) [00:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:37] !log aaron@deploy1001 Synchronized php-1.33.0-wmf.20/includes/specials/pagers/ActiveUsersPager.php: f929e2a5069 (duration: 00m 56s) [00:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T0100). [01:07:36] 10Operations, 10SRE-Access-Requests: Grant root on MediaWiki maintenance hosts to perf-roots - https://phabricator.wikimedia.org/T217813 (10aaron) [01:10:36] !log preparing phabricator upgrade [01:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:19] !log starting phabricator update to tag release/2019-03-07/1 - expect momentary downtime [01:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:10] !log phabricator update complete [01:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:40] (03PS1) 10Paladox: Add "multi-site" plugin so gerrit can have multi masters [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494865 [01:50:15] (03CR) 10Paladox: [C: 04-2] "Should not be submitted until tested" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494865 (owner: 10Paladox) [03:28:16] (03PS2) 10KartikMistry: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494477 (https://phabricator.wikimedia.org/T216123) [04:00:04] kart_: My dear minions, it's time we take the moon! Just kidding. Time for deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T0400). [04:01:59] Yes [04:03:06] !log Started manual run of unpublished ContentTranslation draft purge script (T217310) [04:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:09] T217310: Run unpublished draft purge script for CX (Week of 03/03) - https://phabricator.wikimedia.org/T217310 [04:04:16] (03CR) 10KartikMistry: Enable edittag for ExternalGuidance in CX and VE (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494477 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry) [04:47:13] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 4931 MB (3% inode=83%) [05:22:53] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:25:22] 10Operations, 10ExternalGuidance, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10santhosh) Confirmed that enwiki redirects to mobile version when accessed from Google transl... [05:37:23] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27399 MB (5% inode=99%) [05:46:59] RECOVERY - Disk space on elastic1017 is OK: DISK OK [05:47:36] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 (10bd808) Two new root emails for jobs that did not launch because of LDAP hiccups and two more queues in disabled state as a result: ` scheduling info: queue instanc... [05:57:00] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10bd808) [05:58:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494868 [06:01:06] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494868 (owner: 10Marostegui) [06:02:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494868 (owner: 10Marostegui) [06:02:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494868 (owner: 10Marostegui) [06:03:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1121 (duration: 00m 57s) [06:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:48] !log Deploy schema change on db1121, this will generate lag on labsdb:s4 - T86342 [06:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:51] T86342: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 [06:30:21] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:35:25] 10Operations, 10Discovery, 10Elasticsearch, 10Icinga, and 2 others: Merge http and https elasticsearch icinga checks into one - https://phabricator.wikimedia.org/T215587 (10Mathew.onipe) [06:38:22] (03PS1) 10Mathew.onipe: icinga: merge https and http checks [puppet] - 10https://gerrit.wikimedia.org/r/494869 (https://phabricator.wikimedia.org/T215587) [06:38:55] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:40:17] !log Finished manual run of unpublished ContentTranslation draft purge script (T217310) [06:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:27] T217310: Run unpublished draft purge script for CX (Week of 03/03) - https://phabricator.wikimedia.org/T217310 [06:42:39] (03CR) 10Mathew.onipe: "PCC output is Ok: https://puppet-compiler.wmflabs.org/compiler1002/15018/" [puppet] - 10https://gerrit.wikimedia.org/r/494869 (https://phabricator.wikimedia.org/T215587) (owner: 10Mathew.onipe) [06:50:27] RECOVERY - Disk space on notebook1004 is OK: DISK OK [06:50:49] RECOVERY - Disk space on notebook1003 is OK: DISK OK [06:53:01] cleaned up some stuff --^ [06:57:57] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:59:06] what [06:59:48] mcrouter returning tjos [06:59:51] *tkos [07:00:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494871 [07:00:17] yeah for mc1022 [07:00:23] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:00:27] so same issue that we have been trying to patch recently [07:01:24] yep just confirmed via https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1022&var-slab=All [07:01:56] the long term fix is https://phabricator.wikimedia.org/T213802 [07:02:24] I'll send an email to ops to explain how to check this thing, mw error alerts might come again [07:02:27] sigh [07:10:19] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494871 (owner: 10Marostegui) [07:11:20] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494871 (owner: 10Marostegui) [07:12:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1122 for MySQL upgrade (duration: 00m 57s) [07:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:49] !log Stop MySQL on db1122 to upgradwe [07:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:09] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494872 [07:18:13] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494872 [07:19:31] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494871 (owner: 10Marostegui) [07:19:41] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494872 (owner: 10Marostegui) [07:20:47] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494872 (owner: 10Marostegui) [07:20:59] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494872 (owner: 10Marostegui) [07:21:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 (duration: 00m 56s) [07:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:07] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494873 [07:26:22] 10Operations, 10ops-eqiad, 10Analytics, 10DBA, and 2 others: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10Marostegui) @elukey the problem is that if we add it to the existing proxies, they'll be reachable by wikireplica users, as there is a round robin there, so... [07:26:40] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494873 (owner: 10Marostegui) [07:27:39] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494873 (owner: 10Marostegui) [07:28:55] !log marostegui@deploy1001 sync-file aborted: Repool db1121 (duration: 00m 01s) [07:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1122 (duration: 00m 55s) [07:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:38] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494873 (owner: 10Marostegui) [07:39:01] (03PS1) 10Elukey: profile::labs::db::wikireplica: add special ferm rules for analytics [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [07:39:35] (03CR) 10jerkins-bot: [V: 04-1] profile::labs::db::wikireplica: add special ferm rules for analytics [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [07:42:11] (03PS2) 10Elukey: profile::labs::db::wikireplica: add special ferm rules for analytics [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [07:43:20] (03PS1) 10Marostegui: db-eqiad.php: Repool db1122 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494875 [07:45:24] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/15019/" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [07:45:26] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1122 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494875 (owner: 10Marostegui) [07:46:01] (03CR) 10Elukey: [C: 04-1] "Nope I have added a change to the cdh module, sigh" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [07:46:13] (03CR) 10Marostegui: "an-coord1001.eqiad.wmnet is the only host that will use it?" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [07:46:27] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1122 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494875 (owner: 10Marostegui) [07:46:40] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1122 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494875 (owner: 10Marostegui) [07:46:47] (03PS3) 10Elukey: profile::labs::db::wikireplica: add special ferm rules for analytics [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [07:47:20] (03CR) 10Elukey: "> an-coord1001.eqiad.wmnet is the only host that will use it?" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [07:47:49] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1122 into API (duration: 00m 55s) [07:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:33] (03CR) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/494729 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [07:48:58] (03PS4) 10Elukey: profile::labs::db::wikireplica: add special ferm rules for analytics [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [07:49:31] (03PS5) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/494511 [07:49:33] (03PS2) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/494729 (https://phabricator.wikimedia.org/T197873) [07:49:39] (03CR) 10Elukey: "Ok added also the stat boxes as well after a chat with my ream. So the list is:" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [07:51:18] (03CR) 10Elukey: [C: 04-1] "And of course I forgot that all the hadoop workers will need to be able to contact this host.." [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [07:52:15] marostegui: so I just realized that I'd need to allow all the hadoop workers to contact the new host [07:52:23] but I don't have a list in hiera that I can pull [07:56:14] (03CR) 10Jcrespo: [C: 04-1] "The idea is ok, but please don't touch the wikireplica profile or add conditional hiera keys if possible. I think it can be fully done by " [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [07:57:56] (03PS9) 10Dzahn: Gerrit: Add icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [08:02:26] (03PS5) 10Elukey: profile::labs::db::wikireplica: add special ferm rules for analytics [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [08:02:28] (03CR) 10Elukey: "> The idea is ok, but please don't touch the wikireplica profile or" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [08:02:30] (03CR) 10Dzahn: "> I think we could have both for now and then once we're happy with this new one consider if having the other one is just redundant." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [08:03:19] (03CR) 10Dzahn: [C: 03+2] Gerrit: Add icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [08:03:29] (03PS10) 10Dzahn: Gerrit: Add icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [08:04:46] (03CR) 10Dzahn: "it already uses the same LDAP server as labs does though" [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox) [08:06:20] (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494876 [08:06:36] (03CR) 10Muehlenhoff: [C: 03+1] Update file path to match debdeploy-restarts [puppet] - 10https://gerrit.wikimedia.org/r/494763 (owner: 10Jbond) [08:06:55] (03CR) 10Elukey: "> The idea is ok, but please don't touch the wikireplica profile or" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [08:07:30] (03CR) 10Dzahn: [C: 04-1] Gerrit: Support switching ldap servers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox) [08:07:59] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give more traffic to db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494876 (owner: 10Marostegui) [08:08:57] (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494876 (owner: 10Marostegui) [08:09:10] (03CR) 10jenkins-bot: db-eqiad.php: Give more traffic to db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494876 (owner: 10Marostegui) [08:10:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1122 (duration: 00m 55s) [08:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:50] (03CR) 10Smalyshev: [C: 03+1] "lgtm for wdqs parts" [puppet] - 10https://gerrit.wikimedia.org/r/494729 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [08:12:03] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 3 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Dzahn) Amended the patch to use the regular check_https_url check command and to link to the full output at https://gerrit.wikimedia.org... [08:14:34] ACKNOWLEDGEMENT - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: (null) daniel_zahn new check just added https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [08:19:25] !log reloading icinga service [08:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:58] (03PS1) 10Dzahn: icinga/gerrit: add missing closing quote in health check [puppet] - 10https://gerrit.wikimedia.org/r/494877 (https://phabricator.wikimedia.org/T215457) [08:22:45] (03CR) 10Dzahn: [C: 03+2] icinga/gerrit: add missing closing quote in health check [puppet] - 10https://gerrit.wikimedia.org/r/494877 (https://phabricator.wikimedia.org/T215457) (owner: 10Dzahn) [08:22:55] (03PS2) 10Dzahn: icinga/gerrit: add missing closing quote in health check [puppet] - 10https://gerrit.wikimedia.org/r/494877 (https://phabricator.wikimedia.org/T215457) [08:35:11] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 3 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Dzahn) 05Open→03Resolved works now: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=gerrit.wikimedia.org&servic... [08:38:57] (03PS6) 10Elukey: Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [08:39:12] ACKNOWLEDGEMENT - puppet last run on analytics-tool1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 21 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[init_superset] daniel_zahn https://phabricator.wikimedia.org/T217640#5005554 [08:39:12] ACKNOWLEDGEMENT - superset on analytics-tool1004 is CRITICAL: connect to address 10.64.36.116 and port 9080: Connection refused daniel_zahn https://phabricator.wikimedia.org/T217640#5005554 [08:39:28] elukey: superset is broken on the new server [08:39:52] (03CR) 10jerkins-bot: [V: 04-1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [08:40:34] (03PS7) 10Elukey: Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [08:40:48] mutante: thanks! Sorry I meant to ack it, it needs a new deployment and I have to work on it :( [08:40:57] 10Operations, 10MobileFrontend, 10TechCom, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Tbayer) >>! In T214998#5005391, @Krinkle wrote: >>>! In T214998#4929700, @Jdlrobson wrote: >>... [08:41:08] elukey: np, yep [08:41:26] (03CR) 10jerkins-bot: [V: 04-1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [08:43:49] (03PS8) 10Elukey: Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [08:44:41] (03CR) 10jerkins-bot: [V: 04-1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [08:45:48] Sorry I am duplicating a role that already contains violations, some -1s spam will happen [08:45:52] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) @akosiaris thanks for listing up information needed by SRE. This is very helpful. Before I add those to the task description,... [08:45:53] (03PS9) 10Elukey: Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [08:46:42] (03CR) 10jerkins-bot: [V: 04-1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [08:48:12] (03CR) 10Vgutierrez: [C: 03+2] aptrepo: Get rid of the no longer needed component/kernel-proposed-updates [puppet] - 10https://gerrit.wikimedia.org/r/494690 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [08:48:14] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:48:20] (03PS2) 10Vgutierrez: aptrepo: Get rid of the no longer needed component/kernel-proposed-updates [puppet] - 10https://gerrit.wikimedia.org/r/494690 (https://phabricator.wikimedia.org/T203194) [08:49:05] (03PS10) 10Elukey: Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [08:50:00] (03CR) 10jerkins-bot: [V: 04-1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [08:51:48] RECOVERY - Long running screen/tmux on snapshot1005 is OK: OK: No SCREEN or tmux processes detected. [08:52:56] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, 10User-Elukey: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10elukey) Proposal for fix: ` elukey@asw2-d-eqiad# show | compare [edit interfaces interface-range vlan-analytics1-d-eqiad] member ge-9/0/5 { ... } +... [08:53:17] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494878 [08:54:25] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494878 (owner: 10Marostegui) [08:54:45] !log depooled mw2151 - nutcracker failing [08:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:27] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494878 (owner: 10Marostegui) [08:56:09] (03CR) 10Elukey: "Ready for review: https://puppet-compiler.wmflabs.org/compiler1002/15024/" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [08:57:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1122 (duration: 00m 55s) [08:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:50] RECOVERY - nutcracker port on mw2151 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:02:58] RECOVERY - Check systemd state on mw2151 is OK: OK - running: The system is fully operational [09:03:32] RECOVERY - nutcracker process on mw2151 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [09:03:44] !log mw2151 - mkdir /var/run/nutcracker ; chown nutcracker:nutcracker /var/run/nutcracker ; systemctl start nutcracker - runs again - pooling server [09:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:52] 10Operations, 10MobileFrontend, 10TechCom, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Tbayer) >>! In T214998#4929968, @tstarling wrote: > It complicates SEO in the sense that, whe... [09:04:28] (03CR) 10Gehel: [C: 03+1] "LGTM for WDQS" [puppet] - 10https://gerrit.wikimedia.org/r/494729 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [09:05:42] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494878 (owner: 10Marostegui) [09:06:48] (03CR) 10Jcrespo: "May I ask for further documentation on the role, just a few lines at the beginning saying what a dedicated analytics replica is. We know n" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [09:09:24] (03PS3) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/494729 (https://phabricator.wikimedia.org/T197873) [09:10:38] (03PS1) 10Marostegui: mariadb: Remove mariadb::dbstore [puppet] - 10https://gerrit.wikimedia.org/r/494880 (https://phabricator.wikimedia.org/T216491) [09:10:59] (03CR) 10Dzahn: [C: 03+2] icinga: add notes URLs to various monitoring checks, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/494729 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [09:13:02] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/15025/" [puppet] - 10https://gerrit.wikimedia.org/r/494880 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [09:14:04] RECOVERY - Host analytics1044 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [09:14:04] RECOVERY - Host analytics1045 is UP: PING OK - Packet loss = 0%, RTA = 36.35 ms [09:14:04] RECOVERY - Host analytics1042 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [09:14:04] RECOVERY - Host analytics1043 is UP: PING OK - Packet loss = 0%, RTA = 37.67 ms [09:14:16] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:14:31] \o/ [09:15:10] !log fixed vlan-analytics1-d-eqiad members on asw2-d-eqiad - T205507 [09:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:13] T205507: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 [09:15:34] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T217755 (10Marostegui) p:05Triage→03Normal [09:16:13] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:16:59] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:16:59] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:07] (03CR) 10Gehel: "super minor nitpick inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494869 (https://phabricator.wikimedia.org/T215587) (owner: 10Mathew.onipe) [09:19:12] (03CR) 10Marostegui: "Thanks for clarifying." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [09:21:13] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:21:19] !log upgrading mwdebug servers in codfw to component/php72 (T216712) [09:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:21] T216712: Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 [09:22:01] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:22:01] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:24:24] (03CR) 10Marostegui: [C: 03+1] "dbstore1002 is gone, but as Jaime said, it would be good to run a puppet compiler check just to be sure it won't break anything on the db-" [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [09:28:09] (03CR) 10Jcrespo: [C: 03+1] "A few extra comments, there may be other changes to do, although not related to trusty." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [09:34:24] (03PS1) 10Dzahn: doc.wikimedia.org: switch from PHP 7.2 back to standard stretch/7.0 [puppet] - 10https://gerrit.wikimedia.org/r/494884 [09:35:09] 10Operations, 10Traffic, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10ema) >>! In T202966#5007017, @ayounsi wrote: > cp1099 is the last standing host between me and powering off asw-c-eqiad. > > From this task and the prompt `cp1099 is a Unpuppetised sys... [09:35:10] (03CR) 10Muehlenhoff: mariadb: Remove support for trusty (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [09:37:24] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,service=citoid,cluster=scb,name=scb.* [09:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:36] !log rump up traffic to citoid kubernetes to 100% [09:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:50] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=citoid,cluster=scb,name=scb.* [09:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:17] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10hashar) The issue is noticeable on the Jenkins job https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ since it runs every 10 mi... [09:39:42] (03CR) 10Jcrespo: ":-P" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [09:40:43] (03CR) 10Marostegui: [C: 03+1] mariadb: Refactor dump_section.py and rename to match functionality (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [09:41:18] (03CR) 10Jcrespo: [C: 03+1] mariadb: Remove mariadb::dbstore [puppet] - 10https://gerrit.wikimedia.org/r/494880 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [09:41:39] (03PS2) 10Marostegui: mariadb: Remove mariadb::dbstore [puppet] - 10https://gerrit.wikimedia.org/r/494880 (https://phabricator.wikimedia.org/T216491) [09:42:56] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove mariadb::dbstore [puppet] - 10https://gerrit.wikimedia.org/r/494880 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [09:45:09] (03CR) 10Muehlenhoff: "Could you also remove it from the Cumin aliases:" [puppet] - 10https://gerrit.wikimedia.org/r/494880 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [09:46:33] (03CR) 10Jcrespo: ">" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [09:46:55] jouncebot: next [09:46:55] In 2 hour(s) and 13 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T1200) [09:47:55] !log upgrading mwdebug servers in eqiad to component/php72 (T216712) [09:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:58] T216712: Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 [09:49:55] (03PS1) 10Dzahn: phabricator: switch PHP on stretch from thirdparty/php72 to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/494885 [09:51:08] (03PS1) 10Marostegui: aliases.yaml.erb: Remove dbstore [puppet] - 10https://gerrit.wikimedia.org/r/494886 (https://phabricator.wikimedia.org/T216491) [09:57:01] (03CR) 10Muehlenhoff: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/494886 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [09:57:20] (03CR) 10Marostegui: [C: 03+2] aliases.yaml.erb: Remove dbstore [puppet] - 10https://gerrit.wikimedia.org/r/494886 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [09:58:56] (03PS1) 10Jforrester: [BETA] Initial configuration of depict property for WBMI on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494887 (https://phabricator.wikimedia.org/T217153) [10:06:42] 10Operations, 10Traffic: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 (10ema) [10:06:46] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Wikidata sometimes cuts off entity RDF - https://phabricator.wikimedia.org/T216006 (10ema) [10:08:02] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Some esams<->eqiad varnish backend connections closed by peer - https://phabricator.wikimedia.org/T216006 (10ema) [10:10:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494891 [10:11:48] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494891 (owner: 10Marostegui) [10:12:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494891 (owner: 10Marostegui) [10:13:26] !log upgrading mediawiki canaries to component/php72 (T216712) [10:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:29] T216712: Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 [10:14:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1075 for schema change and mysql upgrade (duration: 00m 56s) [10:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494891 (owner: 10Marostegui) [10:14:51] (03PS1) 10Jforrester: Initial configuration of depict property for WBMI on Commons, TestCommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494895 (https://phabricator.wikimedia.org/T217153) [10:14:54] (03PS1) 10Jforrester: WBMI: Stop using wgMediaInfoEnable, we're scrapping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494896 [10:14:56] (03PS1) 10Jforrester: WBMI: Drop use of temporary wgMediaInfoEnable, being removed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494897 [10:15:30] (03CR) 10Nikerabbit: [C: 03+1] Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494477 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry) [10:15:41] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration of depict property for WBMI on Commons, TestCommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494895 (https://phabricator.wikimedia.org/T217153) (owner: 10Jforrester) [10:15:51] (03CR) 10jerkins-bot: [V: 04-1] WBMI: Stop using wgMediaInfoEnable, we're scrapping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494896 (owner: 10Jforrester) [10:16:07] (03CR) 10jerkins-bot: [V: 04-1] WBMI: Drop use of temporary wgMediaInfoEnable, being removed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494897 (owner: 10Jforrester) [10:17:16] (03CR) 10Matthias Mullie: [C: 03+1] [BETA] Initial configuration of depict property for WBMI on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494887 (https://phabricator.wikimedia.org/T217153) (owner: 10Jforrester) [10:19:11] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Some esams<->eqiad varnish backend connections closed by peer - https://phabricator.wikimedia.org/T216006 (10ema) Varnishlog of the varnish **backend** instance serving the request in esams reports the following: ` - ReqMethod... [10:26:40] I'm about to push out a beta-only config change. [10:26:48] (03CR) 10Jforrester: [C: 03+2] [BETA] Initial configuration of depict property for WBMI on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494887 (https://phabricator.wikimedia.org/T217153) (owner: 10Jforrester) [10:26:54] (03PS2) 10Jforrester: Initial configuration of depict property for WBMI on Commons, TestCommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494895 (https://phabricator.wikimedia.org/T217153) [10:26:57] (03PS2) 10Jforrester: WBMI: Stop using wgMediaInfoEnable, we're scrapping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494896 [10:26:59] (03PS2) 10Jforrester: WBMI: Drop use of temporary wgMediaInfoEnable, being removed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494897 [10:27:49] (03Merged) 10jenkins-bot: [BETA] Initial configuration of depict property for WBMI on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494887 (https://phabricator.wikimedia.org/T217153) (owner: 10Jforrester) [10:33:18] (03PS2) 10Jbond: Update file path to match debdeploy-restarts [puppet] - 10https://gerrit.wikimedia.org/r/494763 [10:34:42] (03PS2) 10Jbond: Add config file and exclude_mounts options to debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/494764 (https://phabricator.wikimedia.org/T217646) [10:34:56] !log restarting HHVM/Apache on mediawiki canaries to pick up OpenSSL security update [10:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:39] (03PS3) 10Jbond: Update file path to match debdeploy-restarts [puppet] - 10https://gerrit.wikimedia.org/r/494763 [10:37:46] (03CR) 10jenkins-bot: [BETA] Initial configuration of depict property for WBMI on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494887 (https://phabricator.wikimedia.org/T217153) (owner: 10Jforrester) [10:38:49] (03CR) 10Jbond: Update file path to match debdeploy-restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494763 (owner: 10Jbond) [10:52:38] 10Operations, 10SRE-Access-Requests: Grant root on MediaWiki maintenance hosts to perf-roots - https://phabricator.wikimedia.org/T217813 (10jbond) [10:54:17] 10Operations, 10SRE-Access-Requests: Grant root on MediaWiki maintenance hosts to perf-roots - https://phabricator.wikimedia.org/T217813 (10jbond) [10:54:17] (03PS1) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [10:54:45] 10Operations, 10SRE-Access-Requests: Grant root on MediaWiki maintenance hosts to perf-roots - https://phabricator.wikimedia.org/T217813 (10jbond) [10:54:52] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:57:56] (03PS2) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [10:58:30] !log upgrading seaborgium to Stretch (so it's running the same distro as serpens/codfw) [10:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:49] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [11:01:17] PROBLEM - High CPU load on API appserver on mw1313 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.16.194: Connection reset by peer [11:02:23] RECOVERY - High CPU load on API appserver on mw1313 is OK: OK - load average: 25.42, 19.68, 16.20 [11:09:49] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Puppet has 15 failures. Last run 6 minutes ago with 15 failures. Failed resources (up to 3 shown): Package[php7.2-common],Package[php7.2-opcache],Package[php7.2-bcmath],Package[php7.2-bz2] [11:11:26] ^ that's me [11:11:28] (03CR) 10Dzahn: [C: 03+2] doc.wikimedia.org: switch from PHP 7.2 back to standard stretch/7.0 [puppet] - 10https://gerrit.wikimedia.org/r/494884 (owner: 10Dzahn) [11:11:37] (03PS2) 10Dzahn: doc.wikimedia.org: switch from PHP 7.2 back to standard stretch/7.0 [puppet] - 10https://gerrit.wikimedia.org/r/494884 [11:12:05] 10Operations, 10SRE-Access-Requests: Grant root on MediaWiki maintenance hosts to perf-roots - https://phabricator.wikimedia.org/T217813 (10jbond) @kchapman can you approve this request Thanks [11:13:06] 10Operations, 10SRE-Access-Requests: Grant root on MediaWiki maintenance hosts to perf-roots - https://phabricator.wikimedia.org/T217813 (10jbond) p:05Triage→03Normal [11:14:10] (03CR) 10Dzahn: "Notice: /Stage[main]/Apt/File[/etc/apt/sources.list.d/wikimedia-php72.list]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/494884 (owner: 10Dzahn) [11:15:53] !log doc1001 - apt-get remove --purge php7.2* (the same packages with 7.0 were previosly installed in parallel) [11:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:55] !log Stop MySQL on db1075 for upgrade [11:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:43] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10hashar) I had a closer look at [[ https://grafana.wikimedia.org/d/000000181/openldap-labs?panelId=3&fullscreen&orgId=1&from=now-6h&to=... [11:18:28] !log doc.wikimedia.org down and being worked on - package downgrade exposed an issue [11:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:56] (03CR) 10Gehel: [C: 04-1] Add wdqs data transfer cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [11:21:34] !log doc.wikimedia.org - back up, manually fixed path to php-fpm.sock to 7.0 - puppet disabled, fix coming [11:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:29] (03PS1) 10Dzahn: doc.wikimedia.org: adjust path to php-fpm socket to 7.0 [puppet] - 10https://gerrit.wikimedia.org/r/494903 [11:25:51] (03CR) 10Dzahn: "< mutante> !log doc1001 - apt-get remove --purge php7.2* (the same packages with 7.0 were previosly installed in parallel by puppet)" [puppet] - 10https://gerrit.wikimedia.org/r/494884 (owner: 10Dzahn) [11:26:10] (03CR) 10Dzahn: [C: 03+2] doc.wikimedia.org: adjust path to php-fpm socket to 7.0 [puppet] - 10https://gerrit.wikimedia.org/r/494903 (owner: 10Dzahn) [11:26:51] (03CR) 10Muehlenhoff: Update file path to match debdeploy-restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494763 (owner: 10Jbond) [11:28:30] !log updated seaborgium to stretch (T217280) [11:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:33] T217280: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 [11:28:41] (03Abandoned) 10Muehlenhoff: Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/424584 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:29:26] (03PS4) 10Muehlenhoff: Fix package name for ack-grep in buster [puppet] - 10https://gerrit.wikimedia.org/r/494684 [11:31:08] (03CR) 10Muehlenhoff: [C: 03+2] Fix package name for ack-grep in buster [puppet] - 10https://gerrit.wikimedia.org/r/494684 (owner: 10Muehlenhoff) [11:35:45] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:38:12] (03PS2) 10Muehlenhoff: mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 [11:44:00] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494909 [11:44:55] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494909 (owner: 10Marostegui) [11:45:37] !log temporarily disabled puppet on seaborgium/serpens to try slapd config changes [11:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:12] (03PS2) 10Marostegui: db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494909 [11:47:18] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494909 (owner: 10Marostegui) [11:48:19] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494909 (owner: 10Marostegui) [11:49:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1075 after schema change and mysql upgrade (duration: 00m 56s) [11:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:05] (03CR) 10WMDE-Fisch: Set up exceptions for rollback confirmation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494270 (https://phabricator.wikimedia.org/T217436) (owner: 10Tim Eulitz) [11:50:49] (03CR) 10Jbond: Update file path to match debdeploy-restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494763 (owner: 10Jbond) [11:51:52] (03PS3) 10Jcrespo: mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [11:52:28] mutante: got a minute? [11:55:00] (03PS4) 10Jcrespo: mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [11:55:14] (03CR) 10Jcrespo: "What do you think, like that?" [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [11:56:08] (03PS5) 10Jcrespo: mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [11:57:22] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494910 [11:57:27] (03PS6) 10Jcrespo: mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [11:57:29] (03PS1) 10GTirloni: openldap: Set thread pool based on processor count [puppet] - 10https://gerrit.wikimedia.org/r/494911 (https://phabricator.wikimedia.org/T217280) [11:57:31] (03CR) 10Marostegui: mariadb: Remove support for trusty (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [11:58:06] (03PS2) 10Mathew.onipe: icinga: merge https and http checks [puppet] - 10https://gerrit.wikimedia.org/r/494869 (https://phabricator.wikimedia.org/T215587) [11:58:28] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494910 (owner: 10Marostegui) [11:58:30] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494909 (owner: 10Marostegui) [11:59:27] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494910 (owner: 10Marostegui) [11:59:30] 10Operations, 10MobileFrontend, 10TechCom, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10ovasileva) Just wanted to chime in with a product perspective on this. This change is not cu... [11:59:40] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494910 (owner: 10Marostegui) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T1200). [12:00:04] kart_, sau226, and hauskatze: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:19] * kart_ is here [12:00:26] (03CR) 10GTirloni: "Puppet compiler output (looks like it still is working with the old 4-CPU value): https://puppet-compiler.wmflabs.org/compiler1002/15028/" [puppet] - 10https://gerrit.wikimedia.org/r/494911 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [12:00:26] kart_: deploying your own change? [12:00:31] I'm here [12:00:31] zeljkof: sure [12:00:33] I'm around to help [12:00:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1075 after schema change and mysql upgrade (duration: 00m 56s) [12:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:42] kart_: go ahead then, let me know if you need help [12:00:47] zeljkof: I'm starting with +2. [12:00:51] zeljkof: yes. Thanks! [12:01:10] (03CR) 10Jcrespo: mariadb: Remove support for trusty (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [12:01:10] zeljkof: let's see if we can make it work today :) [12:01:53] (03CR) 10KartikMistry: [C: 03+2] Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494477 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry) [12:02:40] (03Merged) 10jenkins-bot: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494477 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry) [12:02:56] hauskatze: :) [12:04:32] Deploying in canary.. [12:05:21] * sau226 is here [12:05:44] OK. seems good. [12:05:51] zeljkof: Going for full deploy. [12:06:17] (03PS2) 10Addshore: Enable musical notation datatype on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493010 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:06:37] * addshore is going to add ^^ to this swat [12:07:13] (03PS3) 10Mathew.onipe: icinga: merge https and http checks [puppet] - 10https://gerrit.wikimedia.org/r/494869 (https://phabricator.wikimedia.org/T215587) [12:08:05] zeljkof: Is scap slow to show log message? [12:08:21] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:494477]] Enable edittag for ExternalGuidance in CX and VE (T216123) (duration: 00m 57s) [12:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:24] T216123: Tag is not added when new page is created through External Guidance - https://phabricator.wikimedia.org/T216123 [12:08:39] OK. It only shows after it is finished! [12:09:24] kart_: yes, when finished [12:09:32] and it takes a minute or so [12:09:37] zeljkof: yep [12:10:48] (03CR) 10jenkins-bot: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494477 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry) [12:11:39] (03CR) 10Mathew.onipe: "PCC is Ok (well until we merge): https://puppet-compiler.wmflabs.org/compiler1002/15032/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494869 (https://phabricator.wikimedia.org/T215587) (owner: 10Mathew.onipe) [12:12:32] zeljkof: I'm done with my patch. Monitoring Logstash for a while. [12:13:03] (03CR) 10Lucas Werkmeister (WMDE): "very minor nitpick" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493010 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:13:14] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, 10Patch-For-Review: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10GTirloni) @hashar nice research, thank you! I've submitted a change to increase the number of threads. It does s... [12:13:22] kart_: want to practice on other patches or should I continue with swat? :) [12:14:11] (03PS3) 10Addshore: Enable musical notation datatype on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493010 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:14:19] sau226 around for swat? [12:14:25] Go for it [12:14:36] (03PS4) 10Addshore: Enable musical notation datatype on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493010 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:14:40] zeljkof: I can deploy one more. [12:15:18] kart_: go ahead with the next one then :) 492447 from sau226 [12:15:26] Sure [12:15:48] (03PS8) 10KartikMistry: Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) (owner: 10Sau226) [12:16:02] rebasing.. [12:17:48] (03CR) 10KartikMistry: [C: 03+2] Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) (owner: 10Sau226) [12:18:18] zeljkof: which canary / mwdebug are we using? [12:18:45] (03Merged) 10jenkins-bot: Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) (owner: 10Sau226) [12:19:16] (03CR) 10Dzahn: [C: 03+1] Install ack instead of ack-grep [puppet] - 10https://gerrit.wikimedia.org/r/494718 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [12:19:31] hauskatze: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Canary [12:19:33] 10Operations: Document service owner in Netbox - https://phabricator.wikimedia.org/T217686 (10faidon) [12:19:36] 10Operations: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10faidon) [12:19:45] mwdebug1002.eqiad.wmnet [12:20:11] zeljkof: yesterday I was looking at mwdebug1001..., sigh [12:20:24] not today obviously :) [12:20:33] Just used X-Wikimedia-Debug for mw1002.eqiad for patch [12:20:35] hauskatze: ouch, maybe I forgot to make it explicit [12:21:25] zeljkof: np, we'll see today [12:21:28] sau226: syncing in mwdebug1002. Can you test it? [12:21:28] 10Operations: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10faidon) [12:21:29] it's always mwdebug1002.eqiad.wmnet, I make it explicit with new people, but usually just say mwdebug with people that have been around for a while [12:21:43] zeljkof: patch on mwdebug1002. [12:21:49] is mwdebug1001 used for anything? [12:21:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsoid, 10decommission: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10MoritzMuehlenhoff) The host is still in puppetdb/cumin, see e.g, puppetboard [12:22:03] Testing now [12:22:14] kart_: tested, works, waiting on sau226 doublecheck [12:22:24] Successful for me [12:22:27] hauskatze: probably scap uses it [12:22:27] hauskatze: cool. [12:22:30] sau226: nice. [12:22:33] 10Operations: Document service owner in Netbox - https://phabricator.wikimedia.org/T217686 (10faidon) This seems like a duplicate (and subset of) T216088. I've added the custom field proposal as one of the many options listed in its task description and closing this as duplicate to keep the discussion in one pla... [12:22:33] deploying.. [12:23:42] (03CR) 10Volans: [C: 04-1] "Thanks for the effort, this is not a trivial one to get right." (0316 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [12:23:46] ah. Forgot to add task name :/ [12:23:50] zeljkof: ^ [12:23:57] task number. [12:24:27] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:492447]] Restore bureaucrat rights on hi.wiktionary to default () (duration: 00m 56s) [12:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:29] (03CR) 10jenkins-bot: Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) (owner: 10Sau226) [12:24:50] I don't think it matters much as the change has the task on the commit message [12:24:59] kart_: yeah, not a big deal [12:25:02] mostly for reference [12:25:11] sau226: deployed. Thanks! [12:25:33] I'm testing without X-Wikimedia-Debug per the guide [12:26:27] Good on my end [12:26:35] kart_: want to deploy more or should I take over swat? [12:26:49] zeljkof: Please take over :) [12:27:03] kart_: ok, thanks for the deployments, taking over :) [12:27:21] hauskatze: please stand by, I'll ping you when the first patch is ready for testign [12:27:23] zeljkof: I can be in SWAT again next week for multiple days to learn more. Will deploy more that time! [12:27:39] zeljkof: ack, I'm here [12:27:40] kart_: great, I'm almost always around :) [12:28:08] I'm too. When not cycling/running ;) [12:28:33] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494806 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [12:29:31] (03Merged) 10jenkins-bot: Restrict local uploads on mediawiki.org, take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494806 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [12:29:47] (03CR) 10jenkins-bot: Restrict local uploads on mediawiki.org, take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494806 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [12:29:57] zeljkof: when on mwdebug, can we touch commonsettings and initialisettings as well? [12:30:31] hauskatze: no, that's not normal procedure [12:30:47] zeljkof: well, let's try the normal procedure then [12:30:50] hauskatze: 494806 is at mwdebug1002, please test [12:30:54] ok, going [12:31:23] hauskatze: I really don't want to break stuff, that's my priority, even if that means not deploying something [12:31:48] touching files is not documented as needed, so I don't do it during deployment [12:31:48] (03CR) 10Dzahn: [C: 04-1] "you should be able to set something like this in project Hiera:" [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox) [12:32:02] if it's needed, we need to document it [12:32:16] zeljkof: doesn't appear to be applying. I think we need to `touch` those files for the dblist patch to be applied but if you don't want to, I respect it [12:32:30] maybe we can deploy and see if it applies? [12:32:38] it is really weird [12:32:49] (03CR) 10Alexandros Kosiaris: openldap: Set thread pool based on processor count (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494911 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [12:32:52] hauskatze: I'm ok with deploying it, it's easy to revert in case of trouble [12:33:08] zeljkof: then please deploy, maybe scap touches everything [12:33:16] and certainly the next patch does [12:34:19] !log zfilipin@deploy1001 Synchronized dblists/commonsuploads.dblist: SWAT: [[gerrit:494806|Restrict local uploads on mediawiki.org, take 2 (T217523)]] (duration: 00m 56s) [12:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:22] T217523: Restrict local uploads to commons for MediaWiki.org - https://phabricator.wikimedia.org/T217523 [12:34:23] (03PS3) 10MarcoAurelio: Create an 'uploader' group on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494225 (https://phabricator.wikimedia.org/T217523) [12:34:29] hauskatze: it's deployed, please test [12:35:29] zeljkof: still nothing, is it possible to revert after deploying 494225? [12:35:37] hauskatze: sure [12:35:41] no conflicts as it's different file paths [12:35:52] hauskatze: so 494225 is ok to deploy? [12:36:21] Kvragu patch :P [12:36:21] anyway, let's merge and push to mwdebug :) [12:36:38] 225 is okay to pick yes zel [12:36:45] * zeljkof [12:36:59] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494225 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [12:37:55] (03Merged) 10jenkins-bot: Create an 'uploader' group on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494225 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [12:38:26] (03PS2) 10Dzahn: dumps::nfs: set notes URL [puppet] - 10https://gerrit.wikimedia.org/r/494741 [12:38:35] (03CR) 10Dzahn: [C: 03+2] "per IRC talk" [puppet] - 10https://gerrit.wikimedia.org/r/494741 (owner: 10Dzahn) [12:39:04] hauskatze: 494225 is at mwdebug1002 [12:39:08] checking [12:39:35] zeljkof: works and the previous patch does work now as well [12:39:58] (03CR) 10Muehlenhoff: [C: 03+1] Update file path to match debdeploy-restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494763 (owner: 10Jbond) [12:40:00] I suspected InitialiseSettings had to be touched :) [12:40:01] (03CR) 10jenkins-bot: Create an 'uploader' group on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494225 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [12:40:05] hauskatze: I'll deploy this one, check both, I can revert the first one then [12:40:25] zeljkof: both seems to be working, but I'll check of mwdebug after deployment [12:40:46] (03PS4) 10Jbond: Update file path to match debdeploy-restarts [puppet] - 10https://gerrit.wikimedia.org/r/494763 [12:40:50] ah, _does_ work, I thought you said _does not_ :) [12:41:17] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:494225|Create an uploader group on mediawiki.org (T217523)]] (duration: 00m 55s) [12:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:20] T217523: Restrict local uploads to commons for MediaWiki.org - https://phabricator.wikimedia.org/T217523 [12:41:34] hauskatze: deployed, please check both patches [12:41:41] zeljkof: on it [12:42:03] (03CR) 10ArielGlenn: [C: 03+2] handle failed xml content jobs correctly [dumps] - 10https://gerrit.wikimedia.org/r/494722 (https://phabricator.wikimedia.org/T217744) (owner: 10ArielGlenn) [12:42:08] addshore: please stand by, you're next :) deploying your own patch? [12:42:18] zeljkof: it would be great if someone else could! [12:42:22] I'm ready to test etc though! [12:42:31] addshore: sure, I can deploy [12:42:35] zeljkof: everything seems alright [12:42:40] (03PS5) 10Zfilipin: Enable musical notation datatype on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493010 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:42:48] !log ariel@deploy1001 Started deploy [dumps/dumps@3a25aa0]: handle failed xml content jobs correctly (fix regression) [12:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:53] !log ariel@deploy1001 Finished deploy [dumps/dumps@3a25aa0]: handle failed xml content jobs correctly (fix regression) (duration: 00m 05s) [12:42:53] hauskatze: great, thanks for deploying with #releng :) [12:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:07] I cannot deploy with anyone else ;) [12:43:17] but if I could, I wouldn't :) [12:43:24] releng ftw [12:43:28] hauskatze: you can't deploy, yet ;) [12:43:34] we're working on it [12:43:37] hauskatze: I'm not sure what to do about deploying dblists/commonsuploads.dblist [12:43:57] zeljkof: that patch works [12:44:05] both of them [12:44:35] (03CR) 10Jbond: "add compile log" [puppet] - 10https://gerrit.wikimedia.org/r/494763 (owner: 10Jbond) [12:44:38] hashar, addshore: do you know why dblists/commonsuploads.dblist does not work by itself, but only after wmf-config/InitialiseSettings.php is also deployed? [12:44:40] (03CR) 10Jbond: [C: 03+2] Update file path to match debdeploy-restarts [puppet] - 10https://gerrit.wikimedia.org/r/494763 (owner: 10Jbond) [12:44:54] hauskatze: works now, but if we were just deploying that one patch, it would not work [12:45:01] (03CR) 10GTirloni: openldap: Set thread pool based on processor count (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494911 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [12:45:10] and the current process does not mention touching files [12:45:23] zeljkof: i dont quite follow, a change to the dblist should get deployed without other changes being needed [12:45:32] hauskatze: can you please create a phab ticked and add #releng so we can discuss it? [12:45:36] hauskatze: please cc me [12:45:42] zeljkof: my guess is that given that the wikis using the dblist are defined in InitialiseSettings and/or CommonSettings, if those files ain't touched, it is not detected [12:46:10] (03PS3) 10Dzahn: icinga/toollabs: set notes URLs for toolforge related checks [puppet] - 10https://gerrit.wikimedia.org/r/494490 (https://phabricator.wikimedia.org/T197873) [12:46:11] zeljkof: sure thing, I'll do that. But after lunch if possible [12:46:13] addshore: it got deployed, but it didn't work, started working after I've deployed another patch that updates IS.php :( [12:46:13] I got to go [12:46:22] hauskatze: sure, no rush [12:46:26] ty [12:46:44] as far as I know IS.php is always touched by scap when sycing, but maybe not? hmm [12:47:34] * zeljkof shrugs [12:47:53] hausekatze will create a task and we can discuss it there [12:48:42] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493010 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:49:39] (03Merged) 10jenkins-bot: Enable musical notation datatype on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493010 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:49:41] (03PS4) 10Dzahn: icinga/toollabs: set notes URLs for toolforge related checks [puppet] - 10https://gerrit.wikimedia.org/r/494490 (https://phabricator.wikimedia.org/T197873) [12:51:34] addshore: 493010 is at mwdebug1002 [12:51:46] (03CR) 10jenkins-bot: Enable musical notation datatype on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493010 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:51:50] ack, testing now [12:53:21] (03PS6) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/494511 [12:53:54] zeljkof: looks good to me [12:54:16] addshore: ok, deploying [12:55:29] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:493010|Enable musical notation datatype on testwikidatawiki (T216730)]] (duration: 00m 56s) [12:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:32] T216730: Enable musical notation datatype - https://phabricator.wikimedia.org/T216730 [12:55:47] addshore: deployed, thanks for deploying with #releng ;) [12:55:57] (03CR) 10Dzahn: [C: 03+2] icinga: add notes URLs to various monitoring checks, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/494511 (owner: 10Dzahn) [12:56:20] !log re-enabled puppet on seaborgium/serpens [12:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:53] (03PS5) 10Dzahn: icinga/toollabs: set notes URLs for toolforge related checks [puppet] - 10https://gerrit.wikimedia.org/r/494490 (https://phabricator.wikimedia.org/T197873) [12:57:12] (03CR) 10Dzahn: [C: 03+2] icinga/toollabs: set notes URLs for toolforge related checks [puppet] - 10https://gerrit.wikimedia.org/r/494490 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [12:58:56] (03CR) 10Muehlenhoff: "It's worth a try, but the time frames where threads are stalled a very short and more of a sign of temporary spikes/overloads. Also, the F" [puppet] - 10https://gerrit.wikimedia.org/r/494911 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [12:59:15] !log EU SWAT finished [12:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T1300) [13:00:24] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, 10Patch-For-Review: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10GTirloni) {F28341972} This is impact of the proposed change to thread count, manually applied to both LDAP serv... [13:14:53] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494920 [13:15:19] (03PS3) 10Dzahn: xhgui: setup git cloning and apache site [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) [13:16:32] (03PS4) 10Dzahn: xhgui: setup git cloning and apache site [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) [13:16:58] (03CR) 10Dzahn: "ah, sure. that is possible. done" [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [13:17:05] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [13:17:22] (03CR) 10jerkins-bot: [V: 04-1] xhgui: setup git cloning and apache site [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [13:17:56] (03CR) 10jerkins-bot: [V: 04-1] xhgui: setup git cloning and apache site [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [13:19:35] (03PS3) 10Dzahn: icinga/restbase/eventbus: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494485 (https://phabricator.wikimedia.org/T197873) [13:20:28] (03PS5) 10Dzahn: xhgui: setup git cloning and apache site [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) [13:21:18] (03CR) 10jerkins-bot: [V: 04-1] xhgui: setup git cloning and apache site [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [13:21:45] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 3 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10jbond) p:05Triage→03Normal [13:21:54] 10Operations, 10monitoring, 10LDAP: prometheus-openldap-exporter: Request.write called on a request after Request.finish was called - https://phabricator.wikimedia.org/T217758 (10jbond) p:05Triage→03Normal [13:23:21] (03PS6) 10Dzahn: xhgui: setup git cloning and apache site [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) [13:25:05] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase weight for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494920 (owner: 10Marostegui) [13:26:05] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494920 (owner: 10Marostegui) [13:26:20] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494920 (owner: 10Marostegui) [13:26:32] 10Operations, 10monitoring, 10Patch-For-Review: link Icinga checks to runbook / notes URLs - https://phabricator.wikimedia.org/T197873 (10Dzahn) [13:26:50] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [13:26:59] (03CR) 10Dzahn: [C: 03+2] icinga/restbase/eventbus: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494485 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [13:27:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1075 after schema change and mysql upgrade (duration: 00m 52s) [13:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:16] (03PS1) 10Gilles: Oversample navtiming on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494921 (https://phabricator.wikimedia.org/T187299) [13:29:31] (03CR) 10Dzahn: "fails to parse template even though we just move it?? https://puppet-compiler.wmflabs.org/compiler1002/126/webperf1002.eqiad.wmnet/change." [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [13:29:51] (03Abandoned) 10Gilles: Oversample navtiming on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494921 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [13:30:05] (03Restored) 10Gilles: Oversample navtiming on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494921 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [13:31:14] (03PS2) 10Gilles: Enable Priority Hints origin trial on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494921 (https://phabricator.wikimedia.org/T216499) [13:32:26] (03CR) 10Gehel: [C: 04-1] Add wdqs data transfer cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [13:33:22] (03PS1) 10GTirloni: ldap: increase group TTL from 60 to 3600 seconds in labs [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) [13:34:59] (03CR) 10GTirloni: "This was pointed out by hashar and it seems worth twaking for Cloud VPS / Toolforge. Any drawbacks or security concerns? It's aligning gro" [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [13:35:34] ahah [13:37:14] (03PS1) 10Marostegui: dbproxy1011: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/494924 [13:37:18] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494925 [13:38:03] (03PS2) 10Marostegui: dbproxy1011: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/494924 [13:38:59] (03PS2) 10Dzahn: phabricator: switch PHP on stretch from thirdparty/php72 to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/494885 [13:39:06] (03CR) 10Marostegui: [C: 03+2] dbproxy1011: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/494924 (owner: 10Marostegui) [13:39:17] (03CR) 10Volans: [C: 04-1] Add wdqs data transfer cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [13:39:27] (03PS2) 10Marostegui: db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494925 [13:39:55] !log Reload haproxy on dbproxy1010 to depool labsdb1009 [13:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:28] (03CR) 10Dzahn: [C: 03+2] phabricator: switch PHP on stretch from thirdparty/php72 to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/494885 (owner: 10Dzahn) [13:40:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494925 (owner: 10Marostegui) [13:40:41] (03PS3) 10Dzahn: phabricator: switch PHP on stretch from thirdparty/php72 to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/494885 [13:41:34] !log Stop mysql on labsdb1009 for upgrade (this will trigger an haproxy IRC alert) [13:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:36] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494925 (owner: 10Marostegui) [13:42:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1075 after schema change and mysql upgrade (duration: 00m 55s) [13:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:46] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [13:45:50] ^ expected as logged before [13:47:15] (03CR) 10Volans: [C: 04-1] "Change looks sane, few minor comments inline." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [13:47:32] !log phab1002 - removing all php-7.2 packages and letting puppet reinstall them after component change [13:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:13] (03CR) 10Hashar: "I was looking at nscd "group cache" and how its TTL is only 60 seconds. I live hacked an instance to tweak it and indeed that makes some r" [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [13:48:22] (03CR) 10Volans: "This change depends on Ie958bf4f0a0374bbfe8641389cebecf67e069fa2 , you can rebase this on top of the other and keep them in a chained rela" [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [13:48:38] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1075 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494925 (owner: 10Marostegui) [13:50:22] PROBLEM - HHVM rendering on mw2207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:50] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [13:51:14] RECOVERY - HHVM rendering on mw2207 is OK: HTTP OK: HTTP/1.1 200 OK - 74921 bytes in 0.168 second response time [13:56:34] (03PS4) 10Gehel: icinga: merge https and http checks [puppet] - 10https://gerrit.wikimedia.org/r/494869 (https://phabricator.wikimedia.org/T215587) (owner: 10Mathew.onipe) [13:58:25] (03CR) 10Gehel: [C: 03+2] icinga: merge https and http checks [puppet] - 10https://gerrit.wikimedia.org/r/494869 (https://phabricator.wikimedia.org/T215587) (owner: 10Mathew.onipe) [14:00:05] hashar: That opportune time is upon us again. Time for a MediaWiki train - European version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T1400). [14:04:23] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: (null) [14:04:27] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: (null) [14:04:37] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2004 is CRITICAL: (null) [14:04:53] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2003 is CRITICAL: (null) [14:04:57] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: (null) [14:05:01] crap [14:05:05] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: (null) [14:05:05] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: (null) [14:05:05] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: (null) [14:05:07] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: (null) [14:05:24] oops [14:05:33] I guess I will wait before doing the train [14:05:55] onimisionipe: have we lost logstash ? :( [14:06:08] hashar: nope [14:06:22] the check is failing cos of the last patch gehel merged [14:06:37] crap, rolling back [14:06:43] thanks! [14:06:51] * hashar refrains emitting a bad comment about java developers [14:06:52] ;) [14:06:55] (03CR) 10GTirloni: "> However, the low ttl for positive group cache hits is intentional. It comes from https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/" [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [14:07:07] (03PS1) 10Gehel: Revert "icinga: merge https and http checks" [puppet] - 10https://gerrit.wikimedia.org/r/494927 [14:07:13] ah that is just the icinga check being broken good [14:07:17] I will proceed with the train [14:07:35] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2005 is CRITICAL: (null) [14:07:35] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2002 is CRITICAL: (null) [14:07:43] hashar: please don't refrain on my behalf! [14:07:58] (03CR) 10Gehel: [V: 03+2 C: 03+2] Revert "icinga: merge https and http checks" [puppet] - 10https://gerrit.wikimedia.org/r/494927 (owner: 10Gehel) [14:08:47] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2001 is CRITICAL: (null) [14:08:49] (03CR) 10Muehlenhoff: Update wmf-auto-restarts to read exclude mounts from debdeploy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494765 (https://phabricator.wikimedia.org/T217646) (owner: 10Jbond) [14:09:55] (03PS1) 10Hashar: all wikis to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494928 [14:09:57] (03CR) 10Hashar: [C: 03+2] all wikis to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494928 (owner: 10Hashar) [14:10:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:11:01] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494928 (owner: 10Hashar) [14:11:28] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494928 (owner: 10Hashar) [14:13:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:13:25] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, [14:13:25] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [14:13:35] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, [14:13:35] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [14:13:35] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, [14:13:35] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [14:13:35] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 83, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 104, in [14:13:36] : 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0 [14:13:36] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, [14:13:37] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [14:13:53] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, [14:13:53] 2, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [14:15:24] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.20 [14:15:35] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.37.21:http/_cluster/health error while fetching: Failed to parse: 10.64.37.21:http [14:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:57] (03PS1) 10Marostegui: Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/494929 [14:17:19] PROBLEM - HHVM rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:17:27] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:17:53] PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:18:19] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.106 second response time [14:18:45] RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.229 second response time [14:20:15] RECOVERY - HHVM rendering on mw1314 is OK: HTTP OK: HTTP/1.1 200 OK - 74965 bytes in 1.731 second response time [14:22:37] (03CR) 10Jbond: Update wmf-auto-restarts to read exclude mounts from debdeploy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494765 (https://phabricator.wikimedia.org/T217646) (owner: 10Jbond) [14:24:55] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2006 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 49, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, [14:24:55] 8, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [14:31:47] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2002 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 49, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, [14:31:47] 8, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [14:36:57] (03CR) 10Muehlenhoff: Update wmf-auto-restarts to read exclude mounts from debdeploy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494765 (https://phabricator.wikimedia.org/T217646) (owner: 10Jbond) [14:39:11] (03PS2) 10Gehel: logstash: upgrade to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/494735 (https://phabricator.wikimedia.org/T216052) [14:40:37] (03CR) 10Gehel: [C: 03+2] logstash: upgrade to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/494735 (https://phabricator.wikimedia.org/T216052) (owner: 10Gehel) [14:41:25] (03PS2) 10Marostegui: Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/494929 [14:42:47] PROBLEM - HHVM rendering on mw2186 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:42:47] PROBLEM - HHVM rendering on mw2181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:47] RECOVERY - HHVM rendering on mw2186 is OK: HTTP OK: HTTP/1.1 200 OK - 74901 bytes in 0.150 second response time [14:43:47] RECOVERY - HHVM rendering on mw2181 is OK: HTTP OK: HTTP/1.1 200 OK - 74901 bytes in 0.162 second response time [14:45:13] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/494929 (owner: 10Marostegui) [14:45:33] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2004 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 49, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, [14:45:33] 8, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [14:45:43] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2001 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 49, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, [14:45:43] 8, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [14:45:49] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2005 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 49, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, [14:45:49] 8, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [14:46:01] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2003 is OK: OK - elasticsearch status production-logstash-codfw: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 49, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, [14:46:01] 8, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [14:46:30] !log Reload haproxy on dbproxy1011 to repool labsdb1009 [14:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:51] !log 1.33.0-wmf.20 seems all good [14:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:22] (03PS2) 10Muehlenhoff: Install ack instead of ack-grep [puppet] - 10https://gerrit.wikimedia.org/r/494718 (https://phabricator.wikimedia.org/T213527) [14:58:08] (03CR) 10Muehlenhoff: [C: 03+2] Install ack instead of ack-grep [puppet] - 10https://gerrit.wikimedia.org/r/494718 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [15:00:38] (03PS1) 10ArielGlenn: fix broken pages content job retries [dumps] - 10https://gerrit.wikimedia.org/r/494936 [15:00:41] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 83, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 104, in [15:00:41] : 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0 [15:02:15] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Some esams<->eqiad varnish backend connections closed by peer - https://phabricator.wikimedia.org/T216006 (10ema) Those connection resets on the varnish backend layer happen when frontend caches are full and varnish cannot make sp... [15:02:39] (03PS1) 10Ema: varnish: apply frontend size-based cutoff to text too [puppet] - 10https://gerrit.wikimedia.org/r/494937 (https://phabricator.wikimedia.org/T216006) [15:03:17] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, 10Patch-For-Review: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10GTirloni) Most active talkers to LDAP (in bytes, ~10min packet capture): # 68M deployment-deploy01.deployment-... [15:04:09] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: nuke_limit often reached on esams varnish frontends - https://phabricator.wikimedia.org/T216006 (10ema) [15:12:14] (03CR) 10Hashar: "Have to try it a bit. But already spotted one case worth updating the change ,)" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [15:18:50] (03CR) 10CDanis: [C: 03+1] icinga: set notes_url for Icinga meta checks [puppet] - 10https://gerrit.wikimedia.org/r/494738 (owner: 10Dzahn) [15:20:46] (03PS2) 10Ema: varnish: apply frontend size-based cutoff to text too [puppet] - 10https://gerrit.wikimedia.org/r/494937 (https://phabricator.wikimedia.org/T216006) [15:26:37] !log rolling upgrade of elasticsearch on logstash clusters - T216052 [15:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:40] T216052: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 [15:29:37] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T217755 (10Papaul) a:05Papaul→03jcrespo @jcrespo disk replacement complete. [15:29:48] (03PS1) 10Vgutierrez: acme-chief: Store certificates in unique version certificates [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494956 (https://phabricator.wikimedia.org/T207295) [15:29:52] (03PS1) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) [15:30:58] (03PS2) 10Vgutierrez: acme-chief: Store certificates in unique directories [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494956 (https://phabricator.wikimedia.org/T207295) [15:31:02] (03PS2) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) [15:32:30] (03CR) 10jerkins-bot: [V: 04-1] acme-chief: Store certificates in unique directories [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494956 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [15:32:32] (03CR) 10jerkins-bot: [V: 04-1] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [15:35:12] (03PS3) 10Vgutierrez: acme-chief: Store certificates in unique directories [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494956 (https://phabricator.wikimedia.org/T207295) [15:35:14] (03PS3) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) [15:35:50] (03PS1) 10Gehel: elasticsearch: fix typo in ExitOnOutOfMemoryError [puppet] - 10https://gerrit.wikimedia.org/r/494962 [15:35:59] (03CR) 10Gehel: [V: 03+2 C: 03+2] elasticsearch: fix typo in ExitOnOutOfMemoryError [puppet] - 10https://gerrit.wikimedia.org/r/494962 (owner: 10Gehel) [15:36:42] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10EvanProdromou) So, last level of discussion had these values: | Action | Redis (ms) |... [15:37:11] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T217755 (10jcrespo) Thanks, ` physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, Rebuilding) ` [15:38:08] (03CR) 10CDanis: [C: 03+1] icinga: add check_icinga script (031 comment) [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (https://phabricator.wikimedia.org/T217599) (owner: 10Volans) [15:38:12] (03CR) 10CDanis: [C: 03+1] icinga: add check_icinga script [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (https://phabricator.wikimedia.org/T217599) (owner: 10Volans) [15:38:22] (03CR) 10jerkins-bot: [V: 04-1] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [15:40:27] (03PS4) 10Volans: icinga: add check_icinga script [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (https://phabricator.wikimedia.org/T217599) [15:40:34] (03CR) 10Volans: icinga: add check_icinga script (031 comment) [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (https://phabricator.wikimedia.org/T217599) (owner: 10Volans) [15:41:47] (03CR) 10WMDE-leszek: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490587 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [15:42:13] (03PS4) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) [15:44:29] (03CR) 10jerkins-bot: [V: 04-1] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [15:44:36] FFS [15:45:58] (03PS1) 10Gehel: logstash: upgrade ELK to 5.6.14$ [puppet] - 10https://gerrit.wikimedia.org/r/494966 (https://phabricator.wikimedia.org/T216052) [15:46:45] (03PS2) 10Gehel: logstash: upgrade ELK to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/494966 (https://phabricator.wikimedia.org/T216052) [15:49:25] (03CR) 10Mathew.onipe: [C: 03+1] logstash: upgrade ELK to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/494966 (https://phabricator.wikimedia.org/T216052) (owner: 10Gehel) [15:49:38] (03CR) 10Gehel: [C: 03+2] logstash: upgrade ELK to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/494966 (https://phabricator.wikimedia.org/T216052) (owner: 10Gehel) [15:50:01] (03PS5) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) [15:50:22] (03CR) 10Volans: [C: 04-1] "Looks ok in general, one detail to fix apart the config handling, see inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/494765 (https://phabricator.wikimedia.org/T217646) (owner: 10Jbond) [15:52:11] (03CR) 10CDanis: [C: 03+1] "LGTM" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [16:03:19] (03PS7) 10Krinkle: errorpages: Remove unused hhvm-fatal-error.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) [16:03:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsoid, 10decommission: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10RobH) >>! In T216062#5007976, @MoritzMuehlenhoff wrote: > The host is still in puppetdb/cumin, see e.g, puppetboard Sorry about that, I thought I ran the decom script, but since... [16:03:35] (03CR) 10jerkins-bot: [V: 04-1] errorpages: Remove unused hhvm-fatal-error.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:06:05] (03PS8) 10Krinkle: errorpages: Remove unused hhvm-fatal-error.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) [16:07:49] RECOVERY - Device not healthy -SMART- on db2044 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops [16:10:06] (03PS3) 10Ema: varnish: text/misc frontend size-based cutoffs [puppet] - 10https://gerrit.wikimedia.org/r/494937 (https://phabricator.wikimedia.org/T216006) [16:12:15] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, 10Patch-For-Review: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10hashar) I added nslcd debug logging on deployment-deploy01: ` name=/etc/nslcd.conf log /var/log/nslcd.log debug... [16:12:18] (03PS4) 10Ema: varnish: text/misc frontend size-based cutoffs [puppet] - 10https://gerrit.wikimedia.org/r/494937 (https://phabricator.wikimedia.org/T216006) [16:12:30] (03CR) 10Vgutierrez: [C: 04-2] "I think the approach followed in I286e5b65ec574ea38336a31aeab52599558d5c84 and Ib8a40a049486bc0e4a861041e56d1451c8ecef71 has way more sen" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [16:12:51] (03PS1) 10Jbond: Cookbook to reset ipmi passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 [16:14:28] (03CR) 10jerkins-bot: [V: 04-1] Cookbook to reset ipmi passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 (owner: 10Jbond) [16:15:14] 10Operations, 10ops-codfw: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) [16:16:00] 10Operations, 10Performance-Team (Radar): PHP fatal error handler not working on mwdebug servers - https://phabricator.wikimedia.org/T217846 (10Krinkle) [16:17:16] PROBLEM - HHVM rendering on mw2285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:08] RECOVERY - HHVM rendering on mw2285 is OK: HTTP OK: HTTP/1.1 200 OK - 74870 bytes in 0.179 second response time [16:18:54] 10Operations, 10Graphite, 10Patch-For-Review: Graphite returning server errors (out of memory?) - https://phabricator.wikimedia.org/T217679 (10CDanis) a:03Lucas_Werkmeister_WMDE Lucas, can you verify that this is resolved? [16:19:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] "I 'll do the private git part" [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [16:20:40] cdanis: do you think it’s okay to run the expensive Graphite queries now? [16:20:53] Lucas_WMDE: yes, timeouts and max RSS are in place [16:20:56] if yes, I can try to do that and we’ll see if load explodes again [16:20:59] (03CR) 10Krinkle: "Currently blocked by T217846 to be able to verify this end-to-end." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:21:01] okay [16:21:07] didn’t want to try it out on my own earlier ^^ [16:21:09] so they should fail after a minute, or after they allocate too much memory [16:24:36] (03CR) 10Volans: [C: 04-1] "Much cleaner and simpler, almost there. There are a couple of errors/typo, see inline. All the rest are nitpicks." (0321 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [16:25:13] okay, I got a 503 response from varnish [16:25:38] and the graphite-eqiad board looks okay so far [16:26:57] yep, looks like it timed out after 60 seconds [16:28:26] and I also figured out what was actually broken in my query, yay [16:28:29] (one missing . in a replacement) [16:29:28] 10Operations, 10Graphite, 10Patch-For-Review: Graphite returning server errors (out of memory?) - https://phabricator.wikimedia.org/T217679 (10Lucas_Werkmeister_WMDE) 05Open→03Resolved Seems so, yes. I got an error from Varnish after 60 seconds and the [graphite-eqiad board](https://grafana.wikimedia.org... [16:30:44] great! [16:35:25] (03PS2) 10Jbond: Cookbook to reset ipmi passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 [16:35:50] 10Operations, 10monitoring, 10Patch-For-Review: limit the impact of heavy/large graphite queries - https://phabricator.wikimedia.org/T116767 (10CDanis) 05Open→03Resolved a:03CDanis Just saw the new timeout work -- query returned a 500 status after ~60 seconds. Boldly going to call this resolved; of co... [16:36:52] (03CR) 10jerkins-bot: [V: 04-1] Cookbook to reset ipmi passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 (owner: 10Jbond) [16:38:30] (03PS3) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [16:39:23] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [16:39:35] (03Abandoned) 10Jcrespo: dbproxy: Reload automatically haproxy on configuration update [puppet] - 10https://gerrit.wikimedia.org/r/491476 (owner: 10Jcrespo) [16:42:36] (03PS1) 10Jcrespo: labsdb: Depool labsdb1009, lagging behind [puppet] - 10https://gerrit.wikimedia.org/r/494981 [16:43:13] (03PS2) 10Jcrespo: labsdb: Depool labsdb1009, lagging behind [puppet] - 10https://gerrit.wikimedia.org/r/494981 [16:44:44] (03CR) 10Jcrespo: [C: 03+2] labsdb: Depool labsdb1009, lagging behind [puppet] - 10https://gerrit.wikimedia.org/r/494981 (owner: 10Jcrespo) [16:45:27] 10Operations, 10Traffic, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10Cmjohnson) @ayounsi asw2-c8 is a 1G switch....does this need to go to a 10G rack? I can move to asw2-c7 xe-7/0/6 looks open [16:45:56] (03CR) 10Marostegui: "I guess it is not warm enough after the reboot :-(" [puppet] - 10https://gerrit.wikimedia.org/r/494981 (owner: 10Jcrespo) [16:46:32] (03PS1) 10Dbarratt: Enable Partial Blocks on Arabic Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494982 (https://phabricator.wikimedia.org/T217283) [16:46:47] (03CR) 10Jcrespo: [C: 03+2] "No, it is not that, it started at 7:30, before the restart." [puppet] - 10https://gerrit.wikimedia.org/r/494981 (owner: 10Jcrespo) [16:47:25] (03CR) 10Jcrespo: [C: 03+2] "https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=3&fullscreen&orgId=1&var-server=labsdb1009&var-datasource=eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/494981 (owner: 10Jcrespo) [16:50:02] 10Operations, 10ops-codfw: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) [16:52:08] (03PS2) 10Dbarratt: Enable Partial Blocks on Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494982 (https://phabricator.wikimedia.org/T217283) [16:54:38] !log powering off cp1099 to move to different rack T202966 [16:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:44] T202966: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 [16:59:25] (03CR) 10Volans: "Nice cookbook, few comments inline." (0312 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 (owner: 10Jbond) [17:00:04] godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:01:07] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) [17:05:05] (03CR) 10Marostegui: "That CPU usage pattern isn't uncommon according to this: https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=3&full" [puppet] - 10https://gerrit.wikimedia.org/r/494981 (owner: 10Jcrespo) [17:05:40] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: nuke_limit often reached on esams varnish frontends - https://phabricator.wikimedia.org/T216006 (10ema) >>! In T216006#5008346, @ema wrote: > Interestingly, the problem is not reproducible with larger objects, as varnish autonomously d... [17:05:51] (03PS3) 10Jbond: Cookbook to reset ipmi passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 [17:06:40] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for restbase2019,2020 [dns] - 10https://gerrit.wikimedia.org/r/494984 (https://phabricator.wikimedia.org/T217368) [17:06:44] 10Operations, 10Traffic, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10Cmjohnson) @ayounsi server moved [17:07:03] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production DNS for restbase2019,2020 [dns] - 10https://gerrit.wikimedia.org/r/494984 (https://phabricator.wikimedia.org/T217368) (owner: 10Papaul) [17:07:11] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) lvs100[789] network port disabling: ` robh@asw-c-eqiad# show | compare [edit interfaces interface-range LVS-cross-row] - member-range xe-8/0/26 to xe-8/0/28; [edit inte... [17:07:16] (03CR) 10jerkins-bot: [V: 04-1] Cookbook to reset ipmi passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 (owner: 10Jbond) [17:07:30] (03CR) 10BBlack: [C: 03+1] varnish: text/misc frontend size-based cutoffs [puppet] - 10https://gerrit.wikimedia.org/r/494937 (https://phabricator.wikimedia.org/T216006) (owner: 10Ema) [17:08:16] (03CR) 10ArielGlenn: [C: 03+2] fix broken pages content job retries [dumps] - 10https://gerrit.wikimedia.org/r/494936 (owner: 10ArielGlenn) [17:08:45] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for restbase2019,2020 [dns] - 10https://gerrit.wikimedia.org/r/494984 [17:09:05] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production DNS for restbase2019,2020 [dns] - 10https://gerrit.wikimedia.org/r/494984 (owner: 10Papaul) [17:09:16] !log ariel@deploy1001 Started deploy [dumps/dumps@3e25558]: fix broken page-content job retries [17:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:20] !log ariel@deploy1001 Finished deploy [dumps/dumps@3e25558]: fix broken page-content job retries (duration: 00m 04s) [17:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:54] RECOVERY - HP RAID on db2044 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [17:10:47] (03CR) 10Dmaza: [C: 03+1] Enable Partial Blocks on Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494982 (https://phabricator.wikimedia.org/T217283) (owner: 10Dbarratt) [17:11:16] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) [17:12:59] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) [17:13:12] (03PS1) 10RobH: lvs1007-lvs1012 decommission [puppet] - 10https://gerrit.wikimedia.org/r/494985 (https://phabricator.wikimedia.org/T208586) [17:13:38] (03CR) 10RobH: [C: 03+2] lvs1007-lvs1012 decommission [puppet] - 10https://gerrit.wikimedia.org/r/494985 (https://phabricator.wikimedia.org/T208586) (owner: 10RobH) [17:15:03] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1007.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:15:15] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1008.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:15:28] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1009.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:16:11] (03PS4) 10Jbond: Cookbook to reset ipmi passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 [17:16:12] !log rolling upgrade of elasticsearch on logstash clusters completed - T216052 [17:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:15] T216052: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 [17:17:45] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1010.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:18:00] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1011.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:18:13] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1012.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:19:47] (03CR) 10Bstorm: "What if we did this temporarily, pending the shutdown of the old grid and noted it in related tasks (adding Bug entries)? We'll likely ne" [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [17:20:15] (03PS11) 10Elukey: Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [17:20:32] (03CR) 10Elukey: "> May I ask for further documentation on the role, just a few lines" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [17:21:13] (03CR) 10jerkins-bot: [V: 04-1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [17:21:50] (03CR) 10Alaa Sarhan: [C: 03+1] Disable RDF output of mediainfo Wikibase entities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490587 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [17:22:24] (03CR) 10Alaa Sarhan: [C: 03+1] Added a setting to define Wikibase entity types that have no RDF output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490586 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [17:24:40] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [17:24:50] (03PS1) 10Jcrespo: Revert "labsdb: Depool labsdb1009, lagging behind" [puppet] - 10https://gerrit.wikimedia.org/r/494989 [17:24:59] (03PS2) 10Jcrespo: Revert "labsdb: Depool labsdb1009, lagging behind" [puppet] - 10https://gerrit.wikimedia.org/r/494989 [17:25:08] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "labsdb: Depool labsdb1009, lagging behind" [puppet] - 10https://gerrit.wikimedia.org/r/494989 (owner: 10Jcrespo) [17:25:43] (03PS2) 10RobH: decom analytics100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/494857 (https://phabricator.wikimedia.org/T205507) [17:25:44] robh: deploy? [17:26:00] lvs1007-lvs1012 decommission (d07365687b) [17:26:11] (03CR) 10RobH: [C: 03+2] decom analytics100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/494857 (https://phabricator.wikimedia.org/T205507) (owner: 10RobH) [17:26:20] I am guessing yes, but prefer your ok [17:27:44] (03PS1) 10RobH: decom lvs1007-lvs1012 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/494990 (https://phabricator.wikimedia.org/T208586) [17:28:40] PROBLEM - HTTP availability for Varnish at esams on icinga2001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:28:45] (03CR) 10RobH: [C: 03+2] decom lvs1007-lvs1012 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/494990 (https://phabricator.wikimedia.org/T208586) (owner: 10RobH) [17:29:44] robh, look at the channel! :-D [17:29:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is CRITICAL: cluster={cache_text,cache_upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:29:50] ? [17:30:03] robh: deploy? lvs1007-lvs1012 decommission (d07365687b) [17:30:13] WARNING: Revision range includes commits from multiple committers! [17:30:13] ? [17:30:15] (03CR) 10BryanDavis: Introduce role::labs::db::wikireplica_analytics::dedicated (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [17:30:22] jynus: sorry, what exactly is the problem? [17:30:37] puppet-merge [17:30:45] undeployed commit [17:30:48] oh, you got to it before me [17:30:49] sorry, yes [17:30:50] deploy [17:31:03] i left it half merged on puppetmaster, sorry! [17:31:10] i had it pending yes/no on my terminal [17:31:10] heh [17:31:12] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [17:31:21] sorry about that! [17:31:24] no problem I do it sometimes, I just wanted to ask you [17:31:29] yeah, totally makes sense [17:31:31] =] [17:31:39] and was seeing your activity so you were not idle [17:31:43] (03PS12) 10Elukey: Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) [17:31:45] :-P [17:31:49] im killing servers! \o/ [17:32:00] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [17:32:04] so I was like, look here o/ [17:32:15] yeah, i just was moving too fast for my own good [17:32:39] (03CR) 10jerkins-bot: [V: 04-1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [17:32:42] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:32:48] and I needed my change to go through because we have some issues on dbs [17:33:29] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) [17:33:41] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) a:05RobH→03Cmjohnson [17:33:50] (03PS1) 10ArielGlenn: gather (almost) all maxretries vars under one config setting [dumps] - 10https://gerrit.wikimedia.org/r/494991 (https://phabricator.wikimedia.org/T217744) [17:33:54] !log gehel@deploy1001 Started deploy [logstash/plugins@7c4c5ea]: upgrade logstash plugins to 5.6.14 - T216052 [17:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:57] T216052: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 [17:34:03] !log gehel@deploy1001 Finished deploy [logstash/plugins@7c4c5ea]: upgrade logstash plugins to 5.6.14 - T216052 (duration: 00m 08s) [17:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:16] RECOVERY - HTTP availability for Varnish at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:34:20] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:34:40] !log gehel@deploy1001 Started deploy [logstash/plugins@7c4c5ea]: upgrade logstash plugins to 5.6.14 - T216052 [17:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:47] !log gehel@deploy1001 Finished deploy [logstash/plugins@7c4c5ea]: upgrade logstash plugins to 5.6.14 - T216052 (duration: 00m 07s) [17:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:23] (03PS1) 10Bstorm: Revert "wiki replicas: Remove reference to old comment fields" [puppet] - 10https://gerrit.wikimedia.org/r/494992 [17:35:43] (03CR) 10jerkins-bot: [V: 04-1] Revert "wiki replicas: Remove reference to old comment fields" [puppet] - 10https://gerrit.wikimedia.org/r/494992 (owner: 10Bstorm) [17:36:21] !log rolling upgrade of logstash on logstash clusters - T216052 [17:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:39] (03PS1) 10Paladox: Update healthcheck url [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494994 [17:37:05] (03PS2) 10Paladox: Update healthcheck url [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494994 [17:37:57] (03CR) 10Paladox: [V: 03+2 C: 03+2] Update healthcheck url [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494994 (owner: 10Paladox) [17:38:45] (03CR) 10Volans: "recheck" [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (https://phabricator.wikimedia.org/T217599) (owner: 10Volans) [17:39:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:40:12] (03CR) 10Anomie: "How much longer are you planning on delaying?" [puppet] - 10https://gerrit.wikimedia.org/r/494992 (owner: 10Bstorm) [17:41:24] (03PS7) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) [17:41:40] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/494992 (owner: 10Bstorm) [17:41:58] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:43:02] 10Operations: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10herron) A pretty accurate list of stakeholders for a given host can be gleaned from the users, groups, and sudoers config deployed to it. As an alternative to keeping a separate list of stakeholders in sync manually,... [17:43:04] (03PS1) 10Paladox: Update plugins [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494995 [17:43:59] !log rolling upgrade of logstash on logstash clusters completed - T216052 [17:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:03] (03PS2) 10Bstorm: Revert "wiki replicas: Remove reference to old comment fields" [puppet] - 10https://gerrit.wikimedia.org/r/494992 [17:44:03] T216052: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 [17:44:03] and only kibana to go! [17:44:40] nice! [17:44:55] (03CR) 10Paladox: [V: 03+2 C: 03+2] Update plugins [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494995 (owner: 10Paladox) [17:45:22] (03PS2) 10Paladox: Update plugins [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494995 [17:45:38] (03CR) 10Paladox: [V: 03+2 C: 03+2] Update plugins [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494995 (owner: 10Paladox) [17:46:13] (03PS3) 10Paladox: Add "image-diff" plugin [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494631 [17:48:00] (03PS8) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) [17:48:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, and 2 others: Create an-tool1005 (Staging environment for Superset) - https://phabricator.wikimedia.org/T217738 (10fdans) [17:48:27] !log rolling upgrade of kibana on logstash clusters - T216052 [17:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:02] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:49:21] (03PS2) 10Paladox: Add "multi-site" plugin so gerrit can have multi masters [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494865 [17:49:23] (03CR) 10Bstorm: [C: 03+2] Revert "wiki replicas: Remove reference to old comment fields" [puppet] - 10https://gerrit.wikimedia.org/r/494992 (owner: 10Bstorm) [17:51:11] (03PS9) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) [17:55:13] (03PS5) 10Jbond: Cookbook to reset ipmi passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 [17:55:23] !log rolling upgrade of kibana on logstash clusters completed - T216052 [17:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:26] T216052: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 [17:56:06] herron, godog: ^ ELK stack fully upgraded to 5.6.14, no issue seen or reported (akaik) [17:56:17] excellent! [17:58:44] (03CR) 10Jcrespo: "@BryanDavis: Do you know if there is wikireplica management changes that have to happen outside of the database hosts? That would be my on" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [17:59:38] (03PS6) 10Jbond: Cookbook to reset ipmi passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T1800). [18:01:00] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade logstash and the logstash elasticsearch cluster to 5.6.14 - https://phabricator.wikimedia.org/T216052 (10Gehel) a:03Gehel [18:06:54] (03CR) 10Jbond: Cookbook to reset ipmi passwords (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 (owner: 10Jbond) [18:07:15] (03CR) 10Bstorm: "> Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [18:07:32] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, 10Patch-For-Review: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10hashar) nslcd debug log dumps each of the fields returned for a given query. That can be found by grepping for `... [18:09:35] (03PS1) 10Bstorm: wiki replicas: Remove reference to old comment fields [puppet] - 10https://gerrit.wikimedia.org/r/494999 (https://phabricator.wikimedia.org/T212972) [18:10:33] (03CR) 10jerkins-bot: [V: 04-1] wiki replicas: Remove reference to old comment fields [puppet] - 10https://gerrit.wikimedia.org/r/494999 (https://phabricator.wikimedia.org/T212972) (owner: 10Bstorm) [18:10:41] (03PS10) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) [18:11:50] (03PS2) 10Bstorm: wiki replicas: Remove reference to old comment fields [puppet] - 10https://gerrit.wikimedia.org/r/494999 (https://phabricator.wikimedia.org/T212972) [18:13:25] (03CR) 10Elukey: "Any particular concern about this code change?" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [18:13:34] (03CR) 10Dzahn: [C: 04-1] "10.193.2.201 is already used for cloudvirt2003-dev:" [dns] - 10https://gerrit.wikimedia.org/r/494984 (owner: 10Papaul) [18:13:46] 10Operations, 10ops-eqiad, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) [18:14:41] (03PS2) 10Dzahn: icinga: set notes_url for Icinga meta checks [puppet] - 10https://gerrit.wikimedia.org/r/494738 [18:15:49] !log disable asw2-c-eqiad <-> asw-c-eqiad link - T208734 [18:15:52] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga2001 is CRITICAL: 56.46 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:53] T208734: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 [18:16:01] (03CR) 10Paladox: [V: 03+2 C: 03+2] "Verified locally that it builds." [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494631 (owner: 10Paladox) [18:16:11] (03CR) 10Dzahn: [C: 04-1] DNS: Add mgmt and production DNS for restbase2019,2020 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/494984 (owner: 10Papaul) [18:16:56] (03CR) 10Dzahn: [C: 03+2] icinga: set notes_url for Icinga meta checks [puppet] - 10https://gerrit.wikimedia.org/r/494738 (owner: 10Dzahn) [18:18:16] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga2001 is OK: (C)60 le (W)70 le 84.86 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:18:44] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, 10Patch-For-Review: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10hashar) I mentioned it previously sorry. LDAP spam from deployment-deploy01 is due to #Keyholder / T204681. It... [18:19:56] i will be doing a parsoid deploy in about 15 mins or so. [18:23:13] (03CR) 10CRusnov: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/15036/" [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [18:24:52] (03CR) 10Jcrespo: "> > Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [18:25:14] !log cleaning kernel-proposed-updates component on reprepro (install1002) [18:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:22] (03CR) 10Jcrespo: [C: 03+1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [18:25:50] vgutierrez: ^^ [18:25:59] (03PS3) 10Papaul: DNS: Add mgmt and production DNS for restbase2019,2020 [dns] - 10https://gerrit.wikimedia.org/r/494984 [18:26:03] (03PS1) 10Ayounsi: Remove asw-c-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/495007 (https://phabricator.wikimedia.org/T208734) [18:26:21] (03PS1) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 4 [puppet] - 10https://gerrit.wikimedia.org/r/495008 [18:26:30] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@fac7e5e] (stretch): Updating eqiad cluster before repool of maps2004.codfw.wmnet [18:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:32] thx gehel <3 [18:27:03] (03CR) 10jerkins-bot: [V: 04-1] icinga: add notes URLs to various monitoring checks, part 4 [puppet] - 10https://gerrit.wikimedia.org/r/495008 (owner: 10Dzahn) [18:27:16] (03PS2) 10Ayounsi: Remove asw-c-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/495007 (https://phabricator.wikimedia.org/T208734) [18:28:23] (03PS11) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) [18:30:16] !log mbsantos@deploy1001 Finished deploy [tilerator/deploy@fac7e5e] (stretch): Updating eqiad cluster before repool of maps2004.codfw.wmnet (duration: 03m 46s) [18:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:51] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@248b8c4] (stretch): Updating eqiad cluster before repool of maps2004.codfw.wmnet [18:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:53] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15037/" [puppet] - 10https://gerrit.wikimedia.org/r/495007 (https://phabricator.wikimedia.org/T208734) (owner: 10Ayounsi) [18:32:16] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@248b8c4] (stretch): Updating eqiad cluster before repool of maps2004.codfw.wmnet (duration: 01m 25s) [18:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:33] gehel: ^ [18:33:38] (03Abandoned) 10Paladox: Update healthcheck [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/489734 (owner: 10Paladox) [18:35:26] (03CR) 10Bstorm: "> Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [18:37:12] (03CR) 10Bstorm: [C: 03+1] Introduce role::labs::db::wikireplica_analytics::dedicated [puppet] - 10https://gerrit.wikimedia.org/r/494874 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [18:38:44] I removed asw-c-eqiad from Icinga, should not have any impact, but last time it caused an icinga issue so let me know if you see something suspect [18:39:24] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=maps,name=maps2004.codfw.wmnet [18:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:44] 10Operations, 10ops-eqiad, 10decommission: Decommission conf100[1-3] - https://phabricator.wikimedia.org/T206626 (10RobH) [18:41:52] (03PS1) 10Dzahn: lvs/icinga/services: add notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/495009 [18:42:56] (03CR) 10jerkins-bot: [V: 04-1] lvs/icinga/services: add notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/495009 (owner: 10Dzahn) [18:43:09] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) [18:43:30] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) a:05ayounsi→03Cmjohnson [18:45:32] (03CR) 10Dzahn: [C: 03+2] DNS: Add mgmt and production DNS for restbase2019,2020 [dns] - 10https://gerrit.wikimedia.org/r/494984 (owner: 10Papaul) [18:46:12] (03PS4) 10Dzahn: DNS: Add mgmt and production DNS for restbase2019,2020 [dns] - 10https://gerrit.wikimedia.org/r/494984 (owner: 10Papaul) [18:47:24] (03CR) 10Volans: "Looks mostly good to me. Voting 0 as I think there are still a couple of improvements possible but technically it should already work as e" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 (owner: 10Jbond) [18:49:22] (03PS2) 10Dzahn: lvs/icinga/services: add notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/495009 [18:50:11] (03CR) 10jerkins-bot: [V: 04-1] lvs/icinga/services: add notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/495009 (owner: 10Dzahn) [18:51:02] (03CR) 10Volans: "recheck" [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (https://phabricator.wikimedia.org/T217599) (owner: 10Volans) [18:51:18] !log arlolra@deploy1001 Started deploy [parsoid/deploy@766a920]: Updating Parsoid to d4e76d5 [18:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:57] 10Operations, 10ops-eqiad, 10decommission: Decommission conf100[1-3] - https://phabricator.wikimedia.org/T206626 (10RobH) network ports disabled and noted here on task for later removal: ` robh@asw2-a-eqiad> show interfaces descriptions | grep conf1001 ge-2/0/20 up down conf1001 robh@asw2-a-eqiad... [18:52:50] (03PS3) 10Dzahn: lvs/icinga/services: add notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/495009 [18:54:27] (03PS1) 10RobH: decom conf100[123] production dns [dns] - 10https://gerrit.wikimedia.org/r/495010 (https://phabricator.wikimedia.org/T206626) [18:55:34] (03PS1) 10RobH: conf100[123] decom [puppet] - 10https://gerrit.wikimedia.org/r/495011 (https://phabricator.wikimedia.org/T206626) [18:55:47] (03CR) 10RobH: [C: 03+2] decom conf100[123] production dns [dns] - 10https://gerrit.wikimedia.org/r/495010 (https://phabricator.wikimedia.org/T206626) (owner: 10RobH) [18:56:19] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@766a920]: Updating Parsoid to d4e76d5 (duration: 05m 01s) [18:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:03] (03PS2) 10RobH: conf100[123] decom [puppet] - 10https://gerrit.wikimedia.org/r/495011 (https://phabricator.wikimedia.org/T206626) [18:57:16] (03PS1) 10Paladox: WIP: Update gerrit to 2.16.6 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495012 [18:57:18] (03CR) 10Hashar: "recheck" [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (https://phabricator.wikimedia.org/T217599) (owner: 10Volans) [18:57:30] (03CR) 10RobH: [C: 03+2] conf100[123] decom [puppet] - 10https://gerrit.wikimedia.org/r/495011 (https://phabricator.wikimedia.org/T206626) (owner: 10RobH) [18:59:35] (03PS3) 10Paladox: Add "multi-site" plugin so gerrit can have multi masters [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494865 [19:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T1900). [19:00:04] davidwbarratt: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:53] 10Operations, 10ops-eqiad, 10decommission: Decommission conf100[1-3] - https://phabricator.wikimedia.org/T206626 (10RobH) a:05RobH→03Cmjohnson [19:01:12] here! [19:02:11] (03PS2) 10Paladox: WIP: Update gerrit to 2.16.6 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495012 [19:02:13] who's swatting? [19:03:25] I can do it. [19:03:33] hauskatze: You got something too? [19:03:51] Niharika: I had one, yes [19:04:03] !log Updated Parsoid to d4e76d5 (T202905) [19:04:03] (03CR) 10Volans: [C: 03+2] icinga: add check_icinga script [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (https://phabricator.wikimedia.org/T217599) (owner: 10Volans) [19:04:04] yet to made, but very very simple [19:04:05] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494982 (https://phabricator.wikimedia.org/T217283) (owner: 10Dbarratt) [19:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:08] T202905: Outreach-17 Project: Add a new Linter Category: Links-in-Links - https://phabricator.wikimedia.org/T202905 [19:04:13] hauskatze: Add it to the queue. :) [19:04:15] I'll add it soon [19:05:08] (03Merged) 10jenkins-bot: Enable Partial Blocks on Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494982 (https://phabricator.wikimedia.org/T217283) (owner: 10Dbarratt) [19:05:49] (03PS14) 10CRusnov: Add ganeti->netbox sync script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) [19:06:04] (03CR) 10jenkins-bot: Enable Partial Blocks on Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494982 (https://phabricator.wikimedia.org/T217283) (owner: 10Dbarratt) [19:06:37] (03PS1) 10MarcoAurelio: Grant 'reupload-shared' to mediawiki uploaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495015 (https://phabricator.wikimedia.org/T217523) [19:06:41] Niharika: I'm on it [19:06:47] 5' or less [19:07:29] Niharika: please ping me when the SWAT is over [19:07:35] (03PS3) 10Jbond: Add config file and exclude_mounts options to debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/494764 (https://phabricator.wikimedia.org/T217646) [19:07:37] (03PS3) 10Jbond: Update wmf-auto-restarts to read exclude mounts from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/494765 (https://phabricator.wikimedia.org/T217646) [19:08:38] davidwbarratt: Your change is on mwdebug1002. [19:08:56] (03PS2) 10MarcoAurelio: Grant 'reupload-shared' to mediawiki uploaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495015 (https://phabricator.wikimedia.org/T217523) [19:09:13] Niharika great! let me take a look [19:09:57] Niharika looks good to me! [19:10:15] davidwbarratt: Okie dokie. Syncing it out in a sec. [19:10:22] Niharika Thanks! [19:10:55] Niharika: added to the queue [19:11:24] ooops [19:11:32] (03CR) 10MarcoAurelio: [C: 04-1] Grant 'reupload-shared' to mediawiki uploaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495015 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [19:11:56] hauskatze: to swat or not to swat? [19:12:07] (03PS3) 10MarcoAurelio: Grant 'reupload-shared' to mediawiki uploaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495015 (https://phabricator.wikimedia.org/T217523) [19:12:07] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Partial Blocks on Arabic Wikipedia T217283 (duration: 00m 50s) [19:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:10] T217283: Enable partial Blocks on Arabic Wikipedia on March 12, 2019 - https://phabricator.wikimedia.org/T217283 [19:12:10] davidwbarratt: You're all set! [19:12:32] Niharika: fixed, it's good to go [19:12:40] commonswiki -> mediawikiwiki [19:12:49] (03CR) 10Jbond: Update wmf-auto-restarts to read exclude mounts from debdeploy config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/494765 (https://phabricator.wikimedia.org/T217646) (owner: 10Jbond) [19:12:50] rush not good counselor [19:13:14] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495015 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [19:13:46] Niharika YAY! Thanks! [19:16:13] (03PS4) 10Jbond: Add config file and exclude_mounts options to debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/494764 (https://phabricator.wikimedia.org/T217646) [19:16:15] hauskatze: It's not merged for some reason. Did you mark it as WIP? [19:16:44] Niharika: ah, damn, yes [19:16:48] created via the UI [19:16:53] == WIP by default [19:16:58] un-WIP-ed [19:17:04] (03CR) 10Niharika29: [C: 03+2] Grant 'reupload-shared' to mediawiki uploaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495015 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [19:17:15] Ah, got it. [19:18:04] (03Merged) 10jenkins-bot: Grant 'reupload-shared' to mediawiki uploaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495015 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [19:18:19] (03CR) 10jenkins-bot: Grant 'reupload-shared' to mediawiki uploaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495015 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [19:19:34] Niharika: let me know when it's on the canary to test it :) [19:20:03] hauskatze: It is on mwdebug1002 now. [19:20:13] checking! [19:20:38] Niharika: checked, working as expected [19:22:19] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Grant 'reupload-shared' to mediawiki uploaders and fix T217523 (duration: 00m 49s) [19:22:20] hauskatze: Shipped. 🛳 [19:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:21] T217523: Restrict local uploads to commons for MediaWiki.org - https://phabricator.wikimedia.org/T217523 [19:22:31] gilles: All done with the SWAT. [19:22:37] Niharika: thanks! [19:22:57] (03PS3) 10Gilles: Enable Priority Hints origin trial on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494921 (https://phabricator.wikimedia.org/T216499) [19:25:35] (03CR) 10Anomie: [C: 03+1] "Seems sane. Haven't tested." [puppet] - 10https://gerrit.wikimedia.org/r/494999 (https://phabricator.wikimedia.org/T212972) (owner: 10Bstorm) [19:25:44] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10RobH) a:03RobH [19:26:53] (03CR) 10Gilles: [C: 03+2] Enable Priority Hints origin trial on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494921 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles) [19:27:56] (03Merged) 10jenkins-bot: Enable Priority Hints origin trial on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494921 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles) [19:28:57] (03CR) 10jenkins-bot: Enable Priority Hints origin trial on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494921 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles) [19:30:32] (03PS1) 10Jdlrobson: Enable advanced mobile contributions mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495023 [19:31:52] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb2001.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB - Downtimed hos... [19:32:12] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for rdb2002.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB - Downtimed hos... [19:33:20] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T216499 Enable Priority Hints origin trial on ruwiki (duration: 00m 48s) [19:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:23] T216499: Priority Hints origin trial - https://phabricator.wikimedia.org/T216499 [19:34:05] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10RobH) [19:34:49] (03PS1) 10Jdlrobson: Cleanup beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495024 (https://phabricator.wikimedia.org/T213599) [19:36:44] (03PS1) 10RobH: rdb200[12] prod dns decom [dns] - 10https://gerrit.wikimedia.org/r/495026 (https://phabricator.wikimedia.org/T209425) [19:39:23] (03PS1) 10RobH: rdb200[12] decom [puppet] - 10https://gerrit.wikimedia.org/r/495027 (https://phabricator.wikimedia.org/T209425) [19:39:29] (03CR) 10RobH: [C: 03+2] rdb200[12] prod dns decom [dns] - 10https://gerrit.wikimedia.org/r/495026 (https://phabricator.wikimedia.org/T209425) (owner: 10RobH) [19:40:00] (03PS3) 10Paladox: WIP: Update gerrit to 2.16.6 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495012 [19:40:25] (03CR) 10RobH: [C: 03+2] rdb200[12] decom [puppet] - 10https://gerrit.wikimedia.org/r/495027 (https://phabricator.wikimedia.org/T209425) (owner: 10RobH) [19:41:15] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10RobH) a:05RobH→03Papaul [19:49:28] (03CR) 10CRusnov: Add ganeti->netbox sync script (0321 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [19:56:09] (03CR) 10CRusnov: [C: 03+1] "Looks fine to me." (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/481833 (owner: 10Volans) [20:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T2000) [20:00:04] thcipriani: #bothumor My software never has bugs. It just develops random features. Rise for Gerrit 2.15.11 Upgrade. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190307T2000). [20:02:25] * thcipriani preps gerrit upgrade [20:02:48] :) [20:03:10] (03CR) 10Thcipriani: [V: 03+2] Gerrit 2.15.11 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/494858 (https://phabricator.wikimedia.org/T214359) (owner: 10Thcipriani) [20:07:35] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@5800deb]: Gerrit to 2.15.11 on gerrit2001 only [20:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:48] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@5800deb]: Gerrit to 2.15.11 on gerrit2001 only (duration: 00m 12s) [20:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:33] looks like the deploy is ok on gerrit2001, doing cobalt now [20:08:38] followed by restart [20:09:26] :) [20:09:53] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@5800deb]: Gerrit to 2.15.11 on cobalt (production) [20:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:04] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@5800deb]: Gerrit to 2.15.11 on cobalt (production) (duration: 00m 11s) [20:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:42] !log restarting gerrit on cobalt for 2.15.11 upgrade [20:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:59] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 961 bytes in 0.300 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [20:15:03] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [20:15:04] hrm, gerrit seems fine, the healthcheck plugin seems to behave differently [20:16:55] failing on auth [20:17:00] which is a new feature [20:19:17] (03PS7) 10Jbond: Cookbook to reset ipmi passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 [20:22:58] (03CR) 10Jbond: "updated, I also wonder if i should update the file name to use hyphen instead of underscore as this seems to be the current trend?" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/494976 (owner: 10Jbond) [20:28:47] I'm writing a patch to disable the auth healthcheck now [20:30:04] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T217755 (10Marostegui) 05Open→03Resolved a:05jcrespo→03Papaul Thanks @Papaul - this is now fixed ` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)... [20:32:26] (03CR) 10BryanDavis: "> What if we did this temporarily, pending the shutdown of the old" [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [20:38:38] of course disabling the auth check of the healthcheck plugin still results in a 500 error. [20:39:01] * thcipriani rollsback gerrit [20:40:39] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@5800deb]: Revert "Gerrit to 2.15.11" on gerrit2001 only [20:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:50] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@5800deb]: Revert "Gerrit to 2.15.11" on gerrit2001 only (duration: 00m 10s) [20:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:53] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@5800deb]: Revert "Gerrit to 2.15.11" on cobalt (production) [20:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:01] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@5800deb]: Revert "Gerrit to 2.15.11" on cobalt (production) (duration: 00m 07s) [20:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:22] !log restarting gerrit on cobalt for 2.15.11 rollback [20:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:53] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 868 bytes in 0.304 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [20:44:10] !log explicitely disable sampling on non eqiad routers [20:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:05] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:50:11] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [21:16:13] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:34:10] 10Operations, 10ops-esams, 10DC-Ops, 10decommission: decom bast3003 (65R8Q4J, formerly amslvs4) - https://phabricator.wikimedia.org/T216199 (10RobH) a:05RobH→03None [21:41:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10RobH) a:03RobH [21:41:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10RobH) Decision on reclaim or decommission: These hosts were purchased on April 13, 2015, and support expired in April 2018.... [21:42:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10RobH) [21:45:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10RobH) Chatted with @faidon about this over IRC, we can dispose of these rather than reclaim to spares. So they'll get adde... [21:46:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10RobH) [21:54:17] (03PS1) 10Papaul: DHCP Partman: Add DHCP MAC and partman for restbase2018,2020 [puppet] - 10https://gerrit.wikimedia.org/r/495138 [21:55:12] (03CR) 10jerkins-bot: [V: 04-1] DHCP Partman: Add DHCP MAC and partman for restbase2018,2020 [puppet] - 10https://gerrit.wikimedia.org/r/495138 (owner: 10Papaul) [22:00:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for logstash1004.eqiad.wmnet and performed the following act... [22:00:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for logstash1005.eqiad.wmnet and performed the following act... [22:00:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for logstash1006.eqiad.wmnet and performed the following act... [22:01:09] (03PS2) 10Papaul: DHCP Partman: Add DHCP MAC and partman for restbase2018,2020 [puppet] - 10https://gerrit.wikimedia.org/r/495138 (https://phabricator.wikimedia.org/T217368) [22:03:23] (03PS1) 10RobH: logstash100[456] decommission [puppet] - 10https://gerrit.wikimedia.org/r/495142 (https://phabricator.wikimedia.org/T217556) [22:03:27] 10Operations, 10Keyholder, 10Release-Engineering-Team (Backlog): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10hashar) [22:04:45] (03PS1) 10RobH: decom logstash100[456] prod dns [dns] - 10https://gerrit.wikimedia.org/r/495143 (https://phabricator.wikimedia.org/T217556) [22:05:10] (03CR) 10RobH: [C: 03+2] decom logstash100[456] prod dns [dns] - 10https://gerrit.wikimedia.org/r/495143 (https://phabricator.wikimedia.org/T217556) (owner: 10RobH) [22:05:15] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) [22:05:35] (03CR) 10RobH: [C: 03+2] logstash100[456] decommission [puppet] - 10https://gerrit.wikimedia.org/r/495142 (https://phabricator.wikimedia.org/T217556) (owner: 10RobH) [22:09:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10RobH) [22:09:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10RobH) a:05RobH→03Cmjohnson [22:34:41] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) ` papaul@asw-a-codfw# run show interfaces ge-5/0/8 descriptions Interface Admin Link Description ge-5/0/8 up up restbase2019 {master:7}[... [22:34:56] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) [22:42:12] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10aaron) The SET metric for redis is very slow, so wouldn't use 10x that figure. Central... [23:40:47] !log depool dns2001 - T209989 [23:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:50] T209989: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 [23:43:26] !log ayounsi@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns2001.wikimedia.org,service=pdns_recursor [23:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:38] !log set net.ipv4.ip_local_port_range="49152 65535" on dns2001 - T209989 [23:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:41] T209989: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 [23:53:16] !log set net.ipv4.ip_local_port_range="32768 60999" on dns2001 and repool server - T209989 [23:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:18] T209989: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 [23:54:21] (03PS1) 10Thcipriani: Revert "Gerrit 2.15.11 release" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/495151 [23:54:51] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Revert "Gerrit 2.15.11 release" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/495151 (owner: 10Thcipriani) [23:55:51] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga2001 is CRITICAL: 58.67 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1