[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T0000). [00:00:04] Smalyshev, bmansurov, Urbanecm, kemayo, and kostajh: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:08] here [00:00:10] here [00:00:16] Present. [00:00:24] I can swat [00:03:00] ok bmansurov it looks like yours is first :) [00:03:06] ok cool [00:03:08] (03PS11) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [00:03:57] (03CR) 1020after4: [C: 03+2] "Merging for SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493236 (https://phabricator.wikimedia.org/T217080) (owner: 10Bmansurov) [00:05:06] (03Merged) 10jenkins-bot: Disable reader demographics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493236 (https://phabricator.wikimedia.org/T217080) (owner: 10Bmansurov) [00:07:05] bmansurov: is this even testable on mwdebug or should we just do it? [00:07:08] (03PS12) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [00:07:14] twentyafterfour: yes it's testable [00:07:18] ok [00:08:47] (03CR) 10jenkins-bot: Disable reader demographics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493236 (https://phabricator.wikimedia.org/T217080) (owner: 10Bmansurov) [00:09:01] ok it's sync'd to mwdebug1001 [00:09:11] bmansurov: does that look ok? [00:09:16] twentyafterfour: checking [00:10:02] twentyafterfour: looks good, please ship it. [00:10:27] (03PS13) 10CDanis: graphite: uwsgi workers: set timeouts + max RSS [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [00:10:35] (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/compiler1002/14986/" [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) (owner: 10CDanis) [00:10:36] ok shipping it ;) [00:11:03] (03CR) 10CDanis: "I still am Puppet-clueless, so reviewers, please tell me if there is a better way to do this." [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) (owner: 10CDanis) [00:12:39] !log twentyafterfour@deploy1001 Synchronized wmf-config/InitialiseSettings.php: sync https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/493236/ for SWAT. refs T217080 (duration: 00m 56s) [00:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:42] T217080: Deploy/Undeploy Quicksurvey for reader demographics pilot - https://phabricator.wikimedia.org/T217080 [00:12:56] kostajh: yours is next :) [00:13:10] twentyafterfour: sounds good [00:13:19] twentyafterfour: thank you! [00:14:49] ok that's going to take a few minutes in CI so I'm going to deploy kemayo's in the meantime [00:14:58] (03CR) 1020after4: [C: 03+2] Oversample metrics for mobile visualeditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494271 (https://phabricator.wikimedia.org/T212253) (owner: 10DLynch) [00:14:59] 🎉 [00:16:03] (03Merged) 10jenkins-bot: Oversample metrics for mobile visualeditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494271 (https://phabricator.wikimedia.org/T212253) (owner: 10DLynch) [00:17:11] (03PS1) 10Paladox: Add "image-diff" plugin [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494631 [00:17:14] Kemayo: that should be live on mwdebug1001 now can you test? [00:17:25] (03PS2) 10Paladox: Add "image-diff" plugin [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494631 [00:18:14] Sure, one second. [00:20:10] (03CR) 10jenkins-bot: Oversample metrics for mobile visualeditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494271 (https://phabricator.wikimedia.org/T212253) (owner: 10DLynch) [00:22:01] twentyafterfour: Seems okay. [00:22:47] Kemayo: ok deploying [00:22:53] 👍 [00:23:28] twentyafterfour: sorry, I got distracted - can we deploy my patch too? [00:23:49] !log twentyafterfour@deploy1001 Synchronized wmf-config/mobile.php: sync https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/494271/ for SWAT refs T212253 (duration: 00m 56s) [00:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:52] T212253: Start oversampling all mobile visual editor EditAttemptStep events - https://phabricator.wikimedia.org/T212253 [00:26:20] SMalyshev: sure right after kostajh's patch [00:26:26] coolio [00:26:33] kostajh: can you test? it's live on mwdebug1001 [00:26:40] twentyafterfour: testing now [00:29:54] twentyafterfour: looks good! [00:32:27] kostajh: syncing [00:33:05] (03CR) 1020after4: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494524 (https://phabricator.wikimedia.org/T217276) (owner: 10Smalyshev) [00:34:19] SMalyshev: now yours ... [00:34:46] (03PS2) 1020after4: Enable WikibaseCirrusSearch loading on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494524 (https://phabricator.wikimedia.org/T217276) (owner: 10Smalyshev) [00:36:12] (03CR) 1020after4: [C: 03+2] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494524 (https://phabricator.wikimedia.org/T217276) (owner: 10Smalyshev) [00:37:29] SMalyshev: can you test on mwdebug1001 ? [00:37:35] should be ready to test [00:37:42] ok testing [00:39:27] twentyafterfour: seems to be working fine [00:39:48] I think you can deploy everywhere [00:40:18] (03Abandoned) 10Smalyshev: Enable WikibaseCirrusSearch loading on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489598 (https://phabricator.wikimedia.org/T217276) (owner: 10Smalyshev) [00:40:24] (03PS3) 10Paladox: Scap: upgrade cloud VPS to 3.9.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/493317 (https://phabricator.wikimedia.org/T217287) (owner: 10Thcipriani) [00:40:32] syncing it [00:40:43] (03CR) 10jerkins-bot: [V: 04-1] Scap: upgrade cloud VPS to 3.9.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/493317 (https://phabricator.wikimedia.org/T217287) (owner: 10Thcipriani) [00:41:15] !log twentyafterfour@deploy1001 Synchronized wmf-config/InitialiseSettings.php: sync https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/494524/ for SWAT refs T217276 (duration: 00m 55s) [00:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:18] T217276: Test WikibaseCirrusSearch on testwikidata - https://phabricator.wikimedia.org/T217276 [00:42:24] ok unless Urbanecm is here then I guess that's all for SWAT [00:42:34] Urbanecm: are you available to test? [00:43:11] actually I don't know how we can even test throttle rules ... https://gerrit.wikimedia.org/r/#/c/494500/ [00:43:37] (03CR) 10jenkins-bot: Enable WikibaseCirrusSearch loading on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494524 (https://phabricator.wikimedia.org/T217276) (owner: 10Smalyshev) [00:46:05] (03PS2) 10Smalyshev: Run WikibaseCirrusSearch code for search on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493629 (https://phabricator.wikimedia.org/T217276) [00:46:36] any reason I shouldn't just go ahead and deploy this throttle rule exception? seems safe enough [00:47:18] (03PS1) 10Smalyshev: Enable loading WikibaseCirrusSearch (disabled) on production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494632 [00:49:01] (03CR) 1020after4: [C: 03+2] New throttle rule for Czech Senior Citizen Write Wikipedia course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494500 (https://phabricator.wikimedia.org/T217663) (owner: 10Urbanecm) [00:50:08] (03Merged) 10jenkins-bot: New throttle rule for Czech Senior Citizen Write Wikipedia course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494500 (https://phabricator.wikimedia.org/T217663) (owner: 10Urbanecm) [00:52:11] well nothing seems broken on mwdebug ... deploying https://gerrit.wikimedia.org/r/#/c/494500/ [00:55:01] (03CR) 10jenkins-bot: New throttle rule for Czech Senior Citizen Write Wikipedia course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494500 (https://phabricator.wikimedia.org/T217663) (owner: 10Urbanecm) [00:55:01] !log finished US Eveninig SWAT. [00:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:50] (03CR) 10BryanDavis: [C: 03+1] package_builder: Add docs for BUILD_HOME [puppet] - 10https://gerrit.wikimedia.org/r/494419 (owner: 10Alexandros Kosiaris) [01:25:37] (03CR) 10BryanDavis: "> Hm in that case, it's probably better to follow the docs and per" [puppet] - 10https://gerrit.wikimedia.org/r/494155 (owner: 10BryanDavis) [02:01:59] mobrovac: any reason that JobQueueEventBus always uses DeferredUpdates::addCallableUpdate? We alraedy have JobQueueGroup::lazyPush() for when callers do not care about whether enqueue failed. [02:02:43] I think renameuser is meant to bail/rollback on such failures. [03:17:10] (03Abandoned) 10BryanDavis: pbuilder: Ensure ~pbuilder exists and is writable [puppet] - 10https://gerrit.wikimedia.org/r/494155 (owner: 10BryanDavis) [03:31:31] (03PS1) 10Andrew Bogott: bootstrapvz: work around Buster not yet identifying itself as Debian 10 [puppet] - 10https://gerrit.wikimedia.org/r/494639 [03:32:26] (03CR) 10Andrew Bogott: [C: 03+2] bootstrapvz: work around Buster not yet identifying itself as Debian 10 [puppet] - 10https://gerrit.wikimedia.org/r/494639 (owner: 10Andrew Bogott) [04:00:04] kart_: I, the Bot under the Fountain, allow thee, The Deployer, to do deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T0400). [04:00:50] !log Started manual run of unpublished ContentTranslation draft purge script (T217310) [04:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:53] T217310: Run unpublished draft purge script for CX (Week of 03/03) - https://phabricator.wikimedia.org/T217310 [04:24:32] (03PS1) 10Andrew Bogott: bootstrapvz: don't use the 'puppet' plugin for buster [puppet] - 10https://gerrit.wikimedia.org/r/494643 (https://phabricator.wikimedia.org/T216781) [04:29:00] (03CR) 10Andrew Bogott: [C: 03+2] bootstrapvz: don't use the 'puppet' plugin for buster [puppet] - 10https://gerrit.wikimedia.org/r/494643 (https://phabricator.wikimedia.org/T216781) (owner: 10Andrew Bogott) [04:40:52] 10Operations, 10monitoring: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10CDanis) [04:49:34] (03PS11) 10CRusnov: Add ganeti->netbox sync script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) [05:50:45] !log Finished manual run of unpublished ContentTranslation draft purge script (T217310) [05:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:49] T217310: Run unpublished draft purge script for CX (Week of 03/03) - https://phabricator.wikimedia.org/T217310 [06:05:45] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 (10hashar) I have additionally mailed `ops-l` to raise awareness :) [06:17:35] (03Abandoned) 10Hashar: aptrepo: change Jenkins upstream URL [puppet] - 10https://gerrit.wikimedia.org/r/485685 (owner: 10Hashar) [06:18:50] (03Abandoned) 10Hashar: gerrit: Increase httpd.threads in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/489475 (owner: 10Paladox) [06:22:05] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10Marostegui) The data transfer from labsdb1011 to labsdb1012 finished. I have deleted all the following files on labsdb1012 before starting my... [06:27:06] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494644 [06:27:28] !log Add labsdb1012 to tendril and zarcillo - T215231 [06:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:31] T215231: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 [06:27:52] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:29:00] PROBLEM - puppet last run on an-worker1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:29:04] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494644 (owner: 10Marostegui) [06:30:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494644 (owner: 10Marostegui) [06:31:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1105 for MySQL upgrade (duration: 01m 14s) [06:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:09] !log Stop MySQL on db1105 for MySQL upgrade [06:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:16] PROBLEM - puppet last run on ms-be1049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [06:32:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494644 (owner: 10Marostegui) [06:36:41] marostegui: \o/ [06:37:14] :) [06:38:46] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:46:09] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494645 [06:49:19] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494645 (owner: 10Marostegui) [06:50:31] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494645 (owner: 10Marostegui) [06:53:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1105 after MySQL upgrade (duration: 00m 56s) [06:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:04] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494645 (owner: 10Marostegui) [06:58:14] RECOVERY - puppet last run on ms-be1049 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:16] RECOVERY - puppet last run on an-worker1084 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:00:54] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494646 [07:03:46] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494646 (owner: 10Marostegui) [07:05:35] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494646 (owner: 10Marostegui) [07:06:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic db1105 after MySQL upgrade (duration: 00m 56s) [07:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:38] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494646 (owner: 10Marostegui) [07:09:43] !log raised analytics user's max_user_connection from 10 to 100 on labsdb1012 - T215231 [07:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:46] T215231: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 [07:13:54] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) 05Open→03Resolved [07:14:25] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494647 [07:16:08] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 (10elukey) [07:17:13] (03PS1) 10Marostegui: labsdb1012: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/494648 (https://phabricator.wikimedia.org/T215231) [07:17:19] elukey: ^ [07:17:46] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 (10elukey) No complaints or outages after the shutdown of dbstore1002, I think that we are good to keep going with the decom. "@Marostegui this is control tower, you are clear t... [07:18:31] (03CR) 10Elukey: [C: 03+2] labsdb1012: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/494648 (https://phabricator.wikimedia.org/T215231) (owner: 10Marostegui) [07:18:41] marostegui: thanks! [07:18:58] thanks! [07:18:59] merged! [07:19:18] ouch I was merging as well [07:19:41] I have merged on 2001 successfully, failed on the rest [07:19:46] you have probably the opposite? [07:20:30] failed on 2001 [07:21:04] buen [07:21:09] *bueno [07:21:20] XD [07:21:59] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 (10Marostegui) \o/ Clear to decommission! :) [07:25:45] (03PS1) 10Marostegui: dbstore1002: Set it to spare [puppet] - 10https://gerrit.wikimedia.org/r/494649 (https://phabricator.wikimedia.org/T216491) [07:25:52] (03PS4) 10Dzahn: Scap: upgrade cloud VPS to 3.9.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/493317 (https://phabricator.wikimedia.org/T217287) (owner: 10Thcipriani) [07:27:10] (03CR) 10Dzahn: [C: 03+2] Scap: upgrade cloud VPS to 3.9.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/493317 (https://phabricator.wikimedia.org/T217287) (owner: 10Thcipriani) [07:27:23] (03CR) 10Elukey: "There is also a dbstore.my.cnf.erb mentioning dbstore1002, shall we delete it as well?" [puppet] - 10https://gerrit.wikimedia.org/r/494649 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [07:27:56] (03CR) 10Marostegui: "I will delete it once I delete the whole role" [puppet] - 10https://gerrit.wikimedia.org/r/494649 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [07:28:08] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/14987/" [puppet] - 10https://gerrit.wikimedia.org/r/494649 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [07:28:11] ah! [07:28:22] (03CR) 10Elukey: [C: 03+1] dbstore1002: Set it to spare [puppet] - 10https://gerrit.wikimedia.org/r/494649 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [07:28:27] yeah, let's put the host down first, and then delete the whole role [07:28:40] (03PS2) 10Marostegui: dbstore1002: Set it to spare [puppet] - 10https://gerrit.wikimedia.org/r/494649 (https://phabricator.wikimedia.org/T216491) [07:32:04] (03PS7) 10Dzahn: webperf: Rename /xenon to /arclamp on performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/493247 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [07:32:18] (03CR) 10Marostegui: [C: 03+2] dbstore1002: Set it to spare [puppet] - 10https://gerrit.wikimedia.org/r/494649 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [07:34:07] !log Remove dbstore1002 from tendril and zarcillo T216491 [07:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:10] T216491: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 [07:36:49] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494647 (owner: 10Marostegui) [07:37:28] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 (10Marostegui) [07:38:05] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494647 (owner: 10Marostegui) [07:38:59] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 (10Marostegui) a:03RobH Ready for @RobH to do the next steps. @RobH can you give this some priority for the steps that include the power down of this host? It is a trusty host... [07:40:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic db1105 after MySQL upgrade (duration: 00m 56s) [07:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:52] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494647 (owner: 10Marostegui) [07:44:22] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494653 [07:45:52] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494653 (owner: 10Marostegui) [07:46:56] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494653 (owner: 10Marostegui) [07:47:10] (03PS8) 10Dzahn: webperf: Rename /xenon to /arclamp on performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/493247 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [07:48:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1105 after MySQL upgrade (duration: 00m 56s) [07:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:13] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14988/webperf1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/493247 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [07:50:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494655 [07:54:38] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494653 (owner: 10Marostegui) [07:54:48] (03CR) 10Dzahn: "deployed. https://performance.wikimedia.org/xenon redirects to https://performance.wikimedia.org/arclamp/ as expected" [puppet] - 10https://gerrit.wikimedia.org/r/493247 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [07:55:52] (03PS3) 10Dzahn: xhgui: require php-mongodb package [puppet] - 10https://gerrit.wikimedia.org/r/494422 (https://phabricator.wikimedia.org/T180761) [07:59:36] (03CR) 10Dzahn: [C: 03+2] xhgui: require php-mongodb package [puppet] - 10https://gerrit.wikimedia.org/r/494422 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [08:01:26] (03CR) 10Dzahn: "on webperf1002: Notice: /Stage[main]/Packages::Php_mongodb/Package[php-mongodb]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/494422 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [08:05:32] (03PS7) 10Dzahn: noc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416751 [08:11:21] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494655 (owner: 10Marostegui) [08:11:57] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/494658 [08:12:04] (03PS2) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/494658 [08:12:51] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/494658 (owner: 10Marostegui) [08:13:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494655 (owner: 10Marostegui) [08:14:01] !log Reload haproxy on dbproxy1010 to repool labsdb1011 [08:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090 for MySQL upgrade (duration: 00m 56s) [08:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:03] !log Stop MySQL on db1090 for mysql upgrade [08:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:15] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: Consider raising Memcached MWObject cache memory size limit - https://phabricator.wikimedia.org/T217731 (10elukey) p:05Triage→03Normal [08:18:16] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494655 (owner: 10Marostegui) [08:18:27] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: Consider raising Memcached MWObject cache memory size limit - https://phabricator.wikimedia.org/T217731 (10elukey) [08:19:27] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: Consider raising Memcached MWObject cache memory size limit - https://phabricator.wikimedia.org/T217731 (10elukey) a:05aaron→03None [08:20:16] (03PS2) 10Alexandros Kosiaris: package_builder: Add docs for BUILD_HOME [puppet] - 10https://gerrit.wikimedia.org/r/494419 [08:20:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] package_builder: Add docs for BUILD_HOME [puppet] - 10https://gerrit.wikimedia.org/r/494419 (owner: 10Alexandros Kosiaris) [08:20:40] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494661 [08:27:12] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494661 (owner: 10Marostegui) [08:28:16] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494661 (owner: 10Marostegui) [08:29:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494661 (owner: 10Marostegui) [08:30:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090 after MySQL upgrade (duration: 00m 59s) [08:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:45] 10Operations, 10monitoring, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10Aklapper) [08:40:54] (03PS8) 10Dzahn: noc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416751 [08:41:28] (03CR) 10jerkins-bot: [V: 04-1] noc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn) [08:42:06] !log lvs300[12]: upgrade linux to 4.9.144-3.1, reboot for L1TF kernel/microcode updates T203011 [08:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:20] !log Deploy schema change on s3 codfw, this will generate lag on codfw - T86342 [08:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:23] T86342: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 [08:48:06] !log increase citoid traffic to kubernetes infrastructure to 50% [08:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:25] !log increase citoid traffic to kubernetes infrastructure to 50% T213194 [08:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:28] T213194: Migrate citoid to kubernetes - https://phabricator.wikimedia.org/T213194 [08:48:39] !log akosiaris@cumin1001 conftool action : set/weight=15; selector: dc=codfw,service=citoid,cluster=scb,name=kubernetes.* [08:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:53] !log akosiaris@cumin1001 conftool action : set/weight=12; selector: dc=eqiad,service=citoid,cluster=scb,name=kubernetes.* [08:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:11] (03PS9) 10Dzahn: noc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416751 [08:51:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "nit inline, otherwise LGTM, thanks for tackling this!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) (owner: 10CDanis) [08:54:32] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14990/" [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn) [08:54:51] (03PS10) 10Dzahn: noc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416751 [09:00:42] (03CR) 10Sau226: "I've scheduled this for EU midday SWAT on March 7th and will be rebasing this patch soon after this comment is posted." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) (owner: 10Sau226) [09:00:48] (03PS7) 10Sau226: Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) [09:01:18] switching noc.wm.org to httpd module (mwmaint10020 [09:01:49] !log switching noc.wikimedia.org from apache to httpd module (mwmaint2001, then mwmaint1002) [09:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:37] (03CR) 10Dzahn: "noop on mwmaint2001 and mwmaint1002" [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn) [09:05:13] !log removed debmonitor host entry for ruthenium (T216062) [09:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:16] T216062: decom ruthenium - https://phabricator.wikimedia.org/T216062 [09:06:21] (03PS1) 10Ema: ATS: re-enable icinga notifications [puppet] - 10https://gerrit.wikimedia.org/r/494665 [09:06:41] moritzm: removing ruthenium also meant removing one exception in enforce-users-groups https://gerrit.wikimedia.org/r/c/operations/puppet/+/490407 [09:08:05] (03CR) 10Ema: [C: 03+2] ATS: re-enable icinga notifications [puppet] - 10https://gerrit.wikimedia.org/r/494665 (owner: 10Ema) [09:10:09] (03PS1) 10Urbanecm: Add "how to add a photo" link instead of notability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494668 (https://phabricator.wikimedia.org/T217391) [09:11:13] !log lvs200[123]: upgrade linux to 4.9.144-3.1, reboot for L1TF kernel/microcode updates T203011 [09:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:29] 10Operations, 10monitoring, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10fgiunchedi) Thanks @CDanis and other folks that helped investigate! Off top of my head other problems that will need followup... [09:12:22] (03PS1) 10Urbanecm: Throttle Exception for Art+Feminism event Eindhoven 8th March [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494669 (https://phabricator.wikimedia.org/T217676) [09:14:31] (03PS3) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) [09:16:37] (03CR) 10jerkins-bot: [V: 04-1] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [09:20:23] ema: 4.9.144 is already in stretch/stretch-updates? [09:20:38] (good morning BTW) [09:20:58] (03PS2) 10Urbanecm: Change links in cswiki Help Panel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494668 (https://phabricator.wikimedia.org/T217391) [09:21:40] joal, next [09:21:43] jouncebot, next [09:21:43] In 2 hour(s) and 38 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T1200) [09:21:44] sorry joal [09:22:02] 10Operations, 10Wikimedia-Mailing-lists: Please create an All Affiliates mailing list - https://phabricator.wikimedia.org/T217736 (10Elitre) [09:22:04] vgutierrez: morning! Yes, kindly provided by linux-image-4.9.0-8-amd64 from stretch-updates [09:22:11] lovely [09:22:23] well, by moritz really :) [09:22:35] I'm going to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/484199 then [09:22:49] 10Operations, 10Wikimedia-Mailing-lists: Please create an All Affiliates mailing list - https://phabricator.wikimedia.org/T217736 (10Elitre) I understand that @herron may be the person in charge here. Many thanks in advance! [09:22:54] ema: we needed a kernel >= 4.9.134 to get rid of the cp1075-1099 NIC issues [09:25:36] (03PS1) 10Vgutierrez: Revert "cache: Add kernel-proposed-updates component for cp1075-99" [puppet] - 10https://gerrit.wikimedia.org/r/494671 (https://phabricator.wikimedia.org/T203194) [09:25:40] vgutierrez: yup, I see "bnxt_en: Fix TX timeout during netpoll" in changelog.Debian.gz [09:26:09] ema: indeed, that + FW upgrade on the NICs made the trick [09:26:35] (03CR) 10jerkins-bot: [V: 04-1] Revert "cache: Add kernel-proposed-updates component for cp1075-99" [puppet] - 10https://gerrit.wikimedia.org/r/494671 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [09:26:45] uh? [09:27:15] I love our commit validator [09:27:21] but you already knew that :) [09:27:39] (03CR) 10Elukey: [C: 04-1] "Will create a more specific role" [puppet] - 10https://gerrit.wikimedia.org/r/494473 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [09:27:42] (03Abandoned) 10Elukey: Assign role::analytics_cluster::superset to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/494473 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [09:28:21] (03PS2) 10Vgutierrez: Revert "cache: Add kernel-proposed-updates component for cp1075-99" [puppet] - 10https://gerrit.wikimedia.org/r/494671 (https://phabricator.wikimedia.org/T203194) [09:29:03] (03Restored) 10Elukey: Assign role::analytics_cluster::superset to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/494473 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [09:31:23] (03CR) 10Ema: [C: 03+1] Revert "cache: Add kernel-proposed-updates component for cp1075-99" [puppet] - 10https://gerrit.wikimedia.org/r/494671 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [09:31:25] (03Abandoned) 10Vgutierrez: Revert "cache: Add kernel-proposed-updates component for cp1075-99" [puppet] - 10https://gerrit.wikimedia.org/r/494671 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [09:31:34] ema: sorry about that :( [09:31:51] vgutierrez: no worries! :) [09:32:02] ema: let's make sure that puppet runs first with ensure => absent on cp1075-99 to effectively get rid of the apt component [09:32:32] 10Operations, 10Wikimedia-Mailing-lists: Please create an All Affiliates mailing list - https://phabricator.wikimedia.org/T217736 (10Elitre) p:05Triage→03Unbreak! [09:32:36] vgutierrez: sounds good, yes [09:36:09] (03PS1) 10Vgutierrez: cache: get rid of the wikimedia-kernel-updates apt component (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/494673 (https://phabricator.wikimedia.org/T203194) [09:36:11] (03PS1) 10Vgutierrez: cache: get rid of the wikimedia-kernel-updates apt component (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/494674 (https://phabricator.wikimedia.org/T203194) [09:37:53] (03CR) 10Muehlenhoff: [C: 03+1] cache: get rid of the wikimedia-kernel-updates apt component (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/494673 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [09:38:00] 10Operations, 10Analytics, 10vm-requests, 10User-Elukey: Create an-tool1005 (Staging environment for Superset) - https://phabricator.wikimedia.org/T217738 (10elukey) p:05Triage→03Normal [09:38:23] (03CR) 10Vgutierrez: [C: 03+2] cache: get rid of the wikimedia-kernel-updates apt component (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/494673 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [09:38:24] (03PS1) 10Elukey: Allocate AAAA/A/PTR records for an-tool1005 [dns] - 10https://gerrit.wikimedia.org/r/494676 (https://phabricator.wikimedia.org/T217738) [09:38:37] (03PS2) 10Vgutierrez: cache: get rid of the wikimedia-kernel-updates apt component (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/494673 (https://phabricator.wikimedia.org/T203194) [09:39:54] (03CR) 10Muehlenhoff: cache: get rid of the wikimedia-kernel-updates apt component (2/2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494674 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [09:41:28] (03CR) 10Vgutierrez: cache: get rid of the wikimedia-kernel-updates apt component (2/2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494674 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [09:43:19] (03CR) 10Muehlenhoff: [C: 03+1] cache: get rid of the wikimedia-kernel-updates apt component (2/2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494674 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [09:45:03] (03CR) 10Elukey: [C: 03+2] Allocate AAAA/A/PTR records for an-tool1005 [dns] - 10https://gerrit.wikimedia.org/r/494676 (https://phabricator.wikimedia.org/T217738) (owner: 10Elukey) [09:47:51] (03PS2) 10Gehel: elasticsearch: exit the JVM on OutOfMemoryError [puppet] - 10https://gerrit.wikimedia.org/r/487787 (https://phabricator.wikimedia.org/T76090) [09:49:10] (03CR) 10Gehel: [C: 03+2] elasticsearch: exit the JVM on OutOfMemoryError [puppet] - 10https://gerrit.wikimedia.org/r/487787 (https://phabricator.wikimedia.org/T76090) (owner: 10Gehel) [09:51:40] PROBLEM - puppet last run on acmechief2001 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 5 minutes ago with 5 failures. Failed resources (up to 3 shown): Package[ack-grep],Package[nagios-plugins],Package[nagios-plugins-basic],Package[nagios-plugins-standard] [09:52:38] moritzm: ^^ :) [09:52:52] (03CR) 10Dzahn: "> Yeah, there's T213769 and other tasks for those bits. The stack is C-2'ed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482100 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [09:53:47] vgutierrez: pupper tried it's usual nagios-plugins->ensure dance while acmechief2001 was upgraded, this should recover soon [09:54:00] ack :D [09:55:18] 10Operations, 10Analytics, 10RESTBase, 10Traffic, and 2 others: Verify that hit/miss stats in WebRequest are correct - https://phabricator.wikimedia.org/T215987 (10ema) Is there anything to do here? :-) [09:56:41] (03PS2) 10Dzahn: xhgui: setup git cloning and apache site [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) [09:56:44] RECOVERY - puppet last run on acmechief2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:59:24] (03PS4) 10Dzahn: icinga: make notes_url a required parameter of monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) [10:01:07] (03CR) 10jerkins-bot: [V: 04-1] icinga: make notes_url a required parameter of monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [10:01:10] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review, 10User-Elukey: Create an-tool1005 (Staging environment for Superset) - https://phabricator.wikimedia.org/T217738 (10elukey) ` If you need a private IP, do you need it to be inside the Analytics VLAN? (y/n) y Please enter the correct row. (A,... [10:02:57] (03PS1) 10Elukey: Add an-tool1005 to site.pp/DHCP/partman configs [puppet] - 10https://gerrit.wikimedia.org/r/494680 (https://phabricator.wikimedia.org/T217738) [10:03:49] (03CR) 10Elukey: [C: 03+2] Add an-tool1005 to site.pp/DHCP/partman configs [puppet] - 10https://gerrit.wikimedia.org/r/494680 (https://phabricator.wikimedia.org/T217738) (owner: 10Elukey) [10:04:38] 10Operations: wmf-auto-restart occasionally errors on fuse mounts - https://phabricator.wikimedia.org/T217646 (10MoritzMuehlenhoff) -w sounds good, but let's check first what kind of errors lsof potentially warns about, not that we miss something important in the future. [10:05:19] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review, 10User-Elukey: Create an-tool1005 (Staging environment for Superset) - https://phabricator.wikimedia.org/T217738 (10elukey) To keep archives happy: @MoritzMuehlenhoff is currently testing the Buster debian installer for Ganeti VMs, so please d... [10:06:14] 10Operations: wmf-auto-restart occasionally errors on fuse mounts - https://phabricator.wikimedia.org/T217646 (10MoritzMuehlenhoff) Another possible angle (if lsof supports that, didn't check) would be to exclude some directories entirely from scanning, the HDFS mount point doesn't contain and executables which... [10:07:36] (03PS1) 10Muehlenhoff: Stop using transitional package names for Icinga plugins [puppet] - 10https://gerrit.wikimedia.org/r/494681 (https://phabricator.wikimedia.org/T213527) [10:08:30] (03CR) 10jerkins-bot: [V: 04-1] Stop using transitional package names for Icinga plugins [puppet] - 10https://gerrit.wikimedia.org/r/494681 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [10:09:02] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.20 [software/spicerack] - 10https://gerrit.wikimedia.org/r/494682 [10:09:57] (03CR) 10Dzahn: "let's put "if >= trusty" around it so we can merge without waiting for shinken to be upgraded?" [puppet] - 10https://gerrit.wikimedia.org/r/494681 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [10:10:16] (03PS1) 10Muehlenhoff: Fix package name for ack-grep in buster [puppet] - 10https://gerrit.wikimedia.org/r/494684 [10:11:07] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10monitoring, and 2 others: Icinga monitoring for elasticsearch doesn't notice OOM conditions (this is happening on cloud) - https://phabricator.wikimedia.org/T76090 (10Gehel) Merged, will take effect with the next cluster restarts. [10:11:15] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10ema) p:05Triage→03Normal [10:11:29] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10ema) Anything else to be done here? [10:12:04] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10ema) p:05Triage→03Normal [10:12:27] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10ema) Can this be closed? [10:12:32] jouncebot: next [10:12:32] In 1 hour(s) and 47 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T1200) [10:15:20] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.20 [software/spicerack] - 10https://gerrit.wikimedia.org/r/494682 (owner: 10Volans) [10:16:06] (03PS2) 10Volans: CHANGELOG: add changelogs for release v0.0.20 [software/spicerack] - 10https://gerrit.wikimedia.org/r/494682 [10:16:31] 10Operations, 10Parsoid, 10RESTBase, 10Traffic, and 5 others: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 (10ema) p:05Triage→03Normal >>! In T215956#4977137, @mobrovac wrote: > it seems like we will need to add a rule to Varnish to pass on these requests Would it... [10:19:39] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Wikidata sometimes cuts off entity RDF - https://phabricator.wikimedia.org/T216006 (10ema) [10:20:58] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Wikidata sometimes cuts off entity RDF - https://phabricator.wikimedia.org/T216006 (10ema) I haven't done any investigation yet, but it sounds similar to T215389. [10:22:38] !log lvs100[12],lvs1016: upgrade linux to 4.9.144-3.1, reboot for L1TF kernel/microcode updates T203011 [10:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:12] (03PS1) 10Filippo Giunchedi: prometheus: introduce query/connection limits parameters [puppet] - 10https://gerrit.wikimedia.org/r/494685 (https://phabricator.wikimedia.org/T217715) [10:24:20] (03CR) 10Muehlenhoff: "shinken is already running on jessie VMs. This affects any production host running trusty as well, as we install the plugins on all hosts." [puppet] - 10https://gerrit.wikimedia.org/r/494681 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [10:24:34] (03PS2) 10Muehlenhoff: Stop using transitional package names for Icinga plugins [puppet] - 10https://gerrit.wikimedia.org/r/494681 (https://phabricator.wikimedia.org/T213527) [10:26:18] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v0.0.20 [software/spicerack] - 10https://gerrit.wikimedia.org/r/494682 (owner: 10Volans) [10:26:48] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/494682 (owner: 10Volans) [10:27:16] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Services (watching): logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335 (10Gehel) Removing the Discovery team from this,... [10:28:29] (03PS1) 10Gehel: logstash: Upgrade deployment-logstash2 to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/494686 (https://phabricator.wikimedia.org/T216052) [10:32:11] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.20 [software/spicerack] - 10https://gerrit.wikimedia.org/r/494682 (owner: 10Volans) [10:33:20] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.20 [software/spicerack] - 10https://gerrit.wikimedia.org/r/494682 (owner: 10Volans) [10:33:47] (03PS1) 10Volans: Upstream release v0.0.20 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/494687 [10:34:13] 10Operations: wmf-auto-restart occasionally errors on fuse mounts - https://phabricator.wikimedia.org/T217646 (10jbond) the last option excludes mount points which would work for this case. As far as i can see you can only remove directories from the output which wouldn't stop the warning from triggering also... [10:36:26] !log Deploying a hotfix for Translate https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Translate/+/494659/ [10:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:15] !log hashar@deploy1001 Synchronized php-1.33.0-wmf.20/extensions/Translate/TranslateUtils.php: Revert "TranslateUtils: Avoid use of deprecated class Revision" - T217689 (duration: 00m 59s) [10:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:18] T217689: Translate/RevisionStore: $pageId and $revId cannot both be 0 or null - https://phabricator.wikimedia.org/T217689 [10:38:42] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.20 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/494687 (owner: 10Volans) [10:40:26] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 (10jbond) 05Open→03Resolved a:03jbond This should be in place now, let me know if there are any issues [10:40:52] (03CR) 10Vgutierrez: [C: 03+2] "the apt component has been removed from the affected nodes" [puppet] - 10https://gerrit.wikimedia.org/r/494674 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [10:41:01] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 (10jbond) 05Resolved→03Open sorry updated wrong ticket [10:41:06] (03PS2) 10Vgutierrez: cache: get rid of the wikimedia-kernel-updates apt component (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/494674 (https://phabricator.wikimedia.org/T203194) [10:41:42] 10Operations, 10LDAP-Access-Requests: Add bmansurov to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T217447 (10jbond) 05Open→03Resolved a:03jbond This should be in place now, let me know if there are any issues [10:42:56] (03Merged) 10jenkins-bot: Upstream release v0.0.20 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/494687 (owner: 10Volans) [10:43:11] 10Operations, 10Graphite, 10Patch-For-Review: Graphite returning server errors (out of memory?) - https://phabricator.wikimedia.org/T217679 (10jbond) p:05Triage→03High [10:46:37] !log uploaded spicerack_0.0.20-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [10:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:59] (03PS1) 10Vgutierrez: aptrepo: Get rid of the no longer needed component/kernel-proposed-updates [puppet] - 10https://gerrit.wikimedia.org/r/494690 (https://phabricator.wikimedia.org/T203194) [10:48:18] !log upgraded spicerack to 0.0.20 on cumin[12]001 [10:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:39] (03CR) 10Mathew.onipe: [C: 03+1] logstash: Upgrade deployment-logstash2 to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/494686 (https://phabricator.wikimedia.org/T216052) (owner: 10Gehel) [10:52:25] (03PS2) 10Gehel: logstash: Upgrade deployment-logstash2 to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/494686 (https://phabricator.wikimedia.org/T216052) [10:53:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494691 [10:54:29] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494691 (owner: 10Marostegui) [10:54:42] (03CR) 10Gehel: [C: 03+2] logstash: Upgrade deployment-logstash2 to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/494686 (https://phabricator.wikimedia.org/T216052) (owner: 10Gehel) [10:55:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494691 (owner: 10Marostegui) [10:56:20] (03PS7) 10Jcrespo: mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) [10:56:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 (duration: 00m 53s) [10:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:44] !log Deploy schema change on db1123 [10:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:47] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:57:18] (03PS4) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) [10:59:04] (03CR) 10jerkins-bot: [V: 04-1] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [11:01:34] (03PS5) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) [11:03:07] (03CR) 10jerkins-bot: [V: 04-1] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [11:04:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494691 (owner: 10Marostegui) [11:07:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 (owner: 10Jbond) [11:15:12] 10Operations: Document service owner in Netbox - https://phabricator.wikimedia.org/T217686 (10Volans) My main concern here is that the concept of a single service owner is limited and doesn't reflect reality. We have multiple roles for each server/service, not in all cases we have all those "roles" but I think i... [11:15:44] _joe_: got more redis client library issues on beta (im guessing the same thing as last time) should i ust reopen the same ticket? [11:17:03] (03CR) 10Dzahn: "oooh, well that's a nice outcome too :)" [puppet] - 10https://gerrit.wikimedia.org/r/485685 (owner: 10Hashar) [11:18:58] !log updated and rebooted serpens (T217280) [11:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:01] T217280: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 [11:20:19] (03CR) 10Dzahn: "looks like it's still using the new code though" [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [11:20:51] 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10fgiunchedi) I've examined apache logs on `prometheus2004` for the `k8s` instance looking for clues, even... [11:21:02] !log updated and rebooted seaborgium (T217280) [11:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:11] (03CR) 10Dzahn: "feel free to re-add me once it's out of WIP state and the manual purge run has been finished" [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [11:21:31] I reopened https://phabricator.wikimedia.org/T217323 [11:23:03] onimisionipe: let's merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/494471 ? [11:23:31] 10Operations: Document service owner in Netbox - https://phabricator.wikimedia.org/T217686 (10MoritzMuehlenhoff) @ayounsi : There's the more generic task https://phabricator.wikimedia.org/T216088 for this discussion. (Personally I think this is better tracked in puppet on a per role base and not per server) [11:23:39] mutante: Nope [11:23:44] zeljkof: will you be around for EU / Morning SWAT? [11:24:02] hauskatze: I'm around for eu swat [11:24:35] zeljkof: alright [11:25:03] mutante: we can merge yes [11:25:33] there's a heavy refactoring following that anyway [11:27:53] onimisionipe: ack, per the latest comment on it [11:28:04] (03PS4) 10Dzahn: elasticsearch: move nagios check to profile [puppet] - 10https://gerrit.wikimedia.org/r/494471 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:28:17] (03CR) 10Dzahn: [C: 03+2] elasticsearch: move nagios check to profile [puppet] - 10https://gerrit.wikimedia.org/r/494471 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:28:50] (03CR) 10Ammarpad: Add editcontentmodel right to the templateeditor group on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494016 (https://phabricator.wikimedia.org/T217499) (owner: 10Ammarpad) [11:31:09] 10Operations, 10Wikimedia-Mailing-lists: Please create an All Affiliates mailing list - https://phabricator.wikimedia.org/T217736 (10jbond) Hi Erica, As i think you have noticed the list has been created and you should have revived the admin password. I'm just holding of resolving the ticket as this is the fi... [11:32:13] !log bounce prometheus@k8s on prometheus2004 to test limiting concurrent connections [11:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:37] 10Operations, 10Wikimedia-Mailing-lists: Please create an All Affiliates mailing list - https://phabricator.wikimedia.org/T217736 (10jbond) a:03jbond [11:33:49] (03PS8) 10Jcrespo: mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) [11:34:17] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [11:35:01] (03PS8) 10Jbond: Add ability to filter out auto restarts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 [11:35:28] (03CR) 10Ladsgroup: "Interesting, I thought it's already there for production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493011 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [11:36:28] (03PS9) 10Jcrespo: mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) [11:36:30] mutante: Thanks! [11:36:53] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [11:37:05] (03CR) 10Jbond: [C: 03+2] Add ability to filter out auto restarts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 (owner: 10Jbond) [11:37:33] onimisionipe: yw. so do we have to run puppet on cloudelastic? they are not setup yet, right [11:37:52] everything is ok on icinga side [11:38:24] <_joe_> addshore: on what server? [11:38:35] mutante: no need. They are not setup yet. there's another patch for that [11:38:58] _joe_: https://phabricator.wikimedia.org/T217323#5004710, but lucas may have already fixed it! [11:39:55] hehe, fortunately _joe_ had provided enough info the first time the task was fixed ^^ [11:43:02] :D [11:44:39] onimisionipe: ack, cool [11:46:30] 10Operations, 10Analytics, 10Documentation, 10Patch-For-Review: Remove data from Hadoop's HDFS as part of the user offboard workflow - https://phabricator.wikimedia.org/T200312 (10elukey) 05Open→03Resolved [11:47:37] (03PS1) 10Sbisson: Welcome survey: send all newcomers to variation A (cs, ko) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494698 (https://phabricator.wikimedia.org/T217625) [11:49:13] <_joe_> Lucas_WMDE: thanks, and sorry, it's mostly my fault for not cleaning up properly [11:50:07] PROBLEM - DPKG on serpens is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:51:11] RECOVERY - DPKG on serpens is OK: All packages OK [11:52:11] ACKNOWLEDGEMENT - HP RAID on db2044 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:3 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T217755 [11:52:18] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T217755 (10ops-monitoring-bot) [11:54:08] 10Operations, 10Wikimedia-Mailing-lists: Please create an All Affiliates mailing list - https://phabricator.wikimedia.org/T217736 (10jbond) 05Open→03Resolved Here are URLs for [https://lists.wikimedia.org/mailman/listinfo/all-affiliates listinfo], [https://lists.wikimedia.org/mailman/admin/all-affiliates a... [11:54:18] PROBLEM - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.632 second response time [11:54:31] ack [11:54:41] * volans acks the ack :) [11:55:09] <_joe_> might it be tied to the alert on serpens earlier? [11:55:12] paged for toolschecker [11:55:18] ACKNOWLEDGEMENT - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.632 second response time GTirloni T217280 [11:55:41] <_joe_> gtirloni: remember to remove "send notification" when acking [11:55:46] <_joe_> :) [11:56:09] oh, I didn't know that was a thing. thanks for the heads up, will do next time [11:56:57] <_joe_> that prevents a page to be sent [11:57:12] <_joe_> let us know if you need help [11:57:14] gtirloni: are you upgrading this to stretch? [11:57:24] yep [11:57:34] starting with serpens, i'm fixing some pinning issues [11:57:39] actually is not a thing, we've different opinions on ACK pages, I think they are useful [11:58:32] during off hours an ack page will tell me I don't have to go find a computer and check in.... [11:58:35] the LDAP servers are used by various production services, this should have been coordinated [11:58:42] PROBLEM - Device not healthy -SMART- on db2044 is CRITICAL: cluster=mysql device=cciss,11 instance=db2044:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops [11:58:58] moritzm: seaborgium should be working [11:59:38] this still has the potential to disrupt various things, so tell us beforehand [11:59:49] is this related to https://phabricator.wikimedia.org/T217280 ? [12:00:01] note that gerrit uses that [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T1200). [12:00:04] _joe_, Urbanecm, and hauskatze: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:04] <_joe_> akosiaris: yes [12:00:23] o/ [12:00:27] o/ [12:00:33] here [12:00:38] well an upgrade to stretch would probably not help find the root cause [12:00:49] unless the root cause has already been found and I missed that [12:00:57] _joe_: you can deploy your patches while I get ready for the rest of the patches [12:01:13] (03CR) 10Volans: "It's quite a lot of code, I'll need a bit more time to go over it." (038 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [12:01:18] ping alaa_wmde ^^ :P the bot didnt ping you for some reason [12:01:18] ? [12:01:25] oh alaa_wmde you scheduled it for later, my bad! [12:01:29] <_joe_> zeljkof: yeah wait a sec, I shouldn't +2 my own patches I think :) [12:01:32] ack, we're probably hitting some size limits due to increased usage, the stretch update will not make a difference [12:01:50] <_joe_> moritzm: do you think we can go on with SWAT? [12:01:51] _joe_: you should in this case :) [12:02:01] akosiaris: various mentions to memory leaks fixed in openldap that ships with stretch [12:02:10] _joe_: should be fine [12:02:17] <_joe_> ack, ok [12:02:43] gtirloni: we're restarting slapd already via cron, whatever mem leaks was fixed in there has no effect on this current bug [12:02:47] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T217755 (10jcrespo) @Papaul please substitute under warranty or otherwise with a spare, if available (600GB disk)- probably the second. [12:02:53] <_joe_> zeljkof: I might need some assistance on the second one [12:03:09] (03PS2) 10Giuseppe Lavagetto: Set wgWMEPhp7SamplingRate to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493431 [12:03:31] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10jcrespo) [12:03:58] _joe_: sure, I'm around, but it's all documented [12:04:16] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [12:04:28] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2044 is CRITICAL: cluster=mysql device=cciss,11 instance=db2044:9100 job=node site=codfw Jcrespo T217755 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops [12:04:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Deploying for SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493431 (owner: 10Giuseppe Lavagetto) [12:05:01] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#mediawiki/extensions_and_mediawiki/skins [12:05:06] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#mediawiki/extensions_and_mediawiki/skins_2 [12:05:11] <_joe_> zeljkof: thanks [12:05:16] (03PS2) 10Filippo Giunchedi: prometheus: introduce query/connection limits parameters [puppet] - 10https://gerrit.wikimedia.org/r/494685 (https://phabricator.wikimedia.org/T217715) [12:05:26] _joe_: I'm around if you have any questions [12:05:57] (03Merged) 10jenkins-bot: Set wgWMEPhp7SamplingRate to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493431 (owner: 10Giuseppe Lavagetto) [12:08:45] <_joe_> ok, tested on mwdebug1002, all works well apparently [12:08:47] (03PS4) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/494472 (https://phabricator.wikimedia.org/T197873) [12:08:52] <_joe_> syncing [12:10:31] !log oblivian@deploy1001 Synchronized wmf-config/CommonSettings.php: Setting php7 sample rate for anonymous users to 0 (duration: 00m 57s) [12:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:21] <_joe_> ok I can go with the next one [12:13:26] (03CR) 10Nikerabbit: [C: 04-1] Enable edittag for ExternalGuidance in CX and VE (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494477 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry) [12:13:28] <_joe_> I did all my verifications :) [12:14:49] (03CR) 10jenkins-bot: Set wgWMEPhp7SamplingRate to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493431 (owner: 10Giuseppe Lavagetto) [12:15:06] oh that dpeloyers page is nice, thanks for the link! [12:15:45] <_joe_> zeljkof: still waiting for CI, sorry [12:15:52] RECOVERY - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.698 second response time [12:16:01] _joe_: no problem, it can take a while for extensions/core [12:16:48] (03CR) 10Vgutierrez: "> Patch Set 5:" (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [12:17:17] <_joe_> zeljkof: if you want to go on with one of the other patches [12:17:26] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14992/" [puppet] - 10https://gerrit.wikimedia.org/r/494472 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [12:17:42] _joe_: that might make sense, I don't think we'll step on each other feet [12:17:56] <_joe_> yep, I can wait [12:18:20] Urbanecm: you're next, I'll deploy the first patch and let you know when the second is ready for testing [12:19:13] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494669 (https://phabricator.wikimedia.org/T217676) (owner: 10Urbanecm) [12:19:23] <_joe_> zeljkof: just lmk when you're done with Urbanecm :) [12:19:33] _joe_: sure [12:20:12] (03Merged) 10jenkins-bot: Throttle Exception for Art+Feminism event Eindhoven 8th March [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494669 (https://phabricator.wikimedia.org/T217676) (owner: 10Urbanecm) [12:20:25] (03CR) 10jenkins-bot: Throttle Exception for Art+Feminism event Eindhoven 8th March [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494669 (https://phabricator.wikimedia.org/T217676) (owner: 10Urbanecm) [12:21:19] (03PS1) 10Filippo Giunchedi: prometheus: add recording rules for service runner percentiles [puppet] - 10https://gerrit.wikimedia.org/r/494700 (https://phabricator.wikimedia.org/T217715) [12:21:53] Urbanecm: around for swat? [12:22:43] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:494669|Throttle Exception for Art+Feminism event Eindhoven 8th March (T217676)]] (duration: 00m 56s) [12:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:48] T217676: Throttle Exception for Art+Feminism event Eindhoven 8th March - https://phabricator.wikimedia.org/T217676 [12:22:56] !log updated serpens to stretch (T217280) [12:22:56] Urbanecm: 494669 deployed [12:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:58] T217280: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 [12:23:43] _joe_: go ahead, your patch is merged, and looks like Urbanecm is not around [12:24:04] <_joe_> zeljkof: uhm ok [12:26:39] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/14993/" [puppet] - 10https://gerrit.wikimedia.org/r/494685 (https://phabricator.wikimedia.org/T217715) (owner: 10Filippo Giunchedi) [12:27:17] <_joe_> pulling on mwdebug1002 [12:27:24] (03PS2) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/494511 [12:28:59] <_joe_> all good apparently [12:30:23] sorry, went away for a while [12:30:25] I'm back zeljkof [12:30:32] <_joe_> zeljkof: just to be sure [12:30:39] Urbanecm: no problemo, you're next :) after _joe_ [12:30:40] <_joe_> If i sync-file a directory [12:30:48] sure [12:30:51] <_joe_> shoudl I include the trailing slash? [12:31:02] _joe_: I don't think it matters [12:31:25] I think I usually do, because that's what bash will tab-complete [12:32:10] !log oblivian@deploy1001 Synchronized php-1.33.0-wmf.19/extensions/WikimediaEvents: SWAT: Allow directing a sample of users to PHP 7 backport to wmf.19 T216676 (duration: 00m 57s) [12:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:13] T216676: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 [12:32:37] <_joe_> zeljkof: gimme another 2 minutes and I'll have tested everything [12:32:46] ok [12:33:35] <_joe_> uhm [12:34:55] <_joe_> ok go on [12:35:23] _joe_: ok, I'll continue with Urbanecm's patches [12:35:29] <_joe_> I'm not sure how caching works for js scripts [12:35:31] <_joe_> sure [12:35:33] * Urbanecm is still here [12:35:59] _joe_: I don't think there's any caching at mwdebug [12:36:07] might be client side [12:36:13] <_joe_> not on mwdebug, I was testing on the actual wikis now [12:36:14] (03PS3) 10Zfilipin: Change links in cswiki Help Panel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494668 (https://phabricator.wikimedia.org/T217391) (owner: 10Urbanecm) [12:36:53] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494668 (https://phabricator.wikimedia.org/T217391) (owner: 10Urbanecm) [12:37:34] _joe_: ah, for that I don't really know, but adding something like debug=true to the url should remove all caching [12:38:03] <_joe_> yeah it works with that :) [12:38:10] (03Merged) 10jenkins-bot: Change links in cswiki Help Panel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494668 (https://phabricator.wikimedia.org/T217391) (owner: 10Urbanecm) [12:38:14] _joe_, might https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge be related? [12:38:43] https://www.mediawiki.org/wiki/ResourceLoader/Features#Debug_mode [12:39:15] <_joe_> Urbanecm: it's not really an issue, I plan on rolling out php7 to 1 out of 1000 users tomorrow, it's not a problem for that [12:39:36] Urbanecm: 494668 is at mwdebug1002 [12:39:42] thanks zeljkof [12:39:50] ok then _joe_ :) [12:39:55] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 (10GTirloni) Here are all the times when slapd was restarted by [[ https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/openldap/labs.pp#L52-L56 | cron ]]... [12:40:36] zeljkof, working, please deploy [12:40:37] hauskatze: please stand by, you're next [12:40:40] Urbanecm: ok [12:41:04] zeljkof: please skip my patches. I got to go. [12:41:15] hauskatze: ok, see you next time then [12:41:46] hauskatze, should I take them over? [12:41:48] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:494668|Change links in cswiki Help Panel (T217391)]] (duration: 00m 55s) [12:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:51] T217391: Help panel: improve links in Czech help panel - https://phabricator.wikimedia.org/T217391 [12:41:55] zeljkof: sorry, just got an innopportune work call [12:42:01] Urbanecm: if you wish [12:42:23] Urbanecm: 494668 is deployed [12:42:23] looks like standard ones, zeljkof, you can deploy them as if they were mine :) [12:42:26] thanks zeljkof [12:42:34] Urbanecm: ok, continuing then [12:42:51] I can keep watching until I leave in a minute or two [12:42:54] 10Operations, 10Patch-For-Review: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 (10MoritzMuehlenhoff) Installations in Ganeti are currently blocked for a long time waiting for entropy in the d-i step which generates an SSH host key. This is resolved once 4.9.20-1 i... [12:43:00] and thanks Urbanecm [12:43:05] yw hauskatze [12:44:13] 10Operations, 10monitoring, 10LDAP: prometheus-openldap-exporter: Request.write called on a request after Request.finish was called - https://phabricator.wikimedia.org/T217758 (10GTirloni) [12:44:42] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494218 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [12:45:11] (03CR) 10Urbanecm: [C: 03+1] "LGTM, once https://gerrit.wikimedia.org/r/494218 is merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494225 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [12:45:51] (03Merged) 10jenkins-bot: Restrict local uploads on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494218 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [12:47:39] (03PS3) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/494511 [12:47:42] Urbanecm: 494218 is at mwdebug [12:48:26] zeljkof: cannot see any change in listgrouprights on mwdebug1001 [12:49:04] hauskatze: that one should be deployed, no testing possible? [12:49:26] both are testable [12:49:40] if 494218 is on mwdebug or deployed, something is wrong [12:49:42] let me have a quick look [12:49:52] indeed, and adding the wiki to commonsuploads should remove upload-related permissions from non-admins [12:50:00] not sure why it is not showing [12:50:06] have an idea, verifying [12:50:08] (03PS1) 10Corinna Hillebrand: Remove disable confirmation prompt configuration. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494706 (https://phabricator.wikimedia.org/T215019) [12:50:23] zeljkof, you can deploy the other patch in the meantime [12:50:58] 494218 is rebaset at deploy1001, can I deploy it? [12:51:19] or should I deploy 494225 first? [12:51:22] * zeljkof is confused [12:51:26] doesn't seem to be working [12:51:36] (03CR) 10jenkins-bot: Change links in cswiki Help Panel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494668 (https://phabricator.wikimedia.org/T217391) (owner: 10Urbanecm) [12:51:38] (03CR) 10jenkins-bot: Restrict local uploads on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494218 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [12:51:52] and my partner has arrived so we got to go [12:52:06] Urbanecm: should I revert 494218? [12:52:07] 494225 is independent, doesn't require 494218, so you can deploy the other one [12:52:25] probably, don't see the cause [12:52:29] better safe than sorr [12:52:30] y [12:52:54] Urbanecm: ok, so reverting 494218, deploying 494225? correct? [12:53:07] yes, thanks and sorry for confusing you [12:53:18] no problem, just wanted to make sure [12:53:21] sure [12:53:43] I suspect something related to commonsettings not touched or something [12:53:49] where dblists live [12:53:54] good idea [12:54:06] zeljkof, can you try to touch IS.php and run sync file on it? [12:54:13] zeljkof: before reverting, maybe touch CS and IS? [12:54:28] (03PS1) 10Zfilipin: Revert "Restrict local uploads on mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494707 (https://phabricator.wikimedia.org/T217523) [12:55:08] Urbanecm, hauskatze: the next patch changes IS, would that help? [12:55:16] yes [12:55:36] I don't think I've ever edited the files at deploy1001, that doesn't sound like a good idea :/ [12:56:11] ehh [12:56:13] (03PS2) 10Zfilipin: Create an 'uploader' group on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494225 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [12:56:25] just run touch /path/to/is.php [12:56:32] that'll update the modified timestamp [12:56:57] without actually modifying something [12:57:48] I hesitate to do anything that's not documented, deployments are scary enough :/ [12:58:26] ok [12:58:39] then just revert, we [12:58:46] will solve it later [12:58:54] Urbanecm: ok, reverting then [12:59:02] (03CR) 10Zfilipin: [C: 03+2] Revert "Restrict local uploads on mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494707 (https://phabricator.wikimedia.org/T217523) (owner: 10Zfilipin) [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T1300) [13:00:06] (03Merged) 10jenkins-bot: Revert "Restrict local uploads on mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494707 (https://phabricator.wikimedia.org/T217523) (owner: 10Zfilipin) [13:00:15] Urbanecm: we ran out of time, I'll revert, but there's no time for the second patch :( [13:00:26] ok [13:00:42] someone will reschedule it for another SWAT then :) [13:00:46] thanks for your deployments [13:01:05] Urbanecm: thanks for deploying with #releng :) [13:01:10] yw :) [13:02:13] (03CR) 10Zfilipin: "Please move to another SWAT window, we ran out of time during EU SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494225 (https://phabricator.wikimedia.org/T217523) (owner: 10MarcoAurelio) [13:02:50] !log EU SWAT finished [13:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:29] (03CR) 10jenkins-bot: Revert "Restrict local uploads on mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494707 (https://phabricator.wikimedia.org/T217523) (owner: 10Zfilipin) [13:12:07] (03PS2) 10Ema: Add new conftool service "ats-be" [puppet] - 10https://gerrit.wikimedia.org/r/483094 (https://phabricator.wikimedia.org/T213263) [13:12:09] (03PS2) 10Ema: cache: define ATS nodes in hiera [puppet] - 10https://gerrit.wikimedia.org/r/483095 (https://phabricator.wikimedia.org/T213263) [13:12:11] (03PS4) 10Ema: cache: hiera flag to use ATS as local backend [puppet] - 10https://gerrit.wikimedia.org/r/482024 (https://phabricator.wikimedia.org/T213263) [13:18:11] !log rolling security updates for file on jessie [13:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:44] 10Operations, 10LDAP-Access-Requests: Add bmansurov to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T217447 (10bmansurov) Thanks, everyone! [13:23:18] (03CR) 10Gehel: "Side note: this patch seems to be cherry-picked on deployment-puppetmaster03 and is making puppet fail on deployment-logstash1003. That mi" [puppet] - 10https://gerrit.wikimedia.org/r/492390 (https://phabricator.wikimedia.org/T126989) (owner: 10Herron) [13:32:29] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T217755 (10Marostegui) a:03Papaul [13:35:02] !log Upgrade MySQL on db1123 [13:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:11] (03PS1) 10Muehlenhoff: Install ack instead of ack-grep [puppet] - 10https://gerrit.wikimedia.org/r/494718 (https://phabricator.wikimedia.org/T213527) [13:50:44] (03CR) 10WMDE-Fisch: "To get this merged you need to SWAT that change :-)." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494706 (https://phabricator.wikimedia.org/T215019) (owner: 10Corinna Hillebrand) [14:00:05] hashar: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T1400). [14:00:50] (03PS1) 10ArielGlenn: handle failed xml content jobs correctly [dumps] - 10https://gerrit.wikimedia.org/r/494722 (https://phabricator.wikimedia.org/T217744) [14:11:36] (03CR) 10Ema: [C: 03+2] Add new conftool service "ats-be" [puppet] - 10https://gerrit.wikimedia.org/r/483094 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [14:17:06] (03PS14) 10CDanis: graphite: uwsgi workers: set timeouts + max RSS [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [14:17:30] (03CR) 10Dzahn: "looking at packages.debian.org it looks like "ack" is still in buster and "ack-grep" is a virtual package. but we usually avoid using virt" [puppet] - 10https://gerrit.wikimedia.org/r/494718 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [14:20:51] (03CR) 10Muehlenhoff: "Yep, it still gets installed on the dpkg level, but Puppet tries to re-install it with every puppet run like this:" [puppet] - 10https://gerrit.wikimedia.org/r/494718 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [14:21:29] (03CR) 10CDanis: "updated PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/14997/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) (owner: 10CDanis) [14:21:32] (03PS1) 10Gilles: Have coal watch the PaintTiming schema [puppet] - 10https://gerrit.wikimedia.org/r/494726 (https://phabricator.wikimedia.org/T217395) [14:21:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/494262 (https://phabricator.wikimedia.org/T214594) (owner: 10Cwhite) [14:22:47] (03CR) 10Volans: [C: 04-1] "The last PS shows a lot of refactoring compared to the previous one with the addition of new layers that I don't think are actually needed" (0312 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [14:23:41] (03PS15) 10CDanis: graphite: uwsgi workers: set timeouts + max RSS [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [14:24:34] (03CR) 10Ema: [C: 03+2] cache: define ATS nodes in hiera [puppet] - 10https://gerrit.wikimedia.org/r/483095 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [14:26:57] jbond42: who could I talk to in order to increase the vCPU count for seaborgium/serpens in ganeti? I don't know much about ganeti (except it exists). Until we found what's the issue with slapd's memory consumption, it's also pretty CPU constrained in eqiad (seaborgium) so if we have the spare capacity, that would be a nice relief [14:27:40] gtirloni: https://wikitech.wikimedia.org/wiki/Ganeti#Resize_a_VM and akosiaris ;) [14:28:26] (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/494684 (owner: 10Muehlenhoff) [14:29:01] ah, I read that but "Make sure first that the cluster has adequate space" scared me :) as I don't know how capacity planning is done there [14:31:37] (03CR) 10Ema: [C: 03+2] cache: hiera flag to use ATS as local backend [puppet] - 10https://gerrit.wikimedia.org/r/482024 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [14:31:39] I've got to go out for lunch now (and run an errand) but I'll poke him later, thanks volans! [14:32:34] gtirloni: ack, you can see current usage with https://wikitech.wikimedia.org/wiki/Ganeti#Listing_cluster_nodes [14:32:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, PCC https://puppet-compiler.wmflabs.org/compiler1001/14999/" [puppet] - 10https://gerrit.wikimedia.org/r/494262 (https://phabricator.wikimedia.org/T214594) (owner: 10Cwhite) [14:33:02] thanks [14:33:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494728 [14:36:07] (03PS1) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/494729 (https://phabricator.wikimedia.org/T197873) [14:36:20] (03PS16) 10CDanis: graphite: uwsgi workers: set timeouts + max RSS [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [14:37:43] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494728 (owner: 10Marostegui) [14:38:43] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494728 (owner: 10Marostegui) [14:39:45] (03CR) 10Vgutierrez: [C: 04-2] "let's see how implementation on acme-chief-backend side would look like" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [14:39:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494728 (owner: 10Marostegui) [14:40:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 (duration: 00m 56s) [14:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:37] (03PS17) 10CDanis: graphite: uwsgi workers: set timeouts + max RSS [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [14:44:42] (03CR) 10CDanis: [C: 03+2] graphite: uwsgi workers: set timeouts + max RSS [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) (owner: 10CDanis) [14:44:50] (03PS1) 10Ema: profile::trafficserver::backend: install conftool scripts [puppet] - 10https://gerrit.wikimedia.org/r/494731 (https://phabricator.wikimedia.org/T213263) [14:45:33] (03PS2) 10Ema: profile::trafficserver::backend: install conftool scripts [puppet] - 10https://gerrit.wikimedia.org/r/494731 (https://phabricator.wikimedia.org/T213263) [14:47:53] (03CR) 10Ema: [C: 03+2] profile::trafficserver::backend: install conftool scripts [puppet] - 10https://gerrit.wikimedia.org/r/494731 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [14:48:59] (03PS1) 10Dzahn: xmldumps: set notes URL and contact group to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/494733 (https://phabricator.wikimedia.org/T197873) [14:49:18] (03PS2) 10Dzahn: xmldumps: set notes URL and contact group to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/494733 (https://phabricator.wikimedia.org/T197873) [14:49:20] (03PS4) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/494511 (https://phabricator.wikimedia.org/T197873) [14:49:55] (03CR) 10jerkins-bot: [V: 04-1] xmldumps: set notes URL and contact group to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/494733 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [14:51:25] (03PS1) 10Gehel: logstash: upgrade to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/494735 (https://phabricator.wikimedia.org/T216052) [14:55:44] (03PS3) 10Dzahn: xmldumps: set notes URL and contact group to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/494733 (https://phabricator.wikimedia.org/T197873) [14:59:43] 10Operations, 10Icinga: The icinga web interface can't read the icinga log file - https://phabricator.wikimedia.org/T209568 (10Dzahn) [15:01:51] (03CR) 10CDanis: [C: 03+1] prometheus: add recording rules for service runner percentiles [puppet] - 10https://gerrit.wikimedia.org/r/494700 (https://phabricator.wikimedia.org/T217715) (owner: 10Filippo Giunchedi) [15:02:16] (03PS1) 10Dzahn: icinga: set notes_url for Icinga meta checks [puppet] - 10https://gerrit.wikimedia.org/r/494738 [15:03:34] (03PS1) 10Elukey: hadoop::ssl_config: remove redundant xml.erb files [puppet/cdh] - 10https://gerrit.wikimedia.org/r/494739 (https://phabricator.wikimedia.org/T217412) [15:04:08] (03CR) 10Elukey: [V: 03+2 C: 03+2] hadoop::ssl_config: remove redundant xml.erb files [puppet/cdh] - 10https://gerrit.wikimedia.org/r/494739 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [15:04:59] (03PS10) 10Elukey: hadoop: allow the configuration of ssl-(server|client).xml configs [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) [15:06:54] (03CR) 10Krinkle: "is it possible to move instead of copy the file and use in both places for a short time (e.g. old module using the new file). That would m" [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [15:08:12] (03PS2) 10Tonina Zhelyazkova: Remove disable confirmation prompt configuration. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494706 (https://phabricator.wikimedia.org/T215019) (owner: 10Corinna Hillebrand) [15:10:27] (03PS1) 10Dzahn: dumps::nfs: set notes URL [puppet] - 10https://gerrit.wikimedia.org/r/494741 [15:11:32] 10Operations, 10MobileFrontend, 10TechCom, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krinkle) [15:12:52] (03CR) 10CDanis: [C: 03+1] "I think this looks good. Maybe we want to test the impact of these changes once rolled out? Loading the citoid dashboard ought to be a g" [puppet] - 10https://gerrit.wikimedia.org/r/494685 (https://phabricator.wikimedia.org/T217715) (owner: 10Filippo Giunchedi) [15:13:16] (03PS1) 10Ema: profile::trafficserver::backend: add conftool client [puppet] - 10https://gerrit.wikimedia.org/r/494742 (https://phabricator.wikimedia.org/T213263) [15:13:56] (03PS1) 10Dzahn: puppetmaster/puppetboard: set notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494743 [15:14:12] 10Operations, 10MobileFrontend, 10TechCom, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krinkle) >>! In T214998#4929700, @Jdlrobson wrote: > [..] This feels like an RFC to me. [..]... [15:14:15] (03PS8) 10Jcrespo: mysqld-prometheus-exporter: Change the default arguments for buster [puppet] - 10https://gerrit.wikimedia.org/r/494236 (https://phabricator.wikimedia.org/T161296) [15:18:49] (03PS1) 10Dzahn: trafficserver: set notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494745 [15:19:13] 10Operations, 10Cloud-Services, 10Cloud-VPS: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593 (10Bstorm) [15:19:20] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 (10Bstorm) [15:19:33] 10Operations, 10Wikimedia-Mailing-lists: Please create an All Affiliates mailing list - https://phabricator.wikimedia.org/T217736 (10Elitre) TY very much. Adding @Quiddity so that, once we are ready, he can proceed with the due diligence in what you recommend. [15:19:46] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/15003/" [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [15:20:23] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 (10Bstorm) Apparently, according to T130593, this issue has been around a while and it was worked around with those crons. Now we seem to have a large enough data set that the... [15:30:57] (03PS11) 10Elukey: hadoop: allow the configuration of ssl-(server|client).xml configs [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) [15:34:45] (03CR) 10Jcrespo: [C: 03+2] mysqld-prometheus-exporter: Change the default arguments for buster [puppet] - 10https://gerrit.wikimedia.org/r/494236 (https://phabricator.wikimedia.org/T161296) (owner: 10Jcrespo) [15:36:20] (03CR) 10Elukey: "Note: a follow up patch will add the java.security bits. Sharing them with the kafka ones would make our life easier to avoid duplication " [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [15:36:37] (03PS2) 10Ema: role::trafficserver::backend: add conftool client [puppet] - 10https://gerrit.wikimedia.org/r/494742 (https://phabricator.wikimedia.org/T213263) [15:37:52] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538 (10Bstorm) [15:38:29] (03CR) 10Ema: [C: 03+2] role::trafficserver::backend: add conftool client [puppet] - 10https://gerrit.wikimedia.org/r/494742 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [15:41:08] (03PS4) 10Elukey: Assign role::analytics_cluster::superset to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/494473 (https://phabricator.wikimedia.org/T212243) [15:42:04] (03CR) 10Ottomata: hadoop: allow the configuration of ssl-(server|client).xml configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [15:42:06] (03PS2) 10Dzahn: trafficserver: set notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494745 [15:42:39] (03PS1) 10Andrew Bogott: Revert "shinken: temporarily remove monitoring for deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/494753 [15:42:53] (03PS2) 10Andrew Bogott: Revert "shinken: temporarily remove monitoring for deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/494753 [15:43:44] (03CR) 10Andrew Bogott: [C: 03+2] Revert "shinken: temporarily remove monitoring for deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/494753 (owner: 10Andrew Bogott) [15:45:27] (03CR) 10Elukey: hadoop: allow the configuration of ssl-(server|client).xml configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [15:46:09] (03PS5) 10Elukey: Assign role::analytics_cluster::superset to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/494473 (https://phabricator.wikimedia.org/T212243) [15:46:25] (03CR) 10Elukey: [C: 03+2] "No op for analytics-tool1003 - https://puppet-compiler.wmflabs.org/compiler1002/15006/" [puppet] - 10https://gerrit.wikimedia.org/r/494473 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [15:47:09] (03PS1) 10Giuseppe Lavagetto: Use the local proxy for search under php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494755 (https://phabricator.wikimedia.org/T215491) [15:47:11] (03PS1) 10Giuseppe Lavagetto: Direct 0.1% of anonymous users to php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494756 (https://phabricator.wikimedia.org/T216676) [15:48:04] (03CR) 10jerkins-bot: [V: 04-1] Use the local proxy for search under php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494755 (https://phabricator.wikimedia.org/T215491) (owner: 10Giuseppe Lavagetto) [15:48:21] (03CR) 10jerkins-bot: [V: 04-1] Direct 0.1% of anonymous users to php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494756 (https://phabricator.wikimedia.org/T216676) (owner: 10Giuseppe Lavagetto) [15:49:23] (03PS1) 10Jcrespo: mysqld-prometheus-exporter: Fix typo on configuration [puppet] - 10https://gerrit.wikimedia.org/r/494759 (https://phabricator.wikimedia.org/T161296) [15:51:15] (03PS2) 10Jcrespo: mysqld-prometheus-exporter: Fix typo on configuration [puppet] - 10https://gerrit.wikimedia.org/r/494759 (https://phabricator.wikimedia.org/T161296) [15:52:17] (03PS2) 10Giuseppe Lavagetto: Use the local proxy for search under php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494755 (https://phabricator.wikimedia.org/T215491) [15:52:19] (03PS2) 10Giuseppe Lavagetto: Direct 0.1% of anonymous users to php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494756 (https://phabricator.wikimedia.org/T216676) [15:52:59] (03CR) 10Jcrespo: [C: 03+2] mysqld-prometheus-exporter: Fix typo on configuration [puppet] - 10https://gerrit.wikimedia.org/r/494759 (https://phabricator.wikimedia.org/T161296) (owner: 10Jcrespo) [15:53:14] (03CR) 10jerkins-bot: [V: 04-1] Use the local proxy for search under php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494755 (https://phabricator.wikimedia.org/T215491) (owner: 10Giuseppe Lavagetto) [15:53:17] (03CR) 10jerkins-bot: [V: 04-1] Direct 0.1% of anonymous users to php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494756 (https://phabricator.wikimedia.org/T216676) (owner: 10Giuseppe Lavagetto) [15:54:23] (03CR) 10Cwhite: [C: 03+1] Stop using transitional package names for Icinga plugins [puppet] - 10https://gerrit.wikimedia.org/r/494681 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [15:55:06] (03PS1) 10Ema: cp2002: use ATS backends instead of Varnish [puppet] - 10https://gerrit.wikimedia.org/r/494761 (https://phabricator.wikimedia.org/T213263) [15:57:01] jouncebot: next [15:57:01] In 1 hour(s) and 2 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T1700) [15:58:48] marostegui: Got a small mw patch to roll out. You deploying something now? [15:58:52] (03PS6) 10Alexandros Kosiaris: Introduce cxserver helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/492301 (https://phabricator.wikimedia.org/T213195) [16:00:01] Krinkle: No, you can go ahead :) [16:02:54] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review, 10User-Elukey: Create a ganeti VM identical to analytics-tool1003 with Debian Buster - https://phabricator.wikimedia.org/T217640 (10elukey) [16:03:28] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review, 10User-Elukey: Create a ganeti VM identical to analytics-tool1003 with Debian Buster - https://phabricator.wikimedia.org/T217640 (10elukey) 05Open→03Resolved analytics-tool1004 is up and running, will close this task now and re-open anothe... [16:04:51] (03PS3) 10Giuseppe Lavagetto: Use the local proxy for search under php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494755 (https://phabricator.wikimedia.org/T215491) [16:04:53] (03PS3) 10Giuseppe Lavagetto: Direct 0.1% of anonymous users to php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494756 (https://phabricator.wikimedia.org/T216676) [16:05:20] !log imported scap for buster-wikimedia (T213527) [16:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:23] T213527: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 [16:05:27] ^ elukey [16:06:54] * Krinkle staging on mwdebug1002 [16:07:58] (03PS1) 10Jbond: Update file path to match debdeploy-restarts [puppet] - 10https://gerrit.wikimedia.org/r/494763 [16:08:00] (03PS1) 10Jbond: Add config file and exclude_mounts options to debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/494764 (https://phabricator.wikimedia.org/T217646) [16:08:02] (03PS1) 10Jbond: Update wmf-auto-restarts to read exclude mounts from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/494765 (https://phabricator.wikimedia.org/T217646) [16:08:07] (03CR) 10Dzahn: [C: 03+2] "discussed on IRC traffic channel" [puppet] - 10https://gerrit.wikimedia.org/r/494745 (owner: 10Dzahn) [16:08:25] (03PS3) 10Dzahn: trafficserver: set notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494745 [16:08:27] 10Operations, 10ops-eqiad, 10Analytics, 10decommission: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 (10RobH) [16:08:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus: add recording rules for service runner percentiles [puppet] - 10https://gerrit.wikimedia.org/r/494700 (https://phabricator.wikimedia.org/T217715) (owner: 10Filippo Giunchedi) [16:09:30] (03CR) 10jerkins-bot: [V: 04-1] Update wmf-auto-restarts to read exclude mounts from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/494765 (https://phabricator.wikimedia.org/T217646) (owner: 10Jbond) [16:10:05] (03CR) 10WMDE-Fisch: [C: 03+1] Remove disable confirmation prompt configuration. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494706 (https://phabricator.wikimedia.org/T215019) (owner: 10Corinna Hillebrand) [16:10:31] 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10akosiaris) Ouch, sorry about that. > > In terms of mitigations I'm thinking of starting on the Promethe... [16:10:58] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.20/includes/api/ApiBase.php: I921777089fb8cfb (duration: 00m 58s) [16:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:16] moritzm: \o/ [16:12:35] (03PS2) 10Jbond: Update wmf-auto-restarts to read exclude mounts from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/494765 (https://phabricator.wikimedia.org/T217646) [16:13:44] moritzm: I know you'll hate me, but scap : Depends: python-conftool but it is not installable [16:13:47] :D [16:14:38] that should be something that we can copy without rebuild but not super sure [16:15:12] (03CR) 10Krinkle: [C: 03+1] Direct 0.1% of anonymous users to php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494756 (https://phabricator.wikimedia.org/T216676) (owner: 10Giuseppe Lavagetto) [16:15:22] yeah it's one of our packages so I guess it just needs to be built for buster [16:17:53] elukey: on it :-) [16:18:04] PROBLEM - puppet last run on analytics-tool1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap],Package[analytics/superset/deploy] [16:18:08] buuuuu [16:18:11] :D [16:18:24] (the an-tool1004 failure) [16:21:54] elukey: hey, it's running buster! :) [16:22:14] it is! [16:23:36] !log imported conftool 1.0.2-1+deb10u1 for buster-wikimedia [16:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:42] ^ elukey [16:23:54] ooooh [16:24:03] wait is buster released already? [16:24:09] nope! [16:24:18] ooooohhhh cutting edge! [16:24:42] apergos: preparations with a few systems, see https://phabricator.wikimedia.org/T213527 [16:25:15] gtk [16:25:28] bookmarked (I don't want to get all he mails, just to check it once in a while) [16:25:49] https://phabricator.wikimedia.org/T193224 that looks very scary [16:26:00] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Patch-For-Review: Disable jgit gc on gerrit - https://phabricator.wikimedia.org/T217497 (10Paladox) Just noting that a second bug has been identified in jgit gc. [16:30:59] moritzm: works! [16:33:26] (03PS1) 10Ema: ATS: do not return X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/494770 (https://phabricator.wikimedia.org/T213263) [16:35:16] 10Operations, 10Patch-For-Review: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 (10jcrespo) After the buster upgrade, what appears to be the debmonitor hook fails on apt update, upgrade at db1114 with: ` su: warning: cannot change directory to /nonexistent: No suc... [16:37:03] (03PS1) 10Bstorm: osm: Add a cloud-internal address for the osmdb cluster [puppet] - 10https://gerrit.wikimedia.org/r/494771 (https://phabricator.wikimedia.org/T193264) [16:37:54] elukey: \o/ [16:37:58] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo) [16:38:06] 10Operations, 10DBA, 10Patch-For-Review, 10User-fgiunchedi: Upgrade mysqld_exporter in production - https://phabricator.wikimedia.org/T161296 (10jcrespo) 05Open→03Stalled a:05jcrespo→03None Fixed configuration for buster, but with no additional metrics (same metrics as before). We can thing of ena... [16:38:21] 10Operations, 10Patch-For-Review: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 (10Volans) @jcrespo that's T216832 and we were thinking to just create a home for the user (cc @MoritzMuehlenhoff ) [16:39:46] PROBLEM - superset on analytics-tool1004 is CRITICAL: connect to address 10.64.36.116 and port 9080: Connection refused [16:40:49] (03CR) 10Marostegui: mariadb: Refactor dump_section.py and rename to match functionality (034 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [16:41:38] (03Abandoned) 10Zoranzoki21: Added new protection levels for dewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492534 (https://phabricator.wikimedia.org/T216885) (owner: 10Zoranzoki21) [16:42:50] (03PS4) 10Zoranzoki21: wgCopyUploadDomains: Changed domain for mehrnews.com [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492448 (https://phabricator.wikimedia.org/T213961) [16:47:30] (03CR) 10Bstorm: "If this looks right, I may still wait to merge it until the migration is closer to change over just in case it causes confusion." [puppet] - 10https://gerrit.wikimedia.org/r/494771 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [16:49:21] (03CR) 10Ema: [C: 03+2] ATS: do not return X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/494770 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [16:49:30] (03CR) 10Gehel: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/15001/" [puppet] - 10https://gerrit.wikimedia.org/r/494735 (https://phabricator.wikimedia.org/T216052) (owner: 10Gehel) [16:50:56] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10Tgr) >>! In T187960#4997875, @Marostegui wrote: > #reading-infrastructure-team-backlog tagging you here as this affects x1 master (T187960#4997790... [16:51:21] !log upgrade ATS to 8.0.2-1wm1 [16:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:03] !log built prometheus-openldap-exporter for stretch [16:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:24] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10Marostegui) [16:53:33] (03PS8) 10Dzahn: Gerrit: Add icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [16:54:23] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [16:57:04] (03CR) 10Andrew Bogott: [C: 03+1] osm: Add a cloud-internal address for the osmdb cluster [puppet] - 10https://gerrit.wikimedia.org/r/494771 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [17:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T1700). [17:00:05] kostajh, stephanebisson, alaa_wmde, and Zoranzoki21: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:15] hi [17:01:42] I can SWAT [17:01:52] Hi :) [17:01:52] kostajh: are you around? [17:02:20] Yes [17:03:45] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T217755 (10RobH) This system is no longer under warranty (expired on Jan. 9, 2018), and any disks will need replacement with on site spares. [17:05:12] (03PS2) 10Sbisson: Welcome survey: send all newcomers to variation A (cs, ko) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494698 (https://phabricator.wikimedia.org/T217625) [17:05:17] 10Operations, 10ops-eqsin, 10Traffic: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) [17:05:21] (03PS1) 10Ema: ATS: listen on port 3128 by default [puppet] - 10https://gerrit.wikimedia.org/r/494780 [17:05:22] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) 05Open→03Stalled p:05Normal→03Low I'm keeping them open for a month after the memory swap for followup. [17:05:35] 10Operations, 10ops-eqsin, 10Traffic: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) 05Open→03Resolved a:03RobH [17:06:09] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494698 (https://phabricator.wikimedia.org/T217625) (owner: 10Sbisson) [17:06:40] 10Operations, 10ops-eqsin, 10Traffic: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) [17:06:42] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) 05Open→03Stalled p:05Normal→03Low I'm keeping this open for a month after the swap. If no further errors are logged (need to manually check the SEL) by March 25th, this can be res... [17:07:43] (03Merged) 10jenkins-bot: Welcome survey: send all newcomers to variation A (cs, ko) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494698 (https://phabricator.wikimedia.org/T217625) (owner: 10Sbisson) [17:07:58] (03CR) 10jenkins-bot: Welcome survey: send all newcomers to variation A (cs, ko) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494698 (https://phabricator.wikimedia.org/T217625) (owner: 10Sbisson) [17:08:55] (03CR) 10Ema: [C: 03+2] ATS: listen on port 3128 by default [puppet] - 10https://gerrit.wikimedia.org/r/494780 (owner: 10Ema) [17:09:32] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:494698|Welcome survey: send all newcomers to variation A (cs, ko)]] (duration: 00m 56s) [17:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:02] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@911ad13]: First deploy to new host [17:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:30] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@911ad13]: First deploy to new host (duration: 00m 27s) [17:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:38] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493009 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [17:11:25] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492448 (https://phabricator.wikimedia.org/T213961) (owner: 10Zoranzoki21) [17:11:32] (03CR) 10Sbisson: wgCopyUploadDomains: Changed domain for mehrnews.com [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492448 (https://phabricator.wikimedia.org/T213961) (owner: 10Zoranzoki21) [17:11:36] (03PS5) 10Sbisson: wgCopyUploadDomains: Changed domain for mehrnews.com [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492448 (https://phabricator.wikimedia.org/T213961) (owner: 10Zoranzoki21) [17:11:44] (03Merged) 10jenkins-bot: labs: Enable musical notation datatype in wikidatawiki in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493009 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [17:11:46] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492448 (https://phabricator.wikimedia.org/T213961) (owner: 10Zoranzoki21) [17:11:49] (03PS1) 10Zoranzoki21: Added new throttle rules, removed expired [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494782 (https://phabricator.wikimedia.org/T217485) [17:12:57] PROBLEM - Ensure trafficserver_exporter is running on cp2003 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:13:01] PROBLEM - Check systemd state on cp1074 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:13:03] PROBLEM - Check systemd state on cp1073 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:13:03] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2009 is CRITICAL: connect to address 10.192.16.135 and port 9122: Connection refused [17:13:07] PROBLEM - Check systemd state on cp2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:13:19] PROBLEM - Ensure trafficserver_exporter is running on cp2015 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter [17:13:20] (03Merged) 10jenkins-bot: wgCopyUploadDomains: Changed domain for mehrnews.com [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492448 (https://phabricator.wikimedia.org/T213961) (owner: 10Zoranzoki21) [17:13:21] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1073 is CRITICAL: connect to address 10.64.48.107 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:13:25] PROBLEM - Ensure trafficserver_exporter is running on cp2021 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:13:25] PROBLEM - Ensure trafficserver_exporter is running on cp1073 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:13:31] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1072 is CRITICAL: connect to address 10.64.48.106 and port 9122: Connection refused [17:13:31] PROBLEM - Check systemd state on cp2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:13:35] PROBLEM - Check systemd state on cp2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:13:41] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2021 is CRITICAL: connect to address 10.192.48.25 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:13:47] downtime expired? [17:13:47] ema: --^ [17:13:51] PROBLEM - Ensure trafficserver_exporter is running on cp2009 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter [17:13:51] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2003 is CRITICAL: connect to address 10.192.0.124 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:13:51] I guess ATS? [17:14:04] yes sorry about that! [17:14:16] PEBKAC, no impact [17:14:43] kostajh: Your patch is on mwdebug1002 for you to test [17:14:59] stephanebisson: looking [17:15:17] (03PS6) 10Ottomata: Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) [17:15:35] stephanebisson: looks good [17:15:45] alaa_wmde: Your patch was merged. It will be automatically synced on beta shortly, if not already [17:15:56] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1072 is OK: HTTP OK: HTTP/1.0 200 OK - 12784 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:16:09] stephanebisson: thanks will be looking there [17:17:21] !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.20/extensions/GrowthExperiments/extension.json: SWAT: [[gerrit:494531|Use schema version where reading is a valid editor_interface]] (duration: 00m 56s) [17:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:50] (03CR) 10jenkins-bot: labs: Enable musical notation datatype in wikidatawiki in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493009 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [17:17:51] (03CR) 10jenkins-bot: wgCopyUploadDomains: Changed domain for mehrnews.com [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492448 (https://phabricator.wikimedia.org/T213961) (owner: 10Zoranzoki21) [17:18:02] RECOVERY - Check systemd state on cp2009 is OK: OK - running: The system is fully operational [17:18:04] RECOVERY - Check systemd state on cp2003 is OK: OK - running: The system is fully operational [17:18:07] Zoranzoki21: Your patch is on mwdebug1002. Can you test? [17:18:24] RECOVERY - Ensure trafficserver_exporter is running on cp2003 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:18:28] RECOVERY - Check systemd state on cp1074 is OK: OK - running: The system is fully operational [17:18:32] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2009 is OK: HTTP OK: HTTP/1.0 200 OK - 12781 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:18:32] RECOVERY - Check systemd state on cp1073 is OK: OK - running: The system is fully operational [17:18:36] RECOVERY - Check systemd state on cp2015 is OK: OK - running: The system is fully operational [17:18:50] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1073 is OK: HTTP OK: HTTP/1.0 200 OK - 12776 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:18:50] RECOVERY - Ensure trafficserver_exporter is running on cp2015 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:18:56] RECOVERY - Ensure trafficserver_exporter is running on cp2021 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:18:56] RECOVERY - Ensure trafficserver_exporter is running on cp1073 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:19:43] Zoranzoki21: Are you around? [17:20:17] 10Operations: Document service owner in Netbox - https://phabricator.wikimedia.org/T217686 (10ayounsi) >>! In T217686#5004689, @Volans wrote: > My main concern here is that the concept of a single service owner is limited and doesn't reflect reality. Or can be "1st point of contact". I'd argue that having *some... [17:21:13] (03CR) 10Gehel: icinga: add notes URLs to various monitoring checks, part 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/494729 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [17:23:09] yes I am [17:23:18] It works as expected [17:23:26] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater] [17:23:29] Zoranzoki21: ok, thanks, deploying... [17:23:38] (03PS7) 10Ottomata: Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) [17:23:40] I have one more [17:23:45] Can I add in calendar: [17:24:51] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:492448|wgCopyUploadDomains: Changed domain for mehrnews.com]] (duration: 00m 56s) [17:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:05] I have one more patch. Can I add it in calendar? [17:25:38] Zoranzoki21: sure [17:26:09] ok, adding [17:26:33] 10Operations: Document service owner in Netbox - https://phabricator.wikimedia.org/T217686 (10RobH) So DC Ops already partially does this. When we rack a server, and install the OS, we have to hand it off to someone. I'm not sure if the best answer is tracking via puppet, via netbox, or elsewhere, but I like t... [17:26:38] could someone remind me how to get a integration/config patch deployed? ( https://gerrit.wikimedia.org/r/#/c/integration/config/+/494609/ ) [17:26:45] (03PS2) 10Zoranzoki21: Added new throttle rules, removed expired [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494782 (https://phabricator.wikimedia.org/T217485) [17:27:49] cscott: The way I do it is by asking legoktm [17:27:51] stephandebisson: Ok, added it [17:28:01] cscott: ask releng, #wikimedia-releng ;) [17:28:04] *stephanebisson: Ok, added there [17:28:10] It is https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/494782/ [17:28:23] volans: oh, thanks. i get ops and releng mixed up in my head sometimes. [17:28:30] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [17:28:35] no worries, we don't have +2 on that repo ;) [17:28:42] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2021 is OK: HTTP OK: HTTP/1.0 200 OK - 12777 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:29:53] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494782 (https://phabricator.wikimedia.org/T217485) (owner: 10Zoranzoki21) [17:29:58] 10Operations: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10ayounsi) I got pointed to this task from T217686, where I suggest another option. Puppet also seems a good location, but dunno how the implementation would go. Maybe leverage git history/blame to backfill existing cl... [17:30:24] RECOVERY - Ensure trafficserver_exporter is running on cp2009 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:30:46] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15011/" [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) (owner: 10Ottomata) [17:30:54] (03CR) 10Ottomata: [C: 03+2] Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) (owner: 10Ottomata) [17:31:33] (03Merged) 10jenkins-bot: Added new throttle rules, removed expired [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494782 (https://phabricator.wikimedia.org/T217485) (owner: 10Zoranzoki21) [17:32:10] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2003 is OK: HTTP OK: HTTP/1.0 200 OK - 12790 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:32:35] (03PS6) 10Bstorm: labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) [17:32:44] (03CR) 10jenkins-bot: Added new throttle rules, removed expired [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494782 (https://phabricator.wikimedia.org/T217485) (owner: 10Zoranzoki21) [17:32:45] stephanebisson: It can goes directly, because it is throttle rules patch [17:32:59] Zoranzoki21: ok [17:33:24] !log sbisson@deploy1001 sync-file aborted: SWAT: [[gerrit:494782|Added new throttle rules, removed expired]] (duration: 00m 01s) [17:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:21] stephanebisson: Is there problem? [17:34:22] 10Operations: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10MoritzMuehlenhoff) >>! In T216088#5005884, @ayounsi wrote: > I got pointed to this task from T217686, where I suggest another option. > > Puppet also seems a good location, but dunno how the implementation would go.... [17:34:41] !log sbisson@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:494782|Added new throttle rules, removed expired]] (duration: 00m 55s) [17:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:44] Zoranzoki21: no worries, I hit enter to fast [17:34:49] *too fast [17:35:02] stephanebisson: Ok, everything works as expected. Thanks much! [17:35:10] Zoranzoki21: It is synced now [17:35:20] And that concludes SWAT. Thanks eveybody [17:35:20] stephanebisson: Yes, I see [17:35:49] :) [17:35:52] I going now to eat [17:35:54] Bye [17:35:55] (03CR) 10BryanDavis: [C: 03+1] osm: Add a cloud-internal address for the osmdb cluster [puppet] - 10https://gerrit.wikimedia.org/r/494771 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [17:36:20] !log joal@deploy1001 Started deploy [analytics/refinery@fef9181]: Regular analytics weekly deploy train [17:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:03] 10Operations, 10ops-eqiad, 10Analytics, 10decommission: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for dbstore1002.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetD... [17:41:14] 10Operations, 10ops-eqiad, 10Analytics, 10decommission: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 (10RobH) [17:43:47] (03PS1) 10RobH: decom dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/494794 (https://phabricator.wikimedia.org/T216491) [17:44:26] (03PS1) 10RobH: decom dbstore1002 [dns] - 10https://gerrit.wikimedia.org/r/494795 (https://phabricator.wikimedia.org/T216491) [17:45:31] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:45:32] (03CR) 10RobH: [C: 03+2] decom dbstore1002 [dns] - 10https://gerrit.wikimedia.org/r/494795 (https://phabricator.wikimedia.org/T216491) (owner: 10RobH) [17:45:53] (03CR) 10RobH: [C: 03+2] decom dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/494794 (https://phabricator.wikimedia.org/T216491) (owner: 10RobH) [17:46:17] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:47:02] 10Operations, 10ops-eqiad, 10Analytics, 10decommission: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 (10RobH) a:05RobH→03Cmjohnson [17:47:13] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:48:58] 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10CDanis) >>! In T217715#5005574, @akosiaris wrote: >> Also making sure `url` label can't grow with unbound... [17:50:35] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:51:17] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:51:25] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:52:23] PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 335 MB (0% inode=86%) [17:54:37] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/493009/ is good on beta [17:56:39] (03PS2) 10Filippo Giunchedi: prometheus: add recording rules for service runner percentiles [puppet] - 10https://gerrit.wikimedia.org/r/494700 (https://phabricator.wikimedia.org/T217715) [17:57:29] (03CR) 10Filippo Giunchedi: [C: 03+2] "Deploying this now as no Prometheus restart/reload is needed" [puppet] - 10https://gerrit.wikimedia.org/r/494700 (https://phabricator.wikimedia.org/T217715) (owner: 10Filippo Giunchedi) [17:59:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission oxygen.eqiad.wmnet - https://phabricator.wikimedia.org/T211826 (10RobH) [18:00:37] (03PS7) 10Bstorm: labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) [18:03:18] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 (10GTirloni) @MoritzMuehlenhoff @akosiaris do you have any concerns with us adding an extra 4 vCPUs to seaborgium and serpens? Seaborgium is particular is almost always close to... [18:03:22] (03PS1) 10RobH: decom dns entries for oxygen [dns] - 10https://gerrit.wikimedia.org/r/494799 (https://phabricator.wikimedia.org/T211826) [18:03:44] (03CR) 10RobH: [C: 03+2] decom dns entries for oxygen [dns] - 10https://gerrit.wikimedia.org/r/494799 (https://phabricator.wikimedia.org/T211826) (owner: 10RobH) [18:04:46] !log disabled puppet and downtimed labstore2004 while deploying a change for T210818 [18:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:49] T210818: Move admin cron jobs to systemd timers - https://phabricator.wikimedia.org/T210818 [18:05:00] (03PS2) 10Andrew Bogott: Fix package name for ack-grep in buster [puppet] - 10https://gerrit.wikimedia.org/r/494684 (owner: 10Muehlenhoff) [18:05:03] (03CR) 10Bstorm: [C: 03+2] labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm) [18:05:41] (03PS3) 10Filippo Giunchedi: prometheus: introduce query/connection limits parameters [puppet] - 10https://gerrit.wikimedia.org/r/494685 (https://phabricator.wikimedia.org/T217715) [18:06:05] (03PS3) 10Andrew Bogott: Fix package name for ack-grep in buster [puppet] - 10https://gerrit.wikimedia.org/r/494684 (owner: 10Muehlenhoff) [18:06:18] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 (10jbond) a:05jbond→03None [18:06:55] (03CR) 10Filippo Giunchedi: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/494685 (https://phabricator.wikimedia.org/T217715) (owner: 10Filippo Giunchedi) [18:07:23] !log joal@deploy1001 Finished deploy [analytics/refinery@fef9181]: Regular analytics weekly deploy train (duration: 31m 02s) [18:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:31] (03PS1) 10RobH: decom oxygen [puppet] - 10https://gerrit.wikimedia.org/r/494800 (https://phabricator.wikimedia.org/T211826) [18:07:49] (03CR) 10RobH: [C: 03+2] decom oxygen [puppet] - 10https://gerrit.wikimedia.org/r/494800 (https://phabricator.wikimedia.org/T211826) (owner: 10RobH) [18:08:38] (03PS2) 10RobH: decom oxygen [puppet] - 10https://gerrit.wikimedia.org/r/494800 (https://phabricator.wikimedia.org/T211826) [18:08:52] !log re-enabled puppet after observing the change works well on the partner for labstore2004 and T210818 [18:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:07] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: Prepare puppet for Debian buster - https://phabricator.wikimedia.org/T213546 (10Andrew) If you would like to test, we now have a VPS base image for buster in wmcs. You can use it in the 'testlabs' project, or I can add it to a project of your choice. [18:11:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/494735 (https://phabricator.wikimedia.org/T216052) (owner: 10Gehel) [18:19:27] 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10CDanis) Turns out those queries weren't actually to execute at all. Here's the highest-cardinality metri... [18:19:52] 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10CDanis) 05Open→03Resolved a:03CDanis Resolving this; there's followup work to be done but the 'inci... [18:24:56] (03CR) 10Cwhite: [C: 03+2] prometheus: change escaped to character classes to work around systemd bug [puppet] - 10https://gerrit.wikimedia.org/r/494262 (https://phabricator.wikimedia.org/T214594) (owner: 10Cwhite) [18:25:03] (03PS2) 10Cwhite: prometheus: change escaped to character classes to work around systemd bug [puppet] - 10https://gerrit.wikimedia.org/r/494262 (https://phabricator.wikimedia.org/T214594) [18:27:09] 10Operations, 10monitoring, 10LDAP: prometheus-openldap-exporter: Request.write called on a request after Request.finish was called - https://phabricator.wikimedia.org/T217758 (10jbond) [18:27:12] 10Operations, 10Cloud-Services, 10Cloud-VPS: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593 (10jbond) [18:28:09] 10Operations, 10monitoring, 10LDAP: prometheus-openldap-exporter: Request.write called on a request after Request.finish was called - https://phabricator.wikimedia.org/T217758 (10jbond) i suspect this is related to [[ https://phabricator.wikimedia.org/T130593 | T130593: investigate slapd memory leak. ]] [18:32:18] (03CR) 10Jcrespo: "This is technically not new code, but it has been moved around" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [18:37:00] (03PS1) 10MarcoAurelio: Restrict local uploads on mediawiki.org, take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494806 (https://phabricator.wikimedia.org/T217523) [18:37:10] (03PS2) 10MarcoAurelio: Restrict local uploads on mediawiki.org, take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494806 (https://phabricator.wikimedia.org/T217523) [18:37:43] (03PS10) 10Jcrespo: mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) [18:38:06] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [18:41:17] 10Operations, 10Documentation: Wikitech: update Bacula article - https://phabricator.wikimedia.org/T100954 (10jcrespo) I made some updates to the databases section and point to the place where most updates of that are happening. [18:52:59] bringing seaborgium (LDAP eqiad) down for maintenance (downtime expected in the 1-3min), LDAP connections will go to serpens (codfw) and there could be some timeouts [18:55:45] !log increased seaborgium vCPUs from 4 to 8 (T217280) [18:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:48] T217280: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 [18:56:34] seaborgium is back, already at 600 ldap connections. giving it 10min before doing the same to serpens (LDAP codfw) [18:57:35] (03CR) 10Jcrespo: "Changes changes changes" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T1900) [19:01:45] bringing serpens (LDAP codfw) down for maintenance as well (seaborgium is at ~1000 connections, serpens has ~2700) [19:03:45] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) Here is the diff for codfw: `lang=diff [edit interfaces ae1 unit 2017 family inet] + filter { + output private-out4; + } [edit interfaces ae... [19:03:47] serpens is back, ~500 ldap connections. seaborgium has ~1700 [19:03:53] !log increased serpens vCPUs from 4 to 8 (T217280) [19:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:56] T217280: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 [19:06:09] jouncebot: next [19:06:09] In 0 hour(s) and 53 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T2000) [19:11:46] (03CR) 10EBernhardson: [C: 03+1] Use the local proxy for search under php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494755 (https://phabricator.wikimedia.org/T215491) (owner: 10Giuseppe Lavagetto) [19:11:47] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) 05Resolved→03Open [19:13:15] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) I added the following bit on cr1/cr2: ` elukey@re0.cr1-eqiad# show | compare [edit firewall family inet filter analytics-in4 term my... [19:14:04] !log apply ping-offload redirect to private1-a-codfw - T190090 [19:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:07] T190090: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 [19:15:47] 10Operations, 10ops-eqiad, 10Analytics, 10DBA, and 2 others: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) [19:16:54] 10Operations, 10ops-eqiad, 10Analytics, 10DBA, and 2 others: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) @jcrespo @Marostegui thoughts? What would it be best in your opinion? I'd prefer another dbproxy-based domain but not sure how complicated to create/... [19:20:45] (03PS1) 10Paladox: Gerrit: Support switching ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/494811 [19:22:29] dr0ptp4kt: still a go on the varnish patch today? [19:28:52] bblack able to do this at the same time next week? i can do it now if not. The date for the extension got pushed back one week but I have authorization to do the change now if you like [19:29:40] Or I should say 3-4 next week [19:33:06] dr0ptp4kt: next week I will only be here Tuesday I think. [19:33:45] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) a:03RobH [19:33:59] oh one sec bblack [19:34:05] checking calendar [19:34:16] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) After applying it only to cr1-codfw, I noticed an increase of ICMP errors to eqiad's LVS, see https://grafana.wikimedia.org/d/000000513/ping-offload?orgId=1&from=... [19:35:06] bblack: i say let's do it now. want me to take the warning off the patch and rebase and such? [19:37:09] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) lvs101[012] all exist on asw-c-eqiad (but have ports also reserved on asw2-c-eqiad): ` robh@asw-c-eqiad> show interfaces descriptions | grep lvs1010 xe-8/0/23 up... [19:37:14] dr0ptp4kt: ok, yes pls [19:42:35] (03PS2) 10Dr0ptp4kt: Redirect Google Translate enwiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/490120 (https://phabricator.wikimedia.org/T212197) [19:44:00] ^ bblack [19:44:11] (03CR) 10BBlack: [C: 03+2] Redirect Google Translate enwiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/490120 (https://phabricator.wikimedia.org/T212197) (owner: 10Dr0ptp4kt) [19:45:29] (03PS2) 10Paladox: Gerrit: Support switching ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/494811 [19:45:59] (03PS3) 10Paladox: Gerrit: Support switching ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/494811 [19:46:04] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox) [19:47:04] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10eross) Hi All, Wanted to follow up formally on here. Arbcom-en-b and Arbcom-en-c is finished.... [19:47:38] (03PS4) 10Paladox: Gerrit: Support switching ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/494811 [19:47:44] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox) [19:49:36] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) Since this is lvs, they are on every switch stack =P Row A: lvs101[012] don't show on either asw-a-eqiad or asw2-a-eqiad. Row B: doesnt show on asw2-b-eqiad, asw-b-eqiad i... [19:50:22] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10MarcoAurelio) Are those google groups hosted and/or managed by the Wikimedia Foundation or it i... [19:51:33] (03PS2) 10Filippo Giunchedi: WIP: mirror udp2log data into the logging pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494254 (https://phabricator.wikimedia.org/T126989) [19:51:36] !log hashar@deploy1001 Synchronized php-1.33.0-wmf.20/extensions/LdapAuthentication/LdapPrimaryAuthenticationProvider.php: Remove calls to no-longer-imeplemented methods after I2eeaeed1 - T217692 (duration: 00m 58s) [19:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:39] T217692: labtestweb2001: Fatal error: unknown class AuthPlugin in /srv/mediawiki/php-1.33.0-wmf.20/extensions/LdapAuthentication/LdapAuthenticationPlugin.php on line 21 - https://phabricator.wikimedia.org/T217692 [19:52:32] (03CR) 10jerkins-bot: [V: 04-1] WIP: mirror udp2log data into the logging pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494254 (https://phabricator.wikimedia.org/T126989) (owner: 10Filippo Giunchedi) [19:52:35] dr0ptp4kt: should be live everywhere now [19:52:42] bblack, thx will check [19:58:06] (03PS5) 10Paladox: gerrit: Switch db from mysql to H2 [puppet] - 10https://gerrit.wikimedia.org/r/488093 (https://phabricator.wikimedia.org/T211139) [19:59:26] (03Abandoned) 10Paladox: Test do not merge [puppet] - 10https://gerrit.wikimedia.org/r/355894 (owner: 10Paladox) [20:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T2000) [20:01:05] (03PS4) 10Paladox: Gerrit: Support git protocol version 2 [puppet] - 10https://gerrit.wikimedia.org/r/473643 [20:02:39] bblack: it seems to be doing what is expected [20:03:27] I am delaying the train by a few minutes sorry. Errand happening at home [20:06:40] and ready [20:07:06] (03PS1) 10Hashar: group1 wikis to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494821 [20:07:08] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494821 (owner: 10Hashar) [20:08:30] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494821 (owner: 10Hashar) [20:08:55] dr0ptp4kt: awesome :) [20:12:49] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.20 [20:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:03] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494821 (owner: 10Hashar) [20:13:48] thanks bblack ! [20:14:33] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.20 (duration: 01m 43s) [20:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:56] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10eross) These are under Wikimedia Foundation and they are private. However, OIT will not manage... [20:21:17] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10MarcoAurelio) >>! In T215940#5006511, @eross wrote: > These are under Wikimedia Foundation and... [20:23:04] (03PS12) 10Paladox: Gerrit: Add flogger javaopts [puppet] - 10https://gerrit.wikimedia.org/r/463519 (https://phabricator.wikimedia.org/T200739) [20:30:43] !log 1.33.0-wmf.20 looks fine with group0 and group1 [20:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:46] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10WormTT) The [20:52:45] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 (10Bstorm) Just a note: since the upgrade, serpens' memory graph looks a lot better. https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=4&fullscreen&orgI... [20:59:50] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10eross) You are now the owner Dave feel free to add people. [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190306T2100). [21:01:47] 10Operations: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10jbond) when monitoring the tmp dir one sees many short lived tmp files and a few long long lived files. running lsof shows that the short lived files get created by hhvm (at least on mw1347). My assumption is that fo... [21:16:32] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) Redirect test with unused .225 IP `lang=diff [edit interfaces ae1 unit 2017 family inet] + filter { + output private-out4; + } [edit firewa... [21:23:18] !log test ping-offload with unused IP 208.80.153.225 - T190090 [21:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:21] T190090: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 [21:32:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission oxygen.eqiad.wmnet - https://phabricator.wikimedia.org/T211826 (10RobH) a:05RobH→03Cmjohnson [21:43:23] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10WormTT) That's brilliant - I've added the other members too. I understand there's still a lot... [21:46:17] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) Everything has been rolled back for now. I also added a logging term: ` then { count ping-redirected; next-ip 10.192.0.22/32; } ` Wh... [21:53:17] (03PS1) 10Thcipriani: Update branched submodules to 2.15.11 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/494829 [21:53:36] (03CR) 10EBernhardson: Add support for elasticsearch 6 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [21:55:40] (03CR) 10EBernhardson: [C: 03+1] Plugins for elasticsearch 6.5.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/491297 (https://phabricator.wikimedia.org/T199791) (owner: 10DCausse) [21:55:43] (03PS2) 10Thcipriani: Update branched submodules to 2.15.11 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/494829 [21:57:18] (03CR) 10Paladox: [C: 03+2] "This all looks good to me (i was speaking to David P when he updated these plugins with the bazlets fixes)" [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/494829 (owner: 10Thcipriani) [22:30:25] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10eross) No problem, glad I could help out and learn something new :) [22:32:01] (03PS12) 10CRusnov: Add ganeti->netbox sync script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) [22:38:43] (03PS13) 10CRusnov: Add ganeti->netbox sync script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) [22:40:21] (03CR) 10CRusnov: "Hello!" (038 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [22:44:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate cloudvirt1024 from b8-eqiad:u24 to b2-eqiad:u17 - https://phabricator.wikimedia.org/T216724 (10RobH) p:05Triage→03Normal a:05RobH→03Cmjohnson [22:47:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate cloudvirt1024 from b8-eqiad:u24 to b2-eqiad:u17 - https://phabricator.wikimedia.org/T216724 (10RobH) [22:51:25] (03PS1) 10Bstorm: osmdb: refactor the password framework to not use the module [puppet] - 10https://gerrit.wikimedia.org/r/494843 (https://phabricator.wikimedia.org/T193264) [22:52:47] (03CR) 10Bstorm: [C: 03+2] osmdb: refactor the password framework to not use the module [puppet] - 10https://gerrit.wikimedia.org/r/494843 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [22:55:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate cloudvirt1024 from b8-eqiad:u24 to b2-eqiad:u17 - https://phabricator.wikimedia.org/T216724 (10RobH) [23:03:05] (03PS5) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) [23:13:17] 10Operations, 10netops: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 (10ayounsi) On suggestion from the [[ https://bird.network.cz/pipermail/bird-users/2019-March/013155.html | Bird mailing list ]] (and doc) is to change the dynamic port range on the sever side. From the current: `cat /proc/s... [23:13:44] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, and 2 others: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10RobH) analytics1001:asw2-c-eqiad:ge-4/0/16 analytics1002:asw2-d-eqiad:ge-8/0/7 [23:14:08] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 3 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [23:14:19] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 3 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [23:15:07] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 3 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [23:17:02] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 3 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [23:19:23] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, and 2 others: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10RobH) ` [edit interfaces interface-range vlan-analytics1-d-eqiad] member xe-7/0/3 { ... } + member ge-9/0/4; + member ge-9/0/5; + member ge-9/0/6... [23:20:20] PROBLEM - Host analytics1044 is DOWN: PING CRITICAL - Packet loss = 100% [23:20:38] PROBLEM - Host analytics1042 is DOWN: PING CRITICAL - Packet loss = 100% [23:20:45] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, and 2 others: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for analytics1001.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate... [23:20:48] PROBLEM - Host analytics1043 is DOWN: PING CRITICAL - Packet loss = 100% [23:20:58] PROBLEM - Host analytics1045 is DOWN: PING CRITICAL - Packet loss = 100% [23:21:00] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, and 2 others: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for analytics1002.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate... [23:21:19] presumably those are related? [23:22:01] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, 10User-Elukey: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10RobH) a:05RobH→03None [23:26:45] (03PS1) 10RobH: decom oxygen [puppet] - 10https://gerrit.wikimedia.org/r/494855 (https://phabricator.wikimedia.org/T211826) [23:26:48] (03PS1) 10RobH: decom analytics100[12] [puppet] - 10https://gerrit.wikimedia.org/r/494856 (https://phabricator.wikimedia.org/T205507) [23:27:19] (03CR) 10RobH: [C: 03+2] decom oxygen [puppet] - 10https://gerrit.wikimedia.org/r/494855 (https://phabricator.wikimedia.org/T211826) (owner: 10RobH) [23:28:55] (03CR) 10RobH: [C: 03+2] decom analytics100[12] [puppet] - 10https://gerrit.wikimedia.org/r/494856 (https://phabricator.wikimedia.org/T205507) (owner: 10RobH) [23:29:14] 10Operations, 10Traffic, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10ayounsi) cp1099 is the last standing host between me and powering off asw-c-eqiad. From this task and the prompt `cp1099 is a Unpuppetised system for testing (test)` it should be fine... [23:31:52] (03PS1) 10RobH: decom analytics100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/494857 (https://phabricator.wikimedia.org/T205507) [23:32:16] (03CR) 10jerkins-bot: [V: 04-1] decom analytics100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/494857 (https://phabricator.wikimedia.org/T205507) (owner: 10RobH) [23:34:34] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, and 2 others: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10RobH) a:03Cmjohnson [23:38:20] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, 10User-Elukey: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10RobH) [23:54:25] (03CR) 10Thcipriani: [V: 03+2] Update branched submodules to 2.15.11 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/494829 (owner: 10Thcipriani) [23:55:15] 10Operations, 10Traffic, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10ayounsi) [23:55:18] 10Operations, 10ops-eqiad, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) [23:55:53] (03PS6) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) [23:56:54] (03PS1) 10Thcipriani: Gerrit 2.15.11 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/494858 (https://phabricator.wikimedia.org/T214359) [23:57:24] (03Abandoned) 10Thcipriani: Gerrit 2.15.10 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/492025 (https://phabricator.wikimedia.org/T214359) (owner: 10Thcipriani) [23:58:13] (03CR) 10Paladox: [C: 03+2] Gerrit 2.15.11 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/494858 (https://phabricator.wikimedia.org/T214359) (owner: 10Thcipriani)