[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181128T0000). Please do the needful. [00:00:04] bmansurov, arlolra, arlolra, and ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:09] here [00:01:06] also [00:01:14] \o [00:02:06] i suppose i can run swat today [00:02:16] arlolra: can your patches ship together? [00:02:29] yes [00:02:56] (03CR) 10EBernhardson: [C: 032] Labs: display reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476027 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [00:04:02] (03Merged) 10jenkins-bot: Labs: display reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476027 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [00:04:52] (03CR) 10jenkins-bot: Labs: display reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476027 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [00:06:50] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: no-op labs sync for 476027 (duration: 00m 55s) [00:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:57] !log ebernhardson@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: SWAT: no-op labs sync for 476027 (duration: 00m 53s) [00:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:55] ebernhardson: is it too late to do wmf.4 for this one? [00:09:56] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ParsoidBatchAPI/+/476180 [00:10:35] nah we can do it. Your core patch though failed a test but it doesn't look related. Some sort of npm intermittent wonkery? https://integration.wikimedia.org/ci/job/release-quibble-vendor-mysql-php72-docker/19/console [00:10:48] i can try and re +2 after it finishes failing [00:11:14] yeah, npm install is unreleated [00:11:19] recheck [00:11:41] Hey. Please ignore high traffic on PDF rendering services (new Proton service). I'll be doing a bit of stress testing for next 24 hours. Service is fully built and I want to verify that it works properly. For next 10h I'll be running 5 concurrent render requests for most popular articles. [00:12:15] arlolra: the other patch, to ParsoidBatchAPI, is up on mwdebug1001 [00:12:39] o/ raynor good luck with the test [00:12:55] bmansurov: you are synced out, i didn't test since -labs are noop in prod [00:13:18] ebernhardson: thanks, i'm testing on my end, but I don't see the change yet. [00:13:37] bmansurov: i dont remember how deployment-prep works these days, at one time it synced every 5 minutes with gerrit [00:14:01] ebernhardson: yeah, yesterday i had to wait 35 mins [00:14:05] i guess i'll test tomorrow [00:14:11] ebernhardson: thanks for deploying! [00:14:38] 10Operations: Netbox should use CN rather than UID for LDAP login username - https://phabricator.wikimedia.org/T210566 (10bd808) [00:17:09] ebernhardson: tested the ParsoidBatchAPI patch on mwdebug1001, looks good [00:18:56] arlolra: ok, syncing [00:19:48] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/ParsoidBatchAPI/includes/ApiParsoidBatch.php: SWAT I4e4373a7 revert Modernize ApiParsoidBatch using ApiResult to generate prettier output (duration: 00m 54s) [00:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:00] arlolra: deployed [00:20:08] the other patch is working through CI again [00:20:22] looks good [00:22:36] (03CR) 10EBernhardson: [C: 032] Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476141 (https://phabricator.wikimedia.org/T209402) (owner: 10EBernhardson) [00:23:43] (03Merged) 10jenkins-bot: Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476141 (https://phabricator.wikimedia.org/T209402) (owner: 10EBernhardson) [00:27:50] (03PS1) 10Andrew Bogott: Horizon: move utrs project to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/476181 (https://phabricator.wikimedia.org/T204745) [00:30:50] (03CR) 10Andrew Bogott: [C: 032] Horizon: move utrs project to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/476181 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [00:31:17] (03CR) 10jenkins-bot: Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476141 (https://phabricator.wikimedia.org/T209402) (owner: 10EBernhardson) [00:32:37] (03PS8) 10CRusnov: Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) [00:36:16] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:42:47] arlolra: your core patch is now on mwdebug1001 [00:45:01] hmm [00:46:27] i'm not getting what I would expect from that [00:46:28] curl https://www.mediawiki.org/wiki/User:Arlolra/sandbox -H 'X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet' | grep mw-parser-out [00:47:27] arlolra: `grep T210437 /srv/mediawiki/php-1.33.0-wmf.6/includes/parser/Parser.php` on mwdebug1001 claims the update is in place :S [00:47:27] T210437: Sanitizer::stripAllTags shouldn't expand legacy "semicolon-less" HTML5 entities - https://phabricator.wikimedia.org/T210437 [00:49:26] hmm [00:49:32] ok, let it be [00:49:39] i'll notify cscott that there's work left to do [00:49:49] arlolra: revert it? [00:49:55] please [00:49:58] ok [00:51:47] (03CR) 10Volans: [C: 04-1] "I think there is an error, see inline." (034 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [01:02:01] 10Operations, 10ops-codfw, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10Papaul) a:05Papaul>03RobH This is complete at my end.... [01:02:38] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching): rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10Papaul) [01:02:47] ebernhardson: still time for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ParsoidBatchAPI/+/476180 [01:02:50] ? [01:03:52] 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10RobH) a:05RobH>03Eevans @eevans, You were the initial re... [01:05:49] arlolra: sure we can run a little over. I'll sync out my patches in one sec here and then that one [01:06:26] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/Wikibase/view/resources/jquery/wikibase/jquery.wikibase.entityselector.js: Allow AB test to modify entityselector api request (duration: 00m 56s) [01:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:01] ebernhardson: great, thanks for your time and effort, and sorry for being disorganized [01:07:21] arlolra: no worries, i had to grab an extra patch to sync mine out because i forgot we didn't roll the train lats week too [01:08:44] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/WikimediaEvents/modules/wikibase/ext.wikimediaEvents.completionClicks.js: T209402: AB testing support for wbsearchentities (duration: 00m 53s) [01:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:48] T209402: A/B testing plan for wbsearchentities, context=item - https://phabricator.wikimedia.org/T209402 [01:09:49] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaEvents/modules/wikibase/ext.wikimediaEvents.completionClicks.js: T209402: AB testing support for wbsearchentities (duration: 00m 52s) [01:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:39] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T209402: Configuration for wbsearchentities AB test (duration: 00m 53s) [01:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:34] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching): rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10Papaul) [01:14:56] arlolra: did you purge the parser cache? [01:15:24] i don't think curl alone will work. you need to do a action=purge [01:15:32] cscott: i havn't shipped the revert yet, you can still test the patch in prod in mwdebug1001 [01:15:40] ok, let me check. [01:15:59] https://en.wikipedia.org/wiki/User:Cscott/T209236 is my test case, fwiw [01:17:25] cscott: oh, no, i did not [01:17:28] :( [01:18:46] arlolra: the ParsoidBatchAPI revert is up on mwdebug1001, should be fine to ship since it's where we were an hour ago? [01:19:33] ebernhardson: i tested it and looks good to go [01:19:42] i just wasted a bunch of time trying to test it on en.wikipedia instead of on mediawiki :( [01:19:53] arlolra: your test page is better. let me try that! [01:22:09] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/ParsoidBatchAPI/includes/ApiParsoidBatch.php: SWAT: Revert ApiParsoidBatch update (duration: 00m 54s) [01:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:14] arlolra ebernhardson : https://www.mediawiki.org/wiki/User:Arlolra/sandbox looks right now that i've purged the parser cache [01:22:48] ugh [01:22:50] (and now that i'm testing on a wiki which actually has the patch deployed) [01:23:14] cscott: great! the revert hadn't made it through gerrit yet, so i can simply ship this out to the app servers [01:24:06] how does it work when I ask mwdebug1001 for enwiki? i'm not sure I understand the interaction of X-Wikimedia-Debug and the group0/1/2 deploys [01:24:38] cscott: your request still runs on mwdebug1001, but enwiki still runs the configured version of mediawiki [01:25:01] thats probably still cryptic :P [01:25:33] basically mediawiki versions all get their own directory. Enwiki will still run from the php-1.33.0-wmf.4 directory when using X-Wikimedia-Debug, but it will run on the mwdebug1001 server [01:25:58] yeah. at any rate, i confused myself by trying to test on enwiki, but once I tested arlo's page on mediawiki.org (and purged the parser cache) everything worked as I expected [01:26:20] it's syncing out now [01:27:16] (03PS1) 10EBernhardson: Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476188 (https://phabricator.wikimedia.org/T209402) [01:27:20] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) @Gehel My racking proposal was to rack some of those server in rack 4 of row A,B and D. @ayounsi and I will be working on replacing all the 1G switches in thos... [01:27:28] (03CR) 10jerkins-bot: [V: 04-1] Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476188 (https://phabricator.wikimedia.org/T209402) (owner: 10EBernhardson) [01:27:51] (03PS2) 10EBernhardson: Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476188 (https://phabricator.wikimedia.org/T209402) [01:28:37] hmm, i dunno what happened to logmsg bot but it synced [01:28:58] oh i bet ... i pasted the description from phab and it included a \n inside ''. I bet the bot didn't like that... [01:29:33] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.6/includes/parser/Parser.php: SWAT: T209236 Protect legacy URL parameter syntax in link and alt options (duration: 00m 51s) [01:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:37] T209236: "¶ms" URL parameter (used in a link parameter in [[File]] markup) incorrectly parsed as "¶ms" (%C2%B6ms) - https://phabricator.wikimedia.org/T209236 [01:30:07] (03CR) 10EBernhardson: [C: 032] Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476188 (https://phabricator.wikimedia.org/T209402) (owner: 10EBernhardson) [01:30:17] final patch to turn on my ab test, and this SWAT will be complete [01:31:19] (03Merged) 10jenkins-bot: Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476188 (https://phabricator.wikimedia.org/T209402) (owner: 10EBernhardson) [01:31:24] ebernhardson: parsing-teama appreciates the hard work you're putting in here [01:31:40] arlolra: no wories, it pumps up my gerrit +2 numbers ;) [01:31:54] gotta have magic internet points [01:32:59] RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational [01:34:46] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T209402: Start wbsearchentities ab test at 10% (duration: 00m 54s) [01:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:49] T209402: A/B testing plan for wbsearchentities, context=item - https://phabricator.wikimedia.org/T209402 [01:35:01] alright, that should be everything complete [01:36:17] * arlolra claps [01:37:43] (03CR) 10jenkins-bot: Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476188 (https://phabricator.wikimedia.org/T209402) (owner: 10EBernhardson) [01:39:31] thank you! [01:44:53] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaEvents/: wbsearchentities needed extension.json to be deployed as well. Sync the whole directory (duration: 00m 53s) [01:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:58] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/WikimediaEvents/: wbsearchentities needed extension.json to be deployed as well. Sync the whole directory (duration: 00m 53s) [01:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:23] turns out i broke eventlogging of this for everyone not in the test bucket .. one line fix incoming for wmf.4 and wmf.6 [02:08:35] (this=wbsearchentities) [02:09:51] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching): rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10Papaul) [02:12:57] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching): rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10Papaul) a:05Papaul>03RobH Complete [02:16:15] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaEvents/modules/wikibase/ext.wikimediaEvents.completionClicks.js: Fix wbsearchentities collection of non-test bucketed data (duration: 00m 54s) [02:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:50] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/WikimediaEvents/modules/wikibase/ext.wikimediaEvents.completionClicks.js: Fix wbsearchentities collection of non-test bucketed data (duration: 00m 56s) [02:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:13] (03PS5) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) [03:00:44] (03CR) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [03:00:48] (03CR) 10jerkins-bot: [V: 04-1] profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [03:03:41] (03PS6) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) [03:08:45] prtksxna: Best wishes. [03:08:59] Waggie: Thanks again! [03:09:04] np [03:34:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 935.52 seconds [03:37:40] (03PS3) 10Mathew.onipe: elasticsearch: add new elastic2037-elastic2054 [puppet] - 10https://gerrit.wikimedia.org/r/475942 (https://phabricator.wikimedia.org/T210265) [04:12:58] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:16:26] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 1.910 second response time [04:19:33] (03CR) 10BBlack: [C: 04-1] "Seems like it would be simpler to vary "network::external" data based on $realm (and arguably, also a slightly larger list that includes p" [puppet] - 10https://gerrit.wikimedia.org/r/475714 (owner: 10Alexandros Kosiaris) [04:26:44] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 295.54 seconds [05:35:34] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:51:14] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:02:58] !log Deploy schema change on s6 codfw - T86338 T202167 [06:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:04] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:03:04] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:10:05] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476200 (https://phabricator.wikimedia.org/T208383) [06:11:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool pc1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476200 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:12:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476200 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:13:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1005 - T208383 (duration: 01m 04s) [06:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:48] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [06:14:20] !log Stop MySQL on pc1005 to clone pc1008 - T208383 [06:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476200 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:35:54] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476201 (https://phabricator.wikimedia.org/T86338) [06:37:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476201 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:39:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476201 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:39:41] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10Ladsgroup) p:05Low>03Triage [06:40:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1093 T86338 T202167 (duration: 00m 53s) [06:40:31] !log Deploy schema change db1093 T86338 T202167 [06:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:38] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:40:39] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:24] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476201 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:45:20] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476203 [06:47:43] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476203 (owner: 10Marostegui) [06:48:43] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476203 (owner: 10Marostegui) [06:49:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1093 T86338 T202167 (duration: 00m 52s) [06:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:52] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:49:52] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:50:33] (03PS3) 10Marostegui: filtered_tables.txt: Remove ss_total_views column [puppet] - 10https://gerrit.wikimedia.org/r/475975 (https://phabricator.wikimedia.org/T86339) [06:51:33] (03CR) 10Marostegui: [C: 032] filtered_tables.txt: Remove ss_total_views column [puppet] - 10https://gerrit.wikimedia.org/r/475975 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [06:55:37] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476203 (owner: 10Marostegui) [06:57:00] PROBLEM - MariaDB Slave Lag: pc3 on pc2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.84 seconds [06:57:13] I thought I silenced that one [06:57:14] checking [06:58:18] Right, it is part of another pc section, I guess it is lagging because of the extra load coming from pc1006 [06:59:40] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476204 (https://phabricator.wikimedia.org/T86338) [07:11:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476204 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [07:12:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476204 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [07:13:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098:3316 T86338 T202167 (duration: 00m 53s) [07:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:53] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [07:13:54] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [07:17:02] !log Deploy schema change db1098:3316 T86338 T202167 [07:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476205 [07:22:17] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476204 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [07:33:14] (03CR) 10Vgutierrez: [C: 032] tendril: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475978 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [07:33:21] (03PS3) 10Vgutierrez: tendril: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475978 (https://phabricator.wikimedia.org/T207050) [07:37:05] (03CR) 10Vgutierrez: [C: 032] archiva: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475981 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [07:37:13] (03PS5) 10Vgutierrez: archiva: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475981 (https://phabricator.wikimedia.org/T207050) [07:37:23] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476205 (owner: 10Marostegui) [07:38:31] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476205 (owner: 10Marostegui) [07:39:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1098:3316 T86338 T202167 (duration: 00m 54s) [07:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:40] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [07:39:40] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [07:40:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476206 (https://phabricator.wikimedia.org/T86338) [07:40:46] (03CR) 10Vgutierrez: [C: 032] netmon: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/476025 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [07:40:54] (03PS3) 10Vgutierrez: netmon: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/476025 (https://phabricator.wikimedia.org/T207050) [07:42:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476206 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [07:43:19] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476206 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [07:44:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1096:3316 T86338 (duration: 00m 53s) [07:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:01] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [07:45:01] !log Deploy schema change db1096:3316 T86338 [07:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476207 [07:48:35] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476205 (owner: 10Marostegui) [07:48:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476206 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [07:51:23] (03CR) 10Filippo Giunchedi: [C: 031] Remove Diamond on additional DB roles [puppet] - 10https://gerrit.wikimedia.org/r/476001 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [07:57:59] (03PS2) 10Muehlenhoff: Absent Redis Diamond collector on Redis slaves [puppet] - 10https://gerrit.wikimedia.org/r/475967 (https://phabricator.wikimedia.org/T183454) [07:59:26] (03CR) 10Muehlenhoff: [C: 032] Absent Redis Diamond collector on Redis slaves [puppet] - 10https://gerrit.wikimedia.org/r/475967 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [08:01:32] (03PS1) 10Vgutierrez: certcentral: Provide TLS certificates for icinga.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/476208 (https://phabricator.wikimedia.org/T207050) [08:01:35] (03PS1) 10Vgutierrez: icinga: Deploy the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476209 (https://phabricator.wikimedia.org/T207050) [08:03:25] (03PS3) 10Muehlenhoff: Remove Diamond on additional DB roles [puppet] - 10https://gerrit.wikimedia.org/r/476001 (https://phabricator.wikimedia.org/T183454) [08:03:52] (03PS1) 10Elukey: Apply -R 200 to memcached running on mc1022 [puppet] - 10https://gerrit.wikimedia.org/r/476210 (https://phabricator.wikimedia.org/T208844) [08:05:06] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond on additional DB roles [puppet] - 10https://gerrit.wikimedia.org/r/476001 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [08:07:57] dsaez: no phab task, it was just notebook1003 being under stress due to some running jupyter kernels. I believe one of those was yours [08:10:03] (03CR) 10Elukey: [C: 032] Apply -R 200 to memcached running on mc1022 [puppet] - 10https://gerrit.wikimedia.org/r/476210 (https://phabricator.wikimedia.org/T208844) (owner: 10Elukey) [08:10:11] (03PS2) 10Elukey: Apply -R 200 to memcached running on mc1022 [puppet] - 10https://gerrit.wikimedia.org/r/476210 (https://phabricator.wikimedia.org/T208844) [08:10:41] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476207 (owner: 10Marostegui) [08:11:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476207 (owner: 10Marostegui) [08:12:20] !log apply -R 200 to memcached on mc1022 (cache wipe) - T208844 [08:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:24] T208844: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 [08:12:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1096:3316 T86338 (duration: 00m 53s) [08:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:49] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [08:13:02] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10elukey) [08:15:09] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476207 (owner: 10Marostegui) [08:15:13] (03PS1) 10Alexandros Kosiaris: releases: Set no-cache Cache-control [puppet] - 10https://gerrit.wikimedia.org/r/476211 [08:15:26] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet fail on deployment-mediawiki-07, missing private hiera variable - https://phabricator.wikimedia.org/T210497 (10Joe) So to explain what happened here: - `labs/private` works a double function, as a dupe for the production puppet repo for the compiler (which uses... [08:15:55] (03PS2) 10Vgutierrez: icinga: Deploy the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476209 (https://phabricator.wikimedia.org/T207050) [08:15:57] (03PS1) 10Vgutierrez: certcentral: check for already declared LE Intermediate certs [puppet] - 10https://gerrit.wikimedia.org/r/476212 (https://phabricator.wikimedia.org/T207050) [08:19:57] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel) >>! In T210450#4779871, @Papaul wrote: > Are you going to use the 10G NIC or 1G NIC? We're planning on using the 10G NIC. [08:22:48] 10Puppet, 10ORES, 10Scoring-platform-team (Current): Write puppet for redis-sentinel - https://phabricator.wikimedia.org/T210580 (10Ladsgroup) p:05Triage>03High [08:28:10] !log installing samba security updates (client libs) [08:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:11] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi) Testing on `deployment-mediawiki-07` the endpoint above yields e.g. this message being produced on kafka: ` { "@timestamp":... [08:34:22] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:38:20] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 25899 MB (5% inode=99%) [08:40:12] (03CR) 10Vgutierrez: [C: 032] certcentral: Provide TLS certificates for icinga.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/476208 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [08:40:15] (03CR) 10Vgutierrez: [C: 032] certcentral: check for already declared LE Intermediate certs [puppet] - 10https://gerrit.wikimedia.org/r/476212 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [08:40:28] (03PS2) 10Vgutierrez: certcentral: Provide TLS certificates for icinga.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/476208 (https://phabricator.wikimedia.org/T207050) [08:41:08] !log installing git security updates on trusty (Debian already fixed) [08:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:05] (03PS2) 10Vgutierrez: certcentral: check for already declared LE Intermediate certs [puppet] - 10https://gerrit.wikimedia.org/r/476212 (https://phabricator.wikimedia.org/T207050) [08:49:52] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:51:26] RECOVERY - Disk space on elastic1017 is OK: DISK OK [08:55:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476214 (https://phabricator.wikimedia.org/T202167) [08:57:40] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476214 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [08:58:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476214 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [09:00:34] (03PS1) 10Vgutierrez: icinga: Use certcentral managed TLS certificate for icinga.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/476215 (https://phabricator.wikimedia.org/T207050) [09:00:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1113:3316 T86338 T202167 (duration: 00m 53s) [09:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:43] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [09:00:44] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [09:00:44] !log Deploy schema change db1113:3316 T86338 T202167 [09:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:13] (03PS1) 10Elukey: hive-site.xml: fix unclosed xml tag [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476216 [09:04:29] (03CR) 10Elukey: [V: 032 C: 032] hive-site.xml: fix unclosed xml tag [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476216 (owner: 10Elukey) [09:04:40] (03CR) 10Vgutierrez: [C: 032] icinga: Deploy the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476209 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [09:04:49] (03PS3) 10Vgutierrez: icinga: Deploy the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476209 (https://phabricator.wikimedia.org/T207050) [09:05:36] (03PS1) 10Elukey: Update cdh module to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/476217 [09:07:40] (03PS2) 10Muehlenhoff: Absent Redis Diamond collector on Redis masters [puppet] - 10https://gerrit.wikimedia.org/r/475969 (https://phabricator.wikimedia.org/T183454) [09:07:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476214 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [09:08:18] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10akosiaris) @Cmjohnson, I think we can proceed with this. I did just try to reimage the server but mgmt is not responding ` akosiaris@bast1002:~$ ping backup1001.mgmt.eqiad.wmnet PING b... [09:10:01] (03CR) 10Elukey: [C: 032] Update cdh module to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/476217 (owner: 10Elukey) [09:10:08] (03PS2) 10Elukey: Update cdh module to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/476217 [09:10:10] (03CR) 10Elukey: [V: 032 C: 032] Update cdh module to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/476217 (owner: 10Elukey) [09:13:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476218 [09:14:06] PROBLEM - Check systemd state on labmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:14:52] !log Use a TLS certificate managed by certcentral in icinga.wm.o - T207050 [09:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:56] T207050: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 [09:15:07] (03CR) 10Vgutierrez: [C: 032] icinga: Use certcentral managed TLS certificate for icinga.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/476215 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [09:15:10] (03CR) 10Alexandros Kosiaris: [C: 032] releases: Set no-cache Cache-control [puppet] - 10https://gerrit.wikimedia.org/r/476211 (owner: 10Alexandros Kosiaris) [09:15:16] (03PS2) 10Vgutierrez: icinga: Use certcentral managed TLS certificate for icinga.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/476215 (https://phabricator.wikimedia.org/T207050) [09:15:24] (03PS2) 10Alexandros Kosiaris: releases: Set no-cache Cache-control [puppet] - 10https://gerrit.wikimedia.org/r/476211 [09:15:46] (03PS3) 10Muehlenhoff: Absent Redis Diamond collector on Redis masters [puppet] - 10https://gerrit.wikimedia.org/r/475969 (https://phabricator.wikimedia.org/T183454) [09:15:59] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] releases: Set no-cache Cache-control [puppet] - 10https://gerrit.wikimedia.org/r/476211 (owner: 10Alexandros Kosiaris) [09:16:24] sigh.. that's cheating alex ;P [09:16:37] (03PS5) 10Banyek: mariadb: productionize dbproxy1015 and dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [09:16:43] (03PS4) 10Muehlenhoff: Absent Redis Diamond collector on Redis masters [puppet] - 10https://gerrit.wikimedia.org/r/475969 (https://phabricator.wikimedia.org/T183454) [09:18:14] (03CR) 10Muehlenhoff: [C: 032] Absent Redis Diamond collector on Redis masters [puppet] - 10https://gerrit.wikimedia.org/r/475969 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [09:18:34] (03PS3) 10Vgutierrez: icinga: Use certcentral managed TLS certificate for icinga.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/476215 (https://phabricator.wikimedia.org/T207050) [09:18:50] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:19:18] (03PS4) 10Vgutierrez: icinga: Use certcentral managed TLS certificate for icinga.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/476215 (https://phabricator.wikimedia.org/T207050) [09:20:14] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476218 (owner: 10Marostegui) [09:21:15] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476218 (owner: 10Marostegui) [09:21:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476218 (owner: 10Marostegui) [09:21:54] (03CR) 10Jcrespo: "spelling (see below). You should consider my suggestion of setting up all hosts exactly the same (at least in groups)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [09:22:26] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:22:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1113:3316 T86338 T202167 (duration: 00m 53s) [09:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:46] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [09:22:47] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [09:24:50] RECOVERY - Check systemd state on labmon1002 is OK: OK - running: The system is fully operational [09:25:20] (03CR) 10Banyek: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [09:25:30] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [09:25:59] (03PS6) 10Banyek: mariadb: productionize dbproxy1015 and dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [09:26:27] (03PS1) 10Vgutierrez: icinga: Get rid of old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/476219 (https://phabricator.wikimedia.org/T207050) [09:27:35] (03CR) 10jerkins-bot: [V: 04-1] icinga: Get rid of old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/476219 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [09:27:50] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [09:29:41] (03PS2) 10Muehlenhoff: Add mapped IPv6 to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/475462 [09:30:49] (03PS2) 10Vgutierrez: icinga: Get rid of old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/476219 (https://phabricator.wikimedia.org/T207050) [09:33:17] (03CR) 10Jcrespo: "> >" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [09:33:26] (03CR) 10Vgutierrez: [C: 031] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/13750/" [puppet] - 10https://gerrit.wikimedia.org/r/476219 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [09:35:06] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [09:35:11] (03CR) 10Banyek: "> > >" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [09:36:59] 10Operations, 10Traffic, 10Patch-For-Review: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez) [09:37:09] (03PS7) 10Banyek: mariadb: productionize dbproxy1015 and dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [09:37:24] heads up, labsdb1009 maintenance begins in 20 minutes [09:37:26] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [09:38:02] (03CR) 10jerkins-bot: [V: 04-1] mariadb: productionize dbproxy1015 and dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [09:38:09] (03CR) 10Vgutierrez: [C: 032] icinga: Get rid of old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/476219 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [09:38:16] (03CR) 10Marostegui: "> I like the idea, but I'd like to hear @marostegui too about it. I" [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [09:38:18] (03CR) 10Muehlenhoff: [C: 032] Add mapped IPv6 to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/475462 (owner: 10Muehlenhoff) [09:38:24] (03PS3) 10Muehlenhoff: Add mapped IPv6 to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/475462 [09:38:43] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10fgiunchedi) Looks like it just happened again (timestamp UTC) ` 09:25 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% 09:27 -icinga-wm:#wikimedia-opera... [09:40:48] (03PS8) 10Banyek: mariadb: productionize dbproxy1015 and dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [09:55:56] !log Update tendril topology for pc1 - T208383 [09:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:53] (03CR) 10Banyek: [C: 031] "banyek@cumin2001:~ $ host 10.64.32.72" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476222 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [10:00:10] !log depooling labsdb1009 (T209517) [10:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:11] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [10:00:24] (03PS1) 10Marostegui: pc1008: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/476225 (https://phabricator.wikimedia.org/T208383) [10:00:55] (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1009 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/476079 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm) [10:01:00] (03PS2) 10Banyek: wiki replicas: depool labsdb1009 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/476079 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm) [10:01:01] (03CR) 10Banyek: [V: 032 C: 032] wiki replicas: depool labsdb1009 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/476079 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm) [10:01:18] (03CR) 10Marostegui: [C: 032] pc1008: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/476225 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [10:01:23] (03PS2) 10Marostegui: pc1008: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/476225 (https://phabricator.wikimedia.org/T208383) [10:02:47] banyek: did you get errors when merging your change? [10:03:06] marostegui: no [10:03:20] did you? [10:03:22] yep [10:03:54] I checked again, all clean [10:04:22] just the one which is always there "2018-11-28 10:02:29 [INFO] conftool::yaml_log_error: Error parsing yaml file /etc/conftool/etcdrc: [Errno 2] No such file or directory: '/etc/conftool/etcdrc'" [10:05:52] (03CR) 10Alex Monk: [C: 031] certcentral: Ensure that the service gets reloaded instead of restarted [puppet] - 10https://gerrit.wikimedia.org/r/476223 (https://phabricator.wikimedia.org/T209976) (owner: 10Vgutierrez) [10:08:11] (03PS1) 10Muehlenhoff: Remove Diamond from redis::misc systems [puppet] - 10https://gerrit.wikimedia.org/r/476226 (https://phabricator.wikimedia.org/T183454) [10:08:15] !log disable mailing list mediation-en-l [10:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:59] I'm deploying something for ores [10:10:11] rev to rollback, just in case, e957b244afaf4338856d279313a66a87ba51b55d [10:10:27] 10Operations, 10Discovery-Search, 10Elasticsearch: Fix prometheus elasticsearch exporter to show all the metrics - https://phabricator.wikimedia.org/T210592 (10Mathew.onipe) [10:11:05] 10Operations, 10Elasticsearch, 10Maps, 10Discovery-Search (Current work): Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [10:11:25] !log ladsgroup@deploy1001 Started deploy [ores/deploy@9b9ba06]: T206333 [10:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:29] T206333: Change default serializer of celery from pickle to json - https://phabricator.wikimedia.org/T206333 [10:12:28] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:13:12] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:14:14] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [10:14:18] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Fix prometheus elasticsearch exporter to show all the metrics - https://phabricator.wikimedia.org/T210592 (10Mathew.onipe) p:05Triage>03Normal [10:15:36] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. [10:16:12] (03CR) 10Vgutierrez: [C: 032] certcentral: Ensure that the service gets reloaded instead of restarted [puppet] - 10https://gerrit.wikimedia.org/r/476223 (https://phabricator.wikimedia.org/T209976) (owner: 10Vgutierrez) [10:16:22] (03PS2) 10Vgutierrez: certcentral: Ensure that the service gets reloaded instead of restarted [puppet] - 10https://gerrit.wikimedia.org/r/476223 (https://phabricator.wikimedia.org/T209976) [10:16:33] 10Operations, 10Wikimedia-Mailing-lists: Need to shut down a list, mediation-en-l - https://phabricator.wikimedia.org/T209726 (10jcrespo) 05Open>03Resolved p:05Triage>03High Disabling is done * All messages moderated by default * Removed all administrators * Set admin password to a random string I see... [10:18:53] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:19:45] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [10:22:43] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [10:24:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool pc1008 in pc2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476222 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [10:25:26] (03Merged) 10jenkins-bot: db-eqiad.php: Pool pc1008 in pc2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476222 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [10:26:03] banyek: is that you? ^ [10:26:13] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@9b9ba06]: T206333 (duration: 14m 48s) [10:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:18] T206333: Change default serializer of celery from pickle to json - https://phabricator.wikimedia.org/T206333 [10:26:36] yes :( [10:26:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool pc1008 - T208383 (duration: 00m 50s) [10:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:48] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [10:27:04] (03CR) 10jenkins-bot: db-eqiad.php: Pool pc1008 in pc2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476222 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [10:27:12] ACKNOWLEDGEMENT - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Banyek T209517 [10:30:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 've left an inline comment, rest LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [10:30:18] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) pc1008 has been pooled in pc2 - T208383#4780657 [10:30:21] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [10:30:56] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) [10:31:08] (03PS5) 10Filippo Giunchedi: WIP rsyslog: udp input json_lines shim [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851) [10:34:13] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC says ok at https://puppet-compiler.wmflabs.org/compiler1002/13746/" [puppet] - 10https://gerrit.wikimedia.org/r/475259 (owner: 10Dzahn) [10:34:26] (03PS2) 10Alexandros Kosiaris: upgrade puppet stdlib from 4.19.0 to 4.22.0 [puppet] - 10https://gerrit.wikimedia.org/r/475259 (owner: 10Dzahn) [10:34:29] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] upgrade puppet stdlib from 4.19.0 to 4.22.0 [puppet] - 10https://gerrit.wikimedia.org/r/475259 (owner: 10Dzahn) [10:34:57] RECOVERY - MariaDB Slave Lag: pc3 on pc2009 is OK: OK slave_sql_lag Replication lag: 0.85 seconds [10:36:29] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:40:17] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10jcrespo) 05Open>03Resolved La liste est crée, vous avais recu la mot de pas que vous pouvex utilize à: https://lists.wikimedia.org/mailman/admin/wikimedia-gin pour modifier la configuration ou ajouter plus... [10:40:19] (03PS1) 10Elukey: hive-site.xml: update kerberos/sasl properties [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476227 [10:40:21] (03PS1) 10Filippo Giunchedi: LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476228 (https://phabricator.wikimedia.org/T205851) [10:40:47] (03CR) 10Elukey: [V: 032 C: 032] hive-site.xml: update kerberos/sasl properties [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476227 (owner: 10Elukey) [10:42:02] (03PS1) 10Elukey: Update cdh module with the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/476229 [10:43:12] (03CR) 10Elukey: [C: 032] Update cdh module with the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/476229 (owner: 10Elukey) [10:46:01] !log rolling reboot of logstash1007-1009 to pick up new SSBD instructions and OpenJDK security updates [10:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:15] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: reallocate keys from common to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/476231 [10:47:17] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: activate keystone extra services [puppet] - 10https://gerrit.wikimedia.org/r/476232 (https://phabricator.wikimedia.org/T201504) [10:48:30] (03CR) 10jerkins-bot: [V: 04-1] openstack: eqiad1: activate keystone extra services [puppet] - 10https://gerrit.wikimedia.org/r/476232 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [10:49:31] 10Operations, 10Wikimedia-Mailing-lists: Post hold because of "invalid headers" in wikimediacz-l - https://phabricator.wikimedia.org/T210223 (10jcrespo) Urbanecm, thank you for reporting. Sadly, there is not much information we can infer with a single occurrence, it could be the spam filter detecting something... [10:50:40] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:54:38] (03PS9) 10Banyek: mariadb: productionize dbproxy1015 and dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [10:55:32] (03CR) 10jerkins-bot: [V: 04-1] mariadb: productionize dbproxy1015 and dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [10:57:21] (03CR) 10Arturo Borrero Gonzalez: [C: 032] hieradata: openstack: reallocate keys from common to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/476231 (owner: 10Arturo Borrero Gonzalez) [10:57:29] (03PS10) 10Banyek: mariadb: productionize dbproxy1012 - dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [10:59:24] who should I be adding to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/476228 for signoff ? [11:00:40] (03PS1) 10Marostegui: install_server: Only allow reimage pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/476234 (https://phabricator.wikimedia.org/T208383) [11:00:57] (03PS2) 10Marostegui: install_server: Only allow reimage pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/476234 (https://phabricator.wikimedia.org/T208383) [11:01:42] (03CR) 10Marostegui: [C: 032] install_server: Only allow reimage pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/476234 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [11:03:21] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: activate keystone extra services [puppet] - 10https://gerrit.wikimedia.org/r/476232 (https://phabricator.wikimedia.org/T210595) [11:04:11] (03CR) 10jerkins-bot: [V: 04-1] openstack: eqiad1: activate keystone extra services [puppet] - 10https://gerrit.wikimedia.org/r/476232 (https://phabricator.wikimedia.org/T210595) (owner: 10Arturo Borrero Gonzalez) [11:06:47] (03CR) 10Arturo Borrero Gonzalez: "PTR?" [dns] - 10https://gerrit.wikimedia.org/r/476224 (owner: 10Muehlenhoff) [11:10:54] (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: activate keystone extra services [puppet] - 10https://gerrit.wikimedia.org/r/476232 (https://phabricator.wikimedia.org/T210595) [11:10:58] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:13:40] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476236 (https://phabricator.wikimedia.org/T202167) [11:14:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476236 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [11:15:32] !log repooling labsdb1009 after maintenance [11:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:47] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476236 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [11:16:02] (03PS1) 10Banyek: Revert "wiki replicas: depool labsdb1009 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476238 [11:16:40] (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1009 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476238 (owner: 10Banyek) [11:17:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1085 T202167 (duration: 00m 53s) [11:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:07] !log Deploy schema change on db1085 (s6) (sanitarium master) with replication - T202167 [11:17:07] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [11:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:37] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476239 [11:18:11] (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1009 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476238 [11:18:14] (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1009 for upgrades" [puppet] - 10https://gerrit.wikimedia.org/r/476238 (owner: 10Banyek) [11:19:16] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476239 (owner: 10Marostegui) [11:19:18] PROBLEM - puppet last run on db2078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476236 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [11:19:40] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: cumin: re-allocate hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/476240 [11:20:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476239 (owner: 10Marostegui) [11:20:23] (03CR) 10Arturo Borrero Gonzalez: [C: 032] hieradata: openstack: cumin: re-allocate hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/476240 (owner: 10Arturo Borrero Gonzalez) [11:20:26] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476239 (owner: 10Marostegui) [11:20:28] arturo: there are puppet failures on the DBs: https://phabricator.wikimedia.org/P7856 [11:20:33] is that something related to your changes? [11:21:04] marostegui: could be [11:21:24] will take a look in a minute [11:21:28] thanks [11:21:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1085 T202167 (duration: 00m 53s) [11:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:44] arturo: it's everywhere [11:24:01] puppetmaster has unmerged private repo changes it seems [11:24:14] ? [11:25:34] o_o [11:27:18] (03PS7) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) [11:28:04] That was icinga's scream [11:28:20] I can look it up but there is so much stuff [11:28:33] in which puppetmaster? [11:28:38] (03CR) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [11:29:15] marostegui: problem is that db2078 is which is in codfw is using hierakeys from eqiad [11:29:15] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [11:29:17] <_joe_> sorry what is going on with the puppetmasters? [11:29:23] 11:12 (Berlin time) [11:29:43] <_joe_> marostegui: do i need to take a look? [11:30:02] I think puppetmaster1001 and some other had it too [11:30:18] _joe_: I don't think so, the db puppet issue is a hiera reallocation I just did [11:30:37] which uncovered other issues [11:30:43] <_joe_> well there are unmerged changes right now? [11:30:54] I know nothing about these unmerged changes [11:30:57] that's not me AFAIK [11:31:06] <_joe_> ok i'll take a look [11:31:22] _joe_: No idea bout those either [11:31:36] <_joe_> if a puppet change creates problems and you're not sure how to fix them, just revert it [11:31:55] <_joe_> you'll have more time for the fix afterwards [11:31:59] yes I'm doing already [11:32:00] _joe_: I had an error before while syncing to other puppet masters, but it was already solved by running puppet merge on those failed hosts as per v0lans advise [11:33:06] <_joe_> ok [11:33:33] But those new unmerged changes alert, I don't know [11:34:41] (03PS1) 10Arturo Borrero Gonzalez: Revert "hieradata: openstack: reallocate keys from common to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/476245 [11:34:44] <_joe_> which one? there is none :P [11:35:03] ˜/Amir1 12:29> PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [11:35:11] I checked too and I saw nothing [11:35:15] So not sure where does that come from :) [11:35:38] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Revert "hieradata: openstack: reallocate keys from common to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/476245 (owner: 10Arturo Borrero Gonzalez) [11:36:53] (03PS2) 10Muehlenhoff: Add AAAA record for labmon1001 [dns] - 10https://gerrit.wikimedia.org/r/476224 [11:37:20] <_joe_> it was amir re-pasting an error [11:37:31] <_joe_> marostegui: so dbs are failing puppet? [11:37:41] marostegui: please check db2078 now [11:38:13] arturo: works [11:38:19] _joe_: just two of the misc ones [11:38:38] It's not just dbs, I see this too [11:38:39] PROBLEM - puppet on ORES-worker01.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:38:45] (in #wikimedia-ai) [11:38:55] cool [11:39:15] on all cloud vps nodes we have monitoring on [11:39:56] RECOVERY - puppet last run on db2078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:40:13] Amir1: I reverted my change, should be fixed now. Sorry for the noise. [11:41:10] No worries, it's fine, I was worried it might be a symptom of a huge underlying issue [11:45:09] (03PS11) 10Banyek: mariadb: productionize dbproxy1012 - dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [11:46:53] (03CR) 10Arturo Borrero Gonzalez: [C: 031] Add AAAA record for labmon1001 [dns] - 10https://gerrit.wikimedia.org/r/476224 (owner: 10Muehlenhoff) [11:53:14] (03PS1) 10Ladsgroup: ores: Remove added celery configs [puppet] - 10https://gerrit.wikimedia.org/r/476250 [11:55:05] (03PS12) 10Banyek: mariadb: new dbproxy role & profile [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [11:56:16] (03PS2) 10Alexandros Kosiaris: varnish: move $all_networks to $trusted_networks [puppet] - 10https://gerrit.wikimedia.org/r/475714 [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181128T1200). [12:00:05] CFisch_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:21] \o/ [12:02:41] Anyone doing the SWAT? ;-) [12:04:34] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Remove added celery configs [puppet] - 10https://gerrit.wikimedia.org/r/476250 (owner: 10Ladsgroup) [12:04:40] (03PS2) 10Alexandros Kosiaris: ores: Remove added celery configs [puppet] - 10https://gerrit.wikimedia.org/r/476250 (owner: 10Ladsgroup) [12:04:43] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: Remove added celery configs [puppet] - 10https://gerrit.wikimedia.org/r/476250 (owner: 10Ladsgroup) [12:05:20] CFisch_WMDE: ooops, sorry, forgot about swat [12:05:35] CFisch_WMDE: there are a few new deployers, are you one of them? [12:05:52] nope not yet [12:06:17] ok, in that case, I can SWAT today! [12:06:27] So go for it! \o/ [12:07:54] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:08:37] CFisch_WMDE: do the patches need to be deployed in order? [12:08:42] (03PS3) 10Alexandros Kosiaris: varnish: move $all_networks to $trusted_networks [puppet] - 10https://gerrit.wikimedia.org/r/475714 [12:08:55] the first two would be nice [12:08:55] since the extension patches will need 10-20 minutes to merge, I can deploy config patch first [12:09:00] the third one does not matter [12:09:18] (03CR) 10Muehlenhoff: [C: 032] Add AAAA record for labmon1001 [dns] - 10https://gerrit.wikimedia.org/r/476224 (owner: 10Muehlenhoff) [12:10:04] Thinking about it [12:10:09] does not matter at all zeljkof [12:10:11] ;-) [12:10:23] CFisch_WMDE: ok, merging them then all, deploying as they get merged [12:10:28] kk [12:11:08] PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [12:12:18] RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.010 second response time [12:14:35] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475759 (https://phabricator.wikimedia.org/T207639) (owner: 10WMDE-Fisch) [12:17:19] (03PS1) 10Ladsgroup: ores: labs celery4 config update [puppet] - 10https://gerrit.wikimedia.org/r/476255 (https://phabricator.wikimedia.org/T209587) [12:18:02] (03CR) 10jerkins-bot: [V: 04-1] ores: labs celery4 config update [puppet] - 10https://gerrit.wikimedia.org/r/476255 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [12:18:13] (03CR) 10Zfilipin: Make AdvancedSearch default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475759 (https://phabricator.wikimedia.org/T207639) (owner: 10WMDE-Fisch) [12:18:20] (03PS2) 10Zfilipin: Make AdvancedSearch default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475759 (https://phabricator.wikimedia.org/T207639) (owner: 10WMDE-Fisch) [12:18:30] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475759 (https://phabricator.wikimedia.org/T207639) (owner: 10WMDE-Fisch) [12:19:48] CFisch_WMDE: couple of jobs failed for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AdvancedSearch/+/475985 [12:19:50] (03Merged) 10jenkins-bot: Make AdvancedSearch default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475759 (https://phabricator.wikimedia.org/T207639) (owner: 10WMDE-Fisch) [12:19:52] * zeljkof is looking [12:20:38] zeljkof: oh yeah this is a test that's very unstable [12:20:51] CFisch_WMDE: 475759 is at mwdebug1002 [12:20:52] on master I had several rechecks with different results [12:21:04] and needed to force it into master ^^' [12:21:12] you should fix or delete that test then :/ [12:21:31] yeah ;-/ [12:21:37] (03PS2) 10Ladsgroup: ores: labs celery4 config update [puppet] - 10https://gerrit.wikimedia.org/r/476255 (https://phabricator.wikimedia.org/T209587) [12:22:14] CFisch_WMDE: 475759 is at mwdebug1002 <- works fine [12:22:24] zeljkof: [12:22:57] CFisch_WMDE: ok, deploying [12:23:47] (03PS1) 10Arturo Borrero Gonzalez: openstack: main: keystone: cleanup unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/476256 [12:24:03] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:475759|Make AdvancedSearch default on all wikis (T207639)]] (duration: 00m 54s) [12:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:08] T207639: Make AdvancedSearch default on all wikis - https://phabricator.wikimedia.org/T207639 [12:24:16] CFisch_WMDE: 475759 deployed, please check [12:24:36] (03CR) 10jenkins-bot: Make AdvancedSearch default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475759 (https://phabricator.wikimedia.org/T207639) (owner: 10WMDE-Fisch) [12:24:59] zeljkof: nice works [12:28:34] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compilation is OK: https://puppet-compiler.wmflabs.org/compiler1002/13757/" [puppet] - 10https://gerrit.wikimedia.org/r/476256 (owner: 10Arturo Borrero Gonzalez) [12:30:59] CFisch_WMDE: 476241 merged, I'll ping you when it's ready for testing, in a minute or two [12:31:09] 475985 failed to merge, again :/ [12:31:19] oh man -.- [12:31:38] that's a horrible test, if it's that unstable :P [12:31:50] yeah normally it's not my project [12:32:11] we just jumped in here to do this last minute thing before the extension leaves beta [12:32:24] they are aware at least ^^ [12:32:27] (03CR) 10Alexandros Kosiaris: [C: 032] ores: labs celery4 config update [puppet] - 10https://gerrit.wikimedia.org/r/476255 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [12:32:34] (03PS3) 10Alexandros Kosiaris: ores: labs celery4 config update [puppet] - 10https://gerrit.wikimedia.org/r/476255 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [12:32:36] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: labs celery4 config update [puppet] - 10https://gerrit.wikimedia.org/r/476255 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [12:34:59] !log installing ghostscript security updates [12:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:44] CFisch_WMDE: 476241 is at mwdebug1002 [12:40:17] *testing* [12:41:15] zeljkof: works [12:41:33] CFisch_WMDE: ok, deploying [12:43:12] !log zfilipin@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/Cite: SWAT: [[gerrit:476241|Make backlink highlighting robust for community customized HTML (T205270 T210520)]] (duration: 00m 55s) [12:43:12] about the other thing ... it would be really good if that could be deployed [12:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:17] T205270: Highlight the important jump mark letter by making it bold - https://phabricator.wikimedia.org/T205270 [12:43:18] T210520: Customized messages mess order of DOM elements up - https://phabricator.wikimedia.org/T210520 [12:43:21] it failed again though -.- [12:43:42] CFisch_WMDE: 476241 deployed, please check [12:43:59] (03PS4) 10Alexandros Kosiaris: varnish: move $all_networks to $trusted_networks [puppet] - 10https://gerrit.wikimedia.org/r/475714 [12:44:02] CFisch_WMDE: yeah, test fails constantly, 3 times in a row for 475972 [12:44:30] 476241 works fine, thanks [12:44:50] but this time, it's a new failure :/ [12:44:57] o.O [12:45:23] how does that repo merge anything if the test fails so much?! [12:46:40] either damn luck or user-forced submit [12:46:53] that's my guess [12:46:54] (03CR) 10Alexandros Kosiaris: "@bblack: Ah yes, I see your point. After pondering a bit about it and studying the uses of $all_networks in our tree, it struck me that it" [puppet] - 10https://gerrit.wikimedia.org/r/475714 (owner: 10Alexandros Kosiaris) [12:47:14] CFisch_WMDE: you think it's fine if I force merge the commit? [12:47:24] yes [12:47:32] ok, merging [12:48:08] I will try to talk to the guys to fix their stuff [12:48:25] or I might end up deactivating that one test for now [12:48:40] swating this now was important though [12:48:44] PROBLEM - DPKG on labtestweb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:49:58] RECOVERY - DPKG on labtestweb2001 is OK: All packages OK [12:50:12] CFisch_WMDE: could you please report a bug about failed test, so I can reference it? [12:50:31] yes I will do so [12:50:44] hashar: FYI I'll force-merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AdvancedSearch/+/475985 [12:51:00] there is a failing test, but CFisch_WMDE says it's a known problem [12:51:07] and the patch should be deployed now [12:51:17] hehe fair enough [12:51:32] zeljkof: :) [12:51:44] ok, merged [12:54:13] !log installing serf update from stretch point release [12:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:39] zeljkof: Quickly created https://phabricator.wikimedia.org/T210599 for the failing test [12:54:56] thanks! [12:56:12] (03CR) 10Alexandros Kosiaris: [C: 031] profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [12:56:31] 475985 is not deployed anywhere yet, right? [12:56:34] zeljkof: [12:56:45] not yet [12:57:22] k, just checking [12:57:25] ;-) [12:57:49] anybody knows why is there a new commit in GrowthExperiments? [12:58:03] and CentralNotice [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181128T1300) [13:00:05] https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/CentralNotice+NOT+branch:master has some commits in wmf_deploy branch [13:00:22] (03PS5) 10Alexandros Kosiaris: varnish: move $all_networks to $trusted_networks [puppet] - 10https://gerrit.wikimedia.org/r/475714 [13:00:25] ditto for https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/GrowthExperiments+NOT+branch:master [13:00:34] hashar: I need a few more minutes for SWAT [13:01:23] CFisch_WMDE: 475985 is at mwdebug1002 [13:02:07] zeljkof: there are lang files incuded [13:02:13] theses seem to be missing [13:02:16] :-/ [13:02:38] I just see the keys [13:02:47] ( apart from that it's fine ) [13:02:48] * 9661bd7 - (origin/wmf/1.33.0-wmf.4) WelcomeSurvey: indicate that the special page does write (8 days ago) [13:02:58] but the HEAD does not match bah [13:03:05] hm, not sure how lang files get synced [13:03:17] hashar: do you know why lang files are not at mwdebug? [13:03:32] do you think it's fine to deploy 475985? [13:03:41] hashar, CFisch_WMDE ^ [13:04:05] what do you mean by lang files? [13:04:11] languages/*.php ? [13:04:17] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AdvancedSearch/+/475985 [13:04:21] so we just backported a new feature as part of an extension [13:04:24] i18n files [13:04:29] the stuff is there on debug [13:04:43] but I see the keys instead of the texts [13:05:18] ah so yeah [13:05:23] new language keys got added [13:05:36] but mediawiki reads them from the cache entries (cdb files) [13:05:38] yes [13:05:40] so we gotta regenerate them [13:05:42] so, ok to deploy? [13:05:51] we used to have a scap command to rebuild l10n but it is no more available afaik [13:05:51] or will be done later, during train? [13:05:58] what? [13:06:00] so no [13:06:02] or, should I revert? [13:06:06] it is not ok to deploy since the messages are missing [13:06:09] * zeljkof is confused [13:06:14] we just have to rebuild the l10n files [13:06:23] which I am afraid requires a full scap [13:06:34] err a full sync [13:07:20] hmm :-/ [13:07:27] zeljkof hashar hi! the commits on the CN deploy branch should be stuff that's already deployed on prod [13:08:36] hashar: so I run `scap sync` instead of `scap sync-file`? [13:08:43] should I do it now? [13:08:59] or cancel the deployment, since swat window is already over? [13:09:10] zeljkof: hashar if the scap or whatever is needed it not possible to do now, it would be fine to do that later today, after the european train or in the morning swat [13:09:20] scap sync [13:09:26] that is all what you need [13:09:28] but latest during morning swat would be good [13:09:30] that will rebuild the l10n cache [13:09:50] but also sync everywhere [13:09:55] I am looking up for a workaround [13:10:07] hashar: ok, can I do it now? or should it be moved to the next swat window? [13:10:29] pls do ping if anything seems amiss [13:10:30] we're already into "Pre MediaWiki train sanity break" [13:11:21] so [13:11:23] https://phabricator.wikimedia.org/D983 [13:11:27] is what removed the command [13:11:39] AndyRussG: at deploy1001, in /srv/mediawiki-staging/php-1.33.0-wmf.4, git says: Submodules changed but not updated: [13:12:08] CentralNotice and GrowthExperiments [13:12:08] https://phabricator.wikimedia.org/T208196#4703264 : we no longer have the code to just update the localisation cache on the master, but a full sync should get everything back in order. [13:13:29] hashar: ok, so should I do `scap sync` now? or should I move the commit to the next swat window? [13:13:51] the docs say it takes 20 minutes, and we are already 15 into pre-train window [13:14:49] yeah just full sync [13:14:59] \o/ [13:15:26] hashar: ok, deploying [13:16:35] hashar: just to make it clear, this is what I need to do, right? [13:16:52] FOR GOD SAKE YES [13:16:59] zfilipin@deploy1001/mediawiki-staging$ scap sync 'SWAT: [[gerrit:475985|Add user preference to disable the advanced interface (T210479)]]' [13:16:59] T210479: User setting for disabling AdvancedSearch - https://phabricator.wikimedia.org/T210479 [13:17:00] ;) [13:17:08] hashar: you got me all confused [13:17:10] :P [13:17:24] if it breaks it's all hashar's fault now [13:17:40] ;-) [13:18:33] jouncebot: next [13:18:33] In 0 hour(s) and 41 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181128T1400) [13:19:19] !log zfilipin@deploy1001 Started scap: SWAT: [[gerrit:475985|Add user preference to disable the advanced interface (T210479)]] [13:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:41] CFisch_WMDE, hashar: see ^ [13:19:59] I'll ping you in 20 minutes or so, when it's done [13:20:10] nice, thanks [13:21:21] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Fix prometheus elasticsearch exporter to show all the metrics - https://phabricator.wikimedia.org/T210592 (10Mathew.onipe) After several investigation as to why we are not seeing some already exposed metrics, I detected that we have to run the... [13:26:14] Updated 0 CDB files(s) in /srv/mediawiki/php-1.33.0-wmf.4/cache/l10n [13:26:19] Updated 0 CDB files(s) in /srv/mediawiki/php-1.33.0-wmf.6/cache/l10n [13:26:20] strange [13:27:04] Updating LocalisationCache for 1.33.0-wmf.4 using 30 thread(s) :))) [13:27:16] Updated 415 JSON file(s) in /srv/mediawiki-staging/php-1.33.0-wmf.4/cache/l10n [13:27:17] nice [13:32:13] Updated 415 JSON file(s) in /srv/mediawiki-staging/php-1.33.0-wmf.6/cache/l10n [13:37:05] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/466830 (owner: 10Muehlenhoff) [13:45:51] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Banyek) What are we doing with this host at the end? It still has BBU error, and the host will be decommisioned, but until that I don't see any reason to keep this open, if we don'... [13:50:02] PROBLEM - HHVM rendering on mw1238 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1309 bytes in 0.002 second response time [13:50:30] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [13:51:12] RECOVERY - HHVM rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 76148 bytes in 0.120 second response time [13:51:36] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [13:52:55] scap stuck at this for a while now :/`scap-cdb-rebuild: 98% (ok: 281; fail: 0; left: 4) ` [13:53:04] hashar: ^ [13:53:19] guess some hosts are slower than others [13:53:34] 4 of them seem to be really slow :/ [13:54:25] ok, down to 3, finally [13:54:32] 2 [13:54:48] (03CR) 10Alexandros Kosiaris: "I should note I am not particularly in love with the name of the variable. Perhaps $trusted_aggregates would sound better?" [puppet] - 10https://gerrit.wikimedia.org/r/475714 (owner: 10Alexandros Kosiaris) [13:55:03] hopefully it will finish by train window :/ [13:55:48] *meow* [13:56:11] dont worry about train [13:56:20] I will do it once the full sync is done [13:56:42] zeljkof: I see the messages now ^^ [13:57:01] still 2 left... [13:57:21] down to 1 only... [13:57:27] * zeljkof is holding breath [13:57:31] !log zfilipin@deploy1001 Finished scap: SWAT: [[gerrit:475985|Add user preference to disable the advanced interface (T210479)]] (duration: 38m 12s) [13:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:35] T210479: User setting for disabling AdvancedSearch - https://phabricator.wikimedia.org/T210479 [13:57:37] boom! done! [13:57:47] hashar: it's done! [13:57:52] CFisch_WMDE: please test [13:57:55] cool [13:58:20] huh, and 2 minutes to spare [13:59:05] zeljkof: works perfectly fine [13:59:09] CFisch_WMDE: yeah! [13:59:12] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Fix prometheus elasticsearch exporter to show all the metrics - https://phabricator.wikimedia.org/T210592 (10dcausse) [13:59:18] !log EU SWAT (finally) done [13:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:43] P.S.: AdvancedSearch is now out of Beta and a default https://en.wikipedia.org/w/index.php?search=&title=Special%3ASearch [14:00:03] (03PS1) 10Hashar: group1 wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476265 [14:00:04] hashar: Dear deployers, time to do the MediaWiki train - European version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181128T1400). [14:00:05] (03CR) 10Hashar: [C: 032] group1 wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476265 (owner: 10Hashar) [14:00:06] ;) [14:00:19] But there's also a user preference to deactivate it [14:00:19] https://en.wikipedia.org/wiki/Special:Preferences#mw-prefsection-searchoptions [14:00:21] ;-) [14:00:45] \o/ [14:01:47] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476265 (owner: 10Hashar) [14:03:14] thanks zeljkof and hashar [14:03:18] :-) [14:04:50] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.6 [14:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:43] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.6 (duration: 00m 52s) [14:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:05] (03PS1) 10KartikMistry: cxserver: Added update Youdao config [puppet] - 10https://gerrit.wikimedia.org/r/476266 (https://phabricator.wikimedia.org/T210578) [14:06:37] (03CR) 10jerkins-bot: [V: 04-1] cxserver: Added update Youdao config [puppet] - 10https://gerrit.wikimedia.org/r/476266 (https://phabricator.wikimedia.org/T210578) (owner: 10KartikMistry) [14:08:31] (03PS2) 10KartikMistry: cxserver: Added update Youdao config [puppet] - 10https://gerrit.wikimedia.org/r/476266 (https://phabricator.wikimedia.org/T210578) [14:09:42] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476265 (owner: 10Hashar) [14:11:12] seems good [14:11:16] i will monitor the logs [14:15:09] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Fix prometheus elasticsearch exporter to show all the metrics - https://phabricator.wikimedia.org/T210592 (10dcausse) There should be no need to fetch indices stats for this I think there was a misunderstanding in https://github.com/justwatchc... [14:20:30] (03PS1) 10Lucas Werkmeister (WMDE): Don’t send SPARQL prefixes in WikibaseQualityConstraints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476267 (https://phabricator.wikimedia.org/T204317) [14:25:31] (03CR) 10Lucas Werkmeister (WMDE): "Scheduled for Monday’s EU SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476267 (https://phabricator.wikimedia.org/T204317) (owner: 10Lucas Werkmeister (WMDE)) [14:25:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476268 (https://phabricator.wikimedia.org/T86338) [14:32:12] (03PS1) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/476269 (https://phabricator.wikimedia.org/T204745) [14:32:38] (03PS2) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/476269 (https://phabricator.wikimedia.org/T204745) [14:33:30] (03CR) 10Andrew Bogott: [C: 032] Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/476269 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [14:34:58] (03Abandoned) 10DCausse: [cirrus] Start using psi&omega in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475751 (owner: 10DCausse) [14:35:07] (03Abandoned) 10DCausse: [cirrus] Start using psi&omega in eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475752 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [14:35:20] (03PS6) 10DCausse: [cirrus] Allow configuration arrays in production services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475747 (https://phabricator.wikimedia.org/T210381) [14:35:22] (03PS6) 10DCausse: [cirrus] switch to explicit config in production services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475748 (https://phabricator.wikimedia.org/T210381) [14:35:24] (03PS6) 10DCausse: [cirrus] prepare multi-instance services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) [14:35:26] (03PS11) 10DCausse: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) [14:35:28] (03PS1) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [14:35:30] (03PS1) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [14:35:32] (03PS1) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [14:36:35] PROBLEM - ensure kvm processes are running on labvirt1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm [14:37:09] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [14:37:21] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [14:37:22] hashar: Can I hold you for a moment [14:37:43] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [14:37:57] The WikimediaIncubator extension got updated to wmf.6 but caused some new bugs so I'm preparing a patch to fix them [14:38:20] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [14:38:26] ACKNOWLEDGEMENT - ensure kvm processes are running on labvirt1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm andrew bogott This server is about to be decommissioned. [14:44:04] (03PS1) 10Vgutierrez: certcentral: Provide TLS certificates for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/476275 (https://phabricator.wikimedia.org/T207050) [14:44:15] (03PS3) 10Muehlenhoff: Remove sarin/neodymium from network constants/tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/466830 [14:45:13] Hydriz: cant sorry. But you can craft a patch and add it to the next swat deploy slot! [14:46:18] hashar: oh man alright! [14:47:41] group1 looks good. I have filled a couple tasks but they dont seem too bad [14:47:55] I am out for a couple hours [14:49:24] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [14:49:58] 10Operations, 10netops: Remove neodymium/sarin from router ACLs - https://phabricator.wikimedia.org/T210612 (10MoritzMuehlenhoff) [14:50:32] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [14:51:00] (03CR) 10Muehlenhoff: [C: 032] Remove sarin/neodymium from network constants/tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/466830 (owner: 10Muehlenhoff) [14:51:20] (03PS1) 10Vgutierrez: dumps: Deploy the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476276 (https://phabricator.wikimedia.org/T207050) [14:52:38] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=create https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:53:52] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:57:16] (03CR) 10Vgutierrez: [C: 032] certcentral: Provide TLS certificates for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/476275 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:57:25] (03PS2) 10Vgutierrez: certcentral: Provide TLS certificates for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/476275 (https://phabricator.wikimedia.org/T207050) [15:00:40] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=DELETE https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:04:32] (03PS2) 10DCausse: [cirrus] multi-instance: add cirrussearch-big-indices.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475746 (https://phabricator.wikimedia.org/T210381) [15:04:34] (03PS7) 10DCausse: [cirrus] Allow configuration arrays in production services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475747 (https://phabricator.wikimedia.org/T210381) [15:04:36] (03PS7) 10DCausse: [cirrus] switch to explicit config in production services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475748 (https://phabricator.wikimedia.org/T210381) [15:04:38] (03PS7) 10DCausse: [cirrus] prepare multi-instance services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) [15:04:40] (03PS12) 10DCausse: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) [15:04:42] (03PS2) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [15:04:44] (03PS2) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [15:04:46] (03PS2) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [15:06:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476268 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [15:06:15] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [15:06:42] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:07:04] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [15:07:06] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [15:07:42] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [15:07:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476268 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [15:09:07] do I understand correctly that the train is over? [15:09:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1085 T86338 (duration: 00m 56s) [15:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:24] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [15:09:30] !log Deploy schema change on db1085 with replication - T86338 [15:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:46] !log depooling db1122 due schema change (T85757) [15:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:51] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [15:11:15] (03CR) 10Vgutierrez: [C: 032] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/13759/" [puppet] - 10https://gerrit.wikimedia.org/r/476276 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:11:23] (03PS2) 10Vgutierrez: dumps: Deploy the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476276 (https://phabricator.wikimedia.org/T207050) [15:13:18] (03CR) 10Banyek: [C: 032] mariadb: depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475744 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [15:13:26] (03PS3) 10Banyek: mariadb: depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475744 (https://phabricator.wikimedia.org/T85757) [15:13:28] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475744 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [15:15:09] (03Merged) 10jenkins-bot: mariadb: depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475744 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [15:16:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476268 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [15:16:06] (03CR) 10jenkins-bot: mariadb: depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475744 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [15:16:50] (03PS1) 10Anomie: Avoid putting Message objects in sidebar cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476280 (https://phabricator.wikimedia.org/T210528) [15:17:35] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1122 (duration: 00m 53s) [15:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:39] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [15:19:12] !log Deploy schema change on db1122 - T85757 [15:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:58] (03PS1) 10Vgutierrez: dumps: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476281 (https://phabricator.wikimedia.org/T207050) [15:23:28] (03CR) 10Vgutierrez: [C: 032] dumps: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476281 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:24:03] !log use a certcentral managed TLS certificate in dumps.wm.o - T207050 [15:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:07] T207050: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 [15:27:52] openssl s_client -servername dumps.wikimedia.org -connect dumps.wikimedia.org:443 2>/dev/null | openssl x509 -noout -dates [15:27:52] notBefore=Nov 28 13:59:40 2018 GMT [15:27:56] \o/ [15:29:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476282 [15:31:26] (03CR) 10Herron: WIP rsyslog: udp input json_lines shim (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [15:31:37] (03PS1) 10Vgutierrez: dumps: Get rid of the old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/476283 (https://phabricator.wikimedia.org/T207050) [15:31:41] greg-g: hey around? [15:32:55] 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10Eevans) a:05Eevans>03RobH @Papaul, @RobH Should I be able... [15:34:36] (03PS13) 10DCausse: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) [15:34:38] (03PS3) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [15:34:40] (03PS3) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [15:34:42] (03PS3) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [15:35:44] (03CR) 10DCausse: [cirrus] multi-instance: add cirrussearch-big-indices.dblist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475746 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [15:35:46] (03CR) 10Vgutierrez: [C: 032] "pcc is happy and shows the expected changes: https://puppet-compiler.wmflabs.org/compiler1002/13760/" [puppet] - 10https://gerrit.wikimedia.org/r/476283 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:36:04] (03CR) 10DCausse: [cirrus] Allow configuration arrays in production services (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475747 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [15:36:23] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [15:36:41] (03CR) 10DCausse: [cirrus] prepare multi-instance services (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [15:36:43] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [15:36:59] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Ship PuppetDB logs to ELK - https://phabricator.wikimedia.org/T210458 (10jcrespo) a:03herron Assigning it to you as you seem to be working on it, unclaim it if this is wrong. [15:37:05] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [15:37:16] Reedy and others with access: Can somebody create an account for Satdeep Gill (it's their SUL username) on punjabi.wikimedia.org? [15:37:35] It's fishbowl wiki and it's unusable till an account is created [15:38:26] 10Operations, 10Traffic, 10Patch-For-Review: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez) [15:40:02] (03PS1) 10Vgutierrez: certcentral: Provide TLS certificates for gerrit/gerrit-slave.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/476284 (https://phabricator.wikimedia.org/T207050) [15:40:32] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476282 [15:41:31] (03PS1) 10Vgutierrez: gerrit: Deploy the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476285 (https://phabricator.wikimedia.org/T207050) [15:41:46] Lydia_WMDE: ok, I'll rollback wikidata wiki to wmf.4. Once you've got tasks for problems please be sure to attach them to T206660 [15:41:46] T206660: 1.33.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T206660 [15:41:51] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476282 (owner: 10Marostegui) [15:42:20] thcipriani: thank you! and sorry :( [15:42:22] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10jcrespo) 05Open>03Resolved a:03ArielGlenn @Hello903hello @Urbanecm this seems to be working after Ariel's patch, resolving. Plea... [15:42:55] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476282 (owner: 10Marostegui) [15:43:27] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10jcrespo) Please, please, if this happens in anyway, remember not to close this as resolved without doing lots of cleanup related to the database. [15:43:29] (03CR) 10Vgutierrez: [C: 032] certcentral: Provide TLS certificates for gerrit/gerrit-slave.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/476284 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:43:59] marostegui: looks like you're doing database config on deploy1001, is that right? [15:44:01] thcipriani: There is a modified wikiversions.json on deploy1001 so I cannot rebase :) [15:44:07] thcipriani: good timing :) [15:44:23] heh, yeah, I'll be out of your way in just one second [15:44:29] sure! thanks :) [15:44:35] no rush [15:46:04] (03PS1) 10Thcipriani: wikidatawiki back to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476293 [15:46:22] marostegui: ok, I'm going to rollback one wiki really quickly, should only take a minute, I've pulled down your patch and I'll let you know when all clear, is that ok? [15:46:31] thcipriani: sure, no rush :) [15:46:49] great, thanks :) [15:47:09] (03CR) 10Thcipriani: [C: 032] wikidatawiki back to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476293 (owner: 10Thcipriani) [15:48:21] (03Merged) 10jenkins-bot: wikidatawiki back to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476293 (owner: 10Thcipriani) [15:49:24] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: wikidatawiki back to 1.33.0-wmf.4 [15:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:52] Lydia_WMDE: ^ wikidatawiki should be reverted [15:50:11] marostegui: all yours! I made my patch on top of yours so you should just need to sync. [15:50:18] thcipriani: excellent! thanks! [15:51:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1085 T86338 (duration: 00m 52s) [15:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:33] thcipriani: I am done for the day with scap! Thank you! [15:51:33] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [15:51:59] marostegui: sorry to step on your toes! :) [15:51:59] (03PS1) 10Banyek: Revert "mariadb: depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476296 [15:52:10] !log repooling db1122 after schema change (T85757) [15:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:14] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [15:53:12] (03PS14) 10DCausse: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) [15:53:14] (03PS4) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [15:53:16] (03PS4) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [15:53:18] (03PS4) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [15:53:44] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476296 (owner: 10Banyek) [15:53:50] (03PS2) 10Banyek: Revert "mariadb: depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476296 [15:53:53] (03CR) 10Banyek: [V: 032 C: 032] Revert "mariadb: depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476296 (owner: 10Banyek) [15:55:18] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476282 (owner: 10Marostegui) [15:55:20] (03CR) 10jenkins-bot: wikidatawiki back to 1.33.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476293 (owner: 10Thcipriani) [15:55:24] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1122 (duration: 00m 53s) [15:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:59] exit [15:57:07] nothere [15:58:50] thcipriani: thank you so much [16:00:14] Lydia_WMDE: sure thing! yw :) [16:00:33] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [16:02:43] 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10RobH) @eevens, Nope! Not until we know and apply a role, ri... [16:03:37] (03CR) 10GTirloni: [C: 032] openstack: eqiad1: activate keystone extra services [puppet] - 10https://gerrit.wikimedia.org/r/476232 (https://phabricator.wikimedia.org/T210595) (owner: 10Arturo Borrero Gonzalez) [16:05:50] (03PS4) 10Arturo Borrero Gonzalez: openstack: eqiad1: activate keystone extra services [puppet] - 10https://gerrit.wikimedia.org/r/476232 (https://phabricator.wikimedia.org/T210595) [16:06:17] (03CR) 10Ayounsi: varnish: move $all_networks to $trusted_networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/475714 (owner: 10Alexandros Kosiaris) [16:06:31] (03PS2) 10Andrew Bogott: ci: stop monitoring zmq on Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/473846 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [16:07:39] (03CR) 10Andrew Bogott: [C: 032] ci: stop monitoring zmq on Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/473846 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [16:08:37] (03CR) 10jenkins-bot: Revert "mariadb: depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476296 (owner: 10Banyek) [16:12:15] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) @Gehel In this case the racking proposal will not work since those racks are 1G rack. I will update the task description with the new racking proposal. [16:12:34] (03CR) 10Vgutierrez: [C: 032] gerrit: Deploy the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476285 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [16:12:42] (03PS2) 10Vgutierrez: gerrit: Deploy the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476285 (https://phabricator.wikimedia.org/T207050) [16:13:22] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [16:15:15] (03PS1) 10Alex Monk: labs hieradata: Rm contintcloud project common.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/476298 (https://phabricator.wikimedia.org/T209644) [16:18:02] (03PS1) 10Alex Monk: labs hieradata: Rm puppet3-diffs project common.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/476299 [16:20:27] (03CR) 10BBlack: varnish: move $all_networks to $trusted_networks (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475714 (owner: 10Alexandros Kosiaris) [16:21:42] (03PS1) 10Vgutierrez: gerrit: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) [16:23:14] 10Operations, 10monitoring, 10Patch-For-Review: Icinga downtime script should fail on the passive hosts - https://phabricator.wikimedia.org/T210380 (10Dzahn) also manually copied motd snippet and modified icinga-downtime to einsteinium so that people get the warnings there until the host is fully decomed. an... [16:24:30] (03PS1) 10Alex Monk: labs hieradata: Rm restbase project common.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/476302 [16:25:46] (03CR) 10Alex Monk: "See T204532" [puppet] - 10https://gerrit.wikimedia.org/r/476299 (owner: 10Alex Monk) [16:27:01] (03PS1) 10Lucas Werkmeister (WMDE): Disable classic_entity wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476303 (https://phabricator.wikimedia.org/T209402) [16:27:23] (03PS5) 10Arturo Borrero Gonzalez: openstack: eqiad1: activate keystone extra services [puppet] - 10https://gerrit.wikimedia.org/r/476232 (https://phabricator.wikimedia.org/T210595) [16:27:27] (03CR) 10Arturo Borrero Gonzalez: [V: 032] openstack: eqiad1: activate keystone extra services [puppet] - 10https://gerrit.wikimedia.org/r/476232 (https://phabricator.wikimedia.org/T210595) (owner: 10Arturo Borrero Gonzalez) [16:27:57] (03CR) 10Alex Monk: "The old node is shut down, some review here would be nice." [puppet] - 10https://gerrit.wikimedia.org/r/475227 (owner: 10Alex Monk) [16:28:06] 10Operations, 10Core Platform Team Backlog (Watching / External): Create email alias for CPT Leads - https://phabricator.wikimedia.org/T210624 (10mobrovac) [16:30:54] (03Abandoned) 10Alex Monk: network::constants: Include cloud private range in all_networks [puppet] - 10https://gerrit.wikimedia.org/r/475150 (owner: 10Alex Monk) [16:31:51] ebernhardson, SMalyshev: if one of you is online, I’d appreciate a review on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/476303 [16:31:58] Lydia_WMDE: looks like you're good now? Sorry, I just got on my laptop. [16:32:29] (03PS3) 10Andrew Bogott: deployment-prep: Clean up from cache-text04 -> cache-text05 migration [puppet] - 10https://gerrit.wikimedia.org/r/475227 (owner: 10Alex Monk) [16:32:32] (03CR) 10Herron: [C: 032] "lgtm, and fwiw the newer puppet-diffs project has these set through hiera in horizon" [puppet] - 10https://gerrit.wikimedia.org/r/476299 (owner: 10Alex Monk) [16:32:38] greg-g: yeah thanks :) [16:32:41] (03PS2) 10Herron: labs hieradata: Rm puppet3-diffs project common.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/476299 (owner: 10Alex Monk) [16:34:44] Lydia_WMDE: I can sync up with WMDE tomorrow to push wikidatawiki back to wmf.6 :) [16:34:50] (03PS2) 10Dzahn: check_long_procs: fix shellcheck warnings [puppet] - 10https://gerrit.wikimedia.org/r/475283 (owner: 10Ema) [16:35:01] (03CR) 10Dzahn: [C: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/475283 (owner: 10Ema) [16:35:04] hashar: thanks. i hope we have everything solved by then [16:35:06] (03PS2) 10Lucas Werkmeister (WMDE): Disable classic_entity wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476303 (https://phabricator.wikimedia.org/T209402) [16:35:13] (03CR) 10Vgutierrez: [C: 04-2] "To be merged tomorrow at 08:00 UTC" [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [16:36:07] greg-g: https://phabricator.wikimedia.org/T210618 (entity search case sensitive) is probably not related to the train after all, FYI [16:36:21] if you still want to have it as a train blocker, that’s okay, but there was a reason I removed the parent task :) [16:37:10] not sure if this 1time failure but logging just in case it reoccurs https://www.irccloud.com/pastebin/197it6JK/internal-error.txt [16:37:28] [W-7EBgpAMEkAAD3s49AAAADE] 2018-11-28 16:37:19: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" [16:37:38] I was trying to lock and oversight an account on Meta [16:37:46] (03CR) 10Dzahn: [C: 032] wikistats: remove jessie/php5 support [puppet] - 10https://gerrit.wikimedia.org/r/475031 (owner: 10Dzahn) [16:37:54] (03PS2) 10Dzahn: wikistats: remove jessie/php5 support [puppet] - 10https://gerrit.wikimedia.org/r/475031 [16:38:00] Lucas_WMDE: gotcha, I'll re-read in a second, in a 1:1 now :) thanks [16:38:06] alright :) [16:38:53] was there any patch that affects CentralAuth? [16:39:12] >To avoid creating high replication lag, this transaction was aborted because the write duration (50.556979894638) exceeded the 3 second limit. If you are changing many items at once, try doing multiple smaller operations instead. [16:39:38] [W-7EcQpAADoAAJJiarsAAACP] 2018-11-28 16:39:06: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" [16:41:36] * revi feels lost [16:41:39] [W-7E5wpAAEQAAD2UK4YAAABE] 2018-11-28 16:41:04: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" [16:42:31] (03CR) 10Cwhite: [C: 031] Remove Diamond from redis::misc systems [puppet] - 10https://gerrit.wikimedia.org/r/476226 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:46:51] (03PS4) 10Dzahn: Revert "gerrit: Set log level for com.google.gerrit.server.plugins.PluginLoader to ERROR" [puppet] - 10https://gerrit.wikimedia.org/r/475226 (owner: 10Hashar) [16:47:38] hi folks? [16:47:49] Hydriz: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaIncubator/+/476304/ is live on mwdebug1002, check please and let me know if it is ok to proceed [16:48:40] I need to lock and oversight an account, and the operations are timing out [16:49:25] did you file a task with the exception hash? [16:49:29] hmm [16:49:29] k [16:49:50] I didn't want to file a task since oversight [16:49:57] PROBLEM - puppet last run on cloudcontrol1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:50:13] revi, if oversight is broken file a security task [16:50:16] (03CR) 10Dzahn: "plugin_log: cannot open `plugin_log' (No such file or directory) ?" [puppet] - 10https://gerrit.wikimedia.org/r/475226 (owner: 10Hashar) [16:50:24] !log T206916 created shnwiki views/index in labsdb replicas [16:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:28] T206916: Prepare and check storage layer for shnwiki - https://phabricator.wikimedia.org/T206916 [16:50:36] k Krenair [16:50:41] thcipriani: Alright yes it's correct [16:51:27] Hydriz: ok, syncing everywhere. [16:52:15] (03CR) 10Dzahn: [C: 032] Revert "gerrit: Set log level for com.google.gerrit.server.plugins.PluginLoader to ERROR" [puppet] - 10https://gerrit.wikimedia.org/r/475226 (owner: 10Hashar) [16:52:20] https://phabricator.wikimedia.org/T210628 in case someone see this [16:53:02] jouncebot: now [16:53:02] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [16:53:07] jouncebot: next [16:53:07] In 0 hour(s) and 6 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181128T1700) [16:53:24] I think I did my best, goodnight [16:53:34] sneaks in a gerrit restart before that ^ [16:53:44] !log gerrit about to restart for logging config change [16:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:47] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/WikimediaIncubator/extension.json: [[gerrit:476304|Revert "Replace wiki with wikipedia as wmf-config has been updated"]] T117023 (duration: 00m 54s) [16:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:51] T117023: Auto-generated infopages incorrectly links to the current page for Wikipedia - https://phabricator.wikimedia.org/T117023 [16:54:57] unfortunately I can’t be there for the SWAT, but if Someone™ could backport https://gerrit.wikimedia.org/r/476292 and/or merge+deploy https://gerrit.wikimedia.org/r/476303, that would be swell [16:54:57] ^ Hydriz should be live now [16:55:09] but otherwise I’m sure we’ll make it happen some other way :) [16:55:17] thcipriani: Nice, thanks a lot :) [16:57:02] !log restarting gerrit [16:57:02] Lucas_WMDE: I am doing a bit of an impromptu early SWAT currently, I can get those patches out. [16:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:55] thcipriani: thanks! the Kartographer thing should be testable, try loading https://www.wikidata.org/wiki/Q499158 on wmf.6 with/without the backport [16:58:05] gerrit back to normal [16:58:06] I don’t see how the search patch could be tested though :/ [16:58:59] Lucas_WMDE: are your around for a few and able to test? [16:59:18] unfortunately not, I have to leave soon [16:59:59] well, perhaps five or ten minutes [17:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181128T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:08] uh, did someone deploy certcental to gerrit? [17:00:41] paladox: mutante just did a restart for logging stuff as I understand it [17:00:59] thcipriani yep, im getting puppet errors in the cloud. [17:01:05] (not related to that change) [17:01:20] File[/etc/centralcerts/gerrit.rsa-2048.crt],File[/etc/centralcerts/gerrit.rsa-2048.chain.crt],File[/etc/centralcerts/gerrit.rsa-2048.chained.crt],File[/etc/centralcerts/gerrit.rsa-2048.key] [17:01:22] Lucas_WMDE: ok, if you have a few, I guess let's do the IS.php change since it'll be quick and I don't know how to test. [17:01:32] vgutierrez, ^ [17:01:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476303 (https://phabricator.wikimedia.org/T209402) (owner: 10Lucas Werkmeister (WMDE)) [17:01:47] !log stat1004:~# aptitude install exfat-fuse exfat-utils (elukey fyi) [17:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:57] chasemp: ack :) [17:02:10] 10Operations, 10Core Platform Team Backlog (Watching / External): Create email alias for CPT Leads - https://phabricator.wikimedia.org/T210624 (10Joe) p:05Triage>03Low a:03Joe [17:02:51] (03Merged) 10jenkins-bot: Disable classic_entity wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476303 (https://phabricator.wikimedia.org/T209402) (owner: 10Lucas Werkmeister (WMDE)) [17:02:53] 10Operations, 10Core Platform Team Backlog (Watching / External): Create email alias for CPT Leads - https://phabricator.wikimedia.org/T210624 (10mobrovac) [17:03:08] thcipriani: sorry, computer crashed, I’m back now [17:03:14] we can try the IS.php change, yeah [17:03:19] the other one would take too long in Zuul I guess [17:03:20] i get: Error: /Stage[main]/Profile::Gerrit::Server/Certcentral::Cert[gerrit]/File[/etc/centralcerts/gerrit.rsa-2048.crt]: Could not evaluate: Could not retrieve file metadata for puppet://certcentral1001.eqiad.wmnet/acmedata/gerrit/rsa-2048.crt: execution expired (to be precise) [17:03:29] yeah, that's my fault [17:03:38] Lucas_WMDE: IS.php live on mwdebug1002, check please [17:03:41] (if possible) [17:03:44] I’ll try [17:04:08] 10Operations, 10Wikimedia-Mailing-lists: Post hold because of "invalid headers" in wikimediacz-l - https://phabricator.wikimedia.org/T210223 (10jcrespo) I will also add @herron here, as our resident mail expert, in case he has some suggestion on why this could fail, or he can maybe debug with you the mail or m... [17:04:21] paladox, yeah you probably need certcentral to use the gerrit puppet manifest now [17:04:32] hmm [17:04:38] how does one even setup that? [17:04:48] well which way do you want to set it up? [17:06:15] (03PS1) 10Ema: cache: stop using nhw admission policy [puppet] - 10https://gerrit.wikimedia.org/r/476311 (https://phabricator.wikimedia.org/T144187) [17:07:02] Krenair could it be made so certcentral only applys to prod? [17:07:47] if you have a puppetmaster you could local hack that [17:07:56] would not expect such a change to get merged [17:08:50] it's a pandora's box of deep problems about the labs/prod barrier here [17:09:10] Amir1: So… it looks like your config changes to Beta Commmons for federation broke WBMI. :-( MWException on any WBMI-enabled file page, e.g. https://commons.wikimedia.beta.wmflabs.org/wiki/File:Redsq.png [17:09:14] thcipriani: I think we managed to test it and on mwdebug1002 the issue seems resolved [17:09:24] would it be reasonable to put some kind of live certcentral in wmcs that can service random labs' instances requests? (I think not, esp for private puppetmasters) [17:09:26] (with special thanks to Lydia) [17:09:29] PROBLEM - Host lvs1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:09:45] Lucas_WMDE: ok, thank you for testing, deploying now [17:10:21] RECOVERY - puppet last run on cloudcontrol1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:10:30] lvs1006 is currently a backup, not super critical, will look shortly [17:11:12] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:476303|Disable classic_entity wbsearchentities AB test]] T209402 T210618 (duration: 00m 55s) [17:11:15] bblack i woulden't oppose a cloud certcentral. But it seems it would fail for anyone else install gerrit using the class in the cloud. (due to the path (it will conflict)) [17:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:17] T209402: A/B testing plan for wbsearchentities, context=item - https://phabricator.wikimedia.org/T209402 [17:11:18] T210618: Wikidata entity search sometimes case-sensitive, uses wb_terms instead of CirrusSearch - https://phabricator.wikimedia.org/T210618 [17:11:21] ^ Lucas_WMDE should be live now [17:11:29] thcipriani: thanks! [17:11:36] the simplest solution for now seems to be to do "if $realm" and do the previous setup in cloud and else the new one [17:11:57] indeed, fixed now, thank you again <3 [17:11:58] https://github.com/wikimedia/puppet/commit/d66aa68407caeaeb24a16e9ffa53c71c7124cb53#diff-6a500e5a9001daa876354f5d078f4059 only needs a if prod. [17:11:59] unless/until we can make certcentral work in labs [17:12:06] Let me clear some things up. [17:12:16] * Certcentral can work in labs. [17:12:17] paladox: there will be https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476301/ after that [17:12:21] Krenair: :)) [17:12:25] Lucas_WMDE: I can try to test the Wikibase change myself, fair warning I may fail to notice changes :) [17:12:27] ** Certcentral can work in labs over http-01 [17:12:30] oh [17:12:34] cool, that sounds good [17:12:52] ** Certcentral can work in labs over dns-01 if you get the right credentials set up and agreed with Andrew, this has some implications [17:13:03] (03CR) 10Paladox: [C: 04-1] "This will break for users who have installed this module in the cloud." [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [17:13:10] thcipriani: if you go to wmf.6 without the backport, then opening https://www.wikidata.org/wiki/Q499158 should give you an internal error… hard to miss ;) [17:13:13] probably scary ones, re: random instances getting random SNIs... [17:13:27] Lucas_WMDE: you might be surprised what I can miss :P [17:13:31] * No, you can't use certcentral in a project without it's own puppetmaster and puppetdb setup. [17:13:34] Lucas_WMDE: ok, I'll give it a shot [17:14:28] (03CR) 10jenkins-bot: Disable classic_entity wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476303 (https://phabricator.wikimedia.org/T209402) (owner: 10Lucas Werkmeister (WMDE)) [17:14:30] James_F: hmm, I have a feeling some code need to be changed, let me check logstash [17:14:53] i think a bunch of "if $realm" is still better than making it so that you cant use any of those modules in cloud VPS [17:15:02] ^^ [17:15:09] (we should have more than just gerrit) [17:15:12] yeah but that's a temporary hack [17:15:19] yea, meant to be temp. ack [17:15:34] i think we did the same when LE stuff was new? [17:15:52] will we keep the old letsencrypt::cert::integrated() around in prod puppet just for labs $realm blocks that prod doesn't use anymore? the idea was to kill that old junk. [17:16:08] so... we can move the cert paths in the template to some variables and fill them based on a hiera variable signaling if certcentral should be used or not? [17:16:10] (that's what gerrit presently uses successfully in labs) [17:16:27] the cert paths yes, but the puppetization of what fetches the certs, is harder [17:16:29] yea, the Hiera option should also work [17:16:48] uhm.. yea.. once you have certs [17:16:52] !log rebooting lvs1006, console was unresponsive [17:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:45] hiera could differentiate the cert path, but something else will still need to do "if $realm" to pick the old letsencrypt::cert::integrated vs the new certcentral::cert to fetch that path and renew it. [17:17:49] bblack if we can get certcentral to support a way for mutiple certs with for example puppet://certcentral1001.eqiad.wmnet/acmedata/gerrit/-rsa-2048.crt then we can deploy that to the cloud, otherwise certcentral will not work. [17:18:24] multiple certs? I don't follow you [17:18:26] arguably that part of the problem lies at a different layer [17:18:34] Lucas_WMDE: ugh, looks like tests are failing for that wikibase patch: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php70-docker/13724/console [17:18:40] yeah :( [17:18:44] vgutierrez mutiple certs as in one for one instance and another for another instance. [17:18:52] vgutierrez: he's saying we might have 20 different test gerrit instances in labs I think, that would want unique variant test certs [17:19:00] ^^ [17:19:48] but I think that problem is really just fallout of a deeper problem [17:20:23] thcipriani: that lua assertion has been failing on a few CI builds today (and I think yesterday evening?), seems to be random [17:21:03] If you want multiple gerrit instances in a labs project you can just tell certcentral all of their SANs [17:21:07] have it generate one cert [17:21:12] distribute that cert to all of them [17:21:28] in a perfect world every single one of those modules would have a cloud VPS-equivalent-to-prod where we test all the things. 2 per module, one that is always like prod and one that is running the next version [17:21:31] but the path it gave me was: puppet://certcentral1001.eqiad.wmnet/acmedata/gerrit/rsa-2048.crt [17:21:39] but they'd have to use http-01, which is something we're not planning to puppetize in prod for this (as in, not puppetize all the apache/nginx changes needed for that) [17:21:53] Wait what? Why? [17:21:56] oh then certcentral will not work in cloud [17:22:22] Krenair: I assume we're not going to have a general solution where random $labs instances can all ask for dns-01 via the wmcs public dns servers for random names? [17:22:27] I don't know though! [17:22:57] bblack, there is a designate API... we can set up service users in keystone for instances to be able to administer DNS records within their projects [17:23:05] Lucas_WMDE: ok, I'll try +2 and it should re-run the test before merge, hopefully it'll flap the other way; however, if that test is unreliable/not useful we should remove it. [17:23:12] thcipriani: I filed https://phabricator.wikimedia.org/T210634 for it [17:23:19] thanks [17:23:24] but let’s hope a recheck works for now, yeah [17:23:28] the only missing piece of the puzzle is the script that contacts the designate API to make the change, I haven't got around to writing that yet [17:23:33] I have an open task for that [17:23:36] ok [17:23:53] and I guess, some integration of permissions on what instances are allowed to issue what names? [17:24:03] or is that implicit in projects' scope somehow? [17:24:31] The instances with the credentials would be able to modify DNS for their entire project. [17:24:48] You'd have a certcentral server and it'd have the credentials [17:25:01] so you'll need to spawn a certcentral instance per project [17:25:04] Your puppet setup would control which nodes get what certs from certcentral as usual [17:25:07] of course [17:25:29] "modify DNS for their entire project" - projects have subdomains or something? [17:25:38] (I have no idea!) [17:25:40] you will need a puppetmaster, a puppetdb, a certcentral host, and credentials with the ability to modify DNS [17:25:41] yes? [17:25:47] ok [17:25:53] IMHO certcentral is not that mature yet, we need to solve certain issues with the current cert deployment puppetization [17:25:55] so in deployment-prep we own deployment-prep.wmflabs.org and beta.wmflabs.org [17:26:13] there is a set of credentials that can be used to modify any records under those domains [17:26:21] Designate's API lets you change anything by the way [17:26:29] (and get it renamed before spawning X instances across labs) [17:26:36] We're not restricted to only ACME's TXT records like with gdnsd [17:27:25] ok [17:27:39] is somebody swatting? [17:27:51] right now deployment-prep is the only project with a dns-management service user set up [17:28:03] mobrovac: I am. slowly but surely. [17:28:23] it doesn't sound like we're going to fix this for the labs gerrit instance quickly on the certcentral end, in any case [17:28:32] mobrovac: I'm waiting on CI stuff though [17:28:33] unlike AWS, our OpenStack setup doesn't just allow us to give instances permissions. Gotta have a keystone user and distribute its credentials [17:28:39] thcipriani: ok, i don't see any patches on the calendar, but if i can, i'd like to get a revert in that would fix one train blocker [17:28:50] so I guess shove "if $realm" around this for the gerrit case, which is hopefully the only one for now, and open a ticket about dealing with it better [17:29:02] (03CR) 10GTirloni: [C: 032] labs hieradata: Rm contintcloud project common.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/476298 (https://phabricator.wikimedia.org/T209644) (owner: 10Alex Monk) [17:29:06] mobrovac: yeah, sure! [17:29:11] (03PS2) 10GTirloni: labs hieradata: Rm contintcloud project common.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/476298 (https://phabricator.wikimedia.org/T209644) (owner: 10Alex Monk) [17:29:37] If I were writing a certcentral DNS handler to run in AWS against route53 it'd be much easier :/ [17:30:21] Actually until relatively recently, labs instances couldn't really talk to most OpenStack APIs [17:30:28] (03PS9) 10CRusnov: Make the puppetdb backend process primitive types for queries [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) [17:30:44] right now they can log in with the guest 'novaobserver' credentials and read most stuff, plus deployment-prep has this special service user which can write DNS records [17:32:50] (03PS2) 10Vgutierrez: gerrit: Use the certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) [17:32:52] (03PS1) 10Vgutierrez: gerrit: Avoid certcentral::cert interfering with labs instances [puppet] - 10https://gerrit.wikimedia.org/r/476315 (https://phabricator.wikimedia.org/T207050) [17:33:35] (03CR) 10Smalyshev: Disable classic_entity wbsearchentities AB test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476303 (https://phabricator.wikimedia.org/T209402) (owner: 10Lucas Werkmeister (WMDE)) [17:33:42] For now I see a few things to do [17:33:50] (03PS1) 10Paladox: Gerrit: Add relm check for certcentral [puppet] - 10https://gerrit.wikimedia.org/r/476316 [17:34:02] * Shove realm branching around the traditional LE puppetisation [17:34:38] (03PS2) 10Paladox: Gerrit: Add relm check for certcentral [puppet] - 10https://gerrit.wikimedia.org/r/476316 [17:34:40] * Figure out how to make it work with certcentral and http-01, likely involving more realm branching and stuff [17:34:51] paladox: I think we're colliding.. :) [17:34:53] * I need to get a move on with T206922 [17:34:54] T206922: Write designate integration script for certcentral DNS challenges - https://phabricator.wikimedia.org/T206922 [17:34:56] oh [17:35:02] * paladox abandons [17:35:13] thcipriani: I have to leave now, sorry [17:35:13] (03Abandoned) 10Paladox: Gerrit: Add relm check for certcentral [puppet] - 10https://gerrit.wikimedia.org/r/476316 (owner: 10Paladox) [17:35:16] but thanks for everything you did so far [17:35:33] thcipriani: it's https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/476314/, i added it to the calendar so whenever you have time [17:35:34] and thanks in advance for however far you manage to get with the other change ^^ [17:35:38] Lucas_WMDE: sure, I'm still wating on the backport of your patch to 1.33.0-wmf.6, will do the check you suggested. [17:35:41] (03CR) 10Vgutierrez: [C: 031] "pcc shows the expected NOOP in gerrit production instances: https://puppet-compiler.wmflabs.org/compiler1002/13764/" [puppet] - 10https://gerrit.wikimedia.org/r/476315 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [17:35:47] (03CR) 10Paladox: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/476315 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [17:35:55] mobrovac: thanks, looking [17:36:02] Another option I guess is delegating out of designate to a gdnsd server and use the dns-01 mechanism in certcentral through that. You'd need help to get that set up. [17:36:07] (03CR) 10Paladox: [C: 04-1] gerrit: Use the certcentral managed TLS certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476301 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [17:36:33] (03CR) 10Vgutierrez: [C: 032] gerrit: Avoid certcentral::cert interfering with labs instances [puppet] - 10https://gerrit.wikimedia.org/r/476315 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [17:36:40] (03PS2) 10Vgutierrez: gerrit: Avoid certcentral::cert interfering with labs instances [puppet] - 10https://gerrit.wikimedia.org/r/476315 (https://phabricator.wikimedia.org/T207050) [17:36:51] (you can't self-service create an NS record in designate via the version of horizon we run, that'll be in a future horizon release) [17:36:59] (03PS1) 10Cmjohnson: Adding mgmt dns for cloudvirtan100[1-5} [dns] - 10https://gerrit.wikimedia.org/r/476317 (https://phabricator.wikimedia.org/T207194) [17:37:10] vgutierrez thank you! [17:37:14] paladox: so sorry about the mess with labs, TBH I wasn't aware of that gerrit instance [17:37:26] no worries! :) [17:37:58] ((but you might be able to achieve it with a dns-editing service user, you might be able to get cloud services to do it, you might be able to wait for Wikimedia to get Horizon updated to Stein in probably a few years :P)) [17:38:48] merged.. it should be OK in your side as well (hopefully) [17:39:20] yep works again :) [17:39:26] <3 [17:40:03] (03CR) 10RobH: [C: 031] Adding mgmt dns for cloudvirtan100[1-5} [dns] - 10https://gerrit.wikimedia.org/r/476317 (https://phabricator.wikimedia.org/T207194) (owner: 10Cmjohnson) [17:40:28] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for cloudvirtan100[1-5} [dns] - 10https://gerrit.wikimedia.org/r/476317 (https://phabricator.wikimedia.org/T207194) (owner: 10Cmjohnson) [17:41:06] thcipriani: do we need to hit "submit" on there? jenkins V+2'ed but the patch hasn't been submitted [17:44:09] mobrovac: jenkins voted test+2, still running gate and submit though [17:44:19] ah kk [17:44:30] although it seems to have a failing test in gate-and-submit :( [17:44:44] https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-docker/2828/console [17:45:43] thcipriani: i've been seeing this lately, but this is hardly connected to this patch, as this one just removes a couple of lines of code [17:45:46] looks like: https://phabricator.wikimedia.org/T203506 [17:45:49] no vendoring involved [17:45:49] yeah [17:46:03] thcipriani: please wikidata not back to .6 (relay from lucas) [17:46:16] thcipriani: we need to sort out an announcement first [17:46:40] Lydia_WMDE: ok, so once I get that patch out, don't put wikidata back on wmf.6? [17:47:03] thcipriani: yeah until tomorrow and then i can sort it out with hashar [17:47:10] Lydia_WMDE: sounds good. [17:47:13] :) [17:47:20] mobrovac: I'll just submit it once this wikibase merges (soooo close.) [17:47:26] hehe [17:47:46] gr8 thnx [17:51:28] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install dbstore100[3-5].eqiad.wmnet - https://phabricator.wikimedia.org/T209620 (10RobH) [17:52:03] 10Operations, 10Analytics, 10User-Elukey: rack/setup/install dbstore100[3-5].eqiad.wmnet - https://phabricator.wikimedia.org/T209620 (10RobH) a:05Cmjohnson>03elukey These are now ready for @elukey to take over. Assigning this task to him. You can resolve this as you see fit! [17:52:33] 10Operations, 10Analytics, 10User-Elukey: rack/setup/install dbstore100[3-5].eqiad.wmnet - https://phabricator.wikimedia.org/T209620 (10elukey) 05Open>03Resolved Thanks a lot! We are going to follow up in https://phabricator.wikimedia.org/T210478 [17:55:43] vgutierrez: thanks for the fix! btw, i checked on the other ones that had the change, dumps = not using a cert in cloud, archiva = not in cloud, pretty sure, icinga = there is shinken and icinga2 in cloud, but not icinga 1.x (as of now). [17:56:00] so i think they are just fine as is [17:56:12] any other collision taking into account our roadmap? [17:56:32] phabricator should have the fix [17:56:36] like gerrit did [17:56:41] the one listed in T207050 I mean [17:56:41] T207050: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 [17:56:44] shinken wouldn't be affected by this AFAIK, it sits behind the usual central proxy [17:56:53] mobrovac: for reference I am just waiting for https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/26171/console to finish...very patiently waiting. [17:56:55] icinga2 is likely unpuppetised or at least not using prod puppet manifests [17:57:06] hehe thcipriani, oki [17:57:15] * mobrovac jumps on the waiting wagon [17:57:30] phabricator sits behind cache-misc in prod, and probably sits behind the central labs proxy when being tested within labs [17:57:36] gerrit / gerrit-slave = need cloud fix, lists = not in labs afaict, never saw it [17:57:37] (right?) [17:57:52] mirrors = also pretty sure not needed, cc: apergos [17:57:57] mx = no clue [17:58:06] I think there historically was mailman running inside labs ages ago, doubt it still is [17:58:29] though you could argue that somebody might want to start installing it [17:58:33] for example to test changes for an upgrade [17:59:03] i mean.. theoretically.. all of them would not suffer from having a staging env [18:00:11] theoretically [18:00:59] librenms/netbox/tendril: i would have actually used them in the past when i was working on stuff, but then also delete them again [18:01:37] it's a nice thing that somebody might notice in the future when they apply those roles..that are not currently used [18:02:23] but it can also be fixed "on demand" if actually needed [18:03:18] there is https://tools.wmflabs.org/openstack-browser/puppetclass/ to check which roles are used [18:04:58] includes role::mx https://tools.wmflabs.org/openstack-browser/puppetclass/role::mail::mx [18:06:47] (03CR) 10GTirloni: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/476298 (https://phabricator.wikimedia.org/T209644) (owner: 10Alex Monk) [18:08:13] 10Operations, 10Core Platform Team Backlog (Watching / External): Create email alias for CPT Leads - https://phabricator.wikimedia.org/T210624 (10mobrovac) 05Open>03Resolved Created and works. Thank you @Joe [18:12:05] (03CR) 10Dzahn: [C: 031] "anytime. looks good and not actively working on icinga anymore." [puppet] - 10https://gerrit.wikimedia.org/r/461498 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [18:12:48] XioNoX: want me to merge now? [18:13:40] mutante: discovered a new issue, need to solve https://phabricator.wikimedia.org/T209989 first [18:14:20] oh, ok [18:14:30] mutante: but I can probably move forward on https://gerrit.wikimedia.org/r/c/operations/puppet/+/458850 :) [18:15:00] I have 5% battery life on my laptop, will have a look later today [18:16:35] mobrovac: yay, merged. OK your eventbus change is on mwdebug1002, check please [18:17:03] (03CR) 10Dzahn: [C: 031] Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [18:17:22] XioNoX: yea, looks good to me as well:) [18:20:11] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13765/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [18:22:27] thcipriani: can't be checked [18:22:32] it's for the job runners [18:22:37] also, this is a train blocker ticket [18:23:22] mobrovac: okie doke, I'll sync it out [18:23:27] thnx [18:23:40] 10Operations, 10ops-codfw: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [18:26:08] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/EventBus/includes/EventBus.php: SWAT: [[gerrit:476314|Revert "Revert "Revert "Set event datetime with microsecond resolution."""]] T210608 (duration: 00m 55s) [18:26:12] ^ mobrovac live now [18:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:13] T210608: EventBus::createEvent Call to a member function format() on a non-object (boolean) - https://phabricator.wikimedia.org/T210608 [18:26:16] thnx thcipriani! [18:26:21] yw :) [18:28:21] (03PS3) 10Daimona Eaytoy: Clarify docs for AbuseFilter emergency threshold [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475337 [18:30:31] (03PS1) 10Dzahn: add maintenance.eqiad CNAME, point to mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/476330 [18:31:38] (03PS2) 10Dzahn: add maintenance.eqiad CNAME, point to mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/476330 [18:33:10] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/Wikibase/lib/includes/Formatters/CachingKartographerEmbeddingHandler.php: [[gerrit:476312|Never return null in CachingKartographerEmbeddingHandler::getParserOutput]] T210617 (duration: 00m 53s) [18:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:13] T210617: BadMethodCallException on Wikidata item pages containing coordinates with non-Earth globes - https://phabricator.wikimedia.org/T210617 [18:33:34] (03CR) 10Dzahn: "if you are a user of mwmaint* servers, please comment :)" [dns] - 10https://gerrit.wikimedia.org/r/476330 (owner: 10Dzahn) [18:35:50] okay to deploy this? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ORES/+/476334 [18:35:53] UBN task [18:35:57] thcipriani: ^ [18:36:12] Amir1: yes please! [18:36:47] (03PS1) 10Dzahn: lower TTL of people.wikimedia.org to 5M [dns] - 10https://gerrit.wikimedia.org/r/476335 (https://phabricator.wikimedia.org/T210036) [18:41:14] (03CR) 10Dzahn: [C: 032] lower TTL of people.wikimedia.org to 5M [dns] - 10https://gerrit.wikimedia.org/r/476335 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [18:41:53] (03CR) 10Hashar: "cobalt now has a file /var/log/gerrit/plugin_log" [puppet] - 10https://gerrit.wikimedia.org/r/475226 (owner: 10Hashar) [18:42:36] (03CR) 10Dzahn: [C: 032] "oh yea, i forgot to comment that. restarted gerrit and confirmed that file existed but it was still empty earlier." [puppet] - 10https://gerrit.wikimedia.org/r/475226 (owner: 10Hashar) [18:45:42] (03PS2) 10Dzahn: shinken: Don't try to exec shinkengen until shinken is installed [puppet] - 10https://gerrit.wikimedia.org/r/461962 (owner: 10Alex Monk) [18:46:01] (03CR) 10Dzahn: [C: 031] "looks like the dependency was merged. rebasing! can merge now" [puppet] - 10https://gerrit.wikimedia.org/r/461962 (owner: 10Alex Monk) [18:47:37] (03CR) 10Dzahn: [C: 032] shinken: Don't try to exec shinkengen until shinken is installed [puppet] - 10https://gerrit.wikimedia.org/r/461962 (owner: 10Alex Monk) [18:48:25] (03CR) 10Dzahn: [C: 032] "added gtirloni because of https://phabricator.wikimedia.org/T204562#4689957 fyi,, this was pending for quite some time. merged" [puppet] - 10https://gerrit.wikimedia.org/r/461962 (owner: 10Alex Monk) [18:52:13] (03PS5) 10Dzahn: swift: Fix checks on drive/filesystem titles to allow for labs ones [puppet] - 10https://gerrit.wikimedia.org/r/402758 (https://phabricator.wikimedia.org/T184236) (owner: 10Alex Monk) [18:52:52] (03PS1) 10Hashar: gerrit: log reviewers-by-blame plugin at DEBUG level [puppet] - 10https://gerrit.wikimedia.org/r/476343 (https://phabricator.wikimedia.org/T101131) [18:54:24] (03PS2) 10Hashar: gerrit: log reviewers-by-blame plugin at DEBUG level [puppet] - 10https://gerrit.wikimedia.org/r/476343 (https://phabricator.wikimedia.org/T101131) [18:55:05] (03CR) 10Hashar: "PS2 dropped module/cdh bump" [puppet] - 10https://gerrit.wikimedia.org/r/476343 (https://phabricator.wikimedia.org/T101131) (owner: 10Hashar) [18:55:33] (03CR) 10GTirloni: [C: 031] shinken: Don't try to exec shinkengen until shinken is installed [puppet] - 10https://gerrit.wikimedia.org/r/461962 (owner: 10Alex Monk) [18:56:01] (03CR) 10Mobrovac: [C: 031] labs hieradata: Rm restbase project common.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/476302 (owner: 10Alex Monk) [18:56:12] (03CR) 10Dzahn: [C: 032] "thanks for confirming :) i was about to check -cloud-feed to make sure" [puppet] - 10https://gerrit.wikimedia.org/r/461962 (owner: 10Alex Monk) [18:56:39] (03PS3) 10GTirloni: openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404) [18:58:03] !log ladsgroup@deploy1001 Synchronized php-1.33.0-wmf.6/extensions/ORES/includes/Hooks/ApiHooksHandler.php: [[gerrit:476334|Don't try to add scores in API where there is nothing to add (T210610)]] (duration: 00m 55s) [18:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:07] T210610: PHP Fatal Error: Argument 2 passed to ORES\Hooks\ApiHooksHandler::addScoresForAPI() must be an instance of array, null given - https://phabricator.wikimedia.org/T210610 [18:58:12] (03Abandoned) 10Hashar: gerrit: log reviewers-by-blame plugin at DEBUG level [puppet] - 10https://gerrit.wikimedia.org/r/476343 (https://phabricator.wikimedia.org/T101131) (owner: 10Hashar) [18:59:04] (03PS6) 10Dzahn: swift: Fix checks on drive/filesystem titles to allow for labs ones [puppet] - 10https://gerrit.wikimedia.org/r/402758 (https://phabricator.wikimedia.org/T184236) (owner: 10Alex Monk) [19:00:51] (03CR) 10Dzahn: "Krenair: good?" [puppet] - 10https://gerrit.wikimedia.org/r/402758 (https://phabricator.wikimedia.org/T184236) (owner: 10Alex Monk) [19:04:48] (03PS4) 10GTirloni: openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404) [19:05:41] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:08:04] thcipriani: The ORES errors is gone now but there is huge size of ErrorException from line 1159 of /srv/mediawiki/php-1.33.0-wmf.4/includes/Message.php: PHP Warning: Invalid argument supplied for foreach() [19:08:16] wmf.4 but it's 160k in the last 15 minutes [19:08:23] https://logstash.wikimedia.org/goto/07ff8552be0a2beb7ba1245063dd8349 [19:09:12] Amir1: that one has a task and folks are looking at it [19:09:14] * thcipriani digs [19:10:06] https://phabricator.wikimedia.org/T210499 was merged into https://phabricator.wikimedia.org/T210528 [19:10:16] (03CR) 10Dzahn: [C: 031] swift: Fix checks on drive/filesystem titles to allow for labs ones [puppet] - 10https://gerrit.wikimedia.org/r/402758 (https://phabricator.wikimedia.org/T184236) (owner: 10Alex Monk) [19:12:23] PROBLEM - Apache HTTP on mw1323 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [19:12:47] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:13:21] 10Operations, 10monitoring, 10Patch-For-Review: Icinga downtime script should fail on the passive hosts - https://phabricator.wikimedia.org/T210380 (10Volans) @Dzahn thanks for all the fixes! [19:13:33] RECOVERY - Apache HTTP on mw1323 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.043 second response time [19:13:54] (03CR) 10Dzahn: "i'll second what Volans already said. please use require_package('python3-requests') to install the package, and then you can remove the a" [puppet] - 10https://gerrit.wikimedia.org/r/475579 (https://phabricator.wikimedia.org/T210312) (owner: 10Legoktm) [19:15:23] Amir1: Did you find out anything re. WBMI on Beta? [19:17:10] James_F: sorry, not yet. Dealing with a UBN task [19:20:35] 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Stretch apt repository - https://phabricator.wikimedia.org/T168967 (10debt) @aborrero I believe that @mpopov is still working on this issue, and the #discovery-search team is well aware of the timeline, we're concerned that if we don't get this... [19:21:48] (03PS1) 10GTirloni: openstack: Move spread_check password to eqiad1 [labs/private] - 10https://gerrit.wikimedia.org/r/476360 (https://phabricator.wikimedia.org/T210595) [19:23:38] (03PS2) 10GTirloni: openstack: Move spread_check password to eqiad1 [labs/private] - 10https://gerrit.wikimedia.org/r/476360 (https://phabricator.wikimedia.org/T210595) [19:24:05] James_F: is there a ticket? [19:24:37] Amir1: Not yet. Should I just revert your config patch and re-open the federation one? [19:25:42] James_F: no, I'm making the fix [19:25:47] Kk. [19:26:20] James_F: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseMediaInfo/+/476366 [19:28:27] Amir1: OK, but why is it throwing the exception in the first place? :-) [19:29:35] We'll get to that :D [19:29:42] * James_F grins. [19:30:43] James_F: oh of course, it's due to MCR, it checks if the ns is int which is not, it's 6/mediainfo [19:30:58] !log start goreplay logging of port 9200 across eqiad elastic cluster to track down T208248 [19:31:01] We need to completely remove that part :D [19:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:02] T208248: Intermittent json parse failures in comp suggest - https://phabricator.wikimedia.org/T208248 [19:31:35] Amir1: The bit of WBMI? [19:31:54] (03PS1) 10Bmansurov: Enable reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476368 (https://phabricator.wikimedia.org/T209882) [19:32:14] James_F: the check for namespace being int, because that assumption doesn't hold true anymore [19:32:37] What's even calling getMediaInfoIdLookup? [19:33:07] content-handler-factory-callback for MissingMediaInfoHandler [19:33:20] So… why is this only getting called now, and wasn't beforehand? [19:34:14] * James_F shrugs. [19:34:17] Quick patch anyway. [19:35:38] James_F: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseMediaInfo/+/476369 [19:35:57] I hope we don't have tests for that [19:36:19] (03PS1) 10Bmansurov: Disable reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476370 (https://phabricator.wikimedia.org/T209882) [19:36:45] also we might use some sort of ns validity check anyway, Daniel probably knows better [19:36:48] I'm done for the day [19:36:57] (03PS5) 10Dzahn: create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) [19:38:06] (03CR) 10jerkins-bot: [V: 04-1] create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [19:39:30] (03PS5) 10GTirloni: openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404) [19:40:37] !log rebooting logstash1004 to pick up security updates [19:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:10] when you get "Whoops! It looks like puppet-lint has encountered an error that it doesn't know how to handle. Please open an issue at https://github.com/rodjek/puppet-lint" .. that's like a bonus , double whammy -1 [19:47:28] (03PS15) 10DCausse: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) [19:47:30] (03PS5) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [19:47:32] (03PS5) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [19:47:34] (03PS5) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [19:48:29] mutante: lol [19:48:45] I’ll start chanting “no whammies” when pushing patches now [19:48:54] hahaa [19:51:39] (03PS6) 10Dzahn: create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) [19:52:15] (03CR) 10jerkins-bot: [V: 04-1] create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [20:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181128T2000) [20:04:56] (03PS7) 10Dzahn: create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) [20:05:28] (03CR) 10jerkins-bot: [V: 04-1] create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [20:05:40] (03PS8) 10Dzahn: create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) [20:06:58] (03CR) 10jerkins-bot: [V: 04-1] create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [20:08:12] (03PS9) 10Dzahn: create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) [20:09:21] (03CR) 10jerkins-bot: [V: 04-1] create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [20:09:33] (03CR) 10DCausse: [cirrus] Start using replica group settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [20:11:15] (03PS10) 10Dzahn: create profile::research::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) [20:13:03] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10media-storage, 10Patch-For-Review: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236 (10Dzahn) rebased, amended. is it like the cherry-pick though? https://gerrit.wikimedia.org... [20:16:33] 10Operations: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10colewhite) In the Foundations meeting, we considered removing "misc" as the default cluster and have puppet fail if there is no cluster set. [20:22:32] !log neodymium: mv /srv/jnt{,.old} (use cumin1001 instead!) [20:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:35] XioNoX: ^ [20:23:26] paravoid: rgr! [20:26:32] 10Operations, 10DBA, 10StructuredDiscussions, 10Growth-Team (Current Sprint), and 2 others: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Catrope) @Banyek That sounds right to me. The migration script already exists and was used successfully in... [20:36:09] 10Operations, 10Wikimedia-Mailing-lists: Post hold because of "invalid headers" in wikimediacz-l - https://phabricator.wikimedia.org/T210223 (10Urbanecm) @Blahma sent two messages, both were stopped because "posting to moderated list", not because "invalid header". Looks like the flag is being considered befor... [20:41:54] !log rebooting logstash1005 for security updates [20:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:34] (03PS2) 10Herron: logstash: ship prometheus logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475879 (https://phabricator.wikimedia.org/T210455) [20:55:58] (03CR) 10jerkins-bot: [V: 04-1] logstash: ship prometheus logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475879 (https://phabricator.wikimedia.org/T210455) (owner: 10Herron) [20:58:34] (03PS3) 10Herron: logstash: ship prometheus logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475879 (https://phabricator.wikimedia.org/T210455) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181128T2100). [21:00:24] (03CR) 10Herron: [C: 032] logstash: ship prometheus logs to ELK (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/475879 (https://phabricator.wikimedia.org/T210455) (owner: 10Herron) [21:00:31] (03PS4) 10Herron: logstash: ship prometheus logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475879 (https://phabricator.wikimedia.org/T210455) [21:01:33] (03CR) 10Dzahn: [C: 032] "yep, no project "restbase" at https://tools.wmflabs.org/openstack-browser/project/" [puppet] - 10https://gerrit.wikimedia.org/r/476302 (owner: 10Alex Monk) [21:01:42] (03PS2) 10Dzahn: labs hieradata: Rm restbase project common.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/476302 (owner: 10Alex Monk) [21:06:56] !log arlolra@deploy1001 Started deploy [parsoid/deploy@9ed8c47]: Updating Parsoid to 18a98af [21:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:01] (03PS3) 10Dzahn: labs hieradata: Rm restbase project common.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/476302 (owner: 10Alex Monk) [21:08:30] 10Operations, 10WMF-Legal, 10Software-Licensing: Non-free software installed on stat1004 outside of puppet - https://phabricator.wikimedia.org/T210667 (10Legoktm) p:05Triage>03Unbreak! [21:10:27] 10Operations, 10Discovery-Search, 10Elasticsearch, 10monitoring: Elasticsearch health check for shards icinga check shows OK status when cluster health is yellow - https://phabricator.wikimedia.org/T210668 (10herron) [21:11:07] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [21:11:22] (03CR) 10Dzahn: [C: 031] "nowadays needs manual rebase. should still be done, right?" [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [21:13:18] (03PS1) 10Jforrester: Revert "labs: Add mediainfo to federation config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476391 (https://phabricator.wikimedia.org/T204748) [21:13:49] (03PS3) 10Dzahn: profile::mediawiki::jobrunner: restrict firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [21:13:56] 10Operations, 10WMF-Legal, 10Software-Licensing: Non-free software installed on stat1004 outside of puppet - https://phabricator.wikimedia.org/T210667 (10chasemp) p:05Unbreak!>03Normal I am under the impression anything in Debian main is ok to install in prod, but this is based on adhoc conversations dur... [21:14:10] 10Operations, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10chasemp) [21:14:48] 10Operations, 10Analytics, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10chasemp) [21:14:51] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [21:15:13] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [21:15:13] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:15:15] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Ship prometheus logs to ELK - https://phabricator.wikimedia.org/T210455 (10herron) 05Open>03Resolved a:03herron Prometheus syslogs are now flowing into logstash [21:15:37] PROBLEM - puppet last run on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:15:44] Deploy clear? Can I sling out a Beta-Cluster-only patch? [21:15:45] PROBLEM - proton endpoints health on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:16:31] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@9ed8c47]: Updating Parsoid to 18a98af (duration: 09m 35s) [21:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:07] PROBLEM - configured eth on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:17:07] PROBLEM - dhclient process on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:17:15] PROBLEM - Disk space on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:17:25] PROBLEM - Check whether ferm is active by checking the default input chain on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:17:33] PROBLEM - Check size of conntrack table on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:17:49] PROBLEM - Check systemd state on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:17:59] PROBLEM - DPKG on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:18:32] looks like nagios-nrpe-server is in failed state on proton1002 [21:18:41] 10Operations, 10Analytics, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10chasemp) [21:18:49] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:18:58] proton1002 systemd[1]: nagios-nrpe-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT [21:19:27] RECOVERY - configured eth on proton1002 is OK: OK - interfaces up [21:19:29] RECOVERY - dhclient process on proton1002 is OK: PROCS OK: 0 processes with command name dhclient [21:19:32] !log restarted nagios-nrpe-server on proton1002 [21:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:37] RECOVERY - Disk space on proton1002 is OK: DISK OK [21:19:47] RECOVERY - Check whether ferm is active by checking the default input chain on proton1002 is OK: OK ferm input default policy is set [21:19:55] RECOVERY - Check size of conntrack table on proton1002 is OK: OK: nf_conntrack is 0 % full [21:19:56] (03PS1) 10Cwhite: hiera: add cluster definition to recursor role [puppet] - 10https://gerrit.wikimedia.org/r/476393 (https://phabricator.wikimedia.org/T210486) [21:20:09] RECOVERY - Check systemd state on proton1002 is OK: OK - running: The system is fully operational [21:21:31] PROBLEM - DPKG on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:21:40] 10Operations, 10Analytics, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10chasemp) @fgiunchedi I need to sync up with you here for other reasons, but if you could take a look at this that would be great [21:21:56] (03CR) 10Cwhite: "This definition may be the wrong place." [puppet] - 10https://gerrit.wikimedia.org/r/476393 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [21:23:01] PROBLEM - configured eth on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:23:01] PROBLEM - dhclient process on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:23:09] PROBLEM - Disk space on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:23:21] PROBLEM - Check whether ferm is active by checking the default input chain on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:23:29] PROBLEM - Check size of conntrack table on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:23:43] PROBLEM - Check systemd state on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [21:23:50] (03PS1) 10Cwhite: hiera: add cluster definition to spare role [puppet] - 10https://gerrit.wikimedia.org/r/476396 (https://phabricator.wikimedia.org/T210486) [21:24:06] the proton1002 issues -> that might be me, I think I killed that server when testing :( [21:24:53] herron: raynor: it's swapping a lot, ^ let's just reboot it [21:24:54] I could use a mw dev (legoktm? Reedy?) to help me look at a fairly urgent wikitech issue: T210669 [21:24:55] T210669: Wikitech fails to enable 2fa - https://phabricator.wikimedia.org/T210669 [21:24:55] (03CR) 10Jforrester: [C: 032] "Beta-Cluster-only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476391 (https://phabricator.wikimedia.org/T204748) (owner: 10Jforrester) [21:25:31] mutante, sure, do it please [21:26:09] (03Merged) 10jenkins-bot: Revert "labs: Add mediainfo to federation config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476391 (https://phabricator.wikimedia.org/T204748) (owner: 10Jforrester) [21:26:14] 10Operations, 10Analytics, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10MoritzMuehlenhoff) exfat-fuse itself is free software (GPL) and part of Debian main. Debian's approach on patents is written up at https://www.debian.org/reports... [21:26:19] also, mutante do we have any place where I can check the CPU + memory + swap usage for pronton [1|2]00[1|2]? [21:26:28] !log rebooting proton1002 [21:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:42] 10Operations, 10Analytics, 10Security-Team, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10chasemp) [21:26:53] RECOVERY - Check whether ferm is active by checking the default input chain on proton1002 is OK: OK ferm input default policy is set [21:27:01] RECOVERY - Check size of conntrack table on proton1002 is OK: OK: nf_conntrack is 0 % full [21:27:06] raynor: it's back already [21:27:17] RECOVERY - Check systemd state on proton1002 is OK: OK - running: The system is fully operational [21:27:27] RECOVERY - DPKG on proton1002 is OK: All packages OK [21:27:36] awesome, thx, for now I'll skipp testing, I'll try to recreate similar issue locally, I didn't expect it to fail [21:27:39] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [21:27:43] RECOVERY - configured eth on proton1002 is OK: OK - interfaces up [21:27:45] RECOVERY - dhclient process on proton1002 is OK: PROCS OK: 0 processes with command name dhclient [21:27:53] RECOVERY - Disk space on proton1002 is OK: DISK OK [21:28:02] 10Operations, 10Patch-For-Review: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10colewhite) I think it will break things and otherwise be frustrating to fail on no cluster definition unless we could somehow limit it to production only. It could easily be renamed though: https://gith... [21:28:23] raynor: go to https://grafana.wikimedia.org/dashboard/db/host-overview?refresh=5m&orgId=1 see the "server" in the upper left corner [21:28:30] !log Updated Parsoid to 18a98af (T209236, T210437, T184755, T187142, T208470, T207286, T206777, T205710, T205546, T204477) [21:28:33] click that and start to type "proton" [21:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:50] T184755: Consider not removing multiple blank lines/white space between paragraphs - https://phabricator.wikimedia.org/T184755 [21:28:51] T208470: Parsoid should nowiki-escape '}' in a table cell or insert a whitespace character, as appropriate - https://phabricator.wikimedia.org/T208470 [21:28:51] T206777: Create Wikipedia Shan - https://phabricator.wikimedia.org/T206777 [21:28:52] T207286: Time profiling: Replace millisecond granularity timers with microsecond granularity timers - https://phabricator.wikimedia.org/T207286 [21:28:52] T210437: Sanitizer::stripAllTags shouldn't expand legacy "semicolon-less" HTML5 entities - https://phabricator.wikimedia.org/T210437 [21:28:52] T209236: "¶ms" URL parameter (used in a link parameter in [[File]] markup) incorrectly parsed as "¶ms" (%C2%B6ms) - https://phabricator.wikimedia.org/T209236 [21:28:53] T205710: Create Wikinews Limburgish - https://phabricator.wikimedia.org/T205710 [21:28:53] T205546: Create Wiktionary Cantonese - https://phabricator.wikimedia.org/T205546 [21:28:54] T204477: Create punjabi.wikimedia.org for Punjabi Wikimedians User Group - https://phabricator.wikimedia.org/T204477 [21:28:58] T187142: Deduplicate template styles in Parsoid - https://phabricator.wikimedia.org/T187142 [21:29:21] 10Operations, 10Discovery-Search, 10Elasticsearch, 10monitoring: Elasticsearch health check for shards icinga check shows OK status when cluster health is yellow - https://phabricator.wikimedia.org/T210668 (10EBernhardson) the problem is yellow cluster status is part of normal operations. For example when... [21:29:34] mutante, thanks, got it! [21:31:07] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:31:10] mutante, whats up with proton1001? [21:32:12] proton1002 keept reporting to grafana, the proton1001 just stopped logging stuff at 21:20:32 [21:32:22] raynor: icing says it's state "unknown" for some things [21:32:33] but since unknown isnt CRIT.. it didnt show up in here [21:32:35] unlike the other one [21:32:57] tries to SSH to it [21:33:20] can't [21:33:47] can you restart it? [21:33:52] let me restart that another way. from ganeti [21:33:57] it's a VM [21:34:23] same as proton1002, also it would be nice to check dmesg to see what happened [21:34:49] yes, but that one i could still normally connect to [21:35:36] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Cmjohnson) [21:35:37] !log gnt-instance reboot proton1001.eqiad (stopped working, no SSH) [21:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:04] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Cmjohnson) @robh these are ready for installs, I changed the primary nic to boot from the 10G NIC. The raid was set up exactly like... [21:37:21] (03CR) 10jenkins-bot: Revert "labs: Add mediainfo to federation config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476391 (https://phabricator.wikimedia.org/T204748) (owner: 10Jforrester) [21:37:34] raynor: on proton1002 it was: Out of memory: Kill process 20847 (chromium) [21:37:59] raynor: proton1001 is back for me [21:39:25] rescheduled monitorin checks, icinga now says all services are OK again [21:39:44] andrewbogott: sorry, I'm in class right now [21:39:54] but why it stopped responding? just out of memory? [21:40:11] it should kill chromium but keep services like ssh alive [21:40:38] out of memory meant the nagios-nrpe-server failed. and that meant all the Icinga alerts starting [21:40:59] ssh was still alive on 1002 [21:41:13] it was just all the things that monitoring checks via NRPE [21:42:14] note the " port 5666: Connection refused" in those messages up there [21:42:33] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:42:45] port 5666 is the service that accepts icinga to connect to run things locally for monitoring [21:43:14] ok, thanks for the info, I leave service testing for now, looks like that on high concurrency it just run out of memory [21:43:16] if it was also SSH then we'd get "host is down" additionally , not just services on the host [21:43:30] and started to ride swap [21:43:44] eh, i should say "if ping fails", SSH is another check [21:43:56] yes, raynor [21:44:07] (03CR) 10Effie Mouzeli: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13769/rdb1009.eqiad.wmnet/ looks valid:)" [puppet] - 10https://gerrit.wikimedia.org/r/476226 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [21:45:42] raynor: regarding 1001.. i dont see a reason in syslog [21:48:56] (03CR) 10Dzahn: "did the manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [21:53:43] PROBLEM - puppet last run on db1102 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:55:24] ok, thanks for your help [21:55:35] you're welcome [22:13:41] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:15:13] (03PS2) 10Herron: rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) [22:15:37] (03CR) 10jerkins-bot: [V: 04-1] rsyslog:input:file add multiline handling and ship gerrit logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/475840 (https://phabricator.wikimedia.org/T141324) (owner: 10Herron) [22:22:31] 10Operations, 10Release-Engineering-Team (Backlog): Point keyholder github mirror to gerrit - https://phabricator.wikimedia.org/T210674 (10thcipriani) p:05Triage>03Normal [22:24:49] RECOVERY - puppet last run on db1102 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:25:19] 10Operations, 10Release-Engineering-Team (Kanban): Point keyholder github mirror to gerrit - https://phabricator.wikimedia.org/T210674 (10thcipriani) a:03thcipriani [22:25:51] (03PS1) 10Dzahn: peopleweb: allow deployment server to connect to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/476411 (https://phabricator.wikimedia.org/T210036) [22:26:48] (03CR) 10jerkins-bot: [V: 04-1] peopleweb: allow deployment server to connect to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/476411 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [22:28:51] (03PS2) 10Dzahn: peopleweb: allow deployment server to connect to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/476411 (https://phabricator.wikimedia.org/T210036) [22:29:25] (03CR) 10jerkins-bot: [V: 04-1] peopleweb: allow deployment server to connect to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/476411 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [22:29:35] (03CR) 10Dzahn: "gave up on this approach, instead going for opening firewall holes on deployment_server to talk to backend webservers -> https://gerrit.wi" [puppet] - 10https://gerrit.wikimedia.org/r/423557 (owner: 10Dzahn) [22:31:30] (03PS1) 10Bstorm: wiki replicas: depool labsdb1010 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/476412 (https://phabricator.wikimedia.org/T209517) [22:32:18] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/13770/rutherfordium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/476411 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [22:36:41] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10bd808) [22:36:43] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1020 - https://phabricator.wikimedia.org/T194855 (10bd808) [22:41:43] (03PS3) 10Dzahn: peopleweb: allow deployment server to connect to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/476411 (https://phabricator.wikimedia.org/T210036) [22:42:28] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10Bstorm) [22:43:00] (03PS4) 10Dzahn: peopleweb: allow deployment server to connect to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/476411 (https://phabricator.wikimedia.org/T210036) [22:45:02] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13771/rutherfordium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/476411 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [22:48:34] (03CR) 10Dzahn: [C: 032] "with this it's now possible to test sites on people.wikimedia.org from the deployment-server like so: [deploy1001:~] $ apache-fast-test p" [puppet] - 10https://gerrit.wikimedia.org/r/476411 (https://phabricator.wikimedia.org/T210036) (owner: 10Dzahn) [22:51:18] 10Operations, 10Analytics, 10Security-Team, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10Legoktm) >>! In T210667#4783289, @MoritzMuehlenhoff wrote: > exfat-fuse itself is free software (GPL) and part of Debian main. Debian's approa... [23:16:05] (03PS2) 10Dzahn: cache/trafficserver: replace rutherfordium with people1001, backend and director [puppet] - 10https://gerrit.wikimedia.org/r/475236 (https://phabricator.wikimedia.org/T210036) [23:17:32] bblack: ^ could you possibly check that for me? [23:17:58] i am renaming a director and change the backend but it's only a few lines [23:23:15] !log changing a few passwords for compromised accounts [23:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:14] (03PS1) 10Faidon Liambotis: Use os.getgrouplist() for listing a user's groups [software/keyholder] - 10https://gerrit.wikimedia.org/r/476424 (https://phabricator.wikimedia.org/T204681) [23:30:03] I need to swat deploy a couple code changes patches. They are all +2ed but do I need them to be merged before SWATTING? [23:30:47] PROBLEM - Backup of s2 in eqiad on db1115 is CRITICAL: Backup for s2 at eqiad taken more than 8 days ago: Most recent backup 2018-11-20 23:04:07 [23:38:06] dmaza: something seems pretty stuck in zuul's gate-and-submit queue (where merges happen) [23:38:36] there are 2 patches at the top that have been trying to run tests for well over an hour [23:38:52] they are making progress, but very slowly [23:38:53] ugg.. that might explain why it is taking so long [23:39:14] there are 21 patches stacked up there right now -- https://integration.wikimedia.org/zuul/ [23:41:31] bd808: thank you.. I'll schedule the swat for tomorrow then [23:41:40] jouncebot, next [23:41:48] In 0 hour(s) and 18 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181129T0000) [23:41:49] or at least the ones that are not merged :( [23:42:00] jouncebot is asleep on the job too [23:42:21] (03CR) 10Dzahn: "i added you (reviewers) because you are listed on https://tools.wmflabs.org/openstack-browser/project/phragile and i got there from checki" [puppet] - 10https://gerrit.wikimedia.org/r/475032 (owner: 10Dzahn) [23:43:09] (03PS1) 10Faidon Liambotis: Misc pylint fixes [software/keyholder] - 10https://gerrit.wikimedia.org/r/476429 [23:44:03] (03CR) 10jerkins-bot: [V: 04-1] Misc pylint fixes [software/keyholder] - 10https://gerrit.wikimedia.org/r/476429 (owner: 10Faidon Liambotis) [23:44:09] gah [23:44:15] can't we use a newer version of pylint? [23:45:15] hrm [23:45:18] not that [23:45:25] (03CR) 10Dzahn: "doing this because i was searching puppet repo for what still hardcodes php5, to add support for PHP7. and per "debian-8.1-jessie (depreca" [puppet] - 10https://gerrit.wikimedia.org/r/475032 (owner: 10Dzahn) [23:50:42] (03CR) 10Aaron Schulz: [C: 031] errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) (owner: 10Krinkle) [23:50:56] (03CR) 10Thcipriani: [C: 032] "Old method took 0.9280140399932861s to find my groups on beta, new method takes 0.00141143798828125s, well done :)" [software/keyholder] - 10https://gerrit.wikimedia.org/r/476424 (https://phabricator.wikimedia.org/T204681) (owner: 10Faidon Liambotis) [23:51:48] (03Merged) 10jenkins-bot: Use os.getgrouplist() for listing a user's groups [software/keyholder] - 10https://gerrit.wikimedia.org/r/476424 (https://phabricator.wikimedia.org/T204681) (owner: 10Faidon Liambotis) [23:52:17] (03PS1) 10Bstorm: sonofgridengine: set up shadow_master profile [puppet] - 10https://gerrit.wikimedia.org/r/476430 (https://phabricator.wikimedia.org/T200557) [23:53:11] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: set up shadow_master profile [puppet] - 10https://gerrit.wikimedia.org/r/476430 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [23:53:24] (03PS1) 10Cwhite: role, profile: install, run, and collect icinga exporter metrics [puppet] - 10https://gerrit.wikimedia.org/r/476431 (https://phabricator.wikimedia.org/T208066) [23:54:03] (03CR) 10Cwhite: "Blocked on debian build and deploy." [puppet] - 10https://gerrit.wikimedia.org/r/476431 (https://phabricator.wikimedia.org/T208066) (owner: 10Cwhite) [23:55:58] (03PS2) 10Bstorm: sonofgridengine: set up shadow_master profile [puppet] - 10https://gerrit.wikimedia.org/r/476430 (https://phabricator.wikimedia.org/T200557) [23:56:16] (03PS2) 10Dzahn: noc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416751 [23:56:58] (03CR) 10jerkins-bot: [V: 04-1] noc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn) [23:57:02] (03CR) 10Dzahn: "rebased, there have been changes to the mediawiki module since my comment back in March. it might be easier already" [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn) [23:58:00] (03PS3) 10Dzahn: noc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416751 [23:58:02] (03CR) 10Dzahn: "also dbtree is not in this class anymore" [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn) [23:58:17] (03PS2) 10Faidon Liambotis: Misc pylint fixes [software/keyholder] - 10https://gerrit.wikimedia.org/r/476429 [23:59:15] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%