[00:00:00] please [00:00:01] {{sust:Votar bibliotecario1|Lou123456|Lou123456|{{subst:CURRENTDAY}}|{{subst:CURRENTMONTH}}|{{subst:CURRENTYEAR}}|12/07/2016|Varias|Votación abierta|{{subst:CURRENTTIME}}}} [00:01:02] PROBLEM - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Puppet has 1 failures [00:04:38] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [00:06:48] Dereckson, no, I don't see anything caused by this, thanks. [00:07:01] okay sending to prod, thanks for checking [00:07:14] I'm not sure if I'm doing logstash right (I did message:"getCentralAuthToken"), but I also did greps. [00:07:22] ^ bd808 [00:07:23] !log dereckson@tin Synchronized php-1.28.0-wmf.8/extensions/Echo/includes/ForeignWikiRequest.php: getCentralAuthToken visibility back to protected ([[Gerrit:298661]]) (duration: 00m 27s) [00:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:07:28] Here you are. [00:07:52] There are some stacktraces from the other issues I already know about, but not issues about calling something with the wrong accessibility. [00:08:29] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:09:13] Dereckson: that should be easy since you already have deployer shell. Open a phab task requesting "nda" in LDAP? [00:10:38] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:11:49] Dereckson, looks good, mw1017 off. [00:16:09] RECOVERY - gerrit process on lead is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [00:16:48] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:58] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:58] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:18:19] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:18:49] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:19:58] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:18] uh [00:21:23] matt_flaschen: there is another change for Echo in operations/mediawiki-config repository by the way: https://gerrit.wikimedia.org/r/#/c/289395/ [00:21:48] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [00:21:48] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [00:21:48] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [00:23:44] Dereckson, thanks. I think it's just dead code, I will ask him to schedule it. [00:24:06] yes it's a no op removing wmgUseClusterSession [00:24:31] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [00:25:30] Dereckson, it's removing a couple things. I think you mean wmgUseClusterJobQueue. [00:26:09] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [00:26:29] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [00:26:38] RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:27:39] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:27:39] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:29:39] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [00:29:39] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [00:31:36] Dereckson, are you done deploying? I need to push a fix for fatal [00:32:01] MaxSem: yes, I'm done [00:32:09] thx [00:41:39] !log maxsem@tin Synchronized php-1.28.0-wmf.8/extensions/VisualEditor/: https://gerrit.wikimedia.org/r/#/c/298677/ (duration: 01m 01s) [00:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:48:03] can someone restart gerrit-wm? [00:49:40] RECOVERY - puppet last run on lead is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:51:59] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:53:59] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [00:56:30] !log Restarted grrrit-wm [00:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:57:10] thanks ori. I was just looking up the instructions for that [00:57:35] it's on kubernetes now, cool [00:59:20] I think its the only bot on k8s for now [00:59:29] lots of webservices are switching though [01:02:14] (03PS3) 10Chad: Gerrit: letsencrypt cert names are called gerrit not based on host [puppet] - 10https://gerrit.wikimedia.org/r/298681 [01:03:55] (03CR) 10Dzahn: [C: 032] Gerrit: letsencrypt cert names are called gerrit not based on host [puppet] - 10https://gerrit.wikimedia.org/r/298681 (owner: 10Chad) [01:04:26] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:26] RECOVERY - HTTPS on lead is OK: SSL OK - Certificate gerrit-new.wikimedia.org valid until 2016-10-10 23:47:00 +0000 (expires in 89 days) [01:06:16] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [01:06:19] ^ yay! [01:06:36] re: lead [01:11:32] (03PS1) 10Chad: Gerrit: Turn off rsync cron for now [puppet] - 10https://gerrit.wikimedia.org/r/298683 [01:11:46] bd808, IIRC it's a page called grrrit-wm, on either wikitech or mw.o [01:11:50] might be wikibugs on mw.o [01:12:09] Krenair: yeah. I had just found the page when ori got it started [01:12:18] https://wikitech.wikimedia.org/wiki/Grrrit-wm [01:12:27] (03CR) 10Dzahn: [C: 032] Gerrit: Turn off rsync cron for now [puppet] - 10https://gerrit.wikimedia.org/r/298683 (owner: 10Chad) [01:16:46] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:16:56] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:17:37] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:05] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:58] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1129.78 seconds [01:23:35] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [01:23:55] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [01:26:36] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:26:46] PROBLEM - puppet last run on mw2069 is CRITICAL: CRITICAL: puppet fail [01:28:25] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [01:28:26] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [01:28:36] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [01:43:26] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: puppet fail [01:46:17] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.09 seconds [01:49:26] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: puppet fail [01:56:07] RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:10:56] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [02:16:46] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:28:44] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.8) (duration: 10m 11s) [02:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:52] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jul 13 02:34:52 UTC 2016 (duration 6m 8s) [02:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:33] (03PS1) 10Bartosz Dziewoński: Workaround for really broken browser detection in Gerrit code [puppet] - 10https://gerrit.wikimedia.org/r/298688 [02:50:07] ostriches: is gerrit-new slow because its test environment, or is that a issue? [02:51:09] Should be faster since its a real machine with better specs 😃 [02:51:16] it's not slow… [02:51:21] well, not any slower than the old one [02:51:31] it *is* slow. but not worse than before :P [02:51:39] ostriches: also, https://gerrit.wikimedia.org/r/#/c/298688/ is up [02:52:19] (and i think i'm off to sleep. it would be a great thing to wake up to that being merged and deployed, if you see what i'm saying here.) [02:52:31] ;) [02:53:15] We might need to tweak something here or there too [02:53:22] I did zero tuning on the new lucene index, for example [02:53:28] going from front page -> patchset is slower, not much but noticable, loading diffs also appears to have differnce [02:53:40] looks like Danny_B has also noticed it based on the mailing list [02:54:19] (03CR) 10Chad: [C: 031] Workaround for really broken browser detection in Gerrit code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/298688 (owner: 10Bartosz Dziewoński) [02:54:32] diffs seemed faster to me imho [03:05:51] for me it is enough slower than old version to notice that. [03:06:23] I'm wondering if it has something to do with tuning stuff server-side. [03:06:26] but i admit some slow parts are related to search, so if you haven't polished indices... [03:06:34] Or if there's a performance degradation in the JS. [03:08:26] i can do some manual benchmarks for comparison tomorrow. [03:08:34] Danny_B: One of the best things that ever happened to Gerrit's UI is when they allowed us to upstream our white/blue/grey theme to replace the puke yellow & green :p [03:09:00] (from what I heard, there was much rejoicing by other gerrit users than us too :p) [03:09:32] i just googled up that you can have bootstrap-like ui [03:09:59] sshd.enableCompression might be useful (although not our immediate problem) [03:10:24] it costs some 300k of extra data for necessary stuff [03:11:31] anyway, treat my maillist voice as minor, since i am just occasional user (and will obviously become yet more occasional now ;-)) [03:11:47] (about the ui/ux) [03:12:00] the speed is issue though [03:13:12] Yeah the speed thing worries me. [03:13:20] Like I said, you're the 2nd or 3rd person to mention it [03:14:11] httpd.maxThreads [03:14:11] Maximum number of threads to permit in the worker thread pool. [03:14:11] By default 25, suitable for most lower-volume traffic sites. [03:14:15] That seems worrying ^ [03:15:33] what is the actual plan of gerrit to phabricator migration? [03:15:50] Eh, we have set to 60 [03:16:05] Danny_B: There's no timeline anymore. [03:17:50] (03CR) 10Legoktm: "We should also file this upstream?" [puppet] - 10https://gerrit.wikimedia.org/r/298688 (owner: 10Bartosz Dziewoński) [03:18:15] (03CR) 10Legoktm: "(Now that we're on a supported upstream version ;-))" [puppet] - 10https://gerrit.wikimedia.org/r/298688 (owner: 10Bartosz Dziewoński) [03:18:35] (03CR) 10Chad: "I'm not sure upstream will treat is as a bug actually. This is kind of how GWT has always behaved on browsers it can't handle..." [puppet] - 10https://gerrit.wikimedia.org/r/298688 (owner: 10Bartosz Dziewoński) [03:19:33] :/ [03:20:39] :-( [03:21:39] legoktm: I mean it can't hurt to try :) [03:22:02] Anyway, back to pokemons and such! Follow-up on Phab or the list. And thanks guys for testing things, it helps!! [03:32:12] 06Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-Site-requests: Transwiki import not working in production - https://phabricator.wikimedia.org/T140206#2456605 (10TTO) [03:32:36] ^^ Transwiki import seems to have stopped working. Any idea why? Possibly an issue with internal proxies? [03:37:48] ohai tto [03:38:01] legoktm: hey :) [03:38:30] would this be in any logs? [03:39:01] It's just return Status::newFatal( 'importcantopen' ); [03:39:04] So I guess not [03:39:33] Unless Http::request writes to a log upon failure [03:40:05] $logger = LoggerFactory::getInstance( 'http' ); [03:40:06] $logger->warning( $status->getWikiText( false, false, 'en' ), [03:40:06] [ 'error' => $errors, 'caller' => $caller, 'content' => $req->getContent() ] ); [03:40:08] it does! [03:40:13] Yes, just saw that :) [03:40:22] of course, we don't have that log enabled [03:40:44] of course!... [03:41:14] one minute [03:42:32] (03PS1) 10Legoktm: Log 'http' at warning level to debug transwiki import errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298692 [03:43:03] (03CR) 10Legoktm: [C: 032] Log 'http' at warning level to debug transwiki import errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298692 (owner: 10Legoktm) [03:43:40] (03Merged) 10jenkins-bot: Log 'http' at warning level to debug transwiki import errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298692 (owner: 10Legoktm) [03:44:09] 06Operations, 07Puppet, 06Labs, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2456611 (10Dereckson) According @mmodell, the Arcanist package is distro agnostic . We need it for every distro we support in apt.wm.o [03:44:17] 06Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-Site-requests: Transwiki import not working in production - https://phabricator.wikimedia.org/T140206#2456553 (10Peachey88) >>! In T140206#2456602, @TTO wrote: > OK, seems to be a global issue. Confirmed, attempted Meta->MwWiki [03:44:44] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Log 'http' at warning level to debug transwiki import errors (duration: 00m 29s) [03:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:44:51] tto: try import now? [03:44:53] Aha, nice. I'll try importing en:Frankenfield Glacier to testwiki [03:45:18] Done, should be some errors there now with any luck [03:46:20] legoktm: ^ [03:46:38] HAHAHA [03:46:49] this is hilarious [03:46:50] Error: 403, Insecure Request Forbidden - use HTTPS - https://lists.wikimedia.org/pipermail/mediawiki-api-announce/2016-May/000110.html [03:46:59] Really! [03:47:09] Why on earth is it using HTTP? [03:48:02] probably because it's an internal request and no one noticed [03:48:08] not enough traffic to make the logs? [03:48:37] $link = $firstIw->getURL( strtr( "${additionalIwPrefixes}Special:Export/$page", [03:48:37] ' ', '_' ) ); [03:48:54] from ImportStreamSource::newFromInterwiki [03:50:03] > $i=Interwiki::fetch('mediawikiwiki'); [03:50:03] > var_dump($i->getURL("Test")); [03:50:03] string(29) "//www.mediawiki.org/wiki/Test" [03:50:11] okay, and somewhere an HTTP prefix is being dropped in front of it [03:50:17] Yes, the interwiki map is protocol relative [03:50:52] lets make it HTTPS-only? [03:51:14] That would probably be the simplest solution [03:51:20] is that a meta thing or WikimediaMaintenance? [03:51:30] WikimediaMaintenace dumpInterwiki.php [03:51:48] There's a convenient $urlprotocol member that can be set [03:51:55] do you think you could put up a patch for that? I need to do something IRL for ~10 minutes [03:52:02] Sure, I'll take a look [03:52:50] btw, it's MWHttpRequest::__construct which is to blame, it expands the URL to PROTO_HTTP by default [03:57:20] legoktm: Is HTTPS compulsory everywhere now? I remember for a while there was an exception for China and Iran [03:57:41] If they still browse via HTTP then we will break things for them by forcing the interwiki map to HTTPS [04:03:40] tto: yes, it is required for everyone [04:04:20] ok, then the patch I uploaded ought to be fine (haven't tested though! It's a pain to test WikimediaMaintenance patches on a local machine because of the hardcoded file paths) [04:04:36] lets make it HTTPS-only? [04:04:46] Beta still has broken HTTPS [04:04:51] ugh [04:04:57] damn, forgot about labs [04:05:02] well, lets fix beta after fixing prod [04:05:04] Let's fix production first? [04:05:10] I -1'd but it still went through [04:05:25] sorry, i couldnt cancel my +2 in time [04:05:31] I could have just cherry-picked [04:06:38] uhhh [04:06:44] Krenair: do you know where the updateinterwikicache script went? [04:07:46] wasn't that deleted? [04:07:50] it was?? [04:08:06] When the cache file started getting versioned in git [04:08:22] yeah but...you just pushed it from tin [04:08:26] how do I do it now? [04:08:26] https://gerrit.wikimedia.org/r/#/c/269222/ [04:09:03] (03CR) 10Legoktm: "So....how is it supposed to be updated now? was never updated, please ann" [puppet] - 10https://gerrit.wikimedia.org/r/269222 (owner: 10Ori.livneh) [04:10:07] I'll have to get into SSL in deployment-prep at some point [04:10:19] see how much of the prod LE stuff is reusable [04:11:19] okay, I think I got it [04:12:25] (03PS1) 10Legoktm: Update interwiki map, make everything HTTPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298696 [04:12:30] tto, Krenair: sanity review please ^ [04:13:21] changing >10k lines, fun [04:13:32] gerrit can't handle it [04:13:39] lol [04:13:49] I looked at the diff up to the br projects [04:13:51] (03CR) 10TTO: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298696 (owner: 10Legoktm) [04:14:19] oh, it changes short array syntax to the old syntax [04:14:33] LGTM [04:15:01] oh right, I merged that change to the script, but it hasn't been deployed yet [04:15:05] (03CR) 10Legoktm: [C: 032] Update interwiki map, make everything HTTPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298696 (owner: 10Legoktm) [04:15:23] tto: do you have the X-Wikimedia-Debug stuff set up? [04:15:43] (03Merged) 10jenkins-bot: Update interwiki map, make everything HTTPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298696 (owner: 10Legoktm) [04:15:47] legoktm, no, I don't [04:16:03] Would it come in handy? [04:16:24] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [04:16:25] yes [04:16:48] so I can sync this change to mw1017, you enable the browser extension which sets the header, sending all your requests to mw1017, and you can test it without me breaking all of production :) [04:17:09] I'll install the extension now if you like! [04:17:15] yes please :) [04:18:22] https://test.wikipedia.org/wiki/Frankenfield_Glacier [04:19:02] 04:18, 13 July 2016 Legoktm (talk | contribs | block) imported Frankenfield Glacier from en:Frankenfield Glacier (1 revision) [04:19:34] Excellent! Thanks for all your help legoktm, as always :) [04:20:07] !log legoktm@tin Synchronized wmf-config/interwiki.php: Update interwiki map, make them HTTPS (duration: 00m 39s) [04:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:20:30] (03PS1) 10Legoktm: Revert "Log 'http' at warning level to debug transwiki import errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298697 [04:20:36] (03CR) 10Legoktm: [C: 032] Revert "Log 'http' at warning level to debug transwiki import errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298697 (owner: 10Legoktm) [04:20:41] (03PS2) 10Legoktm: Revert "Log 'http' at warning level to debug transwiki import errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298697 [04:20:46] (03CR) 10Legoktm: Revert "Log 'http' at warning level to debug transwiki import errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298697 (owner: 10Legoktm) [04:20:54] (03CR) 10Legoktm: [C: 032] Revert "Log 'http' at warning level to debug transwiki import errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298697 (owner: 10Legoktm) [04:21:52] (03Merged) 10jenkins-bot: Revert "Log 'http' at warning level to debug transwiki import errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298697 (owner: 10Legoktm) [04:22:52] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: revert http logging change (duration: 00m 31s) [04:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:24:31] 06Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-Site-requests, 13Patch-For-Review: Transwiki import not working in production - https://phabricator.wikimedia.org/T140206#2456629 (10TTO) 05Open>03Resolved a:03TTO The import process was still fetching the pages over HTTP, and after recent chang... [04:25:59] 06Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-Site-requests, 13Patch-For-Review: Transwiki import not working in production - https://phabricator.wikimedia.org/T140206#2456634 (10Legoktm) This was caused by the change to break HTTP POST requests to the API, we just didn't notice that transwiki im... [04:26:33] ok, I'm going afk for a bit, but will have my laptop open [04:26:49] * legoktm huggles tto [04:27:00] debugging import issues is always fun ;) [04:27:07] legoktm: Gotta love it :) [04:45:51] 06Operations, 07Puppet, 06Labs, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2456643 (10mmodell) @dereckson: According to T137770#2426038, the package was uploaded to Trusty and Jessie. [05:08:53] 06Operations, 10ArchCom-RfC, 06Services, 07Archcom-has-shepherd , 07RfC: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#2456667 (10RobLa-WMF) [05:09:26] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2456677 (10RobLa-WMF) [05:17:30] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 13348 MB (3% inode=99%) [05:29:48] 06Operations, 07Puppet, 06Labs, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2456708 (10mmodell) And yet I don't see the package on https://apt.wikimedia.org... [05:43:05] Krenair: LE should be usable in labs. I did have some hoops to jump through to getting it working on prod tho. Feel free to pick my brain at another time [05:43:28] * ostriches is off to bed though so not now [05:59:34] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:35] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [06:30:32] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:12] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:22] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:42] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 13271 MB (3% inode=99%) [06:32:02] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:51] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 3 failures [06:34:41] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:41:42] RECOVERY - Disk space on lithium is OK: DISK OK [06:51:30] PROBLEM - puppet last run on elastic2008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:56:42] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:31] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:40] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:41] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:43] (03PS1) 10ArielGlenn: if no job specified to run, don't try to find next wiki to run based on job [dumps] - 10https://gerrit.wikimedia.org/r/298698 [06:58:11] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:30] PROBLEM - puppet last run on elastic2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:59:11] (03CR) 10ArielGlenn: [C: 032] if no job specified to run, don't try to find next wiki to run based on job [dumps] - 10https://gerrit.wikimedia.org/r/298698 (owner: 10ArielGlenn) [07:07:46] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [07:19:18] RECOVERY - puppet last run on elastic2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:26:06] RECOVERY - puppet last run on elastic2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:30:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:33:24] 06Operations, 10Phabricator: Phabricator weekly report not generated (or at least sent) - https://phabricator.wikimedia.org/T139950#2456823 (10greg) >>! In T139950#2455764, @Danny_B wrote: > @greg That's exactly what I was asking earlier... but now it is possible! :) I'd be willing to do it, but I'll leave it... [07:34:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:35:17] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [07:46:27] PROBLEM - HHVM rendering on mw1257 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.005 second response time [07:48:17] RECOVERY - HHVM rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 67672 bytes in 0.093 second response time [07:51:59] 06Operations, 07Puppet, 06Labs, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2441209 (10valhallasw) https://apt.wikimedia.org/wikimedia/pool/universe/libp/libphutil/ Would it be possible to also upload the package for precise? [07:55:17] PROBLEM - Apache HTTP on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.033 second response time [07:55:26] FYI, I just got a transient 503 accessing my watchlist on enwiki [07:55:34] Oh, hey [07:57:16] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time [07:57:17] PROBLEM - Apache HTTP on mw1274 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 8.275 second response time [07:59:00] checking --^ [07:59:06] RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.033 second response time [07:59:34] mmmm [08:00:56] PROBLEM - HHVM rendering on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.003 second response time [08:02:56] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 67674 bytes in 0.371 second response time [08:02:56] this one has init: hhvm main process (23072) killed by SEGV signal in the dmesg [08:05:05] 06Operations, 07Puppet, 06Labs, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2441209 (10hashar) The package has been created via {T137770} for both Trusty and Jessie: **Trusty** ``` $ apt-cache madison arcanist arcanist | 0~git2016062... [08:05:10] so up to now we have seen flapping: mw1257 mw1220 mw127 mw1020 [08:05:22] (third one is mw1274) [08:06:27] PROBLEM - Apache HTTP on mw1213 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.013 second response time [08:08:06] PROBLEM - Apache HTTP on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.005 second response time [08:08:07] PROBLEM - HHVM rendering on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.880 second response time [08:08:18] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.064 second response time [08:08:24] <_joe_> wat? [08:08:39] <_joe_> something bad is happening [08:08:55] yeah [08:09:07] there are some stacktraces in /var/log/hhvm [08:09:16] PROBLEM - Apache HTTP on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.011 second response time [08:09:19] <_joe_> it's at least 1 hour that we have a higher than normal amount of 5xx [08:09:20] we had a spam of DB error for zhwiki [08:09:28] <_joe_> hashar: not relevant [08:09:35] <_joe_> elukey: can get what that's about? [08:09:57] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.051 second response time [08:10:06] RECOVERY - HHVM rendering on mw1180 is OK: HTTP OK: HTTP/1.1 200 OK - 67674 bytes in 0.140 second response time [08:10:34] _joe_ I found one only one mw1020 and it seemed related to a db query [08:10:41] <_joe_> elukey: ok looking [08:10:48] <_joe_> because other stack traces are empty [08:11:07] RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.032 second response time [08:14:47] PROBLEM - Apache HTTP on mw1187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.013 second response time [08:16:47] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.035 second response time [08:19:06] PROBLEM - Apache HTTP on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50407 bytes in 0.613 second response time [08:20:58] RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.048 second response time [08:23:18] PROBLEM - HHVM rendering on mw1173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.011 second response time [08:25:01] yikes [08:25:19] <_joe_> sjoerddebruin: we know, we're trying to figure out what's happening exactly [08:27:27] RECOVERY - HHVM rendering on mw1173 is OK: HTTP OK: HTTP/1.1 200 OK - 67681 bytes in 0.112 second response time [08:31:58] PROBLEM - HHVM rendering on mw1211 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.014 second response time [08:33:57] RECOVERY - HHVM rendering on mw1211 is OK: HTTP OK: HTTP/1.1 200 OK - 67679 bytes in 0.091 second response time [08:35:26] PROBLEM - HHVM rendering on mw1219 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.019 second response time [08:35:47] PROBLEM - Apache HTTP on mw1269 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.004 second response time [08:36:27] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:37:18] RECOVERY - HHVM rendering on mw1219 is OK: HTTP OK: HTTP/1.1 200 OK - 67672 bytes in 0.107 second response time [08:37:47] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.041 second response time [08:39:27] PROBLEM - HHVM rendering on mw1184 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.013 second response time [08:41:26] RECOVERY - HHVM rendering on mw1184 is OK: HTTP OK: HTTP/1.1 200 OK - 67672 bytes in 0.097 second response time [08:42:48] (03CR) 10Mobrovac: [C: 031] Move node-specific versions to a cluster-wide setting [puppet] - 10https://gerrit.wikimedia.org/r/298631 (https://phabricator.wikimedia.org/T139639) (owner: 10Eevans) [08:43:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:44:58] PROBLEM - Apache HTTP on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.013 second response time [08:46:58] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.630 second response time [08:50:02] (03CR) 10Addshore: Move stats::wmde cron files to analytics/wmde/scripts repo (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) (owner: 10Addshore) [08:50:06] PROBLEM - Apache HTTP on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.010 second response time [08:50:07] PROBLEM - HHVM rendering on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time [08:51:57] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.035 second response time [08:52:07] RECOVERY - HHVM rendering on mw1180 is OK: HTTP OK: HTTP/1.1 200 OK - 67716 bytes in 0.214 second response time [08:53:16] PROBLEM - Apache HTTP on mw1268 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 3.512 second response time [08:55:07] RECOVERY - Apache HTTP on mw1268 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.042 second response time [08:59:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [09:03:37] (03PS7) 10Addshore: Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) [09:03:58] PROBLEM - Apache HTTP on mw1176 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50406 bytes in 0.094 second response time [09:04:33] (03PS8) 10Addshore: Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) [09:05:58] RECOVERY - Apache HTTP on mw1176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.032 second response time [09:06:19] PROBLEM - Apache HTTP on mw1270 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 2.180 second response time [09:08:18] RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.024 second response time [09:10:12] hi CP678|Laptop [09:10:12] legoktm: okay? [09:10:14] I had to block your bot [09:10:24] but can you turn it off too? it's still making read requests [09:10:26] _joe_: ^^ [09:10:49] <_joe_> legoktm: let me check, but I am pretty usre it is [09:10:54] legoktm: how's it causing server outages? [09:11:09] <_joe_> CP678|Laptop: we are not sure, probably not, but we have ongoing issues [09:11:28] <_joe_> and I'd like to exclude possible causes [09:11:29] So my bot was blocked on a whim? [09:11:29] no [09:11:34] we have a partial outage right now [09:11:40] I noticed. [09:11:43] yor bot's requests happen around each crash [09:11:49] *your [09:11:51] I see. [09:11:55] (03PS5) 10Mobrovac: WIP: Parsoid: Move to service::node [puppet] - 10https://gerrit.wikimedia.org/r/298436 [09:12:00] please just stop it for the moment [09:12:08] My bot has been operating in it's current state for over 3 months now. [09:12:10] as soon as we know what;s going on and can fix it/absolve you [09:12:13] we will let you know [09:12:51] <_joe_> CP678|Laptop: i don't think your bot is the cause of the crash [09:12:52] it's not about blame, it's about trying to eliminate causes of the outage as fast as we can [09:12:52] CP678|Laptop: it's unlikely to be a problem in your bot, more likely it's triggering some state that the servers can't handle atm, and keeping the site up is more important than a single bot running [09:12:56] <_joe_> it could be triggerig it [09:13:05] I understand. [09:13:40] well. It take a bit. I have limited access right now. [09:13:50] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Application servers in constant crash - https://phabricator.wikimedia.org/T140223#2457012 (10jcrespo) [09:14:03] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Application servers in constant crash - https://phabricator.wikimedia.org/T140223#2457024 (10jcrespo) p:05Triage>03Unbreak! [09:14:14] ok, please ping here when you're able to shut it down [09:14:52] <_joe_> CP678|Laptop: thanks, really appreciated [09:16:28] (03CR) 10Gehel: "Production clsuter updated, but this change isn't merged. What's the status? Does this still need merging?" [puppet] - 10https://gerrit.wikimedia.org/r/298295 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [09:16:28] It should be shutting down now [09:16:56] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Application servers in constant crash - https://phabricator.wikimedia.org/T140223#2457029 (10jcrespo) By raw numbers, Cyberpower's bot has been requested to stop to see if it could be triggering a problematic App state, to a) debug the issue- see if it is... [09:17:03] _joe_: what exactly is triggering the bot. [09:17:10] *crash [09:17:21] What is my bot doing that's causing this? [09:17:42] well it's got to be the read requests [09:17:42] <_joe_> CP678|Laptop: calling a page that triggers the crash [09:17:51] which means that the content being requested is a problem [09:17:53] Any particular page. [09:18:11] could be a template transcluded in a pile of things [09:18:12] <_joe_> probably a specific one [09:18:37] CP678|Laptop: at the moment, we see correlation, not yet causality... [09:18:40] ah thanks for the shutdown, I just now saw you were able to stop the bot [09:18:52] Gladly. [09:19:32] I guess pin this gem on my userpage. "My bot crashed Wikipedia. :p" [09:19:45] well only if we verify that it really did :-P [09:20:32] I'm curious to what page, if there are specific ones, is acting as the trigger. [09:20:45] *to know [09:20:48] so are we! [09:21:00] Well that answers my question. :p [09:21:13] <_joe_> CP678|Laptop: me too, the issue seems similar to a past crash [09:21:16] https://phabricator.wikimedia.org/T135483 this is a past example which folks believe is the same bug [09:22:03] <_joe_> icinga-wm is not reporting the erros anymore, why? [09:22:13] _joe_: is there any correlation to the recent centralauth crash when calling a specific user? [09:22:16] legoktm: ^ [09:22:33] I noticed it worked again this morning and was able to complete the rename. [09:23:00] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Application servers in constant crash - https://phabricator.wikimedia.org/T140223#2457012 (10Joe) looking at the stack traces, this seems similar to T135483 [09:23:18] CP678|Laptop: no, unrelated [09:23:22] ok [09:26:48] apergos: _joe_: At least the timing is optimal. I wanted to do some maintenance to the exec node host InternetArchiveBot, and I was going to have to shut it down anyway. [09:27:06] heh [09:27:57] apergos: It's only IABot right, or does the other Cyberbot tasks need to be shutdown too? [09:28:02] I don't see the warnings in icinga but I don't see disabled notifications either, weird [09:28:07] PROBLEM - HHVM rendering on mw1239 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.030 second response time [09:28:28] IA is the one that deals with sources that have expired or something right? [09:28:49] CP678|Laptop: [09:29:00] Yes [09:29:13] <_joe_> CP678|Laptop: it hasn't solved the issue but surely mitigated it, see https://grafana-admin.wikimedia.org/dashboard/db/varnish-http-errors [09:29:16] afaik that's the one. if you want to be on the safe side you could stop them all for about 15 minutes [09:29:26] and we could see if that changes anything [09:30:08] RECOVERY - HHVM rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 67660 bytes in 0.096 second response time [09:30:11] and as soon as I slur icinga it whines. *eyeroll* [09:30:19] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors is the public link [09:30:53] apergos: I unfortunately don't have that much time. I have to assist my currently handicapped uncle. [09:31:54] CP678|Laptop_: would you be able to slow their rate of requests drastically? [09:32:05] Not really. [09:32:12] They don't have throttles. [09:32:17] ah [09:32:24] in doubt, maybe shut down all your bots? [09:32:24] They're also very old. [09:32:30] and can resume them later on [09:32:50] when you have more time [09:34:57] Shutting down [09:35:42] * apergos keeps an eye on that graph [09:35:57] The bot should be completely off now. [09:36:23] ok, thank you very much [09:37:09] PROBLEM - Apache HTTP on mw1174 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.012 second response time [09:37:54] I have to go now [09:38:05] ok [09:38:09] thanks for your help [09:39:18] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.048 second response time [09:44:35] (03PS1) 10Ladsgroup: changeprop: add new precaching for ores new models [puppet] - 10https://gerrit.wikimedia.org/r/298707 [09:52:46] (03CR) 10Mobrovac: [C: 031] changeprop: add new precaching for ores new models [puppet] - 10https://gerrit.wikimedia.org/r/298707 (owner: 10Ladsgroup) [10:03:16] PROBLEM - Apache HTTP on mw1211 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.006 second response time [10:03:36] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:04:57] RECOVERY - Apache HTTP on mw1211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.035 second response time [10:06:01] (03PS1) 10Paladox: Fix changeid link in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/298709 [10:06:42] (03CR) 10Paladox: "https://gerrit.wikimedia.org/r/#/q/I048f617eae1aea57d21cf77fd4a291079bd2f97b" [puppet] - 10https://gerrit.wikimedia.org/r/298709 (owner: 10Paladox) [10:08:36] PROBLEM - Apache HTTP on mw1273 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 5.330 second response time [10:09:06] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:09:46] PROBLEM - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:10:37] RECOVERY - Apache HTTP on mw1273 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.048 second response time [10:10:57] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.102 second response time [10:11:26] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [10:11:37] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 67674 bytes in 0.128 second response time [10:12:56] legoktm: would you mind unblocking my bot? It's off now. [10:14:59] (03CR) 10Mobrovac: [C: 04-1] changeprop: add new precaching for ores new models (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/298707 (owner: 10Ladsgroup) [10:16:27] (03PS2) 10Ladsgroup: changeprop: add new precaching for ores new models [puppet] - 10https://gerrit.wikimedia.org/r/298707 [10:17:28] apergos: _joe_: Any news? [10:18:09] (03PS1) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [10:18:10] still trying to track down the page(s) [10:18:15] <_joe_> CP678|Laptop: I got a repro case, but still not enough to be sure what's the root cause, sorry [10:18:16] some possible progress [10:19:08] (03CR) 10Paladox: "I'm not sure if this will break gerrit 2.8 or not." [puppet] - 10https://gerrit.wikimedia.org/r/298710 (owner: 10Paladox) [10:19:18] <_joe_> CP678|Laptop: clearly not (only) your bots :) [10:19:51] Lol [10:20:02] You could temporarily disable the API. [10:20:12] <_joe_> API is not crashing [10:20:12] That would stop all of the bots. [10:20:23] <_joe_> it's the normal appservers that do [10:20:38] And mitigate the crashing some more. [10:21:14] 06Operations, 05WMF-NDA: Add zeljkof to #mediawiki_security IRC channel - https://phabricator.wikimedia.org/T140225#2457091 (10zeljkofilipin) [10:21:23] Niharika: Don't worry, IABot will be up again soon enough. :p [10:21:57] PROBLEM - HHVM rendering on mw1209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.016 second response time [10:22:11] Is T140223 really causing almost 2K 5xx errors/min? [10:22:11] T140223: Application servers in constant crash - https://phabricator.wikimedia.org/T140223 [10:22:22] <_joe_> SPF|Cloud: yes [10:22:33] heh. good luck then [10:23:36] PROBLEM - HHVM rendering on mw1178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.026 second response time [10:23:53] 06Operations, 15User-zeljkofilipin, 05WMF-NDA: Add zeljkof to #mediawiki_security IRC channel - https://phabricator.wikimedia.org/T140225#2457104 (10zeljkofilipin) [10:23:56] RECOVERY - HHVM rendering on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 67673 bytes in 0.116 second response time [10:24:11] _joe_: about the API idea? [10:24:26] <_joe_> CP678|Laptop: API is ok at the moment [10:24:42] <_joe_> CP678|Laptop: api and "wikis" are separated clusters, different machines [10:24:47] You said bots are acting like triggers to the crash, why not disable the API at the moment, if you want to mitigate the crash? [10:25:32] _joe_: ^ [10:25:36] RECOVERY - HHVM rendering on mw1178 is OK: HTTP OK: HTTP/1.1 200 OK - 67673 bytes in 0.119 second response time [10:25:40] The regular appservers are crashing, not the API appservers [10:25:51] (03CR) 10Mobrovac: [C: 031] "GTG - https://puppet-compiler.wmflabs.org/3323/" [puppet] - 10https://gerrit.wikimedia.org/r/298707 (owner: 10Ladsgroup) [10:26:19] <_joe_> CP678|Laptop: there are bots (esp external ones, plus yours) NOT hitting /w/api.php [10:26:43] Ahhh. That was the missing piece I was looking for [10:29:07] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [10:31:07] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5159475 keys - replication_delay is 0 [10:33:06] would an enwiki admin kindly unblock Cyberbot II. Both exec nodes driving Cyberbot are no idle. [10:33:10] *now [10:33:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [10:34:32] 06Operations, 15User-zeljkofilipin, 05WMF-NDA: Add zeljkof to #mediawiki_security IRC channel - https://phabricator.wikimedia.org/T140225#2457115 (10hashar) @zeljkofilipin is leveling up on deployment MediaWiki changes on the Wikimedia infra. [10:36:48] PROBLEM - Apache HTTP on mw1242 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.018 second response time [10:37:07] we can ask lego ktm to unblock it later when he's back on line, CP678|Laptop [10:37:48] PROBLEM - Apache HTTP on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.008 second response time [10:38:47] RECOVERY - Apache HTTP on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time [10:39:56] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.034 second response time [10:40:36] PROBLEM - Apache HTTP on mw1272 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time [10:42:36] RECOVERY - Apache HTTP on mw1272 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.033 second response time [10:51:08] PROBLEM - Host cr1-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.192) [10:52:57] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2457155 (10mobrovac) >>! In T138561#2447442, @Gehel wrote: > Maps servers tested and updated to nodejs 4.4.6. @Yurik, @MaxSem let me know if you see anything u... [10:53:17] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:53:27] <_joe_> it's "solved" for now [10:53:37] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:53:46] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [10:54:07] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2457157 (10mobrovac) @ssastry OK to proceed with the upgrade of ruthenium to node 4.4.6? There shouldn't be any impact as it's a security update, and node 4.4.... [10:54:16] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 65, down: 3, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr1-ulsfo:xe-0/0/0 [10Gbps DF]BRxe-1/0/0: down - Core: cr1-ulsfo:xe-1/0/0 [10Gbps DF]BRae0: down - BR [10:54:57] PROBLEM - Host cr1-ulsfo IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ffff::1 [10:55:36] (03CR) 10Faidon Liambotis: "Good one! This is just historical artifacts from our previous system :)" [dns] - 10https://gerrit.wikimedia.org/r/298513 (owner: 10BBlack) [10:55:46] wait what now? [10:55:49] did we lose ulsfo? [10:56:15] oh no, just cr1 just IPv6? [10:56:17] wth [10:56:47] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 66 probes of 424 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [10:57:20] oh no [10:57:26] we just lost cr1-ulsfo in general [10:57:27] fun [10:57:40] today is not a lucky one [11:00:26] ok, it's booted now [11:01:56] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [11:02:16] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Application servers in constant crash - https://phabricator.wikimedia.org/T140223#2457184 (10jcrespo) The source has probably been identified, as it is a potentially sensitive bug, we should use T135483 to discuss how to fix it. Things for the moments lo... [11:02:26] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [11:02:52] <_joe_> when a day is the right day... [11:03:07] RECOVERY - Host cr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 77.39 ms [11:05:25] yeah [11:05:35] I think it may have been related to my config push for the blacklist [11:06:04] it just crashed just after that [11:06:26] PROBLEM - Disk space on elastic2008 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109018 MB (15% inode=99%) [11:07:26] RECOVERY - Host cr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 74.96 ms [11:07:27] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: The requested table is empty or does not exist for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [11:08:30] what now? [11:08:47] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 2 probes of 425 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [11:08:56] <_joe_> should I decom ulsfo? [11:10:34] no [11:10:40] all is good [11:10:43] <_joe_> ok [11:10:45] cr1 rebooted [11:10:50] <_joe_> I was about to push a review [11:13:50] hey, I know you are busy. Just a quick question regarding LVS. Is this line means if the web service on one node (not all) is down we would notice or LVS just sends requests to other node(s) https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/lvs/configuration.yaml#L967 [11:15:56] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:48] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:26:36] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:27:36] (03PS18) 10Gehel: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 [11:28:07] (03CR) 10jenkins-bot: [V: 04-1] Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [11:30:13] (03CR) 10Gehel: Script to do the initial data load from OSM for Maps project (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [11:30:48] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 25 probes of 425 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [11:36:43] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 425 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [11:37:51] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [11:38:07] !log cr1/2-ulsfo: disabling flow monitoring [11:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:19] !log cr1-ulsfo: "restart snmp" to fix SNMP hiccup after reboot [11:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:40] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:40:10] mobrovac: ^^ [11:40:31] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [11:40:53] k [11:52:34] !log installing varnishkafka_1.0.11-1 on cp3008.esams to test it before the complete rollout [11:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:58:13] (03PS14) 10Yuvipanda: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [11:58:20] (03CR) 10Yuvipanda: [C: 032 V: 032] prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [12:04:35] (03CR) 10Paladox: "I tested this at http://gerrit-test.wmflabs.org/gerrit/#/c/10/ and it works, without this patch it woulden work, with the patch applied it" [puppet] - 10https://gerrit.wikimedia.org/r/298710 (owner: 10Paladox) [12:05:33] (03CR) 10Faidon Liambotis: Create a new grub module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/296729 (owner: 10Faidon Liambotis) [12:07:20] (03PS2) 10Faidon Liambotis: base: do not include grub in Labs Ubuntus [puppet] - 10https://gerrit.wikimedia.org/r/298490 [12:07:22] (03PS4) 10Faidon Liambotis: base: remove ioscheduler setting from non-augeas codepath [puppet] - 10https://gerrit.wikimedia.org/r/296727 [12:07:24] (03PS4) 10Faidon Liambotis: labstore: un-hieraize elevator/ioscheduler boot-setting [puppet] - 10https://gerrit.wikimedia.org/r/296731 [12:07:26] (03PS4) 10Faidon Liambotis: cache: un-hieraize tcpmhash_entries boot setting [puppet] - 10https://gerrit.wikimedia.org/r/296730 [12:07:28] (03PS4) 10Faidon Liambotis: Create a new grub module [puppet] - 10https://gerrit.wikimedia.org/r/296729 [12:07:30] (03PS3) 10Faidon Liambotis: base: reenable augeas codepath on trustys [puppet] - 10https://gerrit.wikimedia.org/r/296728 [12:07:32] (03PS4) 10Faidon Liambotis: mediawiki: un-hieraize cgroup_enable boot-settings [puppet] - 10https://gerrit.wikimedia.org/r/296732 [12:09:11] RECOVERY - Disk space on elastic2008 is OK: DISK OK [12:10:44] (03CR) 10Paladox: [C: 031] "We should give this a try with gerrit 2.8 see if it breaks it or cause problems. This will fix gerrit 2.12 since you have to add all the g" [puppet] - 10https://gerrit.wikimedia.org/r/298710 (owner: 10Paladox) [12:13:45] (03PS1) 10Mobrovac: MobileApps: Increase memory limit to 450MB [puppet] - 10https://gerrit.wikimedia.org/r/298714 (https://phabricator.wikimedia.org/T140215) [12:14:51] (03CR) 10Faidon Liambotis: [C: 032] base: do not include grub in Labs Ubuntus [puppet] - 10https://gerrit.wikimedia.org/r/298490 (owner: 10Faidon Liambotis) [12:15:22] (03CR) 10Faidon Liambotis: [C: 032] base: reenable augeas codepath on trustys [puppet] - 10https://gerrit.wikimedia.org/r/296728 (owner: 10Faidon Liambotis) [12:15:42] 06Operations, 10Phabricator: Phabricator weekly report not generated (or at least sent) - https://phabricator.wikimedia.org/T139950#2457319 (10jcrespo) > Can/Should this (the weekly report script) be run manually now but not enable the crons (there's other things there) until after the failover? I do not know... [12:16:40] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:17:28] mobrovac: what's with the mobileapps alerts? [12:17:41] paravoid: mem-related, it seems [12:17:45] paravoid: and ores-related [12:17:47] :( [12:17:50] (03PS1) 10Ladsgroup: Enable ORES review tool for Turkish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298715 (https://phabricator.wikimedia.org/T139992) [12:17:50] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:17:54] ores is putting too much pressure on scb [12:18:04] Amir1: ^ [12:18:32] paravoid: https://gerrit.wikimedia.org/r/#/c/298714/ should help us out in the short term [12:18:33] (03PS3) 10Giuseppe Lavagetto: puppet: add a function for performing conftool lookups [puppet] - 10https://gerrit.wikimedia.org/r/283151 [12:18:35] mobrovac: I disagree [12:18:39] paravoid: mind reviewing? [12:18:40] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [12:18:47] https://grafana.wikimedia.org/dashboard/db/ores [12:19:02] <_joe_> mobrovac: what? [12:19:04] see the available memory [12:19:06] <_joe_> I can help [12:19:37] Amir1: i've been looking at the cpu and mem utilisation of scb for the last hour or so, and the footprint is rather high from time to time [12:19:56] it correlates with mobileapps alerts [12:19:59] (03CR) 10jenkins-bot: [V: 04-1] puppet: add a function for performing conftool lookups [puppet] - 10https://gerrit.wikimedia.org/r/283151 (owner: 10Giuseppe Lavagetto) [12:20:48] _joe_: https://gerrit.wikimedia.org/r/#/c/298714/ ought to help for now [12:21:13] _joe_: Amir1: paravoid: i think we seriously need to discuss getting some proper hardware for ores [12:21:24] s/proper/dedicated/ [12:21:45] <_joe_> mobrovac: I agree [12:22:04] _joe_: Let's wait for a week or so until we fix one (or even two) memory consuming parts [12:22:17] let me get you phab cards for them [12:22:44] https://phabricator.wikimedia.org/T139407 [12:23:02] <_joe_> mobrovac: it that the memory limit per worker? [12:23:13] this one would increase 72 * 100 MB (7 GB) of memoery [12:23:18] *memory [12:23:48] oh, let me check again, I think that's even more [12:23:50] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [12:23:54] _joe_: yes, but it's the max allowed [12:23:55] (03CR) 10Giuseppe Lavagetto: [C: 031] Create a new grub module [puppet] - 10https://gerrit.wikimedia.org/r/296729 (owner: 10Faidon Liambotis) [12:24:17] nope, 20 GB will be freed after these changes [12:24:21] <_joe_> mobrovac: per worker is a lot [12:24:27] <_joe_> Amir1: 20 GB of ram? [12:24:32] _joe_: so a max of 24 * 100MB per host, but it never happens that all of the workers are saturated [12:24:35] practically ores won't take up any ram [12:24:38] _joe_: yup [12:24:47] <_joe_> Amir1: well that's what I call optimizing [12:24:58] <_joe_> mobrovac: ok let me do a few checks and we're GTG [12:25:09] kk thnx [12:25:38] yeah, unforgettably the change is super big (it's a huge refactoring) so we need to test everything before going to pord [12:25:41] *prod [12:25:58] <_joe_> it's 3 gb/machine more [12:26:04] <_joe_> so it's ok-ish [12:26:34] Amir1: could we get a strong ETA on that? [12:26:53] _joe_: Also I suspect we have a memory leak somewhere in uwsgi part of the code. we are looking for someone from ops to help us looking into it: https://phabricator.wikimedia.org/T140020 [12:26:56] <_joe_> mobrovac: I don't require ETAs [12:27:00] <_joe_> mobrovac: I set deadlines [12:27:11] <_joe_> Amir1: you won't lure me in it :P [12:27:24] mobrovac: I can't say for sure but maybe the next week or the week after [12:27:42] good point _joe_ [12:27:49] _joe_: :D [12:28:08] (03PS2) 10Giuseppe Lavagetto: MobileApps: Increase memory limit to 450MB [puppet] - 10https://gerrit.wikimedia.org/r/298714 (https://phabricator.wikimedia.org/T140215) (owner: 10Mobrovac) [12:28:18] too many maybes for a situation like this if you ask me [12:28:46] it would be great if we get someone with uwsgi experience [12:28:56] (03PS2) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [12:29:30] (03PS3) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [12:29:49] <_joe_> Amir1: put simply, it's all good and great that we do find the bugs and all [12:30:18] (03PS1) 10ArielGlenn: fix age check of cirrus dump files for cleanup [puppet] - 10https://gerrit.wikimedia.org/r/298716 (https://phabricator.wikimedia.org/T138176) [12:30:21] <_joe_> but if by wednesday next week you don't have a release date yet, we'll start thinking about how to relocate ores [12:30:39] _joe_: okay, point taken :) [12:30:40] (03CR) 10Giuseppe Lavagetto: [C: 032] MobileApps: Increase memory limit to 450MB [puppet] - 10https://gerrit.wikimedia.org/r/298714 (https://phabricator.wikimedia.org/T140215) (owner: 10Mobrovac) [12:34:30] 06Operations, 10Ops-Access-Requests: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2457430 (10Joe) 05stalled>03Open [12:36:07] !log T137525 Upgrading Zuul 2.1.0-95-g66c8e52-wmf1precise1 ... zuul_2.1.0-151-g30a433b-wmf1precise1_amd64.deb [12:36:08] T137525: Investigate Zuul 2.1.0-151-g30a433b that stops processing Gerrit events - https://phabricator.wikimedia.org/T137525 [12:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:37:23] 06Operations, 10Ops-Access-Requests: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2457445 (10Joe) So my point is basically that for every service (at least the standard ones that we define via service::node, we can: # ensure that journald forwards the in... [12:39:15] (03CR) 10BBlack: [C: 04-1] "One final nit in the documentation!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/296729 (owner: 10Faidon Liambotis) [12:40:09] * paravoid hides [12:40:42] <_joe_> lolol [12:40:50] <_joe_> how the hell didn't I notice? [12:41:42] (03PS5) 10Faidon Liambotis: labstore: un-hieraize elevator/ioscheduler boot-setting [puppet] - 10https://gerrit.wikimedia.org/r/296731 [12:41:44] (03PS5) 10Faidon Liambotis: cache: un-hieraize tcpmhash_entries boot setting [puppet] - 10https://gerrit.wikimedia.org/r/296730 [12:41:46] (03PS5) 10Faidon Liambotis: Create a new grub module [puppet] - 10https://gerrit.wikimedia.org/r/296729 [12:41:48] (03PS5) 10Faidon Liambotis: mediawiki: un-hieraize cgroup_enable boot-settings [puppet] - 10https://gerrit.wikimedia.org/r/296732 [12:43:34] (03CR) 10BBlack: [C: 031] Create a new grub module [puppet] - 10https://gerrit.wikimedia.org/r/296729 (owner: 10Faidon Liambotis) [12:43:45] thanks guys :) [12:44:03] (03PS2) 10ArielGlenn: fix age check of cirrus dump files for cleanup [puppet] - 10https://gerrit.wikimedia.org/r/298716 (https://phabricator.wikimedia.org/T138176) [12:50:34] apergos: _joe_: What's the latest? [12:50:59] looks like the specific page/revisions was finally tracked down but there are likely to be others [12:52:28] I believe someone is looking at how to fix this up in the MW backend [12:52:59] !log CI is processing with Zuul 2.1.0-151-g30a433b. It might stop processing events at anytime though due to T137525 [12:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:53:14] T137525: Investigate Zuul 2.1.0-151-g30a433b that stops processing Gerrit events - https://phabricator.wikimedia.org/T137525 [12:53:38] <_joe_> CP678|Laptop: you can safely restart your bot [12:54:10] Yay. :D [12:54:17] <_joe_> :) [12:54:23] <_joe_> thanks for the help btw [12:54:31] But I'm still doing the maintenance. :p [12:54:42] <_joe_> hehe ok, no rush [12:54:50] How about unblocking my bot, or can have another admin do that for me? [12:55:02] (03PS4) 10Giuseppe Lavagetto: puppet: add a function for performing conftool lookups [puppet] - 10https://gerrit.wikimedia.org/r/283151 [12:55:03] <_joe_> well, any admin is ok [12:55:05] any en wp admin cando that [12:55:16] <_joe_> I can vouch for it, but I'm not an admin on enwiki [12:55:19] I don't have the admin flag or I would have fixed that already [12:55:59] (03PS1) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [12:56:02] <_joe_> sorry, brb [12:57:40] if it hasn't been done by the time lego ktm is back around then I will be pinging him [12:58:04] _joe_: always happy to cooperate. What was the problem? [13:00:31] !log uploaded varnishkafka 1.0.11-1 to jessie-wikimedia experimental [13:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:00:39] <_joe_> CP678|Laptop: I can't get into the details atm [13:00:52] k [13:01:13] <_joe_> not because I don't want to, it's a potential security risk [13:01:33] !log upgrading cache maps to varnishkafka 1.0.11-1 [13:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:01:50] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/296730 (owner: 10Faidon Liambotis) [13:01:54] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/296731 (owner: 10Faidon Liambotis) [13:01:58] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/296732 (owner: 10Faidon Liambotis) [13:02:01] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/296729 (owner: 10Faidon Liambotis) [13:03:45] (03CR) 10BBlack: [C: 031] cache: un-hieraize tcpmhash_entries boot setting [puppet] - 10https://gerrit.wikimedia.org/r/296730 (owner: 10Faidon Liambotis) [13:05:30] paravoid: so I was nitpicking about how we use $ORIGIN in ops/dns the other day, which is what lead to those simpler cleanups. In the reverse zones, $ORIGIN makes sense I think because there's not much better way to do it. The other remaining case is wmnet. [13:06:35] I had two ideas about making wmnet cleaner that are inter-related, but I donno if you'd agree since it's all style/useability... [13:07:56] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:08:02] 1) for the case of eqiad.wmnet vs mgmt.eqiad.wmnet (well all the mgmt subdomains) - making those explicit rather than origin. In other words, there would be $ORIGIN eqiad.wmnet, and inside it "cp1055" and "cp1055.mgmt". I like this because mgmt usually has duplicate hostnames with the upper subdomain, which is confusing and you have to backtrack to find origin if you're grepping [13:08:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:08:49] (03CR) 10ArielGlenn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/298716 (https://phabricator.wikimedia.org/T138176) (owner: 10ArielGlenn) [13:09:46] 2) Splitting wmnet into separate zonefiles with delegation, so we have zonefiles for "wmnet", "eqiad.wmnet", "codfw.wmnet", etc... it breaks up a huge file and seems cleaner to me, and gets rid of the rest of the ORIGIN statements for those cases. [13:10:21] (03CR) 10ArielGlenn: [C: 032] fix age check of cirrus dump files for cleanup [puppet] - 10https://gerrit.wikimedia.org/r/298716 (https://phabricator.wikimedia.org/T138176) (owner: 10ArielGlenn) [13:11:08] to me the upsides of reducing/eliminating $ORIGIN in there and making mgmt explicit within is organization and greppability. "git grep foohost" tells you in its output alone which subdomain it's in and which IP is mgmt vs non-mgmt. under the current system that's not clear without far-away (in linecount terms) $ORIGIN context. [13:15:56] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:20:07] !log scheduling icinga downtime on elastic1001-1016 prior to decommissioning (T139758) [13:20:08] T139758: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758 [13:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:20:36] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:23:39] CP678|Laptop: Reedy has sysop on en.wiki iirc and can unblock [13:23:55] (if you ask nicely) [13:24:57] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2457703 (10ssastry) >>! In T138561#2457157, @mobrovac wrote: > @ssastry OK to proceed with the upgrade of ruthenium to node 4.4.6? There shouldn't be any impac... [13:29:23] !log disabling puppet on eqiad high-traffic2 lvs for rcstream cleanup [13:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:54] !log disabling puppet on rcs100[12] for rcstream cleanup [13:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:31:37] (03CR) 10Ottomata: "One small nit, other than that looks good to me, we can merge today." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) (owner: 10Addshore) [13:31:44] Zuul/CI looks ok. I am having a short break [13:32:03] (03PS1) 10Muehlenhoff: Update to Linux 4.4.15 [debs/linux44] - 10https://gerrit.wikimedia.org/r/298727 [13:33:35] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2457740 (10MoritzMuehlenhoff) Now upgraded to 4.4.6 [13:34:10] (03PS1) 10Ottomata: Update cdh module with log aggregation retention check interval param [puppet] - 10https://gerrit.wikimedia.org/r/298728 [13:34:26] !log disabling puppet and stopping elasticsearch on elastic1001-1016 (T139758) [13:34:27] T139758: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758 [13:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:49] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with log aggregation retention check interval param [puppet] - 10https://gerrit.wikimedia.org/r/298728 (owner: 10Ottomata) [13:36:15] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2457764 (10MoritzMuehlenhoff) 05Open>03Resolved Ah, so we can actually close the bug... [13:36:24] (03PS5) 10BBlack: Remove old rcstream public LVS config [puppet] - 10https://gerrit.wikimedia.org/r/298525 (https://phabricator.wikimedia.org/T134871) [13:36:32] (03CR) 10BBlack: [C: 032 V: 032] Remove old rcstream public LVS config [puppet] - 10https://gerrit.wikimedia.org/r/298525 (https://phabricator.wikimedia.org/T134871) (owner: 10BBlack) [13:41:08] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2457774 (10mmodell) @jcrespo I think the phabricator_search tables are non-critical as they can be re-generated from other data. I think that at least some of the migrati... [13:41:09] !log upgrading hhvm on remaining appservers in eqiad and codfw [13:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:12] !log restarting pybal on primary eqiad high-traffic2 (lvs1002) [13:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:44:46] (03PS1) 10Yuvipanda: Do not use longform options to bash on webservice shell [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298729 [13:44:53] (03PS2) 10Andrew Bogott: Replace labvirt1010 in the nova scheduling pool. [puppet] - 10https://gerrit.wikimedia.org/r/298653 [13:45:22] !log restarting hadoop nodemanagers to apply log aggregation retention check interval change [13:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:46:54] (03PS5) 10Gehel: Decommission old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T138329) [13:47:39] (03PS6) 10Gehel: Decommission old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T139758) [13:48:18] (03CR) 10Yuvipanda: [C: 032 V: 032] Do not use longform options to bash on webservice shell [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/298729 (owner: 10Yuvipanda) [13:49:23] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2457801 (10chasemp) The idea was for `rt_migration` and `bugzilla_migration` to live "forever" or at least as long as anyone cared that at one time we used either. Those... [13:51:51] (03PS2) 10Muehlenhoff: pybal_config: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297601 [13:53:25] (03PS4) 10BBlack: Remove old rcstream public LVS config in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/298564 (https://phabricator.wikimedia.org/T134871) [13:53:32] (03CR) 10BBlack: [C: 032 V: 032] Remove old rcstream public LVS config in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/298564 (https://phabricator.wikimedia.org/T134871) (owner: 10BBlack) [13:53:43] (03PS2) 10BBlack: remove rcstream lvs::realserver config [puppet] - 10https://gerrit.wikimedia.org/r/298566 (https://phabricator.wikimedia.org/T134871) [13:54:32] (03PS7) 10Gehel: Decommission old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T139758) [13:57:10] (03CR) 10BBlack: [C: 032 V: 032] remove rcstream lvs::realserver config [puppet] - 10https://gerrit.wikimedia.org/r/298566 (https://phabricator.wikimedia.org/T134871) (owner: 10BBlack) [13:58:26] !log T137525 reverted Zuul back to zuul_2.1.0-95-g66c8e52-wmf1precise1_amd64.deb . It could not connect to Gerrit reliably [13:58:27] T137525: Investigate Zuul 2.1.0-151-g30a433b that stops processing Gerrit events - https://phabricator.wikimedia.org/T137525 [13:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:43] !log cleanup puppet / salt from old elasticsearch servers elastic1001-1016 (T139758) [13:58:44] T139758: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758 [13:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:03:45] PROBLEM - Disk space on elastic2006 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109525 MB (15% inode=99%) [14:05:54] !log rcstream cleanup done, puppet re-enabled on relevant lvs and rcs100x [14:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:56] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2457840 (10jcrespo) > The idea was for rt_migration and bugzilla_migration to live "forever" or at least as long as anyone cared that at one time we used either. That is... [14:07:56] (03CR) 10Addshore: Move stats::wmde cron files to analytics/wmde/scripts repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) (owner: 10Addshore) [14:07:58] ottomata: ^^ [14:08:40] (03PS9) 10Ottomata: Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) (owner: 10Addshore) [14:08:55] (03PS8) 10Gehel: Decommission old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T139758) [14:09:04] (03CR) 10Ottomata: [C: 032 V: 032] Move stats::wmde cron files to analytics/wmde/scripts repo [puppet] - 10https://gerrit.wikimedia.org/r/298487 (https://phabricator.wikimedia.org/T140095) (owner: 10Addshore) [14:09:24] !log Dropping legacy system_auth tables in staging to complete RBAC conversion : T139639 [14:09:25] T139639: Cassandra 2.2.6 post-upgrade checklist - https://phabricator.wikimedia.org/T139639 [14:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:09:45] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2457845 (10MoritzMuehlenhoff) Why are we keeping this open? The host is reimaged and I'm able to connect to it? [14:11:34] (03PS1) 10KartikMistry: Deploy Compact Language Links as default (Stage 4.5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298735 (https://phabricator.wikimedia.org/T136677) [14:13:39] !log about to deploy updated kartotherian & tilerator for node 4.4.6 [14:13:43] gehel, ^ [14:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:13:49] thcipriani, ^ [14:13:55] yurik: yay! [14:14:07] gehel, the server is already 446, right? [14:14:15] yurik: yes [14:14:43] yurik: and you have canaries anyway, so we'll know before we crash the whole cluster [14:16:09] (03PS3) 10Andrew Bogott: Replace labvirt1010 in the nova scheduling pool. [puppet] - 10https://gerrit.wikimedia.org/r/298653 [14:16:28] !log oblivian@palladium conftool action : delete; selector: cluster=rcstream [14:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:16:34] !log shutting down elastic1001-1016 (T139758) [14:16:35] T139758: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758 [14:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:37] !log oblivian@palladium conftool action : delete; selector: cluster=rcstream [14:17:55] _joe_: around? [14:18:03] <_joe_> Amir1: yes [14:18:05] (03PS1) 10Muehlenhoff: Use a ferm service for contint [puppet] - 10https://gerrit.wikimedia.org/r/298737 [14:18:33] _joe_: I talked with Aaron, for now. I want to make a patch to fix the issue [14:18:42] please merge it [14:19:07] <_joe_> a patch to what to fix which issue? [14:20:10] (03PS1) 10Ladsgroup: ores: reduce uwsgi workers to 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/298739 [14:20:13] (03CR) 10Andrew Bogott: [C: 032] Replace labvirt1010 in the nova scheduling pool. [puppet] - 10https://gerrit.wikimedia.org/r/298653 (owner: 10Andrew Bogott) [14:20:34] _joe_: https://gerrit.wikimedia.org/r/298739 to fix memory pressure on scb [14:20:57] <_joe_> s/fix/contain/ [14:21:10] (03PS2) 10Giuseppe Lavagetto: ores: reduce uwsgi workers to 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/298739 (owner: 10Ladsgroup) [14:21:30] !log reboot ms-be1012, many mkfs.xfs stuck on broken sdh [14:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:36] thanks :) [14:21:55] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, and 2 others: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2457870 (10Gehel) a:05Gehel>03RobH elastic1001-1016 have been shutdown. I followed the [[ https://wikitech.wikimedia.org/wiki/Server_Lifecycle#... [14:22:06] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ores: reduce uwsgi workers to 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/298739 (owner: 10Ladsgroup) [14:22:37] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, and 3 others: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2457873 (10Gehel) [14:25:44] <_joe_> Amir1: done [14:25:44] !log depooling mw1298 (image scaler) for some tests [14:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:26:02] _joe_: thanks. Do you want to run puppet agent in scb nodes? [14:26:07] (03PS2) 10KartikMistry: Deploy Compact Language Links as default (Stage 4.5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298735 (https://phabricator.wikimedia.org/T136677) [14:26:14] <_joe_> Amir1: I already did that [14:26:21] thanks [14:26:21] <_joe_> that's what I meant with "done" [14:26:38] https://grafana.wikimedia.org/dashboard/db/ores [14:26:52] the memory started to go up [14:28:29] (03CR) 10Andrew Bogott: "> Doesn't deleting instances in horizon and creating them in horizon also break wikitech compatibility?" [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) (owner: 10Andrew Bogott) [14:29:37] !log deployed kartotherian https://gerrit.wikimedia.org/r/#/c/298731/ & tilerator https://gerrit.wikimedia.org/r/#/c/298732/ [14:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:29:46] cc mobrovac gehel [14:30:03] (node 4.4.6) [14:30:45] (03PS1) 10Addshore: stats:wmde pass ${scripts_dir} into cron scripts [puppet] - 10https://gerrit.wikimedia.org/r/298743 [14:30:47] 06Operations, 10Incident-20151216-Labs-NFS, 06Labs: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#2457900 (10MoritzMuehlenhoff) Both hosts should be migrated to Linux 4.4, 3.19 is deprecated at this point. [14:31:26] ottomata: I literally don't believe this but no matter how much I looked at that last patch I missed this https://gerrit.wikimedia.org/r/#/c/298743/ (which was kind of the whole point of the patch).... [14:32:19] Hmm... looks like something else is increasing the memory pressure too [14:32:35] !log Restarting RESTBase on xenon.eqiad.wmnet : T139639 [14:32:36] T139639: Cassandra 2.2.6 post-upgrade checklist - https://phabricator.wikimedia.org/T139639 [14:32:39] We didn't recover as much memory as last time with the restart [14:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:47] About 10% difference [14:33:40] 06Operations, 10Traffic: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2457902 (10ema) a:03ema [14:34:36] (03PS1) 10Ema: cache_upload VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/298744 (https://phabricator.wikimedia.org/T131502) [14:34:53] <_joe_> halfak: yes, we raised the memory used by mobileapps [14:35:19] (03PS3) 10Giuseppe Lavagetto: puppetmaster: correct puppettization of the private repo [puppet] - 10https://gerrit.wikimedia.org/r/298258 (https://phabricator.wikimedia.org/T98173) [14:35:42] _joe_, gotcha [14:35:46] (03CR) 10Chad: [C: 031] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/298737 (owner: 10Muehlenhoff) [14:36:08] _joe_, I heard that there was an unofficial deadline put on ORES memory issues [14:36:21] That deadline lands square in the middle of my upcoming vacation. [14:36:37] <_joe_> halfak: before we think of offloading it off of scb [14:36:47] Oh. Let's think about that anyway :) [14:36:59] _joe_, this is only going to get worse over time [14:37:12] We'll be implementing some strategies for mitigating, but ORES is memory hungry [14:37:18] <_joe_> I mean we can wait a week or so before doing that I think, if you're confident you can reduce the memory footprint [14:37:22] <_joe_> halfak: understood [14:37:23] As we support more wikis with more prediction models, we'll need more memory [14:37:42] <_joe_> ok, then we'll be on the quest to find the budget for that :) [14:38:20] !log Restarting Cassandra on xenon.eqiad.wmnet : T139639 [14:38:20] I'll talk to Dario then too. [14:38:21] T139639: Cassandra 2.2.6 post-upgrade checklist - https://phabricator.wikimedia.org/T139639 [14:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:38] I'm guessing that this can' [14:38:44] t be a spare-hardware hack [14:38:55] We'll need to purchase some new machines? [14:38:56] <_joe_> halfak: I guess not [14:39:08] (03PS2) 10Ema: cache_upload VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/298744 (https://phabricator.wikimedia.org/T131502) [14:39:20] <_joe_> or use some spares, but those don't come free either, IIRC :) [14:39:31] <_joe_> halfak: I have to go afk for a few [14:39:40] <_joe_> actually I was leaving when you pinged me [14:39:46] <_joe_> can we continue this later? [14:41:38] (03CR) 10Ema: "noop on v3 https://puppet-compiler.wmflabs.org/3325/" [puppet] - 10https://gerrit.wikimedia.org/r/298744 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [14:42:19] _joe_, totally. [14:42:21] Thanks :) [14:44:13] !log Starting offset dump runs from {xenon,cerium,praseodymium}.eqiad.wmnet : T139639 [14:44:14] T139639: Cassandra 2.2.6 post-upgrade checklist - https://phabricator.wikimedia.org/T139639 [14:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:38] RECOVERY - MegaRAID on ms-be1012 is OK: OK: optimal, 13 logical, 13 physical [14:46:57] mutante: hey, tell me when you have some time. I have two questions [14:47:04] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2457963 (10Gehel) [14:49:12] zhuyifei1999_ and andrewbogott: hi, so you plan SSD migration right now? [14:49:29] it's not urgent [14:50:10] Dereckson: I think some files need to be transferred first? [14:50:29] andrewbogott: I'm currently fetching them to terbium [14:50:37] will ping when done [14:50:40] (03CR) 10Paladox: [C: 031] Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 (owner: 10Paladox) [14:50:41] cool [14:50:43] k [14:50:47] (03PS4) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [14:51:38] Terbium through webproxy.eqiad.wmnet:8080 downloads at 20 Mbps [14:52:25] 90 secondes for 1926M, 20 MBps [14:52:33] (03CR) 10EBernhardson: [C: 031] "looks good, so unimportant suggestions on simplifying regex's" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T139758) (owner: 10Gehel) [14:53:30] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2457977 (10Andrew) 05Open>03Resolved [14:54:57] PROBLEM - HHVM rendering on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:26] jouncebot: next [14:55:27] In 0 hour(s) and 4 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160713T1500) [14:55:38] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:56:19] (03CR) 10Mobrovac: "Some minor nits, LGTM otherwise" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/283151 (owner: 10Giuseppe Lavagetto) [14:56:49] RECOVERY - HHVM rendering on mw1153 is OK: HTTP OK: HTTP/1.1 200 OK - 67988 bytes in 0.152 second response time [14:57:30] zhuyifei1999_: andrewbogott: done, and files are on terbium and the checksums have been checked [14:57:46] ok [14:58:28] <_joe_> morebots: thanks [14:58:28] I am a logbot running on tools-exec-1211. [14:58:28] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:58:28] To log a message, type !log . [14:58:33] <_joe_> argh [14:58:36] <_joe_> mobrovac: [14:58:39] <_joe_> :) [14:58:46] :) [14:58:53] Dereckson: ok! I'll move that VM in a moment then [14:58:59] well by 'move' I mean 'destroy and recreate' :) [14:59:04] _joe_: we could just do a The Simpsons and rename mobrovac [14:59:13] >.> [14:59:38] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [14:59:54] <_joe_> p858snake: lol [15:00:05] anomie, ostriches, thcipriani, hashar, twentyafterfour, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160713T1500). Please do the needful. [15:00:05] Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:24] o/ [15:00:40] very nice, high five [15:01:02] <_joe_> Reedy: ? [15:01:10] high five :) [15:01:14] borat/ o/ [15:02:38] RECOVERY - Disk space on elastic2006 is OK: DISK OK [15:02:59] I can SWAT today, pairing with zeljkof :) [15:03:17] thcipriani: hey, thanks [15:03:44] you know already about the process of deploying ores review tool :D [15:05:42] Amir1: yup :) [15:07:51] (03CR) 10Gehel: Decommission old elasticsearch servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T139758) (owner: 10Gehel) [15:11:37] !log Stopping Staging dumps : T139639 [15:11:39] T139639: Cassandra 2.2.6 post-upgrade checklist - https://phabricator.wikimedia.org/T139639 [15:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:17] _joe_: a heads up. We might be able to fix the "memory leak" issue with ores sooner than we thought. it would free about 3 GBs (10%) [15:14:24] (03PS7) 10Gehel: logstash: Update default mappings for Elasticsearch 2.x [puppet] - 10https://gerrit.wikimedia.org/r/298295 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [15:14:39] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298715 (https://phabricator.wikimedia.org/T139992) (owner: 10Ladsgroup) [15:14:51] (03CR) 10Ottomata: [C: 032] stats:wmde pass ${scripts_dir} into cron scripts [puppet] - 10https://gerrit.wikimedia.org/r/298743 (owner: 10Addshore) [15:15:28] (03Merged) 10jenkins-bot: Enable ORES review tool for Turkish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298715 (https://phabricator.wikimedia.org/T139992) (owner: 10Ladsgroup) [15:16:24] (03PS1) 10MarcoAurelio: Closing wikimania2015wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298772 (https://phabricator.wikimedia.org/T139032) [15:17:31] Amir1: pulled down code to mw1017, creating tables now. [15:17:39] thcipriani: thanks [15:17:53] (03CR) 10Gehel: "Puppet compiler: https://puppet-compiler.wmflabs.org/3326/" [puppet] - 10https://gerrit.wikimedia.org/r/298295 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [15:19:24] (03CR) 10MarcoAurelio: "Can this be abandoned? This change ain't going to happen." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239455 (https://phabricator.wikimedia.org/T113096) (owner: 10Platonides) [15:19:32] 06Operations, 15User-zeljkofilipin, 05WMF-NDA: Add zeljkof to #mediawiki_security IRC channel - https://phabricator.wikimedia.org/T140225#2458078 (10RobH) 05Open>03Resolved Done! It is based on your cloak "wikimedia/zeljko-filipin-wmf". If this changes, please let us know, since you'll find yourself un... [15:19:57] (03CR) 10Gehel: "Puppet compiler: https://puppet-compiler.wmflabs.org/3327/" [puppet] - 10https://gerrit.wikimedia.org/r/298381 (owner: 10BryanDavis) [15:20:21] (03PS1) 10Yuvipanda: tools: Add checks for k8s and flannel etcds [puppet] - 10https://gerrit.wikimedia.org/r/298774 (https://phabricator.wikimedia.org/T140247) [15:20:24] (03CR) 10Gehel: "Puppet compiler: https://puppet-compiler.wmflabs.org/3328/" [puppet] - 10https://gerrit.wikimedia.org/r/298382 (owner: 10BryanDavis) [15:20:30] thcipriani: I can confirm, non-db parts works just fine [15:20:48] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [15:21:05] Amir1: ack, so trying to run mainenance script I see: Starting...Fatal error: Class 'ORES\Api' not found in /srv/mediawiki/php-1.28.0-wmf.8/extensions/ORES/maintenance/CheckModelVersions.php on line 54 ? [15:21:20] running: thcipriani@terbium:~$ mwscript extensions/ORES/maintenance/CheckModelVersions.php trwiki [15:21:46] 5xx at 3K/m [15:21:46] 06Operations, 10RESTBase-Cassandra: provide restbase with systemd unit files - https://phabricator.wikimedia.org/T106806#2458086 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi this happened ``` restbase1014:~$ systemctl status restbase ● restbase.service - "restbase service" Loaded: loaded (/lib/sys... [15:21:49] should not happen [15:22:05] strange [15:22:14] maybe you need to run from tin? [15:22:28] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [15:23:28] thcipriani: is it synced? [15:23:47] logstash didn't say anything [15:24:00] Amir1: ack, new code was not on terbium, running from tin worked fine. Trying to do mw1017 only deploy first. [15:24:34] Amir1: populateDatabase is running currently [15:24:43] nice [15:25:05] thcipriani: I can confirm it's working: https://tr.wikipedia.org/w/index.php?title=%C3%96zel:SonDe%C4%9Fi%C5%9Fiklikler&hidenondamaging=1 [15:25:14] (enable it as a beta feature in mw1017) [15:25:37] Amir1: kk, I will sync everywhere when PopulateDatabase completes. [15:25:48] awesome [15:25:50] (03PS8) 10Gehel: logstash: Update default mappings for Elasticsearch 2.x [puppet] - 10https://gerrit.wikimedia.org/r/298295 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [15:27:08] (03CR) 10Alex Monk: "Shouldn't those same mechanisms handle renames then?" [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) (owner: 10Andrew Bogott) [15:27:10] (03PS2) 10Ottomata: Fix the output directory for multimedia reports [puppet] - 10https://gerrit.wikimedia.org/r/298605 (https://phabricator.wikimedia.org/T140121) (owner: 10Mforns) [15:27:21] (03CR) 10Ottomata: [C: 032 V: 032] Fix the output directory for multimedia reports [puppet] - 10https://gerrit.wikimedia.org/r/298605 (https://phabricator.wikimedia.org/T140121) (owner: 10Mforns) [15:28:25] (03PS2) 10Chad: Fix changeid link in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/298709 (owner: 10Paladox) [15:28:51] (03PS6) 10Mobrovac: Parsoid: Move to service::node [puppet] - 10https://gerrit.wikimedia.org/r/298436 (https://phabricator.wikimedia.org/T90668) [15:30:00] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:298715|Enable ORES review tool for Turkish Wikipedia (T139992)]] (duration: 00m 28s) [15:30:01] T139992: Deploy ORES review tool in Turkish Wikipedia - https://phabricator.wikimedia.org/T139992 [15:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:07] ^ Amir1 check live please [15:30:13] \o/ [15:30:19] sure [15:30:56] thcipriani: it's okay [15:30:58] thanks [15:31:05] Amir1: cool, thanks for checking :) [15:31:07] 06Operations, 06DC-Ops, 10Continuous-Integration-Infrastructure (phase-out-gallium): Can scandium.eqiad.wmnet receives a couple 500G hard drive in a RAID 1 array? - https://phabricator.wikimedia.org/T138955#2458103 (10hashar) [15:31:13] 06Operations, 06DC-Ops, 10Continuous-Integration-Infrastructure (phase-out-gallium): Can scandium.eqiad.wmnet receives a couple 500G hard drive in a RAID 1 array? - https://phabricator.wikimedia.org/T138955#2414901 (10hashar) 05stalled>03declined Change of .plan. We are heading toward using contint1001 t... [15:32:39] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:32:58] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:34:37] 06Operations: Jessie imaging installs nfs-common needlessly - https://phabricator.wikimedia.org/T107412#1494262 (10fgiunchedi) there's also a puppet class now to explicitly disable nfs client services, `base::no_nfs_client` added in https://gerrit.wikimedia.org/r/#/c/271274/ though for jessie it should be tracke... [15:35:41] (03PS5) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [15:36:05] (03PS2) 10Yuvipanda: tools: Add checks for k8s and flannel etcds [puppet] - 10https://gerrit.wikimedia.org/r/298774 (https://phabricator.wikimedia.org/T140247) [15:36:21] (03PS3) 10Yuvipanda: tools: Add checks for k8s and flannel etcds [puppet] - 10https://gerrit.wikimedia.org/r/298774 (https://phabricator.wikimedia.org/T140247) [15:36:27] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add checks for k8s and flannel etcds [puppet] - 10https://gerrit.wikimedia.org/r/298774 (https://phabricator.wikimedia.org/T140247) (owner: 10Yuvipanda) [15:38:19] greg-g: hi! Am I correct in understanding that today the train actually moves wmf.8 -> wmf.10 for groups 0 and 1, while group 2 will be on wmf.8 until tomorow? [15:38:40] AndyRussG: yep [15:38:54] wmf.9 had a good run from Thurs to Mon, but it's time is done :) [15:39:16] (03CR) 10Andrew Bogott: "> Shouldn't those same mechanisms handle renames then?" [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) (owner: 10Andrew Bogott) [15:40:18] greg-g: cool, thx! [15:40:59] (03PS1) 10Elukey: Remove cronspam coming from Gerrit log deletion [puppet] - 10https://gerrit.wikimedia.org/r/298779 (https://phabricator.wikimedia.org/T132324) [15:42:26] (03CR) 10Alex Monk: [C: 031] "Alright, well, this looks like it might work." [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) (owner: 10Andrew Bogott) [15:43:03] (03PS1) 10Mobrovac: [Beta] Parsoid: direct traffic to deployment-parsoid07 [puppet] - 10https://gerrit.wikimedia.org/r/298780 (https://phabricator.wikimedia.org/T90668) [15:44:38] (03PS3) 10Chad: Fix changeid link in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/298709 (owner: 10Paladox) [15:45:33] (03PS1) 10Yuvipanda: tools: Add flannel/k8s etcd checks to icinga [puppet] - 10https://gerrit.wikimedia.org/r/298781 (https://phabricator.wikimedia.org/T140247) [15:45:40] (03PS1) 10Mobrovac: [Beta] Parsoid: direct traffic to deployment-parsoid07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298782 (https://phabricator.wikimedia.org/T90668) [15:46:34] (03PS2) 10Yuvipanda: tools: Add flannel/k8s etcd checks to icinga [puppet] - 10https://gerrit.wikimedia.org/r/298781 (https://phabricator.wikimedia.org/T140247) [15:47:11] thcipriani: ami too late for the swat party? [15:47:33] mobrovac: still have a little bit of this window left [15:47:34] i've got a beta-only change [15:47:45] https://gerrit.wikimedia.org/r/#/c/298782/ ? [15:48:03] indeed :) [15:48:11] np [15:48:16] i'll put it on the calendar [15:48:18] thnx thcipriani! [15:48:29] thcipriani: I think you can take a little longer if needed [15:48:30] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298782 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [15:50:53] (03CR) 10Andrew Bogott: [C: 032] "yep, tested on labtest." [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) (owner: 10Andrew Bogott) [15:51:00] (03PS2) 10Andrew Bogott: Disable the UpdateInstanceInfo tab. [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) [15:51:05] hmm zuul is thinking about that one for a while, don't think I missed any dependencies... [15:53:27] 06Operations, 10media-storage: investigate swift used space spikes since June 2016 - https://phabricator.wikimedia.org/T140075#2458231 (10fgiunchedi) I've ran a series of queries against commonswiki.image for biggest uploads by day by user, e.g. `select img_user_text, sum(img_size) from image where img_timesta... [15:53:45] (03PS2) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [15:54:17] (03PS1) 10Elukey: Reduce cronspam from terbium related to the echo_mail_batch cron script [puppet] - 10https://gerrit.wikimedia.org/r/298785 (https://phabricator.wikimedia.org/T132324) [15:55:44] (03PS1) 10Ladsgroup: Change ORES thresholds in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298786 [15:56:15] (03Merged) 10jenkins-bot: [Beta] Parsoid: direct traffic to deployment-parsoid07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298782 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [15:56:28] thcipriani: I have this simple patch for beta (hard threshold should be lower than soft) https://gerrit.wikimedia.org/r/298786 just need a +2 [15:56:53] it would be great [15:56:57] (03CR) 10Giuseppe Lavagetto: puppet: add a function for performing conftool lookups (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/283151 (owner: 10Giuseppe Lavagetto) [15:57:01] (03CR) 10Alex Monk: [C: 04-1] "We probably want to log such errors, but should fix the issue with wikitech settings on hosts that aren't silver/labtestweb2001 instead." [puppet] - 10https://gerrit.wikimedia.org/r/298785 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [15:57:13] (03PS5) 10Giuseppe Lavagetto: puppet: add a function for performing conftool lookups [puppet] - 10https://gerrit.wikimedia.org/r/283151 [15:57:25] (03CR) 10Alex Monk: "On the other hand it probably shouldn't go to root..." [puppet] - 10https://gerrit.wikimedia.org/r/298785 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [15:57:58] (03PS6) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [15:58:04] Amir1: looking [15:58:37] (03PS2) 10Thcipriani: Change ORES thresholds in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298786 (owner: 10Ladsgroup) [15:59:38] 06Operations: Debian repository supporting multiple package versions - https://phabricator.wikimedia.org/T115758#2458271 (10Eevans) Perhaps another alternative, would be to lightly automate the creation of [[ https://wiki.debian.org/HowToSetupADebianRepository#Debian_Repository_Types | trivial APT repos ]] using... [15:59:43] 06Operations, 10Monitoring: Icinga RAID check: monitor rebuild status - https://phabricator.wikimedia.org/T83476#2458272 (10fgiunchedi) [15:59:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298786 (owner: 10Ladsgroup) [16:00:01] (03CR) 10Mobrovac: [C: 031] "Cherry-picked in beta and works." [puppet] - 10https://gerrit.wikimedia.org/r/298780 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [16:00:04] ejegg and AndyRussG: Respected human, time to deploy CentralNotice update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160713T1600). Please do the needful. [16:00:45] thcipriani: thanks. Does jenkins deploy it to beta later? [16:00:48] (03CR) 10Rush: [C: 032] "beta only change and already verified, thanks mobrovac" [puppet] - 10https://gerrit.wikimedia.org/r/298780 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [16:01:06] oh, wikiversions-inuse is neat! [16:01:11] (03CR) 10Elukey: "I am happy with every solution, I've seen "> /dev/null" and I thought those notifications were not important :)" [puppet] - 10https://gerrit.wikimedia.org/r/298785 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [16:01:14] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#2458283 (10fgiunchedi) [16:01:16] Amir1: yeah, the beta-scap-eqiad job will deploy to beta shortly [16:01:20] !log thcipriani@tin Synchronized wmf-config/LabsServices.php: SWAT: [[gerrit:298782|[Beta] Parsoid: direct traffic to deployment-parsoid07]] (duration: 00m 26s) [16:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:26] (03Merged) 10jenkins-bot: Change ORES thresholds in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298786 (owner: 10Ladsgroup) [16:01:38] mobrovac: your patch was merged and should go out with beta-scap-eqiad shortly :) [16:01:39] awesome [16:01:40] thanks [16:01:48] thnx thcipriani [16:01:51] (03CR) 10Alex Monk: "stdout likely isn't, stderr is more important" [puppet] - 10https://gerrit.wikimedia.org/r/298785 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [16:02:00] thcipriani: I think we're actually waiting for zuul... let us know when the SWAT is all complete? [16:02:18] AndyRussG: ack. One last noop sync. [16:02:20] elukey, people keep complaining about the issues with WikitechPrivateSettings outside of wikitech hosts [16:02:57] maybe we should only include it if it exists [16:03:08] PROBLEM - Disk space on elastic2006 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 107497 MB (15% inode=99%) [16:03:21] or make an empty one on non-wikitech hosts [16:03:23] not sure which [16:03:31] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:298786|[Beta] Change ORES thresholds in beta]] (duration: 00m 29s) [16:03:34] AndyRussG: SWAT done. [16:03:34] Krenair: ahhh okok will have a better look tomorrow then! [16:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:03:53] thcipriani: thx! [16:04:20] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:04:22] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:04:23] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:04:25] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:04:26] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:04:27] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:04:28] !ops [16:04:29] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:04:30] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:04:32] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:04:33] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:04:35] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:05:01] 06Operations, 06Performance-Team, 10Traffic: Split stats/metrics by cache cluster - https://phabricator.wikimedia.org/T109378#2458286 (10fgiunchedi) 05Open>03Invalid we do have these stats now under `varnish.` in graphite and grouped by cluster. Tentatively resolving, unless there are some missing stats. [16:05:02] I wonder why Sigyn doesn't appear to be working in here like it does -ops [16:05:25] mobrovac: all mobileapps alerts are now CRITICAL again -- why aren't we treating this like an outage? [16:05:39] Krenair: maybe someone needs to told the bot that? [16:05:47] *to tell [16:06:05] 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and reimage it in labs-support-network - https://phabricator.wikimedia.org/T140257#2458291 (10hashar) [16:06:10] 06Operations, 06Performance-Team, 10Traffic: Split stats/metrics by cache cluster - https://phabricator.wikimedia.org/T109378#2458304 (10BBlack) Yeah good call. We don't have TLS broken down, but IMHO it's not that important. We did get status codes broken down in https://grafana.wikimedia.org/dashboard/db... [16:06:21] paravoid: transient, just ran the check on the nodes and gotten All endpoints are healthy [16:06:44] (03PS3) 10Yuvipanda: tools: Add flannel/k8s etcd checks to icinga [puppet] - 10https://gerrit.wikimedia.org/r/298781 (https://phabricator.wikimedia.org/T140247) [16:06:50] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add flannel/k8s etcd checks to icinga [puppet] - 10https://gerrit.wikimedia.org/r/298781 (https://phabricator.wikimedia.org/T140247) (owner: 10Yuvipanda) [16:07:51] 06Operations: Upgrade phpredis client on zend - https://phabricator.wikimedia.org/T112694#2458313 (10fgiunchedi) 05Open>03Invalid looks like related {T86081} is essentially completed, for all things production anyway. [16:07:54] root@neon:/var/log/icinga# grep mobileapps icinga.log |grep -c CRITICAL [16:07:57] 399 [16:08:00] "transient" [16:08:43] it's just being reported critical 399 times in a day by a 5min interval check.. [16:09:25] (03PS1) 10Yuvipanda: tools: Allow all users read only access to k8s node objects [puppet] - 10https://gerrit.wikimedia.org/r/298787 (https://phabricator.wikimedia.org/T140248) [16:09:41] (03PS2) 10Yuvipanda: tools: Allow all users read only access to k8s node objects [puppet] - 10https://gerrit.wikimedia.org/r/298787 (https://phabricator.wikimedia.org/T140248) [16:11:50] (03PS2) 10MarcoAurelio: Closing wikimania2015wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298772 (https://phabricator.wikimedia.org/T139032) [16:12:09] Lourdes Cardenal es una basura descompuesta nacida el año 1, es una vieja de mierda echa de caca y usa falditas en sus gordas piernas [16:13:18] (03PS9) 10Gehel: Decommission old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T139758) [16:13:33] (03PS1) 10Cmjohnson: Adding production dns for labsdb1009-1011 [dns] - 10https://gerrit.wikimedia.org/r/298788 [16:14:14] 06Operations: operations/software/conftool fails tox-py27-jessie - https://phabricator.wikimedia.org/T112853#1648216 (10fgiunchedi) I see `tox-jessie` passing now for conftool, e.g. in https://gerrit.wikimedia.org/r/#/c/288632/ so perhaps no longer an issue? [16:15:29] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:16:18] 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and reimage it in labs-support-network - https://phabricator.wikimedia.org/T140257#2458352 (10chasemp) p:05Triage>03High [16:17:29] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [16:19:23] (03CR) 10Faidon Liambotis: [C: 04-1] puppetmaster: correct puppettization of the private repo (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/298258 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [16:19:56] !log CI slightly overloaded / backloaded due to a long tail of Wikibase changes sent in Gerrit. [16:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:15] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Allow all users read only access to k8s node objects [puppet] - 10https://gerrit.wikimedia.org/r/298787 (https://phabricator.wikimedia.org/T140248) (owner: 10Yuvipanda) [16:20:17] Hi ops! Doing my first CentralNotice deploy in many months, and I'm seeing a lot of changes when I diff with origin/wmf/1.28.0-wmf.8 [16:20:58] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:21:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:22:04] (03CR) 10Cmjohnson: [C: 032] Adding production dns for labsdb1009-1011 [dns] - 10https://gerrit.wikimedia.org/r/298788 (owner: 10Cmjohnson) [16:24:11] (03CR) 10Chad: "Do not enable linkDrafts. We don't want them in Diffusion remember?" [puppet] - 10https://gerrit.wikimedia.org/r/298710 (owner: 10Paladox) [16:24:18] oop, nvm, just the usual rebase needed [16:30:21] (03PS7) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [16:31:04] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2458435 (10MoritzMuehlenhoff) Status update: Using Times as the sample and scaling to only 675 I can reproduce this. https://p... [16:31:08] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:31:18] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:32:17] RECOVERY - Disk space on elastic2006 is OK: DISK OK [16:32:25] I have tried to trigger an alert now [16:32:58] 06Operations, 10ops-eqiad: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2458445 (10Cmjohnson) [16:33:37] (03PS1) 10Gehel: Remove old elasticsearch servers from DNS [dns] - 10https://gerrit.wikimedia.org/r/298790 (https://phabricator.wikimedia.org/T139758) [16:33:58] 06Operations, 10ops-eqiad: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2365691 (10Cmjohnson) Tried installing and a few boxes installed but mc1020-21, mc1023-25 30-33,35 give the below message. The highlighted entry will be executed automatically in 0s. Unable to f... [16:34:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to Linux 4.4.15 [debs/linux44] - 10https://gerrit.wikimedia.org/r/298727 (owner: 10Muehlenhoff) [16:34:08] (03CR) 10Filippo Giunchedi: "what's the output that gets into cron?" [puppet] - 10https://gerrit.wikimedia.org/r/298779 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [16:35:23] 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and reimage it in labs-support-network - https://phabricator.wikimedia.org/T140257#2458466 (10RobH) a:03mark Since this was allocated during an emergency situation for a specific u... [16:35:59] 06Operations, 10netops: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2458474 (10faidon) The esams side is apparently ready, we're waiting for the LOA; the eqiad side is being actively worked on. Commit date is the 26th, firm date is the 22nd, optimistic delivery is on the... [16:37:22] !log demon@tin Purged l10n cache for 1.28.0-wmf.5 [16:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:33] !log demon@tin Purged l10n cache for 1.28.0-wmf.6 [16:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:46] !log demon@tin Purged l10n cache for 1.28.0-wmf.7 [16:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:12] !log demon@tin Purged l10n cache for 1.28.0-wmf.9 [16:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:25] (03CR) 10Elukey: "Ah yes sorry.. stuff like:" [puppet] - 10https://gerrit.wikimedia.org/r/298779 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [16:40:04] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2458516 (10jcrespo) I got `fab_migration` and `rt_migration` mixed. No need to ask anything else- I will archive and drop fab_migration, percona and test. I will be waiti... [16:40:18] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2458517 (10chasemp) no worries, noted in irc I meant to keep rt_migration too :) This go hand in hand w/ https://phabricator.wikimedia.org/diffusion/OPUP/browse/productio... [16:40:47] !log demon@tin Started scap: wmf.10 code sync + testwiki to wmf.10 for l10n cache gen [16:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:24] !log demon@tin scap aborted: wmf.10 code sync + testwiki to wmf.10 for l10n cache gen (duration: 00m 37s) [16:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:37] (on purpose, no worries, forgot something) [16:41:44] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, and 3 others: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2458521 (10RobH) I've synced up wtih @gehel about this via irc, and I'll go ahead and complete the switch port steps onward for the decommissioning... [16:44:17] paravoid: I guess you can now remove your op-status ;) [16:44:42] paravoid: we identified a perf issue in MCS, mdholloway and bearND will deploy it asap (in a couple of hours likely) [16:46:46] (03PS3) 10Andrew Bogott: Disable the UpdateInstanceInfo tab. [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) [16:47:28] !log demon@tin scap failed: OSError [Errno 1] Operation not permitted: '/var/lock/scap' (duration: 00m 00s) [16:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:47:40] Herp derp? [16:47:51] Aborting scap didn't clean up my lock file? [16:47:54] No bueno scap [16:47:57] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, and 3 others: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2458557 (10Gehel) Still 2 patches waiting to be merged: * DNS - https://gerrit.wikimedia.org/r/#/c/298790/ * Puppet cleanup - https://gerrit.wikim... [16:48:13] Errr...... [16:48:33] ejegg: Are you trying to scap something? [16:48:39] You seem to have a scap lock file.... [16:48:51] ostriches: yep! [16:48:51] ostriches: we are deploying yeah [16:48:55] !log ejegg@tin Synchronized php-1.28.0-wmf.8/extensions/CentralNotice/: (no message) (duration: 01m 52s) [16:48:58] deploying CentralNotice [16:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:09] jouncebot: next [16:49:09] In 2 hour(s) and 10 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160713T1900) [16:49:18] jouncebot: now [16:49:23] it needs a now [16:49:32] What do you know, you did have a window and I stepped on you :p [16:49:40] * ostriches walks away sheepishly [16:49:49] ostriches: np! [16:49:52] lock files, they're good [16:50:22] Heh, funny thing is, had I not aborted my initial scap, CN couldn't have sync'd :p [16:50:25] 06Operations, 10media-storage: investigate swift used space spikes since June 2016 - https://phabricator.wikimedia.org/T140075#2458566 (10Fae) This is due to the NYPL uploads, which are coming to an end, though maybe with some final runs and housekeeping. As it happens I've started a wikibreak and packing to t... [16:50:26] !log updated CentralNotice for cookie cleanup [16:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:14] 06Operations, 10Ops-Access-Requests: MediaWiki deployment shell access request for zfilipin - https://phabricator.wikimedia.org/T140264#2458570 (10thcipriani) [16:52:57] (03PS1) 10Thcipriani: Add zfilipin to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/298792 (https://phabricator.wikimedia.org/T140264) [16:53:44] ejegg, AndyRussG: You guys done? [16:54:24] ostriches: looks like it, just monitoring for regressions [16:54:35] Ok thanks! I'm gonna do a happy fun scaptime [16:54:38] !log demon@tin Started scap: wmf.10 code sync + testwiki to wmf.10 for l10n cache gen (once more with feeling) [16:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:22] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service, 07Easy: Puppet fails on new web node - https://phabricator.wikimedia.org/T140265#2458608 (10Ladsgroup) [16:59:24] (03PS1) 10Rush: phab: notes on DB dependencies for rt & bz update jobs [puppet] - 10https://gerrit.wikimedia.org/r/298794 [17:00:30] 06Operations: Upgrade phpredis client on zend - https://phabricator.wikimedia.org/T112694#2458627 (10Legoktm) >>! In T112694#2458313, @fgiunchedi wrote: > looks like related {T86081} is essentially completed, for all things production anyway. FWIW wikitech still runs zend PHP... [17:02:26] ejegg: confirmed working on mobile site too! [17:02:55] https://en.m.wikipedia.org/?country=ES&randomcampaign=0.1&force=true&device=ipad [17:08:52] (03CR) 10jenkins-bot: [V: 04-1] phab: notes on DB dependencies for rt & bz update jobs [puppet] - 10https://gerrit.wikimedia.org/r/298794 (owner: 10Rush) [17:09:49] paravoid: the mobileapps deploy is on hold since we don't have a fix yet. The patch we thought would fix things are for another patch we hadn't even deployed yet. :(. So, we'll have to look more into it. [17:09:58] Amir1: what's up [17:10:12] chasemp: Lol, no comments for you! [17:10:20] seriously wth [17:10:28] ah [17:10:32] Something else in the file failed lint I'm guessing. [17:10:40] (from before, less strict linting at the time) [17:10:50] mutante: hey, first, I want to know if this one means each node has their own monitoring too https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/lvs/configuration.yaml#L922 [17:11:18] if web in scb1001 goes down we would know or LVS just send requests to scb1002 [17:12:44] paravoid: _joe_ : has the puppet patch https://gerrit.wikimedia.org/r/#/c/298714/ been deployed? Not sure how to check for it. The Gerrit page says "aborted" and the next puppet swat would be tomorrow. [17:13:23] it says "merged" [17:14:10] Amir1: i dont know about the LVS config, but from Icinga point of view, i can tell that we have monitoring checks for the web app on each separate node [17:14:26] Amir1: so in Icinga there is one " [17:14:28] ores uWSGI web app" [17:14:37] on scb1001, scb1002 and so on [17:14:47] oh, amazing [17:14:53] thanks mutante [17:15:04] Amir1: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ores [17:15:14] you should be able to login there with the labs user [17:15:29] and then scroll down all the way [17:16:05] yeah, nice [17:16:06] thanks [17:16:24] and yea, it currently says that in labs, the node -03 is ok but -05 gets a 404 [17:16:42] paravoid: does that mean it's deployed right when it's merged? The next entry on that Gerrit page says aborted/ post-merg build failed [17:16:58] yeah, we are working on it [17:17:04] (see SAL for ores) [17:17:06] (03PS10) 10Gehel: Decommission old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T139758) [17:17:14] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.80, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [17:17:46] PROBLEM - Restbase root url on restbase1013 is CRITICAL: Connection refused [17:18:01] bearND: yes, ignore that [17:18:32] (03CR) 10Gehel: [C: 032] Decommission old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/295649 (https://phabricator.wikimedia.org/T139758) (owner: 10Gehel) [17:18:52] ^^^ looking into it [17:19:51] (03PS9) 10Gehel: logstash: Update default mappings for Elasticsearch 2.x [puppet] - 10https://gerrit.wikimedia.org/r/298295 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [17:20:14] ostriches: could I squeeze 2 patches from master of the WikimediaEvents extension into the branch for the train? or am I too late? ;) [17:21:01] Already branched and sync'ing now. [17:21:04] Would need a backport. [17:21:23] (you can do the backport between now and noon PDT if you'd like so it can ride the rest of wmf.10 today if you'd like) [17:21:33] (03CR) 10Gehel: [C: 032] logstash: Update default mappings for Elasticsearch 2.x [puppet] - 10https://gerrit.wikimedia.org/r/298295 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [17:21:50] ostriches: that would be good! :) *goes to do that* [17:21:54] PIDO CLEMENCIA HE CAMBIADO SERE BUENO.. LO PROMETO !!!!!! [17:21:59] (03PS3) 10Gehel: logstash: Remove all _* fields from gelf records [puppet] - 10https://gerrit.wikimedia.org/r/298382 (owner: 10BryanDavis) [17:22:04] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:10] !log Starting restbase on restbase1013.eqiad.wmnet [17:22:14] <_joe_> uhm again? [17:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:23:05] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [17:23:33] urandom: do we have an explanation for 1013? [17:23:34] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:35] RECOVERY - Restbase root url on restbase1013 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.017 second response time [17:23:38] (03CR) 10Gehel: [C: 032] logstash: Remove all _* fields from gelf records [puppet] - 10https://gerrit.wikimedia.org/r/298382 (owner: 10BryanDavis) [17:23:57] gwicke: i'm opening an issue (TL;DR no, but we've seen this before) [17:23:59] 06Operations: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2458739 (10Dereckson) [17:24:20] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2458754 (10Dereckson) [17:24:32] anything in the syslog? [17:25:00] gwicke: nothing helpful, no [17:25:08] <_joe_> !log restarting hhvm on mw1229 (stuck in HPHP::Treadmill::getAgeOldestRequest) [17:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:25:33] (03PS2) 10Gehel: logstash: Remove normalize_fields fitler [puppet] - 10https://gerrit.wikimedia.org/r/298381 (owner: 10BryanDavis) [17:25:45] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 68622 bytes in 0.182 second response time [17:26:43] (03PS1) 10Cmjohnson: Adding dhcpd file entries for labsdb1009-11 [puppet] - 10https://gerrit.wikimedia.org/r/298798 [17:27:15] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.361 second response time [17:27:32] jouncebot: next [17:27:33] In 1 hour(s) and 32 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160713T1900) [17:27:36] urandom, is this an instance of https://phabricator.wikimedia.org/T136957 ? [17:27:39] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, and 3 others: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2458765 (10RobH) a:05RobH>03mark Actually, before I go into the switches and disable all the ports, I'll get @mark's feedback on decommission o... [17:27:39] (03CR) 10Gehel: [C: 032] logstash: Remove normalize_fields fitler [puppet] - 10https://gerrit.wikimedia.org/r/298381 (owner: 10BryanDavis) [17:27:46] !log drop databases fab_migration, percona and test from m3 T138460 [17:27:47] T138460: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460 [17:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:54] gwicke: crap, a moment too late [17:27:57] gwicke: it is [17:28:20] ostriches: I'm going to deploy addshore's patches in a few minutes [17:28:29] When my scap is done :) [17:28:47] Thanks both! :) [17:28:50] sync-apaches: 71% (ok: 275; fail: 0; left: 108) [17:30:07] 06Operations, 10RESTBase, 06Services: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2353511 (10Eevans) This continues to happen; This occurred just now on restbase1013: {P3418} [17:31:02] the line with "Parent pid 16945, child pid 16946" is odd [17:31:16] is firejail doing some weird ptrace thing like upstart used to? [17:31:17] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup payments1005-8 - https://phabricator.wikimedia.org/T136881#2458807 (10Cmjohnson) Spoke with Jeff G and he prefers using the same names for payments. We shutdown the old payments1003 and connected to the new payments1003. Changed the idrac to... [17:31:33] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup payments1005-8 - https://phabricator.wikimedia.org/T136881#2458808 (10Cmjohnson) racktables updated with new payments1003 information. [17:31:51] (03CR) 10RobH: [C: 031] Remove old elasticsearch servers from DNS [dns] - 10https://gerrit.wikimedia.org/r/298790 (https://phabricator.wikimedia.org/T139758) (owner: 10Gehel) [17:32:17] (03PS1) 10Yurik: Re-enable geoshapes for maps [puppet] - 10https://gerrit.wikimedia.org/r/298800 [17:32:17] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2458809 (10Dereckson) [17:32:28] I broke db1048 replication [17:32:59] PROBLEM - MariaDB Slave SQL: m3 on db2012 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1008, Errmsg: Error Cant drop database test: database doesnt exist on query. Default database: test. Query: [snipped] [17:33:05] "Can't drop database 'test'; database doesn't exist" [17:33:17] (03PS2) 10Cmjohnson: Adding dhcpd file entries for labsdb1009-11 [puppet] - 10https://gerrit.wikimedia.org/r/298798 [17:33:42] fixed [17:34:43] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd file entries for labsdb1009-11 [puppet] - 10https://gerrit.wikimedia.org/r/298798 (owner: 10Cmjohnson) [17:35:04] (03CR) 10Dzahn: [C: 031] Fix changeid link in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/298709 (owner: 10Paladox) [17:35:20] (03CR) 10Paladox: [C: 031] "I spoke with @Chad and explaned why I did it." [puppet] - 10https://gerrit.wikimedia.org/r/298710 (owner: 10Paladox) [17:36:18] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2458836 (10Dereckson) [17:36:21] RECOVERY - MariaDB Slave SQL: m3 on db2012 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:38:38] 06Operations, 10Ops-Access-Requests: MediaWiki deployment shell access request for addshore - https://phabricator.wikimedia.org/T140276#2458869 (10Addshore) [17:38:49] (03PS8) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [17:39:07] 06Operations, 06Performance-Team, 10Thumbor: Package Thumbor for Debian - https://phabricator.wikimedia.org/T134485#2458881 (10Gilles) Upstream issued a new major version at my request, the current Debian package I've prepared needs to be updated to Thumbor 6.1.0 [17:39:08] (03PS9) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [17:39:22] (03CR) 10Chad: [C: 031] "Compiler gave me what I wanted: https://puppet-compiler.wmflabs.org/3329/, no actual config change on ytterbium, just the 2 line fix on le" [puppet] - 10https://gerrit.wikimedia.org/r/298709 (owner: 10Paladox) [17:40:16] (03PS4) 10Dzahn: Fix changeid link in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/298709 (owner: 10Paladox) [17:40:48] (03CR) 10Dzahn: "gerrit restart or live hack to avoid it (on each config change)" [puppet] - 10https://gerrit.wikimedia.org/r/298709 (owner: 10Paladox) [17:41:37] (03CR) 10Dzahn: [C: 032] "oh, nevermind, yes "no actual config change on ytterbium" like you say" [puppet] - 10https://gerrit.wikimedia.org/r/298709 (owner: 10Paladox) [17:41:45] !log demon@tin Finished scap: wmf.10 code sync + testwiki to wmf.10 for l10n cache gen (once more with feeling) (duration: 47m 06s) [17:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:42:13] (03PS1) 10Chad: Moving all group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298801 [17:42:42] well, that looks like it's related but it's not [17:43:20] ostriches: should I sync now or let you finish? [17:43:26] I'm done until noon. [17:43:38] * ostriches hands the deployment baton [17:43:39] ok [17:43:51] (03PS2) 10Gehel: Re-enable geoshapes for maps [puppet] - 10https://gerrit.wikimedia.org/r/298800 (owner: 10Yurik) [17:44:12] !log legoktm@tin Synchronized php-1.28.0-wmf.10/extensions/WikimediaEvents/WikimediaEventsHooks.php: Include the namespace for all pages & Include the resolved special page name for special pages - T138500 (duration: 00m 36s) [17:44:14] T138500: [Task] Add Special:AboutTopic view stats to grafana for ArticlePlaceholder - https://phabricator.wikimedia.org/T138500 [17:44:15] 06Operations, 10Ops-Access-Requests: MediaWiki deployment shell access request for addshore - https://phabricator.wikimedia.org/T140276#2458869 (10Legoktm) As one of the people who currently deploys code for addshore out of the normal processes, I endorse this request! [17:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:22] addshore: ^^ [17:44:36] (03CR) 10Dzahn: "yep, no-op on ytterbium. change applied and service restart on lead. should work now" [puppet] - 10https://gerrit.wikimedia.org/r/298709 (owner: 10Paladox) [17:46:15] (03CR) 10Gehel: [C: 032] "No objection were raised, merging this as discussed." [puppet] - 10https://gerrit.wikimedia.org/r/298800 (owner: 10Yurik) [17:48:05] 06Operations, 10RESTBase, 06Services: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2353511 (10GWicke) The parent / child pid line seems to come from https://github.com/netblue30/firejail/blob/a344c555ff282c23a8274d10ad0f75eb4fae6836/src/firejail/main.c#L2180, and seems to... [17:48:15] mediawiki down [17:48:44] Steinsplitter: works for me actually [17:48:46] what error do you get? [17:49:11] Request from via cp1068 cp1068, Varnish XID 1059329671 [17:49:11] Error: 503, Service Unavailable at Wed, 13 Jul 2016 17:48:07 GMT [17:49:18] wfm [17:49:26] Steinsplitter: try again? [17:49:51] https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions [17:50:10] PROBLEM - Disk space on elastic2008 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109250 MB (15% inode=99%) [17:50:26] 06Operations, 10RESTBase, 06Services: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2458922 (10Eevans) >>! In T136957#2458919, @GWicke wrote: > The parent / child pid line seems to come from https://github.com/netblue30/firejail/blob/a344c555ff282c23a8274d10ad0f75eb4fae6836... [17:50:28] hrmmm, can confirm on that url [17:50:45] yep, me too [17:51:25] greg-g: Steinsplitter Searching for "Help:Extension:" fails too [17:51:33] via cp1068 cp1068, Varnish XID 1059885336 [17:51:34] Error: 503, Service Unavailable at Wed, 13 Jul 2016 17:51:08 GMT [17:52:09] RECOVERY - Disk space on elastic2008 is OK: DISK OK [17:52:43] https://www.mediawiki.org/w/index.php?search=Help%3Aextensions&title=Special%3ASearch&go=Go [17:52:44] fails [17:52:54] that's wired: the sear works [17:53:02] Request from 86.177.45.123 via cp1055 cp1055, Varnish XID 4040168863 [17:53:02] Error: 503, Service Unavailable at Wed, 13 Jul 2016 17:52:29 GMT [17:53:02] *search for existing pages [17:53:08] but for a non existing page, it fails [17:53:17] ah shit [17:53:18] 2016-07-13 17:53:08 [V4aABApAMEoAAFRU2eIAAABO] mw1239 mediawikiwiki 1.28.0-wmf.10 fatal ERROR: [5c5c2c0c] PHP Fatal Error: Class undefined: PoolCounter_Client [17:53:22] Reedy: ^ [17:53:37] ostriches: CC, for wmf.10 specific error, I guess? [17:53:51] (03PS1) 10Cmjohnson: Adding labsdb1009-11 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/298805 [17:54:31] give me a minute [17:54:34] https://github.com/wikimedia/mediawiki/search?utf8=%E2%9C%93&q=PoolCounter_Client [17:54:44] (03CR) 10Cmjohnson: [C: 032] Adding labsdb1009-11 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/298805 (owner: 10Cmjohnson) [17:54:58] yes, I'm fixing it [17:55:20] legoktm is awesome :-) [17:55:26] <_joe_> are we haveing an outage? [17:55:32] <_joe_> I was at SoS [17:55:42] <_joe_> who is releasing? we need to roolback [17:55:44] on group0 wikis [17:55:45] no one [17:55:47] I'm fixing [17:55:54] <_joe_> and rollout again once it's fixed [17:55:56] er, group0, yeah, I was confsued by the wmf.10 [17:56:13] <_joe_> but well if it's matter of minutes, it's ok too [17:56:17] (03PS3) 10Legoktm: PoolCounterClient.php -> extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298096 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [17:56:27] (03CR) 10Legoktm: [C: 032] PoolCounterClient.php -> extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298096 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [17:56:29] o_O [17:56:31] yeah, group0 is slightly less important, especially since it's just a subset of pages [17:57:08] (03Merged) 10jenkins-bot: PoolCounterClient.php -> extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298096 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [17:58:57] another deploy outage? [17:58:57] !log legoktm@tin Synchronized wmf-config/: PoolCounterClient.php -> extension.json (duration: 00m 32s) [17:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:05] Hi. Accessing some flow posts from notifications gives me a consistent HTTP 503 error. Are you aware of this? It may be tracked in https://phabricator.wikimedia.org/T140223 [17:59:36] Vulpix: it is know. :) [17:59:39] should be fixed now [18:00:07] legoktm works for me now [18:00:10] <_joe_> thanks legoktm :) [18:00:16] ah, good, it works now :) [18:00:18] thx [18:00:18] thanks for fixing it [18:00:21] Bleh, how did that not blow up during deployment? [18:00:23] can we do a real postmortem on this and possibly a meta-postportem? [18:00:23] <_joe_> legoktm is always working, paladox [18:00:24] summary: we removed the entry point for PoolCounter except it was using include() and not require() so it only fatal'd later on when something tried to use PoolCounter [18:00:25] That's annoying af. [18:00:32] Yep [18:01:06] we've been getting a lot of outages during deployments lately, this is evidence of something systemic being wrong [18:01:08] paravoid: I can write an incident report, but what do you mean by a meta one? [18:01:56] I mean looking at the broader picture of how we CI and deploy code perhaps [18:02:05] 06Operations, 10Parsoid: Delete Parsoid deb 0.4.0 package from releases wikimedia.org - https://phabricator.wikimedia.org/T140279#2458962 (10ssastry) [18:02:26] this one was another one that could be caught by a canary deploy, but it needs organic traffic (hence not just mw1017) [18:02:34] s/hence/thus/ [18:02:40] why wasn't it caught in beta? [18:02:45] why don't we do canary deploys yet? [18:02:47] (etc.) [18:02:49] (03PS1) 10Chad: Gerrit: Vary extension listing cron on git_dir [puppet] - 10https://gerrit.wikimedia.org/r/298808 [18:03:02] good question re beta [18:03:13] legoktm: idea why it wasn't caught in beta? just no one noticed? [18:03:23] does beta have poolcounter set up? [18:03:27] I find it _really_ embarassing that this kind of thing was user-visible [18:03:28] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:44] legoktm: I think so? We have poolcounter nodes setup [18:03:49] '-wmgUsePoolCounter' => [ [18:03:49] 'default' => false, // T38891 [18:03:49] ], [18:03:49] T38891: setup poolcounter daemon - https://phabricator.wikimedia.org/T38891 [18:03:50] and it's been happening often too [18:03:50] nope [18:03:58] Ah yes, so no, beta would not have caught it. [18:04:15] also the use of include vs require is really dumb [18:04:19] all extensions need to be require'd [18:04:40] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [18:05:27] ok, I'm off [18:05:30] bye [18:05:31] paravoid, legoktm: https://phabricator.wikimedia.org/D288 would've prevented this from happening again [18:05:39] paravoid: Fix for that cron will land shortly, thanks for pinging me. [18:05:47] I don't think so [18:05:47] (03PS10) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [18:05:51] it wasn't extension-list [18:05:54] it was the config itself [18:06:08] so FWIW, there is a patch that recently merged to make scap aware of "canary" machines. [18:06:14] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2458980 (10mmodell) @jcrespo: There is a scheduled downtime every Thursday at 01:00 (AM) GMT. If you want to pick a better time for your convenience then let me know when... [18:06:21] also there is a place-holder patch for adding canary deployments to mediawiki [18:06:31] reopened the poolcounter in beta task: https://phabricator.wikimedia.org/T38891#2458977 [18:06:33] 06Operations, 10RESTBase, 06Services: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2458983 (10Eevans) Another thing to add here. The timing of these log entries is very suspicious: ``` Jul 13 17:12:30 restbase1013 firejail[16945]: Parent pid 16945, child pid 16946 Jul 13... [18:06:38] there are some blocking tasks here, however: https://phabricator.wikimedia.org/T136839 [18:06:57] well, probably quicker to paste the parent task: https://phabricator.wikimedia.org/T136883 [18:06:57] legoktm: Fair enough. But it certainly didn't help :) [18:07:42] (03CR) 10Paladox: "Rebased." [puppet] - 10https://gerrit.wikimedia.org/r/298710 (owner: 10Paladox) [18:07:52] (03PS11) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [18:08:11] it would be nice to determine *what* and *how* we would like to check a canary. The work to plug that into scap is underway and was the result of a different deployment outage incident report. [18:08:48] 06Operations, 10RESTBase, 06Services: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2458989 (10GWicke) > FWIW, firejail exits with the status passed to myexit(int), and I think systemd is telling us that is 0. Yup, which is why I suspected that we are hitting [the fall-thr... [18:09:58] (03PS1) 10Cmjohnson: Adding production dns entries for ms-be102[2-7] [dns] - 10https://gerrit.wikimedia.org/r/298810 [18:10:55] (03CR) 10Cmjohnson: [C: 032] Adding production dns entries for ms-be102[2-7] [dns] - 10https://gerrit.wikimedia.org/r/298810 (owner: 10Cmjohnson) [18:12:33] (03PS2) 10Chad: Gerrit: Vary extension listing cron on git_dir [puppet] - 10https://gerrit.wikimedia.org/r/298808 [18:12:35] (03PS3) 10Chad: Gerrit: Go ahead and swap git directory locations to simplify lead [puppet] - 10https://gerrit.wikimedia.org/r/298672 [18:12:39] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2459003 (10jcrespo) Assuming everybody knows this is happening- This is master failover- we are going to take m3 and point it somewhere else. While connections finish (an... [18:12:56] if anyone has any further thoughts on https://phabricator.wikimedia.org/T136839 or https://phabricator.wikimedia.org/T110068 that would help move the mw canary deploy forward. [18:13:30] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2459022 (10jcrespo) Assuming everybody knows this is happening (disruptive maintenance happens every week)- we can do it tomorrow. [18:18:44] 06Operations, 06Discovery, 10Wikimedia-Logstash, 03Discovery-Search-Sprint, and 2 others: Create and test a dump+filter+load process to reindex logstash data that is not ES 2.x safe - https://phabricator.wikimedia.org/T140283#2459064 (10bd808) [18:18:49] (03CR) 10Chad: [C: 031] "This will work, but will restart gerrit unless we do the "don't actually restart" dance." [puppet] - 10https://gerrit.wikimedia.org/r/298672 (owner: 10Chad) [18:19:10] (03CR) 10Chad: [C: 031] "Needs to land because cronspam, works from compiler." [puppet] - 10https://gerrit.wikimedia.org/r/298808 (owner: 10Chad) [18:19:26] (03PS1) 10Cmjohnson: Adding dhcp file and netboot cfg for ms-be1022-27 [puppet] - 10https://gerrit.wikimedia.org/r/298811 [18:20:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:21:19] 06Operations, 06Discovery, 10Wikimedia-Logstash, 03Discovery-Search-Sprint, and 2 others: Create and test a dump+filter+load process to reindex logstash data that is not ES 2.x safe - https://phabricator.wikimedia.org/T140283#2459099 (10bd808) [18:22:10] !log checkLocalUser.php finished, starting run #2 now [18:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:59] (03CR) 10Cmjohnson: [C: 032] Adding dhcp file and netboot cfg for ms-be1022-27 [puppet] - 10https://gerrit.wikimedia.org/r/298811 (owner: 10Cmjohnson) [18:24:14] (03PS1) 10Alex Monk: [labs/deployment-prep] Remove old pre-Swift directory variables referencing /data/project/upload7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298812 (https://phabricator.wikimedia.org/T64835) [18:24:24] a gerrit restart due to config change is coming up [18:24:38] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:25:38] (03CR) 10Alex Monk: [C: 032] [labs/deployment-prep] Remove old pre-Swift directory variables referencing /data/project/upload7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298812 (https://phabricator.wikimedia.org/T64835) (owner: 10Alex Monk) [18:26:10] (03CR) 10Dzahn: [C: 032] Gerrit: Go ahead and swap git directory locations to simplify lead [puppet] - 10https://gerrit.wikimedia.org/r/298672 (owner: 10Chad) [18:26:18] aww. rebase [18:26:22] (03PS4) 10Dzahn: Gerrit: Go ahead and swap git directory locations to simplify lead [puppet] - 10https://gerrit.wikimedia.org/r/298672 (owner: 10Chad) [18:26:42] ostriches: if I need to backport a core patch, it needs to go to wmf.8 and not wmf.9? [18:26:53] (03PS5) 10Dzahn: Gerrit: Go ahead and swap git directory locations to simplify lead [puppet] - 10https://gerrit.wikimedia.org/r/298672 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [18:27:44] (03PS2) 10Alex Monk: Remove old pre-Swift directory variables referencing upload7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298404 (https://phabricator.wikimedia.org/T64835) [18:27:46] greg-g, note those dates are the date of the rotation, not the date of occurence. So there are none of my troubleshooting commit in the 12 logs, despite me doing the commit yesterday and getting immediate results. [18:27:47] (03PS12) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [18:27:52] legoktm: Yes, wmf.8 and wmf.10 [18:27:54] Skipping 9 [18:28:19] (03PS2) 10Alex Monk: [labs/deployment-prep] Remove old pre-Swift directory variables referencing /data/project/upload7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298812 (https://phabricator.wikimedia.org/T64835) [18:28:34] (03CR) 10Alex Monk: [labs/deployment-prep] Remove old pre-Swift directory variables referencing /data/project/upload7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298812 (https://phabricator.wikimedia.org/T64835) (owner: 10Alex Monk) [18:28:43] (03CR) 10Alex Monk: [C: 032] [labs/deployment-prep] Remove old pre-Swift directory variables referencing /data/project/upload7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298812 (https://phabricator.wikimedia.org/T64835) (owner: 10Alex Monk) [18:28:45] greg-g: https://phabricator.wikimedia.org/P3421 [18:31:50] (03CR) 10Dzahn: [V: 032] "already verified" [puppet] - 10https://gerrit.wikimedia.org/r/298672 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [18:31:54] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2459164 (10mmodell) @jcrespo: Yes there is a one hour scheduled maintenance which is planned downtime although usually it takes no more than a few minutes. I'm game when y... [18:32:12] !log gerrit will restart shortly for a config change. expect a very short downtime [18:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:32:59] (03Merged) 10jenkins-bot: [labs/deployment-prep] Remove old pre-Swift directory variables referencing /data/project/upload7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298812 (https://phabricator.wikimedia.org/T64835) (owner: 10Alex Monk) [18:34:05] !log krenair@tin Synchronized wmf-config: labs-only change, should be a noop here: https://gerrit.wikimedia.org/r/298812 (duration: 00m 27s) [18:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:22] PROBLEM - HHVM rendering on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:30] .. the restart did not work as expected [18:36:49] mutante oh did it fail [18:36:53] PROBLEM - Apache HTTP on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:08] not sure what's up with mw1170 [18:37:41] Jul 13 18:37:26 mw1270: Lost parent, LightProcess exiting [18:37:42] this again [18:37:59] hhvm locked up again [18:38:01] paladox: yes, there is an issue with indices.. investigated [18:38:09] Oh [18:38:10] ok [18:38:32] PROBLEM - gerrit process on ytterbium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [18:40:04] ACKNOWLEDGEMENT - gerrit process on ytterbium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn investigation ongoing [18:40:39] !log gerrit has a temp problem. maintenance going on [18:40:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:52] !log gerrit/ytterbium: flapped for a minute because of incompat 2.12/2.8 config. Working, puppet disabled pending real fix. [18:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:33] RECOVERY - gerrit process on ytterbium is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [18:45:39] 06Operations, 10Ops-Access-Requests: MediaWiki deployment shell access request for addshore - https://phabricator.wikimedia.org/T140276#2458869 (10aude) addshore knows what he's doing (or can figure it out, knows when to ask for help). It would be helpful to have another deployer from WMDE [18:48:47] can someone restart grrrit-wm [18:48:48] ? [18:49:08] I will pay someone $10 to make it restart itself :p [18:51:05] i am adding some dogecoin to that [18:54:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:55:21] 06Operations, 10ops-eqiad: rack/setup/install/deploy labsdb1009-labsdb1011 - https://phabricator.wikimedia.org/T136860#2459225 (10Cmjohnson) [18:56:17] 06Operations, 10media-storage: authoritative copy of 'root' files for upload.wikimedia.org is only in swift - https://phabricator.wikimedia.org/T130709#2459241 (10AlexMonk-WMF) They should be in VCS so people can propose changes to them, and have Puppet apply the changes when approved. It'd also be helpful for... [18:57:03] 06Operations, 10Ops-Access-Requests: Give bawolff access to #mediawiki_security - https://phabricator.wikimedia.org/T140287#2459242 (10Bawolff) [18:57:09] ok, i did it [18:57:21] well, if it comes back now [19:00:04] (03PS1) 10Urbanecm: Logo update for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298819 (https://phabricator.wikimedia.org/T140015) [19:00:04] ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160713T1900). Please do the needful. [19:00:34] Gimmie 5 mins jouncebot [19:01:25] (03PS2) 10Chad: Moving all group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298801 [19:01:37] (03CR) 10Chad: [C: 032] Moving all group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298801 (owner: 10Chad) [19:01:40] 06Operations, 10Ops-Access-Requests: Give bawolff access to #mediawiki_security - https://phabricator.wikimedia.org/T140287#2459270 (10RobH) 05Open>03Resolved a:03RobH This is a simple change, I just asked for a task to have an audit trail. Since Brian is already on staff/security team/etc, its pretty s... [19:02:43] (03PS16) 10Thcipriani: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [19:03:02] (03Merged) 10jenkins-bot: Moving all group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298801 (owner: 10Chad) [19:04:33] You guys saw alls these RL warnings on mw1272? https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors [19:04:47] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: all group0 to wmf.10 [19:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:38] 06Operations, 10Ops-Access-Requests: Platonides access to #mediawiki_security - https://phabricator.wikimedia.org/T140288#2459280 (10Platonides) [19:07:36] AndyRussG: Looks like it passed? [19:07:43] (I'm seeing the # going down) [19:08:22] (03PS2) 10Dzahn: Use a ferm service for contint [puppet] - 10https://gerrit.wikimedia.org/r/298737 (owner: 10Muehlenhoff) [19:09:32] (03CR) 10Dzahn: [C: 032] Use a ferm service for contint [puppet] - 10https://gerrit.wikimedia.org/r/298737 (owner: 10Muehlenhoff) [19:09:52] ostriches: ah hmmmm yeah [19:10:34] (03CR) 10Dzahn: "this will just influence the new server that is still being tested anyways. no change on current gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/298737 (owner: 10Muehlenhoff) [19:11:26] mutante: It won't affect old or new gerrit, it's ferm for CI box [19:11:29] (to talk to gerrit) [19:12:55] (03CR) 10Urbanecm: "Yep, I know. I'm going to abandone this and create a new one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298444 (https://phabricator.wikimedia.org/T140015) (owner: 10Urbanecm) [19:13:08] (03Abandoned) 10Urbanecm: HD logos for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298444 (https://phabricator.wikimedia.org/T140015) (owner: 10Urbanecm) [19:13:33] (03PS1) 10Urbanecm: HD logos for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298822 (https://phabricator.wikimedia.org/T140015) [19:13:41] ostriches: i know, and the ferm rules that is changed influences only new [19:14:02] I misread you then :) [19:14:11] !log legoktm@tin Synchronized php-1.28.0-wmf.8/includes/api/ApiQueryRecentChanges.php: API: Remove index forcing in ApiQueryRecentChanges - T140108 (duration: 00m 26s) [19:14:11] T140108: ApiQueryRecentChanges::run is spiking, nuking API servers - https://phabricator.wikimedia.org/T140108 [19:14:15] i mean the access for gerrit-new to CI [19:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:37] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/298737 (owner: 10Muehlenhoff) [19:21:13] (03PS4) 10Andrew Bogott: Disable the UpdateInstanceInfo tab. [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) [19:23:03] !log anomie@tin Synchronized php-1.28.0-wmf.8/includes/auth/AuthManager.php: Add timing data logging for T119736 (duration: 00m 27s) [19:23:04] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [19:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:56] !log anomie@tin Synchronized php-1.28.0-wmf.10/includes/auth/AuthManager.php: Add timing data logging for T119736 (duration: 00m 28s) [19:23:57] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [19:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:24:09] (03CR) 10Dzahn: [V: 032] Gerrit: Vary index type. We don't want LUCENE on 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/298816 (owner: 10Chad) [19:25:53] 06Operations, 10ops-eqiad: rack/setup/install/deploy labsdb1009-labsdb1011 - https://phabricator.wikimedia.org/T136860#2459318 (10Cmjohnson) [19:26:41] 06Operations, 10ops-eqiad: rack/setup/install/deploy labsdb1009-labsdb1011 - https://phabricator.wikimedia.org/T136860#2350219 (10Cmjohnson) These are installed but puppet and salt keys have not been completed yet. @jcrespo let me know if it's okay to continue. [19:28:23] 06Operations, 10ops-eqiad, 06DC-Ops, 10Traffic, and 2 others: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#2459327 (10Cmjohnson) 05Open>03Resolved The servers have been installed and setup. the only issue is the snmp errors but that has a separate linked task. resolvin... [19:28:49] !log gerrit: restarting, puppet back on, issue fixed. [19:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:46] ostriches: :) cool! be back after lunch [19:31:03] 06Operations, 10ops-eqiad: eqiad: Install SSD's into ganeti hosts - https://phabricator.wikimedia.org/T138414#2459331 (10Cmjohnson) p:05Low>03Triage [19:31:35] 06Operations, 10ops-eqiad: labsdb1001: Swap eth0 cable - https://phabricator.wikimedia.org/T137555#2459332 (10Cmjohnson) p:05Triage>03Lowest [19:32:36] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 1 failures [19:33:40] 06Operations, 10ops-eqiad: db1054 degraded RAID (failed disk) - https://phabricator.wikimedia.org/T139026#2459336 (10Cmjohnson) swapped disk [19:35:41] 06Operations, 10ops-eqiad: Rack/Setup Carbon/Apt Server Replacement - https://phabricator.wikimedia.org/T139171#2459355 (10Cmjohnson) going to call this mirror1001 [19:36:51] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2459359 (10Cmjohnson) 05Open>03Resolved [19:37:30] 06Operations, 10ops-eqiad: Rack/Setup Carbon/Apt Server Replacement - https://phabricator.wikimedia.org/T139171#2459361 (10RobH) [19:37:32] 06Operations, 10ops-eqiad, 10DBA: db1034 lag - https://phabricator.wikimedia.org/T139280#2459362 (10Cmjohnson) [19:38:41] 06Operations, 10ops-eqiad: ms-be1012.eqiad.wmnet: slot=7 dev=sdh failed - https://phabricator.wikimedia.org/T140101#2459364 (10Cmjohnson) p:05Normal>03Unbreak! [19:41:26] PROBLEM - Disk space on elastic2008 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 108064 MB (15% inode=99%) [19:45:29] need another grrrit-wm restart :( [19:46:17] 06Operations, 10ops-eqiad: ms-be1012.eqiad.wmnet: slot=7 dev=sdh failed - https://phabricator.wikimedia.org/T140101#2453018 (10Cmjohnson) @fgiunchedi Replaced the disk in slot 7 shows as unconfigured good but could not add back. Please take a look. [19:54:26] legoktm: I can do that [19:54:46] thanks [19:55:06] done [19:55:32] thanks Krenair. I thought I was a member of lolrrit-wm but apparently not [19:57:27] (03PS1) 10Chad: Group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298830 [19:59:46] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:59:54] (03CR) 10Thcipriani: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, and mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160713T2000). Please do the needful. [20:01:27] (03PS1) 10Chad: Contint: remove contint-users group [puppet] - 10https://gerrit.wikimedia.org/r/298832 [20:02:07] ostriches: is group1 still going to wmf.10 today? [20:02:54] bd808: Thaz the plan! [20:03:16] !log Update codfw elasticsearch cluster sttings with cluster.routing.allocation.disk.watermark.low: 70% to match eqiad and reduce free space icinga warnings [20:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:02] bd808: Prolly about 20 mins, gonna run to the store real quick for gatorade & shit. [20:04:26] ostriches: awesome. hydrate or die! [20:04:55] no parsoid deploy today [20:12:27] RECOVERY - Disk space on elastic2008 is OK: DISK OK [20:12:36] ostriches, hi, what's the status of the train at this point? We need to update WV maps as soon as the new code gets deployed [20:12:40] cc jgirault ^ [20:13:18] yurik: read up 5 lines [20:13:42] greg-g, thx! :) [20:13:53] :) [20:17:12] I could use some help. We're throwing errors from ores.wikimedia.org and I need to find the log that they end up in. We should be able to see them wherever uwsgi logs end up. [20:17:27] Oh! I just found /srv/log [20:17:31] Looks like this will work [20:17:52] (03CR) 10Chad: [C: 032] Group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298830 (owner: 10Chad) [20:18:34] (03Merged) 10jenkins-bot: Group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298830 (owner: 10Chad) [20:19:03] !log starting mobileapps deploy [20:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:44] Weird. It looks like we're timing out while requesting something from plwiki's API [20:20:12] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.10 [20:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:54] Sure enough, this request times out: https://pl.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=ids|user|timestamp|userid|comment|size|contentmodel|content&revids=21914234230|3243242|234324 [20:23:21] It takes a *looong* time rather [20:24:30] Not wmf.10's fault at least ^ :D [20:24:42] halfak: only with all 3 variables [20:24:47] the first revid is bad though [20:24:53] it looks a few digits too long [20:25:36] Yeah. Intentionally including a bad revid for testing something else. [20:25:43] And stumbled across the slowness issue [20:26:13] * halfak files some tasks for ORES to handle this kind of slowness better. [20:26:14] interesting how it's fast with each seperate [20:27:22] mutante, filed a task for you, btw. [20:27:57] !log deployed mobileapps d1eb1da [20:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:28:09] reedy, that is weird. [20:28:15] Do you think I should flag the issue to someone? [20:28:24] Yeah [20:28:31] I'd file an api/db/perf bug [20:28:38] Something feels skewy [20:29:17] Hmm... Response is fast now. Maybe it pulled something into a cache [20:29:39] Oh it works better if I include the bogus revid :) [20:31:38] (03CR) 10Thcipriani: [C: 031] "LGTM. Would like to use in the MW Deploy process soon." [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [20:35:57] Reedy, FYI: https://phabricator.wikimedia.org/T140302 [20:36:07] Yikes! It happens with enwiki too [20:37:26] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [20:37:31] To-do for ORES: Report timeout issues from the MediaWiki API so that users don't hate on us. [20:38:03] Should get the ball rolling [20:38:10] jynus, see https://phabricator.wikimedia.org/T140302 of you're around. [20:38:18] this might be "Unbreak now" worthy :/ [20:39:54] Why are there unmerged mira changes? [20:40:08] Wait does sync-wikiversions do a co-master sync? [20:40:13] I don't think so [20:40:24] o_0 [20:40:53] !log demon@tin Synchronized README: no-op to bring co-masters in sync (duration: 00m 28s) [20:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:04] Lol, it's not even done and it logged? [20:41:15] Er, or my terminal just lagged. [20:41:15] w/e [20:41:16] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:41:19] updating terminal is slow [20:41:25] thx icinga you're the best. [20:42:08] halfak: May be a query killer [20:42:44] Could be. Shouldn't be a slow query though [20:42:55] This seems like new behavior to me [20:43:15] There were some API server issues... And possibly killer changes [20:43:23] ostriches: any reason sync-wikiversions shouldn't sync the co-masters? [20:43:33] I thought it did. [20:43:40] SELECT rev_id,rev_page FROM `revision`,`page` WHERE rev_id IN ('21914234230','3243242','234324') AND (rev_page = page_id) [20:43:42] But mira started complaining so I started suspecting. [20:44:04] (03PS2) 10Gehel: Remove old elasticsearch servers from DNS [dns] - 10https://gerrit.wikimedia.org/r/298790 (https://phabricator.wikimedia.org/T139758) [20:45:45] (03CR) 10Gehel: [C: 032] Remove old elasticsearch servers from DNS [dns] - 10https://gerrit.wikimedia.org/r/298790 (https://phabricator.wikimedia.org/T139758) (owner: 10Gehel) [20:46:14] is it done with the WMDE patches yet? [20:46:27] OMG. Reedy, why are those revids strings? [20:46:28] looks like no https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/Wikibase,n,z [20:46:45] I wonder if that would cause problems all by itself. [20:47:03] do we wait for all of these having a green jenkins-bot column? [20:47:13] halfak: it makes a difference on EXPLAIN [20:47:33] I'll note that in the phab card. [20:47:40] i just did [20:47:46] Cool [20:48:11] Well that is a *very* different query plan :) [20:48:14] i suspect, it's the APi/DB wrapper unconditionally quoting values from the internet [20:49:15] mutante: No, we killed zuul. [20:49:18] Queue died. [20:49:31] Er, got too clogged and literally couldn't even [20:49:58] So things will need rechecks [20:50:03] ostriches: gotcha, ok [20:50:06] (preferably not all of those en masse lol) [20:50:15] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/298737 (owner: 10Muehlenhoff) [20:50:26] heh, no [20:52:01] (03PS3) 10Dzahn: Use a ferm service for contint [puppet] - 10https://gerrit.wikimedia.org/r/298737 (owner: 10Muehlenhoff) [20:52:05] ok, it worked pretty fast [20:53:45] subbu: oh yes, i will definitly get to that, sorry [20:53:52] ticket is good, saw it [20:54:16] no worries. [20:55:49] ostriches: "polygerrit" ? https://groups.google.com/group/repo-discuss/attach/127c2bd2eb0799/Screen%20Shot%202016-01-29%20at%2012.03.07.png?part=0.1.1&authuser=0 [20:57:11] https://github.com/gerrit-review/gerrit/search?utf8=%E2%9C%93&q=enablePolyGerrit [20:57:50] https://groups.google.com/forum/#!topic/repo-discuss/Zz8sEbsO36s [20:58:30] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) (owner: 10Andrew Bogott) [21:06:56] Woops i gave wrong link [21:06:57] it is [21:06:58] https://groups.google.com/forum/#!topic/repo-discuss/vBirC0v3ihE/discussion [21:07:41] Considering it's only in master and none of the release branches we can't really test it. [21:08:04] ostriches it is in 2.12 [21:08:20] No it isn't. [21:08:25] I just looked at my clone. [21:08:46] https://gerrit-review.googlesource.com/?polygerrit=1 [21:08:54] ostriches yeh it is ^^ [21:09:01] is running gerrit 2.12.3 [21:09:10] They're running master. [21:09:13] No [21:09:21] Yes they are. [21:09:28] Powered by Gerrit Code Review (2.12.3-3154-g79ecd50) | Report Bug | Press '?' to view keyboard shortcuts [21:09:36] It says here https://gerrit-review.googlesource.com/#/c/80017/ [21:09:38] that ^^ [21:09:59] They're still running master. [21:10:10] Or they've cherry-picked it. [21:10:15] Oh. [21:10:23] Either way: it's not in stable-2.12 or the v2.12.x tags [21:10:40] Oh, yep i just saw that [21:10:42] sorry [21:10:49] they backported it [21:12:01] ostriches but i can build it in buck and deploy it to gerrit-test to test it. [21:12:28] (03PS5) 10Andrew Bogott: Disable the UpdateInstanceInfo tab. [puppet] - 10https://gerrit.wikimedia.org/r/298538 (https://phabricator.wikimedia.org/T139768) [21:12:34] let's worry about UI improvements until after we get gerrit moved/upgraded [21:12:42] (is my opinion) [21:12:51] paladox: No. I don't like deviating from actual releases. [21:12:58] Building and deploying custom versions is a pain. [21:13:09] Im not saying doing that on lead [21:13:20] i mean i will do it on the gerrit-test instance i setup [21:13:22] in labs [21:13:33] I mean you can do whatever you want with gerrit-test, it doesn't mean we'll cherry-pick it into the production version. [21:14:18] Ok [21:14:33] let's not distract ostriches with that stuff right now then [21:14:42] he has plenty of other things to worry about :) [21:16:10] 06Operations, 10Mail: not being able to send emails via Special:EmailUser - https://phabricator.wikimedia.org/T137337#2365463 (10Huji) If it helps, I received emails on a GMail account and always receive them, but I learn about having an email through Flow hours before I actually see it in my inbox. [21:17:09] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: MediaWiki deployment shell access request for zfilipin - https://phabricator.wikimedia.org/T140264#2459808 (10greg) +1 from me [21:17:27] Sorry [21:18:38] (03CR) 10Hashar: [C: 031] Add zfilipin to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/298792 (https://phabricator.wikimedia.org/T140264) (owner: 10Thcipriani) [21:18:56] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#2459810 (10greg) [21:20:40] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 07Technical-Debt, 07Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#986422 (10greg) [21:22:16] (03PS3) 10Dzahn: Gerrit: Vary extension listing cron on git_dir [puppet] - 10https://gerrit.wikimedia.org/r/298808 (owner: 10Chad) [21:22:19] (03PS1) 10Smalyshev: Move updater logs config to /etc/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/298880 (https://phabricator.wikimedia.org/T139434) [21:22:52] (03CR) 10Dzahn: [C: 032] Gerrit: Vary extension listing cron on git_dir [puppet] - 10https://gerrit.wikimedia.org/r/298808 (owner: 10Chad) [21:23:38] (03CR) 10jenkins-bot: [V: 04-1] Move updater logs config to /etc/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/298880 (https://phabricator.wikimedia.org/T139434) (owner: 10Smalyshev) [21:28:12] (03PS2) 10Smalyshev: Move updater logs config to /etc/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/298880 (https://phabricator.wikimedia.org/T139434) [21:29:19] (03CR) 10jenkins-bot: [V: 04-1] Move updater logs config to /etc/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/298880 (https://phabricator.wikimedia.org/T139434) (owner: 10Smalyshev) [21:30:25] (03PS3) 10Smalyshev: Move updater logs config to /etc/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/298880 (https://phabricator.wikimedia.org/T139434) [21:33:17] (03CR) 10Dzahn: [C: 031] "scheduled for merge tomorrow (will cause another gerrit restart, we are done with them for today)" [puppet] - 10https://gerrit.wikimedia.org/r/298688 (owner: 10Bartosz Dziewoński) [21:34:05] (03CR) 10Dzahn: "eh, ignore last comment, this is another patch" [puppet] - 10https://gerrit.wikimedia.org/r/298688 (owner: 10Bartosz Dziewoński) [21:39:19] (03PS2) 10Dzahn: Workaround for really broken browser detection in Gerrit code [puppet] - 10https://gerrit.wikimedia.org/r/298688 (owner: 10Bartosz Dziewoński) [21:39:41] (03PS3) 10Alex Monk: Remove old pre-Swift directory variables referencing upload7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298404 (https://phabricator.wikimedia.org/T64835) [21:40:00] (03CR) 10Dzahn: [C: 032] Workaround for really broken browser detection in Gerrit code [puppet] - 10https://gerrit.wikimedia.org/r/298688 (owner: 10Bartosz Dziewoński) [21:48:24] !log ytterbium, disabled puppet, started apache, needs fix [21:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:50:15] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:50:23] (03PS1) 10Dzahn: gerrit: only include acme challenge conf on 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/298885 [21:50:44] (03PS2) 10Dzahn: gerrit: only include acme challenge conf on 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/298885 [21:50:54] (03CR) 10jenkins-bot: [V: 04-1] gerrit: only include acme challenge conf on 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/298885 (owner: 10Dzahn) [21:51:39] (03PS3) 10Dzahn: gerrit: only include acme challenge conf on 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/298885 [21:54:06] (03CR) 10Dzahn: [C: 032] gerrit: only include acme challenge conf on 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/298885 (owner: 10Dzahn) [21:55:49] !log ytterbium - puppet enabled again, fix deployed [21:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:58:01] (03CR) 10Dzahn: [C: 031] Allow aklapper to delete files in Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/298494 (owner: 10Aklapper) [22:02:20] (03PS2) 10Dzahn: xhgui: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298678 [22:02:48] (03CR) 10Dzahn: [C: 032] "ori, just fyi that i changed the role name slightly. no-op on tungsten. http://puppet-compiler.wmflabs.org/3322/tungsten.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/298678 (owner: 10Dzahn) [22:04:36] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2459936 (10Boshomi) >>! In T132521#2454541, @demon wrote: > Where would such an deprecation be... [22:07:39] (03CR) 10Dzahn: "also checked with "watroles" tool for labs instances using it" [puppet] - 10https://gerrit.wikimedia.org/r/298678 (owner: 10Dzahn) [22:09:23] (03CR) 10Dzahn: "adding Alex K. for review" [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis) [22:15:51] (03CR) 10Paladox: [C: 031] "This is needed to have the diffusion links show again." [puppet] - 10https://gerrit.wikimedia.org/r/298710 (owner: 10Paladox) [22:16:57] 06Operations, 10Fundraising Tech Backlog: Add granularity limiter (g=) to wikimedia.org DKIM record(s) - https://phabricator.wikimedia.org/T140316#2460001 (10CCogdill_WMF) [22:17:07] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:17:28] 06Operations, 10Fundraising Tech Backlog: Add granularity limiter (g=) to wikimedia.org DKIM record(s) - https://phabricator.wikimedia.org/T140316#2460017 (10CCogdill_WMF) [22:17:43] ostriches: seeing a bunch of those RL module hash warnings again... Krinkle? [22:19:08] I'm not by my laptop at the moment so I can't help much [22:19:20] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2460021 (10CCogdill_WMF) To follow up on my last comment, I created T140316 which addresses the granularity limiter for wik... [22:20:15] ostriches: K thx :) [22:25:55] Krinkle: greg-g (not sure whom else to ping) there is a spike in "Module '...' produced an invalid version hash" warnings starting around 20:20 UTC today, coinciding with some wikiversions deploy to wmf.10 [22:26:04] ongoing [22:26:16] AndyRussG: That sounds like a major problem. [22:26:26] AndyRussG: When did this start? Link? Any impact so far? [22:26:26] Yeah [22:26:41] All I know about is what I see in logstash [22:26:48] https://logstash.wikimedia.org/#/dashboard/elasticsearch/default [22:26:52] Not limited to centralnotice? [22:26:58] It looked like it was a spike earlier [22:27:11] Oh nevermind. [22:27:13] Add the filter "invalid module hash" [22:27:17] Ori forgot to update a strlen check [22:27:18] Yeah not limited at all to CN [22:27:28] It means it'll hash it again and then work fine. [22:27:57] ostriches: Was part of the new wmf branch [22:28:06] ostriches: Only on group0? [22:28:06] If so, please block roll out, but no need to revert. [22:28:19] Should be fixed by tomorrow. [22:28:19] No more rollout today [22:28:26] Krinkle: ah cool, that would explain the lack of real impact and the fact that it says "warning"... [22:28:28] And today was group0? [22:28:37] And 1 [22:28:40] AndyRussG: See StartupModule.php [22:28:46] It coincides with a deploy to group 1 [22:28:58] ostriches: Oh, that's... unfortunate. [22:29:11] There's a use in having those 24 hours... [22:29:13] Can we swat the fix? [22:29:16] Sure [22:29:46] I know but the train missed yesterday because of the other thing [22:30:11] Well, we could skip a week, or do the last step on friday, or on monday. [22:30:42] That's 2 weeks missed if we skip [22:31:34] Given the state of our test confidence, I'd say that sounds better than just taking the risk on group1 wikis. This could've gone wrong in hard to reverse ways. [22:32:36] I did mention two other options than skipping - either way not my responsibility to come up with a solution. If this isn't part of documented practice, I'd recommend we keep the extra day in the future. Or else I'd be happy to stay quiet and file a task instead. [22:32:36] No. Skipping weeks is bad because the finally deployed delta is too big [22:32:51] what's up? [22:33:01] strlen( $versionHash ) !== 8. [22:33:04] :) [22:33:08] Friday sounds risky for Wikipedias [22:33:19] I'd prefer following Monday [22:33:36] I'll cherry-pick a fix [22:33:51] ori: Okay. I'll +2. [22:33:59] Thanks guys [22:36:29] (03PS13) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [22:36:46] PROBLEM - Disk space on elastic2008 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 108546 MB (15% inode=99%) [22:37:00] 06Operations, 10Parsoid: Delete Parsoid deb 0.4.0 package from releases wikimedia.org - https://phabricator.wikimedia.org/T140279#2460075 (10Dzahn) I am not sure how to do this right. The current situation is: reprepro ls parsoid parsoid | 0.5.1all | jessie-mediawiki | amd64, i386 parsoid | 0.4.0 | trust... [22:37:38] * AndyRussG wipes his brow w/ an old discarded bannerLoader version hash [22:38:20] Krinkle: options A: strlen( $versionHash ) !== 7 B: strlen( $versionHash ) !== strlen( ResourceLoader::makeHash( '' ) ) C: strlen( $versionHash ) !== ResourceLoader::HASH_LENGTH, D: always unconditionally rehash [22:38:38] I'm leaning toward B. [22:40:29] I'll just do A for now and we can debate this later [22:40:34] ori: D+E: Make makeHash() do weird stuff if already length-and-hash-like (or just length-like) [22:40:39] Could also be an option [22:40:49] that could backfire spectacularly [22:42:42] it could work, needs some thought [22:42:48] i'll just update 8 to 7 for now [22:43:07] k [22:48:06] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: puppet fail [22:48:52] oh, Krinkle, did you have a chance to talk about the SWAT (and potentially train's) usage of mw1017 vs mw1099? [22:53:16] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: puppet fail [22:53:55] RECOVERY - Disk space on elastic2008 is OK: DISK OK [22:54:29] greg-g: Not yet. [22:54:48] no worries [22:55:03] ori: SWAT now uses staging on mw1017 as part of their process (yay!) - any preference with regards to which one? E.g. to avoid conflicts when we're debugging stuff. [22:55:11] I still use mw1017 most of the time, but not sure about others [22:55:29] which one what? [22:55:47] I guess it also wouldn't be too difficult to move another app server from the pool into to the debug pool for swat if conflicts become common [22:55:50] picking a different host for SWAT might be good, just because people already have a habit of using mw1017 [22:55:56] Yeah [22:56:17] mw1099? [22:56:18] I'm having to tether through my phone and it's running out of battery [22:56:20] I may or may not be able to be around for swat [22:56:42] that's the only other eqiad canary I think [22:56:57] wfm [22:56:58] yeah, the others are mw2017 and mw2099 [22:57:14] ori: Krinkle cool, thanks both! [22:57:25] If concurrency is still an issuse (e.g. 2 non-swat canaries), we can always expand the pool. [22:57:27] * greg-g goes to update docs and send an email to known swatters [22:57:33] thanks! [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160713T2300). Please do the needful. [23:00:04] Krenair: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:05:20] (docs updated, swatters emailed) [23:06:48] Looks like I'll have to do this one myself though [23:08:05] Let's hope my connection lasts [23:08:52] (03PS6) 10Alex Monk: Add basic contact form for stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225509 (https://phabricator.wikimedia.org/T98625) [23:08:58] (03CR) 10Alex Monk: [C: 032] Add basic contact form for stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225509 (https://phabricator.wikimedia.org/T98625) (owner: 10Alex Monk) [23:09:40] (03Merged) 10jenkins-bot: Add basic contact form for stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225509 (https://phabricator.wikimedia.org/T98625) (owner: 10Alex Monk) [23:11:38] okay, I need to change it a little bit [23:13:51] Krenair: while you do that, can I sneak in a 1-char fix to wmf10? [23:14:18] yes [23:15:06] (03PS1) 10Alex Monk: Change steward contact form back to vform instead of ooui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298899 (https://phabricator.wikimedia.org/T98625) [23:15:32] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [23:15:34] (03CR) 10Alex Monk: [C: 032] "already in prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298899 (https://phabricator.wikimedia.org/T98625) (owner: 10Alex Monk) [23:16:30] (03Merged) 10jenkins-bot: Change steward contact form back to vform instead of ooui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298899 (https://phabricator.wikimedia.org/T98625) (owner: 10Alex Monk) [23:17:17] !log ori@tin Synchronized php-1.28.0-wmf.10/includes/resourceloader/ResourceLoaderStartUpModule.php: I882bf7075: ResourceLoader: Update expected length of module version hash (duration: 00m 25s) [23:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:57] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/225509 + https://gerrit.wikimedia.org/r/298899 - create https://meta.wikimedia.org/wiki/Special:Contact/stewards (duration: 00m 26s) [23:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:21] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [23:21:45] (03PS2) 10Alex Monk: Logo update for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298819 (https://phabricator.wikimedia.org/T140015) (owner: 10Urbanecm) [23:21:50] (03CR) 10Alex Monk: [C: 032] Logo update for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298819 (https://phabricator.wikimedia.org/T140015) (owner: 10Urbanecm) [23:22:33] (03Merged) 10jenkins-bot: Logo update for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298819 (https://phabricator.wikimedia.org/T140015) (owner: 10Urbanecm) [23:22:55] (03PS2) 10Alex Monk: HD logos for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298822 (https://phabricator.wikimedia.org/T140015) (owner: 10Urbanecm) [23:23:01] (03CR) 10Alex Monk: [C: 032] HD logos for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298822 (https://phabricator.wikimedia.org/T140015) (owner: 10Urbanecm) [23:23:43] (03Merged) 10jenkins-bot: HD logos for trwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298822 (https://phabricator.wikimedia.org/T140015) (owner: 10Urbanecm) [23:25:09] !log krenair@tin Synchronized static/images/project-logos: update for https://gerrit.wikimedia.org/r/298819 and https://gerrit.wikimedia.org/r/298822 (duration: 00m 24s) [23:25:11] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:31] done [23:26:37] oh, wait [23:27:01] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [23:27:02] almost forgot the second part [23:27:17] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/298822 (duration: 00m 26s) [23:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:29] 06Operations, 10ops-eqiad: Rack/Setup Carbon/Apt Server Replacement - https://phabricator.wikimedia.org/T139171#2460215 (10faidon) No need to — I wouldn't expect more of those. An element name for it would be fine IMHO. [23:29:51] okay that appears to have worked [23:29:58] now I'm done [23:30:42] RECOVERY - MegaRAID on db1054 is OK: OK: optimal, 1 logical, 2 physical [23:36:45] (03PS1) 10Dzahn: ipmi: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298902 [23:43:48] (03PS1) 10Dzahn: servermon: move role to module, add system::role [puppet] - 10https://gerrit.wikimedia.org/r/298904 [23:45:16] (03PS2) 10Dzahn: servermon: move role to module, add system::role [puppet] - 10https://gerrit.wikimedia.org/r/298904 [23:46:02] (03PS3) 10Dzahn: servermon: move role to module, add system::role [puppet] - 10https://gerrit.wikimedia.org/r/298904 [23:48:41] (03PS1) 10Dzahn: extdist: move role to module, fix class name, labs-only [puppet] - 10https://gerrit.wikimedia.org/r/298906 [23:53:13] (03PS1) 10Dzahn: installserver: move role to module, rename to ::wmf [puppet] - 10https://gerrit.wikimedia.org/r/298907 [23:54:04] (03PS2) 10Dzahn: installserver: move role to module, rename to ::wmf [puppet] - 10https://gerrit.wikimedia.org/r/298907 [23:54:35] jynus: Did you still want to do the phabricator database failover during tonight's phab maintenance window?