[00:00:10] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable ORES review tool as a beta feature in ptwiki (T139692) (duration: 00m 28s) [00:00:11] T139692: Deploy ORES review tool in ptwiki - https://phabricator.wikimedia.org/T139692 [00:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:00:55] checked without mw17, works as expected [00:01:00] Dereckson: thank you :) [00:01:01] ptwiki is on s2 [00:01:13] db2035 [00:01:37] mw1017? we're not in pmtpa anymore :) [00:01:38] I mean "Select count(*) from ores_classification" [00:01:54] Krenair: yeah :D [00:02:06] Amir1: 5019 [00:02:07] you can pull table_rows out of one of the information_schema tables [00:02:16] Dereckson: that's perfect [00:02:23] but I suppose with only 5k rows it doesn't matter much [00:02:47] Krenair: the best solution is to have replicas in labs IMO [00:03:03] we do [00:03:07] so I can test them easily and play with them, etc. [00:03:15] but if this is a new table you need to request it be whitelisted as appropriate [00:03:21] Krenair: for ORES tables, it's not replicated [00:03:34] should they be visible in labs? [00:03:53] I already talked to yuvipanda, he thinks it's okay but the script to do that is bogus and might leak data [00:04:01] the ores table data is public [00:04:17] * Krenair rolls eyes [00:04:21] it's the maintain-replicas script again [00:04:21] Krenair: mw1017 is the chosen canary server to test changes before send them to prod [00:04:23] I'm rewriting it [00:04:33] I know Dereckson [00:04:41] we also have mw1099, mw2017 and mw2099 [00:04:51] not quite sure why [00:05:11] two eqiad and two codfw [00:05:15] Krenair: nice, tell me when you are done with the script. I definitely need it [00:05:30] Dereckson, right but two? [00:05:45] Amir1, you can follow along https://gerrit.wikimedia.org/r/#/c/295607/ [00:05:47] along at * [00:06:46] Krenair: the script being written in Perl is considered "an issue" :))) [00:06:52] like [00:07:10] I had a particularly fun time trying to understand what it was doing. [00:07:29] Amir1: indeed, should be in LOLCAT >.> [00:07:57] I think the part I particularly liked was: next if ($customviews{$view}->{'limit'}//1) > ($db{$dbk}->{'size'}//1); [00:08:16] MatmaRex: there? [00:08:20] Dereckson: yeah-ish [00:08:25] sup? [00:08:34] :D [00:08:45] and I have no idea what that part means [00:08:48] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [00:09:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:09:41] well [00:10:18] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 1 failures [00:10:20] It continues to the next iteration if the view's limit number is greater than the database's size number [00:10:58] so particularly expensive views don't show on huge DBs, I guess [00:10:59] Dereckson, we're going to need a small follow up to the Echo patch. [00:11:07] not sure about the //1 [00:12:28] (03CR) 10Ladsgroup: "It needs flake8 specially about long lines." [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [00:13:27] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:13:38] (03CR) 10Ladsgroup: "Thank you!" (031 comment) [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [00:14:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:14:27] matt_flaschen: k [00:15:23] AndyRussG: current deployed HEAD is 29321966fa [00:15:59] Dereckson: yep [00:16:10] Didja see the core patch I made? (Didn't +2 it) [00:16:12] ? [00:16:23] https://gerrit.wikimedia.org/r/#/c/297926/ [00:16:57] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 41.833 second response time [00:17:31] AndyRussG: so it will add Add object caching to ChoiceDataProvider / Enforce jscs, make it pass / jshint, gruntfile, and compatibility fixes / Minor clean up in CNChoiceDataResourceLoaderModule / Update mediawiki_api gem to 1.7.1 / Minor clean up in CNChoiceDataResourceLoaderModule / ext.centralNotice.display: API for registering tests / etc. [00:17:41] yuvipanda: ^ kubernetes [00:17:44] AndyRussG: SWAT is normally for one patch, not for a train of patches [00:17:47] Dereckson, followup is https://gerrit.wikimedia.org/r/#/c/297932/ [00:18:02] Dereckson: it's a bunch of accumulated changes, it's true [00:18:06] No major features [00:18:17] Dereckson: https://phabricator.wikimedia.org/rEORSd9780062a0a7aa2950f230371b0afdc6f84ef107 this one is not deployed yet. I thought it wouldn't be nice to put it in SWAT window :D [00:19:00] AndyRussG: live on mw1017 [00:19:14] AndyRussG: er hold on [00:19:29] ACKNOWLEDGEMENT - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 56.120 second response time Yuvi Panda Flaky check is flaky [00:19:50] AndyRussG: now live on mw1017 [00:19:56] Dereckson: testing! thx! [00:20:32] Amir1: before it would have been useful, now it can wait next MediaWiki wmf branch [00:20:49] Amir1: we'll run it at wmf10 [00:22:05] AndyRussG: perhaps in the future could you request deployment windows one per for your team only to update this extension if you've a lot of changes to land? [00:22:05] yeah [00:23:35] Dereckson: OK [00:24:28] Dereckson: I don't see the new code on mw1017 [00:24:54] Dereckson: you didn't merge the core patch? [00:25:12] legoktm: your change API: Generate head items in the context of the given title is live on mw1017 [00:25:17] (03CR) 10Alex Monk: [WIP/POC/POS] Add python version of maintain-replicas script (031 comment) [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [00:25:25] Dereckson: testing [00:25:47] AndyRussG: ah, not yet, url of the core Gerrit change? [00:25:52] Dereckson: confirmed [00:25:56] The core patch updates the CN submodule to 07e047614d5 [00:26:07] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 31.458 second response time [00:26:16] Dereckson: https://gerrit.wikimedia.org/r/#/c/297926/ [00:26:41] It doesn't get automerged like other extensions [00:27:27] PROBLEM - Disk space on terbium is CRITICAL: DISK CRITICAL - free space: / 2550 MB (3% inode=79%) [00:27:28] !log dereckson@tin Synchronized php-1.28.0-wmf.9/includes/api/ApiParse.php: API: Generate head items in the context of the given title (T139565) (duration: 00m 30s) [00:27:29] T139565: API action=parse&prop=headhtml doesn't operate in the context of the page being requested - https://phabricator.wikimedia.org/T139565 [00:27:32] legoktm: okay, here you are in prod ^ [00:27:32] Dereckson: Should I +2? Last time I was told the deployer should do that [00:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:08] AndyRussG: yes I'm going to do but that doesn't matter for the code on mw1017 [00:28:30] Dereckson: what sha did you update mw1017 to? [00:28:43] AndyRussG: give me the full path of one of the file your commit changed [00:28:46] Dereckson: confirmed, thank you [00:28:53] legoktm: you're welcome [00:29:04] I'm going to checksum them both on Tin and mv1017, we'll see [00:30:22] (and yes, +2 on wmf branch should be at deploy time not before) [00:31:30] Dereckson, sorry for the last minute change (per above). It's https://gerrit.wikimedia.org/r/#q,297934,n,z . [00:32:05] AndyRussG: checksum matches for this file: [00:32:10] mw1017 1a5ed48f0cdae212b5de30c0bd29b662 /srv/mediawiki/php-1.28.0-wmf.9/extensions/CentralNotice/api/ApiCentralNoticeChoiceData.php [00:32:18] tin 1a5ed48f0cdae212b5de30c0bd29b662 api/ApiCentralNoticeChoiceData.php [00:32:59] Dereckson: problem.....aaaarrrg [00:33:55] Something bad happened in the merge of master into wmf_deploy [00:33:56] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:33:58] AndyRussG: we revert back to 2932196? [00:34:03] Yeah pls do [00:34:05] k [00:34:11] Dereckson: thx and apologies!!!! [00:35:53] Dereckson: If you have time, I could maybe fix it and we could try again after the other patches? The code is fully tested, I don't know what went wrong with the merge to our deploy branch [00:37:33] AndyRussG: if there isn't any emergency, you've all the week-end and Monday morning to fix that [00:37:50] Dereckson: we do have folks standing by specifically to watch this patch.... [00:38:32] AndyRussG: it fixes the 5 users blocked by rename? sure fix your merge now in this case [00:39:28] (03CR) 10Ladsgroup: [WIP/POC/POS] Add python version of maintain-replicas script (031 comment) [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [00:39:30] it fixes a bunch of things, including improving performance of the whole site [00:40:00] which patch is this we're discussing making an exception for? [00:42:53] Dereckson, AndyRussG: ^ [00:44:16] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:04] 06Operations: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671#2440374 (10Peachey88) [00:46:29] Krenair: https://gerrit.wikimedia.org/r/297817 [00:47:00] Krenair: we've a bunch of users stuck in renames, like binapland [00:47:02] https://phabricator.wikimedia.org/T137973 [00:47:25] They can't login to their accounts as long as the rename process isn't finished. [00:47:48] (03CR) 10Alex Monk: [WIP/POC/POS] Add python version of maintain-replicas script (031 comment) [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [00:48:35] Dereckson, that's CentralAuth, I thought AndyRussG was here about CentralNotice? [00:49:48] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2440380 (10Liuxinyu970226) [00:49:50] ... [00:50:53] AndyRussG: so does you new branch HEAD solve an unbreak now feature? [00:51:14] what urgen things? < "it fixes a bunch of things, including improving performance of the whole site" [00:51:33] Dereckson: I found the problem, creating a new merge patch now [00:51:48] Dereckson: why don't you go ahead with the other SWAT patches meanwhile? [00:52:00] because I've to do a full scap [00:52:43] One of the patch has some l10n files to update. [00:53:55] Dereckson, what's with the ...? [00:54:03] Dereckson: here's the merge patch on the CN wmf_deploy branch: https://gerrit.wikimedia.org/r/297938 [00:54:15] I can make the core patch then [00:54:33] Am I missing something obvious and making myself look silly here? [00:54:39] I don't get it [00:55:12] Krenair: ... = "I acked the AndyRussG's answer as a 'Yes, that fixes the rename issue'" : 00:38:32 < Dereckson> AndyRussG: it fixes the 5 users blocked by rename? 00:39:30 < AndyRussG> it fixes a bunch of things [00:56:27] AndyRussG: if there is no unbreak now! or very important feature in your branch, I suggest you reschedule it for next SWAT [00:56:28] (03CR) 10Ladsgroup: [WIP/POC/POS] Add python version of maintain-replicas script (031 comment) [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [00:56:30] It doesn't fix rename anything, it improves site performance in several ways [00:56:58] Well I'm not one to rock any boats [00:57:46] It's ready to go and it is important regardless of the bug triaged. But it's your call [00:58:06] I could also wait until the end of the Scap [01:01:00] Dereckson: here is the core patch https://gerrit.wikimedia.org/r/297939 [01:01:29] aaarg [01:01:39] no [01:02:14] matt_flaschen: sorry for the delay, both Flow and Echo code is live on mw1017 [01:03:11] AndyRussG: any idea why Popup invited itself when I tried to revert your core patch at https://gerrit.wikimedia.org/r/297937? [01:04:15] Dereckson, no problem. I had that last minute patch I mentioned before (https://gerrit.wikimedia.org/r/#/c/297934/). [01:04:19] Will test the other two first. [01:05:25] Dereckson: no Idea. But here is the new core patch, for realz: https://gerrit.wikimedia.org/r/#/c/297940/ [01:06:35] AndyRussG: we first need to revert the former one [01:07:26] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [01:07:44] Dereckson: we can't just merge and deploy the new one? It just changes the submodule pointer for CentralNotice [01:08:55] Should be just as easy [01:09:50] The problem with the previous merge patch was a forgotten git pull (or I should've done git merge origin/master instead of git merge master) [01:09:59] Flow one looks good. [01:10:58] Dereckson: at least one or the other (revert to previous submodule pointer, or update to the one in the last patch I just linked here) should be done [01:11:12] Dereckson, I can't test the Echo one on mw1017, since it's primarily about i18n. So just go ahead and do that (but be sure to also do https://gerrit.wikimedia.org/r/#/c/297934/ please). [01:13:09] matt_flaschen: we wait last Jenkins tests for 297934 [01:13:30] Dereckson, great, thanks. [01:14:33] Merged. [01:17:12] matt_flaschen: live on mw1017 too, but same thing for l10n I guess [01:17:31] Yeah [01:18:21] matt_flaschen: we scap? [01:18:30] Yep [01:20:33] !log dereckson@tin Started scap: Flow 297914, Echo 297919 297934, ORES 297916 [01:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:26:19] !log gerrit: readded robots.txt to ytterbium for now [01:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:30:34] 01:28:46 Started sync-masters [01:30:34] sync-masters: 0% (ok: 0; fail: 0; left: 1) [01:33:05] AndyRussG: so the new wmf_deploy HEAD is well c448174da935b3? [01:33:23] Dereckson: yessir!!! :D [01:39:33] 06Operations: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671#2440477 (10Dzahn) So how are we doing this? I guess let's first check if we have any inactive users in this list and remove them. Anyone in this group who says they don't need access anymore? [01:39:45] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:46:59] thcipriani: bd808: could you check https://phabricator.wikimedia.org/P3365 log please? [01:47:43] rsync: write failed on "/srv/mediawiki/php-1.28.0-wmf.9/cache/l10n/upstream/l10n_cache-qu.cdb.json": No space left on device (28) [01:48:09] Sooner: 00:27:27 < icinga-wm> PROBLEM - Disk space on terbium is CRITICAL: DISK CRITICAL - free space: / 2550 MB (3% inode=79%) [01:48:22] !log dereckson@tin Finished scap: Flow 297914, Echo 297919 297934, ORES 297916 (duration: 27m 48s) [01:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:48:40] matt_flaschen: Amir1: could you test please? [01:49:07] Terbium: /dev/mapper/terbium--vg-root 67G 64G 0 100% / [01:49:11] (03CR) 10Dzahn: "puppet/files/apache$ git log ports.conf.ssl" [puppet] - 10https://gerrit.wikimedia.org/r/297727 (https://phabricator.wikimedia.org/T132661) (owner: 10Dzahn) [01:49:12] we're really out of space there [01:50:23] looking. freed 120M [01:50:29] 06Operations, 06Project-Admins, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Archive #codfw-rollout-Jan-Mar-2016 - https://phabricator.wikimedia.org/T139711#2440481 (10Danny_B) [01:51:56] Dereckson: test what? [01:52:48] Amir1: Remove oresc_is_predicted = 1 in db queries [01:53:06] oh, I thought you did that aalready [01:53:10] in mw1017? [01:53:12] on prod [01:53:20] (you've already confirmed it's okay on mw1017) [01:53:56] RECOVERY - Disk space on terbium is OK: DISK OK [01:54:07] !log terbium ran out of disk, deleted rotated nutcracker log [01:54:11] Dereckson: it's okay [01:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:54:17] Amir1: thanks for testing [01:54:29] thank you for deploying :) [01:54:32] You're welcome. [01:54:36] Dereckson: ^ nutcracker.log is to blame [01:54:51] there is still a huge current one but space now [01:54:53] I was typing the same [01:55:04] Dereckson, i18n is not working. [01:56:22] E.g if you click the right Echo button (near the username) it shows broken i18n. [01:56:26] Let me try refreshing the startup module. [01:57:12] matt_flaschen: expected, we had an issue during the scap [01:57:27] let's run l10nupdate [01:58:03] Starting l10nupdate at Fri Jul 8 01:57:58 UTC 2016. [01:59:57] matt_flaschen: so, Terbium ran out of disk space on the partition needed by the l10n update part [02:00:04] !log l10nupdate@tin LocalisationUpdate failed: git pull of extensions failed [02:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:00:22] ^ it failed at WikimediaMessages [02:00:26] Yeah, I see now (read scrollback) [02:00:35] Your configuration specifies to merge with the ref 'master' from the remote, but no such ref was fetched. [02:05:46] I'm trying again l10nupdate, if it fails, we'll sync-l10nupdate manually only for Echo [02:06:36] Dereckson: isn't terbium all that failed the scap? [02:06:50] bd808: yes, terbium ran out of space [02:06:57] that's fine [02:07:08] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review, and 2 others: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2440498 (10Pokefan95) >>! In T137973#2436343, @biplabanand wrote: > Got Problem once again:) my account is not attached with more... [02:07:18] scap pull there will catch it up [02:08:42] bd808: by the way, we've another issue: we need an l10n update for Echo, and running l10nupdate I got a 'git pull of extensions failed' (for WikimediaMessages). I'm currently running it again after I checked all looks good at /var/lib/l10nupdate/mediawiki/extensions/WikimediaMessages [02:09:45] running l10nupdate manually? [02:09:55] yes [02:09:59] as the l10nupdate user or as youself? [02:10:29] as myself [02:11:05] oh /usr/local/bin/l10nupdate has the right sudo in it [02:11:28] yep it works as an alias for sudo -u l10nupdate /usr/local/bin/l10nupdate-1 [02:11:47] Seems to work now by the way [02:11:49] Updated extensions [02:11:49] Already up-to-date. [02:12:53] !log scap pull on terbium (was out of disk space during previous full scap) [02:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:13:20] matt_flaschen: would this be causing https://phabricator.wikimedia.org/T139712 ? [02:13:27] a full run of l10nupdate takes a long time as I recall [02:14:02] er, why not scap? [02:14:10] also, we're like 2 hours over the window :S [02:14:32] legoktm: no idea really. I think Dereckson is hoping for magic [02:14:42] okay.... :/ [02:14:57] * AndyRussG cowers for having contributed to delayz [02:15:10] legoktm: bd808: I was under the impression l10nupdate was quicker than a full scap [02:15:15] legoktm, yes. [02:15:28] legoktm, scap originally failed due to out-of-disk, now other complications. [02:15:35] Dereckson: I think that may be a mistaken impression [02:15:59] Can someone remind me where RL puts the messages? [02:16:01] l10nupdate is a completely different thing that merges messages from master with the branches [02:16:03] bd808: I cancel it and switch to a normal full scap? Rebuilding localization cache at 2016-07-08 02:06:43+00:00 [02:16:05] Just trying to hard-refresh that URL manually. [02:16:34] Dereckson: what exactly are you trying to fix? [02:16:43] bd808: l10n for Echo [02:16:51] there was a patch that added new messages I think [02:17:16] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:19] Dereckson: the only failure in the scap was for terbioum correct? [02:17:22] Yes, because we changed the sorting from Alerts/Messages to Alerts/Notices, and we had to do a followup to change the i18n. [02:17:46] bd808, what does that mean effect-wise? Since it's also wrong in the web UI. [02:17:46] So presumably wrong in the MW app servers. [02:17:50] changes like that should really ride the train :/ [02:17:55] bd808: some l10n stuff is done on Terbium: https://phabricator.wikimedia.org/P3365 [02:17:58] rsync: write failed on "/srv/mediawiki/php-1.28.0-wmf.9/cache/l10n/upstream/l10n_cache-qu.cdb.json": No space left on device (28) rsync error: error in file IO (code 11) at receiver.c(389) [receiver=3.1.0] rsync error: error in file IO (code 11) at io.c(1642) [sender=3.1.0] [02:18:33] terbium is just a misc task server that has MW installed [02:18:39] bd808, yeah, we would have, but unfortunately this was overlooked. I checked, and there does not seem to be a policy of "no scap" for SWAT currently. I try to avoid it, but it was confusing users about the new sorting scheme. [02:18:42] it's got nothing at all to do with the app servers [02:18:55] bd808, I know, so I'm saying the i18n is also wrong for ResourceLoader and thus on the app servers. [02:19:21] matt_flaschen: you mean RL cache? [02:19:36] bd808, yes, the messages ResourceLoader is serving to JS [02:20:08] Maybe it needs a touch or something? [02:20:11] Seems right on the server-side. [02:21:47] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [02:22:25] there's a step at the end of l10nupdate that purges some RL cache [02:22:58] but generally the solution to messed up RL cache is to poke RoanKattouw or Krinkle I think [02:23:26] * Krinkle sticks head out of the bushes [02:23:36] there is a race condition that bites the RL cache on some deploys [02:24:08] and that's one of the reasons historically that swat scaps are troublesome [02:24:10] Afaik deploy vs rl-cache issues relate to load.php requests and js/css files. Not l10n. [02:24:25] Krinkle, scap originally failed due to out-of-disk. See above. I'm not sure if that only affected terbium or other machines. [02:24:26] the l10n messageblobstore purge should be fine [02:24:55] sync-apaches: 100% (ok: 408; fail: 1; left: 0) [02:24:55] bd808, I thought scap should always work, it just takes longer so it's somewhat discouraged for SWAT. [02:24:55] only terbioum [02:25:00] * bd808 can't spell that [02:25:12] Krinkle, can you remind me where RL serves the messages? Did it change? [02:25:13] terribioum [02:25:29] What URL I mean. [02:25:40] some news about l10nupdate: 02:23:19 Updated 393 JSON file(s) in /srv/mediawiki-staging/php-1.28.0-wmf.9/cache/l10n it's syncing the cache. sync-proxies: 0% (ok: 0; fail: 0; left: 9) [02:25:43] matt_flaschen: serves? no. stores? yes. It used to be in a mysql db table. It's now in general object cache (memcached) [02:25:43] do we have a user-facing problem? [02:25:54] Since about 6 months ago [02:25:57] https://phabricator.wikimedia.org/T139712 [02:25:58] Krenair, yes. [02:25:58] * Krinkle reads backscroll [02:26:25] Thanks, Dereckson. Hopefully that does it. [02:27:03] matt_flaschen: what exact messages by the way aren't served? the new ones your changes introduced or the whole set for JS? [02:27:24] Dereckson, just the new ones. [02:27:27] (03PS2) 10Dzahn: (WIP) apache, icinga: make ports.conf handling config option [puppet] - 10https://gerrit.wikimedia.org/r/297727 (https://phabricator.wikimedia.org/T132661) [02:27:29] Dereckson, but it's high impact [02:28:19] Dereckson, e.g. echo-notification-notice-text-only [02:28:56] matt_flaschen: okay, so in a few moment we'll see if the cache update fixed that. if not, I'd recommend you submit a quick fix to get rid of these new keys or a straight revert [02:29:02] sync-apaches: 82% (ok: 338; fail: 0; left: 71) [02:29:03] by the looks of the scap sync log, wouldn't a re-run fix this? [02:29:06] (03CR) 10jenkins-bot: [V: 04-1] (WIP) apache, icinga: make ports.conf handling config option [puppet] - 10https://gerrit.wikimedia.org/r/297727 (https://phabricator.wikimedia.org/T132661) (owner: 10Dzahn) [02:29:22] (03PS3) 10Dzahn: (WIP) apache, icinga: make ports.conf handling config option [puppet] - 10https://gerrit.wikimedia.org/r/297727 (https://phabricator.wikimedia.org/T132661) [02:29:25] Krenair: bd808 thinks it only impacted Terbium [02:29:45] krinkle@mw1099:~$ mwscript eval.php --wiki enwiki [02:29:45] echo wfMessage('echo-notification-notice-text-only')->plain(); [02:29:46] > notices [02:29:47] and the user-visible issue is something else? [02:29:48] Notices * [02:29:53] I really don't see that scap messed up except on terbium which has already had a couple of follow up `scap pull` runs [02:29:57] Krinkle, yeah, it's just wrong in RL. [02:30:03] So the msgcache as synced is fine [02:30:20] Order of deployments is important [02:30:34] (03CR) 10jenkins-bot: [V: 04-1] (WIP) apache, icinga: make ports.conf handling config option [puppet] - 10https://gerrit.wikimedia.org/r/297727 (https://phabricator.wikimedia.org/T132661) (owner: 10Dzahn) [02:30:42] scap-cdb-rebuild: 71% (ok: 300; fail: 0; left: 118) [02:30:47] new cache is propagating [02:30:50] (03PS4) 10Dzahn: (WIP) apache, icinga: make ports.conf handling config option [puppet] - 10https://gerrit.wikimedia.org/r/297727 (https://phabricator.wikimedia.org/T132661) [02:31:00] Dereckson, can you try touching ui/mw.echo.ui.js and syncing that file. [02:31:08] !log dereckson@tin scap sync-l10n completed (1.28.0-wmf.9) (duration: 07m 47s) [02:31:10] Krinkle, will that invalidate the RL module? [02:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:23] Refreshing ResourceLoader caches [02:32:02] matt_flaschen: could you check on a wiki in a or b ady. for example if the issue is still there? [02:32:20] matt_flaschen: Dereckson: Most likely what happened is that new l10n cache rolled out, during this part someone requested 'startup' manifest from a server with the new l10n cache values. This server responded with a module version that includes the new messages (and working). Then another request comes in (to an older server this time) responding to a [02:32:20] request for Echo JS with the new version in the query string. It responds without the message as it doesn't have it yet. [02:32:31] Then, the l10n sync completes, but the new url is now populated in varnish [02:32:36] so it won't repair itself [02:32:43] (03CR) 10Dzahn: [C: 04-1] (WIP) apache, icinga: make ports.conf handling config option [puppet] - 10https://gerrit.wikimedia.org/r/297727 (https://phabricator.wikimedia.org/T132661) (owner: 10Dzahn) [02:32:53] Dereckson, what is "a or b ady"? [02:33:01] Oooh I remember, that bit me once, too... [02:33:10] matt_flaschen: any wiki starting by a or b [02:33:14] Dereckson: RL uses content hashes. Touching won't do anything. [02:33:31] timestamps are irrelevant (on purpose, because each new brach resets timestamps to 'now', since Git doesn't store timestamps) [02:33:35] branch* [02:33:36] You can just add a useless space or comment somewhere, no? [02:33:45] Krinkle: l10nupdate is at the Refreshing ResourceLoader caches, letter h now [02:33:49] Yeah, but then when you sync it later, it'll go back to the old version. [02:34:21] matt_flaschen: https://en.wikipedia.org/wiki/Orange_Catholic_Bible I've "Notices" and "Alertes" as titels [02:34:47] Before you continue, 1) Figure out a request url for Echo (from the browser dev tools) and verify that the message is indeed broken in the JS response and that adding a garbage query parameter makes it work. [02:34:59] That way you know it's just Varnish cache thats the problem at this point. [02:35:05] Dereckson, looks right on ab. [02:35:06] (03CR) 10Dzahn: [C: 04-1] Phab: make sure the mail crons have mysql-client installed [puppet] - 10https://gerrit.wikimedia.org/r/297803 (owner: 10Chad) [02:35:40] okay and the issue is still at yue [02:35:47] (and the script is at p) [02:35:59] so that means l10nupdate is successfully refreshing the RL cache [02:36:06] Dereckson, Krinkle, it's fixed. [02:36:11] mediawiki.org showed for me at first, but it shows it correctly now. [02:36:21] matt_flaschen: not on all wikis [02:36:26] we're at s [02:36:44] Dereckson, I know, but presumably when the current script gets through all. It's fixed on the ones it's made it through alphabetically I guess. [02:36:56] * Dereckson nods. [02:36:59] we're at u [02:37:13] Dereckson: Yeah, if l10n update was not synced at all, and is being synced now, then the problem wasn't cache poisoning. It was just missing and the version number of the broken version was of the broken version. [02:37:27] The race condition I referred to could still be happening now during the actual l10n sync. [02:37:29] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jul 8 02:37:29 UTC 2016 (duration 6m 22s) [02:37:32] On one or more wikis. [02:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:25] ANd the working version will have a different version hash. It'll naturally roll over and update globally within 5 minutes. [02:38:33] Krinkle, did we do something wrong originally and/or now? Originally he just ran a scap. [02:38:43] (since sha1() of is not the same as sha1() of 'Notices', naturally) [02:38:46] URL for reference: https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=ext.echo.controller%2Cdm%2Clogger%2Cui%7Cext.echo.styles.badge%2Cnotifications%7Cext.echo.ui.desktop%7Coojs-ui-core%2Coojs-ui-widgets%7Coojs-ui-core.styles%7Coojs-ui.styles.icons%2Cicons-alerts%2Cicons-content%2Cicons-interactions%2Cicons-user%2Cindicators%2Ctextures%7Cschema.EchoInteraction&skin=vector&version=9c06a8a051bb [02:39:06] "echo-notification-notice-text-only":"Notices" [02:39:36] matt_flaschen: you've Notifications at https://zh-yue.wikipedia.org/wiki/Special:Notifications ? [02:40:13] I've a nice 提示 for the first, but still a echo-notification-notice-text-only [02:40:31] Dereckson, I don't have any notifications there, so I can't test. [02:40:43] One sec [02:40:43] JS is loaded based on permalinks with version hashes. The manifest with all the version hashes is cached in varnish for 5 minutes. [02:41:07] So whatever the fix, it will happen within 5 minutes. Depending on when that window started, you may not be able to see it yet. [02:41:34] Dereckson, I don't think that page has the words Alerts/Messages anyway (except in the upper right where it is in every page) [02:41:37] Dereckson, screenshot? [02:42:17] ah works now [02:42:55] matt_flaschen: https://s3.amazonaws.com/upload.screenshot.co/e3039784e5 [02:43:16] matt_flaschen: Dereckson: I do recall problems with scap doing the l10n sync at time that is conceptually not entirely compatible with what MediaWiki and ResourceLoader really want. What it should do is sync l10n and files together and apply to all servers. If it's doing l10n first to all servers and then files, that will cause problems. Most of which will [02:43:16] recover automatically in 5 minutes. [02:43:47] (after it worked everywhere else, it printed before the ) [02:44:00] Dereckson, okay, I thought you meant something specifically on Special:Notifications (that upper right stuff is on every page) [02:44:25] The word Notifications is not new, only 'notices' is new. [02:45:51] Krinkle: bd808: the full scap doesn't refresh RL cache, does it? [02:46:44] Dereckson: If by RL cache you mean l10n-cache-in-memcached-for-RL, then yes, that is purged by the l10n update script. [02:47:43] Krinkle: yes but if we'd instead run scap sync, would had been it be purged too? [02:49:23] afaik 'scap sync' just syncs /srv/mediawiki. It doesn't rebuild l10n files, it doesn't sync l10n files, it doesn't purge l10n-related caches [02:50:07] Even though those .json files are in /srv/mediawiki, production MediaWiki is not allowed to read those files off disk. We exclusively read from the cdb files. Which have to be rebuild on tin if changes are in the json files. [02:51:22] Krinkle, are you sure? https://wikitech.wikimedia.org/wiki/How_to_deploy_code#More_complex_changes:_sync_everything makes it sound like you're free to change i18n files as long as you run scap sync (formerly scap). [02:51:47] Oh, the new 'scap sync' is not like sync-common-all [02:51:50] it's like scap itself [02:51:53] ok, that's confusing [02:53:01] Yeah, running full scap will rebuild l10n cache files, sync them as part of the sync, and purge caches presumably [02:55:19] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review, and 2 others: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2440540 (10biplabanand) >>! In T137973#2440498, @Pokefan95 wrote: >>>! In T137973#2436343, @biplabanand wrote: >> Got Problem onc... [02:55:52] Yes, 'scap sync', just like 'scap sync-l10n' calls update_localization_cache() which uses rebuildLocalisationCache.php which will purge MessageBlobStore for all wikis [02:57:57] Although it triggers all that before it even starts syncing [02:58:08] so it's not likely for some of those values to be re-filled before it finishes syncing [02:58:10] This is flawed [02:58:12] :/ [02:59:32] Anyway, to be investigated another day [03:09:36] Krinkle, you mean "it is like for some [...] to be re-filled"? [03:11:05] it is likely or not likely? [03:14:13] Dereckson: hi! So reverting, I guess...? [03:14:30] I see it's been a long haul [03:14:43] So if that works it's fine of course [03:14:52] AndyRussG: I was only documenting on wikitech and Gerrit what we reverted sooner [03:14:58] Ah OK [03:15:02] Next step is you schedule https://gerrit.wikimedia.org/r/#/c/297940/ at a next SWAT [03:15:05] Whatever works for you is good [03:15:21] or better, ask a weekly/monthly window only for your extension [03:16:43] OK [03:17:35] AndyRussG: if noone from your team is really comfortable to deploy, there is probably a way to have someone from the MediaWiki train or the SWAT team to be present at your window too. [03:19:32] Dereckson: well, several of us can deploy and awight actually does so a lot but he and others have a lot on their plates. I recently had to update my access (hard disk died) and haven't set up prod access yet. But I generally feel more nervous about deploying and prefer to see it in expert hands [03:24:30] Dereckson: so just to confirm, all of prod now has CentralNotice commit 29321966f, as per https://phabricator.wikimedia.org/rMWdb5eb37d82c598115f984b829f44793c469f1074 , correct? [03:24:41] aye [03:24:47] OK gotcha [03:25:02] K thanks for working on this in any case! [03:25:13] and I submitted a PS2 to 297940 to do 29321966f..wmf_deploy current HEAD [03:25:41] You're welcome. [03:26:05] K [03:26:48] Well, I'll call it a day then....... bye all! [03:28:41] bye AndyRussG|away [03:29:32] matt_flaschen: do we tomorrow need to write an incident report? [03:30:36] Dereckson, I wasn't planning on it. But on the other hand, I don't have a 100% understanding of what happened, so that's probably a good idea. [03:32:48] Dereckson, I will tomorrow. Heading out, my contact info is on officewiki if anything goes wrong. [03:34:51] Have an enjoyable evening. [03:35:23] * Dereckson heads out too. [03:50:55] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review, and 2 others: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2440559 (10Pokefan95) >>! In T137973#2440540, @biplabanand wrote: >>>! In T137973#2440498, @Pokefan95 wrote: >>>>! In T137973#243... [04:07:14] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:09:26] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [04:35:32] 06Operations, 10ops-codfw, 10DBA: BIOS upgrade on certain codfw machines - https://phabricator.wikimedia.org/T139714#2440561 (10jcrespo) [04:36:18] 06Operations, 10ops-codfw, 10DBA: pc2006 down - https://phabricator.wikimedia.org/T139283#2440574 (10jcrespo) 05Open>03Resolved Tracked on T139714 [04:39:12] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2440590 (10jcrespo) 05Open>03Resolved I consider this as resolved, let's track BIOS and followup on T139714. [05:07:50] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:10:09] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [05:10:21] PROBLEM - Disk space on terbium is CRITICAL: DISK CRITICAL - free space: / 2554 MB (3% inode=79%) [05:48:39] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:50] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [06:13:13] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [06:31:52] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:23] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:34] 06Operations, 10DBA, 13Patch-For-Review: All s4 (commons) slaves failed replication - https://phabricator.wikimedia.org/T139346#2440643 (10jcrespo) 05Open>03Resolved Incident documented on https://wikitech.wikimedia.org/wiki/Incident_documentation/20160705-commons-replication [06:32:41] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [06:33:13] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:22] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:11] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:41] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:03] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:22] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:12] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:56:02] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:22] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:11] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:22] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:41] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:58:42] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:41] (03PS1) 10Giuseppe Lavagetto: nutcracker: lower verbosity on the maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/297954 [07:05:20] (03CR) 10Muehlenhoff: "Now accepted by backports FTP masters." [puppet] - 10https://gerrit.wikimedia.org/r/297236 (https://phabricator.wikimedia.org/T138136) (owner: 10Muehlenhoff) [07:09:46] (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker: lower verbosity on the maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/297954 (owner: 10Giuseppe Lavagetto) [07:15:12] <_joe_> !log removing 20 gb logfile from terbium, only useless debug info [07:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:17:57] RECOVERY - Disk space on terbium is OK: DISK OK [07:22:46] (03CR) 10Muehlenhoff: [C: 04-1] "In addition to what I wrote on T139598, we also need to raise the bucket size, see commit 51223efe in operations/puppet. We should ship a " [puppet] - 10https://gerrit.wikimedia.org/r/297897 (https://phabricator.wikimedia.org/T139598) (owner: 10Andrew Bogott) [07:23:36] (03CR) 10Muehlenhoff: "Also, these sysctl values are only set on boot time, so we need to fix the running labvirt* hosts manually." [puppet] - 10https://gerrit.wikimedia.org/r/297897 (https://phabricator.wikimedia.org/T139598) (owner: 10Andrew Bogott) [07:26:14] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Depleted connection tracking table on labvirt1010 - https://phabricator.wikimedia.org/T139598#2440759 (10MoritzMuehlenhoff) BTW, base::firewall provides an Icinga check for conntrack tables, which has been really useful to notice problems wit... [07:53:26] (03Abandoned) 10Filippo Giunchedi: [STRAWMAN] let puppet-lint fix arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/294282 (https://phabricator.wikimedia.org/T137763) (owner: 10Filippo Giunchedi) [08:06:24] 06Operations, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#2440785 (10akosiaris) p:05Normal>03Low The tasks above have been completed, so per DC url-downloader instances are now being used. The task of getting... [08:08:19] !log oblivian@palladium conftool action : set/pooled=no; selector: name=mw1261.eqiad.wmnet [08:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:08:43] !log rebooting tin for kernel security update [08:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:12:29] !log rearming keyholder on tin [08:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:06] PROBLEM - Apache HTTP on mw1261 is CRITICAL: Connection refused [08:17:37] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet has 1 failures [08:17:46] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 1 failures [08:18:33] !log rebooting terbium for kernel security update [08:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:19:31] (03CR) 10Alexandros Kosiaris: [C: 031] Mark off a block of public IPs for labtest [dns] - 10https://gerrit.wikimedia.org/r/284491 (https://phabricator.wikimedia.org/T115491) (owner: 10Andrew Bogott) [08:20:26] (03CR) 10Alexandros Kosiaris: [C: 031] backup: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/295786 (owner: 10Muehlenhoff) [08:21:34] (03CR) 10Alexandros Kosiaris: [C: 031] ocg: Restrict to DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297840 (owner: 10Muehlenhoff) [08:30:17] (03PS2) 10Merlijn van Deen: Set up labs realm (ldap classifier and hiera) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/297902 (https://phabricator.wikimedia.org/T97081) [08:31:00] (03CR) 10jenkins-bot: [V: 04-1] Set up labs realm (ldap classifier and hiera) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/297902 (https://phabricator.wikimedia.org/T97081) (owner: 10Merlijn van Deen) [08:35:17] 06Operations, 06Commons, 10Wikimedia-SVG-rendering: SVG files larger than 10 MB cannot be thumbnailed - https://phabricator.wikimedia.org/T111815#2440812 (10MoritzMuehlenhoff) Sounds good to me. I'll run some tests with "--unlimited" next week and if all if fine, I'll enable it on the image scalers. [08:35:32] !log gallium: restarting Zuul to apply logging configuration change https://gerrit.wikimedia.org/r/#/c/291913/ [08:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:36:01] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Puppet has 1 failures [08:36:40] <_joe_> that's known ^^; my fault [08:40:03] (03CR) 10Hashar: "Thanks Daniel! I had completely forgot about this change and it is quite useful to have. I have restarted Zuul and it looks all fine :)" [puppet] - 10https://gerrit.wikimedia.org/r/291913 (owner: 10Hashar) [08:40:45] !log gallium: deleting old log files /var/log/zuul/gearman-server-debug.log* [08:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:41:06] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2440815 (10elukey) Finally we found a repro thanks to https://bz.apache.org/bug... [08:41:42] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:42:13] elukey: good morning!! [08:42:39] hashar: o/ [08:42:46] elukey: all the investigation on the AH01075 log spam is quite fascinating :D [08:43:15] elukey: the beta cluster on labs has some hhvm/apache and if we have the same error there we could probably use beta cluster as a playground test area [08:43:19] <_joe_> elukey: new package in the oven, let's see if I backported everything that was needed [08:43:51] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:44:15] <_joe_> hashar: https://blogs.msdn.microsoft.com/seliot/2011/04/25/i-dont-always-test-my-code-but-when-i-do-i-do-it-in-production/ [08:44:36] ahahahahhahaahah [08:44:42] yeah I guess nowadays it is quite easy to isolate a random prod box to test with [08:50:14] (03PS3) 10Merlijn van Deen: Set up labs realm (ldap classifier and hiera) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/297902 (https://phabricator.wikimedia.org/T97081) [08:50:40] * valhallasw`cloud eyes jenkins suspiciously [08:51:04] (03CR) 10jenkins-bot: [V: 04-1] Set up labs realm (ldap classifier and hiera) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/297902 (https://phabricator.wikimedia.org/T97081) (owner: 10Merlijn van Deen) [08:51:36] (03PS4) 10Merlijn van Deen: Set up labs realm (ldap classifier and hiera) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/297902 (https://phabricator.wikimedia.org/T97081) [08:56:12] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.058 second response time [09:00:59] !log rebooting ruthenium for update to Linux 4.4 [09:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:02:01] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:02:25] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: Clones from git.wikimedia.org are not redirected - https://phabricator.wikimedia.org/T139206#2440827 (10Danny_B) [09:03:00] !log elukey@palladium conftool action : set/pooled=yes:weight=5; selector: mw1261.eqiad.wmnet [09:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:04:22] RECOVERY - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is OK: TCP OK - 0.007 second response time on port 9042 [09:06:39] !log elukey@palladium conftool action : set/pooled=no:weight=5; selector: mw1261.eqiad.wmnet [09:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:18] !log elukey@palladium conftool action : set/pooled=yes:weight=5; selector: mw1261.eqiad.wmnet [09:26:41] godog: 1009-c is up!!! ^^^ [09:26:46] \o/ [09:30:33] !log upgrading nodejs packages on aqs100[23] [09:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:30:55] !log elukey@palladium conftool action : set/pooled=no; selector: aqs1002.eqiad.wmnet [09:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:25] oh, it's still joining the cluster [09:32:09] mobrovac: yeah, not sure it will complete :( [09:32:17] :/ [09:33:21] sorry for the conftool spam, I forgot --quiet [09:33:45] !log elukey@palladium conftool action : set/pooled=yes; selector: aqs1002.eqiad.wmnet [09:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:34:39] mobrovac: I've briefly looked at the errors and look like sockettimeout, will update T139362 later as well [09:34:40] T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [09:37:17] damn [09:38:16] i gues we'll forced to do removenode after all [09:38:18] :S [09:40:54] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2440893 (10elukey) Aqs* hosts upgraded! [09:51:59] 06Operations, 10Traffic, 13Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#1531608 (10ema) a:03ema [09:53:18] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2440926 (10MoritzMuehlenhoff) [09:54:22] (03PS2) 10Muehlenhoff: Install fonts-sil-lateef on scalers [puppet] - 10https://gerrit.wikimedia.org/r/297236 (https://phabricator.wikimedia.org/T138136) [09:58:56] 06Operations, 13Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261185 (10Joe) So the new goal is to use rhodium as a test system for installing the puppetmasters on jessie. I will work on this in the next weeks as part of this qu... [10:01:11] (03PS3) 10Muehlenhoff: Install fonts-sil-lateef on scalers [puppet] - 10https://gerrit.wikimedia.org/r/297236 (https://phabricator.wikimedia.org/T138136) [10:04:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Install fonts-sil-lateef on scalers [puppet] - 10https://gerrit.wikimedia.org/r/297236 (https://phabricator.wikimedia.org/T138136) (owner: 10Muehlenhoff) [10:06:26] (03PS1) 10Giuseppe Lavagetto: puppetmaster: switch rhodium to install jessie [puppet] - 10https://gerrit.wikimedia.org/r/297962 (https://phabricator.wikimedia.org/T98173) [10:07:38] (03PS2) 10Giuseppe Lavagetto: puppetmaster: switch rhodium to install jessie [puppet] - 10https://gerrit.wikimedia.org/r/297962 (https://phabricator.wikimedia.org/T98173) [10:08:06] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Wikimedia-SVG-rendering, 07I18n, 13Patch-For-Review: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2440935 (10MoritzMuehlenhoff) @mehtab.ahmed The Lateef font for Sindhi is now installed on all image sc... [10:09:33] (03CR) 10Gehel: [C: 031] "LGTM, deployment scheduled for Monday July 11th" [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T129138) (owner: 10EBernhardson) [10:10:30] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog: Wikipedia app hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2440936 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:12:16] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2440940 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:12:36] 06Operations: mod_deflate + mod_uwsgi causing mangled apache responses - https://phabricator.wikimedia.org/T135595#2440941 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:14:54] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2440942 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:15:56] 06Operations: Install puppetDB at WMF - https://phabricator.wikimedia.org/T139476#2440956 (10MoritzMuehlenhoff) p:05Triage>03High [10:17:58] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2440958 (10ema) p:05Triage>03Normal [10:18:59] 06Operations, 10Traffic, 10Wikimedia-Shop, 07HTTPS: Canonical URL in Store points to HTTP address, should be HTTPS - https://phabricator.wikimedia.org/T131131#2440959 (10ema) p:05Triage>03Normal [10:19:11] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: switch rhodium to install jessie [puppet] - 10https://gerrit.wikimedia.org/r/297962 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [10:19:54] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930#2440962 (10ema) p:05Triage>03Normal [10:20:28] 06Operations, 10Traffic: Set up LVS connection sync - https://phabricator.wikimedia.org/T136944#2440963 (10ema) p:05Triage>03Normal [10:21:02] 06Operations, 10Traffic, 13Patch-For-Review: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#2440964 (10ema) p:05Triage>03Normal [10:26:38] (03PS1) 10Thiemo Mättig (WMDE): Disable PDF export in the Wikidata Item namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297964 (https://phabricator.wikimedia.org/T136814) [10:46:15] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#2440988 (10Dvorapa) @MoritzMuehlenhoff When could we await the fix to be in a release on Wikipedias (Czech)? In yesterd... [10:50:03] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#2440990 (10MoritzMuehlenhoff) @Dvorapa This change is not a change in Mediawiki, but in the librsvg software component... [10:59:28] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2441042 (10elukey) The two patches needed are the following (even simpler than... [10:59:42] hashar: ---^ these are the patches needed [10:59:48] it turned out to be even simpler [11:01:25] (03CR) 10PleaseStand: "Are you sure this will work? According to the Apache documentation:" [puppet] - 10https://gerrit.wikimedia.org/r/277904 (owner: 10Paladox) [11:02:10] (03CR) 10Paladox: "Nope I am not sure, but I'm guessing it will work." [puppet] - 10https://gerrit.wikimedia.org/r/277904 (owner: 10Paladox) [11:04:39] we should get rid of spurious 503s in the access logs too [11:07:36] we have a lot of clients aborting requests apparently [11:09:00] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#2441059 (10Dvorapa) @MoritzMuehlenhoff I see, then it is only matter of time when it will arrive to other Wikis? [11:21:42] elukey: great!!! [11:21:44] https://svn.apache.org/viewvc/httpd/httpd/branches/2.4.x/modules/proxy/mod_proxy_fcgi.c?r1=1726019&r2=1726018&pathrev=1726019 [11:21:59] that really looks like if (bug) { donot(503); continue; ) [11:24:58] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#2441093 (10MoritzMuehlenhoff) If any other Mediawiki installation wants to use the fixed version, it needs to install u... [11:35:40] 06Operations: Several mw* servers missing in conftool-data, but present in site.pp - https://phabricator.wikimedia.org/T139154#2441105 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff These are in fact intentionally not pooled, so closing. [11:35:52] hashar: yeah it does but it makes sense, if the connection has been aborted by the client it is not a 503... [11:38:02] 06Operations, 10ops-codfw, 10DBA: BIOS upgrade on certain codfw machines - https://phabricator.wikimedia.org/T139714#2441108 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:50:29] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:58:40] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: Puppet has 1 failures [11:59:49] PROBLEM - puppet last run on mw2068 is CRITICAL: CRITICAL: Puppet has 1 failures [12:00:30] (03PS3) 10Yuvipanda: dynamicproxy: do not override nginx.conf [puppet] - 10https://gerrit.wikimedia.org/r/297829 (https://phabricator.wikimedia.org/T134383) [12:00:40] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: do not override nginx.conf [puppet] - 10https://gerrit.wikimedia.org/r/297829 (https://phabricator.wikimedia.org/T134383) (owner: 10Yuvipanda) [12:00:52] (03PS3) 10Yuvipanda: dynamicproxy: Use http2 rather than spdy [puppet] - 10https://gerrit.wikimedia.org/r/297832 (https://phabricator.wikimedia.org/T134383) [12:01:05] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Use http2 rather than spdy [puppet] - 10https://gerrit.wikimedia.org/r/297832 (https://phabricator.wikimedia.org/T134383) (owner: 10Yuvipanda) [12:03:47] !log stopping replication on db1043 (m3-slave) for maintenance [12:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:11:00] (03PS1) 10Yuvipanda: dynamicproxy: s/spdy/http2/ for domainproxy as well [puppet] - 10https://gerrit.wikimedia.org/r/297971 (https://phabricator.wikimedia.org/T134383) [12:11:29] (03PS2) 10Yuvipanda: dynamicproxy: s/spdy/http2/ for domainproxy as well [puppet] - 10https://gerrit.wikimedia.org/r/297971 (https://phabricator.wikimedia.org/T134383) [12:11:43] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: s/spdy/http2/ for domainproxy as well [puppet] - 10https://gerrit.wikimedia.org/r/297971 (https://phabricator.wikimedia.org/T134383) (owner: 10Yuvipanda) [12:16:38] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:18:01] (03CR) 10Aude: [C: 031] "@note this patch also applies to test.wikidata, which is okay and desired" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297964 (https://phabricator.wikimedia.org/T136814) (owner: 10Thiemo Mättig (WMDE)) [12:21:36] !log installing glib updates from jessie point release [12:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:22:13] 06Operations, 10Cassandra, 06Services, 13Patch-For-Review: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362#2441203 (10fgiunchedi) looks like the latest bootstrap failed too {P3366} and the "memory was freed" + sockettimeout seem to point to bugs like http... [12:22:58] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:24:17] RECOVERY - puppet last run on mw2068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:25:08] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:45] 07Puppet, 06Labs, 10Phabricator, 10Tool-Labs: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2441211 (10Dereckson) [12:27:02] (03CR) 10Peachey88: "I'm not sure we should even really both, The task for this proxy block in the first place was declined (T130156) so i'm not sure why it wa" [puppet] - 10https://gerrit.wikimedia.org/r/277904 (owner: 10Paladox) [12:27:20] 07Puppet, 06Labs, 10Phabricator, 10Tool-Labs: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2441214 (10Dereckson) [12:27:28] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:28:16] (03CR) 10Peachey88: "(Added in rOPUPda81f55c04852b4008a63c17502fd05b30b3484c)" [puppet] - 10https://gerrit.wikimedia.org/r/277904 (owner: 10Paladox) [12:31:36] (03PS1) 10Dereckson: Install arcanist in toollabs::dev_environ [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) [12:32:20] (03CR) 10Andrew Bogott: "> we can avoid the exec() and just fix the labvirt* systems via salt" [puppet] - 10https://gerrit.wikimedia.org/r/297897 (https://phabricator.wikimedia.org/T139598) (owner: 10Andrew Bogott) [12:34:53] (03PS1) 10Yuvipanda: tools: Move static site to use http2 as well [puppet] - 10https://gerrit.wikimedia.org/r/297976 (https://phabricator.wikimedia.org/T134383) [12:36:12] (03PS2) 10Yuvipanda: tools: Move static site to use http2 as well [puppet] - 10https://gerrit.wikimedia.org/r/297976 (https://phabricator.wikimedia.org/T134383) [12:36:28] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Move static site to use http2 as well [puppet] - 10https://gerrit.wikimedia.org/r/297976 (https://phabricator.wikimedia.org/T134383) (owner: 10Yuvipanda) [12:36:56] (03CR) 10Muehlenhoff: "This is set during boot via modprobe and any fresh installation will need another reboot to get the latest kernel etc, so I don't think th" [puppet] - 10https://gerrit.wikimedia.org/r/297897 (https://phabricator.wikimedia.org/T139598) (owner: 10Andrew Bogott) [12:38:08] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1363.46 seconds [12:42:48] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [12:45:35] ^Error: Could not find a suitable provider for etcd_role [12:45:47] Error: Could not find a suitable provider for etcd_user [12:47:27] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:50:07] (03PS2) 10Andrew Bogott: Increase conntrack limits for nova compute nodes. [puppet] - 10https://gerrit.wikimedia.org/r/297897 (https://phabricator.wikimedia.org/T139598) [12:54:08] (03PS2) 10Yuvipanda: tools: Do not have static class inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) [12:55:08] (03PS3) 10Yuvipanda: tools: Do not have static class inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) [12:55:41] (03CR) 10Yuvipanda: [C: 031] "This is now blocking T139743 as well. I don't think blocking this on refactoring the whole toollabs base class is a good idea." [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) (owner: 10Yuvipanda) [13:03:01] (03CR) 10Tim Landscheidt: [C: 04-1] "The arcanist package is not available in Ubuntu Precise, so this must be moved up in the "if os_version('ubuntu trusty')" block." [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) (owner: 10Dereckson) [13:04:40] (03CR) 10Dereckson: "And for Debian; only Sid" [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) (owner: 10Dereckson) [13:05:17] (03PS1) 10Yuvipanda: Revert "tools: Move static site to use http2 as well" [puppet] - 10https://gerrit.wikimedia.org/r/297978 (https://phabricator.wikimedia.org/T139743) [13:07:07] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Puppet has 4 failures [13:07:27] PROBLEM - puppetmaster backend https on rhodium is CRITICAL: Connection refused [13:07:38] (03CR) 10Yuvipanda: [C: 032] Revert "tools: Move static site to use http2 as well" [puppet] - 10https://gerrit.wikimedia.org/r/297978 (https://phabricator.wikimedia.org/T139743) (owner: 10Yuvipanda) [13:07:41] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#2441331 (10Dvorapa) @MoritzMuehlenhoff I just want to fix [[https://cs.wikipedia.org/wiki/Wikipedista:Dvorapa/P%C3%ADsk... [13:09:15] (03PS2) 10Dereckson: Install arcanist in toollabs::dev_environ [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) [13:10:04] (03CR) 10Dereckson: "PS2: trusty only" [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) (owner: 10Dereckson) [13:10:06] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#461916 (10jcrespo) @Dvorapa I see the [[ https://cs.wikipedia.org/wiki/Wikipedista:Dvorapa/P%C3%ADskovi%C5%A1t%C4%9B/Br... [13:14:10] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#2441351 (10Dvorapa) @jcrespo I see, thank you, it must be from today, because yesterday purging server and browser cach... [13:14:34] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#2441352 (10TheDJ) @Dvorapa The confusion we are seeing here is probably a bit logical. The issues is fixed, BUT any ima... [13:16:56] (03PS1) 10Elukey: Remove Websocket upgrade request from Varnishkafka webrequest config [puppet] - 10https://gerrit.wikimedia.org/r/297980 (https://phabricator.wikimedia.org/T136314) [13:19:14] (03PS1) 10Giuseppe Lavagetto: puppetmaster::passenger: use sslcert::dhparam on jessie [puppet] - 10https://gerrit.wikimedia.org/r/297981 (https://phabricator.wikimedia.org/T98173) [13:20:01] (03PS2) 10Elukey: Remove Websocket upgrade request from Varnishkafka webrequest config [puppet] - 10https://gerrit.wikimedia.org/r/297980 (https://phabricator.wikimedia.org/T136314) [13:24:34] (03CR) 10Ottomata: [C: 031] Remove Websocket upgrade request from Varnishkafka webrequest config [puppet] - 10https://gerrit.wikimedia.org/r/297980 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [13:25:34] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::passenger: use sslcert::dhparam on jessie [puppet] - 10https://gerrit.wikimedia.org/r/297981 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [13:27:01] (03CR) 10Ema: [C: 031] Remove Websocket upgrade request from Varnishkafka webrequest config [puppet] - 10https://gerrit.wikimedia.org/r/297980 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [13:28:58] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [13:33:06] <_joe_> ah, obviously [13:33:54] (03PS1) 10Muehlenhoff: debug_proxy: Limit to production networks [puppet] - 10https://gerrit.wikimedia.org/r/297982 [13:35:46] RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 333 bytes in 1.764 second response time [13:38:07] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#2441422 (10Dvorapa) @TheDJ I see, thank you [13:39:39] (03PS1) 10Muehlenhoff: role::horizon: Limit to production networks [puppet] - 10https://gerrit.wikimedia.org/r/297983 [13:43:57] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BR [13:43:57] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [13:50:04] (03CR) 10Muehlenhoff: [C: 04-1] "One thing, rest looks fine" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/297897 (https://phabricator.wikimedia.org/T139598) (owner: 10Andrew Bogott) [13:51:27] (03PS3) 10Andrew Bogott: Increase conntrack limits for nova compute nodes. [puppet] - 10https://gerrit.wikimedia.org/r/297897 (https://phabricator.wikimedia.org/T139598) [13:54:28] (03PS4) 10Andrew Bogott: Increase conntrack limits for nova compute nodes. [puppet] - 10https://gerrit.wikimedia.org/r/297897 (https://phabricator.wikimedia.org/T139598) [14:04:14] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/297897 (https://phabricator.wikimedia.org/T139598) (owner: 10Andrew Bogott) [14:04:47] (03CR) 10Andrew Bogott: [C: 032] Increase conntrack limits for nova compute nodes. [puppet] - 10https://gerrit.wikimedia.org/r/297897 (https://phabricator.wikimedia.org/T139598) (owner: 10Andrew Bogott) [14:05:02] (03Abandoned) 10Ema: Package 4.1.3-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/297603 (owner: 10Ema) [14:06:13] AaronSchulz: hiii, would love to get this merged soon: https://gerrit.wikimedia.org/r/#/c/293628/ [14:08:18] I can't +2 my own changes generally [14:08:24] (03PS1) 10Ottomata: Initial debianization and release [debs/python-confluent-kafka] - 10https://gerrit.wikimedia.org/r/297985 [14:09:02] maybe ping Krinkle [14:10:53] (03CR) 10Ottomata: [C: 032 V: 032] Initial debianization and release [debs/python-confluent-kafka] (debian) - 10https://gerrit.wikimedia.org/r/293280 (owner: 10Ottomata) [14:12:34] (03PS1) 10Ottomata: Release 0.9.1.2 [debs/python-confluent-kafka] (debian) - 10https://gerrit.wikimedia.org/r/297986 [14:13:15] AaronSchulz: you might want a release notes entry for https://gerrit.wikimedia.org/r/#/c/293628/2 :) [14:17:10] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Depleted connection tracking table on labvirt1010 - https://phabricator.wikimedia.org/T139598#2441481 (10Andrew) 05Open>03Resolved a:03Andrew [14:18:04] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 2 failures [14:18:10] (03PS1) 10Giuseppe Lavagetto: puppetmaster: add rhodium as an inactive backend [puppet] - 10https://gerrit.wikimedia.org/r/297987 (https://phabricator.wikimedia.org/T98173) [14:18:11] ok thanks AaronSchulz will do [14:18:12] (03PS1) 10Giuseppe Lavagetto: puppetmaster: perform git init in the private repo dir [puppet] - 10https://gerrit.wikimedia.org/r/297988 (https://phabricator.wikimedia.org/T98173) [14:18:28] <_joe_> akosiaris: before you go, an opinion on ^^ :) [14:19:32] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: add rhodium as an inactive backend [puppet] - 10https://gerrit.wikimedia.org/r/297987 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [14:19:42] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: perform git init in the private repo dir [puppet] - 10https://gerrit.wikimedia.org/r/297988 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [14:20:05] (03PS3) 10Elukey: Remove Websocket upgrade request from Varnishkafka webrequest config [puppet] - 10https://gerrit.wikimedia.org/r/297980 (https://phabricator.wikimedia.org/T136314) [14:24:40] (03CR) 10Elukey: "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/3294/" [puppet] - 10https://gerrit.wikimedia.org/r/297980 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [14:24:49] (03CR) 10Alexandros Kosiaris: [C: 031] puppetmaster: perform git init in the private repo dir [puppet] - 10https://gerrit.wikimedia.org/r/297988 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [14:24:51] (03CR) 10Elukey: [C: 032] "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/3294/" [puppet] - 10https://gerrit.wikimedia.org/r/297980 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [14:25:40] (03CR) 10Alexandros Kosiaris: [C: 031] "keep in mind apache actually needs to be restarted for this change to take effect. A reload for some reason won't do it" [puppet] - 10https://gerrit.wikimedia.org/r/297987 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [14:35:21] (03PS1) 10Ema: Release 4.1.3-1wm1 [debs/varnish4] - 10https://gerrit.wikimedia.org/r/297989 [14:37:23] (03Abandoned) 10Ema: Release 4.1.3-1wm1 [debs/varnish4] - 10https://gerrit.wikimedia.org/r/297989 (owner: 10Ema) [14:41:45] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [14:42:57] (03PS1) 10Yuvipanda: dynamicproxy: Remove redundant redundantproxy.conf file [puppet] - 10https://gerrit.wikimedia.org/r/297991 [14:43:12] (03PS2) 10Yuvipanda: dynamicproxy: Remove redundant redundantproxy.conf file [puppet] - 10https://gerrit.wikimedia.org/r/297991 [14:43:32] !log rechecking data consistency after m3 table fixes (could cause lag) [14:43:33] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Remove redundant redundantproxy.conf file [puppet] - 10https://gerrit.wikimedia.org/r/297991 (owner: 10Yuvipanda) [14:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:02] (03PS1) 10Ema: Release 4.1.3-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/297992 [14:44:05] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:44:30] phabricator has some non-innodb tables, those had missing data on the slave [14:45:28] (03CR) 10Ottomata: [C: 032 V: 032] Release 0.9.1.2 [debs/python-confluent-kafka] (debian) - 10https://gerrit.wikimedia.org/r/297986 (owner: 10Ottomata) [14:46:30] (03CR) 10Ema: [C: 032 V: 032] Release 4.1.3-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/297992 (owner: 10Ema) [14:47:19] it could had happened because a crash, as it was still on 5.5 and that version didn't have safe replication [14:48:54] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2435518 (10TheDJ) Yeah I agree, something seems broken with the font setup since the upgrade. [14:50:04] 06Operations, 10Cassandra, 06Services, 13Patch-For-Review: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362#2441575 (10Eevans) >>! In T139362#2441203, @fgiunchedi wrote: > looks like the latest bootstrap failed too [ ... ] > and the "memory was freed" + so... [14:50:26] urandom: o/ didn't see your ping yesterday, did you need me? [14:51:12] elukey: s'okay [14:51:18] elukey: i found someone else :) [14:52:18] ahhh okok :) [14:52:28] elukey: given the late hour, i didn't expect that you'd answer [14:52:32] 06Operations, 10hardware-requests: Find and rack 2 EX4200s in rack c1-eqiad - https://phabricator.wikimedia.org/T139752#2441576 (10mark) [14:59:24] 06Operations, 10Wikimedia-SVG-rendering: SVG marker-mid with orient auto don't work (stops rendering subsequent elements) - https://phabricator.wikimedia.org/T117530#2441595 (10TheDJ) 05Open>03Resolved a:03TheDJ Also: https://commons.wikimedia.org/wiki/File:T117530_Chinatown_map_WV.svg Indeed seems fixe... [15:00:54] 06Operations, 10Cassandra, 06Services, 13Patch-For-Review: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362#2441603 (10fgiunchedi) indeed sounds like 1014 won't be able to recover the disk space anyway on its own, I'm ok to go `removenode` [15:02:29] 06Operations, 10Traffic: Install XKey vmod - https://phabricator.wikimedia.org/T122881#2441604 (10ema) The XKey vmod is part of a collection of VMODs called [[https://github.com/varnish/varnish-modules/blob/master/src/vmod_xkey.c|varnish-modules]], already packaged in Debian. I've just [[http://anonscm.debian... [15:03:01] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2441605 (10elukey) Websocket upgrades filtered out from Varnishkafka. The last step is to wait for https://gerrit.wikimedia.org/r/#/c/... [15:03:54] (03PS1) 10Eevans: Deinit 1009-c and 1014-c [puppet] - 10https://gerrit.wikimedia.org/r/297993 (https://phabricator.wikimedia.org/T139362) [15:04:51] 06Operations, 10Cassandra, 06Services, 13Patch-For-Review: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362#2441621 (10mobrovac) >>! In T139362#2441603, @fgiunchedi wrote: > indeed sounds like 1014 won't be able to recover the disk space anyway on its own, I'... [15:05:40] (03CR) 10Eevans: "This can be merged at any time since it doesn't result in any change server-side (removal of config, systemd units, etc, all require manua" [puppet] - 10https://gerrit.wikimedia.org/r/297993 (https://phabricator.wikimedia.org/T139362) (owner: 10Eevans) [15:12:06] lag on m3 may happen again with the checks [15:13:19] (03PS1) 10Rush: labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 [15:13:40] 06Operations: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671#2441642 (10leila) @Krenair I don't know what kind of access/privileges I've been getting by being in "restricted". :( I double-checked with Analytics to make sure by not being in that group I don't loose access to Bastion and th... [15:14:33] 06Operations: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671#2441643 (10leila) [15:15:33] (03CR) 10jenkins-bot: [V: 04-1] labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 (owner: 10Rush) [15:16:48] 06Operations, 10Wikimedia-SVG-rendering, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2441644 (10MoritzMuehlenhoff) [15:17:04] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 109 failures [15:17:44] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Need snapshot of 'reviewdb' on spare machine to test gerrit schema upgrades - https://phabricator.wikimedia.org/T139755#2441652 (10demon) [15:18:24] (03PS1) 10Muehlenhoff: Also install conntrack on labvirt* hosts [puppet] - 10https://gerrit.wikimedia.org/r/297998 [15:18:52] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Need snapshot of 'reviewdb' on spare machine to test gerrit schema upgrades - https://phabricator.wikimedia.org/T139755#2441666 (10demon) [15:24:24] (03PS2) 10Rush: labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 [15:27:20] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Need snapshot of 'reviewdb' on spare machine to test gerrit schema upgrades - https://phabricator.wikimedia.org/T139755#2441690 (10demon) p:05Normal>03Unbreak! Because of {T139669} [15:27:24] (03CR) 10jenkins-bot: [V: 04-1] labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 (owner: 10Rush) [15:29:38] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Need snapshot of 'reviewdb' on spare machine to test gerrit schema upgrades - https://phabricator.wikimedia.org/T139755#2441652 (10jcrespo) a:03jcrespo [15:31:34] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2441717 (10mobrovac) Since all of the SCB services have been tested with the new version of node, @MoritzMuehlenhoff and I decided to switch SCB to Node 4.4.6... [15:32:44] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2441719 (10Gehel) [15:34:22] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 03Discovery-Search-Sprint: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2441721 (10Gehel) [15:34:23] (03PS1) 10Ottomata: Update eventlogging kafka consumer args to match with python-confluent-kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/298000 (https://phabricator.wikimedia.org/T133779) [15:35:39] (03PS2) 10Ottomata: Update eventlogging kafka consumer args to match with python-confluent-kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/298000 (https://phabricator.wikimedia.org/T133779) [15:35:44] (03CR) 10jenkins-bot: [V: 04-1] Update eventlogging kafka consumer args to match with python-confluent-kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/298000 (https://phabricator.wikimedia.org/T133779) (owner: 10Ottomata) [15:36:58] (03CR) 10jenkins-bot: [V: 04-1] Update eventlogging kafka consumer args to match with python-confluent-kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/298000 (https://phabricator.wikimedia.org/T133779) (owner: 10Ottomata) [15:37:35] 06Operations, 10hardware-requests: Find and rack 2 EX4200s in rack c1-eqiad - https://phabricator.wikimedia.org/T139752#2441576 (10RobH) Unfortunately, the spares tracking pages doesn't list off EX4200s, just their extra modules. We should have a few spare 4200 48 port switches from the Tampa shutdown. I'm g... [15:40:42] PROBLEM - HP RAID on ms-be1021 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:2 - OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:41:15] (03PS2) 10Eevans: Deinit 1009-c and 1014-c [puppet] - 10https://gerrit.wikimedia.org/r/297993 (https://phabricator.wikimedia.org/T139362) [15:41:31] PROBLEM - Disk space on ms-be1021 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error [15:42:34] (03PS3) 10Rush: labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 [15:45:14] (03CR) 10jenkins-bot: [V: 04-1] Update eventlogging kafka consumer args to match with python-confluent-kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/298000 (https://phabricator.wikimedia.org/T133779) (owner: 10Ottomata) [15:48:51] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:51:11] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [15:54:43] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 1 failures [15:55:12] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:57:31] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [15:58:59] (03CR) 10Filippo Giunchedi: [C: 031] Deinit 1009-c and 1014-c [puppet] - 10https://gerrit.wikimedia.org/r/297993 (https://phabricator.wikimedia.org/T139362) (owner: 10Eevans) [16:00:02] RECOVERY - Disk space on ms-be1021 is OK: DISK OK [16:00:17] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2441866 (10elukey) Deployed the patched version of httpd but the errors are sti... [16:01:21] hashar: I am really sad --^ [16:02:05] (03CR) 10Andrew Bogott: [C: 04-1] "This is a great idea! One comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/297996 (owner: 10Rush) [16:05:18] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2441876 (10Jgreen) [16:05:31] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:06:18] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2441890 (10Jgreen) [16:07:51] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:11:02] PROBLEM - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: Connection refused [16:11:10] ^^^ got it. [16:11:22] PROBLEM - cassandra-c service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [16:11:27] ^^^ and that. [16:11:42] (03PS2) 10Andrew Bogott: Also install conntrack on labvirt* hosts [puppet] - 10https://gerrit.wikimedia.org/r/297998 (owner: 10Muehlenhoff) [16:12:06] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: Connection refused eevans Removing - The acknowledgement expires at: 2016-07-09 16:11:44. [16:12:42] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Need snapshot of 'reviewdb' on spare machine to test gerrit schema upgrades - https://phabricator.wikimedia.org/T139755#2441652 (10Krenair) Actually, reviewdb lives on m2, I believe since 2012: ```alex@alex-laptop:~$ ssh gerrit gerrit gsql Welcome... [16:12:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "merging as Eric mentioned this is a noop server-side, though it will line up icinga configuration to prevent spurious alerts when the inst" [puppet] - 10https://gerrit.wikimedia.org/r/297993 (https://phabricator.wikimedia.org/T139362) (owner: 10Eevans) [16:13:06] ACKNOWLEDGEMENT - cassandra-c service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans Removing - The acknowledgement expires at: 2016-07-09 16:12:52. [16:13:07] (03PS1) 10Dereckson: Serve jQuery locally [software/dbtree] - 10https://gerrit.wikimedia.org/r/298011 (https://phabricator.wikimedia.org/T139762) [16:13:44] godog: thanks! [16:14:55] urandom: yw! [16:15:26] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Need snapshot of 'reviewdb' on spare machine to test gerrit schema upgrades - https://phabricator.wikimedia.org/T139755#2441932 (10jcrespo) Yes, I didn't even read the description, I am copying from db. I had already dropped old version that used t... [16:15:51] (03PS4) 10Rush: labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 [16:17:13] (03PS2) 10Dereckson: Serve jQuery locally [software/dbtree] - 10https://gerrit.wikimedia.org/r/298011 (https://phabricator.wikimedia.org/T139762) [16:17:29] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Need snapshot of 'reviewdb' on spare machine to test gerrit schema upgrades - https://phabricator.wikimedia.org/T139755#2441935 (10demon) [16:17:50] (03CR) 10Dereckson: "PS2: don't fix \n at EOF" [software/dbtree] - 10https://gerrit.wikimedia.org/r/298011 (https://phabricator.wikimedia.org/T139762) (owner: 10Dereckson) [16:19:06] (03CR) 10Andrew Bogott: [C: 031] labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 (owner: 10Rush) [16:21:44] (03PS5) 10Rush: labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 [16:31:41] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:46] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:40:19] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Need snapshot of 'reviewdb' on spare machine to test gerrit schema upgrades - https://phabricator.wikimedia.org/T139755#2441986 (10jcrespo) 05Open>03Resolved ``` root@db1042:~$ my reviewdb -e "SHOW TABLES" +-----------------------------+ | Tabl... [16:40:22] 06Operations, 10ops-eqiad: ms-be1012.eqiad.wmnet: slot=1I:1:2 dev=sdh - https://phabricator.wikimedia.org/T139767#2441989 (10fgiunchedi) [16:40:37] 06Operations, 10ops-eqiad: ms-be1012.eqiad.wmnet: slot=1I:1:2 dev=sdh failed - https://phabricator.wikimedia.org/T139767#2442001 (10fgiunchedi) [16:40:58] RECOVERY - HP RAID on ms-be1021 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:42:38] ACKNOWLEDGEMENT - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi T139767 [16:48:37] (03CR) 10Andrew Bogott: [C: 04-1] labs: break out puppet alert logic to generic notify handler (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/297996 (owner: 10Rush) [16:49:17] (03PS1) 10Yuvipanda: labs: Allow explicitly specifying lookupcache via hiera [puppet] - 10https://gerrit.wikimedia.org/r/298021 (https://phabricator.wikimedia.org/T139769) [16:49:27] (03PS3) 10Andrew Bogott: Also install conntrack on labvirt* hosts [puppet] - 10https://gerrit.wikimedia.org/r/297998 (owner: 10Muehlenhoff) [16:51:49] (03CR) 10Andrew Bogott: [C: 032] Also install conntrack on labvirt* hosts [puppet] - 10https://gerrit.wikimedia.org/r/297998 (owner: 10Muehlenhoff) [16:53:08] (03CR) 10Andrew Bogott: [C: 031] role::horizon: Limit to production networks [puppet] - 10https://gerrit.wikimedia.org/r/297983 (owner: 10Muehlenhoff) [16:57:45] (03PS2) 10Rush: labs: Allow explicitly specifying lookupcache via hiera [puppet] - 10https://gerrit.wikimedia.org/r/298021 (https://phabricator.wikimedia.org/T139769) (owner: 10Yuvipanda) [16:58:14] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2442059 (10jcrespo) I corrected some, but there is still some drift between db1043 and db1048. We cannot progress until that is fixed. [17:01:29] (03CR) 10Rush: [C: 031] "yeah should be good, nfs refactor work paying off too :)" [puppet] - 10https://gerrit.wikimedia.org/r/298021 (https://phabricator.wikimedia.org/T139769) (owner: 10Yuvipanda) [17:01:54] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2442065 (10mmodell) @jcrespo is there anything I can do to help? [17:01:57] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Allow explicitly specifying lookupcache via hiera [puppet] - 10https://gerrit.wikimedia.org/r/298021 (https://phabricator.wikimedia.org/T139769) (owner: 10Yuvipanda) [17:05:07] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Need snapshot of 'reviewdb' on spare machine to test gerrit schema upgrades - https://phabricator.wikimedia.org/T139755#2442077 (10demon) `db1042`, got it! [17:08:04] (03CR) 10Rush: labs: break out puppet alert logic to generic notify handler (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/297996 (owner: 10Rush) [17:13:14] (03PS6) 10Rush: labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 [17:18:37] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 301 bytes in 0.234 second response time [17:21:55] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2442162 (10elukey) I copied an example of 304 in mw1061:/home/elukey/error_log_... [17:22:38] PROBLEM - puppet last run on mw2090 is CRITICAL: CRITICAL: puppet fail [17:23:47] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 31.451 second response time [17:33:16] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2442184 (10kaldari) I wonder if this is related to Ghostscript, as most of the fonts that are broken are fonts that are typica... [17:33:26] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 18.694 second response time [17:37:57] 06Operations, 10ops-eqiad, 10hardware-requests: eqiad: add all spare network switches to hardware spares tracking - https://phabricator.wikimedia.org/T139775#2442219 (10RobH) [17:37:59] 06Operations, 10ops-codfw, 10hardware-requests: codfw: add all spare network switches to hardware spares tracking - https://phabricator.wikimedia.org/T139776#2442235 (10RobH) [17:38:35] 06Operations, 10ops-eqiad, 10hardware-requests: eqiad: add all spare network switches to hardware spares tracking - https://phabricator.wikimedia.org/T139775#2442252 (10RobH) [17:38:37] 06Operations, 10hardware-requests: Find and rack 2 EX4200s in rack c1-eqiad - https://phabricator.wikimedia.org/T139752#2441576 (10RobH) [17:38:49] 06Operations, 10ops-eqiad, 10hardware-requests: eqiad: add all spare network switches to hardware spares tracking - https://phabricator.wikimedia.org/T139775#2442219 (10RobH) [17:38:58] 06Operations, 10ops-eqiad, 10hardware-requests: eqiad: add all spare network switches to hardware spares tracking - https://phabricator.wikimedia.org/T139775#2442219 (10RobH) [17:39:00] 06Operations, 10hardware-requests: Find and rack 2 EX4200s in rack c1-eqiad - https://phabricator.wikimedia.org/T139752#2441576 (10RobH) [17:39:09] man i dont like you sometimes wikibugs. [17:40:31] i need https://meta.wikimedia.org/wiki/Interwiki_map synchronyzed please :) [17:40:46] robh: heh, view the positve side: You know, that your comment was saved now :D [17:40:52] *positive [17:41:10] Steinsplitter: what do we need, a https://phabricator.wikimedia.org/diffusion/EWMA/browse/master/dumpInterwiki.php run? [17:41:44] yeah i just know its my task update pings, but i still have to check obsessively, heh [17:42:24] Dereckson: yes :) [17:48:07] petscan looks new [17:48:21] that's correct? [17:48:34] it looks new, it is the replacement for catscan. [17:48:37] RECOVERY - puppet last run on mw2090 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:50:23] (03PS1) 10Dereckson: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298028 [17:50:24] Steinsplitter: here you are ^ [17:51:42] Dereckon: thx :-) [17:52:06] (03CR) 10Steinsplitter: [C: 031] Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298028 (owner: 10Dereckson) [17:52:18] Do you need that deployed now or could that wait our next SWAT window Monday 17:00 UTC? [17:52:20] (i should lear to run the script myself) [17:52:57] mwscript extensions/WikimediaMaintenance/dumpInterwiki.php --protocolrelative > wmf-config/interwiki.php [17:53:02] it would be bether now, but if we wait until monday nothing serious should happen [17:54:53] Steinsplitter: I'm not sure dumpInterwiki.php works just by cloning WikimediaMaintenance extension [17:56:01] Steinsplitter: paths are hardcoded: https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/dumpInterwiki.php#L224 [17:57:01] So it could work if you have a symlink /srv/mediawiki pointing to an up to date clone of operations/mediawiki-config [17:59:46] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: puppet fail [18:00:41] Scheduled for Monday @ https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160711T1500 [18:19:37] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: puppet fail [18:20:24] Steinsplitter: to run dumpInterwiki.php also require getRealmSpecificFilename, which is provided by multiversion/MWRealm.php in the config repo, so it's probably easier to ask here. [18:20:30] (03PS4) 10Ottomata: Update eventlogging kafka consumer args to match with python-confluent-kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/298000 (https://phabricator.wikimedia.org/T133779) [18:20:52] Dereckson: okay, thanks again. [18:23:02] (03PS5) 10Ottomata: Update eventlogging kafka consumer args to match with python-confluent-kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/298000 (https://phabricator.wikimedia.org/T133779) [18:24:48] RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [18:29:01] !log db1042 - temp stop puppet, edit ferm rules to allow testing from lead [18:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:41] (03PS1) 10Dzahn: typos file: add 'mariabd' and 'eqad' [puppet] - 10https://gerrit.wikimedia.org/r/298033 [18:35:55] (03CR) 10jenkins-bot: [V: 04-1] typos file: add 'mariabd' and 'eqad' [puppet] - 10https://gerrit.wikimedia.org/r/298033 (owner: 10Dzahn) [18:39:11] !log Stopping restbase1014-c.eqiad.wmnet : T139362 [18:39:11] T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [18:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:45:37] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [18:46:21] (03PS6) 10Ottomata: Update eventlogging kafka consumer args to match with python-confluent-kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/298000 (https://phabricator.wikimedia.org/T133779) [18:48:38] !log "This is going to hurt me more than it does you."; `nodetool removenode' of restbase1014-c.eqiad.wmnet : T139362 [18:48:39] T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [18:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:50:56] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:52:01] !log Attempting to resubmit LocalRenameUserJobs for T137973 [18:52:02] T137973: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973 [18:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:06] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 4.682 second response time [18:55:36] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:57:07] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:07] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:47] mobrovac: ^ [19:02:28] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 11.228 second response time [19:02:33] Dereckson, writing the incident report now. [19:02:46] Dereckson, so, you ran the original scap on tin, but the l10nupdate on terbium? Or both on tin? [19:03:34] both on Tin [19:04:13] output of initial scap is at https://phabricator.wikimedia.org/P3365 [19:04:27] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [19:04:27] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [19:04:47] RECOVERY - Disk space on restbase1014 is OK: DISK OK [19:05:45] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review, and 2 others: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2442609 (10Anomie) Oh... I didn't pay close enough attention, the backport was submitted as https://gerrit.wikimedia.org/r/#/c/29... [19:07:50] (03PS7) 10Ottomata: Update eventlogging kafka consumer args to match with python-confluent-kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/298000 (https://phabricator.wikimedia.org/T133779) [19:09:27] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 5.416 second response time [19:10:10] (03CR) 10Andrew Bogott: [C: 031] labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 (owner: 10Rush) [19:10:21] (03Abandoned) 10Ottomata: Initial debianization and release [debs/python-confluent-kafka] - 10https://gerrit.wikimedia.org/r/297985 (owner: 10Ottomata) [19:11:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:19:07] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 11.175 second response time [19:26:12] (03PS7) 10Rush: labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 [19:28:02] 06Operations: Rotate (nutcracker) logs more frequently on terbium to save disk space - https://phabricator.wikimedia.org/T139786#2442645 (10Mattflaschen-WMF) [19:30:36] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 3.122 second response time [19:31:48] yuvipanda: ^ k8s things flapping? [19:31:53] 06Operations: Rotate (nutcracker) logs more frequently on terbium to save disk space - https://phabricator.wikimedia.org/T139786#2442665 (10Mattflaschen-WMF) [19:32:37] Dereckson, initial version at https://wikitech.wikimedia.org/wiki/Incident_documentation/20160707-Echo . It's not conclusive (that's partly why I thought it was worth writing, so collaboratively we can fully figure it out). Please edit boldly if you have anything to add/fix. [19:37:47] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 11.121 second response time [19:39:30] (03PS1) 10Dzahn: (WIP) make gerrit compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/298041 [19:43:29] Dereckson, thanks for editing. Are you going to create a task for the l10nupdate one? If not, I can. [19:44:02] (03PS1) 10Tpt: Deploys the Kartographer extension to Metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298042 (https://phabricator.wikimedia.org/T139787) [19:48:10] Maybe I should wirte an IRC bot to create tasks at phabricator via a simple IRC-Command, what do you mean? :) [19:48:10] (03PS1) 10Chad: Gerrit: Don't use SSH to connect to gsql, we can do it from shell itself [puppet] - 10https://gerrit.wikimedia.org/r/298043 [19:48:16] I can create it, I'm looking at l10nupdate code to see exactly the step involved. [19:48:19] mutante: Whee ^ [19:48:48] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:49:23] Luke081515: interesting. would it send email? [19:49:34] ostriches: oh, looking.(also i made one for Apache 2.4) [19:49:46] Yeah I saw :) [19:51:07] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:51:54] mutante: I didn't wrote it yet, but I can imagine something like !task :<project>:<short desc>, and then a bot account at phabricator would create a task, and you can add more data later. So yes, phabricator would send mail [19:53:27] <icinga-wm> RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [19:55:51] <grrrit-wm> (03PS2) 10Chad: Gerrit: Don't use SSH to connect to gsql, we can do it from shell itself [puppet] - 10https://gerrit.wikimedia.org/r/298043 [19:55:56] <icinga-wm> RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [19:55:59] <mutante> Luke081515: ah.. what i meant was if the bot creates the task by just mailing. because there is task@phabricator already [19:56:17] <urandom> !log Throttle RESTBase Cassandra outgoing streams to 3mbit cluster-wide : T139362 [19:56:18] <stashbot> T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [19:56:20] <mutante> Luke081515: which creates a task .. as long as it comes from an existing user [19:56:22] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:11] <matt_flaschen> Thanks, Dereckson [19:57:16] <mutante> Luke081515: it does sound nice to have.. but have to be careful about spam [19:57:20] <Luke081515> mutante: I think I would allow people with a cloak to use that function, and that bot would quote the IRC-Nick, I think [19:57:25] <Luke081515> so we can prevent spam [19:57:54] <Luke081515> (just wikimedia (and other WMF) cloaks), so no real spam, I think [19:58:18] <Luke081515> I think that would solve the "can you create a task" question at IRC :D [19:58:42] <mutante> Luke081515: yes, i thought about something like that, but that user auth part is making ot more complex than simple bot [19:59:19] <mutante> maybe an existing bot that already does that part. and has flags and all that .. like eggdrop [19:59:36] <mutante> then i thought one day i'm gonna puppetize eggdrop, lol [19:59:42] <Luke081515> mutante: I think the hardest point is the phabricator API, I have actually no experience there. checking a cloak at IRC is easy [19:59:45] <mutante> to replace a few bots [20:01:56] <grrrit-wm> (03CR) 10Rush: [C: 032] labs: break out puppet alert logic to generic notify handler [puppet] - 10https://gerrit.wikimedia.org/r/297996 (owner: 10Rush) [20:08:27] <icinga-wm> RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.456 second response time [20:08:38] <Dereckson> matt_flaschen: so the quickest solution yesterday would have probably been to : 1. notice strings were available in PHP, but not in JS, and so it's a RL specific issue 2. skip the cache rebuild and refresh it directly with `foreachwiki extensions/WikimediaMaintenance/refreshMessageBlobs.php` [20:09:43] <wikibugs> 06Operations, 10Cassandra, 06Services, 13Patch-For-Review: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362#2442781 (10Eevans) Status Update: The Good News: A `nodetool removenode` for restbase1014-c.eqiad.wmnet has been initiated, and the space consumed by... [20:10:11] <Dereckson> matt_flaschen: this script invalidates message blob cache keys [20:10:25] <matt_flaschen> Dereckson, thanks, that looks like a good solution to the RL problem. [20:10:35] <matt_flaschen> Dereckson, I'm not positive whether #1 was true initially though. At one point, it may have been broken for both PHP and JS. Unfortunately, I don't have records for that. [20:11:02] <matt_flaschen> Later it definitely was. [20:13:26] <Dereckson> 02:29:45 < Krinkle> krinkle@mw1099:~$ mwscript eval.php --wiki enwiki [20:13:29] <Dereckson> 02:29:02 < Dereckson> sync-apaches: 82% (ok: 338; fail: 0; left: 71) [20:13:56] <wikibugs> 06Operations, 10Deployment-Systems, 03Scap3 (Scap3-MediaWiki-MVP): Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913#2248683 (10bd808) I'm not sure that there is any massive code/functionality duplication. [[https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modul... [20:14:44] <Dereckson> matt_flaschen: indeed, we can't be sure: K.rinkle looked while l10nupdate synced the cache [20:16:02] <wikibugs> 06Operations, 10Deployment-Systems, 03Scap3 (Scap3-MediaWiki-MVP): Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913#2248683 (10demon) >>! In T133913#2442793, @bd808 wrote: > With the train running at a weekly cadence the benefit of l10nupdate runs is also arguable. There was a t... [20:16:22] <Dereckson> Another interesting question, with an answer granting us possibility to deploy without a full scap: how to update the l10n cache for one extension, for one wmf version? [20:17:06] <bd808> !log Deleted old l10nupdate caches manually on tin (T130317) [20:17:07] <stashbot> T130317: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317 [20:17:10] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:28] <bd808> Dereckson: "build a completely new l10n cache system" [20:19:30] <wikibugs> 06Operations, 10Deployment-Systems, 03Scap3 (Scap3-MediaWiki-MVP): Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913#2248683 (10Dereckson) I don't share these opinions. Translations could be a typo, a clumsy formulation. We currently have a working solution to allow the interface... [20:21:39] <wikibugs> 06Operations, 10Deployment-Systems, 03Scap3 (Scap3-MediaWiki-MVP): Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913#2442810 (10demon) >>! In T133913#2442804, @Dereckson wrote: > I don't share these opinions. Translations could be a typo, a clumsy formulation. We currently have a... [20:22:36] <icinga-wm> PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 14.153 second response time [20:24:44] <wikibugs> 06Operations, 10Cassandra, 06Services: Remove obsolete metrics - https://phabricator.wikimedia.org/T139792#2442835 (10Eevans) [20:24:53] <wikibugs> 06Operations, 10Deployment-Systems, 03Scap3: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2132321 (10bd808) See also {T119747} [20:26:14] <urandom> Throttle RESTBase Cassandra outgoing streams to 1mbit cluster-wide : T139362 [20:26:14] <stashbot> T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [20:32:49] <grrrit-wm> (03PS1) 10Paladox: Update gerrit to 2.12.3 [debs/gerrit] - 10https://gerrit.wikimedia.org/r/298047 [20:32:58] <icinga-wm> PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: puppet fail [20:33:39] <grrrit-wm> (03PS2) 10Dereckson: Deploy the Kartographer extension to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298042 (https://phabricator.wikimedia.org/T139787) (owner: 10Tpt) [20:34:36] <icinga-wm> RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 5.973 second response time [20:36:13] <grrrit-wm> (03PS2) 10Paladox: Update gerrit to 2.12.3 [debs/gerrit] - 10https://gerrit.wikimedia.org/r/298047 [20:36:19] <grrrit-wm> (03CR) 10Dereckson: Deploy the Kartographer extension to meta (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298042 (https://phabricator.wikimedia.org/T139787) (owner: 10Tpt) [20:36:50] <grrrit-wm> (03PS1) 10Chad: Gerrit: Add config stanzas for new Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/298049 [20:39:53] <grrrit-wm> (03Abandoned) 10Paladox: Update gerrit to 2.12.3 [debs/gerrit] - 10https://gerrit.wikimedia.org/r/298047 (owner: 10Paladox) [20:46:02] <yuvipanda> chasemp yeah, flapping because it's somehow being interwoven between calls to gridengine webservice check [20:46:03] <yuvipanda> I'm going to silence it until monday [20:47:52] <grrrit-wm> (03PS1) 10Chad: WIP: Gerrit: Remove SSH public key and last user of it [puppet] - 10https://gerrit.wikimedia.org/r/298052 [20:48:05] <grrrit-wm> (03CR) 10Chad: [C: 04-1] WIP: Gerrit: Remove SSH public key and last user of it [puppet] - 10https://gerrit.wikimedia.org/r/298052 (owner: 10Chad) [20:48:33] <grrrit-wm> (03PS2) 10Chad: WIP: Gerrit: Remove SSH public key and last user of it [puppet] - 10https://gerrit.wikimedia.org/r/298052 [20:51:32] <urandom> !log Throttle RESTBase Cassandra outgoing streams to 1mbit cluster-wide : T139362 (actually happened at 21:26) [20:51:33] <stashbot> T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [20:51:36] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:59:07] <icinga-wm> RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:59:50] <urandom> !log Forcing node removal (restbase1014-c.eqiad.wmnet) : T139362 [20:59:51] <stashbot> T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [20:59:54] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:01:06] <icinga-wm> PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 16.030 second response time [21:02:23] <grrrit-wm> (03PS1) 10Reedy: SpecialNuke -> Nuke [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298054 [21:04:28] <urandom> !log Restarting restbase1009-a.eqiad.wmnet to cancel running streams : T139362 [21:04:29] <stashbot> T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [21:04:32] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:26] <grrrit-wm> (03PS2) 10Reedy: SpecialNuke -> Nuke/extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298054 [21:10:59] <urandom> !log Restarting restbase1014-a.eqiad.wmnet to cancel running streams : T139362 [21:11:00] <stashbot> T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [21:11:03] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:28] <icinga-wm> PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:12:39] <icinga-wm> PROBLEM - HHVM rendering on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:59] <icinga-wm> PROBLEM - nutcracker port on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:14:09] <icinga-wm> PROBLEM - Check size of conntrack table on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:14:18] <icinga-wm> PROBLEM - nutcracker process on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:14:28] <icinga-wm> PROBLEM - SSH on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:14:39] <icinga-wm> PROBLEM - DPKG on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:14:49] <icinga-wm> PROBLEM - salt-minion processes on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:15:08] <icinga-wm> PROBLEM - Disk space on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:15:18] <icinga-wm> PROBLEM - dhclient process on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:15:19] <icinga-wm> PROBLEM - HHVM processes on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:15:39] <icinga-wm> PROBLEM - configured eth on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:17:18] <urandom> !log Restarting restbase1015-a.eqiad.wmnet to cancel running streams : T139362 [21:17:19] <stashbot> T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [21:17:22] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:19] <grrrit-wm> (03PS1) 10Reedy: Fix loaded entry point for RestBaseUpdateJobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298094 [21:28:08] <wikibugs> 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2442995 (10elukey) I found a repro (If-Modified needs to be modified accordingl... [21:28:22] <greg-g> :( I don't have permission to make snapshots in grafana? [21:28:45] <greg-g> "You don't have permission to access /api/snapshots on this server." [21:28:49] <grrrit-wm> (03PS1) 10Reedy: Load RestBaseUpdateJobs via wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298095 [21:29:18] <grrrit-wm> (03CR) 10Reedy: [C: 04-1] "Depends-On: I0e3a2481d575c5e98844989e70e33b6f69c0e2aa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298095 (owner: 10Reedy) [21:31:27] <elukey> !log mw1146 powercycled (memory pressure, no ssh/root login) [21:31:31] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:04] <Luke081515> greg-g: just make a oldschool screenshot from your PC window? :D [21:32:07] <urandom> elukey: wow, you on late [21:32:13] <urandom> s/you/your/ [21:32:31] <elukey> I knooow I have a httpd thing that bugs me too much :P [21:32:57] <urandom> :( [21:33:16] <icinga-wm> RECOVERY - nutcracker port on mw1146 is OK: TCP OK - 0.000 second response time on port 11212 [21:33:18] <icinga-wm> RECOVERY - nutcracker process on mw1146 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:33:22] <grrrit-wm> (03PS1) 10Reedy: PoolCounterClient.php -> extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298096 [21:33:37] <bd808> greg-g: are you on grafana-admin.wikimedia.org or grafana.wikimedia.org? [21:33:37] <icinga-wm> RECOVERY - DPKG on mw1146 is OK: All packages OK [21:33:50] <greg-g> bd808: grafana, and yeah, forgot to try that :/ [21:33:54] <bd808> I think write operations on work on the former [21:33:56] <icinga-wm> RECOVERY - SSH on mw1146 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [21:34:04] <grrrit-wm> (03PS1) 1020after4: Add conduit_token to the .arcrc on nodepool slaves [puppet] - 10https://gerrit.wikimedia.org/r/298097 [21:34:07] <bd808> *only work [21:34:17] <icinga-wm> RECOVERY - configured eth on mw1146 is OK: OK - interfaces up [21:34:18] <icinga-wm> RECOVERY - salt-minion processes on mw1146 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:34:27] <icinga-wm> RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.595 second response time [21:34:30] <greg-g> what happened to mw1146? [21:34:35] <greg-g> oh [21:34:37] <icinga-wm> RECOVERY - dhclient process on mw1146 is OK: PROCS OK: 0 processes with command name dhclient [21:34:39] <greg-g> I see the !log now [21:34:46] <icinga-wm> RECOVERY - Check size of conntrack table on mw1146 is OK: OK: nf_conntrack is 6 % full [21:34:57] <icinga-wm> RECOVERY - Disk space on mw1146 is OK: DISK OK [21:35:17] <icinga-wm> RECOVERY - HHVM processes on mw1146 is OK: PROCS OK: 6 processes with command name hhvm [21:35:37] <icinga-wm> RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 71707 bytes in 0.790 second response time [21:35:45] <elukey> greg-g: it was me :) [21:36:00] <elukey> memory pressure again, not reachable [21:37:47] <icinga-wm> RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [21:48:05] <urandom> !log Restarting restbase1009-b.eqiad.wmnet to cancel running streams : T139362 [21:48:06] <grrrit-wm> (03CR) 10Reedy: [C: 032] Fix loaded entry point for RestBaseUpdateJobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298094 (owner: 10Reedy) [21:48:07] <stashbot> T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [21:48:10] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:48:45] <grrrit-wm> (03Merged) 10jenkins-bot: Fix loaded entry point for RestBaseUpdateJobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298094 (owner: 10Reedy) [21:48:49] <grrrit-wm> (03PS2) 10Reedy: else if -> elseif [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296567 [21:48:56] <grrrit-wm> (03CR) 10Reedy: [C: 032] else if -> elseif [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296567 (owner: 10Reedy) [21:49:36] <grrrit-wm> (03Merged) 10jenkins-bot: else if -> elseif [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296567 (owner: 10Reedy) [21:50:38] <logmsgbot> !log reedy@tin Synchronized multiversion/MWWikiversions.php: Remove some else if spaces (duration: 00m 46s) [21:50:41] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:51:07] <icinga-wm> PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:49] <logmsgbot> !log reedy@tin Synchronized wmf-config/CommonSettings.php: Use Canonical entry point for RestBaseUpdateJobs (duration: 00m 33s) [21:51:53] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:52:05] <legoktm> jouncebot: next [21:52:05] <jouncebot> In 65 hour(s) and 7 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160711T1500) [21:52:45] <legoktm> I'm gonna deploy the patch to hopefully unbreak global rename right now (cc: greg-g) [21:53:01] <greg-g> task? [21:55:24] <wikibugs> 06Operations: Rotate (nutcracker) logs more frequently on terbium to save disk space - https://phabricator.wikimedia.org/T139786#2442645 (10hashar) terbium has a 500G disk with LVM so `/` can be extended: ``` $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda... [21:55:46] <icinga-wm> RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [21:57:38] <legoktm> greg-g: the giant one: https://phabricator.wikimedia.org/T137973 patch is https://gerrit.wikimedia.org/r/297941 [21:57:50] <grrrit-wm> (03PS1) 10Reedy: Update entry point for RestBaseUpdateJobs in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298099 [21:58:07] <James_F> Reedy: Oh dear. When did that break? [21:58:13] <Reedy> Few minutes ago [21:58:18] <James_F> Ick. [21:58:20] <Reedy> When legoktm merged my patch [21:58:26] <grrrit-wm> (03CR) 10Reedy: [C: 032] Update entry point for RestBaseUpdateJobs in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298099 (owner: 10Reedy) [21:58:58] <grrrit-wm> (03Merged) 10jenkins-bot: Update entry point for RestBaseUpdateJobs in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298099 (owner: 10Reedy) [21:59:28] <greg-g> legoktm: kk [21:59:55] <logmsgbot> !log reedy@tin Synchronized wmf-config/extension-list: Fix RestBaseUpdateJobs in extension-list (duration: 00m 36s) [21:59:59] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:00:10] <James_F> Reedy: But https://gerrit.wikimedia.org/r/#/c/298093/ isn't in a cut version yet? [22:00:25] <Reedy> James_F: Indeed [22:00:25] <grrrit-wm> (03PS2) 10Reedy: Load RestBaseUpdateJobs via wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298095 [22:00:32] <Reedy> ^ Which is that patch, which I -1'd [22:00:51] <James_F> Yeah, but I was mostly asking about https://gerrit.wikimedia.org/r/#/c/298094/1/wmf-config/CommonSettings.php [22:00:52] <legoktm> Reedy: I'm gonna deploy a CA patch now [22:01:10] <wikibugs> 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2443055 (10MoritzMuehlenhoff) Hmm, for some reason Debian jessie (frozen in November 2014) has an older version of Ghostscript... [22:01:16] <grrrit-wm> (03CR) 10Reedy: [C: 04-1] Load RestBaseUpdateJobs via wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298095 (owner: 10Reedy) [22:01:41] <Reedy> James_F: And magically, because the file has been in that repo for a long time to fix jenkins... It still works fine on the wikis :) [22:01:54] <Reedy> RestBaseUpdateJobs.php [22:01:55] <Reedy> * Entry point with same name as extension so jenkins can [22:01:55] <Reedy> * load it [22:01:55] <Reedy> */ [22:01:56] <Reedy> require_once __DIR__ . '/RestbaseUpdate.php'; [22:02:09] <James_F> Ha. [22:02:17] <Reedy> Someone fixed it one place [22:02:17] * James_F sighs at hacks upon hacks. ;-) [22:02:21] <Reedy> But not the MANY other places [22:02:27] <Reedy> so, I was untangling them [22:02:49] <logmsgbot> !log legoktm@tin Synchronized php-1.28.0-wmf.9/extensions/CentralAuth/: Fix job serializing (and status display on Special:GlobalRenameProgress) - T137973 (duration: 00m 32s) [22:02:50] <stashbot> T137973: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973 [22:02:54] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:04:05] <Reedy> James_F: But yeah. Getting burned trying to clean it up :D [22:04:14] <Reedy> Luckily, it only broke beta for a few minutes [22:04:31] <Reedy> beta scap [22:04:34] <Reedy> so even narrower scope [22:04:36] * James_F nods. [22:05:27] <grrrit-wm> (03PS3) 10Dzahn: Gerrit: Don't use SSH to connect to gsql, we can do it from shell itself [puppet] - 10https://gerrit.wikimedia.org/r/298043 (owner: 10Chad) [22:07:45] <grrrit-wm> (03CR) 10Dzahn: [C: 032] "query works when running it manually like that (minus escaping)" [puppet] - 10https://gerrit.wikimedia.org/r/298043 (owner: 10Chad) [22:09:04] <wikibugs> 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2443063 (10elukey) Interesting fact: I tried the same curl request on mw2244 (t... [22:10:46] <grrrit-wm> (03CR) 10Hashar: "The Nodepool images do not use any secret() or passwords since really anyone could read it. Just to generate the image we would need to ha" [puppet] - 10https://gerrit.wikimedia.org/r/298097 (owner: 1020after4) [22:13:16] <grrrit-wm> (03PS4) 10Krinkle: Remove unused file 'docroot/foundation/index.html' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297909 [22:13:19] <grrrit-wm> (03PS3) 10Reedy: Load RestBaseUpdateJobs via wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298095 (https://phabricator.wikimedia.org/T139800) [22:13:42] <urandom> !log Restarting restbase1014-b.eqiad.wmnet to cancel running streams : T139362 [22:13:43] <stashbot> T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [22:13:46] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:13:52] <grrrit-wm> (03PS2) 10Reedy: PoolCounterClient.php -> extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298096 (https://phabricator.wikimedia.org/T139800) [22:14:09] <legoktm> Reedy: https://phabricator.wikimedia.org/T119117 [22:14:26] <Reedy> That too :D [22:15:10] <grrrit-wm> (03CR) 1020after4: "@hashar: diskimage-builder runs on a labs instance, not production?" [puppet] - 10https://gerrit.wikimedia.org/r/298097 (owner: 1020after4) [22:15:37] <grrrit-wm> (03CR) 1020after4: "The jenkins phabricator plugin doesn't expose the secret anywhere that I can find it." [puppet] - 10https://gerrit.wikimedia.org/r/298097 (owner: 1020after4) [22:17:17] <icinga-wm> PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:21:17] <icinga-wm> PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:21:47] <icinga-wm> RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:23:35] <wikibugs> 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2443135 (10hashar) So you at least have a proof `AH01070: Error parsing script... [22:25:47] <icinga-wm> RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:34:45] <Amir1> legoktm: hey, since it's backported: https://phabricator.wikimedia.org/T137973#2443058 can we test? ;) [22:35:03] <legoktm> Amir1: yes! I've been looking for a global renamer for like half an hour now :P [22:35:14] <Amir1> :))) [22:35:24] <legoktm> Amir1: please start 2 renames :) [22:35:25] <grrrit-wm> (03PS1) 10Chad: Gerrit: Require specifing IPv4 and IPv6 addresses to role [puppet] - 10https://gerrit.wikimedia.org/r/298101 [22:35:28] <Amir1> sure [22:35:32] <ostriches> mutante: ^^^ [22:37:03] <grrrit-wm> (03PS1) 10Chad: WIP: Gerrit: Add lead's hiera overrides we need [puppet] - 10https://gerrit.wikimedia.org/r/298102 [22:37:34] <Amir1> https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress [22:37:39] <Amir1> legoktm: ^ [22:38:09] <Amir1> I've got big ones [22:42:45] <Amir1> legoktm: both of them are done [22:43:03] <legoktm> Amir1: \o/ 3-4 more? :) [22:43:13] <Amir1> yeaaah [22:44:08] <icinga-wm> RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 5.538 second response time [22:45:25] <grrrit-wm> (03PS2) 10Chad: Gerrit: Require specifing IPv4 and IPv6 addresses to role [puppet] - 10https://gerrit.wikimedia.org/r/298101 [22:46:57] <icinga-wm> PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:48:17] <Amir1> legoktm: 4 are done. I have a bonus https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Level91 [22:49:16] <icinga-wm> RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:51:06] <icinga-wm> PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 11.245 second response time [22:52:00] <legoktm> Amir1: submit more? [22:52:16] <Amir1> yeah [22:57:08] <grrrit-wm> (03PS1) 10Dzahn: add gerrit-new records, point to lead [dns] - 10https://gerrit.wikimedia.org/r/298106 (https://phabricator.wikimedia.org/T125018) [22:57:46] <grrrit-wm> (03CR) 10Paladox: ":)" [dns] - 10https://gerrit.wikimedia.org/r/298106 (https://phabricator.wikimedia.org/T125018) (owner: 10Dzahn) [22:58:37] <grrrit-wm> (03PS2) 10Dzahn: add gerrit-new records, point to lead [dns] - 10https://gerrit.wikimedia.org/r/298106 (https://phabricator.wikimedia.org/T125018) [22:59:07] <grrrit-wm> (03PS3) 10Dzahn: add gerrit-new records, point to lead [dns] - 10https://gerrit.wikimedia.org/r/298106 (https://phabricator.wikimedia.org/T125018) [22:59:35] <urandom> !log Restarting restbase1015-b.eqiad.wmnet to cancel running streams : T139362 [22:59:36] <stashbot> T139362: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362 [22:59:39] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:01] <grrrit-wm> (03PS1) 10Dzahn: gerrit: add missing reverse DNS for IPv6 [dns] - 10https://gerrit.wikimedia.org/r/298108 [23:04:29] <grrrit-wm> (03PS2) 10Dzahn: gerrit: add missing reverse DNS for IPv6 [dns] - 10https://gerrit.wikimedia.org/r/298108 [23:05:13] <Amir1> legoktm: I lost count of how many requests I got accepted (all of them big) and no issues so far [23:05:17] <Amir1> all done now [23:05:31] <Amir1> Level9 was the biggest [23:05:35] <Amir1> 137 wikis [23:06:04] <Amir1> I can go on [23:06:09] <Amir1> it's fun [23:06:38] <grrrit-wm> (03CR) 10Paladox: [C: 031] gerrit: add missing reverse DNS for IPv6 [dns] - 10https://gerrit.wikimedia.org/r/298108 (owner: 10Dzahn) [23:11:49] <wikibugs> 06Operations, 10Datasets-General-or-Unknown, 10netops, 07TestMe: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#2443295 (10Danny_B) Still an issue? [23:12:46] <grrrit-wm> (03CR) 10Chad: [C: 031] "No changes! https://puppet-compiler.wmflabs.org/3297/" [puppet] - 10https://gerrit.wikimedia.org/r/298101 (owner: 10Chad) [23:13:46] <icinga-wm> RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.725 second response time [23:15:34] <legoktm> Amir1: let's stop for now? I'm fairly confident, but I don't want anything breaking over the weekend [23:15:39] <legoktm> emailing the renamers now [23:16:02] <Amir1> legoktm: sure [23:16:03] <Amir1> :) [23:16:23] <Amir1> thanks for backporting and thanks anomie for fixing [23:17:05] <Amir1> legoktm: also, do you have time to review a patch in core? [23:17:05] <Amir1> https://gerrit.wikimedia.org/r/#/c/297959/ [23:17:15] <Amir1> I have tested it and it worked fine [23:17:46] <grrrit-wm> (03PS1) 10Dzahn: gerrit: move service IP (v4 and v6) into hiera per host [puppet] - 10https://gerrit.wikimedia.org/r/298110 (https://phabricator.wikimedia.org/T125018) [23:18:24] <grrrit-wm> (03PS2) 10Dzahn: gerrit: move service IP (v4 and v6) into hiera per host [puppet] - 10https://gerrit.wikimedia.org/r/298110 (https://phabricator.wikimedia.org/T125018) [23:18:47] <mutante> arg [23:19:48] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] gerrit: move service IP (v4 and v6) into hiera per host [puppet] - 10https://gerrit.wikimedia.org/r/298110 (https://phabricator.wikimedia.org/T125018) (owner: 10Dzahn) [23:20:46] <icinga-wm> PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 11.067 second response time [23:21:01] <grrrit-wm> (03PS3) 10Dzahn: gerrit: move service IP (v4 and v6) into hiera per host [puppet] - 10https://gerrit.wikimedia.org/r/298110 (https://phabricator.wikimedia.org/T125018) [23:21:55] <mutante> ostriches: we both did the same thing.. kind of . i just added lead too. but yours works :) [23:22:07] <grrrit-wm> (03CR) 10Dzahn: [C: 032] Gerrit: Require specifing IPv4 and IPv6 addresses to role [puppet] - 10https://gerrit.wikimedia.org/r/298101 (owner: 10Chad) [23:22:21] <mutante> i will rebase and use mine to add the one for lead .85 [23:22:45] <ostriches> Yours is nicer [23:22:55] <ostriches> Heh, it's all good [23:24:40] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] gerrit: move service IP (v4 and v6) into hiera per host [puppet] - 10https://gerrit.wikimedia.org/r/298110 (https://phabricator.wikimedia.org/T125018) (owner: 10Dzahn) [23:32:49] <Amir1> !log manually restarting uwsgi-ores on scb1002 [23:32:54] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:02] <Amir1> done [23:34:00] <Amir1> everything works as expected [23:37:33] <grrrit-wm> (03PS4) 10Dzahn: gerrit: add service IP for gerrit-new (lead) [puppet] - 10https://gerrit.wikimedia.org/r/298110 (https://phabricator.wikimedia.org/T125018) [23:38:50] <grrrit-wm> (03PS5) 10Dzahn: gerrit: add service IP for gerrit-new (lead) [puppet] - 10https://gerrit.wikimedia.org/r/298110 (https://phabricator.wikimedia.org/T125018) [23:40:29] <grrrit-wm> (03CR) 10Dzahn: [C: 032] gerrit: add service IP for gerrit-new (lead) [puppet] - 10https://gerrit.wikimedia.org/r/298110 (https://phabricator.wikimedia.org/T125018) (owner: 10Dzahn) [23:41:58] <wikibugs> 06Operations, 10ops-codfw, 10hardware-requests: codfw: add all spare network switches to hardware spares tracking - https://phabricator.wikimedia.org/T139776#2443352 (10Peachey88) [23:41:59] <wikibugs> 06Operations, 10hardware-requests: Find and rack 2 EX4200s in rack c1-eqiad - https://phabricator.wikimedia.org/T139752#2443351 (10Peachey88) [23:43:16] <grrrit-wm> (03CR) 10Dzahn: [C: 032] gerrit: add missing reverse DNS for IPv6 [dns] - 10https://gerrit.wikimedia.org/r/298108 (owner: 10Dzahn) [23:44:40] <wikibugs> 06Operations, 10Deployment-Systems, 03Scap3 (Scap3-MediaWiki-MVP): Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913#2443360 (10mmodell) Full scap of a new branch is ~30 minutes. Updating localization without syncing a new branch is probably less than 15 minutes and could easily... [23:44:43] <wikibugs> 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2443362 (10MaxSem) [23:44:47] <wikibugs> 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2443361 (10MaxSem) [23:44:57] <grrrit-wm> (03CR) 10Krinkle: [C: 032] Remove unused file 'docroot/foundation/index.html' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297909 (owner: 10Krinkle) [23:45:33] <grrrit-wm> (03Merged) 10jenkins-bot: Remove unused file 'docroot/foundation/index.html' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297909 (owner: 10Krinkle) [23:45:39] <grrrit-wm> (03CR) 10Dzahn: "[radon:~] $ host gerrit.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/298108 (owner: 10Dzahn) [23:46:53] <grrrit-wm> (03PS4) 10Dzahn: add gerrit-new records, point to lead [dns] - 10https://gerrit.wikimedia.org/r/298106 (https://phabricator.wikimedia.org/T125018) [23:47:52] <grrrit-wm> (03CR) 10Dzahn: [C: 032] "these will be added to the interface on the server by puppet via the role" [dns] - 10https://gerrit.wikimedia.org/r/298106 (https://phabricator.wikimedia.org/T125018) (owner: 10Dzahn) [23:48:56] <logmsgbot> !log krinkle@tin Synchronized docroot/foundation/: Remove unused docroot/foundation/index.html (duration: 00m 30s) [23:49:00] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:54] <grrrit-wm> (03CR) 10Dzahn: ";; ANSWER SECTION:" [dns] - 10https://gerrit.wikimedia.org/r/298106 (https://phabricator.wikimedia.org/T125018) (owner: 10Dzahn) [23:52:12] <grrrit-wm> (03PS2) 10Dzahn: Gerrit: Add config stanzas for new Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/298049 (owner: 10Chad) [23:52:32] <Amir1> !log manually restarting uwsgi-ores in scb1001 [23:52:35] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:08] <mutante> gerrit config change incoming .. which will restart it .. but you can stop me [23:53:22] <mutante> if you are in the middle of something.. can totally wait [23:53:29] <mutante> just feels there is never a good time anyways [23:54:04] <Amir1> mutante: hey, are you talking to me? [23:55:07] <mutante> to anyone who reads it and is using gerrit :) [23:55:51] <Amir1> oh, it's okay. I was restarting a service in scb I thought you imply that [23:56:04] <Amir1> thanks [23:56:22] <mutante> no, not really. but hi! [23:56:50] <grrrit-wm> (03PS1) 10BryanDavis: Fix de_dot to process keys with falsey values [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/298115 (https://phabricator.wikimedia.org/T136001) [23:56:52] <grrrit-wm> (03CR) 10Dzahn: [C: 032] Gerrit: Add config stanzas for new Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/298049 (owner: 10Chad) [23:57:02] <Amir1> :) [23:57:36] <mutante> !log behold, gerrit might restart now for config change [23:57:40] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:12] <wikibugs> 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2443397 (10kaldari) @MoritzMuehlenhoff: That would be good to try. The sooner the better.