[00:00:28] ebernhardson: greg-g bearND Deskana i'm switching connections. brb [00:01:39] ebernhardson: greg-g bearND Deskana i'm back [00:02:17] dr0ptp4kt: nothing new while you were gone :) [00:03:14] greg-g: thx [00:04:11] still waiting on the code to make it to deployment-mediawiki01 so i can test again [00:04:15] any moment now :) [00:05:36] I'm waiting for my last few things to clear Jenkins [00:06:08] you'll get there before me, i'm also waiting on jenkins and then i have to make 2 core bumps [00:06:40] 6operations, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1176439 (10BBlack) I've been looking forward to folding bits back into text for a while now anyways. On many levels, it's an appropriate move at this point in time regardless of SPDY, IMHO. I... [00:11:52] ebernhardson: can I add a core patch to your queue? [00:12:23] patches are ready: https://gerrit.wikimedia.org/r/#/c/201442/ (1.25wmf24) and https://gerrit.wikimedia.org/r/#/c/201443/ (1.24wmf25) [00:12:56] ori: yea looks like i can ship those after [00:13:05] sweet, thanks [00:16:51] 00:12:19 1) Flow\Tests\Api\ApiFlowLockTopicTest::testLockTopic [00:16:53] 00:12:19 Argument 4 passed to SimpleCaptcha::shouldCheck() must implement interface IContextSource, bool given [00:16:54] ebernhardson: Is that what you're fixing? [00:17:04] RoanKattouw: yes [00:17:15] RoanKattouw: well, reverse [00:17:17] OK [00:17:33] Cause Jenkins is getting that while trying to process https://gerrit.wikimedia.org/r/#/c/201625/ [00:17:33] RoanKattouw: i'm reverting the code in both ConfirmEdit and Flow that add that parameter [00:17:43] OK [00:18:03] ebernhardson: But does that affect wmf24 only? [00:18:09] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:18:18] If so, then I'll merge Ori's wmf23 patch and do the wmf23 sync [00:18:39] RoanKattouw: reverting from 23 as well [00:18:45] Oh OK [00:19:00] I ask because Jenkins merged my 23 patch but not the 24 one [00:19:04] I'll hold off then, until you're done [00:19:06] i guess the mobile apps might not talk to mediawiki.org? because noone noticed account creation was broken the week 23 was there [00:19:13] (or not commonly anyways) [00:19:36] (03CR) 10Dzahn: [C: 04-2] "@John i have added these to hiera now: https://gerrit.wikimedia.org/r/#/c/201404/" [puppet] - 10https://gerrit.wikimedia.org/r/189196 (owner: 10John F. Lewis) [00:20:54] ebernhardson: right, they use .m.wikipedia.org [00:25:31] (03CR) 10Dzahn: "needs SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187278 (https://phabricator.wikimedia.org/T369) (owner: 10Spage) [00:27:56] (03PS5) 10Dzahn: Ignore some warnings about case statements without default matches [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [00:29:18] (03PS6) 10Dzahn: Ignore some warnings about case statements without default matches [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [00:29:54] (03CR) 10Dzahn: [C: 032] Ignore some warnings about case statements without default matches [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [00:32:34] (03CR) 10Yuvipanda: "This would fail, no? Because of the labsdebrepo being a define, and misc::labsdebrepo still being needed?" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [00:34:11] (03CR) 10Dzahn: [C: 032] Add txt2yaml [software] - 10https://gerrit.wikimedia.org/r/191846 (owner: 10Tim Landscheidt) [00:35:32] RoanKattouw: havn't deployed yet (~5min more for bumps to get through jenkins), but tests should pass now as all the repo branches are inline [00:35:44] Awesome [00:35:53] longest. swat. ever. [00:35:59] I can't merge things anyway until your things come through, so I'll wait for you [00:36:06] (not your guy's fault, just a comedy) [00:36:45] maybe google has the right idea having 1 giant repo with everything :P half of this annoyance was just having tests in 2 repositories that fail unless patches in both repositories are merged [00:37:01] (I really want to turn off the phpunit-zend jobs pre-merge) [00:38:07] phpunit-zend is soooo sloooow [00:38:25] After this is done I also need to do an OOUI release [00:38:40] Which involves breaking all MW core tests for ~10 mins [00:38:42] :S [00:38:46] woo :) [00:39:28] ebernhardson: Is it OK if I take over this deployment from here? [00:39:44] I'll merge Ori's patches and then sync everything [00:39:48] RoanKattouw: sure if you want, i also promised ori i would push out patches too [00:39:59] works for me, thanks guys [00:40:28] * Deskana is ready to test the apps [00:40:32] Say when! [00:40:38] Deskana: its not quite out yet :P but now roan will tell you :) [00:40:51] zuul says "ETA: 0 min" [00:40:57] https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/4410/console [00:40:57] it usually says that for 3 or 5 minutes :P [00:41:07] Deskana: I'm still lurking as well [00:41:09] yay! [00:41:48] ebernhardson: the last 10% (eg: 00:40:05 ............................................................. 8723 / 9645 ( 90%)) takes longer than jenkins/zuul/whoever thinks [00:42:37] OK I'm gonna do some syncs now [00:42:40] And the rest later [00:42:42] It's like a metaphor for software development [00:43:05] Because Jon's thing is bad and he needs to leave soon [00:43:06] Deskana: keeps ya honest [00:43:44] well, he's gone now anyways :) [00:44:13] My thing is bad too, the apps are broken :( [00:44:24] right [00:44:30] s'all bad tonight [00:44:50] Deskana: Which one is yours? [00:44:51] !log catrope Synchronized php-1.25wmf23/extensions/Gather: SWAT (duration: 00m 11s) [00:44:51] i weep [00:44:55] Logged the message, Master [00:44:59] * Deskana hopes the angry people on Google Play don't notice I'm copy-pasting the same message to all the other angry people on Google Play [00:45:05] RoanKattouw: The think that ebernhardson was working on. [00:45:11] Sweet, OK, https://en.wikipedia.org/wiki/Special:Preferences is now styled again [00:45:11] greg-g: ya, jon just left the office [00:45:16] No he went to the restroom [00:45:26] Anyways, his thing works [00:46:24] I'm going to go now, actually, I'll assume ya'll got it from here [00:46:41] greg-g: Thank you. :-) [00:46:44] thanks all [00:47:07] !log catrope Synchronized php-1.25wmf23/extensions/ConfirmEdit: SWAT (duration: 00m 11s) [00:47:10] OK here go the ConfirmEdit/Flow ones [00:47:15] Logged the message, Master [00:47:18] Those should fix the apps issue AIUI [00:47:20] !log catrope Synchronized php-1.25wmf23/extensions/Flow: SWAT (duration: 00m 12s) [00:47:23] Logged the message, Master [00:47:24] * Deskana checks [00:47:33] !log catrope Synchronized php-1.25wmf24/extensions/ConfirmEdit: SWAT (duration: 00m 13s) [00:47:36] Logged the message, Master [00:47:47] !log catrope Synchronized php-1.25wmf24/extensions/Flow: SWAT (duration: 00m 14s) [00:47:50] Logged the message, Master [00:47:58] FIXED [00:48:06] Sweet [00:48:12] ebernhardson, RoanKattouw: Thank you! [00:48:55] !log catrope Synchronized php-1.25wmf24/extensions/VisualEditor: SWAT (duration: 00m 12s) [00:48:56] ebernhardson did all the hard work :) [00:48:58] Logged the message, Master [00:49:05] superm401: And that's your VE parsefragment change there ---^^ [00:49:29] RoanKattouw, thanks. [00:49:58] I'm still waiting for Jenkins for the wmf24 version of the Gather fix, and Ori's patches [00:50:43] Deskana: worked here for me on dev as well. thx ebernhardson RoanKattouw greg-g bearND and others. my captcha this time was 'yurilight' [00:50:52] (03PS6) 10BryanDavis: [WIP] Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) [00:51:08] thx bd808! [00:51:38] PROBLEM - HHVM rendering on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:52:00] Sweet. Glad we got it sorted out. Thanks for the revert work ebernhardson [00:52:11] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) (owner: 10BryanDavis) [00:52:17] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:52:19] yeah! Worked for me as well! Thank you Thanks you Thank you!!!! [00:52:32] Thanks to all, indeed. [00:52:49] sweet. worked here too (captcha: toddywiper). [00:53:03] thanks all [00:53:37] Mine was "pingtara" [00:53:38] tara: Thanks [00:53:45] hahaha [00:53:58] PROBLEM - HHVM busy threads on mw1120 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [86.4] [00:54:16] all right, catch you all on the flipside. [00:54:47] PROBLEM - HHVM queue size on mw1120 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [00:59:57] (03CR) 10BryanDavis: [WIP] Add role::mediawiki_vagrant_lxc (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) (owner: 10BryanDavis) [01:00:00] mw1120 is struggling, i'm going to restart hhvm [01:00:08] (03PS7) 10BryanDavis: [WIP] Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) [01:00:24] !log restart HHVM on mw1120 [01:00:29] Logged the message, Master [01:00:37] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.785 second response time [01:00:59] !log catrope Synchronized php-1.25wmf24/extensions/Gather: SWAT (duration: 00m 13s) [01:01:00] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) (owner: 10BryanDavis) [01:01:03] Logged the message, Master [01:01:49] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 70589 bytes in 1.289 second response time [01:03:37] 6operations: Enable the usage of `hhvm -m debug --debug-host ::1` from mw1017 so developers can step through code (think gdb) in production to see what is going wrong. - https://phabricator.wikimedia.org/T94951#1176625 (10EBernhardson) 3NEW [01:03:48] !log catrope Synchronized php-1.25wmf24/autoload.php: (no message) (duration: 00m 11s) [01:03:52] Logged the message, Master [01:04:31] !log catrope Synchronized php-1.25wmf24/includes: SWAT (duration: 00m 15s) [01:04:34] Logged the message, Master [01:06:54] !log catrope Synchronized php-1.25wmf23/autoload.php: SWAT (duration: 00m 12s) [01:06:58] Logged the message, Master [01:08:18] RECOVERY - HHVM queue size on mw1120 is OK: OK: Less than 30.00% above the threshold [10.0] [01:08:51] !log catrope Synchronized php-1.25wmf23/includes: SWAT (duration: 00m 15s) [01:08:57] Logged the message, Master [01:08:59] ori: Your patches just went out [01:09:00] SWAT DONE [01:09:03] Well that only took two hours :S [01:09:17] RECOVERY - HHVM busy threads on mw1120 is OK: OK: Less than 30.00% above the threshold [57.6] [01:10:07] (03PS8) 10BryanDavis: [WIP] Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) [01:11:01] !log catrope Synchronized php-1.25wmf23/extensions/ContentTranslation/modules/campaigns/ext.cx.campaigns.contributionsmenu.js: touch (duration: 00m 12s) [01:11:05] Logged the message, Master [01:11:15] !log catrope Synchronized php-1.25wmf24/extensions/ContentTranslation/modules/campaigns/ext.cx.campaigns.contributionsmenu.js: touch (duration: 00m 13s) [01:11:18] Logged the message, Master [01:16:02] ebernhardson: So that ConfirmEdit thing you fixed in prod, is that fixed in master? [01:16:06] ebernhardson: Because I'm seeing errors in beta [01:16:36] [aab71a07] /w/index.php?title=Main_Page&action=edit BadMethodCallException from line 315 of /srv/mediawiki/php-master/extensions/ConfirmEdit/Captcha.php: Call to a member function getRequest() on a non-object (NULL) [01:16:43] When trying to edit any page [01:17:58] (03PS9) 10BryanDavis: [WIP] Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) [01:18:52] ebernhardson: Aaaahm [01:18:56] ebernhardson: Your revert is fatally broken [01:19:04] ebernhardson: You left a usage of $context in [01:19:35] ebernhardson: Broken in wmf24 and master but seemingly not in 23 [01:20:47] (03PS1) 10Thcipriani: Merge parsoid beta and production roles [puppet] - 10https://gerrit.wikimedia.org/r/201636 [01:22:45] (03PS2) 10Thcipriani: Merge parsoid beta and production roles [puppet] - 10https://gerrit.wikimedia.org/r/201636 (https://phabricator.wikimedia.org/T91549) [01:24:19] RoanKattouw: which patch? [01:25:09] PROBLEM - Disk space on ms-be1005 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdf1 is not accessible: Input/output error [01:25:48] PROBLEM - RAID on ms-be1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [01:26:56] dmesg on ms-be1005 looks like hardware error :( [01:30:14] !log ori Synchronized php-1.25wmf24/extensions/ConfirmEdit: 7cb7ef4e6f: Update ConfirmEdit for Id4798364d (duration: 00m 12s) [01:30:18] Logged the message, Master [01:42:28] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Puppet has 1 failures [01:47:57] who's familiar with this procedure? i haven't touched swift before: https://wikitech.wikimedia.org/wiki/Swift/How_To#Remove_.28fail_out.29_a_drive_from_a_ring [01:52:14] (03PS1) 10Catrope: Remove resources symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201643 [01:53:12] (03CR) 10Yuvipanda: [C: 031] Remove resources symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201643 (owner: 10Catrope) [01:53:21] ori, yt? can you tell me the names of the couple boxes provisioned to replace vanadium? [01:53:35] (03CR) 10Ori.livneh: [C: 032] "Did not find requests for this URL pattern in the Apache log file on $RANDOM_APPSERVER." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201643 (owner: 10Catrope) [01:53:40] (03Merged) 10jenkins-bot: Remove resources symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201643 (owner: 10Catrope) [01:54:25] !log catrope Synchronized w: (no message) (duration: 00m 12s) [01:54:31] Logged the message, Master [01:58:08] PROBLEM - Apache HTTP on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:58:27] PROBLEM - HHVM rendering on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:58:28] PROBLEM - HHVM rendering on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:58:48] PROBLEM - Apache HTTP on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:59:19] PROBLEM - HHVM rendering on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:59:28] PROBLEM - Apache HTTP on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:00:36] uh [02:00:38] * YuviPanda looks at ^ [02:02:03] hmm, hitting localhost works but not with a host header [02:02:39] RECOVERY - HHVM rendering on mw1249 is OK: HTTP OK: HTTP/1.1 200 OK - 70566 bytes in 0.266 second response time [02:02:47] RoanKattouw, is the ConfirmEdit thing still broken? [02:02:48] RECOVERY - Apache HTTP on mw1249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [02:02:54] I think he left. [02:02:55] !log restarted hhvm on mw1249 and mw1065 [02:03:01] Logged the message, Master [02:03:04] superm401: People are saying account creation is broken on group0 wikis [02:03:16] superm401: At least going to the edit form shouldn't be broken there any more [02:03:27] RECOVERY - HHVM rendering on mw1065 is OK: HTTP OK: HTTP/1.1 200 OK - 70566 bytes in 0.175 second response time [02:03:29] Fiona: ^ [02:03:37] RECOVERY - HHVM rendering on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 70566 bytes in 0.961 second response time [02:03:39] RoanKattouw, alright, looking at it. [02:03:49] !log restarted hhvm on mw1209 [02:03:50] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [02:03:52] Logged the message, Master [02:04:57] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [02:06:04] RoanKattouw, for me, it's alternating between The name "Test Account April 2 2015 1" is not allowed to prevent confusing or spoofed usernames: Canonicalized name too short. Please choose another name." (or whatever name I put) and CAPTCHA errors. [02:06:32] What are the CAPTCHA errors? [02:06:52] CAPTCHA? [02:06:59] I didn't even see a CAPTCHA on the form. [02:07:16] I'm a bit surprised I didn't notice that. [02:07:49] RoanKattouw, I think it just said it was incorrect, but now I'm getting the "confusing or spoofed" every time. [02:08:04] I'll look at the patches. [02:08:54] Why are the requests different, anyway? What are the two objects representing? [02:10:55] superm401: Do you see a CAPTCHA at https://test.wikipedia.org/w/index.php?title=Special:UserLogin/signup&error=&fromhttp=1 ? [02:10:57] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 796945 msg: ocg_render_job_queue 0 msg [02:10:57] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 796950 msg: ocg_render_job_queue 0 msg [02:10:58] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 796957 msg: ocg_render_job_queue 0 msg [02:11:14] No [02:11:18] I tested on my phone and in a desktop browser and I haven't seen a CAPTCHA yet. I think we might've disabled CAPTCHAs there. [02:11:32] I'm testing on MediaWiki.org. [02:11:43] Aha. [02:11:46] RoanKattouw, which files are you looking at? [02:13:00] None, I'm not investigating anything [02:13:20] superm401: I'm heading off, but if you figure out the cause, you can deploy a fix. From TechOps, YuviPanda is around, and you should feel free to page anyone else you want to look over anything. This is sufficiently urgent. [02:13:36] Okay [02:14:14] Actually I'm going to also call greg-g and let him know just so he's aware [02:14:38] Is it just group0 wikis? [02:15:16] I mean, they're group0 for a reason. :-) [02:16:08] It works for me locally with Flow and ConfirmEdit at 1.25wmf14 [02:17:01] Has anything been deployed recently related to Unicode normalization? [02:18:00] We had a refactor in ConfirmEdit that was reverted [02:18:10] Hmmm. [02:18:14] Then that revert missed a spot, so we had lots of exceptions in group0, I fixed that [02:18:15] It's AntiSpoof that's throwing the message. [02:18:20] Oooh [02:18:20] The one I'm seeing, anyway. [02:18:27] Unicode normalization, yes of course [02:18:39] legoktm and brion worked on splitting out UtfNormal into a separate package [02:18:42] Pulled in by composer [02:19:10] Fiona: Could you add that info (that it's AntiSpoof throwing the error and that I suspect the UtfNormal move) to the task? [02:19:20] RoanKattouw, hmm, maybe they forgot to do vendor [02:19:59] Hmm but I think that should make things explode loudly (missing functions etc), not like this [02:20:10] Yeah, plus it's there: https://git.wikimedia.org/commit/mediawiki%2Fvendor/40d9f266d8ccf9254f03e1a542e929f7b89216ec [02:21:40] It still works with 1.25wmf24 of AntiSpoof locally. [02:21:59] Hmm, forgot to do core. [02:22:28] RoanKattouw: I added a note to https://phabricator.wikimedia.org/T94958 [02:25:38] PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: Puppet has 1 failures [02:25:51] hmm [02:25:56] well, it's eqiad [02:26:24] ^ getting of the outdated "esams in scheduled downtime" bit [02:26:35] there should have been a "rid" in there somewhere [02:31:04] mattflaschen@tin:~$ mwscript eval.php --wiki=mediawikiwiki [02:31:05] > $u = new SpoofUser( 'Matt 2015-04-02' ); [02:31:07] > var_export( $u->isLegal() ); [02:31:08] false [02:31:10] Easily reproducible, just need to find the cause. [02:31:43] !log l10nupdate Synchronized php-1.25wmf23/cache/l10n: (no message) (duration: 09m 08s) [02:31:51] Logged the message, Master [02:38:05] Changing some AntiSpoof protection levels on tin so I can call the methods from the shell. [02:38:24] !log LocalisationUpdate completed (1.25wmf23) at 2015-04-03 02:37:21+00:00 [02:38:31] Logged the message, Master [02:40:48] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [02:45:13] It's something weird in listtostring. [02:46:00] codepointToUtf8 is totally broken. [02:46:04] Is that provided by UtfNormal [02:47:31] ^ legoktm, could I get a hand, since you're an author in UtfNormal? [02:48:24] I don't know if he's around? haven't seen in him in a bit [02:51:35] Weird, it has a namespaced function that works, and a non-namespaced one that returns blank (but doesn't throw). [02:52:30] 6operations, 3Interdatacenter-IPsec: Kernel panics on Jessie (3.16.0-4-amd64) during IPsec load test - https://phabricator.wikimedia.org/T94820#1176783 (10Gage) Got another one: I changed ciphers on berkelium & restarted the daemon there; before I had a chance to restart the daemon on curium for corresponding... [02:52:57] Figured it out [02:53:12] Nice. [02:55:04] https://gerrit.wikimedia.org/r/#/c/201646/ [02:55:10] Self-merged it since it's 100% obvious. [02:55:33] hahah [02:55:50] In PHP, we believe types get in the way of productive programming. [02:55:54] Instead, we give the people what they want. [02:56:08] Which in this case, is letting any function's return value convert to a string. [02:56:15] PHP: Why Worry About Types? [02:56:20] lol [02:56:42] PHP: These aren't the types you are looking for [02:57:44] In PHP, we also believe that echo NULL should echo a blank string, and echo array(1,2,3,4) should output Array. [02:57:56] PHP: Why Worry About Debugging? [02:58:39] 6operations, 3Interdatacenter-IPsec: Kernel panics on Jessie (3.16.0-4-amd64) during IPsec load test - https://phabricator.wikimedia.org/T94820#1176792 (10Gage) Ok, this is reproducible and seems to be the primary problem I was having yesterday: enabling Extended Sequence Numbers (ESN, http://kernelnewbies.org... [02:58:56] account creation is broken where? we just fixed it for apps [02:58:57] oh wow.. that's kinda awesome superm401 :) [02:58:58] Also, it's pretty messed up that we have global functions that don't even start with wf (which is already lowering the bar). [02:58:58] !log l10nupdate Synchronized php-1.25wmf24/cache/l10n: (no message) (duration: 06m 00s) [02:59:04] Logged the message, Master [02:59:27] greg-g: group0 wikis, almost fixed again (superm found it!) [02:59:33] csteipp, my prod troubleshooting, or my PHP philosophizing? [02:59:34] thanks a ton superm401 [02:59:45] superm401: ALL OF IT :) [02:59:52] ^ that [03:00:00] [object Object] [03:00:10] At least it works for arrays. ;) [03:00:17] In JS [03:00:50] Should I force it? It got to "Main test build succeeded." [03:01:15] what, jenkins? [03:01:22] Yeah [03:01:35] it's almost done [03:01:49] Yeah [03:01:50] just mediawiki-phpunit-zend [03:02:02] It would be good to test with the extensions (although I think it's done that part) [03:02:12] don't force please [03:02:18] Okay [03:02:27] we've waited soooo long, might as well wait for zeeend [03:02:34] it causes zuul to get confused (a ref already merged it thinks it should have to merge) [03:02:40] effing zend [03:02:51] kill zend, who cares? :) [03:03:13] Oh, I didn't know it confused Zuul when you forced. [03:03:14] kill zend, switch to hack, done [03:03:30] yeah, we just figured that out last week I think (or at least, I just learned about it last week) [03:03:33] !log LocalisationUpdate completed (1.25wmf24) at 2015-04-03 03:02:30+00:00 [03:03:39] Logged the message, Master [03:03:43] ohgood, l10n probably slowed things down [03:03:51] in jenkins [03:05:27] I'm going to go back to eating dinner [03:05:43] Other users of mediawiki probably wouldn't be too happy with use of hack :) [03:05:44] much love! [03:06:00] Krenair: I was 99% joking :) [03:06:28] but, we should probably make the zend builds only run after merge to save us time [03:06:45] anywho, dinner is getting cold [03:09:06] We have almost 10,000 test functions. [03:10:33] And now we wait for Jenkins again: https://gerrit.wikimedia.org/r/#/c/201654/ [03:29:25] !log mattflaschen Synchronized php-1.25wmf24/includes/libs/normal/UtfNormalUtil.php: Fix UtfNormal shim so account creations work (duration: 00m 12s) [03:29:32] Logged the message, Master [03:30:20] Darn it, I forgot to pull [03:31:34] !log mattflaschen Synchronized php-1.25wmf24/includes/libs/normal/UtfNormalUtil.php: Fix UtfNormal shim so account creations work (duration: 00m 12s) [03:32:04] Works [03:32:14] superm401: I am surprised that that was the only thing that broke [03:32:47] Well, the only thing that broke account creations the second time account creations was broken today. [03:33:19] true, but for different reasons [03:33:26] The users don't care. :) [03:33:32] totally :) [03:33:42] superm401: more like, I was expecting something like that to break a lot more things... [03:33:48] not that this is trivial breakage, etc [03:34:24] YuviPanda, well, some of them they probably changed to call UtfNormal\Utils::codepointToUtf8 . Or maybe that's the only caller, who knows. [03:34:38] YuviPanda, I more thought you meant "I am surprised you didn't have to repair multiple files." [03:34:46] To fix the current problem [03:34:59] ah, right. no, I was surprised that more things in the site didn't break on an accidental null [03:36:28] Interesting, git doesn't show you the affected files when you do git pull --rebase, unlike git pull. [03:37:29] (03PS1) 10Yuvipanda: tools: Have webservice write out a service.manifest file [puppet] - 10https://gerrit.wikimedia.org/r/201656 (https://phabricator.wikimedia.org/T94964) [03:42:42] was there a task for the second account creation fail today? [03:43:15] found it: https://phabricator.wikimedia.org/T94958 [04:04:57] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:06:27] superm401: well done! Thank you! [04:06:48] Thanks. :) [04:14:57] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59674 bytes in 7.573 second response time [04:28:28] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:36:48] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59654 bytes in 9.909 second response time [04:38:22] (03CR) 10Tim Landscheidt: [C: 04-1] "Doesn't work yet:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201656 (https://phabricator.wikimedia.org/T94964) (owner: 10Yuvipanda) [04:43:47] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:54:52] superm401: arghhhh :| [04:55:02] superm401: thanks for debugging and fixing :) [04:55:27] legoktm, it happens. [04:56:11] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 3 04:55:08 UTC 2015 (duration 55m 7s) [04:56:20] Logged the message, Master [04:56:36] looking back, I should have left the tests in the add back-compat layer patch, made sure they passed, and then removed them in a follow-up. [04:56:41] that would have caught this bug [04:59:20] (03CR) 10Tim Landscheidt: [C: 04-1] "This causes errors on Toolsbeta (others not tested):" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [05:03:48] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59653 bytes in 0.438 second response time [05:09:16] (03PS2) 10Yuvipanda: tools: Have webservice write out a service.manifest file [puppet] - 10https://gerrit.wikimedia.org/r/201656 (https://phabricator.wikimedia.org/T94964) [05:11:04] (03CR) 10Yuvipanda: "Fixed." [puppet] - 10https://gerrit.wikimedia.org/r/201656 (https://phabricator.wikimedia.org/T94964) (owner: 10Yuvipanda) [05:32:19] (03CR) 10Tim Landscheidt: [C: 031] "Tested; "webservice2 start" sets "web: lighttpd", "webservice2 start --release=precise" sets "web: lighttpd-precise", "webservice2 stop" r" [puppet] - 10https://gerrit.wikimedia.org/r/201656 (https://phabricator.wikimedia.org/T94964) (owner: 10Yuvipanda) [05:33:29] (03CR) 10Yuvipanda: "It shouldn't particularly be hidden, I think." [puppet] - 10https://gerrit.wikimedia.org/r/201656 (https://phabricator.wikimedia.org/T94964) (owner: 10Yuvipanda) [05:33:43] (03CR) 10Yuvipanda: [C: 032] tools: Have webservice write out a service.manifest file [puppet] - 10https://gerrit.wikimedia.org/r/201656 (https://phabricator.wikimedia.org/T94964) (owner: 10Yuvipanda) [05:36:41] (03PS1) 10KartikMistry: CX: Enable Content Translation in guwiki and viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201666 [05:41:18] (03PS1) 10KartikMistry: CX: Add 'gu' and 'vi' in language selector [puppet] - 10https://gerrit.wikimedia.org/r/201667 [05:43:04] (03CR) 10KartikMistry: [C: 04-1] "Not to merge until https://gerrit.wikimedia.org/r/#/c/201666 is done." [puppet] - 10https://gerrit.wikimedia.org/r/201667 (owner: 10KartikMistry) [06:10:48] (03PS1) 10Tim Landscheidt: Tools: Puppetize webservice2 requirement [puppet] - 10https://gerrit.wikimedia.org/r/201671 [06:16:09] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:19:28] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59653 bytes in 0.568 second response time [06:21:13] (03CR) 10Yuvipanda: "Whoops :)" [puppet] - 10https://gerrit.wikimedia.org/r/201671 (owner: 10Tim Landscheidt) [06:26:27] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:29:19] PROBLEM - puppet last run on mw2056 is CRITICAL: CRITICAL: puppet fail [06:29:48] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:08] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:28] PROBLEM - puppet last run on mw2083 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:37] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:39] PROBLEM - puppet last run on wtp2012 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:38] PROBLEM - puppet last run on mw2097 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:58] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:08] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:07] can someone tell me where I can get the db list of wmf production cluster ? [06:39:51] all.dblist in mediawiki-config repo [06:40:30] turbocat: thanks. checking that now [06:40:48] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: puppet fail [06:41:52] and is there are default $wgDBprefix ? [06:42:07] ( it's there, I just missed the link :\ ) [06:45:47] RECOVERY - puppet last run on mw2083 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:45:47] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:47] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:48] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:49] RECOVERY - puppet last run on mw2097 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:17] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:39] RECOVERY - puppet last run on wtp2012 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:47:58] RECOVERY - puppet last run on mw2056 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:48] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59653 bytes in 0.309 second response time [06:57:38] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:48:08] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:08] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:09] PROBLEM - HHVM rendering on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:17] PROBLEM - HHVM rendering on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:17] PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:17] PROBLEM - HHVM rendering on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:17] PROBLEM - HHVM rendering on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:17] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:18] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:18] PROBLEM - HHVM rendering on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:19] PROBLEM - HHVM rendering on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:19] PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:19] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:19] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:20] PROBLEM - HHVM rendering on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:37] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:38] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:38] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:39] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:39] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:39] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:47] PROBLEM - Apache HTTP on mw1197 is CRITICAL: Connection timed out [07:48:48] PROBLEM - HHVM rendering on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:48] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:48] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:48] PROBLEM - Apache HTTP on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:48] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:48] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:49] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:49] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:50] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:50] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:51] PROBLEM - HHVM rendering on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:51] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:52] PROBLEM - HHVM rendering on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:03] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:03] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:04] PROBLEM - HHVM rendering on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:07] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:07] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:07] PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:07] PROBLEM - HHVM rendering on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:07] PROBLEM - Apache HTTP on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:08] PROBLEM - HHVM rendering on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:08] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:09] PROBLEM - HHVM rendering on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:09] PROBLEM - HHVM rendering on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:09] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:09] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:10] PROBLEM - Apache HTTP on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:17] PROBLEM - HHVM rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:17] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:17] PROBLEM - Apache HTTP on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:17] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:18] PROBLEM - HHVM rendering on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:18] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 23.08% of data above the critical threshold [500.0] [07:49:18] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:18] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:29] PROBLEM - HHVM rendering on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:30] PROBLEM - HHVM rendering on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:38] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:38] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:38] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:38] PROBLEM - HHVM rendering on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:38] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:39] PROBLEM - HHVM rendering on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:39] PROBLEM - HHVM rendering on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:39] PROBLEM - HHVM rendering on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:40] PROBLEM - HHVM rendering on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:40] PROBLEM - HHVM rendering on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:47] PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:47] PROBLEM - Apache HTTP on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:48] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:48] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:48] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:49] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:49] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:54] uh oh [07:49:57] PROBLEM - Apache HTTP on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:57] PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:57] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:58] PROBLEM - HHVM rendering on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:58] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:58] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:58] PROBLEM - HHVM rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:58] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:58] PROBLEM - HHVM busy threads on mw1196 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [115.2] [07:49:59] PROBLEM - HHVM rendering on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:59] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:50:09] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:50:09] PROBLEM - HHVM rendering on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:50:09] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:50:09] PROBLEM - HHVM rendering on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:50:09] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:50:17] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:50:18] PROBLEM - HHVM busy threads on mw1143 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [86.4] [07:50:18] PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:50:48] PROBLEM - HHVM busy threads on mw1146 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [07:50:58] PROBLEM - HHVM busy threads on mw1126 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [07:50:58] PROBLEM - HHVM queue size on mw1194 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [07:51:07] PROBLEM - HHVM busy threads on mw1235 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [07:51:07] PROBLEM - HHVM busy threads on mw1221 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [07:51:07] PROBLEM - HHVM busy threads on mw1200 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [115.2] [07:51:07] PROBLEM - HHVM busy threads on mw1117 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [86.4] [07:51:07] PROBLEM - HHVM queue size on mw1206 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [80.0] [07:51:08] PROBLEM - HHVM busy threads on mw1121 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [86.4] [07:51:08] PROBLEM - HHVM busy threads on mw1227 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [115.2] [07:51:09] PROBLEM - HHVM busy threads on mw1224 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [115.2] [07:51:28] PROBLEM - HHVM busy threads on mw1147 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [86.4] [07:51:28] PROBLEM - HHVM busy threads on mw1137 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [07:51:28] PROBLEM - HHVM queue size on mw1200 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:51:29] PROBLEM - HHVM queue size on mw1145 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:51:37] PROBLEM - HHVM queue size on mw1142 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:51:37] PROBLEM - HHVM busy threads on mw1122 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [07:51:37] PROBLEM - HHVM queue size on mw1192 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [80.0] [07:51:37] PROBLEM - HHVM busy threads on mw1199 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [07:51:37] PROBLEM - HHVM busy threads on mw1222 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [07:51:37] PROBLEM - HHVM busy threads on mw1198 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [115.2] [07:51:37] PROBLEM - HHVM queue size on mw1201 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:51:38] PROBLEM - HHVM queue size on mw1144 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:51:47] PROBLEM - HHVM queue size on mw1222 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [80.0] [07:51:48] PROBLEM - HHVM busy threads on mw1125 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [86.4] [07:51:48] PROBLEM - HHVM queue size on mw1228 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [80.0] [07:51:48] PROBLEM - HHVM queue size on mw1117 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:51:48] PROBLEM - HHVM queue size on mw1226 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:51:48] PROBLEM - HHVM busy threads on mw1223 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [115.2] [07:51:49] PROBLEM - HHVM queue size on mw1120 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:51:49] PROBLEM - HHVM busy threads on mw1194 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [115.2] [07:51:49] PROBLEM - HHVM busy threads on mw1226 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [07:51:57] PROBLEM - HHVM busy threads on mw1193 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [115.2] [07:51:58] PROBLEM - HHVM queue size on mw1233 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [80.0] [07:51:58] PROBLEM - HHVM busy threads on mw1234 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [07:51:58] PROBLEM - HHVM busy threads on mw1190 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [115.2] [07:51:58] PROBLEM - HHVM busy threads on mw1204 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [115.2] [07:51:59] PROBLEM - HHVM busy threads on mw1129 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [07:51:59] PROBLEM - HHVM queue size on mw1235 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [80.0] [07:51:59] PROBLEM - HHVM busy threads on mw1144 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [07:52:07] PROBLEM - HHVM busy threads on mw1133 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [86.4] [07:52:07] PROBLEM - HHVM busy threads on mw1232 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [07:52:07] PROBLEM - HHVM queue size on mw1196 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:52:07] PROBLEM - HHVM queue size on mw1199 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [80.0] [07:52:07] PROBLEM - HHVM busy threads on mw1201 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [115.2] [07:52:08] PROBLEM - HHVM busy threads on mw1189 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [115.2] [07:52:08] PROBLEM - HHVM busy threads on mw1139 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [86.4] [07:52:08] PROBLEM - HHVM busy threads on mw1225 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [115.2] [07:52:08] PROBLEM - HHVM busy threads on mw1136 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [07:52:09] PROBLEM - HHVM busy threads on mw1116 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [86.4] [07:52:09] PROBLEM - HHVM queue size on mw1136 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [80.0] [07:52:10] PROBLEM - HHVM busy threads on mw1205 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [115.2] [07:52:27] PROBLEM - HHVM busy threads on mw1119 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [86.4] [07:52:28] PROBLEM - HHVM queue size on mw1208 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [80.0] [07:52:28] PROBLEM - HHVM busy threads on mw1230 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [115.2] [07:52:28] PROBLEM - HHVM busy threads on mw1148 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [86.4] [07:52:28] PROBLEM - HHVM queue size on mw1127 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [80.0] [07:52:28] PROBLEM - HHVM busy threads on mw1114 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [07:52:28] PROBLEM - HHVM queue size on mw1225 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:52:28] PROBLEM - HHVM queue size on mw1231 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0] [07:52:29] PROBLEM - HHVM busy threads on mw1142 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [86.4] [07:52:29] PROBLEM - HHVM busy threads on mw1208 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [07:52:30] PROBLEM - HHVM busy threads on mw1206 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [115.2] [07:52:30] PROBLEM - HHVM busy threads on mw1127 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [86.4] [07:52:37] PROBLEM - HHVM queue size on mw1227 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [80.0] [07:52:37] PROBLEM - HHVM busy threads on mw1132 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [86.4] [07:52:37] PROBLEM - HHVM queue size on mw1131 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [80.0] [07:52:38] PROBLEM - HHVM queue size on mw1124 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [80.0] [07:52:38] PROBLEM - HHVM busy threads on mw1134 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [86.4] [07:52:38] PROBLEM - HHVM busy threads on mw1123 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [86.4] [07:52:38] PROBLEM - HHVM queue size on mw1140 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:52:47] PROBLEM - HHVM busy threads on mw1130 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [86.4] [07:52:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [07:52:48] PROBLEM - HHVM busy threads on mw1191 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [115.2] [07:52:48] PROBLEM - HHVM queue size on mw1202 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [80.0] [07:52:48] PROBLEM - HHVM queue size on mw1195 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [80.0] [07:52:48] PROBLEM - HHVM busy threads on mw1128 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [86.4] [07:52:48] PROBLEM - HHVM queue size on mw1119 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [80.0] [07:52:48] PROBLEM - HHVM queue size on mw1125 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0] [07:52:49] PROBLEM - HHVM queue size on mw1198 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0] [07:52:49] PROBLEM - HHVM queue size on mw1114 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [07:52:50] PROBLEM - HHVM queue size on mw1134 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:52:50] PROBLEM - HHVM busy threads on mw1192 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [115.2] [07:52:57] PROBLEM - HHVM busy threads on mw1131 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [86.4] [07:52:58] PROBLEM - HHVM busy threads on mw1233 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [115.2] [07:52:58] PROBLEM - HHVM queue size on mw1234 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [80.0] [07:52:58] PROBLEM - HHVM queue size on mw1148 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [80.0] [07:52:59] PROBLEM - HHVM queue size on mw1138 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [80.0] [07:52:59] PROBLEM - HHVM busy threads on mw1120 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [86.4] [07:52:59] PROBLEM - HHVM busy threads on mw1228 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [115.2] [07:52:59] PROBLEM - HHVM queue size on mw1132 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [80.0] [07:53:07] PROBLEM - HHVM queue size on mw1146 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [80.0] [07:53:07] PROBLEM - HHVM busy threads on mw1115 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [86.4] [07:53:07] PROBLEM - HHVM queue size on mw1129 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [80.0] [07:53:08] PROBLEM - HHVM queue size on mw1189 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [80.0] [07:53:08] PROBLEM - HHVM queue size on mw1191 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [07:53:08] PROBLEM - HHVM queue size on mw1221 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:53:08] PROBLEM - HHVM busy threads on mw1207 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [115.2] [07:53:17] PROBLEM - HHVM queue size on mw1137 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:53:17] PROBLEM - HHVM queue size on mw1230 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:53:17] PROBLEM - HHVM busy threads on mw1203 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [115.2] [07:53:17] PROBLEM - HHVM queue size on mw1126 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [80.0] [07:53:17] PROBLEM - HHVM queue size on mw1203 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [80.0] [07:53:18] PROBLEM - HHVM busy threads on mw1135 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [86.4] [07:53:18] PROBLEM - HHVM busy threads on mw1124 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [86.4] [07:53:18] PROBLEM - HHVM busy threads on mw1197 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [115.2] [07:53:18] PROBLEM - HHVM queue size on mw1207 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [80.0] [07:53:27] PROBLEM - HHVM busy threads on mw1145 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [86.4] [07:53:27] PROBLEM - HHVM queue size on mw1121 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [80.0] [07:53:29] PROBLEM - HHVM busy threads on mw1231 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [115.2] [07:53:38] PROBLEM - HHVM queue size on mw1143 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [07:53:38] PROBLEM - HHVM busy threads on mw1140 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [86.4] [07:53:47] PROBLEM - HHVM queue size on mw1197 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [80.0] [07:53:48] PROBLEM - HHVM queue size on mw1130 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [80.0] [07:53:48] PROBLEM - HHVM queue size on mw1122 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:53:48] PROBLEM - HHVM queue size on mw1204 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [80.0] [07:53:48] PROBLEM - HHVM queue size on mw1135 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [07:53:58] PROBLEM - HHVM queue size on mw1139 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [07:54:08] PROBLEM - HHVM queue size on mw1116 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [80.0] [07:54:18] PROBLEM - HHVM queue size on mw1223 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [80.0] [07:54:28] PROBLEM - HHVM queue size on mw1205 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [80.0] [07:54:38] PROBLEM - HHVM queue size on mw1224 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [80.0] [07:54:47] PROBLEM - HHVM queue size on mw1128 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [80.0] [07:54:57] PROBLEM - HHVM queue size on mw1115 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [07:54:58] PROBLEM - HHVM queue size on mw1123 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [80.0] [07:55:32] easter attack on HHVM ? [07:55:49] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.108 second response time [07:55:49] are ops not getting paged? [07:55:56] dunno [07:56:11] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=LVS%20loadbalancers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1428047719&g=network_report&z=large not good at all [07:56:13] afaik, most EU opsens are out for the week [07:56:28] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.037 second response time [07:56:38] PROBLEM - RAID on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:56:40] nooop [07:56:47] PROBLEM - dhclient process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:56:58] PROBLEM - HHVM processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:56:58] PROBLEM - DPKG on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:57:01] all sockets time-out [07:57:03] interesting [07:57:07] PROBLEM - HHVM queue size on mw1147 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [80.0] [07:57:08] PROBLEM - nutcracker port on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:57:08] RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 69502 bytes in 1.833 second response time [07:57:08] PROBLEM - HHVM queue size on mw1190 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [80.0] [07:57:18] PROBLEM - Disk space on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:57:27] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.073 second response time [07:57:37] PROBLEM - SSH on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:57:38] PROBLEM - salt-minion processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:57:48] PROBLEM - puppet last run on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:57:50] that means a meltdown or connectivity problems [07:58:07] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 69502 bytes in 2.017 second response time [07:58:07] PROBLEM - configured eth on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:08] PROBLEM - nutcracker process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:38] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 69501 bytes in 7.846 second response time [07:58:47] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.917 second response time [07:59:17] good morning [07:59:38] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 69501 bytes in 8.985 second response time [07:59:59] MaxSem: mobrovac they are most probably [08:00:17] RECOVERY - dhclient process on mw1147 is OK: PROCS OK: 0 processes with command name dhclient [08:00:18] RECOVERY - HHVM processes on mw1147 is OK: PROCS OK: 1 process with command name hhvm [08:00:21] does it have any impact anyway? I am not sure what the mw11xx machines are [08:00:27] RECOVERY - DPKG on mw1147 is OK: All packages OK [08:00:29] RECOVERY - nutcracker port on mw1147 is OK: TCP OK - 0.000 second response time on port 11212 [08:00:36] api [08:00:38] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.050 second response time [08:00:47] here, debugging [08:00:47] RECOVERY - Disk space on mw1147 is OK: DISK OK [08:00:58] RECOVERY - SSH on mw1147 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:01:08] RECOVERY - salt-minion processes on mw1147 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:01:08] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 69501 bytes in 0.170 second response time [08:01:08] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [08:01:17] PROBLEM - Apache HTTP on mw1235 is CRITICAL: Connection timed out [08:01:27] RECOVERY - configured eth on mw1147 is OK: NRPE: Unable to read output [08:01:28] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.166 second response time [08:01:30] cool, cause I realised that my us phone has no international calls option [08:01:38] RECOVERY - nutcracker process on mw1147 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:01:41] insane wikidata database queries [08:01:42] paravoid: am here too, only partly. [08:01:48] RECOVERY - RAID on mw1147 is OK: OK: no RAID installed [08:01:48] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.248 second response time [08:01:50] db1071 has 230k qps alone [08:01:58] eek [08:02:01] + 60-70k qps for other s5 servers [08:02:08] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.683 second response time [08:02:10] what? [08:02:17] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 67406 bytes in 0.456 second response time [08:02:31] SELECT /* DatabaseBase::selectRow 10.64.32.84 */ ips_item_id FROM `wb_items_per_site` WHERE ips_site_id = 'svwikivoyage' AND ips_site_page = 'API' LIMIT 1 [08:02:32] * YuviPanda decides to go to sleep instead, enough Europeans seem around [08:02:38] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.903 second response time [08:02:40] also [08:02:42] WHERE ips_site_id = 'svwikivoyage' AND ips_site_page = 'Dwimmerlaik' LIMIT [08:02:47] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 69502 bytes in 1.749 second response time [08:02:48] PROBLEM - SSH on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:48] PROBLEM - RAID on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:02:57] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.203 second response time [08:02:57] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:02:57] PROBLEM - nutcracker port on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:02:58] PROBLEM - SSH on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:58] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 69501 bytes in 6.018 second response time [08:03:02] what the hell is this? [08:03:07] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 69501 bytes in 1.584 second response time [08:03:08] PROBLEM - puppet last run on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:17] PROBLEM - RAID on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:18] PROBLEM - nutcracker process on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:18] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:18] PROBLEM - configured eth on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:18] PROBLEM - SSH on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:19] PROBLEM - salt-minion processes on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:27] PROBLEM - RAID on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:28] PROBLEM - Disk space on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:32] <_joe_> whoa [08:03:36] <_joe_> what's happening? [08:03:38] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:39] these queries themselves are harmless, must be volume or other, heavier ones [08:03:42] <_joe_> I just opened IRC now [08:03:48] PROBLEM - DPKG on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:48] PROBLEM - DPKG on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:48] PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:58] PROBLEM - configured eth on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:58] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:07] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.585 second response time [08:04:08] PROBLEM - HHVM processes on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:12] api outage [08:04:14] ongoing [08:04:16] <_joe_> yes [08:04:17] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:17] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:18] RECOVERY - Apache HTTP on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.577 second response time [08:04:18] PROBLEM - dhclient process on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:19] PROBLEM - nutcracker port on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:19] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.570 second response time [08:04:24] eeek [08:04:27] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:27] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.000 second response time [08:04:28] PROBLEM - configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:28] PROBLEM - puppet last run on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:28] RECOVERY - nutcracker port on mw1130 is OK: TCP OK - 0.000 second response time on port 11212 [08:04:28] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:36] aude: what is this? [08:04:37] RECOVERY - SSH on mw1130 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:04:37] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [08:04:38] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:42] paravoid: no idea [08:04:47] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:47] RECOVERY - Apache HTTP on mw1222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.822 second response time [08:04:47] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 69500 bytes in 8.916 second response time [08:04:48] RECOVERY - RAID on mw1130 is OK: OK: no RAID installed [08:04:48] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 69501 bytes in 0.144 second response time [08:04:48] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 69501 bytes in 9.999 second response time [08:04:49] restarting hhvm on these servers seems to help [08:04:57] PROBLEM - HHVM rendering on mw1115 is CRITICAL: Connection timed out [08:04:58] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 69500 bytes in 8.674 second response time [08:04:59] infinite loop? [08:05:00] that's all I know so far [08:05:07] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 69501 bytes in 0.820 second response time [08:05:07] PROBLEM - configured eth on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:07] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:07] PROBLEM - salt-minion processes on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:08] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.882 second response time [08:05:08] PROBLEM - HHVM processes on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:18] PROBLEM - dhclient process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:19] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.217 second response time [08:05:19] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 69495 bytes in 0.734 second response time [08:05:27] SELECT /* DatabaseBase::selectRow 10.64.32.84 */ ips_item_id FROM `wb_items_per_site` WHERE ips_site_id = 'svwikivoyage' AND ips_site_page = 'API' LIMIT 1 [08:05:27] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 69495 bytes in 0.834 second response time [08:05:28] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 2.370 second response time [08:05:28] PROBLEM - puppet last run on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:28] PROBLEM - puppet last run on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:28] RECOVERY - HHVM rendering on mw1196 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 3.686 second response time [08:05:28] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.648 second response time [08:05:29] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.631 second response time [08:05:29] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 6.185 second response time [08:05:33] returns immediately, fwiw [08:05:37] PROBLEM - salt-minion processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:37] RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.036 second response time [08:05:38] RECOVERY - HHVM queue size on mw1117 is OK: OK: Less than 30.00% above the threshold [10.0] [08:05:38] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 7.527 second response time [08:05:38] PROBLEM - nutcracker port on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:38] PROBLEM - DPKG on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:38] PROBLEM - RAID on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:38] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:05:39] PROBLEM - dhclient process on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:43] <_joe_> paravoid: returns what? [08:05:47] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.641 second response time [08:05:48] PROBLEM - Disk space on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:48] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:48] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.053 second response time [08:05:57] PROBLEM - Apache HTTP on mw1114 is CRITICAL: Connection timed out [08:05:58] PROBLEM - SSH on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:06:00] nothing [08:06:07] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 5.490 second response time [08:06:08] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:06:08] PROBLEM - nutcracker process on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:06:17] databases are just getting hammered by this and the Dwimmerlaik query [08:06:17] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 8.116 second response time [08:06:18] RECOVERY - HHVM queue size on mw1127 is OK: OK: Less than 30.00% above the threshold [10.0] [08:06:18] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 3.990 second response time [08:06:19] RECOVERY - HHVM busy threads on mw1127 is OK: OK: Less than 30.00% above the threshold [57.6] [08:06:19] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.493 second response time [08:06:27] PROBLEM - HHVM rendering on mw1132 is CRITICAL: Connection timed out [08:06:37] RECOVERY - HHVM busy threads on mw1117 is OK: OK: Less than 30.00% above the threshold [57.6] [08:06:38] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 69495 bytes in 3.871 second response time [08:06:38] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 4.281 second response time [08:06:38] RECOVERY - HHVM rendering on mw1128 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 1.777 second response time [08:06:47] <_joe_> some of the servers are unreachable btw [08:06:49] PROBLEM - HHVM processes on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:06:56] which? [08:06:57] <_joe_> or I can't connect [08:06:58] PROBLEM - nutcracker port on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:06:58] PROBLEM - puppet last run on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:01] <_joe_> mw1114? [08:07:07] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.488 second response time [08:07:07] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.719 second response time [08:07:07] PROBLEM - SSH on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:08] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 3.860 second response time [08:07:18] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 8.448 second response time [08:07:18] PROBLEM - RAID on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:18] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 1.139 second response time [08:07:18] RECOVERY - HHVM rendering on mw1133 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 1.568 second response time [08:07:18] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.376 second response time [08:07:18] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 69495 bytes in 3.055 second response time [08:07:28] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [08:07:28] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:28] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.977 second response time [08:07:31] let me know if we should send out a tweet (https://wikitech.wikimedia.org/wiki/Incident_response#Communicating_with_the_public ) [08:07:37] there a bunch of SlowTimer [10000ms] at runtime/ext_mysql: slow query: SELECT MASTER_POS_WAIT('db1058-bin.002580', [08:07:38] PROBLEM - dhclient process on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:38] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.378 second response time [08:07:38] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.221 second response time [08:07:43] but i guess that is a side effect [08:07:47] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.128 second response time [08:07:47] PROBLEM - configured eth on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:47] PROBLEM - puppet last run on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:47] PROBLEM - DPKG on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:47] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.625 second response time [08:07:48] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.031 second response time [08:07:48] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.821 second response time [08:07:48] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 1.921 second response time [08:07:57] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:07:57] PROBLEM - salt-minion processes on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:57] PROBLEM - RAID on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:57] PROBLEM - Disk space on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:58] PROBLEM - dhclient process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:58] RECOVERY - Disk space on mw1135 is OK: DISK OK [08:07:58] PROBLEM - SSH on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:58] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 1.852 second response time [08:07:58] PROBLEM - nutcracker process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:59] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.860 second response time [08:07:59] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 69503 bytes in 0.681 second response time [08:08:00] <_joe_> mh, ganglia reports the whole cluster "down" [08:08:00] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.812 second response time [08:08:09] HaeB, doesn't seem very user-noticeable so far [08:08:17] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.381 second response time [08:08:17] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 5.491 second response time [08:08:17] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:08:17] PROBLEM - nutcracker port on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:18] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 17192 bytes in 0.187 second response time [08:08:18] the cause is QPS [08:08:21] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.444 second response time [08:08:21] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.065 second response time [08:08:22] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.061 second response time [08:08:22] RECOVERY - configured eth on mw1135 is OK: NRPE: Unable to read output [08:08:22] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 1.048 second response time [08:08:22] to databases [08:08:25] HaeB: limited impact, and seems to be almost over [08:08:27] RECOVERY - HHVM rendering on mw1121 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 9.657 second response time [08:08:27] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.875 second response time [08:08:30] it's not over, no [08:08:37] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [08:08:37] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.032 second response time [08:08:37] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [08:08:37] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.150 second response time [08:08:37] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.181 second response time [08:08:37] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 1.343 second response time [08:08:37] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [08:08:38] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.092 second response time [08:08:47] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [08:08:48] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [08:08:48] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.744 second response time [08:08:49] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.059 second response time [08:08:49] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [08:08:49] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 69495 bytes in 0.107 second response time [08:08:49] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212 [08:08:49] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.146 second response time [08:08:57] RECOVERY - DPKG on mw1135 is OK: All packages OK [08:08:57] PROBLEM - configured eth on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:58] PROBLEM - DPKG on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:58] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.075 second response time [08:08:58] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.079 second response time [08:08:59] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.208 second response time [08:08:59] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.209 second response time [08:08:59] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 69504 bytes in 0.172 second response time [08:09:07] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.064 second response time [08:09:08] RECOVERY - HHVM rendering on mw1124 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.153 second response time [08:09:08] MaxSem: i know, it's just the API... but then again i came here because i can't save edits on commons right now (because of 503 errors for the API) [08:09:09] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [08:09:09] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [08:09:17] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [08:09:17] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.080 second response time [08:09:17] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 1 process with command name hhvm [08:09:18] RECOVERY - HHVM queue size on mw1147 is OK: OK: Less than 30.00% above the threshold [10.0] [08:09:27] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [08:09:27] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.679 second response time [08:09:28] RECOVERY - HHVM busy threads on mw1116 is OK: OK: Less than 30.00% above the threshold [57.6] [08:09:28] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [08:09:28] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [08:09:28] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [08:09:28] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:09:28] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 69495 bytes in 0.125 second response time [08:09:29] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.149 second response time [08:09:37] RECOVERY - RAID on mw1135 is OK: OK: no RAID installed [08:09:37] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 69495 bytes in 0.114 second response time [08:09:38] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.230 second response time [08:09:39] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 1.597 second response time [08:09:41] why are these servers coming back? did anyone restart hhvm or similar? [08:09:47] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.054 second response time [08:09:48] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.148 second response time [08:09:48] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.171 second response time [08:09:48] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.076 second response time [08:09:48] RECOVERY - HHVM queue size on mw1116 is OK: OK: Less than 30.00% above the threshold [10.0] [08:09:48] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 69503 bytes in 0.158 second response time [08:09:48] RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.440 second response time [08:09:49] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.441 second response time [08:09:49] PROBLEM - puppet last run on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:58] PROBLEM - salt-minion processes on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:58] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 69496 bytes in 0.189 second response time [08:10:07] PROBLEM - MySQL Slave Delay on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:09] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.259 second response time [08:10:09] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:09] PROBLEM - HHVM processes on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:14] I did not [08:10:18] RECOVERY - nutcracker port on mw1143 is OK: TCP OK - 0.000 second response time on port 11212 [08:10:18] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.082 second response time [08:10:18] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures [08:10:27] RECOVERY - SSH on mw1143 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:10:28] PROBLEM - SSH on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:10:28] PROBLEM - MySQL InnoDB on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:35] T=1422ms action=wbgetclaims format=json entity=Q5296 property=P373 callback=jQuery111201273..... [08:10:38] PROBLEM - DPKG on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:38] RECOVERY - RAID on mw1143 is OK: OK: no RAID installed [08:10:38] PROBLEM - HHVM processes on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:38] PROBLEM - RAID on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:38] PROBLEM - Disk space on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:38] PROBLEM - nutcracker process on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:49] PROBLEM - nutcracker process on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:58] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 69509 bytes in 9.150 second response time [08:10:58] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.105 second response time [08:10:58] PROBLEM - salt-minion processes on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:59] PROBLEM - MySQL Recent Restart on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:11:07] RECOVERY - configured eth on mw1143 is OK: NRPE: Unable to read output [08:11:07] RECOVERY - DPKG on mw1143 is OK: All packages OK [08:11:08] PROBLEM - dhclient process on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:11:17] RECOVERY - nutcracker port on mw1122 is OK: TCP OK - 0.000 second response time on port 11212 [08:11:17] RECOVERY - dhclient process on mw1122 is OK: PROCS OK: 0 processes with command name dhclient [08:11:17] PROBLEM - MySQL Slave Running on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:11:17] PROBLEM - nutcracker port on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:11:17] RECOVERY - salt-minion processes on mw1143 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:11:18] RECOVERY - Disk space on mw1143 is OK: DISK OK [08:11:18] RECOVERY - dhclient process on mw1143 is OK: PROCS OK: 0 processes with command name dhclient [08:11:18] RECOVERY - nutcracker process on mw1143 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:11:18] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [08:11:19] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:11:28] PROBLEM - configured eth on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:11:37] PROBLEM - MySQL Idle Transactions on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:11:38] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.342 second response time [08:11:47] RECOVERY - nutcracker process on mw1122 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:11:57] RECOVERY - SSH on mw1122 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:11:57] RECOVERY - salt-minion processes on mw1122 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:11:57] RECOVERY - HHVM processes on mw1143 is OK: PROCS OK: 1 process with command name hhvm [08:11:58] RECOVERY - RAID on mw1122 is OK: OK: no RAID installed [08:11:59] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 69501 bytes in 1.368 second response time [08:12:07] RECOVERY - Disk space on mw1122 is OK: DISK OK [08:12:18] RECOVERY - DPKG on mw1122 is OK: All packages OK [08:12:28] RECOVERY - configured eth on mw1122 is OK: NRPE: Unable to read output [08:12:39] RECOVERY - HHVM processes on mw1122 is OK: PROCS OK: 1 process with command name hhvm [08:12:48] RECOVERY - HHVM queue size on mw1235 is OK: OK: Less than 30.00% above the threshold [10.0] [08:12:48] RECOVERY - HHVM queue size on mw1130 is OK: OK: Less than 30.00% above the threshold [10.0] [08:13:17] RECOVERY - configured eth on mw1119 is OK: NRPE: Unable to read output [08:13:18] RECOVERY - salt-minion processes on mw1119 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:13:27] RECOVERY - HHVM processes on mw1119 is OK: PROCS OK: 1 process with command name hhvm [08:13:28] RECOVERY - HHVM busy threads on mw1130 is OK: OK: Less than 30.00% above the threshold [57.6] [08:13:28] RECOVERY - HHVM busy threads on mw1235 is OK: OK: Less than 30.00% above the threshold [76.8] [08:13:47] RECOVERY - SSH on mw1119 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:13:48] RECOVERY - DPKG on mw1119 is OK: All packages OK [08:13:57] RECOVERY - Disk space on mw1119 is OK: DISK OK [08:13:58] RECOVERY - RAID on mw1119 is OK: OK: no RAID installed [08:14:08] RECOVERY - nutcracker process on mw1119 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:14:08] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 69501 bytes in 0.216 second response time [08:14:27] RECOVERY - dhclient process on mw1119 is OK: PROCS OK: 0 processes with command name dhclient [08:14:28] RECOVERY - nutcracker port on mw1119 is OK: TCP OK - 0.000 second response time on port 11212 [08:14:28] RECOVERY - HHVM queue size on mw1197 is OK: OK: Less than 30.00% above the threshold [10.0] [08:14:58] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.076 second response time [08:15:09] RECOVERY - HHVM queue size on mw1205 is OK: OK: Less than 30.00% above the threshold [10.0] [08:15:17] RECOVERY - HHVM queue size on mw1202 is OK: OK: Less than 30.00% above the threshold [10.0] [08:15:17] RECOVERY - HHVM queue size on mw1125 is OK: OK: Less than 30.00% above the threshold [10.0] [08:15:28] RECOVERY - HHVM queue size on mw1128 is OK: OK: Less than 30.00% above the threshold [10.0] [08:15:38] RECOVERY - HHVM queue size on mw1191 is OK: OK: Less than 30.00% above the threshold [10.0] [08:15:39] RECOVERY - HHVM queue size on mw1221 is OK: OK: Less than 30.00% above the threshold [10.0] [08:15:39] RECOVERY - HHVM queue size on mw1200 is OK: OK: Less than 30.00% above the threshold [10.0] [08:15:39] RECOVERY - HHVM queue size on mw1203 is OK: OK: Less than 30.00% above the threshold [10.0] [08:15:47] RECOVERY - HHVM queue size on mw1142 is OK: OK: Less than 30.00% above the threshold [10.0] [08:15:48] RECOVERY - HHVM queue size on mw1201 is OK: OK: Less than 30.00% above the threshold [10.0] [08:15:58] RECOVERY - HHVM queue size on mw1226 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:08] RECOVERY - HHVM queue size on mw1233 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:17] RECOVERY - HHVM queue size on mw1196 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:18] RECOVERY - HHVM queue size on mw1199 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:18] RECOVERY - HHVM queue size on mw1204 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:18] RECOVERY - HHVM queue size on mw1135 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:27] RECOVERY - HHVM queue size on mw1193 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:28] RECOVERY - HHVM queue size on mw1229 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:29] RECOVERY - HHVM queue size on mw1133 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:29] RECOVERY - HHVM queue size on mw1232 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:38] RECOVERY - HHVM queue size on mw1208 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:38] RECOVERY - HHVM queue size on mw1231 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:39] RECOVERY - HHVM queue size on mw1225 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:47] aude: so, pointers to which piece of code would generate these queries? [08:16:48] RECOVERY - HHVM queue size on mw1227 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:48] RECOVERY - HHVM queue size on mw1131 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:48] RECOVERY - HHVM queue size on mw1223 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:54] aude: how would a URL look like? [08:16:57] RECOVERY - HHVM queue size on mw1194 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:57] RECOVERY - HHVM queue size on mw1140 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:57] RECOVERY - HHVM queue size on mw1195 is OK: OK: Less than 30.00% above the threshold [10.0] [08:16:58] RECOVERY - HHVM queue size on mw1198 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:08] RECOVERY - HHVM queue size on mw1234 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:08] RECOVERY - HHVM queue size on mw1138 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:08] RECOVERY - HHVM queue size on mw1224 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:08] RECOVERY - HHVM queue size on mw1132 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:17] RECOVERY - HHVM queue size on mw1146 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:18] RECOVERY - HHVM queue size on mw1189 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:27] RECOVERY - HHVM queue size on mw1230 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:27] RECOVERY - HHVM queue size on mw1137 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:27] RECOVERY - HHVM queue size on mw1115 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:27] RECOVERY - HHVM queue size on mw1123 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:28] RECOVERY - HHVM queue size on mw1145 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:28] RECOVERY - HHVM busy threads on mw1135 is OK: OK: Less than 30.00% above the threshold [57.6] [08:17:28] RECOVERY - HHVM queue size on mw1192 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:28] RECOVERY - HHVM queue size on mw1207 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:28] RECOVERY - HHVM queue size on mw1144 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:38] RECOVERY - HHVM queue size on mw1121 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:38] RECOVERY - HHVM queue size on mw1222 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:38] RECOVERY - HHVM queue size on mw1228 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:41] paravoid: don't think it's anything in wikibase [08:17:48] RECOVERY - HHVM queue size on mw1120 is OK: OK: Less than 30.00% above the threshold [10.0] [08:17:58] RECOVERY - HHVM queue size on mw1190 is OK: OK: Less than 30.00% above the threshold [10.0] [08:18:07] RECOVERY - HHVM queue size on mw1136 is OK: OK: Less than 30.00% above the threshold [10.0] [08:18:15] if it's api calls [08:18:22] can be a bot or other thing [08:18:28] RECOVERY - HHVM queue size on mw1124 is OK: OK: Less than 30.00% above the threshold [10.0] [08:18:38] RECOVERY - HHVM queue size on mw1206 is OK: OK: Less than 30.00% above the threshold [10.0] [08:18:38] RECOVERY - HHVM queue size on mw1134 is OK: OK: Less than 30.00% above the threshold [10.0] [08:18:48] RECOVERY - configured eth on mw1129 is OK: NRPE: Unable to read output [08:18:49] RECOVERY - salt-minion processes on mw1129 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:18:49] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.721 second response time [08:18:49] RECOVERY - HHVM processes on mw1129 is OK: PROCS OK: 1 process with command name hhvm [08:18:57] RECOVERY - MySQL InnoDB on db1021 is OK: OK longest blocking idle transaction sleeps for 0 seconds [08:19:07] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [08:19:18] RECOVERY - nutcracker port on mw1129 is OK: TCP OK - 0.000 second response time on port 11212 [08:19:18] RECOVERY - DPKG on mw1129 is OK: All packages OK [08:19:27] RECOVERY - RAID on mw1129 is OK: OK: no RAID installed [08:19:27] RECOVERY - dhclient process on mw1129 is OK: PROCS OK: 0 processes with command name dhclient [08:19:27] RECOVERY - MySQL Recent Restart on db1021 is OK: OK 24371810 seconds since restart [08:19:28] RECOVERY - Disk space on mw1129 is OK: DISK OK [08:19:38] RECOVERY - HHVM queue size on mw1143 is OK: OK: Less than 30.00% above the threshold [10.0] [08:19:38] RECOVERY - HHVM busy threads on mw1143 is OK: OK: Less than 30.00% above the threshold [57.6] [08:19:47] RECOVERY - SSH on mw1129 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:19:47] RECOVERY - MySQL Slave Running on db1021 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [08:19:48] RECOVERY - HHVM queue size on mw1122 is OK: OK: Less than 30.00% above the threshold [10.0] [08:19:48] RECOVERY - MySQL Processlist on db1021 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 3 statistics [08:19:57] RECOVERY - nutcracker process on mw1129 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:20:07] RECOVERY - MySQL Idle Transactions on db1021 is OK: OK longest blocking idle transaction sleeps for 0 seconds [08:20:18] RECOVERY - MySQL Slave Delay on db1021 is OK: OK replication delay 126 seconds [08:20:19] RECOVERY - MySQL Replication Heartbeat on db1021 is OK: OK replication delay 126 seconds [08:20:38] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 69479 bytes in 1.334 second response time [08:20:59] RECOVERY - HHVM busy threads on mw1122 is OK: OK: Less than 30.00% above the threshold [57.6] [08:22:17] RECOVERY - HHVM queue size on mw1119 is OK: OK: Less than 30.00% above the threshold [10.0] [08:22:27] RECOVERY - HHVM queue size on mw1148 is OK: OK: Less than 30.00% above the threshold [10.0] [08:22:38] RECOVERY - HHVM busy threads on mw1147 is OK: OK: Less than 30.00% above the threshold [57.6] [08:25:08] RECOVERY - HHVM queue size on mw1139 is OK: OK: Less than 30.00% above the threshold [10.0] [08:25:38] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:26:57] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:27:38] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:38] RECOVERY - HHVM queue size on mw1129 is OK: OK: Less than 30.00% above the threshold [10.0] [08:27:39] RECOVERY - HHVM processes on mw1126 is OK: PROCS OK: 1 process with command name hhvm [08:27:47] RECOVERY - configured eth on mw1126 is OK: NRPE: Unable to read output [08:27:47] RECOVERY - nutcracker process on mw1126 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:27:48] RECOVERY - DPKG on mw1126 is OK: All packages OK [08:28:07] RECOVERY - salt-minion processes on mw1126 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:28:08] RECOVERY - dhclient process on mw1126 is OK: PROCS OK: 0 processes with command name dhclient [08:28:09] RECOVERY - HHVM busy threads on mw1129 is OK: OK: Less than 30.00% above the threshold [57.6] [08:28:18] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 41 minutes ago with 0 failures [08:28:29] RECOVERY - SSH on mw1126 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:28:37] RECOVERY - RAID on mw1126 is OK: OK: no RAID installed [08:28:48] RECOVERY - Disk space on mw1126 is OK: DISK OK [08:28:49] RECOVERY - nutcracker port on mw1126 is OK: TCP OK - 0.000 second response time on port 11212 [08:28:49] RECOVERY - Host mw2027 is UP: PING WARNING - Packet loss = 28%, RTA = 120.80 ms [08:29:17] RECOVERY - HHVM rendering on mw1126 is OK: HTTP OK: HTTP/1.1 200 OK - 69479 bytes in 0.770 second response time [08:29:28] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.068 second response time [08:31:37] RECOVERY - HHVM busy threads on mw1144 is OK: OK: Less than 30.00% above the threshold [57.6] [08:32:38] i do see https://ru.wikipedia.org/wiki/MediaWiki:Sidebar-related.js [08:33:02] aude: Dwimmerlaik is apparently coming from includes/cache/MessageCache.php [08:33:49] and these were requests originating from parsoid [08:34:28] PROBLEM - NTP on mw1114 is CRITICAL: NTP CRITICAL: No response from NTP server [08:34:52] interesting... [08:37:56] RECOVERY - HHVM queue size on mw1126 is OK: OK: Less than 30.00% above the threshold [10.0] [08:37:57] RECOVERY - HHVM busy threads on mw1126 is OK: OK: Less than 30.00% above the threshold [57.6] [08:47:35] aude: https://sv.wikivoyage.org/w/index.php?namespace=828&tagfilter=&title=Special%3ASenaste+%C3%A4ndringar (via MaxSem) [08:47:48] I'm specifically looking at https://sv.wikivoyage.org/w/index.php?title=Modul:Wikibase&curid=10456&diff=69028&oldid=68945 [08:47:54] if is_defined(dbname) and item and item:getSitelink(dbname) then [08:48:28] oh wait that's not very new [08:48:34] doesn't look remarkable to me [08:48:38] dbname = siteid [08:48:44] mmm, italian comments [08:49:08] a real bazaar-style open source! :P [08:49:16] *sigh* [08:49:20] yay broken monitoring https://gdash.wikimedia.org/dashboards/jobq/ [08:50:27] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=terbium.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1428051005&v=52865&m=Global%20JobQueue%20length&z=large [08:50:31] MaxSem: confirmed [08:51:04] wow [08:51:22] confirmed what? [08:51:39] the job queue -> parsoid etc. theory [08:55:42] -- Restituisce il nome della capitale o del capoluogo attuale dell'elemento. [08:55:42] function p.huvudstad(frame) [08:56:14] do they even understand what it is, or they just do shotgun debugging all the time? [08:56:39] so uhm [08:56:40] https://sv.wikivoyage.org/w/index.php?title=Modul:Wikibase&curid=10456&diff=69028&oldid=68945 [08:56:46] local function is_defined(s) [08:56:52] if is_defined(s) then return s end [08:57:05] how's that not a loop? [08:57:19] that's from https://sv.wikivoyage.org/w/index.php?title=Modul:Wikibase&curid=10456&diff=69028&oldid=68945 [08:57:47] hm [08:57:55] (infinite recursion, that is) [08:58:34] 6operations, 6MediaWiki-Core-Team, 7Wikimedia-log-errors: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1177327 (10aaron) [08:58:55] 6operations, 6MediaWiki-Core-Team, 7Wikimedia-log-errors: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1177334 (10aaron) 5Open>3Resolved [08:58:58] yup, it's reverted a few minutes later [08:59:04] lol [08:59:07] then reintroduced differently [08:59:12] if s and s ~= '' then return s end [08:59:17] then changed again [08:59:26] so, Lua sandboxing is broken? [08:59:31] nah [08:59:37] this was a broken change but I doubt it mattered [09:00:28] well, it was, sort of [09:00:43] I think it resulted into all those jobs being inserted to the job queue [09:00:56] which caused parsoid ask for that page from the API [09:00:58] which resulted into: [09:01:02] 11:38 < MaxSem> 2015-04-03 08:38:28 mw1203 svwikivoyage: MessageCache::parse called by Scribunto_LuaError::getScriptTraceHtml/Message::parse/Message::toString/Message::parseText/MessageCache::parse with no title set. [09:01:15] this is the LuaError [09:01:28] so presumably this was generated by Scribunto saying that this infinite loop is in error [09:01:33] makes sense [09:01:34] right, there should be better job de-duplication in the parsoid update extension [09:01:39] a few jobs wouldn't hurt by itself [09:01:45] for each of these MessageCache::parse did the Dwimmercrap query [09:01:48] that's just a small wikivoyage [09:02:09] which overloaded s5 slaves [09:02:11] I've edited templates and modules on enwiki used on millions pages and didn't melt the cluster [09:02:53] that must have been caused by lua errors [09:03:34] lua error called MessageCache::parse [09:03:50] no title set(?), therefore Title::newFromText( 'Dwimmerlaik' ); [09:04:16] alright, I'm gonna post a draft incident report to wikitech & the ops list [09:04:58] paravoid: where did the ="API" form come from? Dwimmercrap was ~ 10% [09:05:53] also, db1049 is still getting hammered by jobrunners, but the other s5 slaves have returned to normal. idk why db1049 is still in the hotseat [09:05:56] springle, it's set by the API [09:06:09] that's from includes/api/ApiQueryAllMessages.php [09:08:01] right, but why does it exist at all, and is it uncached? =API is also an empty result set, like paravoid found with dwimmerlaik [09:16:22] good question... [09:16:35] I think it's a placeholder, I don't know why it would result in a wikidata query? [09:17:12] (03CR) 10Gilles: "I've added a readme file, as you requested." [software/sentry] - 10https://gerrit.wikimedia.org/r/201006 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [09:18:18] !log springle Synchronized wmf-config/db-eqiad.php: reduce db1049 load (duration: 06m 26s) [09:18:23] Logged the message, Master [09:18:53] duration: 06m 26s :O [09:19:25] !log tin sync-file: mw1114.eqiad.wmnet returned [-15] [09:19:28] Logged the message, Master [09:22:02] quick random question: i'd need to find out and log onto a labs instance pointed to by graphoid.wmflabs.org [09:22:09] how can i do that? [09:22:25] (more of a #-labs question) [09:23:10] mobrovac: that is most probably pointing to the shared web proxy [09:23:19] yep [09:23:22] mobrovac, https://wikitech.wikimedia.org/wiki/Special:NovaProxy [09:23:29] mobrovac: so you would have to be a member of the project to list the web proxies setup and find out the underlying instance [09:23:31] or https://wikitech.wikimedia.org/wiki/Special:NovaAddress [09:23:57] MaxSem: thnx i know, the problem is that that's not in my projects [09:24:04] don't even know in which one, tbh [09:24:04] heh [09:24:19] !log mw1114 critical, no ssh, no console, powercycle [09:24:21] must be a way to look it up in ldap [09:24:23] Logged the message, Master [09:25:38] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [09:25:48] RECOVERY - configured eth on mw1114 is OK: NRPE: Unable to read output [09:25:49] RECOVERY - Disk space on mw1114 is OK: DISK OK [09:25:58] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [09:25:59] RECOVERY - DPKG on mw1114 is OK: All packages OK [09:26:10] (03CR) 10Hashar: "Can we somehow trick puppet in having base::standard-packages to be executed before? Puppet stages would do but that is a rather intrusi" [puppet] - 10https://gerrit.wikimedia.org/r/201598 (https://phabricator.wikimedia.org/T94917) (owner: 10Krinkle) [09:26:27] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [09:26:28] RECOVERY - dhclient process on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [09:26:28] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [09:26:47] RECOVERY - salt-minion processes on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:26:57] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [09:27:07] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.087 second response time [09:27:07] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [09:27:38] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 69510 bytes in 1.249 second response time [09:28:18] RECOVERY - HHVM busy threads on mw1198 is OK: OK: Less than 30.00% above the threshold [76.8] [09:28:57] RECOVERY - HHVM busy threads on mw1133 is OK: OK: Less than 30.00% above the threshold [57.6] [09:30:17] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 621 [09:32:43] !log springle Synchronized wmf-config/db-eqiad.php: depool db1049 (duration: 00m 20s) [09:32:48] Logged the message, Master [09:34:27] RECOVERY - HHVM busy threads on mw1114 is OK: OK: Less than 30.00% above the threshold [57.6] [09:34:47] RECOVERY - HHVM queue size on mw1114 is OK: OK: Less than 30.00% above the threshold [10.0] [09:35:17] RECOVERY - check_mysql on db1008 is OK: Uptime: 1881949 Threads: 1 Questions: 12395217 Slow queries: 12597 Opens: 37849 Flush tables: 2 Open tables: 64 Queries per second avg: 6.586 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:41:59] springle: is the "Lag" on https://dbtree.wikimedia.org/ rounded to seconds? [09:42:19] would be nice if it was x.xx [09:44:16] turbocat: seconds is a good as it gets [09:44:28] paravoid: still around? [09:47:18] 3~yes [09:48:07] what's up? [09:48:26] at least api and jobrunner app servers are fixated on db1049 somehow. it's depooled, but connections continue. i tried a round of pt-kill, but they return. any ideas? [09:48:40] mw1123 for eg [09:50:05] usually everything responds quite quickly to a sync-file, except jobrunners which take time [09:52:01] currently 4001 wikiuser connections to db1049, all still running the same wikidata query from the outage, despite depooling db1049, and other s5 slaves lightly loaded [09:52:20] 4001 is stable. no fluctuation [09:53:25] eventually the job runner coordinators will kill the workers (3600 sec) [09:53:38] RECOVERY - HHVM busy threads on mw1148 is OK: OK: Less than 30.00% above the threshold [57.6] [09:54:17] why would api servers be taking longer that normal? are connections from jobrunners > api > db held open somehow? [09:56:29] i suppose it isn't a showstopper. but it's unusual behavior == nervous [09:59:45] hm [10:00:05] so we got hundreds of thousands of connections right [10:00:23] but the jobs weren't that many [10:01:53] (03CR) 10Krinkle: "@Hashar See https://phabricator.wikimedia.org/T94927" [puppet] - 10https://gerrit.wikimedia.org/r/201598 (https://phabricator.wikimedia.org/T94917) (owner: 10Krinkle) [10:07:48] RECOVERY - HHVM busy threads on mw1123 is OK: OK: Less than 30.00% above the threshold [57.6] [10:08:08] RECOVERY - HHVM busy threads on mw1115 is OK: OK: Less than 30.00% above the threshold [57.6] [10:08:09] so... [10:08:17] mw1208 is depooled, but still emits those errors [10:10:30] 72 active connections from mw1208 to db1049 [10:10:53] ok, that's just crazy [10:11:06] smells like a luasandbox bug [10:11:07] none from mw1208 to other s5 slaves [10:14:03] still? [10:14:28] RECOVERY - HHVM busy threads on mw1208 is OK: OK: Less than 30.00% above the threshold [76.8] [10:14:46] seems so, I'll restart [10:15:07] mw1208 connections just died. i guess that was the restart? [10:15:59] these are the offending clients: http://aerosuidae.net/paste/16/551e669c [10:16:21] apis [10:16:23] alright [10:16:28] I'll restart all API servers but one [10:16:39] if that's okay with you [10:16:45] ok [10:16:47] might be useful to someone to live gdb debug [10:16:54] yep [10:19:08] PROBLEM - puppet last run on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:19:38] PROBLEM - RAID on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:33] !log staggered restart of the API cluster (sans mw1234, left for further debugging) [10:20:40] Logged the message, Master [10:20:49] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [10:21:17] RECOVERY - RAID on mw1128 is OK: OK: no RAID installed [10:21:47] PROBLEM - HHVM busy threads on mw1128 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [10:22:57] RECOVERY - HHVM busy threads on mw1232 is OK: OK: Less than 30.00% above the threshold [76.8] [10:23:18] RECOVERY - HHVM busy threads on mw1142 is OK: OK: Less than 30.00% above the threshold [57.6] [10:23:27] today I learned that having a puppetmaster on Trusty does not work :D [10:23:37] RECOVERY - HHVM busy threads on mw1200 is OK: OK: Less than 30.00% above the threshold [76.8] [10:23:47] RECOVERY - HHVM busy threads on mw1120 is OK: OK: Less than 30.00% above the threshold [57.6] [10:24:18] RECOVERY - HHVM busy threads on mw1196 is OK: OK: Less than 30.00% above the threshold [76.8] [10:24:38] RECOVERY - HHVM busy threads on mw1201 is OK: OK: Less than 30.00% above the threshold [76.8] [10:25:27] PROBLEM - puppet last run on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:38] RECOVERY - HHVM busy threads on mw1137 is OK: OK: Less than 30.00% above the threshold [57.6] [10:25:39] PROBLEM - SSH on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:26:58] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures [10:27:17] RECOVERY - SSH on mw1134 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [10:27:55] springle: should be much better, yes? [10:28:37] RECOVERY - HHVM busy threads on mw1128 is OK: OK: Less than 30.00% above the threshold [57.6] [10:28:40] paravoid: yep [10:28:48] RECOVERY - HHVM busy threads on mw1131 is OK: OK: Less than 30.00% above the threshold [57.6] [10:29:28] RECOVERY - HHVM busy threads on mw1125 is OK: OK: Less than 30.00% above the threshold [57.6] [10:29:30] mw1234 still querying though, right? [10:31:48] RECOVERY - HHVM busy threads on mw1119 is OK: OK: Less than 30.00% above the threshold [57.6] [10:32:03] !log depooled mw1234 [10:32:04] paravoid: no, mw1234 is quiet [10:32:06] Logged the message, Master [10:32:07] RECOVERY - HHVM busy threads on mw1221 is OK: OK: Less than 30.00% above the threshold [76.8] [10:32:07] RECOVERY - HHVM busy threads on mw1121 is OK: OK: Less than 30.00% above the threshold [57.6] [10:32:08] ah, shit [10:32:18] RECOVERY - HHVM busy threads on mw1224 is OK: OK: Less than 30.00% above the threshold [76.8] [10:32:28] but, i grabbed a gdb stack dump from mw1234 about 5min ago [10:32:36] oh! [10:32:37] for hhvm [10:32:37] awesome :) [10:32:48] RECOVERY - HHVM busy threads on mw1145 is OK: OK: Less than 30.00% above the threshold [57.6] [10:32:52] don't think i killed it. it did continue hammering afterwards [10:32:58] RECOVERY - HHVM busy threads on mw1193 is OK: OK: Less than 30.00% above the threshold [76.8] [10:33:08] RECOVERY - HHVM busy threads on mw1140 is OK: OK: Less than 30.00% above the threshold [57.6] [10:33:17] RECOVERY - HHVM busy threads on mw1189 is OK: OK: Less than 30.00% above the threshold [76.8] [10:33:17] RECOVERY - HHVM busy threads on mw1139 is OK: OK: Less than 30.00% above the threshold [57.6] [10:33:18] RECOVERY - HHVM busy threads on mw1136 is OK: OK: Less than 30.00% above the threshold [57.6] [10:33:18] RECOVERY - HHVM busy threads on mw1205 is OK: OK: Less than 30.00% above the threshold [76.8] [10:33:19] RECOVERY - HHVM busy threads on mw1202 is OK: OK: Less than 30.00% above the threshold [76.8] [10:33:28] RECOVERY - HHVM busy threads on mw1138 is OK: OK: Less than 30.00% above the threshold [57.6] [10:33:37] RECOVERY - HHVM busy threads on mw1132 is OK: OK: Less than 30.00% above the threshold [57.6] [10:33:38] RECOVERY - HHVM busy threads on mw1146 is OK: OK: Less than 30.00% above the threshold [57.6] [10:33:48] RECOVERY - HHVM busy threads on mw1192 is OK: OK: Less than 30.00% above the threshold [76.8] [10:33:57] RECOVERY - HHVM busy threads on mw1191 is OK: OK: Less than 30.00% above the threshold [76.8] [10:34:08] RECOVERY - HHVM busy threads on mw1233 is OK: OK: Less than 30.00% above the threshold [76.8] [10:34:08] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 83 data above and 9 below the confidence bounds [10:34:17] RECOVERY - HHVM busy threads on mw1207 is OK: OK: Less than 30.00% above the threshold [76.8] [10:34:18] RECOVERY - HHVM busy threads on mw1203 is OK: OK: Less than 30.00% above the threshold [76.8] [10:34:27] RECOVERY - HHVM busy threads on mw1197 is OK: OK: Less than 30.00% above the threshold [76.8] [10:34:28] RECOVERY - HHVM busy threads on mw1124 is OK: OK: Less than 30.00% above the threshold [57.6] [10:34:28] RECOVERY - HHVM busy threads on mw1199 is OK: OK: Less than 30.00% above the threshold [76.8] [10:34:28] RECOVERY - HHVM busy threads on mw1222 is OK: OK: Less than 30.00% above the threshold [76.8] [10:34:47] RECOVERY - HHVM busy threads on mw1190 is OK: OK: Less than 30.00% above the threshold [76.8] [10:34:47] RECOVERY - HHVM busy threads on mw1194 is OK: OK: Less than 30.00% above the threshold [76.8] [10:34:48] RECOVERY - HHVM busy threads on mw1234 is OK: OK: Less than 30.00% above the threshold [76.8] [10:35:07] RECOVERY - HHVM busy threads on mw1225 is OK: OK: Less than 30.00% above the threshold [76.8] [10:35:09] RECOVERY - HHVM busy threads on mw1229 is OK: OK: Less than 30.00% above the threshold [76.8] [10:35:09] RECOVERY - HHVM busy threads on mw1195 is OK: OK: Less than 30.00% above the threshold [76.8] [10:35:28] RECOVERY - HHVM busy threads on mw1206 is OK: OK: Less than 30.00% above the threshold [76.8] [10:35:29] RECOVERY - HHVM busy threads on mw1134 is OK: OK: Less than 30.00% above the threshold [57.6] [10:35:57] RECOVERY - HHVM busy threads on mw1228 is OK: OK: Less than 30.00% above the threshold [76.8] [10:36:27] RECOVERY - HHVM busy threads on mw1226 is OK: OK: Less than 30.00% above the threshold [76.8] [10:36:27] RECOVERY - HHVM busy threads on mw1231 is OK: OK: Less than 30.00% above the threshold [76.8] [10:36:27] RECOVERY - HHVM busy threads on mw1223 is OK: OK: Less than 30.00% above the threshold [76.8] [10:36:37] RECOVERY - HHVM busy threads on mw1204 is OK: OK: Less than 30.00% above the threshold [76.8] [10:36:58] RECOVERY - HHVM busy threads on mw1230 is OK: OK: Less than 30.00% above the threshold [76.8] [10:36:58] !log springle Synchronized wmf-config/db-eqiad.php: repool db1049, warm up (duration: 00m 12s) [10:37:03] Logged the message, Master [10:37:27] RECOVERY - HHVM busy threads on mw1227 is OK: OK: Less than 30.00% above the threshold [76.8] [10:38:21] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1177434 (10Steinsplitter) a:5RobH>3None [10:39:57] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [10:40:36] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations: Access to francium - https://phabricator.wikimedia.org/T94093#1177436 (10Nemo_bis) > please provide us with some other means of testing this at realistic scale On this topic, see https://lists.wikimedia.org/pipermail/labs-l/2015-March/003557.html [10:57:07] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [11:14:03] !log springle Synchronized wmf-config/db-eqiad.php: db1049 to normal load (duration: 00m 11s) [11:14:11] Logged the message, Master [11:50:43] 7Puppet, 6operations, 6Labs, 5Patch-For-Review, 7Regression: Puppet: "Package[gdb] is already declared in file modules/java/manifests/tools.pp" - https://phabricator.wikimedia.org/T94917#1177523 (10hashar) Out of the four precise instances Timo created, only one has the problem: integration-slave-precise... [11:55:35] 7Puppet, 6operations, 10Continuous-Integration, 5Patch-For-Review, 7Regression: Puppet: "Package[git-core] is already declared in file modules/authdns/manifests/scripts.pp" - https://phabricator.wikimedia.org/T94921#1177531 (10hashar) Out of the four precise instances Timo created, only one has the probl... [11:57:28] (03CR) 10Hashar: [C: 031 V: 032] "Applied on the integration puppetmaster. That fixed the puppet run on the new Precise images." [puppet] - 10https://gerrit.wikimedia.org/r/201598 (https://phabricator.wikimedia.org/T94917) (owner: 10Krinkle) [11:57:37] (03CR) 10Hashar: [C: 031 V: 032] "Applied on the integration puppetmaster. That fixed the puppet run on the new Precise images." [puppet] - 10https://gerrit.wikimedia.org/r/201603 (https://phabricator.wikimedia.org/T94921) (owner: 10Krinkle) [12:08:42] 7Puppet, 6operations, 10Continuous-Integration, 7Regression: Puppet: "Could not find class role::ci::slave::labs" - https://phabricator.wikimedia.org/T94925#1177560 (10hashar) 5Open>3Resolved No idea what happened, maybe it was related to the puppetmaster being Trusty? [12:09:25] 6operations, 10Continuous-Integration, 10Wikimedia-Labs-General: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1177565 (10hashar) [12:26:57] 6operations, 10Continuous-Integration, 10Wikimedia-Labs-General: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1177607 (10hashar) I had an instance suffering of the issue, I had to recreate it. I can confirm puppet runs just fine now. Thank you! [12:35:33] 6operations, 7network: hook up and dns oob access - https://phabricator.wikimedia.org/T80847#1177633 (10faidon) 5Open>3Resolved a:3faidon Layer42 is tracked in T82323. Nothing else needed here, AFAIK. [12:36:06] 6operations, 7network: Establish IPsec tunnel between codfw and eqiad pfw - https://phabricator.wikimedia.org/T89294#1177641 (10faidon) [12:39:56] 6operations, 6Mobile-Web, 3Mobile-Web-Sprint-44-R_________: Spike: figure out the simplest possible way to apply tags to a large group of articles on en wikipedia - https://phabricator.wikimedia.org/T94755#1177653 (10phuedx) I feel that the simplest solution would be storing the `serialize`d lists in files i... [12:48:19] 6operations, 10ops-eqiad: check Temperature Alarm: asw-d-eqiad. - https://phabricator.wikimedia.org/T94997#1177660 (10Cmjohnson) 3NEW a:3Cmjohnson [13:00:30] 6operations, 3Interdatacenter-IPsec: Kernel panics on Jessie (3.16.0-4-amd64) during IPsec load test - https://phabricator.wikimedia.org/T94820#1177686 (10BBlack) Try with the 3.19 kernel in case that makes the problem go away? That's the kernel the caches run anyways, and is in our repo: `apt-get install lin... [13:11:26] 6operations, 10Continuous-Integration: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1177707 (10hashar) 3NEW [13:12:05] Error: /Stage[main]/Xvfb/Service[xvfb]: Provider upstart is not functional on this host [13:12:05] bah [13:12:14] :) [13:12:59] hashar: _joe_ made us a nice definition called "service_unit" to handle the case where we're defining custom initscripts/service in puppet and need to gracefully handle sysvinit and/or upstart and/or systemd cases [13:13:22] https://github.com/wikimedia/operations-puppet/blob/21c72942dd7bf25dbe0759d2f867082e966bfb45/modules/base/manifests/service_unit.pp [13:13:27] I love the recent hires [13:13:42] bblack: thanks that is much helpful [13:16:44] and yet another task created ( https://phabricator.wikimedia.org/T95003 ) [13:17:17] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [13:20:33] (03PS1) 10Hashar: contint: Jessie does not have openjdk-6-jdk [puppet] - 10https://gerrit.wikimedia.org/r/201701 (https://phabricator.wikimedia.org/T94999) [13:23:37] 6operations, 10Continuous-Integration: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1177746 (10hashar) [13:24:53] (03CR) 10Hashar: [C: 031 V: 032] "Applied on integration puppetmaster. The package is no more required by the Jessie instance." [puppet] - 10https://gerrit.wikimedia.org/r/201701 (https://phabricator.wikimedia.org/T94999) (owner: 10Hashar) [13:35:47] 6operations, 10Continuous-Integration: Build Debian package jenkins-debian-glue for Jessie - https://phabricator.wikimedia.org/T95006#1177778 (10hashar) [13:45:07] 6operations, 10Continuous-Integration: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1177814 (10hashar) 3NEW [14:11:51] 6operations: Force https for archiva.wikimedia.org - https://phabricator.wikimedia.org/T88139#1177876 (10Ottomata) Any updates? [14:13:53] 6operations, 10Continuous-Integration: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1177880 (10hashar) [14:14:43] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban, 10hardware-requests: deploy eventlog1001 - https://phabricator.wikimedia.org/T90904#1177882 (10Nuria) [14:15:06] ottomata: See the private RT at https://rt.wikimedia.org/Ticket/Display.html?id=9286 for updates :) [14:16:52] ah yeah, no updates there either [14:16:57] i saw it [14:16:58] thanks [14:17:37] 6operations, 10Continuous-Integration: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1177889 (10hashar) [14:24:33] (03PS1) 10Hashar: contint: update browsers package names for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/201711 (https://phabricator.wikimedia.org/T95000) [14:26:26] (03CR) 10Hashar: [C: 031 V: 032] "Applied on integration puppetmaster:" [puppet] - 10https://gerrit.wikimedia.org/r/201711 (https://phabricator.wikimedia.org/T95000) (owner: 10Hashar) [14:32:12] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Upgrade box for EventLogging (vanadium) - https://phabricator.wikimedia.org/T90363#1177903 (10Ottomata) Found the machine, finally! https://phabricator.wikimedia.org/T90904 On it... [14:40:12] 7Blocked-on-Operations, 6operations, 10Continuous-Integration: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1177814 (10hashar) [14:41:31] 6operations, 10Continuous-Integration: Build Debian package jenkins-debian-glue for Jessie - https://phabricator.wikimedia.org/T95006#1177925 (10hashar) a:3hashar [14:41:38] 6operations, 10Continuous-Integration: Build Debian package jenkins-debian-glue for Jessie - https://phabricator.wikimedia.org/T95006#1177778 (10hashar) I have build the package using a cowbuilder image of jessie-wikimedia. Result is published on http://people.wikimedia.org/~hashar/debs/jenkins-debian-glue/... [14:42:25] any repreprop guru would mind pushing the jenkins-debian-glue package I have just build please ? We are missing it on Jessie [14:42:37] files are at http://people.wikimedia.org/~hashar/debs/jenkins-debian-glue/ or scp terbium.eqiad.wmnet:/home/hashar/public_html/debs/jenkins-debian-glue . [14:42:50] would be for jessie-wikimedia . The package is maintained by a debian developer :) [14:42:57] related task is https://phabricator.wikimedia.org/T95006 ! [14:46:19] (03PS1) 10BBlack: scale varnish->varnish backend weight for prod 2layer clusters [puppet] - 10https://gerrit.wikimedia.org/r/201714 [14:48:13] 6operations, 10Continuous-Integration: Build Debian package jenkins-debian-glue for Jessie - https://phabricator.wikimedia.org/T95006#1177941 (10hashar) I have manually installed the binary packages I needed on the integration-slave-jessie-1001.eqiad.wmflabs instance. ``` dpkg -i jenkins-debian-glue_0.11.0_al... [14:49:35] (03PS2) 10BBlack: scale varnish->varnish backend weight for prod 2layer clusters [puppet] - 10https://gerrit.wikimedia.org/r/201714 [14:50:08] RECOVERY - Incoming network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0] [14:50:51] (03PS3) 10BBlack: scale varnish->varnish backend weight for prod 2layer clusters [puppet] - 10https://gerrit.wikimedia.org/r/201714 (https://phabricator.wikimedia.org/T86663) [14:52:36] 6operations, 7HTTPS, 3HTTPS-by-default: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1177948 (10BBlack) https://gerrit.wikimedia.org/r/#/c/201714 <- proposal for backend weighting [15:02:51] (03CR) 10JanZerebecki: "> I myself would always use a default that fail() the catalog." [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [15:11:22] greg-g: Hey. I've updated the EducationProgram patch. [15:11:24] https://gerrit.wikimedia.org/r/#/c/167199/ [15:11:55] Glaisher: cool [15:12:09] now, to find someone to own the deploy, can you test if it works or not? [15:12:40] Pretty sure it will. I recently did another similar patch. [15:13:21] greg-g: Maybe namespaceDupes.php will have to be run. [15:13:23] not sure [15:14:13] !log kartik Synchronized php-1.25wmf23/extensions/ContentTranslation: (no message) (duration: 00m 17s) [15:14:20] Logged the message, Master [15:14:44] !log kartik Synchronized php-1.25wmf24/extensions/ContentTranslation: (no message) (duration: 00m 14s) [15:14:47] Logged the message, Master [15:15:06] greg-g: thanks for quick reply, just done with fixing error. [15:17:01] kart_: cool [15:17:25] Glaisher: good point... will need someone who can more definitively answer that/do it [15:21:25] kart_: added the CX deploy for next week to the monday "morning" swat window, feel free to move earlier if you want [15:21:41] greg-g: I don't think it was run at ruwiki. It'll have been updated with 1.25wmf23 deploy there. [15:24:56] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Requesting access to analytics-users (stat1002) for Jkatz - https://phabricator.wikimedia.org/T94939#1178002 (10Ottomata) Clarification: Jon is asking for 'analytics-privatedata-users' membership, not 'analytics-users'. [15:26:02] greg-g: new languages are just mw-config :) [15:26:13] greg-g: we do CX code update every Thu. [15:26:21] Thanks! [15:27:04] greg-g: you're right though :) [15:38:47] (03PS4) 10BBlack: scale varnish->varnish backend weight for prod 2layer clusters [puppet] - 10https://gerrit.wikimedia.org/r/201714 (https://phabricator.wikimedia.org/T86663) [15:44:47] (03PS1) 10Ottomata: Puppetize eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/201718 (https://phabricator.wikimedia.org/T90363) [15:46:27] (03CR) 10Ottomata: [C: 032] Puppetize eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/201718 (https://phabricator.wikimedia.org/T90363) (owner: 10Ottomata) [15:48:49] 6operations, 10ops-eqiad: check Temperature Alarm: asw-d-eqiad. - https://phabricator.wikimedia.org/T94997#1178036 (10Cmjohnson) I can not see anything abnormal on the switch itself. The fans are working and I do not see anything blocking ventilation. I removed the top blanking panel to allow more hot air to e... [15:58:33] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Requesting access to analytics-users (stat1002) for Jkatz - https://phabricator.wikimedia.org/T94939#1178045 (10Dzahn) p:5Triage>3Normal [16:08:00] (03PS1) 10Dzahn: admins: update ssh key for tomasz [puppet] - 10https://gerrit.wikimedia.org/r/201722 (https://phabricator.wikimedia.org/T94934) [16:10:32] RECOVERY - RAID on db1035 is OK: OK: optimal, 1 logical, 2 physical [16:15:10] (03PS3) 10BryanDavis: composer.json: Set classmap-authoritative: true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188393 (https://phabricator.wikimedia.org/T85182) (owner: 10Legoktm) [16:15:57] 6operations, 10ops-eqiad: db1035 raid degraded - https://phabricator.wikimedia.org/T94805#1178122 (10Cmjohnson) 5Open>3Resolved replaced drive ...all spun up lot Number: 7 Drive's position: DiskGroup: 0, Span: 3, Arm: 1 Enclosure position: N/A Device Id: 7 WWN: 5000C50032408F84 Sequence Number: 12 Media E... [16:17:44] (03CR) 10BryanDavis: [C: 032] composer.json: Set classmap-authoritative: true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188393 (https://phabricator.wikimedia.org/T85182) (owner: 10Legoktm) [16:18:55] (03CR) 10Andrew Bogott: [C: 032] Add resolv.conf alternates for new dns server. [puppet] - 10https://gerrit.wikimedia.org/r/201448 (owner: 10Andrew Bogott) [16:19:03] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master [16:19:31] (03Merged) 10jenkins-bot: composer.json: Set classmap-authoritative: true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188393 (https://phabricator.wikimedia.org/T85182) (owner: 10Legoktm) [16:27:25] is $wgDBPrefix set to null in wmf clusters ? [16:28:38] <^d> It's left the default which is '' I think [16:28:49] <^d> (we don't set it explicitly in mw-config) [16:29:48] (03PS1) 10Ottomata: Use eventlog1001 as main eventlogging host instead of vanadium [puppet] - 10https://gerrit.wikimedia.org/r/201724 [16:30:40] <^d> tonythomas: Yep, set to '' in DefaultSettings [16:30:44] <^d> And we don't override [16:30:46] (03PS2) 10Ottomata: Use eventlog1001 as main eventlogging host instead of vanadium [puppet] - 10https://gerrit.wikimedia.org/r/201724 (https://phabricator.wikimedia.org/T90363) [16:31:23] ^d: yeah. cool in that case! [16:31:39] (03CR) 10jenkins-bot: [V: 04-1] Use eventlog1001 as main eventlogging host instead of vanadium [puppet] - 10https://gerrit.wikimedia.org/r/201724 (https://phabricator.wikimedia.org/T90363) (owner: 10Ottomata) [16:31:53] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Puppet has 1 failures [16:32:00] <^d> tonythomas: Each wiki gets its own database so we don't have as much need for prefixes in prod [16:32:04] <^d> Extra typing ;-) [16:32:42] (03PS3) 10Ottomata: Use eventlog1001 as main eventlogging host instead of vanadium [puppet] - 10https://gerrit.wikimedia.org/r/201724 (https://phabricator.wikimedia.org/T90363) [16:32:52] ^d: exactly - I just found that our VERP regex https://github.com/wikimedia/operations-puppet/blob/e72f9710df4b46505cb081b2305c943824ace229/templates/exim/exim4.conf.SMTP_IMAP_MM.erb#L28 fails to go to the correct router when a dbprefix ( it come with a '-' right ) come in between [16:33:16] Jeff_Green: and I was checking why we kept like that firstly ! [16:33:54] ACKNOWLEDGEMENT - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master Nuria Ruiz swaping boxes [16:35:46] (03PS1) 10Dzahn: base: add nmap to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/201725 [16:36:02] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [16:36:03] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [16:44:33] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [16:49:03] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:49:31] (03CR) 10Ottomata: [C: 032] Use eventlog1001 as main eventlogging host instead of vanadium [puppet] - 10https://gerrit.wikimedia.org/r/201724 (https://phabricator.wikimedia.org/T90363) (owner: 10Ottomata) [16:52:02] Coren_away: I'm going to go shower and move to office, it has my keyboard... [17:01:16] hey bblack, uh, i'm having some unexpected systemd trouble on the bits cachecs [17:01:25] i just applied this [17:01:26] https://gerrit.wikimedia.org/r/#/c/201724/3/manifests/role/cache.pp [17:01:37] trying to create a new varnishncsa instance, and replace an old one [17:01:47] but it wont' start, and i'm not sure why [17:03:11] do I need a /etc/systemd/system file?? [17:06:12] 6operations, 5Patch-For-Review: Update ssh key for 'tomasz' - https://phabricator.wikimedia.org/T94934#1178297 (10Tfinc) thank you [17:19:42] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: Puppet has 1 failures [17:25:55] (03CR) 10Dzahn: [C: 032] Fix duplicate Package[gdb] declaration [puppet] - 10https://gerrit.wikimedia.org/r/201598 (https://phabricator.wikimedia.org/T94917) (owner: 10Krinkle) [17:26:24] (03PS1) 10BBlack: varnish::logging: use service_unit [puppet] - 10https://gerrit.wikimedia.org/r/201733 [17:27:22] (03CR) 10jenkins-bot: [V: 04-1] varnish::logging: use service_unit [puppet] - 10https://gerrit.wikimedia.org/r/201733 (owner: 10BBlack) [17:27:43] (03PS2) 10BBlack: varnish::logging: use service_unit [puppet] - 10https://gerrit.wikimedia.org/r/201733 [17:27:54] hehe semicolons :p [17:28:04] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:28:13] (03CR) 10Ottomata: [C: 031] "Bug: T90363" [puppet] - 10https://gerrit.wikimedia.org/r/201733 (owner: 10BBlack) [17:29:15] (03CR) 10BBlack: [C: 032] varnish::logging: use service_unit [puppet] - 10https://gerrit.wikimedia.org/r/201733 (owner: 10BBlack) [17:29:29] 6operations, 10Continuous-Integration: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1178354 (10faidon) Preliminary analysis: - The libvips/libmemcached are puppet bugs. These are libraries with a SONAME suffix and we shouldn't hardcode their SONAM... [17:30:34] RECOVERY - Incoming network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0] [17:32:09] (03PS1) 10Ottomata: Use ensure => absent for varnish::logging vanadium [puppet] - 10https://gerrit.wikimedia.org/r/201734 (https://phabricator.wikimedia.org/T90363) [17:33:08] (03PS2) 10Ottomata: Use ensure => absent for varnish::logging vanadium [puppet] - 10https://gerrit.wikimedia.org/r/201734 (https://phabricator.wikimedia.org/T90363) [17:33:12] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [17:33:18] (03CR) 10Ottomata: [C: 032 V: 032] Use ensure => absent for varnish::logging vanadium [puppet] - 10https://gerrit.wikimedia.org/r/201734 (https://phabricator.wikimedia.org/T90363) (owner: 10Ottomata) [17:34:44] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: puppet fail [17:34:52] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:34:52] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: puppet fail [17:34:53] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: puppet fail [17:34:59] heh [17:35:02] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail [17:35:29] (03PS1) 10Ottomata: Use ensure => present for default varnish::logging [puppet] - 10https://gerrit.wikimedia.org/r/201735 (https://phabricator.wikimedia.org/T90363) [17:35:43] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: puppet fail [17:35:57] (03CR) 10Ottomata: [C: 032 V: 032] Use ensure => present for default varnish::logging [puppet] - 10https://gerrit.wikimedia.org/r/201735 (https://phabricator.wikimedia.org/T90363) (owner: 10Ottomata) [17:36:12] PROBLEM - puppet last run on cp1069 is CRITICAL: CRITICAL: puppet fail [17:36:33] PROBLEM - puppet last run on amssq58 is CRITICAL: CRITICAL: puppet fail [17:37:03] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: puppet fail [17:37:23] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: puppet fail [17:37:32] PROBLEM - puppet last run on amssq39 is CRITICAL: CRITICAL: puppet fail [17:37:33] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: puppet fail [17:37:33] PROBLEM - puppet last run on amssq31 is CRITICAL: CRITICAL: puppet fail [17:37:33] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: puppet fail [17:37:52] RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:38:12] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: puppet fail [17:38:12] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:38:13] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:38:13] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: puppet fail [17:38:23] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: puppet fail [17:38:42] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: puppet fail [17:39:03] PROBLEM - puppet last run on amssq37 is CRITICAL: CRITICAL: puppet fail [17:39:12] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: puppet fail [17:39:12] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: puppet fail [17:39:13] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: puppet fail [17:39:13] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: puppet fail [17:39:32] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: puppet fail [17:39:52] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [17:39:54] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [17:39:54] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [17:40:02] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: puppet fail [17:40:12] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [17:40:52] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: puppet fail [17:41:03] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [17:41:03] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: puppet fail [17:41:43] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:42:46] (03PS1) 10BBlack: service_unit @name is already the full name [puppet] - 10https://gerrit.wikimedia.org/r/201736 [17:43:05] (03CR) 10BBlack: [C: 032 V: 032] service_unit @name is already the full name [puppet] - 10https://gerrit.wikimedia.org/r/201736 (owner: 10BBlack) [17:47:37] (03PS1) 10Ottomata: Send server side eventlogging logs to eventlog1001 instead of vanadium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201737 (https://phabricator.wikimedia.org/T90363) [17:47:55] greg-g: i know nothing about mw deployment [17:47:59] how do I get this deployed? [17:47:59] https://gerrit.wikimedia.org/r/#/c/201737/ [17:48:26] (03CR) 10Ori.livneh: [C: 032] Send server side eventlogging logs to eventlog1001 instead of vanadium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201737 (https://phabricator.wikimedia.org/T90363) (owner: 10Ottomata) [17:48:31] (03Merged) 10jenkins-bot: Send server side eventlogging logs to eventlog1001 instead of vanadium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201737 (https://phabricator.wikimedia.org/T90363) (owner: 10Ottomata) [17:48:38] haha, ori, hears my call [17:48:39] that is how! [17:48:41] :p [17:49:23] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [17:49:32] !log ori Synchronized wmf-config/CommonSettings.php: I9c4de264: Send server side eventlogging logs to eventlog1001 instead of vanadium (duration: 00m 11s) [17:49:36] Logged the message, Master [17:49:39] awesome, taht did it! [17:49:42] thanks ori :) [17:49:48] np, thank you [17:51:43] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:53:23] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:53:24] RECOVERY - puppet last run on amssq58 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:53:42] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:53:54] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:54:24] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:54:24] RECOVERY - puppet last run on amssq31 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:55:04] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:55:22] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:55:54] RECOVERY - puppet last run on amssq37 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:56:02] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:56:03] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:56:03] RECOVERY - puppet last run on amssq39 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:56:04] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:56:04] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:56:04] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:56:04] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:56:23] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:56:42] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:56:52] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:57:02] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:57:13] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:57:42] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:57:43] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:57:52] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:57:52] (03CR) 10Dzahn: [C: 032] "confirmed in person at office" [puppet] - 10https://gerrit.wikimedia.org/r/201722 (https://phabricator.wikimedia.org/T94934) (owner: 10Dzahn) [17:57:52] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:57:55] (03PS1) 10Ori.livneh: Dedupe severity. Dedupe severity. [debs/ircecho] - 10https://gerrit.wikimedia.org/r/201738 [17:58:33] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:03:09] 6operations, 5Patch-For-Review: Update ssh key for 'tomasz' - https://phabricator.wikimedia.org/T94934#1178533 (10Dzahn) 5Open>3Resolved a:3Dzahn confirmed in person at office. puppet already replaced the key on terbium and bast1001. should work now and very soon across all servers when puppet applied it. [18:08:00] (03Abandoned) 10John F. Lewis: add network variables for dumps rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/189196 (owner: 10John F. Lewis) [18:12:12] legoktm: what if i just merged a config change in labs/tools/grrrit but didn't do anything else. would that good or bad (because it should also be deployed) [18:12:42] it may fix a problem for the bots trying to join an invite only channel [18:13:37] also labs/tools/wikibugs2 [18:24:06] debugging a problem with flows memcache usage, and i noticed the memcache-serious logs on fluorine have exploded in the last few days. The gzip'd logs have done from ~200kB per day to 1GB per day. Is this a wider problem perhaps? [18:25:18] turbocat: ^ [18:25:31] ebernhardson: any pattern to the messages that have increased? [18:26:08] ori: most look to be mw1147 taling to nutcracker unix socket, [18:26:14] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban, 5Patch-For-Review: Upgrade box for EventLogging (vanadium) - https://phabricator.wikimedia.org/T90363#1178620 (10Ottomata) Things are looking good! eventlog1001 has fully taken over all vanadium duties. I will leave vanadium up for now, and decomm... [18:26:48] looking closer, but the logs are big enough that tools like grep and wc take a minute to run [18:28:00] (03PS1) 10Ottomata: Install jq in eventlogging role [puppet] - 10https://gerrit.wikimedia.org/r/201746 [18:28:03] 6operations, 6Labs, 10hardware-requests: eqiad: (6) labs virt nodes - https://phabricator.wikimedia.org/T89752#1044517 (10Andrew) These are now in DC but not yet actually at eqiad. I'm on clinic duty next week, but would like to start imaging them on the 13th. Thanks! [18:28:10] todays memcache-serious has ~64M lines so far, getting a count of non mw1147 lines in a min [18:28:23] thanks [18:28:39] ori: nothing new [18:29:16] looks to be all mw1147, only 26k lines that don't have mw1147 in them [18:29:32] (03CR) 10Ottomata: [C: 032] Install jq in eventlogging role [puppet] - 10https://gerrit.wikimedia.org/r/201746 (owner: 10Ottomata) [18:32:42] perhaps take mw1147 out of rotation, then look through the server to figure out what happened and how to set a monitor on the circumstances? or perhaps just a monitor on rate of messages going into memcache-serious [18:33:20] sounds legit [18:33:23] any ops on duty? [18:34:04] (03CR) 10Ori.livneh: "i'd add it to base::standard-packages if it's available on trusty and jessie both" [puppet] - 10https://gerrit.wikimedia.org/r/201746 (owner: 10Ottomata) [18:34:23] i'll depool [18:35:48] !log Depooled mw1147. Spamming fluorine:/a/mw-log/memcache-serious.log. Some nutcracker issue most likely. [18:35:51] Logged the message, Master [18:38:51] something with zhwiki just started spamming memcache-serious in the last few minutes as well, Memcached error for key "zhwiki:preprocess-hash:b878bc90c624257155bc0aa0e4b4e0c5:1" on server "/var/run/nutcracker/nutcracker.sock:0": SERVER ERRO [18:39:16] from a variety of servers, not any in particular [18:39:27] I've seen those errors before [18:39:28] probably a value over 1mb [18:39:46] tail -10000 memcached-serious.log | awk '{ print $3 }' | sort | uniq -c | sort -rn [18:40:23] !log Restarted nutcracker on HHVM and mw1147 and repooled [18:40:28] Logged the message, Master [18:41:57] ori: 1147 looks fine now, not coming up in memcached-serious (except the zhwiki thing). thanks [18:42:57] ebernhardson: thank you [18:43:48] (03CR) 10BBlack: [C: 031] Dedupe severity. Dedupe severity. [debs/ircecho] - 10https://gerrit.wikimedia.org/r/201738 (owner: 10Ori.livneh) [18:44:07] 6operations, 6MediaWiki-Core-Team, 10MediaWiki-Debug-Logging, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1178685 (10bd808) [18:44:08] ebernhardson: you might like https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/files/home/ori/.hosts/fluorine [18:44:26] ebernhardson: output: https://dpaste.de/TR2c/raw [18:44:56] it's my retro kibana [18:45:56] ori: nifty, looks like i'd need to pull in spark separatly [18:46:03] Hi, I know this might not be the right channel for it, but has someone experience with using the apt.wikimedia.org/wikimedia repo for third-party uses? I get missing keys errors, and installing wikimedia-keyring as described on the wikitech page about the apt repo didn't worked (package couldn't be found). [18:46:31] 6operations: Grant user 'tomasz' access to dbstore1002 for Event Logging data - https://phabricator.wikimedia.org/T95036#1178721 (10Tfinc) 3NEW [18:46:43] ebernhardson: yeah i have it in https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/files/home/ori/.binned/spark [18:48:04] (03CR) 10Yuvipanda: [C: 031] "Hmm, should this also need to amend the changelog and bump the version?" [debs/ircecho] - 10https://gerrit.wikimedia.org/r/201738 (owner: 10Ori.livneh) [18:48:15] YuviPanda: dunno what the protocol is [18:48:30] ori: let me poke around. depends on if it's set to ensure present or latest, I guess. [18:48:37] hmm, I could just manually update it either way, maybe... [18:49:39] hmm, it's ensure present [18:49:42] also why is it a package... [18:50:06] old and no one updated [18:50:14] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [18:50:15] i'd depackage it and move it to puppet [18:50:22] ori: yeah, want to do that or shall I/ [18:50:23] ? [18:50:28] could you? [18:50:33] ori: totes. [18:50:36] <3 [18:50:52] it's an init script too, let me move that to an upstart script as well. [18:51:44] (03PS1) 10Ori.livneh: Update my (=ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/201750 [18:52:08] (03CR) 10Ori.livneh: [C: 032 V: 032] Update my (=ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/201750 (owner: 10Ori.livneh) [18:52:25] 6operations: Move ircecho out of package into puppet repository - https://phabricator.wikimedia.org/T95038#1178752 (10yuvipanda) 3NEW [18:52:35] 6operations: Move ircecho out of package into puppet repository - https://phabricator.wikimedia.org/T95038#1178759 (10yuvipanda) a:3yuvipanda [18:52:53] (03PS4) 10Greg Grossmeier: Add interwiki-labs.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175755 (https://phabricator.wikimedia.org/T69931) (owner: 10Reedy) [18:52:58] (03CR) 10jenkins-bot: [V: 04-1] Add interwiki-labs.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175755 (https://phabricator.wikimedia.org/T69931) (owner: 10Reedy) [18:54:19] (03PS1) 10Ori.livneh: Set $wgStatsdServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201751 [18:55:15] !log restarted grrrit-wm for config change [18:55:18] Logged the message, Master [18:57:08] (03CR) 10Ori.livneh: "@Filippo: Fire when ready :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201751 (owner: 10Ori.livneh) [18:57:34] godog: fyi ^ (next week i imagine) [18:59:56] chasemp: andrewbogott greg-g my first kid is still awake. Will be a few minutes late :( [19:00:07] hasharAway: kk [19:01:43] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:01:49] stupid kids [19:02:41] gotta love 'em [19:02:42] (03PS2) 10John F. Lewis: bacula: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170476 [19:02:48] http://www.smbc-comics.com/index.php?id=3693 [19:05:03] RECOVERY - DPKG on labmon1001 is OK: All packages OK [19:05:22] (03CR) 10Faidon Liambotis: "Can't we just fix those checks to not emit CRITICAL etc. to their text output?" [debs/ircecho] - 10https://gerrit.wikimedia.org/r/201738 (owner: 10Ori.livneh) [19:08:58] (03PS3) 10John F. Lewis: mailman: SENDER_HEADERS use from only [puppet] - 10https://gerrit.wikimedia.org/r/154846 (https://bugzilla.wikimedia.org/46049) [19:10:26] bblack / akosiaris ^^ rebased and mergable [19:10:44] paravoid ^ you might also want to look? [19:11:38] (03CR) 10Ori.livneh: "@Faidon: Technically, yes; realistically, no. (But I'd be glad to be proven wrong.)" [debs/ircecho] - 10https://gerrit.wikimedia.org/r/201738 (owner: 10Ori.livneh) [19:14:17] (03CR) 10Dzahn: [C: 032] Fix duplicate Package[git-core] declaration [puppet] - 10https://gerrit.wikimedia.org/r/201603 (https://phabricator.wikimedia.org/T94921) (owner: 10Krinkle) [19:17:59] (03CR) 10Faidon Liambotis: [C: 031] "Yeah, I grepped the last 10k lines of icinga-wm messages. 62.5% matched "CRITICAL: CRITICAL" and "OK: OK". These are all kinds of differen" [debs/ircecho] - 10https://gerrit.wikimedia.org/r/201738 (owner: 10Ori.livneh) [19:19:38] (03PS1) 10Dzahn: Fix duplicate Package[libnet-dns-perl] declaration [puppet] - 10https://gerrit.wikimedia.org/r/201791 [19:19:56] hey JohnFLewis [19:20:03] hey YuviPanda [19:20:14] JohnFLewis: I see you're looking to add new checks for the wmt project? [19:20:17] JohnFLewis: what kind of checks? [19:20:37] YuviPanda: it was not so much for the project but for the docs mostly :) [19:21:08] JohnFLewis: alright, so we can have some form of active checks, or graphite checks. the 'active' check will basically just execute something *on* the shinken host, and check its output... [19:21:38] JohnFLewis: and how to set it up isn't different from icinga / shinken, although it's fairly still messy. I'm looking to see if there's a custom check I can point you to... [19:22:29] JohnFLewis: for 'graphite checks', you basically do what's done for beta. It has a 'beta::monitoring::shinken' class that's included in the labs shinken role. So you make a class like that, and just include it in the shinken role... [19:23:56] right, now the question is it seems they have to defined in puppet. What about the cases where things being ran on labs don't have a puppet module and so on? [19:25:12] JohnFLewis: for now, you're basically screwed :) [19:25:21] JohnFLewis: I'd say define them in puppet. make a stub module. [19:26:05] so literally a module for the project just with monitoring classes? interesting :P [19:27:14] JohnFLewis: or one module for unpuppetized tools that need monitoring. we haven't had that come up yet. [19:27:19] err, 'projects' not tools [19:27:40] that's when it needs specific monitoring, of course. If it just wants the generic monitoring adding code to shinkengen.yaml.erb is good enough [19:27:50] yeah [19:28:29] (03PS2) 10Dzahn: scap: remove perl packages, fix duplicate declare [puppet] - 10https://gerrit.wikimedia.org/r/201791 [19:31:29] ottomata: hey, icinga has tons of varnishncsa warnings [19:31:35] (03CR) 10Dzahn: [C: 032] "< bd808> mutante: It was used in the old perl ping script that we killed off over a year ago" [puppet] - 10https://gerrit.wikimedia.org/r/201791 (owner: 10Dzahn) [19:31:39] checking [19:31:42] PROCS WARNING: 4 processes with command name 'varnishncsa' [19:31:53] hmmm [19:32:14] really, I'm the only one checking icinga :P [19:32:16] uh oh, bblack [19:32:31] the doublenamed processes didn't get stopped [19:33:08] ^ that will fix one more icinga [19:33:20] mutante: libnet-dns-perl fail for iodine too, role/otrs.pp defines it too [19:33:38] andrewbogott: silver is happy again [19:33:49] paravoid: arrr, ok,i'll look [19:33:54] :) [19:34:13] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:35:27] mutante: context? I’m in a meeting, sorry [19:35:55] andrewbogott: puppet failed due to a duplicate package declaration and now it's fixed [19:36:05] great :) [19:39:03] ha, that kinda sucks. that means webrequest udp2logs were collecting duplicate events for the last hourish [19:39:10] good thing people hsould be using udp2log data anymore! [19:39:13] but they probably are anyway! [19:39:22] (ezachte...wherever you are, this means you!) [19:39:27] shoudln't* [19:40:27] so I need to do a quick security patch for T93543, am I clear to sync-file? [19:41:10] twentyafterfour: yes; there aren't any scheduled deployments [19:41:19] and security patches are fair game even on fridays [19:41:39] ottomata: login to icinga more often? *g* [19:42:11] paravoid: indeed! [19:42:18] i used to check it about once a day...haven't been recently [19:42:21] hmm, the ircecho defaults file is a mess…. [19:42:31] !log twentyafterfour Synchronized php-1.25wmf24/extensions/OpenStackManager/nova/OpenStackNovaUser.php: sync security patch (duration: 00m 12s) [19:42:36] Logged the message, Master [19:42:42] paravoid: why would I ever log in if I have such an excellent notification service? :D [19:42:52] haha [19:42:55] * YuviPanda doesn’t convert the init script this time. [19:43:20] twentyafterfour: export DOLOGMSGNOLOG=1 :) [19:43:42] heh I was about to ask :) [19:45:32] I don't think this one really warrants uber-secrecy [19:46:12] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [19:47:38] (03PS1) 10Yuvipanda: ircecho: Move all files into repo [puppet] - 10https://gerrit.wikimedia.org/r/201827 (https://phabricator.wikimedia.org/T95038) [19:47:40] ori: paravoid^ [19:47:40] (03PS1) 10Dzahn: otrs: fix duplicate package definition [puppet] - 10https://gerrit.wikimedia.org/r/201828 [19:48:17] (03PS2) 10Dzahn: otrs: fix duplicate package definition [puppet] - 10https://gerrit.wikimedia.org/r/201828 [19:48:33] (03CR) 10jenkins-bot: [V: 04-1] ircecho: Move all files into repo [puppet] - 10https://gerrit.wikimedia.org/r/201827 (https://phabricator.wikimedia.org/T95038) (owner: 10Yuvipanda) [19:49:04] (03CR) 10Dzahn: [C: 032] otrs: fix duplicate package definition [puppet] - 10https://gerrit.wikimedia.org/r/201828 (owner: 10Dzahn) [19:49:33] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:50:30] !log restarting gitblit [19:50:35] Logged the message, Master [19:51:50] 19:47:54 Error: Could not parse for environment production: Syntax error at ']'; expected ']' at /srv/ssd/jenkins-slave/workspace/pplint-HEAD/modules/ircecho/manifests/init.pp:50 [19:51:51] (03PS3) 10Spage: Redirect dev.wikimedia.org URLs [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) [19:51:52] I don't understand that [19:51:53] RECOVERY - puppet last run on iodine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:52:07] it says it expected ']' and it got ']' and hence it is complaining?! [19:52:09] * YuviPanda is confused. [19:53:12] (03PS2) 10Yuvipanda: ircecho: Move all files into repo [puppet] - 10https://gerrit.wikimedia.org/r/201827 (https://phabricator.wikimedia.org/T95038) [19:54:03] (03CR) 10jenkins-bot: [V: 04-1] ircecho: Move all files into repo [puppet] - 10https://gerrit.wikimedia.org/r/201827 (https://phabricator.wikimedia.org/T95038) (owner: 10Yuvipanda) [19:54:15] hmmm [19:54:23] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59683 bytes in 0.347 second response time [19:54:26] (03CR) 10Dzahn: "12:52 < icinga-wm> RECOVERY - puppet last run on iodine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures" [puppet] - 10https://gerrit.wikimedia.org/r/201828 (owner: 10Dzahn) [19:54:28] (03CR) 10Spage: "PS3 only redirects top-level requests." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage) [19:55:25] mutante: can you take a look at the puppetline message for https://gerrit.wikimedia.org/r/#/c/201827/? I do not understand what it's saying and the line number looks fine to me... [19:56:44] (03PS1) 10BryanDavis: Add a logo banner to scap [tools/scap] - 10https://gerrit.wikimedia.org/r/201829 [19:56:50] YuviPanda: hmm. i was already looking actually.. is it the trailing comma on 49? [19:57:02] (03CR) 10jenkins-bot: [V: 04-1] Add a logo banner to scap [tools/scap] - 10https://gerrit.wikimedia.org/r/201829 (owner: 10BryanDavis) [19:57:18] YuviPanda: mind i just amend and try? [19:57:20] mutante: I was thinking of that, but that shouldn't cause it... [19:57:21] mutante: sure! [19:57:23] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: puppet fail [19:58:16] (03PS3) 10Dzahn: ircecho: Move all files into repo [puppet] - 10https://gerrit.wikimedia.org/r/201827 (https://phabricator.wikimedia.org/T95038) (owner: 10Yuvipanda) [19:59:37] ottomata: woah, that passed. [19:59:39] YuviPanda: it likes that [19:59:42] ottomata: that feels wrong [19:59:49] we always have trailing commas... [19:59:53] (even though I don't like that) [20:00:15] ? [20:00:32] i guess not inside a "require" block with multiple files and packages [20:00:49] gah, I meant mutante, not ottomata [20:01:01] mutante: I think that's a puppetlint fail. [20:01:13] 'arrays should have trailing commas' should be an universal rule [20:01:23] not going to fight that atm, however. [20:02:29] YuviPanda: hmm, yea, i'm not sure what to blame [20:05:00] ruthenium, now what's up with you [20:06:09] !log ruthenium - running puppet, no issues (has not for 7 days but wasn't disabled either?) [20:06:14] Logged the message, Master [20:07:03] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:09:51] 6operations, 10hardware-requests, 3Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1178952 (10RobH) my apologies on how long a reply has been pending: https://wikitech.wikimedia.org/wiki/Server_Spares The above page l... [20:10:25] (03PS1) 10MaxSem: Remove my key while I'm travelling [puppet] - 10https://gerrit.wikimedia.org/r/201833 [20:11:59] (03CR) 10Dzahn: [C: 032] Remove my key while I'm travelling [puppet] - 10https://gerrit.wikimedia.org/r/201833 (owner: 10MaxSem) [20:13:30] (03PS4) 10Yuvipanda: ircecho: Move all files into repo [puppet] - 10https://gerrit.wikimedia.org/r/201827 (https://phabricator.wikimedia.org/T95038) [20:13:36] (03CR) 10Yuvipanda: [C: 032 V: 032] ircecho: Move all files into repo [puppet] - 10https://gerrit.wikimedia.org/r/201827 (https://phabricator.wikimedia.org/T95038) (owner: 10Yuvipanda) [20:15:13] (03PS1) 10Yuvipanda: ircecho: Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/201835 [20:15:23] (03CR) 10jenkins-bot: [V: 04-1] ircecho: Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/201835 (owner: 10Yuvipanda) [20:15:36] MaxSem: confirmed gone from bast1001, others per puppet [20:15:52] thanks:) [20:16:12] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:16:47] (03PS1) 10Yuvipanda: ircecho: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/201836 [20:16:51] (03PS2) 10BryanDavis: Add a logo banner to scap [tools/scap] - 10https://gerrit.wikimedia.org/r/201829 [20:17:07] (03CR) 10Yuvipanda: [C: 032 V: 032] ircecho: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/201836 (owner: 10Yuvipanda) [20:18:33] Error 400 on SERVER: Failed when searching for node i-00000a4d.eqiad.wmflabs: You must set the 'external_nodes' parameter to use the external node terminus [20:18:37] wut? [20:19:44] andrewbogott, ^ [20:20:10] whoah, mutante how’d you do that? [20:20:28] andrewbogott: puppet agent -tv [20:20:32] (03CR) 10BryanDavis: "This patch is silly to the point of being ridiculous, but I've had it laying around in my git repo for over a year and I decided that I wo" [tools/scap] - 10https://gerrit.wikimedia.org/r/201829 (owner: 10BryanDavis) [20:20:47] mutante: where? [20:21:06] andrewbogott: integration-slave-precise-1011 , just wanted to confirm https://phabricator.wikimedia.org/T94921 is now fixed [20:22:05] (03PS1) 10Yuvipanda: ircecho: Fix permissions for init script [puppet] - 10https://gerrit.wikimedia.org/r/201838 [20:22:11] mutante: that box uses a local puppet master which means Not My Problem [20:22:48] (03CR) 10Yuvipanda: [C: 032 V: 032] ircecho: Fix permissions for init script [puppet] - 10https://gerrit.wikimedia.org/r/201838 (owner: 10Yuvipanda) [20:23:46] that node certainly does exist in ldap [20:23:52] Krinkle: hey, did i do that wrong running puppet agent on integration-slave-precise-1011 ? [20:24:25] Krinkle: just wanted to confirm the duplicate definition puppet bugs are fixed [20:24:38] the master looks pretty sick as well, puppet won’t run [20:25:23] the "external_nodes" parameter thing is an unusual error [20:25:35] mutante: I can dive in and try to fix that puppet master, but only if someone really wants me to :) [20:26:30] booo neon [20:26:37] mutante: those slaves are not pooled yet so there's not much that can go wrong. They're not yet fully provisioned [20:27:03] mutante: if the puppetmaster is re-created as precise, I'll just delete the new pool and do it again. [20:27:16] andrewbogott: i don't know, it's a CI thing, i'm just trying to help confirm puppet bugs are gone [20:27:27] Krinkle: well, i don't know what that failure is about [20:27:36] why does tcpircbot run on neon... [20:28:08] mutante: role::puppet::self is fragile; if you can get away with it I’d build a fresh one. Trusty even. [20:28:21] don't trusty puppetmasters, I think. [20:28:22] andrewbogott: i'm not building new CI instances [20:28:29] :) [20:28:36] Me neither, i hope! [20:28:47] puppetmasters on trusty are still untested and prone to fucking up [20:28:50] so should be on preicse for now [20:28:57] unless you really want to trail blaze [20:29:22] mutante: so is that instance is a puppetmaster that is trusty, I'm not surprised it is messed up [20:29:25] and it should be precise. [20:29:30] YuviPanda: really? news to me [20:29:33] ok, stop, guys, wait here [20:29:38] it's called "precise" [20:29:40] andrewbogott: basically all our current puppetmasters are precise [20:30:22] andrewbogott: so trusty might work, but might not as well, and considering the puppetmagics done initially to have puppet run similar versions of the client for trusty and precise... [20:30:55] 6operations, 10hardware-requests, 3Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1179012 (10hashar) Chase / Andrew / I just had a meeting. We followed the discussion on IRC with RobH. Seems we will get a machine in l... [20:31:52] mutante: The error for duplicate def showed up when re-creating the instance. The only difference since last time is several hundred puppet changes by ops (most of which I imagine have never been tested on a fresh instance, so it might be a genuine error) and that we switched to trusty. The puppetmaster is new back to precise. I'll delete the broken pool of new instances that was never finished [20:31:52] and start afresh later. [20:31:55] (03CR) 10Yuvipanda: [C: 031] "<3" [tools/scap] - 10https://gerrit.wikimedia.org/r/201829 (owner: 10BryanDavis) [20:33:05] Krinkle: hey. I think the precise instance you created early on are ready for pooling. Though it is friday :D [20:33:22] Krinkle: I have installed zuul-cloner via the DEbian package I have already pushed to the other labs Precise systems [20:33:45] Krinkle: will deploy Zuul as a debian package on Trusty instances next week [20:33:56] then look at migrating the prod slaves [20:34:00] and gallium :/ [20:34:06] (03CR) 10Greg Grossmeier: [C: 031] Add a logo banner to scap [tools/scap] - 10https://gerrit.wikimedia.org/r/201829 (owner: 10BryanDavis) [20:34:13] hashar: I'm not pooling them. They failed half-way the puppet run due to package def errors [20:34:20] I will delete and re-create them [20:34:25] It is not safe to try and let puppet fix it [20:34:30] Krinkle: I have cherry picked your change on the puppetmaster [20:34:34] our puppet manifests are not good enough to do that [20:34:36] your change*s* [20:34:39] ok, so... i reviewed the patch, merged it,was the only one who didnt ignore the bug and all i wanted is to confirm it's all good now [20:34:55] and we run a Precise puppetmaster [20:35:19] so the instances looks fine to me. Ieven deleted one you created this afternoon and rebuild from scracth to make sure it works fine [20:35:23] puppet passed on first run [20:35:35] mutante: Yeah, we'll know in a few days. I'll try again on monday. [20:35:40] Thank you so much! [20:35:42] 6operations, 10hardware-requests, 3Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1179038 (10RobH) After an IRC discussion, we will be allocating two of the old squid systems for these tasks: * WMF3095 (in row c) as l... [20:35:47] Krinkle: ok!:) [20:36:06] (03PS1) 10Yuvipanda: tcpircbot: Use ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/201841 [20:36:47] (03CR) 10Yuvipanda: [C: 032] tcpircbot: Use ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/201841 (owner: 10Yuvipanda) [20:37:06] YuviPanda: why not require_package? [20:37:32] (03CR) 10Yuvipanda: "(remove -2, investigate oriping)" [puppet] - 10https://gerrit.wikimedia.org/r/201841 (owner: 10Yuvipanda) [20:37:38] ori: ok, so I'm not sure what's require_package [20:37:39] * YuviPanda gresp [20:40:47] ori: Is require_package documented somewhere? [20:40:58] oooh, it's our own thing? [20:41:05] ori: is require_package superior to ensure_packages from stdlib? [20:41:05] yeah [20:41:16] or was it just because we didnt have stdlib [20:41:17] yes, it makes the package a requirement for the current class scope [20:41:21] ah [20:41:23] aaaah, perfect... [20:41:28] so you don't later need to requires => package everywhere [20:41:34] * YuviPanda likeyu [20:41:45] although it doesn't capture semantic dependency information... [20:41:50] well [20:41:50] it does [20:43:15] (03PS2) 10Yuvipanda: tcpircbot: Use require_package [puppet] - 10https://gerrit.wikimedia.org/r/201841 [20:43:18] ori: ^ can you take a very quick look? [20:43:28] 6operations: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1179075 (10RobH) 3NEW a:3RobH [20:44:01] (03CR) 10Ori.livneh: [C: 031] tcpircbot: Use require_package [puppet] - 10https://gerrit.wikimedia.org/r/201841 (owner: 10Yuvipanda) [20:44:09] ori: ty [20:44:10] 6operations: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1179083 (10RobH) 3NEW a:3RobH [20:44:23] 6operations, 10hardware-requests, 3Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1179092 (10RobH) [20:44:25] 6operations: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1179091 (10RobH) [20:44:25] (03CR) 10Yuvipanda: [C: 032] tcpircbot: Use require_package [puppet] - 10https://gerrit.wikimedia.org/r/201841 (owner: 10Yuvipanda) [20:44:31] 6operations, 10hardware-requests, 3Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1128751 (10RobH) [20:44:35] 6operations: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1179097 (10RobH) [20:44:52] 7Puppet, 6operations, 10Continuous-Integration, 5Patch-For-Review, 7Regression: Puppet: "Package[git-core] is already declared in file modules/authdns/manifests/scripts.pp" - https://phabricator.wikimedia.org/T94921#1179104 (10Dzahn) should be fixed now. wanna confirm? [20:45:43] 7Puppet, 6operations, 6Labs, 5Patch-For-Review, 7Regression: Puppet: "Package[gdb] is already declared in file modules/java/manifests/tools.pp" - https://phabricator.wikimedia.org/T94917#1179105 (10Dzahn) merged. should be fixed now. wanna confirm? [20:47:37] (03CR) 10Thcipriani: [C: 031] "http://tyler.zone/scap.png ← in case anyone was curious" [tools/scap] - 10https://gerrit.wikimedia.org/r/201829 (owner: 10BryanDavis) [20:48:04] (03CR) 10Ori.livneh: [C: 031] "Omfg finally yes." [tools/scap] - 10https://gerrit.wikimedia.org/r/201829 (owner: 10BryanDavis) [20:49:29] well that's clear :) [20:49:58] (03CR) 10Ori.livneh: [C: 032] Make vbench more generic [puppet] - 10https://gerrit.wikimedia.org/r/197240 (https://phabricator.wikimedia.org/T92701) (owner: 10Gergő Tisza) [20:50:38] YuviPanda: oops, merged your change as well [20:50:43] hope that's ok [20:50:51] ori: I did already before... [20:51:01] ori: oooh, it failed on strontium [20:51:04] ori: so yeah, thanks [20:51:17] (03CR) 10Dzahn: [C: 031] Add a logo banner to scap [tools/scap] - 10https://gerrit.wikimedia.org/r/201829 (owner: 10BryanDavis) [20:51:28] !log restart ircecho on neon to test) [20:51:32] Logged the message, Master [20:51:34] boom! [20:51:34] 6operations: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1179119 (10RobH) 5Open>3stalled This is being stalled until next Friday (2015-04-10). Then we will discuss with @hashar and proceed as needed. [20:52:05] YuviPanda: icinga -v /etc/icinga/icinga.conf [20:52:17] mutante: nope, that'll have no effect since I'm testing ircecho itself [20:52:20] 6operations, 10hardware-requests, 3Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1179128 (10hashar) Lets get WMF3095 (in row c) as labnodepool1001 installed. We need it to start the integration of Nodepool and play wi... [20:52:55] YuviPanda: ah, i thought "boom" meant you got icinga config error [20:53:29] ah, this was a 'boom, that went well' [20:53:33] not the most appropriate of sounds, perhaps [20:53:42] I think I picked that up from Ryan Faulkner... [20:53:54] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1179129 (10hashar) [20:54:09] 6operations, 3Continuous-Integration-Isolation: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1179130 (10hashar) [20:54:41] (03Abandoned) 10Yuvipanda: ircecho: Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/201835 (owner: 10Yuvipanda) [20:55:13] 6operations: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1179142 (10RobH) [20:55:22] (03CR) 10Dzahn: [C: 031] ferm: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195858 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [20:56:27] (03PS1) 10Yuvipanda: irecho: DEDUPLICATE: DEDUPLICATE [puppet] - 10https://gerrit.wikimedia.org/r/201846 [20:56:50] (03PS2) 10Yuvipanda: irecho: DEDUPLICATE: DEDUPLICATE [puppet] - 10https://gerrit.wikimedia.org/r/201846 [20:57:00] (03PS3) 10Yuvipanda: irecho: DEDUPLICATE: DEDUPLICATE [puppet] - 10https://gerrit.wikimedia.org/r/201846 [20:57:10] (03CR) 10Yuvipanda: [C: 032] irecho: DEDUPLICATE: DEDUPLICATE [puppet] - 10https://gerrit.wikimedia.org/r/201846 (owner: 10Yuvipanda) [20:57:22] (03CR) 10Yuvipanda: [V: 032] irecho: DEDUPLICATE: DEDUPLICATE [puppet] - 10https://gerrit.wikimedia.org/r/201846 (owner: 10Yuvipanda) [20:57:40] (03CR) 1020after4: [C: 032] "I'll merge this when pigs fly." [tools/scap] - 10https://gerrit.wikimedia.org/r/201829 (owner: 10BryanDavis) [20:58:01] (03Merged) 10jenkins-bot: Add a logo banner to scap [tools/scap] - 10https://gerrit.wikimedia.org/r/201829 (owner: 10BryanDavis) [20:58:29] (03Abandoned) 10Nemo bis: Add users_to_rename table to fullview in labsdb replica [software] - 10https://gerrit.wikimedia.org/r/197507 (owner: 10Nemo bis) [20:58:29] yay! [20:58:35] hehe [20:58:46] 6operations, 10ops-eqiad: labnodepool1001 setup tasks: labels/ports/racktables - https://phabricator.wikimedia.org/T95048#1179162 (10RobH) 3NEW a:3Cmjohnson [20:58:48] now I have to make shirts of course [20:59:01] ori: so I've moved ircecho to ops/puppet now, and now it just needs a lot of improvement, I think... [20:59:08] jesus what a convoluted plan.... [20:59:21] 'let us write to a plain file and use pyinotify' [20:59:24] just use a unix socket... [20:59:33] * YuviPanda resists just rewriting ircecho [20:59:54] YuviPanda: I see you had some package-related traffic, this isn't because of mysterious puppet errors in labs right? [20:59:55] and the morebots please :) [21:00:05] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1179180 (10RobH) [21:00:51] bblack: nope :D Just moving irecho which is a package (with one file!) into ops/puppet [21:01:10] bblack: partially because that one file could totally use a lot of love... [21:01:17] oh ok [21:01:18] and I think this definitely lowers barrier-to-rewrite :P [21:01:27] I was worried this was echoes of yesterdays "trusty puppetmaster" thing [21:01:49] bblack: :D yeah, I vaguely saw that but not fully... [21:01:56] bblack: I guess we'll want to move our puppetmasters to jessie at some point [21:02:11] yeah we need to, but it's pretty non-trivial [21:02:44] puppet (and basically every other software package in the world ever) could learn a lot about backwards-compatibility from glibc :P [21:02:50] :D [21:03:08] bblack: what, you mean we can't just install puppetmaster via gem anda bundler?! [21:03:30] 3.3 >= 2.7, therefore all our 2.7 code should work fine, right? [21:04:07] but yeah for scripting languages with modules, this really comes down to the need for, at the language level, symbol/API versioning [21:04:19] heh [21:04:25] yeah [21:04:38] your script should be able to "require foomodule api==2.7" or whatever the hell the syntax ends up looking like [21:04:53] well, virtualenv / bundler handles those things... [21:04:57] and foomodule 3.3 should still have 2.7-compatible entry points available that didn't change behavior [21:05:11] then you really can just install the latest deps and go [21:05:33] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1179201 (10Eloquence) The branding has already been updated, there's a Twitter account "wikipediastore", etc., so all that remains to be done is changing... [21:07:09] bblack: I personally think we should just have a proper solution for running virtualenv type things, and solve the security releases issues there... [21:07:38] well, the problem there is an explosion of security maintenance work that will never actually get done on time in practice [21:07:51] that's the nice thing about well-maintained distributions as bases [21:08:52] if we have 30 microservices each with 300 different npm packages installed in their virtualenvs, and each set is different and has different versions of the same things, and CERT says "hey node.js module foo 3.3 has a security bug"... [21:09:23] and then some of them can't upgrade to the upstream fixed version because they're not API-compatible forward to that version [21:09:51] just managing all of that is a damn nightmare, and in practice basically it doesn't happen, and out-of-date buggy modules abound [21:10:54] sorry, it's just a glass-half-empty kind of day :) [21:11:46] bblack: yeah, totally... [21:12:12] but still, one of the primary benefits a maintained OS distribution provides is that someone is keeping tabs on security and backporting fixes, and you can safely "apt-get upgrade" and get your fixes without new breakages, etc [21:12:46] yup, but theoretically with scripting languages too you should be able to require 2.7.x and still get all security releases, for example [21:12:51] when we decide to bundle a whole environment separately from the OS (like virtualenv of a node app + all its npm deps), we're having to take on that role fully [21:13:47] (03PS6) 10Yuvipanda: parsoid: Remove parsoid beta role [puppet] - 10https://gerrit.wikimedia.org/r/193082 (https://phabricator.wikimedia.org/T86633) [21:13:47] should, yes :) [21:13:50] theoretically [21:14:19] yeahhh [21:14:39] in practice every dynamic language community seems to be full of too-useful-to-ignore modules that you get dependant on, maintained by a person who believes in linear versioning of all changes and no bugfix backports [21:15:46] (if they're actively maintained at all) [21:15:56] (03PS7) 10Yuvipanda: parsoid: Remove parsoid beta role [puppet] - 10https://gerrit.wikimedia.org/r/193082 (https://phabricator.wikimedia.org/T86633) [21:15:58] thcipriani: ^ wanna go through this again? [21:16:11] * thcipriani looks [21:16:18] bblack: yeah, totally. but not sure how to 'fix' that, though... I guess just depending on debs is one way... [21:17:23] just depending on debs is one answer, but then you're tied to very old and stable software, because even the distro guys don't want to keep up with the bleeding edge of every module in npm or rubygems or CPAN or pypi [21:18:40] another way would be to standardize a sub-platform for ourselves, where we maintain a baseline of packages (or not-packages) of known-good versions and available modules, and do the security and backporting work, and the microservices have to adhere to what's available there or request for new stuff to be added and maintained [21:18:53] so that at least we only have to cover all the work once, instead of x30 [21:18:57] 6operations, 5Patch-For-Review: Move ircecho out of package into puppet repository - https://phabricator.wikimedia.org/T95038#1179255 (10yuvipanda) 5Open>3Resolved DONE! [21:19:25] bblack: yeah, basically have a pypi mirror of some sort (or equivalent for your language) and standarize on which ones you're going to depend on... [21:19:49] (03CR) 10Dzahn: bacula: lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170476 (owner: 10John F. Lewis) [21:20:02] yeah I donno if a pypi mirror works. upstream might do stupid things and you'll have to diverge anyways [21:20:03] bblack: openstack has that kind of concept. They have a shared repo containing a list of python modules that all their projects rely on [21:20:04] (03PS3) 10Dzahn: bacula: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170476 (owner: 10John F. Lewis) [21:20:06] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban, 5Patch-For-Review: Upgrade box for EventLogging (vanadium) - https://phabricator.wikimedia.org/T90363#1179257 (10yuvipanda) 5Open>3Resolved Yay :) [21:20:38] bblack: mirror might be the wrong word, yeah. basically an internal 'vetted' repository of sorts. [21:20:45] yeah [21:20:48] like our debian repo but for python / node whatever [21:21:05] or a vetted build/installation of them all, that can be installed as the baseline w/ e.g. docker or whatever. [21:21:05] reason exposed at: https://github.com/openstack/requirements#global-requirements-for-openstack-projects [21:21:13] bblack: yup [21:21:28] (03CR) 10Dzahn: [C: 032] "rebasing reduced the number of changed files and what is left confirmed with puppet-lint 1.1.0" [puppet] - 10https://gerrit.wikimedia.org/r/170476 (owner: 10John F. Lewis) [21:21:49] bblack: but nobody's taking that on atm, I guess, so we're back to debs for now since we get that 'for free' in some form. [21:22:31] bblack: also now you're here, do you think https://gerrit.wikimedia.org/r/#/c/201618/ makes sense? :) [21:22:46] (03PS2) 10Yuvipanda: Tools: Puppetize webservice2 requirement [puppet] - 10https://gerrit.wikimedia.org/r/201671 (owner: 10Tim Landscheidt) [21:23:14] bblack: I know load is a bad metric, but having that on as a 'effect alert' rather than a 'cause alert' seems useful... [21:23:25] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1179261 (10Krenair) They can name it whatever they want, but wikipedia.org subdomains must run Wikimedia-approved-and-changeable code on WMF-owned hardwa... [21:23:27] YuviPanda: I forget, is that load already scaled per-cpu or not? [21:23:38] otherwise it's hard to have a good limit for 48-core that works for 2-core [21:23:38] bblack: IIRC it isn't, and that machine has 8CPUs [21:24:05] if it's just for specific machines yeah it makes sense [21:24:11] bblack: I baically picked that limit by looking at the graphite data, finding data for the outage that we got alerted to by users and tweaking so this would've caught it before it actually 'hit' [21:24:49] seems like a reasonable bandaid :) [21:24:49] YuviPanda: something you once reviewed and then was amended but it's from last year https://gerrit.wikimedia.org/r/#/c/169253/3 [21:26:03] bblack: :) ok! [21:27:19] (03CR) 10Dzahn: [C: 031] Tools: Puppetize webservice2 requirement [puppet] - 10https://gerrit.wikimedia.org/r/201671 (owner: 10Tim Landscheidt) [21:28:05] "fwconfigtool" from operations/software .. [21:28:06] (03PS3) 10Yuvipanda: tools: Puppetize webservice2 requirement [puppet] - 10https://gerrit.wikimedia.org/r/201671 (owner: 10Tim Landscheidt) [21:28:12] does anyone use that? [21:28:46] oh, written by Leslie [21:29:23] (03CR) 10Yuvipanda: [C: 032] tools: Puppetize webservice2 requirement [puppet] - 10https://gerrit.wikimedia.org/r/201671 (owner: 10Tim Landscheidt) [21:30:20] 6operations, 7Tracking: Make ircecho much better (Tracking) - https://phabricator.wikimedia.org/T95052#1179271 (10yuvipanda) 3NEW [21:31:33] 6operations, 7Tracking: ircecho should accept input via unix sockets - https://phabricator.wikimedia.org/T95053#1179280 (10yuvipanda) 3NEW [21:31:58] 6operations: ircecho should accept input via unix sockets - https://phabricator.wikimedia.org/T95053#1179288 (10greg) [21:32:20] (03CR) 10Dzahn: [C: 032] fwconfigtool: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169252 (owner: 10Tim Landscheidt) [21:33:13] (03Abandoned) 10Thcipriani: Merge parsoid beta and production roles [puppet] - 10https://gerrit.wikimedia.org/r/201636 (https://phabricator.wikimedia.org/T91549) (owner: 10Thcipriani) [21:33:52] 6operations: Move ircecho config file to be YAML - https://phabricator.wikimedia.org/T95054#1179295 (10yuvipanda) 3NEW [21:34:34] 6operations: Convert ircecho init script to an upstart job - https://phabricator.wikimedia.org/T95055#1179301 (10yuvipanda) 3NEW [21:34:51] ori: https://phabricator.wikimedia.org/T95052 tracking ticket for making ircecho better. [21:35:39] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1179308 (10Eloquence) Ah, I see. So, to recap -- the concern is: # We set user authentication cookies on the wikipedia.org domain (not its subdomains li... [21:36:49] (03PS2) 10Dzahn: Disable class_inherits_from_params_class puppet-lint checks [puppet] - 10https://gerrit.wikimedia.org/r/198170 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:37:48] (03CR) 10Dzahn: [C: 032] Disable class_inherits_from_params_class puppet-lint checks [puppet] - 10https://gerrit.wikimedia.org/r/198170 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:41:01] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1179317 (10Krenair) Vulnerability in the code, or the (unknown, unidentified, non-NDA) people who administrate whatever server it runs on, yes. I wouldn'... [21:42:21] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1179321 (10yuvipanda) A possibly terrifying third option is to setup a cookie stripping reverse proxy that we run on shop.wikimedia.org and then have tha... [21:43:01] YuviPanda: For goodness' sake don't give them ideas [21:43:04] YuviPanda, you suggested it, so ... ;) [21:43:11] I'm not 'it' [21:43:15] notice: I ran away [21:43:21] if only that worked [21:43:24] :P [21:43:35] I was gonna say 'make me!' and then realized 'ooh', so please do not make me :P [21:43:44] There are other ways you could work around it I'm sure [21:43:57] yeah, there's always going to be technical fixes [21:44:09] like whitelisting the domains we send cookies to. that's probably not going to fly either, however. [21:44:14] (03PS1) 10John F. Lewis: alphabetise site.pp [puppet] - 10https://gerrit.wikimedia.org/r/201850 [21:44:20] but larger question of 'if', etc that I don't want to get involved in :) [21:44:30] It's not like I'm about to suggest any of them because I don't want this change [21:44:40] I just like the idea of a cookie-eating reverse proxy, so we can call it cookiemonster. [21:44:40] : [21:44:48] YuviPanda: you now sit much closer to Eloquence physically than you did before, he can stand over you and make you now :) [21:44:57] greg-g: but I said 'please' [21:45:08] any ops fancy looking at the above commit - your future people will love you for it! [21:45:32] are there other security issues besides the cookie-stealing one? [21:45:37] JohnFLewis: you shold probably email ops@ too :P sorting isn't a one time operation... [21:45:49] Eloquence: CORS might be an issue as well, not sure. [21:45:51] YuviPanda: will do :p [21:47:05] (03PS2) 10Alex Monk: Alphabetise site.pp [puppet] - 10https://gerrit.wikimedia.org/r/201850 (owner: 10John F. Lewis) [21:47:34] Krenair: bah thanks [21:47:38] urgh, that's going to be fun to review, JohnFLewis [21:47:47] might be easier to review with a script >_> [21:48:23] indeed, it was fun doing it as well... I never knew the alphabet went t v u s h g etc :p [21:49:02] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1179326 (10Eloquence) Terrifying options may warrant some consideration, but I'll loop back with Victoria that this may be trickier than anticipated. @cs... [21:49:05] some of these operations/* things are best left to people who can self-+2 on the repository [21:49:47] strip comments at the top level on both sides, break the blocks into array elements, check everything present on the left is on the right and vice versa [21:50:52] PROBLEM - Disk space on eventlog1001 is CRITICAL: DISK CRITICAL - free space: / 329 MB (3% inode=86%): [21:51:21] JohnFLewis, considering some of these are regexes, I wonder how you would sort node /^[za]\.(codfw|eqiad)\.wmnet$/ :P [21:51:37] (03PS3) 10Yuvipanda: labs: Alert on high load in labstore* [puppet] - 10https://gerrit.wikimedia.org/r/201618 (https://phabricator.wikimedia.org/T94606) [21:52:01] Krenair: no idea :P [21:52:16] I just recongised it started with a z and went 'put it with z as opposed to q :p [21:53:21] YuviPanda: and emailed [21:53:25] with a nice subject :p [21:53:26] JohnFLewis: ty [21:53:29] saw :D [21:53:43] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:53:43] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:53:55] lovely. [21:54:18] (03CR) 10Yuvipanda: [C: 032] labs: Alert on high load in labstore* [puppet] - 10https://gerrit.wikimedia.org/r/201618 (https://phabricator.wikimedia.org/T94606) (owner: 10Yuvipanda) [21:54:40] mutante just left the office, so merged his patches as well [21:54:43] just puppetlint [21:55:23] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [21:55:23] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:08:49] (03CR) 10Gage: [C: 04-1] "Thanks John! This issue has bothered me for a while and I'd like to see this patch merged, with some fix-ups:" [puppet] - 10https://gerrit.wikimedia.org/r/201850 (owner: 10John F. Lewis) [22:09:42] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1179377 (10csteipp) >>! In T92438#1179321, @yuvipanda wrote: > A possibly terrifying third option is to setup a cookie stripping reverse proxy that we ru... [22:09:43] jgage: heh good catch with db1018 :P [22:09:52] :D [22:10:23] I thought the numbers differed too much for whitespace changes [22:12:15] (03PS3) 10John F. Lewis: alphabetise site.pp [puppet] - 10https://gerrit.wikimedia.org/r/201850 [22:13:07] jgage: readded + whitespaces seem accounted for from a glance now [22:14:30] yeah i'm trying to see where that 4 line difference is coming from [22:14:35] radium is still out of order [22:16:36] 128-130 is 3 of those lines, the other is 1118. i'm fine with that. [22:17:43] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [22:17:43] jgage: gah, hold on [22:18:38] (03PS4) 10John F. Lewis: alphabetise site.pp [puppet] - 10https://gerrit.wikimedia.org/r/201850 [22:18:44] in place now? [22:19:57] JohnFLewis, you broke the commit message again [22:20:13] Krenair: sorry :( [22:20:19] closer, but radium should be above rdb100* :) [22:20:32] jgage: that order still is messed up >.> [22:20:38] Krenair: I'll fix it with this commit [22:20:39] (03PS5) 10Alex Monk: Alphabetise site.pp [puppet] - 10https://gerrit.wikimedia.org/r/201850 (owner: 10John F. Lewis) [22:20:52] well, I'll ensure not to break it again [22:23:12] (03PS6) 10John F. Lewis: alphabetise site.pp [puppet] - 10https://gerrit.wikimedia.org/r/201850 [22:23:46] heh [22:24:01] (03PS7) 10MaxSem: Alphabetize site.pp [puppet] - 10https://gerrit.wikimedia.org/r/201850 (owner: 10John F. Lewis) [22:26:29] (03PS8) 10Alex Monk: Alphabetise site.pp [puppet] - 10https://gerrit.wikimedia.org/r/201850 (owner: 10John F. Lewis) [22:26:58] gerrit commit war :p [22:27:01] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1179403 (10csteipp) >>! In T92438#1179326, @Eloquence wrote: > Terrifying options may warrant some consideration, but I'll loop back with Victoria that t... [22:28:24] csteipp, "does some amount of javascript parsing, to ensure that scripts on their site can't change the javascript domain" [22:28:25] uhhhhh. [22:28:29] that doesn't sound very secure to me [22:28:47] unless you're going to run all JS in a sandbox [22:29:11] and detect whether it tries to mess with anything it shouldn't [22:30:23] which seems like more than a simple parser that could be fooled by some obfuscation [22:33:52] (03CR) 10Gage: [C: 031] "Discussed on IRC, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/201850 (owner: 10John F. Lewis) [22:38:42] twentyafterfour: would it be possible to cherry-pick https://gerrit.wikimedia.org/r/#/c/186222/1 into the Sprint extension? [22:39:16] to release/2015-01-08/1 [22:39:41] Krenair: I assumed the, "and this all needs to be done secure, with about 3 hours support from my team" was a given :) [22:41:34] :) [22:41:41] "User 'iegapp' has exceeded the 'max_user_connections' resource" -- need a DBA to raise the connection limit [22:41:58] I assume springle is off having a weekend [22:42:27] This same thing happened to the Wikimania Scholarships app earlier this year. [22:42:37] Negative24: we would have to do it during a phab deployment window [22:43:19] twentyafterfour: or even better bring all the labs phabs to release/2015-02-18 [22:43:38] or is there a reason that labs is lagging way behind [22:45:26] Negative24: the reason is just that it requires ops to +2 a change to the puppet repo [22:46:14] Negative24: you can propose a change to the labs role in puppet [22:46:24] phab/labs [22:46:48] (03CR) 10Alex Monk: "GWicke, is this ready to go as far as you are concerned? Do we just need ops to +2 this?" [puppet] - 10https://gerrit.wikimedia.org/r/200206 (owner: 10Alex Monk) [22:47:13] twentyafterfour: Will do. I'm guessing it's just the git tag var [22:47:31] yep [22:48:32] (03CR) 10GWicke: "Yes, looks good to me. Once merged, puppet should restart restbase in labs automatically." [puppet] - 10https://gerrit.wikimedia.org/r/200206 (owner: 10Alex Monk) [22:48:37] twentyafterfour: I need that commit to be on all the machines running phab for my puppet patch to work or else the storage upgrade will fail and puppet won't start phd [22:48:52] (or I could get puppet to ignore error code 2) [22:49:48] storage upgrade is done automatically? I thought that was a manual step [22:49:57] Any ops want to +2 that labs-only change for me? [22:50:13] (03CR) 10Dzahn: [C: 032] Try to unbreak VE on http://ee-prototype.wikipedia.beta.wmflabs.org/ [puppet] - 10https://gerrit.wikimedia.org/r/200206 (owner: 10Alex Monk) [22:50:17] thanks mutante :) [22:50:20] twentyafterfour: Its going to be on auto with puppet exec [22:50:35] mutante: grazie! [22:51:21] twentyafterfour: epriestley said its safe to run with ever run since it load handles and shouldn't do much without changes but Sprint table errors will derail it [22:51:58] no problem [22:52:24] Krenair: https://gerrit.wikimedia.org/r/#/c/198433/ is technically ready to go as well, but lets wait until Monday [22:52:47] right, that's prod, not labs :) [22:53:00] I still think the aawiki* thing is silly [22:53:05] aawik* even [22:53:20] it's just the first thing in the list [22:53:34] Negative24: why are there sprint table errors? [22:53:36] (03PS1) 10Tim Landscheidt: Tools: Update toollabs::toolwatcher documentation [puppet] - 10https://gerrit.wikimedia.org/r/201855 [22:53:41] and doesn't work, so bad for users visiting https://rest.wikimedia.org/ [22:53:59] you should either include or exclude all locked wikis [22:54:08] agree that we should probably filter out the other closed wikis too [22:54:10] not just exclude the first ones on the list [22:54:37] twentyafterfour: Phab assumes that extensions that use LiskDAO have a table in the database but Sprint doesn't need one. See https://phabricator.wikimedia.org/T86773 [22:54:44] Krenair: patches accepted ;) [22:55:10] I thought you agreed [22:55:58] oh, btw gwicke [22:56:10] abwiktionary is closed too [22:57:32] !log Updated iegreview to 3813520 (Stop using persistent db connections) [22:57:38] Logged the message, Master [22:57:44] (03CR) 10Dzahn: [C: 032] "just comments" [puppet] - 10https://gerrit.wikimedia.org/r/201855 (owner: 10Tim Landscheidt) [22:57:52] Krenair: looking at the sitematrix output.. there is actually a 'closed' flag [22:58:03] so we can automate this [22:59:02] Umm [22:59:04] gwicke [22:59:13] where is wikisource? [22:59:29] !log Graceful'd Apache on Zirconium for change 3813520 to iegreview (Stop using persistent db connections) [22:59:33] Logged the message, Master [23:00:21] Thanks ori. App still works and no errors so far :) [23:00:40] Krenair: forgot that one it seems - was going through by project [23:01:02] haha [23:01:46] luckily I don't think VE runs on wikisource yet, due to their ... interesting editing system [23:02:53] yeah [23:03:12] afaik the main missing VE wikis are a few select wiktionaries [23:03:26] and sv.wikimedia.org [23:03:36] which is special anyway [23:03:48] plus private wikis [23:06:10] (03PS1) 10Dzahn: phab: update phab version in labs [puppet] - 10https://gerrit.wikimedia.org/r/201857 [23:10:18] !log updated iegreview to aef8b1e (Use proper label for campaign selector) [23:10:21] Logged the message, Master [23:12:47] gwicke, what about the existing closed wikipedias? [23:13:04] 7Puppet: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064#1179518 (10scfc) 3NEW [23:13:11] twentyafterfour: Negative24 ^ i'd do that, but what is the sprint_tag vs the reulgar tag [23:13:15] regular [23:13:49] mutante: the difference between phabricator and the sprint extension [23:14:17] the regular tag updates phabricator itself and the sprint tag is the tag the sprint extension updates to [23:14:19] amending, i meant to change the current_tag [23:14:58] also noticed prod has a security tag which labs doesnt [23:15:05] mutante: I can just include it in my patch [23:15:05] (03PS2) 10Dzahn: phab: update phab version in labs to 2015-02-18 [puppet] - 10https://gerrit.wikimedia.org/r/201857 [23:15:36] Negative24: wanna review PS2 ? [23:15:40] mutante: Well to be honest it already is a part of my patch :P [23:15:45] but not pushed yet [23:15:48] is /1 also correct? [23:16:04] mutante: for the regular tag, yes [23:16:19] * Negative24 checks that [23:17:03] Krenair: I just updated my script to automatically filter out everything that's marked as 'closed' [23:18:31] hmm, is there a way to increase the ssh connection timeout for idrac? i need to capture a kernel panic from the console which can take a while to trigger, but i keep getting disconnected. [23:19:15] Krenair: https://gist.github.com/gwicke/1e415641dd8f5f58bef3 [23:20:26] we should probably generate the thing in the first place by script [23:20:32] that way we know we aren't missing anything [23:20:43] and all the criteria is defined by the script [23:21:56] jgage: racadm config -g cfgSerial -o cfgSerialConsoleIdleTimeout 0x708 [23:22:00] jgage: untested [23:22:06] but supposedly that is 15minutes [23:22:19] thanks! where'd you find that? [23:22:30] (03PS3) 10GWicke: Enable group1 wikis in RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) [23:22:31] !log Updated scap to a1a5235 (Add a logo banner to scap) [23:22:32] Krenair: it is now script-generated [23:22:37] Logged the message, Master [23:22:41] cool. where is the script? :p [23:22:42] i'd like to set it to like 12 hours [23:22:51] in here http://www.thisisigi.com/2010/08/howto-configure-dell-poweredge-blade-servers-drac/ [23:22:56] danke [23:23:09] after i knew i wanted "IdleTimeout" which i found elsewhere [23:23:14] Krenair: let me check it into restbase [23:23:30] should be put into the puppet repo really, gwicke [23:23:34] not part of restbase itself [23:24:03] they use it here as well http://www.wikihow.com/Configure-Dell-Drac-Console-Redirection-for-SSH-Connections [23:24:14] Krenair: it's not really puppet either [23:24:24] it's a way to generate a restbase config [23:24:47] mutante: cool, i found http://downloads.dell.com/Manuals/Select/integrated-dell-remote-access-cntrllr-6-for-monolithic-srvr-v1.95_Reference%20Guide_en-us.pdf page 184 [23:24:48] a part of it, that is [23:26:01] jgage: ah, so 0 ?:) [23:26:12] or maybe we shouldnt [23:26:24] some kind of timeout is still sane [23:27:04] oh cool there's a serial history size setting too [23:27:21] maybe i could just connect after the crash and see the trace in the history buffer [23:28:11] Krenair: https://github.com/wikimedia/restbase/pull/226 [23:29:26] (03PS1) 10Negative24: Puppet run storage upgrade for phd service [puppet] - 10https://gerrit.wikimedia.org/r/201864 (https://phabricator.wikimedia.org/T95062) [23:30:27] mutante: ^ [23:33:49] (03CR) 1020after4: [C: 031] phab: update phab version in labs to 2015-02-18 [puppet] - 10https://gerrit.wikimedia.org/r/201857 (owner: 10Dzahn) [23:34:39] (03CR) 10Negative24: [C: 04-1] "Sprint extension would probably break since it also would need to be updated to a later release. My patch to puppet (I094f7b123eaffbee7318" [puppet] - 10https://gerrit.wikimedia.org/r/201857 (owner: 10Dzahn) [23:34:51] Negative24: one inline comment on yours [23:35:00] that the bot somehow missed [23:35:05] mutante: And one on yours :) [23:35:57] i would do a change in the labs role but i don't wanna merge a module change [23:36:14] that mixes 2 things afaict [23:37:00] (03CR) 1020after4: [C: 04-1] "I don't think we want puppet to run storage upgrade. Upgrades should be done manually." [puppet] - 10https://gerrit.wikimedia.org/r/201864 (https://phabricator.wikimedia.org/T95062) (owner: 10Negative24) [23:45:22] (03CR) 10Negative24: "I don't see why not. epriestley even said that it's "...safe (and desirable / required) to run it every time."" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201864 (https://phabricator.wikimedia.org/T95062) (owner: 10Negative24) [23:47:31] mutante: But that would break things. Sprint is built very close to phab and updating one will probably break the other (personal experience) [23:48:09] 10Ops-Access-Requests, 6operations, 10Analytics-EventLogging: Grant user 'tomasz' access to dbstore1002 for Event Logging data - https://phabricator.wikimedia.org/T95036#1179580 (10Dzahn) [23:48:57] twentyafterfour: reply on my patch [23:50:56] Negative24: well, changing both versions is ok, but it's not related to the "run storage upgrade" part ? [23:51:44] (03CR) 1020after4: "I'm not sure, maybe @chasemp has an opinion? I'm pretty sure he intended the phab upgrade process to be manual." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201864 (https://phabricator.wikimedia.org/T95062) (owner: 10Negative24) [23:51:58] mutante: But it is and I said in the commit message. The upgrade is required because of a Sprint commit that fixed an error in the database script. [23:54:25] Negative24: ok, got it. ..and this is how things become more complicated than just a simple version number upgrade :p [23:58:09] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail