[01:13:15] PROBLEM - Host wikiversity-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:13:17] PROBLEM - Host wiktionary-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:13:35] RECOVERY - Host wiktionary-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.68 ms [01:13:38] RECOVERY - Host wikiversity-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.67 ms [01:19:05] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [02:03:51] !log LocalisationUpdate completed (1.22wmf1) at Mon Apr 15 02:03:51 UTC 2013 [02:14:02] PROBLEM - Puppet freshness on search1015 is CRITICAL: No successful Puppet run in the last 10 hours [02:16:02] PROBLEM - Puppet freshness on search1016 is CRITICAL: No successful Puppet run in the last 10 hours [02:51:04] Reedy_: want to get bug 47221? [02:51:23] (or anyone) [02:54:13] where????!????????????????????????????????????????? [02:54:17] Heh. [03:14:30] morebots is dead again [03:14:59] When did it die? [03:15:26] before my scrollback ends :/ [03:17:43] Interesting, it died on April 12. [03:20:49] apparently, gotta get down on a friday means death for morebots... [03:21:25] TimStarling Reedy_: feel like gently nudging morebots back to life? [03:26:44] Don't worry, I filed . [03:27:50] maybe someone should just fix the code so that it doesn't die [03:29:46] Boring. [03:30:16] file a bug andassign it to Susan! *runs* [03:58:20] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:58:20] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:58:20] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [04:05:19] wtf is up with https://gerrit.wikimedia.org/r/59074 ? it's like PS3 never came in? [04:05:55] PS2 is selected by default when first loading the page and there's no comment (under comments) about PS3 [04:06:36] legoktm mentioned he got a gerrit edit conflict with me [04:06:46] while trying to fix the bug number in the commit message [04:07:09] why are you awake? [04:07:10] yeah it was weird [04:07:12] ;-) [04:07:33] legoktm: was there a message of some sort? [04:07:44] Conflict error or something [04:07:48] I thought I hit cancel though [04:07:57] maybe repo and DB are not in sync? [04:11:29] jeremyb_: mmm? [04:14:12] PROBLEM - Puppet freshness on gallium is CRITICAL: No successful Puppet run in the last 10 hours [04:20:51] odder: to explain the current state... [04:22:26] 06:07 jeremyb_: why are you awake? [04:22:32] I meant that one :) [04:28:36] ahhh. well it's quite early :) [04:36:25] New patchset: Tim Starling; "Preserve timestamps when copying l10n cdb files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58910 [04:36:32] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58910 [04:36:46] New patchset: Tim Starling; "l10nupdate: Use refreshMessageBlobs.php instead of clearMessageBlobs.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58911 [04:43:26] New patchset: Krinkle; "Ensure 'php5-parsekit' in contint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59105 [04:43:49] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [04:44:37] TimStarling: Could you help me install php5-parsekit on gallium? ^ [04:45:25] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59105 [04:46:10] puppet run in progress [04:46:43] Thx [04:47:50] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: System_role[role::zuul::production] is already defined in file /var/lib/git/operations/puppet/manifests/role/jenkins.pp at line 4; cannot redefine at /var/lib/git/operations/puppet/manifests/role/zuul.pp:41 on node gallium.wikimedia.org [04:48:18] I just installed that package directly [04:48:27] if puppet is broken then it can't revert me, can it? [04:49:00] I suppose [04:49:06] The package is working [04:49:31] Looks like someone broke it, hashar worked on zuul last week [04:49:53] Does the jenkins job we have for puppet not validate it complete? Or does it just do a syntax check? [04:50:04] (assuming puppet is capable of doing such validation) [04:50:56] it can catch certain types of errors [04:50:58] probably not that one [04:52:17] Thats weird [04:52:24] There is no 'production' in zuul.php [04:52:25] .pp* [04:52:41] Perhaps it is running a new file with an old version of another file? [04:53:34] shouldn't be possible [04:54:09] ah, there is two zuul.pp (picked the wrong one in quick-open file) [04:54:14] yeah, its in the other one [04:59:29] TimStarling: I'm not too familiar with puppet, it seems both system_role resources by that name only carry an identifier and a description. Is that how a system_role usually is created? Not sure what it is supposed to do there. [05:00:25] the class it is in is already included on its own [05:00:48] system_role just adds a line to the motd file [05:01:10] you know, [05:01:12] $ ssh root@gallium [05:01:12] Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-39-generic x86_64) [05:01:12] gallium is a Wikimedia continuous integration test server (misc::contint::test). [05:01:54] there are probably two classes trying to create the same role, that's not allowed [05:02:31] well, it's allowed as long as you don't have both on the same server [05:03:08] Right [05:03:09] I'm quite certain hashar intended manifests/role/jenkins.pp:4 to read system_role { 'role::jenkins::production': description => 'Jenkins master on production' } [05:03:24] he had probably just done it for zuul and copy/pasted, changing the description but not the name [05:03:46] I thought a role would be a more elaborate resource, like node. [05:03:57] with a description, include some classes etc. [05:04:36] now it is basically a property of a class. Seems a bit counter intuitive given the way puppet works otherwise (not that I'm an expert or anything) [05:05:08] puppet is insane [05:05:50] TimStarling: btw, It seems I have to specify the extension config from the command line for it to pick it up (like bin/lint does on fenari as well) [05:05:52] anyways, if you look at manifests/generic-definitions.pp you'll see what $title is used for [05:06:06] it's to generate a nice message in the MOTD [05:06:25] a bit funny that this should cause puppet to barf, but it doesn't have a sense of what's important [05:06:38] That's a bit annoying since the location of it is different for any system I know, so I can't test it locally. Don't packages normally add a file in /etc/php5/conf.d so that they're present and hide the package manager's implementation. [05:06:50] Or did you install it outside apt-get alltogether? [05:08:27] Hm.. I suppose though that it would save memory by not loading it for every php process. But then again, I don't know any other extension we have that doesn't put itself in conf. [05:12:27] New patchset: Ori.livneh; "Fix system_role title for role::jenkins::master::production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59106 [05:13:52] ^ TimStarling [05:15:13] New review: Ori.livneh; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58892 [05:19:43] New review: Tim Starling; "Can be merged once the dependency is merged." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/58911 [05:21:42] New review: Tim Starling; "Merging immediately since puppet is broken on gallium." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/59106 [05:21:43] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59106 [05:24:39] TimStarling: Did you confirm it now runs smoothly? I just want to make sure that if I switch something to use it, it won't break jobs later. [05:25:03] * jeremyb_ wonders if someone wants to merge/deploy gerrit 59074 too [05:25:05] Only testing one extension for now, but I'll likely switch more to it later. [05:25:27] Krinkle: /var/log/puppet.log is world-readable, iirc [05:25:39] ori-l: but runs are only every 30 mins :) [05:25:52] unless tim ran it manually [05:25:57] * jeremyb_ nominates ori-l to merge that :) [05:26:04] i can't [05:26:16] oh, wait [05:26:21] it's not a puppet change. [05:26:30] puppet finished on gallium [05:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:27:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [05:27:29] as for parsekit lacking configuration, I think that is just a statement about how awful that extension is [05:27:43] it's not the sort of thing you would want to use in mediawiki [05:28:05] you should be able to specify it on the command line without a fully qualified path [05:28:15] like php -d extension=parsekit.so eval.php [05:28:50] I don't know if there is any other reason, want me to check it for hooks? [05:30:32] right, it does actually hook zend_error_cb [05:30:43] which may or may not break wmerrors [05:30:47] or wmerrors might break it [05:31:28] I thought the conf.d file is something the debian package usually adds. [05:31:50] I guess this isn't a package of ours? [05:32:38] I remember a bug about it in our bug tracker. Since it is common for php5- packages to add a 2-line file in conf.d, I figured we did so as well. [05:32:49] I can't find a deb- repo for it though [05:33:16] it is our package [05:33:48] I'm interested in seeing that package because the default parsekit is broken on php 5.4. I'd like to create a homebrew (mac os package manager) package for it. [05:33:59] I assume our package has a patch for the source code [05:34:37] we use PHP 5.3 [05:34:43] Really [05:34:53] Oh, of course [05:35:03] well, it's not quite that obvious [05:35:14] it is our own package too, and we have discussed moving to PHP 5.4 at length [05:35:35] * Krinkle just checked default php in our ubuntu package channel. [05:35:40] making sure all these extensions work (wmerrors etc.) would be the main part of the migration projcet [05:35:55] precise is still at 5.3 too [05:39:05] Anyhow, I'll hardcode the -n option in the php call for now. Though it'd be nice if our package could add a conf.d/parsekit.ini file (like apc, curl, lua, pdo, tidy etc.) [05:40:31] jeremyb_: you assigned a wmf-config bug to yourself for an event that starts in 5 hours and neither you nor the three people you added as reviewers have +2 in that repo..? [05:43:47] ori-l: i assigned none of them as reviewers. been kinda busy with other stuff. actually i did that git push while on a moving nyc subway train underground [05:44:01] i don't even have a browser open right now [05:44:02] which train? [05:44:18] can't remember if it was the 2 or 3 [05:44:32] sigh [05:44:41] * ori-l used to live on the 1 [05:44:45] heh [05:44:56] anyways, velocity doesn't have terribly much to do with how smart that was [05:45:10] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59074 [05:45:31] ori-l: the point was i've been preoccupied [05:45:34] and still am [05:45:36] TimStarling: Great, thanks for your input on the changes and help with deployment. Here's the first run on a real change from jenkins: https://integration.wikimedia.org/ci/job/mwext-UploadWizard-lint/110/console [05:46:14] ori-l: (also, i didn't assign the bug to myself either. but i would have) [05:48:03] ori-l: fwiw, i wouldn't feel terribly bad if it wasn't done in time... they should know better than to wait for the day before. but greasing the wheels a tiny bit doesn't hurt i guess [05:49:42] !log olivneh synchronized wmf-config/throttle.php 'Adding throttling exception for itwiki GLAM event which starts in four hours (Bug 47221, change Id9ae600ac)' [05:50:08] you got the bug # right! mazel tov [05:50:11] :-) [05:50:32] (see the earlier patchsets) [05:56:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.149 second response time [06:06:18] New patchset: Tim Starling; "dont duplicate wikimedia-task-appserver dependencies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59004 [06:06:24] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59004 [06:15:53] PROBLEM - Puppet freshness on db1058 is CRITICAL: No successful Puppet run in the last 10 hours [06:26:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:26:59] TimStarling: Can you try running lint.php on the files in the code-utils repo (as an example). I'm getting a segmentation fault on Jenkins. Though only when it sees more than N files or a big file like check-vars.php. [06:27:03] Works locally for me. [06:27:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [06:27:34] Trying to narrow down what the cause could be. [06:29:04] with the latest lint? [06:29:33] yeah [06:29:40] from the repo itself [06:29:57] e.g. ./lint.php -v . [06:30:29] locally it works both with and without an error (when I modify one of the files to be incorrect) [06:30:43] on gallium is works with a valid file but segfaults when there is an error [06:30:47] https://integration.wikimedia.org/ci/job/mwext-UploadWizard-lint/115/console [06:31:03] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Mon Apr 15 06:30:57 UTC 2013 [06:33:15] works for me [06:33:48] without an error it works, just trying with [06:35:00] ah, with an error it does indeed segfault [06:35:29] this is with quite an old PHP and an old parsekit [06:35:45] Program received signal SIGSEGV, Segmentation fault. [06:35:45] 0x00000000009b66b6 in _get_zval_ptr_ptr_cv (type=1, Ts=0x7ffff7eb3098, node=0x1415100) [06:35:46] at /home/tstarling/src/php/releases/php-5.3.8-slow/Zend/zend_execute.c:302 [06:35:46] 302 zval ***ptr = &CV_OF(node->u.var); [06:36:22] (gdb) print CV_OF(node->u.var) [06:36:22] Cannot access memory at address 0x40 [06:36:48] When I checkout code-utils in my home directory on fenari it works both with and without an error. [06:37:03] Different parsekit version? [06:37:44] actually scrap that, it segfaults there too [06:38:04] both latest lint.php and the version that is on fenari [06:39:00] assuming we package parsekit 1.3.0, it must be related to php 5.3, because I have parsekit 1.3.0 locally (patched for php 5.4) and it works. EIther that or something ubuntu related perhaps? [06:39:43] ugh, consering parsekit's history of issues and abandonment, perhaps we should look for an alternative [06:47:17] http://php.net/runkit_lint_file looks interesting, though its implementation seems very simple (using zend compile_filename/compile_string), which could be subject to the same problem as parsekit. [06:50:25] https://github.com/php/pecl-php-parsekit/blob/master/parsekit.c#L862-L948 [06:50:26] https://github.com/zenovich/runkit/blob/master/runkit_sandbox.c#L1947-L1974 [07:27:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:28:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [07:35:06] hey hashar, Krinkle [07:35:28] ori-l: hey :-] [07:35:35] sorry to dump this on you, but can one of you look at https://gerrit.wikimedia.org/r/#/c/59113/ and comments 9 & 12 on https://bugzilla.wikimedia.org/show_bug.cgi?id=46577 [07:35:48] i have to run, and this is causing fatals in prod [07:35:52] ori-l: I have noticed your mail about tracking stacktrace. Not sure yet how to reply to it nor whether I will have any time to help you with the project :( [07:36:26] ori-l: will take care of it with Jeroen [07:36:42] ori-l: get some sleep and enjoy tomorrow morning with your family =) [07:37:10] thanks :) [07:37:18] * Krinkle has to go as well [07:37:28] *wave* [07:37:34] cya [07:37:34] hashar: I'm done with the changes, all committed, merged and pushed to jenkins [07:37:39] ori-l: Krinkle you tow should relocate to Europe :-] [07:37:45] Krinkle: great thank you! [07:39:12] ok, i'm off -- hashar, enjoy the rest of your day & ttyl [07:39:23] I am in Europe :P [07:39:25] still waiting for mediawiki-core-lint to get pushed, though I can abort that. The changes are in the repo, so when you push it it'll be the same. [07:39:29] It's been 9 minutes. 1 job. [07:39:31] Something is broken [07:40:30] Krinkle: maybe each time it update a job it checks the freshness of all the jobs :-D [07:40:43] Krinkle: it is hard to say since python-jenkins has no log [08:21:38] !log updating Jenkins jobs to fetch ZUUL_COMMIT instead of ZUUL_BRANCH. That would solve a race condition between jobs ( {{bug|46723}} / {{gerrit|58865}} ). [08:24:14] morebots is dead :p [08:26:08] https://bugzilla.wikimedia.org/show_bug.cgi?id=47228 [08:27:23] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [08:27:57] pooor morebots [08:28:38] I don't even know where it runs nowadays [08:31:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [08:33:33] PROBLEM - Backend Squid HTTP on sq33 is CRITICAL: Connection timed out [08:33:33] PROBLEM - SSH on sq33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:40] New review: Hashar; "Yeah wrong copy paste. Sorry about that and thank you to have spotted and fixed it!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59106 [08:37:55] PROBLEM - Frontend Squid HTTP on sq33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:41:46] damn [08:41:58] poor jenkins spurt errors [08:56:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:57:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [09:19:54] !log jenkins died :( [09:21:38] PROBLEM - Host sq33 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:09] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [11:13:37] New patchset: Hashar; "beta: syslog-ng on deployment-bastion host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51668 [11:15:12] New patchset: Hashar; "systemuser learned 'managehome' (default true)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53879 [11:15:22] New patchset: Hashar; "create jenkins user with systemuser" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53880 [11:16:37] New review: Hashar; "That is fully back compatible with the previous definition and will let us optionally declare system..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/53879 [11:17:52] New review: Hashar; "This is merely normalizing the use of systemuser {} for applications users. It is equivalent to us..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/53880 [11:19:12] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [11:20:52] New patchset: Hashar; "sql script no more need /etc/cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55877 [11:21:50] New patchset: Mark Bergsma; "Add cp3008 to the pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59123 [11:22:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59123 [11:24:47] New patchset: Hashar; "zuul: support cloning from a different branch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58737 [11:25:13] New review: Hashar; "Also get git_branch in manifests/zuul.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58737 [11:25:27] New patchset: Hashar; "zuul: in labs use the `labs` branch to install Zuul" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58738 [11:26:37] New review: Hashar; "I have moved support of $git_branch in manifests/zuul.pp to the dependent change https://gerrit.wiki..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58738 [11:27:43] New patchset: Hashar; "zuul: support specifying the git directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58898 [11:29:38] New patchset: Hashar; "zuul: migrate git dir in production to the ssd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58899 [11:31:28] New patchset: Hashar; "zuul: support specifying the git directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58898 [11:31:35] New patchset: Hashar; "zuul: migrate git dir in production to the ssd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58899 [11:32:22] New review: Hashar; "This should be harmless, it just adds a new parameter to the classes." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/58898 [11:33:35] New review: Hashar; "The change to manifests/zuul.pp has been moved to dependent change https://gerrit.wikimedia.org/r/#/..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/58899 [11:35:04] New review: Hashar; "This should be harmless, it is just to let the labs instance use a different code than on production..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/58738 [11:41:33] New patchset: Hashar; "correct jenkins master system role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59124 [11:42:54] Change abandoned: Hashar; "not needed anymore, has been hacked differently." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53875 [12:04:25] !log Depooled knsq16-knsq22 frontend squids [12:15:01] PROBLEM - Puppet freshness on search1015 is CRITICAL: No successful Puppet run in the last 10 hours [12:15:53] yay! [12:17:01] PROBLEM - Puppet freshness on search1016 is CRITICAL: No successful Puppet run in the last 10 hours [12:23:56] !log Depooled amssq47-amssq62 frontend squids [12:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time [13:05:29] New patchset: Hashar; "mobile always uses role::cache::configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54864 [13:06:33] New review: Hashar; "* rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54864 [13:06:49] New patchset: Mark Bergsma; "Double upload frontend cache size to 8 GB, on 96 GB boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59129 [13:07:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59129 [13:09:59] mark, looks like VCLs weren't reloaded after https://gerrit.wikimedia.org/r/#/c/32866/ was merged [13:10:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [13:12:20] MaxSem: that's possible [13:12:32] there's probably no notification/subscription on that file [13:12:51] yeah [13:13:19] as well as geoip [13:14:56] New patchset: Hashar; "upload cache in labs now uses role::cache::configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54863 [13:15:54] New review: Hashar; "* rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54863 [13:18:28] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:19:14] New review: Hashar; "rebased, fixed the trivial conflict around varnish::logging statements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:19:54] New patchset: Mark Bergsma; "Reload VCL on varnish::common-vcl changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59131 [13:20:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59131 [13:22:13] New patchset: Hashar; "Varnish rules for Beta cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47567 [13:22:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:16] New review: Hashar; "rebased, fixed a trivial conflict related to some Analytics header." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47567 [13:24:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [13:33:09] New patchset: Hashar; "adapt role::cache::upload for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50064 [13:36:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [13:41:53] yeahhh that applies [13:41:55] \O/ [13:41:57] New patchset: Mark Bergsma; "Remove special storage backend config for esams upload caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59133 [13:42:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59133 [13:46:17] New patchset: Ottomata; "Subscribing self puppetmaster service to class puppet::self::client." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59134 [13:46:29] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59134 [13:46:38] New patchset: Diederik; "Remove class 'analinterns' it is no longer used." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59135 [13:47:16] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59135 [13:50:15] New patchset: Mark Bergsma; "Revert "Remove class 'analinterns' it is no longer used."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59136 [13:50:31] nostalgia [13:50:45] :D [13:51:32] analinterns LOL [13:54:08] ahhh [13:54:14] finally. That one was offending me. [13:57:11] it's still there :) [13:58:09] -:( [13:58:41] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:58:41] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:58:41] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [14:44:41] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:05] New patchset: coren; "Split labstore[34] from [12] (gluster->nfs)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59143 [14:49:12] !log reedy synchronized php-1.22wmf2/ 'Initial sync of 1.22wmf2' [14:52:49] New patchset: coren; "Split labstore[34] from [12] (gluster->nfs)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59143 [14:53:46] !log reedy synchronized wmf-config/ [14:54:27] Reedy: can you check https://gerrit.wikimedia.org/r/#/c/59143/2 for me, this is causing issues. [14:54:37] !log reedy synchronized w [14:54:58] Coren: Did you mean me? [14:55:06] where is the bot who logs to https://wikitech.wikimedia.org/wiki/Server_Admin_Log ? [14:55:11] dead [14:55:49] !log reedy synchronized docroot [14:55:57] Reedy: I meant R[TAB]; that used to be Ryan all of 30 s ago. :-) [14:56:11] Reedy: But he fled, so if you /can/ review, I'd rather not self +2 [14:56:17] I thought it was tab fail ;) [14:56:26] Raymond_afk: https://bugzilla.wikimedia.org/show_bug.cgi?id=47228 [14:56:30] I could review it, but I can't approve it, nor do I really know what I'm looking at [14:56:48] Reedy: Heh. No worries. I'll try to find someone elsse. [14:57:05] odder: thanks for the link [14:58:52] New review: Mark Bergsma; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59143 [14:58:53] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki to 1.22wmf2, scap to come [15:00:25] New review: coren; "I don't like self +2, but this is causing issues right now." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/59143 [15:00:26] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59143 [15:01:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.168 second response time [15:18:36] !log reedy Started syncing Wikimedia installation... : Build localisation cache for 1.22wmf2 [15:18:50] Reedy: feel like running the defib over morebots and bringing it back to life? [15:22:06] I don't think I can [15:30:07] New patchset: Ottomata; "Using a different ssldir for self hosted puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59148 [15:33:54] New patchset: Ottomata; "Using a different ssldir for self hosted puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59148 [15:34:09] mw1101: rsync: failed to set times on "/usr/local/apache/common-local/live-1.5": Operation not permitted (1) [15:34:10] mw1035: rsync: failed to set times on "/usr/local/apache/common-local/live-1.5": Operation not permitted (1) [15:34:12] bleugh [15:34:30] Shall have to fix it post scap [15:34:32] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59148 [15:35:01] apergos: snapshot1: sudo: no tty present and no askpass program specified [15:36:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [15:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:52:30] New patchset: Ottomata; "Ensuring puppet::self::client ssldir exists," [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59152 [15:52:58] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59152 [15:53:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.143 second response time [15:53:37] !log reedy Finished syncing Wikimedia installation... : Build localisation cache for 1.22wmf2 [15:58:56] New patchset: Reedy; "1.22wmf2 setup" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59153 [15:59:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59153 [16:04:57] PROBLEM - NTP on cp1041 is CRITICAL: NTP CRITICAL: Offset unknown [16:08:51] Any ops around that could fix a permission problem for me? /usr/local/apache/common/live-1.5 is owned by root:root, should be mwdeploy:mwdeploy Taaa [16:08:52] ddsh -F30 -cM -g mediawiki-installation -o -oSetupTimeout=10 'chown mwdeploy:mwdeploy /usr/local/apache/common/live-1.5' [16:09:57] RECOVERY - NTP on cp1041 is OK: NTP OK: Offset -0.05912792683 secs [16:11:28] Reedy: running [16:11:35] Thanks [16:11:46] done [16:12:38] Hmm [16:12:43] Apparently that didn't fix it [16:15:57] PROBLEM - Puppet freshness on db1058 is CRITICAL: No successful Puppet run in the last 10 hours [16:18:09] sec [16:18:44] how about now? [16:18:54] Reedy: ^ [16:20:02] Yup! thanks [16:20:21] it needed -h [16:20:26] as live-1.5 is a symlink [16:20:55] aha [16:21:03] paravoid, there's a request at https://bugzilla.wikimedia.org/show_bug.cgi?id=47197 to delete a mailing list with no archives - do you (ops) need an RT ticket for that? [16:21:10] Hopefully that should fix most of the error spam during scap :) [16:21:46] Thehelpfulone: yes, preferrably [16:21:57] okay, I'll create one and link it :) [16:21:59] sometimes andre__ brings BZ into our attention [16:22:40] :) [16:34:33] bblack, ping [16:34:50] yurik: pong [16:34:54] hi! [16:35:01] what's up? [16:35:23] you are still not in the https://office.wikimedia.org/wiki/Contact_list list :) [16:35:34] but that's not what i wanted to ask :) [16:35:43] i have looked at some of the code you posted [16:35:59] you mean the vmod_netmapper thing? [16:36:02] had some questions about it - are you sure you want to deal with the locking and all the other issues? [16:36:04] yep [16:36:28] what locking and other issues are you concerned about? [16:36:31] i mean - wouldn't it be easier for the external shell script/cron job to issue a vagrant - reload command to reload scripts? [16:37:01] because if you will have to perform all the locking and other stuff in code, it might be fairly expensie [16:37:04] just a thought [16:37:19] sorry, varnish, not vagrant [16:37:30] and restarting varnish is not expensive at all? ;) [16:37:36] I'm not sure if "reload" reloads vmods, but I suspect it's just VCLs [16:37:44] hmm, that might be true [16:37:57] well, varnish restart should not be required, at least i hope its not :) [16:38:01] the upside of doing the reload in-process though, regardless, is zero downtime [16:38:19] or zero "delays in processing the next request" or whatever you want to call it [16:38:21] bblack: I've added it to the ops meeting agenda, I'll ask you to tell a little bit about it [16:38:23] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:38:44] well, you will always have a delay i guess :) [16:38:51] actually, no, we won't :) [16:39:06] with a lock? [16:39:06] how [16:39:08] look closer, there is no lock [16:39:13] (there is no spoon) [16:39:17] i mean - if you had no lock at all, you can do reference change [16:39:30] hhmmm.. maybe you already implemented it that way :) [16:39:53] i briefly looked at it, so can't claim i understood it [16:40:02] this is where you want to read: http://lttng.org/urcu [16:40:05] i thought you were doing sync locking [16:40:22] RECOVERY - Disk space on db1033 is OK: DISK OK [16:41:15] the pthread calls in there are just to sync up the initialization of the vmod. that stuff may not be perfectly placed yet. in its current form it may pointlessly reinitialize the netmapper stuff on VCL reload (because all the VCL threads do their finishing call). but that can be figured out once the rest is working. [16:42:07] thx for the lib pointer, will read about it. As for the rest, if there is no threadsync for data access, its all great :) [16:46:27] yurik: yeah, the basic urcu data model is kind of like "have the readers access all the critical data through this one pointer, and have the writer load a new data update on the side, replace that pointer so it points at new data, and then free up the old copy" [16:46:41] except, if you did that without liburcu, it wouldn't really work. [16:47:12] (because sometimes CPU cache would keep the older pointer "valid" for the reader after the writer swapped it, and sometimes the reader would still be using the old structure when the writer frees it) [16:47:22] liburcu solves those issues locklessly though [16:51:44] bblack: btw, liburcu maintainer replied, he's taking care of my suggestions :) [16:52:06] awesome :) [16:52:35] but because he wasn't careful enough, wheezy is going to be released with 0.6.7 instead of 0.7.6 :( [16:53:21] New patchset: Andrew Bogott; "Added 'pythonfile' module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59161 [16:54:07] bwahahahahahahahaa [16:54:09] Breaking news: Debian gets released with a backdated version of some software package! More shocking news at 9! [16:54:09] BWAHAHAHAHAHAHAHAHA [16:54:13] my week in hell is over@! [16:54:14] \o/ [16:54:25] now apergos must suffer as I have suffered [16:56:58] RobH, I sent you that email before you changed the topic :p [16:57:57] heh, no worries [16:58:12] monday before the handoff is a vague area [16:58:32] I'll end up handing things off to ariel personally, including that email, but until ops meeting there isnt a lot of RT movement [17:03:22] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [17:05:54] bblack, had a connection drop in case you replied [17:06:44] last thing I said on that topic was: 11:47 < bblack> liburcu solves those issues locklessly though [17:07:01] I didn't see any other question from you, if it was lost in your dead connection :) [17:19:40] New patchset: Andrew Bogott; "Added 'pythonfile' module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59161 [17:19:40] New patchset: Andrew Bogott; "Install ldapsupportlib using pythonfile." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59163 [17:21:15] New review: Andrew Bogott; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59161 [17:44:34] New patchset: Ottomata; "Setting X-Analytics header in vcl_deliver." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59166 [17:57:58] linode got hacked: http://slashdot.org/firehose.pl?op=view&type=submission&id=2603667 [18:03:17] ori-l: :-( [18:06:57] Oh, yeah right. Deployment time [18:07:31] Reedy: :) [18:09:55] PROBLEM - NTP on cp1041 is CRITICAL: NTP CRITICAL: Offset unknown [18:13:35] PROBLEM - search indices - check lucene status page on search1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:14:43] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki and mediawikiwiki to 1.22wmf2 [18:14:55] RECOVERY - NTP on cp1041 is OK: NTP OK: Offset 0.04932034016 secs [18:15:05] New patchset: Reedy; "testwiki and mediawikiwiki to 1.22wmf2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59170 [18:15:19] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59170 [18:27:55] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [18:28:25] RECOVERY - search indices - check lucene status page on search1016 is OK: HTTP OK: HTTP/1.1 200 OK - 52993 bytes in 0.010 second response time [18:28:36] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:34:26] thanks apergos :) [18:34:31] yw [18:34:35] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [18:40:33] preilly: wikimedia alumni! :) [18:42:25] rfaulkner: Okay [18:43:28] rfaulkner: invitation sent [18:44:52] preilly: great ty [18:45:57] rfaulkner: Are you having a good trip? [18:48:15] preilly: yeah, whistler was a lot of fun … just flew in to Toronto yesterday [18:48:28] rfaulkner: ah cool [18:48:34] rfaulkner: when are you back in SF? [18:48:48] here until Saturday, back sat night [18:49:22] maybe i'll even see drdee while I'm here ;) [18:49:45] unless you keep hiding rfaulkner [18:49:53] haha [19:13:01] New patchset: ArielGlenn; "redirect wikimaps.net to wikipedia.org (RT 4945)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/59179 [19:14:47] Change merged: ArielGlenn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/59179 [19:33:58] paravoid, is this how you would accomplish this? https://gerrit.wikimedia.org/r/#/c/59161/ [19:34:31] no :) [19:34:42] you shouldn't install files in /usr/lib with puppet [19:34:54] where, then? [19:34:58] as for dist-packages, this is... well, for distribution packages :) [19:35:18] puppet is "local" changes, so /usr/local [19:35:23] but not in this case [19:35:26] why do you need this? [19:35:48] For this, at the moment: https://gerrit.wikimedia.org/r/#/c/59163/ [19:35:52] But, seems generally useful. [19:36:03] Unless you want to argue that those .py files should always be in a .deb [19:36:16] ariel is doing a graceful restart of all apaches [19:36:39] !log ariel gracefulled all apaches [19:37:11] yeah, this looks like something that needs packaging [19:37:22] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 742 bytes in 0.110 second response time [19:37:23] hm, or not [19:37:34] if it's a script in /usr/local/sbin, I think it's fine to just be there [19:37:54] if it needs libraries in PYTHONPATH, then it needs packaging [19:38:48] I am working on a bot that wants to use ldapsupportlib.py. That's what prompted all this. [19:38:59] I can hardcode the weird path in the bot, but that seems worse. [19:40:52] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [19:52:12] New patchset: Pyoungmeister; "adding db72 to s4 and db73 to s5" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59186 [19:55:12] PROBLEM - mysqld processes on db59 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:57:12] RECOVERY - mysqld processes on db59 is OK: PROCS OK: 1 process with command name mysqld [19:58:21] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59186 [19:59:04] andrewbogott: if it's a single .py file and the license is compatible, why not simply include it in the same directory as the bot? [19:59:39] ori-l: It's used by multiple tools. I could put it in two places, of course… lots of messy options :) [20:02:03] andrewbogott: hrm, i see [20:02:05] New patchset: Asher; "binary character and collation defaults for mariadb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59187 [20:02:42] ori-l, it's not a complicated problem. I'm just vaguely disposed against creating ever more tiny .deb packages [20:03:06] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59187 [20:09:48] RECOVERY - Puppet freshness on search1016 is OK: puppet ran at Mon Apr 15 20:09:39 UTC 2013 [20:10:18] RECOVERY - Puppet freshness on search1015 is OK: puppet ran at Mon Apr 15 20:10:13 UTC 2013 [20:12:48] hrm, how is the site doing ? [20:13:38] LeslieCarr: ?? [20:13:52] big news event usually equals some site issues [20:13:57] i am checking network stuff [20:13:57] LeslieCarr: OK [20:14:05] linode = big news or is there another one? [20:14:14] apergos: http://en.wikipedia.org/wiki/Boston_Marathon_explosions [20:14:20] oh no [20:14:37] wtf [20:14:45] LeslieCarr: I think that it will only be big news in the US, tbh [20:14:50] pope is big news in lots of the world [20:15:01] notpeter, don't be silly [20:15:01] i'm in boston right now, my buddy and I almost biked down to see the marathon an hour ago! [20:15:02] turns out, lots of people live in places that get bombed every day :/ [20:15:07] it's perfect time for an explosion in EUrope [20:15:20] 10 PM, lots of folks are in front of their computers [20:15:30] I guess there'll be lot of traffic coming [20:16:07] I mean, maybe wikinews traffic will increase by 10x [20:18:21] binasher: https://ishmael.wikimedia.org/?hours=3&host=db1017&sort=time ;) [20:25:37] New patchset: Andrew Bogott; "Look for ldapsupportlib in /usr/local/sbin." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/59193 [20:25:42] notpeter: it's already top story on two of the top greek news sites [20:26:18] Change merged: Andrew Bogott; [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/59193 [20:29:12] hiiii paravoid, you around? i'm getting close to submitting a new patchset for the kafka .deb [20:29:21] wanted to talk with you about the $KAFKA_HOME/bin/kafka-server-start.sh bit [20:29:28] vs /usr/sbin, etc. [20:30:05] apergos: huh [20:30:23] !log DNS update - add wikpedia.org [20:31:53] !log restarting pdns on ns1 [20:34:20] New review: Krinkle; "@Peachey88 I assume that's a reply to @Hashar? Since the people that the are discouraged by the bots..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [20:38:54] cmjohnson1: did the new cards and ex4200 modules get in yet ? [20:40:55] ottomata: hey [20:40:57] New review: Hashar; "Peachey wrote:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [20:41:07] heya [20:41:18] ja so, i'm not sure what to do about that. [20:41:32] the shell scripts in kafka/bin use relative paths [20:41:34] things like [20:41:39] $(dirname $0)/.. [20:41:50] i could patch them? [20:42:24] the cdh4 packages all make use of a *-env.sh file that the scripts all know how to load [20:42:47] this could set something like $KAFKA_HOME, and then the other kafka script could all use that var [20:43:04] not sure though [20:45:43] leslicarr: i don't know yet..i would assume yes but I have not seen anything from equinix yet...i will not be able to check until Wednesday [20:47:17] i keep forgetting, sorry! [20:48:23] PROBLEM - DPKG on db73 is CRITICAL: NRPE: Command check_dpkg not defined [20:48:23] PROBLEM - MySQL Slave Delay on db73 is CRITICAL: NRPE: Command check_mysql_slave_delay not defined [20:48:23] PROBLEM - mysqld processes on db73 is CRITICAL: NRPE: Command check_mysqld not defined [20:48:23] PROBLEM - MySQL Slave Running on db72 is CRITICAL: NRPE: Command check_mysql_slave_running not defined [20:48:23] PROBLEM - Disk space on db72 is CRITICAL: NRPE: Command check_disk_space not defined [20:48:33] PROBLEM - Full LVS Snapshot on db72 is CRITICAL: NRPE: Command check_lvs not defined [20:48:33] PROBLEM - MySQL disk space on db72 is CRITICAL: NRPE: Command check_disk_6_3 not defined [20:48:33] PROBLEM - Disk space on db73 is CRITICAL: NRPE: Command check_disk_space not defined [20:48:33] PROBLEM - MySQL Slave Running on db73 is CRITICAL: NRPE: Command check_mysql_slave_running not defined [20:48:43] PROBLEM - MySQL Idle Transactions on db72 is CRITICAL: NRPE: Command check_mysql_idle_transactions not defined [20:48:43] PROBLEM - Full LVS Snapshot on db73 is CRITICAL: NRPE: Command check_lvs not defined [20:48:43] PROBLEM - MySQL disk space on db73 is CRITICAL: NRPE: Command check_disk_6_3 not defined [20:48:53] PROBLEM - MySQL Recent Restart on db72 is CRITICAL: NRPE: Command check_mysql_recent_restart not defined [20:48:53] PROBLEM - RAID on db72 is CRITICAL: NRPE: Command check_raid not defined [20:48:53] PROBLEM - MySQL Idle Transactions on db73 is CRITICAL: NRPE: Command check_mysql_idle_transactions not defined [20:49:03] PROBLEM - MySQL Replication Heartbeat on db72 is CRITICAL: NRPE: Command check_mysql_slave_heartbeat not defined [20:49:03] PROBLEM - RAID on db73 is CRITICAL: NRPE: Command check_raid not defined [20:49:03] PROBLEM - MySQL Recent Restart on db73 is CRITICAL: NRPE: Command check_mysql_recent_restart not defined [20:49:13] PROBLEM - DPKG on db72 is CRITICAL: NRPE: Command check_dpkg not defined [20:49:13] PROBLEM - MySQL Slave Delay on db72 is CRITICAL: NRPE: Command check_mysql_slave_delay not defined [20:49:13] PROBLEM - MySQL Replication Heartbeat on db73 is CRITICAL: NRPE: Command check_mysql_slave_heartbeat not defined [20:49:13] PROBLEM - mysqld processes on db72 is CRITICAL: NRPE: Command check_mysqld not defined [20:50:13] RECOVERY - DPKG on db72 is OK: All packages OK [20:50:13] RECOVERY - MySQL Slave Delay on db72 is OK: OK replication delay seconds [20:50:23] RECOVERY - MySQL Slave Delay on db73 is OK: OK replication delay seconds [20:50:23] RECOVERY - MySQL Slave Running on db72 is OK: OK replication [20:50:23] RECOVERY - Disk space on db72 is OK: DISK OK [20:50:23] RECOVERY - DPKG on db73 is OK: All packages OK [20:50:33] RECOVERY - MySQL disk space on db72 is OK: DISK OK [20:50:33] RECOVERY - Disk space on db73 is OK: DISK OK [20:50:33] RECOVERY - MySQL Slave Running on db73 is OK: OK replication [20:50:33] RECOVERY - Full LVS Snapshot on db72 is OK: OK no full LVM snapshot volumes [20:50:44] RECOVERY - MySQL disk space on db73 is OK: DISK OK [20:50:44] RECOVERY - MySQL Idle Transactions on db72 is OK: OK longest blocking idle transaction sleeps for seconds [20:50:44] RECOVERY - Full LVS Snapshot on db73 is OK: OK no full LVM snapshot volumes [20:50:53] RECOVERY - MySQL Recent Restart on db72 is OK: OK seconds since restart [20:50:53] RECOVERY - MySQL Idle Transactions on db73 is OK: OK longest blocking idle transaction sleeps for seconds [20:50:53] RECOVERY - RAID on db72 is OK: OK: State is Optimal, checked 2 logical device(s) [20:51:03] RECOVERY - MySQL Recent Restart on db73 is OK: OK seconds since restart [20:51:03] RECOVERY - MySQL Replication Heartbeat on db72 is OK: OK replication delay seconds [20:51:03] RECOVERY - RAID on db73 is OK: OK: State is Optimal, checked 6 logical device(s) [20:51:13] RECOVERY - MySQL Replication Heartbeat on db73 is OK: OK replication delay seconds [20:52:36] ottomata: https://gist.github.com/atdt/9fa079d35f72c36e6b80/raw/3172b1e41e22c048e4d98dd6462aae2217ed1c3d/kafka_path.pp [20:52:59] that's what i usually do; nice and debianish [20:53:45] oh for hte kafka user? [20:53:52] or is that for everyone? [20:54:00] hmmmMmmmm [20:54:08] hmmmm, yeah [20:54:14] i'm building the .deb rigiht now [20:54:16] so I could do that [20:54:43] i think it might only be for login shells; there's (iirc) /etc/environment for shell-independent stuff [20:54:52] do you mind if we pick up this tomorrow? [20:59:50] sure no probs [21:01:16] mutante, we might need another private wiki soon, if it's decided that we (IEGCom) need one, should the DNS stuff go in an RT ticket and the actual wiki creation in a Bugzilla ticket, or both in the same RT ticket? ( I don't expect you to volunteer :P ) [21:01:55] Thehelpfulone: ehm.. start out with putting it all on a single ticket [21:02:12] if we feel we want to we can still split it out [21:03:59] heh, sure [21:07:00] New patchset: Dzahn; "add wikpedia.org to redirect to wikipedia (RT-4803)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/59327 [21:09:55] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/59327 [21:10:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:11:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [21:12:39] dzahn is doing a graceful restart of all apaches [21:13:22] !log dzahn gracefulled all apaches [21:14:22] Nemo_bis: re: "< Nemo_bis> mutante: is wikpedia.org supposed to work? [21:14:27] now.. yes:) [21:15:36] mutante: wonderful :) [21:15:50] who knows, maybe in the next few decades legals may even recover wikipedia.it [21:16:21] perhaps if the community camps at WIPO or demonstrates naked in the snow like Femens during Davos [21:17:00] LOL [21:19:20] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [21:21:58] !log linode wikitech is now shut down \o/ [21:24:04] !log upgrading mw1060-mw1069 [21:25:35] linode is dead. [21:26:20] PROBLEM - Apache HTTP on mw106 is CRITICAL: Connection refused [21:26:30] PROBLEM - DPKG on mw1067 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:26:30] PROBLEM - DPKG on mw1066 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:26:30] PROBLEM - DPKG on mw1069 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:26:30] PROBLEM - DPKG on mw1064 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:26:30] PROBLEM - DPKG on mw1061 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:26:30] PROBLEM - DPKG on mw1062 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:26:30] PROBLEM - DPKG on mw1065 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:28:20] RECOVERY - Apache HTTP on mw106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.149 second response time [21:28:30] RECOVERY - DPKG on mw1069 is OK: All packages OK [21:28:30] RECOVERY - DPKG on mw1065 is OK: All packages OK [21:29:10] RobH: Oh finally I always hated that damn thing [21:29:30] RECOVERY - DPKG on mw1066 is OK: All packages OK [21:29:30] RECOVERY - DPKG on mw1061 is OK: All packages OK [21:29:30] RECOVERY - DPKG on mw1067 is OK: All packages OK [21:29:31] RECOVERY - DPKG on mw1064 is OK: All packages OK [21:29:31] RECOVERY - DPKG on mw1062 is OK: All packages OK [21:30:51] RoanKattouw: sent a test message to your phone [21:31:21] Yeah I guessed so :) [21:32:29] !log upgrading mw1070-mw1079 [21:32:38] alright, so you got a check on http://cerium.wikimedia.org:6081 with that phone number as a contact [21:32:47] OK good [21:32:55] Does it at least text instead of call [21:32:57] ? [21:33:20] btw, we could even sent custom POST variables, request headers, other timeouts..bla [21:33:32] RoanKattouw: we all did [21:33:35] wait, this was supposed to be a text when i hit the test button :p [21:33:36] but now its dead \o/ [21:33:50] mutante: It called me instead [21:35:04] oooh, type "phone" versus type "sms" [21:35:30] PROBLEM - DPKG on mw1078 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:35:30] PROBLEM - DPKG on mw1071 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:35:30] PROBLEM - DPKG on mw1076 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:35:30] PROBLEM - DPKG on mw1073 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:35:30] PROBLEM - DPKG on mw1070 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:35:30] PROBLEM - DPKG on mw1072 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:35:50] working hours 00:00 to 23:59 sounds right when it's Roan ,heh [21:36:22] RoanKattouw: changed to type "sms". it says it sent a confirmation code [21:36:24] And that was a text message [21:36:28] Yes it did [21:36:32] PROBLEM - DPKG on mw1074 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:36:32] RECOVERY - DPKG on mw1072 is OK: All packages OK [21:36:32] PROBLEM - DPKG on mw1079 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:36:32] PROBLEM - DPKG on mw1075 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:36:36] actually i need the code [21:36:49] (PMed) [21:37:52] thx, done [21:38:32] RECOVERY - DPKG on mw1074 is OK: All packages OK [21:38:32] RECOVERY - DPKG on mw1071 is OK: All packages OK [21:38:32] RECOVERY - DPKG on mw1076 is OK: All packages OK [21:38:32] RECOVERY - DPKG on mw1070 is OK: All packages OK [21:38:32] RECOVERY - DPKG on mw1079 is OK: All packages OK [21:38:32] RECOVERY - DPKG on mw1073 is OK: All packages OK [21:38:32] RECOVERY - DPKG on mw1078 is OK: All packages OK [21:38:33] RECOVERY - DPKG on mw1075 is OK: All packages OK [21:41:42] PROBLEM - DPKG on sockpuppet is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:42:22] PROBLEM - NTP on cp1041 is CRITICAL: NTP CRITICAL: Offset unknown [21:42:53] !log doing a few more random host upgrades of misc hosts [21:44:26] hi Ryan_Lane [21:44:31] howdy [21:48:22] RECOVERY - NTP on cp1041 is OK: NTP OK: Offset -0.09200155735 secs [22:02:47] New patchset: Dzahn; "add webhostingwikipedia.com redirect (RT-4678)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/59332 [22:02:52] New patchset: Andrew Bogott; "Added a configurable cachedir." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/59333 [22:02:56] notpeter: I hear you've done some work recently on making search more reliable during index updates [22:03:31] binasher: so User::invalidateCache should be better in wmf2, but I don't know by how much [22:03:42] RECOVERY - DPKG on sockpuppet is OK: All packages OK [22:03:49] maybe there should be a bug about using memcached or something for that [22:03:49] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/59332 [22:04:02] New patchset: Andrew Bogott; "Added a configurable cachedir." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/59333 [22:04:13] Change merged: Andrew Bogott; [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/59333 [22:04:41] New review: Kaldari; "First of all, it's very unlikely that MediaWiki:Contact would be defined, but not MediaWiki:Contact-..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [22:05:14] xyzram: so on the logging side, we have https://gerrit.wikimedia.org/r/#/c/57337/ [22:05:33] yes [22:06:14] that probably needs to be improved a bit for logging [22:06:43] "mwsearch" is already used for logging every request, we probably want a separate channel for errors [22:06:54] and timeout errors need to be logged, not just 500s [22:07:02] ^demon: ^ [22:07:09] Chad was doing that. [22:07:11] can be in a separate commit though [22:07:28] New review: Kaldari; "To clarify, neither 'contact', nor 'contact-url' are defined in MediaWiki core, nor do I think they ..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [22:07:35] did you say something about preloading index files? was that just an idea, or is that code written? [22:08:09] There are places in the Java code that use the term "warmup" [22:08:26] <^demon> We disabled that logging since it was overly spammy and was serving a one-off debugging we were doing. [22:08:34] <^demon> But yeah, we can narrow that down to 500s only. [22:08:50] 500s and all other HTTP request errors [22:08:55] in particular timeouts [22:09:17] I want to have timeout logging because that is the main thing that should spike with each index update [22:09:39] we won't really know whether we have fixed the problem of rsync causing downtime until we start logging timeotus [22:10:30] I'm fairly skeptical that splitting rsync in half on s4 would completely fix the problem [22:10:37] since the problem predates the existence of s4 :) [22:10:43] src/org/wikimedia/lsearch/search/Warmup.java [22:11:26] Agree. [22:11:49] But I'm not sure a "complete" fix is within reach. [22:12:33] ok [22:12:33] If an rsync of any significant size is going to cause everything else to slow waaaay down. [22:14:19] ^demon: can you look at the rest of the logging issues on the PHP side ? [22:14:45] <^demon> Yeah, I'll look at getting wfDebugLog() calls in the right places. [22:15:04] Ok, I didn't want to be duplicating effort ... thanks. [22:15:30] !log Moved labs-morebots to tools project. [22:15:40] have you seen Warmup.getWarmupCount()? [22:15:46] it looks like that feature has to be configured to work [22:18:01] yeah, and only enwiki spell has it configured [22:18:08] enwiki : (spell,40,10) (warmup,500) [22:19:15] Right, we can configure it for others but I'd like to have a better idea of which wikis actually need it. [22:19:52] ok [22:20:15] I'm going to go have breakfast now [22:20:45] OK, I'll be around, meeting with Robla 4-4:30 [22:23:21] RobH: https://rt.wikimedia.org/Ticket/Display.html?id=4951 is filed. [22:23:51] Is there something we can figure out regarding permissions? I mean that ticket is probably gonna be moved to the procurement queue at which point Gabriel, James and myself can't see it any more [22:24:29] Maybe, like, have separate tickets for the private procurement stuff and the public discussion? [22:24:31] Pretty sure if you all list yourselves as requestors you will [22:24:34] (i thought) [22:24:46] but not sure lemme check queue permission [22:24:49] I am the only requestor, the other two are CC and AdminCC respectively [22:24:56] I suppose we can make them co-requestors? [22:25:16] yep, but lemme confirm requestors can see in that queue [22:25:42] RoanKattouw: nevermind.... [22:25:50] procurement is specifically locked down. [22:26:29] RoanKattouw: So yea, in the ops-requests we will figure out all the details, and then I'll link the actual purchase ticket [22:26:33] that way you guys can keep in loop [22:27:17] dzahn is doing a graceful restart of all apaches [22:28:39] OK good [22:30:00] dzahn is doing a graceful restart of all apaches [22:30:04] key issue, gotta do that again [22:30:21] re: RT permissions, yea, we just do the role based stuff for requestor on access-requests so far [22:30:43] !log dzahn gracefulled all apaches [22:54:06] RECOVERY - mysqld processes on db73 is OK: PROCS OK: 1 process with command name mysqld [22:54:35] New patchset: Dzahn; "remove from singer: account awjrichards, group wikidev, svn client" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58844 [22:54:59] New review: Dzahn; "recheck" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/58844 [22:56:06] TimStarling: I think the biggest win in that arena recently has been asher further splitting up the indexes in pool4 to a new pool5 [22:56:06] RECOVERY - mysqld processes on db72 is OK: PROCS OK: 1 process with command name mysqld [22:56:07] TimStarling: " so on the logging side, we have https://gerrit.wikimedia.org/r/#/c/57337/" yes and additionally https://gerrit.wikimedia.org/r/#/c/56354/ improves logging on the Java side. [22:56:42] notpeter: when did that happen ? [22:57:40] 2/28 slash 3/1 [23:00:44] notpeter: xyzram was just telling me that lucene-search-2 has a "warmup" feature which seems specifically designed to avoid this issue of slowness after index updates [23:00:59] it's just not enabled [23:02:28] enabled for enwiki and seems specifically disabled for others: (warmup,0) [23:02:59] New review: Dzahn; "yes, manually removed from singer. also manually removed other old accounts that weren't puppetized." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58844 [23:02:59] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58844 [23:05:37] New patchset: Krinkle; "noc: Add missing entries to createTxtFileSymlinks.sh" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59033 [23:05:38] New patchset: Dzahn; "icinga: fix jenkins monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58906 [23:07:04] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58906 [23:11:32] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [23:12:40] !log mwalker synchronized php-1.22wmf1/extensions/CentralNotice/ 'CentralNotice banner= fix' [23:13:19] TimStarling: xyzram want to try enabling it on a relatively stable pool, like pool2, and then procede from there? [23:13:24] New review: Dzahn; "fixed it:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58906 [23:14:34] notpeter: do we currently have an issue though ? [23:15:21] currently, no [23:15:49] but if this buys us more time before we start getting issues again, then I'd love to act proactively [23:16:18] we don't know if we have an issue [23:17:03] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59033 [23:17:05] nagios won't alert on LVS downtime for 3 minutes, we could have daily downtime that lasts for less than that [23:17:45] and it won't alert on individual search nodes for 6 minutes, individual search nodes could be generating timeouts for users without the pool as a whole being down [23:18:05] !log mwalker synchronized php-1.22wmf2/extensions/CentralNotice [23:18:16] true [23:25:35] New patchset: Krinkle; "noc: Refactor highlight.php to be simpler and less more secure" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59034 [23:31:19] If 3/6 is too high, what is a reasonable timeout pair ? 1/2 ? [23:57:24] !log csteipp synchronized php-1.22wmf2/includes/ 'User gerrit versions of security fixes' [23:58:46] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [23:58:46] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [23:58:46] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours