[01:13:15] PROBLEM - Host wikiversity-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:13:17] PROBLEM - Host wiktionary-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:13:35] RECOVERY - Host wiktionary-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.68 ms [01:13:38] RECOVERY - Host wikiversity-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.67 ms [01:19:05] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [02:03:51] !log LocalisationUpdate completed (1.22wmf1) at Mon Apr 15 02:03:51 UTC 2013 [02:14:02] PROBLEM - Puppet freshness on search1015 is CRITICAL: No successful Puppet run in the last 10 hours [02:16:02] PROBLEM - Puppet freshness on search1016 is CRITICAL: No successful Puppet run in the last 10 hours [02:51:04] Reedy_: want to get bug 47221? [02:51:23] (or anyone) [02:54:13] where????!????????????????????????????????????????? [02:54:17] Heh. [03:14:30] morebots is dead again [03:14:59] When did it die? [03:15:26] before my scrollback ends :/ [03:17:43] Interesting, it died on April 12. [03:20:49] apparently, gotta get down on a friday means death for morebots... [03:21:25] TimStarling Reedy_: feel like gently nudging morebots back to life? [03:26:44] Don't worry, I filed . [03:27:50] maybe someone should just fix the code so that it doesn't die [03:29:46] Boring. [03:30:16] file a bug andassign it to Susan! *runs* [03:58:20] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:58:20] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:58:20] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [04:05:19] wtf is up with https://gerrit.wikimedia.org/r/59074 ? it's like PS3 never came in? [04:05:55] PS2 is selected by default when first loading the page and there's no comment (under comments) about PS3 [04:06:36] legoktm mentioned he got a gerrit edit conflict with me [04:06:46] while trying to fix the bug number in the commit message [04:07:09] why are you awake? [04:07:10] yeah it was weird [04:07:12] ;-) [04:07:33] legoktm: was there a message of some sort? [04:07:44] Conflict error or something [04:07:48] I thought I hit cancel though [04:07:57] maybe repo and DB are not in sync? [04:11:29] jeremyb_: mmm? [04:14:12] PROBLEM - Puppet freshness on gallium is CRITICAL: No successful Puppet run in the last 10 hours [04:20:51] odder: to explain the current state... [04:22:26] 06:07 jeremyb_: why are you awake? [04:22:32] I meant that one :) [04:28:36] ahhh. well it's quite early :) [04:36:25] New patchset: Tim Starling; "Preserve timestamps when copying l10n cdb files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58910 [04:36:32] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58910 [04:36:46] New patchset: Tim Starling; "l10nupdate: Use refreshMessageBlobs.php instead of clearMessageBlobs.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58911 [04:43:26] New patchset: Krinkle; "Ensure 'php5-parsekit' in contint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59105 [04:43:49] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [04:44:37] TimStarling: Could you help me install php5-parsekit on gallium? ^ [04:45:25] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59105 [04:46:10] puppet run in progress [04:46:43] Thx [04:47:50] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: System_role[role::zuul::production] is already defined in file /var/lib/git/operations/puppet/manifests/role/jenkins.pp at line 4; cannot redefine at /var/lib/git/operations/puppet/manifests/role/zuul.pp:41 on node gallium.wikimedia.org [04:48:18] I just installed that package directly [04:48:27] if puppet is broken then it can't revert me, can it? [04:49:00] I suppose [04:49:06] The package is working [04:49:31] Looks like someone broke it, hashar worked on zuul last week [04:49:53] Does the jenkins job we have for puppet not validate it complete? Or does it just do a syntax check? [04:50:04] (assuming puppet is capable of doing such validation) [04:50:56] it can catch certain types of errors [04:50:58] probably not that one [04:52:17] Thats weird [04:52:24] There is no 'production' in zuul.php [04:52:25] .pp* [04:52:41] Perhaps it is running a new file with an old version of another file? [04:53:34] shouldn't be possible [04:54:09] ah, there is two zuul.pp (picked the wrong one in quick-open file) [04:54:14] yeah, its in the other one [04:59:29] TimStarling: I'm not too familiar with puppet, it seems both system_role resources by that name only carry an identifier and a description. Is that how a system_role usually is created? Not sure what it is supposed to do there. [05:00:25] the class it is in is already included on its own [05:00:48] system_role just adds a line to the motd file [05:01:10] you know, [05:01:12] $ ssh root@gallium [05:01:12] Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-39-generic x86_64) [05:01:12] gallium is a Wikimedia continuous integration test server (misc::contint::test). [05:01:54] there are probably two classes trying to create the same role, that's not allowed [05:02:31] well, it's allowed as long as you don't have both on the same server [05:03:08] Right [05:03:09] I'm quite certain hashar intended manifests/role/jenkins.pp:4 to read system_role { 'role::jenkins::production': description => 'Jenkins master on production' } [05:03:24] he had probably just done it for zuul and copy/pasted, changing the description but not the name [05:03:46] I thought a role would be a more elaborate resource, like node. [05:03:57] with a description, include some classes etc. [05:04:36] now it is basically a property of a class. Seems a bit counter intuitive given the way puppet works otherwise (not that I'm an expert or anything) [05:05:08] puppet is insane [05:05:50] TimStarling: btw, It seems I have to specify the extension config from the command line for it to pick it up (like bin/lint does on fenari as well) [05:05:52] anyways, if you look at manifests/generic-definitions.pp you'll see what $title is used for [05:06:06] it's to generate a nice message in the MOTD [05:06:25] a bit funny that this should cause puppet to barf, but it doesn't have a sense of what's important [05:06:38] That's a bit annoying since the location of it is different for any system I know, so I can't test it locally. Don't packages normally add a file in /etc/php5/conf.d so that they're present and hide the package manager's implementation. [05:06:50] Or did you install it outside apt-get alltogether? [05:08:27] Hm.. I suppose though that it would save memory by not loading it for every php process. But then again, I don't know any other extension we have that doesn't put itself in conf. [05:12:27] New patchset: Ori.livneh; "Fix system_role title for role::jenkins::master::production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59106 [05:13:52] ^ TimStarling [05:15:13] New review: Ori.livneh; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58892 [05:19:43] New review: Tim Starling; "Can be merged once the dependency is merged." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/58911 [05:21:42] New review: Tim Starling; "Merging immediately since puppet is broken on gallium." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/59106 [05:21:43] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59106 [05:24:39] TimStarling: Did you confirm it now runs smoothly? I just want to make sure that if I switch something to use it, it won't break jobs later. [05:25:03] * jeremyb_ wonders if someone wants to merge/deploy gerrit 59074 too [05:25:05] Only testing one extension for now, but I'll likely switch more to it later. [05:25:27] Krinkle: /var/log/puppet.log is world-readable, iirc [05:25:39] ori-l: but runs are only every 30 mins :) [05:25:52] unless tim ran it manually [05:25:57] * jeremyb_ nominates ori-l to merge that :) [05:26:04] i can't [05:26:16] oh, wait [05:26:21] it's not a puppet change. [05:26:30] puppet finished on gallium [05:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:27:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [05:27:29] as for parsekit lacking configuration, I think that is just a statement about how awful that extension is [05:27:43] it's not the sort of thing you would want to use in mediawiki [05:28:05] you should be able to specify it on the command line without a fully qualified path [05:28:15] like php -d extension=parsekit.so eval.php [05:28:50] I don't know if there is any other reason, want me to check it for hooks? [05:30:32] right, it does actually hook zend_error_cb [05:30:43] which may or may not break wmerrors [05:30:47] or wmerrors might break it [05:31:28] I thought the conf.d file is something the debian package usually adds. [05:31:50] I guess this isn't a package of ours? [05:32:38] I remember a bug about it in our bug tracker. Since it is common for php5- packages to add a 2-line file in conf.d, I figured we did so as well. [05:32:49] I can't find a deb- repo for it though [05:33:16] it is our package [05:33:48] I'm interested in seeing that package because the default parsekit is broken on php 5.4. I'd like to create a homebrew (mac os package manager) package for it. [05:33:59] I assume our package has a patch for the source code [05:34:37] we use PHP 5.3 [05:34:43] Really [05:34:53] Oh, of course [05:35:03] well, it's not quite that obvious [05:35:14] it is our own package too, and we have discussed moving to PHP 5.4 at length [05:35:35] * Krinkle just checked default php in our ubuntu package channel. [05:35:40] making sure all these extensions work (wmerrors etc.) would be the main part of the migration projcet [05:35:55] precise is still at 5.3 too [05:39:05] Anyhow, I'll hardcode the -n option in the php call for now. Though it'd be nice if our package could add a conf.d/parsekit.ini file (like apc, curl, lua, pdo, tidy etc.) [05:40:31] jeremyb_: you assigned a wmf-config bug to yourself for an event that starts in 5 hours and neither you nor the three people you added as reviewers have +2 in that repo..? [05:43:47] ori-l: i assigned none of them as reviewers. been kinda busy with other stuff. actually i did that git push while on a moving nyc subway train underground [05:44:01] i don't even have a browser open right now [05:44:02] which train? [05:44:18] can't remember if it was the 2 or 3 [05:44:32] sigh [05:44:41] * ori-l used to live on the 1 [05:44:45] heh [05:44:56] anyways, velocity doesn't have terribly much to do with how smart that was [05:45:10] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59074 [05:45:31] ori-l: the point was i've been preoccupied [05:45:34] and still am [05:45:36] TimStarling: Great, thanks for your input on the changes and help with deployment. Here's the first run on a real change from jenkins: https://integration.wikimedia.org/ci/job/mwext-UploadWizard-lint/110/console [05:46:14] ori-l: (also, i didn't assign the bug to myself either. but i would have) [05:48:03] ori-l: fwiw, i wouldn't feel terribly bad if it wasn't done in time... they should know better than to wait for the day before. but greasing the wheels a tiny bit doesn't hurt i guess [05:49:42] !log olivneh synchronized wmf-config/throttle.php 'Adding throttling exception for itwiki GLAM event which starts in four hours (Bug 47221, change Id9ae600ac)' [05:50:08] you got the bug # right! mazel tov [05:50:11] :-) [05:50:32] (see the earlier patchsets) [05:56:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.149 second response time [06:06:18] New patchset: Tim Starling; "dont duplicate wikimedia-task-appserver dependencies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59004 [06:06:24] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59004 [06:15:53] PROBLEM - Puppet freshness on db1058 is CRITICAL: No successful Puppet run in the last 10 hours [06:26:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:26:59] TimStarling: Can you try running lint.php on the files in the code-utils repo (as an example). I'm getting a segmentation fault on Jenkins. Though only when it sees more than N files or a big file like check-vars.php. [06:27:03] Works locally for me. [06:27:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [06:27:34] Trying to narrow down what the cause could be. [06:29:04] with the latest lint? [06:29:33] yeah [06:29:40] from the repo itself [06:29:57] e.g. ./lint.php -v . [06:30:29] locally it works both with and without an error (when I modify one of the files to be incorrect) [06:30:43] on gallium is works with a valid file but segfaults when there is an error [06:30:47] https://integration.wikimedia.org/ci/job/mwext-UploadWizard-lint/115/console [06:31:03] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Mon Apr 15 06:30:57 UTC 2013 [06:33:15] works for me [06:33:48] without an error it works, just trying with [06:35:00] ah, with an error it does indeed segfault [06:35:29] this is with quite an old PHP and an old parsekit [06:35:45] Program received signal SIGSEGV, Segmentation fault. [06:35:45] 0x00000000009b66b6 in _get_zval_ptr_ptr_cv (type=1, Ts=0x7ffff7eb3098, node=0x1415100) [06:35:46] at /home/tstarling/src/php/releases/php-5.3.8-slow/Zend/zend_execute.c:302 [06:35:46] 302 zval ***ptr = &CV_OF(node->u.var); [06:36:22] (gdb) print CV_OF(node->u.var) [06:36:22] Cannot access memory at address 0x40 [06:36:48] When I checkout code-utils in my home directory on fenari it works both with and without an error. [06:37:03] Different parsekit version? [06:37:44] actually scrap that, it segfaults there too [06:38:04] both latest lint.php and the version that is on fenari [06:39:00] assuming we package parsekit 1.3.0, it must be related to php 5.3, because I have parsekit 1.3.0 locally (patched for php 5.4) and it works. EIther that or something ubuntu related perhaps? [06:39:43] ugh, consering parsekit's history of issues and abandonment, perhaps we should look for an alternative [06:47:17] http://php.net/runkit_lint_file looks interesting, though its implementation seems very simple (using zend compile_filename/compile_string), which could be subject to the same problem as parsekit. [06:50:25] https://github.com/php/pecl-php-parsekit/blob/master/parsekit.c#L862-L948 [06:50:26] https://github.com/zenovich/runkit/blob/master/runkit_sandbox.c#L1947-L1974 [07:27:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:28:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [07:35:06] hey hashar, Krinkle [07:35:28] ori-l: hey :-] [07:35:35] sorry to dump this on you, but can one of you look at https://gerrit.wikimedia.org/r/#/c/59113/ and comments 9 & 12 on https://bugzilla.wikimedia.org/show_bug.cgi?id=46577 [07:35:48] i have to run, and this is causing fatals in prod [07:35:52] ori-l: I have noticed your mail about tracking stacktrace. Not sure yet how to reply to it nor whether I will have any time to help you with the project :( [07:36:26] ori-l: will take care of it with Jeroen [07:36:42] ori-l: get some sleep and enjoy tomorrow morning with your family =) [07:37:10] thanks :) [07:37:18] * Krinkle has to go as well [07:37:28] *wave* [07:37:34] cya [07:37:34] hashar: I'm done with the changes, all committed, merged and pushed to jenkins [07:37:39] ori-l: Krinkle you tow should relocate to Europe :-] [07:37:45] Krinkle: great thank you! [07:39:12] ok, i'm off -- hashar, enjoy the rest of your day & ttyl [07:39:23] I am in Europe :P [07:39:25] still waiting for mediawiki-core-lint to get pushed, though I can abort that. The changes are in the repo, so when you push it it'll be the same. [07:39:29] It's been 9 minutes. 1 job. [07:39:31] Something is broken [07:40:30] Krinkle: maybe each time it update a job it checks the freshness of all the jobs :-D [07:40:43] Krinkle: it is hard to say since python-jenkins has no log [08:21:38] !log updating Jenkins jobs to fetch ZUUL_COMMIT instead of ZUUL_BRANCH. That would solve a race condition between jobs ( {{bug|46723}} / {{gerrit|58865}} ). [08:24:14] morebots is dead :p [08:26:08] https://bugzilla.wikimedia.org/show_bug.cgi?id=47228 [08:27:23] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [08:27:57] pooor morebots [08:28:38] I don't even know where it runs nowadays [08:31:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [08:33:33] PROBLEM - Backend Squid HTTP on sq33 is CRITICAL: Connection timed out [08:33:33] PROBLEM - SSH on sq33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:40] New review: Hashar; "Yeah wrong copy paste. Sorry about that and thank you to have spotted and fixed it!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59106 [08:37:55] PROBLEM - Frontend Squid HTTP on sq33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:41:46] damn [08:41:58] poor jenkins spurt errors [08:56:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:57:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [09:19:54] !log jenkins died :( [09:21:38] PROBLEM - Host sq33 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:09] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [11:13:37] New patchset: Hashar; "beta: syslog-ng on deployment-bastion host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51668 [11:15:12] New patchset: Hashar; "systemuser learned 'managehome' (default true)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53879 [11:15:22] New patchset: Hashar; "create jenkins user with systemuser" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53880 [11:16:37] New review: Hashar; "That is fully back compatible with the previous definition and will let us optionally declare system..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/53879 [11:17:52] New review: Hashar; "This is merely normalizing the use of systemuser {} for applications users. It is equivalent to us..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/53880 [11:19:12] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [11:20:52] New patchset: Hashar; "sql script no more need /etc/cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55877 [11:21:50] New patchset: Mark Bergsma; "Add cp3008 to the pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59123 [11:22:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59123 [11:24:47] New patchset: Hashar; "zuul: support cloning from a different branch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58737 [11:25:13] New review: Hashar; "Also get git_branch in manifests/zuul.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58737 [11:25:27] New patchset: Hashar; "zuul: in labs use the `labs` branch to install Zuul" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58738 [11:26:37] New review: Hashar; "I have moved support of $git_branch in manifests/zuul.pp to the dependent change https://gerrit.wiki..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58738 [11:27:43] New patchset: Hashar; "zuul: support specifying the git directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58898 [11:29:38] New patchset: Hashar; "zuul: migrate git dir in production to the ssd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58899 [11:31:28] New patchset: Hashar; "zuul: support specifying the git directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58898 [11:31:35] New patchset: Hashar; "zuul: migrate git dir in production to the ssd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58899 [11:32:22] New review: Hashar; "This should be harmless, it just adds a new parameter to the classes." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/58898 [11:33:35] New review: Hashar; "The change to manifests/zuul.pp has been moved to dependent change https://gerrit.wikimedia.org/r/#/..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/58899 [11:35:04] New review: Hashar; "This should be harmless, it is just to let the labs instance use a different code than on production..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/58738 [11:41:33] New patchset: Hashar; "correct jenkins master system role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59124 [11:42:54] Change abandoned: Hashar; "not needed anymore, has been hacked differently." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53875 [12:04:25] !log Depooled knsq16-knsq22 frontend squids [12:15:01] PROBLEM - Puppet freshness on search1015 is CRITICAL: No successful Puppet run in the last 10 hours [12:15:53] yay! [12:17:01] PROBLEM - Puppet freshness on search1016 is CRITICAL: No successful Puppet run in the last 10 hours [12:23:56] !log Depooled amssq47-amssq62 frontend squids [12:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time [13:05:29] New patchset: Hashar; "mobile always uses role::cache::configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54864 [13:06:33] New review: Hashar; "* rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54864 [13:06:49] New patchset: Mark Bergsma; "Double upload frontend cache size to 8 GB, on 96 GB boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59129 [13:07:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59129 [13:09:59] mark, looks like VCLs weren't reloaded after https://gerrit.wikimedia.org/r/#/c/32866/ was merged [13:10:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [13:12:20] MaxSem: that's possible [13:12:32] there's probably no notification/subscription on that file [13:12:51] yeah [13:13:19] as well as geoip [13:14:56] New patchset: Hashar; "upload cache in labs now uses role::cache::configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54863 [13:15:54] New review: Hashar; "* rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54863 [13:18:28] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:19:14] New review: Hashar; "rebased, fixed the trivial conflict around varnish::logging statements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:19:54] New patchset: Mark Bergsma; "Reload VCL on varnish::common-vcl changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59131 [13:20:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59131 [13:22:13] New patchset: Hashar; "Varnish rules for Beta cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47567 [13:22:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:16] New review: Hashar; "rebased, fixed a trivial conflict related to some Analytics header." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47567 [13:24:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [13:33:09] New patchset: Hashar; "adapt role::cache::upload for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50064 [13:36:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [13:41:53] yeahhh that applies [13:41:55] \O/ [13:41:57] New patchset: Mark Bergsma; "Remove special storage backend config for esams upload caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59133 [13:42:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59133 [13:46:17] New patchset: Ottomata; "Subscribing self puppetmaster service to class puppet::self::client." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59134 [13:46:29] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59134 [13:46:38] New patchset: Diederik; "Remove class 'analinterns' it is no longer used." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59135 [13:47:16] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59135 [13:50:15] New patchset: Mark Bergsma; "Revert "Remove class 'analinterns' it is no longer used."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59136 [13:50:31] nostalgia [13:50:45] :D [13:51:32] analinterns LOL [13:54:08] ahhh [13:54:14] finally. That one was offending me. [13:57:11] it's still there :) [13:58:09] -:( [13:58:41] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:58:41] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:58:41] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [14:44:41] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:05] New patchset: coren; "Split labstore[34] from [12] (gluster->nfs)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59143 [14:49:12] !log reedy synchronized php-1.22wmf2/ 'Initial sync of 1.22wmf2' [14:52:49] New patchset: coren; "Split labstore[34] from [12] (gluster->nfs)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59143 [14:53:46] !log reedy synchronized wmf-config/ [14:54:27] Reedy: can you check https://gerrit.wikimedia.org/r/#/c/59143/2 for me, this is causing issues. [14:54:37] !log reedy synchronized w [14:54:58] Coren: Did you mean me? [14:55:06] where is the bot who logs to https://wikitech.wikimedia.org/wiki/Server_Admin_Log ? [14:55:11] dead [14:55:49] !log reedy synchronized docroot [14:55:57] Reedy: I meant R[TAB]; that used to be Ryan all of 30 s ago. :-) [14:56:11] Reedy: But he fled, so if you /can/ review, I'd rather not self +2 [14:56:17] I thought it was tab fail ;) [14:56:26] Raymond_afk: https://bugzilla.wikimedia.org/show_bug.cgi?id=47228 [14:56:30] I could review it, but I can't approve it, nor do I really know what I'm looking at [14:56:48] Reedy: Heh. No worries. I'll try to find someone elsse. [14:57:05] odder: thanks for the link [14:58:52] New review: Mark Bergsma; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59143 [14:58:53] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki to 1.22wmf2, scap to come [15:00:25] New review: coren; "I don't like self +2, but this is causing issues right now." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/59143 [15:00:26] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59143 [15:01:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.168 second response time [15:18:36] !log reedy Started syncing Wikimedia installation... : Build localisation cache for 1.22wmf2 [15:18:50] Reedy: feel like running the defib over morebots and bringing it back to life? [15:22:06] I don't think I can [15:30:07] New patchset: Ottomata; "Using a different ssldir for self hosted puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59148 [15:33:54] New patchset: Ottomata; "Using a different ssldir for self hosted puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59148 [15:34:09] mw1101: rsync: failed to set times on "/usr/local/apache/common-local/live-1.5": Operation not permitted (1) [15:34:10] mw1035: rsync: failed to set times on "/usr/local/apache/common-local/live-1.5": Operation not permitted (1) [15:34:12] bleugh [15:34:30] Shall have to fix it post scap [15:34:32] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59148 [15:35:01] apergos: snapshot1: sudo: no tty present and no askpass program specified [15:36:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [15:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:52:30] New patchset: Ottomata; "Ensuring puppet::self::client ssldir exists," [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59152 [15:52:58] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59152 [15:53:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.143 second response time [15:53:37] !log reedy Finished syncing Wikimedia installation... : Build localisation cache for 1.22wmf2 [15:58:56] New patchset: Reedy; "1.22wmf2 setup" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59153 [15:59:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59153 [16:04:57] PROBLEM - NTP on cp1041 is CRITICAL: NTP CRITICAL: Offset unknown [16:08:51] Any ops around that could fix a permission problem for me? /usr/local/apache/common/live-1.5 is owned by root:root, should be mwdeploy:mwdeploy Taaa [16:08:52] ddsh -F30 -cM -g mediawiki-installation -o -oSetupTimeout=10 'chown mwdeploy:mwdeploy /usr/local/apache/common/live-1.5' [16:09:57] RECOVERY - NTP on cp1041 is OK: NTP OK: Offset -0.05912792683 secs [16:11:28] Reedy: running [16:11:35] Thanks [16:11:46] done [16:12:38] Hmm [16:12:43] Apparently that didn't fix it [16:15:57] PROBLEM - Puppet freshness on db1058 is CRITICAL: No successful Puppet run in the last 10 hours [16:18:09] sec [16:18:44] how about now? [16:18:54] Reedy: ^ [16:20:02] Yup! thanks [16:20:21] it needed -h [16:20:26] as live-1.5 is a symlink [16:20:55] aha [16:21:03] paravoid, there's a request at https://bugzilla.wikimedia.org/show_bug.cgi?id=47197 to delete a mailing list with no archives - do you (ops) need an RT ticket for that? [16:21:10] Hopefully that should fix most of the error spam during scap :) [16:21:46] Thehelpfulone: yes, preferrably [16:21:57] okay, I'll create one and link it :) [16:21:59] sometimes andre__ brings BZ into our attention [16:22:40] :) [16:34:33] bblack, ping [16:34:50] yurik: pong [16:34:54] hi! [16:35:01] what's up? [16:35:23] you are still not in the https://office.wikimedia.org/wiki/Contact_list list :) [16:35:34] but that's not what i wanted to ask :) [16:35:43] i have looked at some of the code you posted [16:35:59] you mean the vmod_netmapper thing? [16:36:02] had some questions about it - are you sure you want to deal with the locking and all the other issues? [16:36:04] yep [16:36:28] what locking and other issues are you concerned about? [16:36:31] i mean - wouldn't it be easier for the external shell script/cron job to issue a vagrant - reload command to reload scripts? [16:37:01] because if you will have to perform all the locking and other stuff in code, it might be fairly expensie [16:37:04] just a thought [16:37:19] sorry, varnish, not vagrant [16:37:30] and restarting varnish is not expensive at all? ;) [16:37:36] I'm not sure if "reload" reloads vmods, but I suspect it's just VCLs [16:37:44] hmm, that might be true [16:37:57] well, varnish restart should not be required, at least i hope its not :) [16:38:01] the upside of doing the reload in-process though, regardless, is zero downtime [16:38:19] or zero "delays in processing the next request" or whatever you want to call it [16:38:21] bblack: I've added it to the ops meeting agenda, I'll ask you to tell a little bit about it [16:38:23] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:38:44] well, you will always have a delay i guess :) [16:38:51] actually, no, we won't :) [16:39:06] with a lock? [16:39:06] how [16:39:08] look closer, there is no lock [16:39:13] (there is no spoon) [16:39:17] i mean - if you had no lock at all, you can do reference change [16:39:30] hhmmm.. maybe you already implemented it that way :) [16:39:53] i briefly looked at it, so can't claim i understood it [16:40:02]