[00:00:17] hmmm [00:19:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:27] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [00:35:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [00:48:27] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [01:07:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:21:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.946 seconds [01:27:18] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 188 seconds [01:34:25] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: (Return code of 255 is out of bounds) [01:34:43] PROBLEM - mysqld processes on storage3 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [01:35:10] PROBLEM - MySQL disk space on storage3 is CRITICAL: DISK CRITICAL - /a is not accessible: Success [01:54:40] PROBLEM - Puppet freshness on mw49 is CRITICAL: Puppet has not run in the last 10 hours [01:56:37] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:56:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.423 seconds [02:27:46] !log LocalisationUpdate completed (1.21wmf1) at Mon Oct 8 02:27:45 UTC 2012 [02:28:05] Logged the message, Master [02:30:40] RECOVERY - MySQL disk space on storage3 is OK: DISK OK [02:45:40] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [02:52:16] !log LocalisationUpdate completed (1.20wmf12) at Mon Oct 8 02:52:16 UTC 2012 [02:52:27] Logged the message, Master [03:19:43] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (51735) [03:34:34] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (56015) [04:08:37] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:31:45] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [05:31:45] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [05:35:48] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [05:36:06] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [05:38:30] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [07:28:35] New review: Dereckson; "Comments aren't the clearer of the world these namespace should let disabled if we've new entries fo..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/27123 [07:30:45] New patchset: Dereckson; "(bug 40838) Namespace configuration for es.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27123 [07:33:06] New review: Dereckson; "PS1: Initial change" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/27123 [08:21:01] mornin' [08:21:40] yo [08:22:53] finally home [08:23:03] what the hell happened last night [08:24:19] dunno [08:24:22] I was out for most of it [08:24:33] domas figures some nfs client somewhere borked things [08:32:53] ah welcome back btw [08:48:23] thanks [08:48:33] I guess we don't have a meeting today [08:48:44] techops I mean [08:49:14] wmf holiday so guess not [09:26:44] New review: Hashar; "Well I wrote the tests anyway so we can get them in for no additional costs :-] That makes it easy ..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26040 [09:33:19] New review: Hashar; "> I think it should be this script what runs as mwdeploy. The solution isn't really to add a sudo at..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/22116 [09:49:31] hi [09:49:53] hello [09:49:58] you're not on the hook for today you know [09:50:35] what do you mean? [09:51:02] I mean when we set the window for retrying we didn't ask you so you don't have to be here [09:51:17] i need to work today don't I? :) [09:51:27] well [09:51:33] in theory it's a wmf holiday [09:51:33] haha [09:51:36] in practice too [09:51:37] no [09:51:39] it's a US holiday [09:51:46] I follow dutch holidays [09:51:51] I suggest you follow greek ones ;) [09:52:07] I try to (I'm working today) [09:52:19] why would I follow greek holidays? [09:52:19] when is your maintenance? [09:52:23] but I'm not sure the number matches [09:52:35] why would you want to work while all your friends are off? [09:52:36] the same time, 11 am til 2 pm utc [09:52:46] what's the point of being all off on the same time in the organization?:) [09:53:10] I'm not suggesting to follow US holidays either :) [09:53:19] you're suggesting to always work [09:53:27] you're greek, you should work 6 days a week, that's true [09:53:38] hahaha [09:55:43] that's pretty lax [09:55:46] so, apergos you seem to have everything under control [09:55:48] the week has 7 days in it [09:55:51] yes apergos [09:55:54] we'll all take off [09:55:55] you can handle it [09:55:59] but in case I can do something [09:56:09] shoot [09:56:20] ok. it should be pretty straightforward [09:56:22] are the ownerships ok now? [09:56:28] yes [09:56:34] excellent [09:57:21] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [09:57:22] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:57:22] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:57:50] someday I will actually learn that lesson: [09:57:58] check *everything* [09:58:05] but it seems I haven't yet [09:58:23] that's impossible you know [09:58:34] that's why people do mandatory code reviews and staging environments [09:59:10] mark: just to follow along, what's the deal with the geoip stuff? [09:59:27] that's entirely different from our DNS geoip, correct? [09:59:41] (which isn't based on maxmind, that much I know) [09:59:47] correct [09:59:59] so, here's the story [10:00:04] so far we're just using the ipv4 downloaded databases [10:00:09] we could adapt that to use ipv6 as well [10:00:33] until recently, i circumvented the problem by not adding an AAAA record to geoiplookup.wikimedia.org [10:00:35] using it how/where? [10:00:43] sorry, for varnish geoiplookup [10:00:47] geoiplookup.wikimedia.org [10:00:52] so so far, everyone used it over v4 only [10:00:53] worked fine [10:01:02] then someone made it work on bits.wikimedia.org/geoip [10:01:06] which does have an AAAA record [10:01:44] is that being sourced from javascript or something? [10:01:47] yes [10:01:50] for fundraising etc [10:01:59] it's just json output [10:02:07] yeah, I just noticed [10:03:05] and why do we need to switch to bits? [10:03:21] someone said it delayed requests because of the extra dns lookup [10:03:24] aha [10:03:33] by 200ms [10:03:35] which seems a lot to me [10:03:47] but anyway, it's true that it's one less dns lookup [10:03:52] but it has this disadvantage [10:04:13] dual-stacking the geoip lookup doesn't seem a good idea to me tbh [10:04:19] even if we add maxmind's ip [10:04:48] tunneling is so widespread in ipv6 that a big percentage will be bogus results [10:04:55] that's what I say as well [10:05:15] I think they're much better off with v4-only, for now [10:05:17] well, maybe not big, but still [10:05:28] ipv6-only hosts are certainly much smaller :) [10:05:32] yes [10:05:34] er, less even [10:06:01] virtually non-existent I'd say [10:06:24] the few ones that are, I expect them to have nat64 anyway [10:06:45] so, I was talking to a person at VUG [10:06:54] who it turns out has implemented something similar to chash [10:07:10] but does a few extra things [10:07:19] one of them is that it kind of merges frontends/backends [10:07:51] i.e. you hit one varnish randomly (e.g. LVS) and it determines if it's the "master" and serves it, otherwise it passes it to the right "master" [10:08:01] that's what I wanted to do as well [10:08:12] but asher implemented it for mobile and thought it wouldn't be possible ;) [10:08:17] heh [10:08:25] the other nice thing is that it has a ramp-up [10:08:40] for masters that were down and now are up again [10:08:46] it starts giving them traffic gradually [10:09:05] yeah that's nice [10:09:21] so I mentioned your work [10:09:51] he said his work is kinda messy and for varnish 2.x and he's been reluctant to spend time to communicate with the varnish people to merge it [10:10:12] but if yours get merged he said he'll look into porting his features on top of yours [10:10:30] I got the contact in case you're interested [10:10:50] alright [10:12:33] he was *very* impressed when I told him that you've made a chash director [10:12:45] and that it's in production at wikipedia [10:13:39] nice guy too, they are a company who does Varnish consulting, it seems Varnish is getting an ecosystem around it [10:14:22] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [10:15:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [10:16:44] New review: Hashar; "Tim Starling wrote:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/22116 [10:18:59] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [10:19:55] New review: Hashar; "Rebased on top of I0fe0f3cf - "run mergeMessageFileList.php as 'mwdeploy' user"" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/22116 [10:19:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [10:23:28] haha [10:23:35] chash was a couple hours of work [10:25:02] well I think he was more impressed that someone else had done (kind of) the same [10:25:49] just to get feature parity with squid really ;) [10:31:16] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [10:31:37] New patchset: Hashar; "wikibugs migrated from svn to git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [10:32:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26325 [10:32:46] anyway, I would think that you can do the frontend/backend merging in VCL [10:33:15] check source ip, if it's coming from another varnish, act like a backend, else a frontend [10:33:18] something like that [10:37:08] New patchset: Hashar; "wikibugs migrated from svn to git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [10:38:11] yeah, he was saying something to that effect, although he added another step before that [10:38:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26325 [10:38:22] to shortcut it if the master is the local instance [10:38:27] to avoid a proxying to itself [10:38:30] yes of course [10:40:45] brb [10:49:16] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [11:00:23] New patchset: ArielGlenn; "move from upload6 on ms7 to upload7 on nas1" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27159 [11:06:44] * apergos looks around for Reedy, it's about that time [11:06:57] Yup [11:06:58] I'm here [11:07:09] wanna eyeball and merge um [11:07:23] the link right above in the scrollback? :-) [11:07:39] in gerrit. then I'll do the merge/deploy on fenari [11:11:10] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27159 [11:13:04] ok, here goes the sync [11:14:54] !log ariel synchronized wmf-config 'move from ms7 upload6 to nas1 upload7' [11:15:00] all righty [11:15:06] Logged the message, Master [11:15:09] let's see what they say in wikimedia-tech [11:19:06] goodie [11:21:27] bleh, loads of php warnings [11:22:03] oh? such as? [11:22:21] PHP Warning: unlink() [function.unlink]: No such file or directory in /usr/local/apache/common-local/php-1.20wmf12/includes/GlobalFunctions.php on line 2995 [11:22:23] I see rather more GETs for images today than I did on Friday [11:22:28] on ms7 [11:22:33] unlink, fwrite, fopen, fclose... [11:23:01] weird [11:23:15] what's it being called from? [11:23:43] Hmm [11:23:47] That's seemingly in wfDiff() [11:24:11] next q: see the same errors in yesterday's log? [11:24:32] or heck earlier today [11:27:23] Not that I remember.. [11:27:49] there's write to temp space to do file diffs.. [11:28:18] but that's /tmp locally [11:28:49] apergos: It's seemingly only one apache... [11:28:50] 10.0.11.49 [11:28:56] huh [11:29:25] Oct 8 11:25:58 10.0.11.49 apache2[25400]: PHP Warning: fclose() expects parameter 1 to be resource, boolean given in /usr/local/apache/common-local/php-1.20wmf12/includes/GlobalFunctions.php on line 2973 [11:29:25] Oct 8 11:25:58 10.0.11.49 apache2[25400]: PHP Warning: fwrite() expects parameter 1 to be resource, boolean given in /usr/local/apache/common-local/php-1.20wmf12/includes/GlobalFunctions.php on line 2974 [11:29:25] Oct 8 11:25:58 10.0.11.49 apache2[25400]: PHP Warning: fclose() expects parameter 1 to be resource, boolean given in /usr/local/apache/common-local/php-1.20wmf12/includes/GlobalFunctions.php on line 2975 [11:29:43] all mw49 [11:30:00] Is it worth just depooling it for a bit? [11:30:30] it didn't get the sync [11:30:40] no idea why [11:30:51] I'll run sync-comon on it [11:31:07] !log Manually running sync-common on mw49 [11:31:16] ok [11:31:18] Logged the message, Master [11:31:30] it seems to have quite a few old cache dirs in its /tmp [11:31:34] hm [11:31:39] 1.20wmf1 etc [11:31:58] ok maybe I was wrong about this, it has a bunch of old ~ files with upload6 [11:32:02] the current ones seem to be ok [11:32:26] the sync was a good idea regardless [11:32:35] still see errors? [11:33:05] Yup [11:33:30] wonder if I should restart memcached over there, maybe something screwed up locally [11:33:58] * apergos does so [11:36:18] /tnp was very full over there [11:36:19] fixing that [11:36:40] that sounds suspect [11:37:11] how many of the old mw-cache-1.19 etc do I need? [11:37:31] I was going to toss all but mw-cache-1.20wmf10 11 12 in /tmp [11:39:43] we're only using 1.20wmf12 and 1.20wmf1 at the moment [11:39:51] **1.21wmf1 [11:39:53] wmf1 ? [11:39:55] ahh [11:40:09] the log doesn't seem to be gaining any new warnings [11:40:43] RECOVERY - Puppet freshness on mw49 is OK: puppet ran at Mon Oct 8 11:40:26 UTC 2012 [11:40:44] * apergos forces a puppet run over there for good measure (it seemed to be running out of space in /tmp :-P) [11:42:12] ok, that should be that for now [11:42:29] lemme see how GETs are on ms7 again [11:43:48] still some [11:44:27] they are from various squids, there is never a referer [11:45:01] I gotta guess that's a cache issue [11:46:11] I'm guessing it's a minority though? [11:46:18] sure [11:46:56] 3 to 5 a second maybe, from all the squids combined [11:47:43] or maybe 5 to 8, can't actually count :-P [11:47:48] then there are the math images [11:47:48] haha [11:47:52] still get some of those too [11:49:25] What do we do about those then? Wait for expiry? [11:49:28] !log manually cleaned up /tmp on mw49, forced a puppet run too (isn't there a cron job to clean up /tmp?), stopped job runner afterwards since it had been stopped already by someone [11:49:39] Logged the message, Master [11:49:59] for now, I guess so [11:50:55] I should check in on it though to see if those requests start dropping off [11:51:00] if not we'll have to look at it again [11:51:16] ganglia shows incoming network to ms7 has dropped to just over 0 [11:51:59] * apergos goes to look [11:52:05] http://ganglia.wikimedia.org/latest/graph.php?r=2hr&z=xlarge&h=ms7.pmtpa.wmnet&m=load_one&s=by+name&mc=2&g=network_report&c=Miscellaneous+pmtpa [11:52:19] awww it's so cute.... [11:52:21] and out has dropped by 2 thirds [11:52:42] well I also stopped the last rsync just before we did this push [11:53:07] heh [11:53:52] http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&h=ms7.pmtpa.wmnet&m=load_one&s=by+name&mc=2&g=network_report&c=Miscellaneous+pmtpa [11:55:54] now the next q is whether we can safely unmount /mnt/upload6 from the apaches [11:56:20] I was going to ask if there was a way to loook at nfs io [11:56:25] but then i remember from yesterday, there isn't [11:56:30] not a nice way [11:56:35] I'll snoop some [11:57:18] 451115c8888738207f44f60ed8c82adc.png [11:57:22] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [11:57:24] some either math or timeline s stuff, hard to tell [11:57:42] 11731 0.00000 srv298.pmtpa.wmnet -> ms7 NFS C LOOKUP3 FH=E850 077_Sardanes_a_la_plaça_de_l'Ajuntament.jpg [11:58:05] 245724 0.00000 hume.wikimedia.org -> ms7 NFS C LOOKUP3 FH=AC17 Braunlage2011_LCOC2_Mattel_Iraschko_Faisst_Seyfarth_Log [11:58:05] ar_Van_Seifriedsberger_117.JPG [11:58:07] hume? really? [11:58:31] wtf is it doing? [11:58:40] 433225 0.00000 srv226.pmtpa.wmnet -> ms7 NFS C LOOKUP3 FH=3698 Budge_Magraw.jpg [11:58:42] hmmmmm [11:58:59] ok well one at a time, let's see about hume, there's lots of those [12:01:50] these resync-nfs-s* make me suspicious [12:04:13] good, the apache logs are down to the usuals that come and go [12:04:21] yay for that [12:10:50] well if nothig is actually broken I guess we can call it a day, I'll have to track down these nfs users over the next few days and close em down [12:12:28] I think so too [12:12:36] Quick scan of the other logs show nothing of any interest [12:12:41] excellent [12:14:03] New patchset: Dereckson; "(bug 40795) Subpages namespace configuration for kk.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27166 [12:15:04] thanks for showing up on your holiday [12:21:50] heh [12:22:02] It's not exactly a holiday here [12:22:15] well your wmf holiday [12:22:17] ;-) [12:28:44] New patchset: Hashar; "fix doc for git::clone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27170 [12:29:10] ^^^ a suppppeeeer easy change to review/merge :-] [12:29:15] it just changes a documentation block [12:29:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27170 [12:36:01] apergos, it seems that you're the only one talking about a wmf holiday [12:36:25] that's cause all the ones actually on it don't wake up for somehours [12:36:46] heh, yes [12:41:48] apergos: could you review a doc change in puppet : https://gerrit.wikimedia.org/r/27170 did some rdoc formatting there. [12:46:16] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [12:47:21] hashar: lin 719 s/Will clones/Will clone/ [12:52:19] New patchset: Hashar; "fix doc for git::clone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27170 [12:52:20] apergos: done in PS2 :-] [12:53:13] New review: Hashar; "Fix a grammar typo reported by Apergos : s/Will clones/Will clone/" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/27170 [12:53:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27170 [12:53:34] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27170 [12:53:47] I forget if I need to merge it on sockpuppet for you or not [12:53:51] can you do those yet? [12:55:04] hashar: [12:56:48] apergos: sorry, I can't merge on sock puppet :/ [12:56:56] ok, no worries [12:57:09] that would grant me root access I guess [12:57:21] since I would be able to add some nasty content in a puppet manifest [12:57:54] I guess I think of you as honorary ops or something cause you're always in here doing work on these [12:57:55] :-D [12:58:02] # WIKIPEDiA SMELLS [12:58:11] sure does [12:58:15] the peasants are revolting too [13:01:04] apergos: I am probably one of the few non-ops being active in operations/puppet [13:01:14] apergos: together with otto at least for the analytic stuff [13:01:19] spread the virus :-D [13:02:33] elp [13:02:43] well [13:03:07] I need to craft a "I do some review in Gerrit" virus to the ops team :-] Some people seems to be immune to it hehe [13:03:44] I'm pretty resistant, but not 100% [13:26:34] New patchset: Hashar; "git::clone now support a specific sha1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27175 [13:27:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27175 [13:28:24] New patchset: Hashar; "wikibugs migrated from svn to git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [13:29:20] New review: Hashar; "Rebased on top of If6676ffa - "git::clone now support a specific sha1"" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26325 [13:29:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26325 [13:29:41] New patchset: Hashar; "wikibugs migrated from svn to git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [13:30:37] New review: Hashar; "git::clone now uses:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26325 [13:30:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26325 [13:30:55] have to do that in puppet class : [13:30:56] ( [13:58:43] heyaaaa, anybody working today want to give me a puppet review? [13:58:46] it is an easy one: [13:58:50] https://gerrit.wikimedia.org/r/#/c/26799/ [13:59:30] notpeter how U doin? :) [14:02:43] installing mongodb is "an easy one"? :) [14:03:31] mark said it was cool [14:03:36] but didn't have time to review my change [14:03:43] it is easy to install, so yes [14:03:49] my main q is [14:03:58] should I make this more generic, so that it can be used out side of stats stuff [14:04:08] or, should I just wait until someone else actually wants it [14:04:40] i'm leaning toward the latter on this one, because I doubt that there will be a need for it outside of analytics/stats stuff [14:06:36] ottomata: maybe consider making it a module ? [14:06:57] yeah was thikning about that, if I do that I might google around for an already done mongodb module [14:07:00] i'm sure one must exist [14:07:13] use the forge? [14:07:18] yeah [14:07:27] the reason I didnt' write it, is because I don't really have experience with MongoDB, so I betcha there are a bunch of settings that should be puppetized that I'm not familiar with [14:07:41] aaaand, I got the ok from mark to use Mongo, but I wasn't sure if adding a module would cause more of a stink [14:07:42] well just make a very simple module that fit your needs [14:07:50] since that might look like we are endorsing using it or something [14:07:51] yeah [14:08:00] I think ops want to get migrate everything to modules eventually [14:08:03] paravoid do you think that's better too? [14:08:06] yeah [14:08:30] well, in my opinion it would be better to migrate anything that has nothing to do with WMF specific stuff to modules [14:08:33] but it seems that we are not doing that [14:08:50] manifests that have WMF stuff in it could be used to include and configure modules [14:08:53] buuut, meh well [14:08:56] annyyyyywayyyyy [14:09:07] paravoid, if you think I should do this in a module, I am happy to [14:10:00] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:11:57] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (35786) [14:12:15] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (35639) [14:14:08] damn enwiki [14:17:20] they need to delete more stuff, it's just too big [14:18:11] ;-) [14:20:11] I am ultra tired of puppet [14:20:19] Ishould just write a shell script instead of manifests :-D [14:20:30] it can't run git clone hehe [14:21:02] no?! [14:21:08] i wrote a git define, does it not work for you? [14:21:12] err: /Stage[main]/Misc::Irc::Wikibugs/Git::Clone[Clone-wikibugs]/Exec[git_clone_Clone-wikibugs]/returns: change from notrun to 0 failed: git clone https://gerrit.wikimedia.org/r/p/wikimedia/bugzilla/wikibugs.git /var/lib/wikibugs/script returned 1 instead of one of [0] at /etc/puppet/manifests/generic-definitions.pp:776 [14:21:18] ahhh rats [14:21:19] that is the one I wrote [14:21:21] "return 1 instead of [0]" … sooooo useful [14:21:33] can you run the command itself manually? [14:21:38] yeah that works [14:21:41] hmmm [14:21:50] even when you run as root? [14:22:10] puppet runs commands as root unless you tell it not to [14:22:14] yeah I do run puppetd -tv as root [14:22:21] no, i mean the git clone command [14:22:23] but maybe it uses a different user [14:22:33] OH MY FUCKKK=iNG GOD [14:22:35] I am dumb [14:22:39] sierouly [14:22:42] thanks ottomata [14:23:13] must be a perm error :-] [14:23:19] yeah, if you set [14:23:20] owner => [14:23:22] on the git::clone [14:23:28] it will run the command as that user [14:26:30] New patchset: Ottomata; "Installing mercurial on in statistics::packages." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27184 [14:27:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27184 [14:27:50] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27184 [14:32:05] New patchset: Hashar; "wikibugs migrated from svn to git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [14:43:29] I am doomed [14:47:57] haha [14:51:50] I found out the root cause [14:52:00] I specified a wrong owner in git::clone [14:52:07] which was not allowed to write to the destination directory [14:52:07] lame [14:52:17] spent like 2 hours on it :( [14:52:22] what a waste of time [14:52:23] New patchset: Hashar; "wikibugs migrated from svn to git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [14:53:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26325 [14:57:05] New review: Hashar; "The reason for this change is that we sometime need to fetch an arbitrary commit, for example when u..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/27175 [14:57:28] I'll be afk for awhile, can't imagine that the netapp will explode or anything while I'm gone... I'll check back later tonight [14:57:42] apergos: have a good dinner ;) [14:58:00] thanks! [15:20:39] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [15:20:57] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [15:32:08] New patchset: Hashar; "wikibugs migrated from svn to git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [15:33:15] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [15:33:15] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [15:33:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26325 [15:37:36] New review: Hashar; "PS 8 is the one to review. I have added some inline comments to guide reviewers." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26325 [15:41:27] New patchset: Hashar; "wikibugs migrated from svn to git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [15:42:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26325 [15:53:35] New review: Hashar; "PS9 fix "recurse => yes" to "recurse => true"" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26325 [16:00:48] New patchset: Hashar; "wikibugs migrated from svn to git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [16:01:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26325 [16:39:34] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26920 [16:40:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26747 [16:40:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27166 [16:41:00] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27034 [16:41:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27123 [16:43:00] !log reedy synchronized wmf-config/ [16:43:11] Logged the message, Master [17:27:39] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (39850), plwiki (30266) [17:27:57] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (39693), plwiki (30094) [17:36:03] ouch [17:48:15] MaxSem: they're getting processed quicker than ever [17:48:20] Like the millions from zhwiki [17:56:30] Reedy, are you working today? [17:56:46] Yeah [18:09:19] New patchset: Ottomata; "Adding mongodb module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27195 [18:10:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27195 [18:36:47] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.21wmf1 [18:36:58] Logged the message, Master [18:40:00] PROBLEM - Apache HTTP on srv194 is CRITICAL: HTTP CRITICAL - No data received from host [18:44:29] !log Updated liquidthreads_labswikimedia CategoryLinks schema with patch-categorylinks-better-collation.sql [18:44:40] Logged the message, Master [18:58:45] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [18:59:03] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [19:05:56] !log reedy synchronized wmf-config/CommonSettings.php [19:06:07] Logged the message, Master [19:07:50] !log reedy synchronized wmf-config/CommonSettings.php [19:08:00] Logged the message, Master [19:13:03] !log reedy synchronized wmf-config/InitialiseSettings.php 'wgOldChangeTagsIndex to false for outreachwiki' [19:13:14] Logged the message, Master [19:46:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [19:57:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:57:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [19:57:52] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [20:02:05] !log reedy synchronized wmf-config/InitialiseSettings.php 'Set wgOldChangeTagsIndex to false for all wikis running with new indexes' [20:02:16] Logged the message, Master [20:08:21] New patchset: Dereckson; "(bug 40794) Namespace configuration for kk.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27212 [20:09:52] New patchset: Dereckson; "(bug 40794) Namespace configuration for kk.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27212 [20:10:33] New review: Dereckson; "PS2: adding bug reference" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/27212 [20:11:02] New patchset: Dereckson; "(bug 40794) Namespace configuration for kk.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27212 [20:20:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.342 seconds [20:31:33] Reedy, have there been any updates on the OTRS RT tickets as of yet, or are we still waiting for a response? [20:31:55] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [20:36:40] !log reedy synchronized php-1.21wmf1/extensions/ConfirmEdit/FancyCaptcha.class.php 'Revert cache key change live hack' [20:36:51] Logged the message, Master [20:36:58] Thehelpfulone: I think it's still in the Ops court.. [20:41:19] !log maxsem synchronized php-1.21wmf1/extensions/MobileFrontend/MobileFrontend.php 'Live hack to unbreak prop=extracts on 1.21wmf1' [20:41:30] Logged the message, Master [20:42:01] hmm, in that case woosters did you manage to have that chat with the ops to see who would be able to help? I know Jeff who usually does it is busy with the fundraiser... [20:49:01] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 195 seconds [20:49:46] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 226 seconds [20:49:55] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [20:56:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:00:13] Thehelpfulone: it's a US holiday today [21:01:39] Jeff doesn't "usually" do it either ;) [21:02:31] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [21:02:51] ah I didn't realise, and well he's the one that seems to have commented recently on the bugs, that's "usual" enough for me ;) [21:02:58] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [21:03:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.154 seconds [21:38:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [21:58:56] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [22:25:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:39:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.891 seconds [22:47:40] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [23:13:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:27:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.986 seconds