[00:01:05] RECOVERY - Puppet freshness on nfs1 is OK: puppet ran at Thu Jul 26 00:00:53 UTC 2012 [00:01:32] RECOVERY - Puppet freshness on nfs2 is OK: puppet ran at Thu Jul 26 00:01:29 UTC 2012 [00:02:35] RECOVERY - LDAPS on nfs2 is OK: TCP OK - 0.002 second response time on port 636 [00:03:11] RECOVERY - LDAP on nfs2 is OK: TCP OK - 0.000 second response time on port 389 [00:06:11] maplebed, RoanKattouw: I was looking at srv281 (it's on the server admin log) [00:06:18] notpeter: too [00:06:21] Yeah [00:06:24] it was disabled for quite some time due to disk full [00:06:27] It seems to have missed the repartitioning [00:06:37] I reformatted it, it was partioned with 7gb again [00:06:37] That's why its disk is still full [00:06:49] and now there's an additional problem [00:06:57] paravoid: I'm having an issue with lab [00:06:59] s [00:07:00] I reformatted it with our default distro, which is now precise [00:07:10] paravoid: I can't get to my mobile-testing instance [00:07:12] heh, we didn't fix partman, did we ? [00:07:19] which fails in some ways, so I have to fix that [00:07:22] paravoid: and I don't see anything on the console output page [00:07:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:07:29] New Apache installs are probably still partitioned wrong [00:07:34] I was looking at it but sidestepped it for something else [00:07:39] I think Peter just live-remounted the existing ones [00:07:46] preilly: is Ryan there? :) it's 3am here [00:07:52] paravoid: nope [00:07:58] paravoid: but don't worry about it I guess [00:07:59] okay, let me have a look [00:08:50] paravoid: Compare the output of 'mount' on srv281 and srv280 for instance [00:09:20] RoanKattouw: I'll have a look tomorrow, I just wanted to ping you because I saw you were wondering [00:09:29] you and maplebed [00:09:40] OK [00:09:53] it's disabled in pybal for quite some time [00:10:18] even if apache comes up, which I don't think it even can right now [00:12:36] paravoid: any idea why https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=mobile&instanceid=i-00000271 is blank? [00:12:53] none whatsoever :) [00:12:59] I'm trying to look at nova logs now [00:14:36] preilly: forgive me if it's a silly question: have you tried rebooting it? [00:14:51] paravoid: yes [00:16:05] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [00:18:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.723 seconds [00:19:02] that's strange [00:19:04] it's stuck in GRUB [00:20:07] preilly: seems to be up now [00:21:31] paravoid: you working right now or about to head out and enjoy the evening ? [00:21:48] LeslieCarr: "evening"? [00:21:51] "Evening" [00:21:54] 03:21 :) [00:21:57] (am) [00:21:59] hehe [00:22:07] s/evening/morning ? [00:22:10] paravoid: weird [00:23:17] preilly: mind if you pass it to Ryan or I investigate tomorrow? [00:23:35] paravoid: sure [00:23:37] the immediate problem should be fixed [00:23:41] i.e. the VM is up [00:23:49] let me find the root cause with a clear head :) [00:42:38] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [00:54:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:00:38] paravoid or anyone else who might know - i've made a bunch of changes in a labs instance with the self-hosted puppetmaster. i've made some local commits but now i want to push to gerrit [01:01:06] how do i do that with the clone in /var/lib/git/opperations/puppet? [01:03:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.146 seconds [01:15:33] sooo [01:15:38] why do so many memcached requests time out? [01:15:46] I guess need a decent testing for that [01:24:36] packet loss [01:24:54] that's my theory anyway [01:38:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:23] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 216 seconds [01:43:02] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 261 seconds [01:49:47] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 668s [01:49:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.766 seconds [01:52:38] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [01:55:29] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 8s [01:55:56] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 11 seconds [02:22:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:31] New patchset: preilly; "switch back to strtok instead of strtok_r" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16716 [02:25:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16716 [02:25:22] New patchset: preilly; "remove carrier acl block and switch back to strtok instead of strtok_r" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16716 [02:26:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16716 [02:27:31] any operations people actually here right now? [02:30:28] mark: ping [02:30:40] paravoid: ping [02:30:45] notpeter: ping [02:34:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.686 seconds [02:44:31] RECOVERY - Puppet freshness on lvs5 is OK: puppet ran at Thu Jul 26 02:44:05 UTC 2012 [03:11:58] RECOVERY - Puppet freshness on srv198 is OK: puppet ran at Thu Jul 26 03:11:48 UTC 2012 [04:04:25] gerrit acct creation, already has SVN (so I can't do it myself): https://www.mediawiki.org/w/index.php?title=Developer_access&diff=565442&oldid=565340 [04:14:28] * jeremyb waves Ryan_Lane [04:15:23] preilly was having a labs problem. paravoid poked it briefly and got the instance up but didn't really investigate because it was 3am. then preilly came back again looking for ops later but didn't say why and then he /quit [04:16:44] instance was 'mobile-testing' [04:26:54] anyone around? [04:39:52] preilly: mobile-testing still? [04:40:17] jeremyb: nope [04:50:08] PROBLEM - Puppet freshness on potassium is CRITICAL: Puppet has not run in the last 10 hours [05:56:51] anyone around? [07:09:52] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [07:33:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:04:37] hello hashar [08:04:45] good morning :-) [08:05:37] hashar: is there a quick query you can make to see how many reviews (excluding bot reviews) we're having on gerrit compared to CodeReview? [08:05:59] Nemo_bis: I don't have access to gerrit query engine :-D [08:06:07] uh [08:06:14] Nemo_bis: I know the analytics team is polishing a tool that will generate statistics out of Gerrit [08:06:18] who does, only demon? [08:06:24] like the number of changes merged per day and per repo [08:06:32] the time between patch submission and its +2 [08:06:34] yep, I hoped for a quick and dirty answer [08:07:26] there are some experimental data in analytics/gerrit-stats/data.git (though you might not have access) [08:07:54] git clone ssh://gerrit.wikimedia.org:29418/analytics/gerrit-stats/data.git [08:07:54] :) [08:08:20] then you get a datafiles/mediawiki/core/core.csv file [08:08:24] might get what you want [08:26:43] hashar: I do have permissions [08:27:18] great!!! [08:27:50] and also sent wikitech email as a plus [08:33:58] Change merged: Nikerabbit; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16498 [08:35:01] PROBLEM - Puppet freshness on cp1020 is CRITICAL: Puppet has not run in the last 10 hours [08:36:01] hashar: contains the following columns, not so useful IMHO: date,commits,self_review,time_first_review_staff,time_first_review_total,time_first_review_volunteer,time_plus2_staff,time_plus2_total,time_plus2_volunteer [08:36:04] PROBLEM - Puppet freshness on mw58 is CRITICAL: Puppet has not run in the last 10 hours [08:36:13] Nemo_bis: sorry that is all I got for now :-D [08:36:20] hashar: yeah [08:36:41] also probably broken, or it would mean we have no reviews at all or so [08:36:52] so we have to wait a bit more :) [08:36:55] Nemo_bis: you can talk about it with Diederik van Liere :-D [08:36:58] he wrote the code [08:37:31] I think the Gerrit report card will be released in the next few weeks. They will be able to tell you [08:37:50] he said "next week" 23 days ago :) [08:38:00] ask him again so :-] [08:38:02] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [08:38:02] PROBLEM - Puppet freshness on srv209 is CRITICAL: Puppet has not run in the last 10 hours [08:38:05] though he is sleeping right now hehe [08:38:08] but yes it's difficult [08:38:24] I don't think he needs a reminder :) [08:38:53] at least he can gives you an updated deadline :) [08:39:23] oh well, I think a talk comment and a wikitech email is enough [09:44:11] morning [09:46:23] hello paravoid :-] Had a good night? [09:46:44] * hashar looks at Athens weather http://www.bbc.co.uk/weather/264371 [09:47:24] ouch low of 30°C during night, which is what we got during the afternoon here and that is basically rendering everyone useless :-/ [09:48:13] paravoid: I got a fresh easy hack for you to review https://gerrit.wikimedia.org/r/#/c/16661/ :-D [09:48:28] then we can do the NFS to /data/project migration if you feel like doing it on wake up [10:13:21] New patchset: Ori.livneh; "*UNTESTED* Calls to bits-lb.eqiad/event.gif 204'd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16724 [10:13:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16724 [10:26:47] mark: around? [10:27:02] mark: 1: cs:Connected ro:Secondary/Secondary ds:UpToDate/Diskless A r---- [10:27:17] mark: diskless means broken I think :) [10:32:25] hashar: it's a horrible *horrible* hack [10:32:30] 16661 that is [10:35:33] paravoid: not nastier than the existing one :-D [10:36:25] maybe that could be set at the realm.pp level, but I am not really willing to clean that out [10:43:59] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [10:55:51] paravoid: yeah [10:55:54] not saying that isn't broken [10:55:59] just saying, it's normal that isn't mounted [10:56:12] ok [11:00:54] http://www.theregister.co.uk/2008/07/18/hp_packaging/ [11:00:55] this is so true [11:00:57] I hate HP [11:01:12] once we got like 2 pallets of HP boxes for toolserver [11:01:16] Dell is pretty good for packaging [11:25:53] hahaha that's so very true [11:37:28] New review: Alex Monk; "Almost perfect, just a couple of nitpicks." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/16035 [12:25:33] I'm seriously thinking of sending all labs mail to /dev/null until we have a proper relay [12:25:45] we're essentially blackholing them in our mailboxes anyway [12:32:22] ah you enjoyed the fcron mails? [12:33:13] yes [12:34:09] hehe [12:55:42] hashar: nfs? [12:56:00] paravoid: sure :-) [13:00:04] mark: if you are in a review mood, I have updated/rebased my two patches for bits.beta.wmflabs.org https://gerrit.wikimedia.org/r/#/c/15445 (fix bits when enable_geoiplookup is enabled) https://gerrit.wikimedia.org/r/#/c/13304/ (use the new cluster_options hash to pass the test hostname). [13:00:29] mark: oh and it is deployed on labs via puppetmaster:self :) [13:00:59] paravoid: ready for it ? change is https://gerrit.wikimedia.org/r/#/c/15545/ [13:05:29] New review: Faidon; "finally :)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/15545 [13:05:34] what could possibly go wrong [13:06:34] ah, I have to do 16632 first [13:06:49] yup that one could be nasty [13:06:55] but I think it fix an issue we have currently [13:07:38] some servers not having lvsrealserver end up with no nfs::upload [13:07:50] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16632 [13:07:54] hasher, Nemo_bis: are you around? [13:07:56] though I guess puppet is not smart enough to actually magically umount upload :D [13:08:01] drdee: I am there [13:08:19] yes, things have been slow with gerrit-stats, main reason was the work that needed to be done on limn [13:08:35] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [13:08:40] now that's out of the door so to say, it's on github, we are now focusing on gerrit-stats [13:08:57] drdee_: I am fine with it :-D Maybe you could write a short message on wikitech-l so volunteer know about it ? [13:09:13] we deployed a first version on labs, but there are some issues i need to iron out first [13:09:16] drdee_: I am myself 100% confident that project will be successfull and don't really care when it will land :-D [13:09:42] paravoid: running puppet on the instances [13:12:05] paravoid: the f**** stupid puppet is file bucking everything :-D [13:12:16] info: /Stage[main]/Nfs::Apache::Labs/File[/usr/local/apache]: Filebucketed /usr/local/apache/common-back/php-1.19/languages/messages/.svn/text-base/MessagesPrg.php.svn-base to puppet with sum 3b2f65a59d80da76a8ce3220f339c6bd [13:13:04] uh oh [13:13:17] notice: Finished catalog run in 115.29 seconds [13:13:19] anyway :) [13:13:35] there's a way to tell it to not do that [13:13:40] don't remember the details [13:13:43] anyway, did it work? [13:13:48] not sure [13:14:01] need to manually umount the old point [13:14:02] I didn't push it to production yet [13:16:03] I have manually umounted /mnt/upload6 and rerunning puppet [13:16:58] hello drdee_ [13:17:10] hey Nemo_bis [13:17:25] oh focus on gerrit-stats, nice [13:17:29] yes we kept you waiting :) [13:17:31] or :( [13:17:35] haha [13:17:37] blaahhh info: /Stage[main]/Apaches::Service/Exec[apache-trigger-mw-sync]: Scheduling refresh of Exec[mw-sync] [13:17:42] never going to work on lab I guess [13:18:00] the work on limn was taking more time: github.com/wikimedia/limn [13:18:14] but that's done (for the moment) [13:18:19] drdee_: are you including per-reviewer stats for code review? (to answer questions like the last one I sent to wikitech-l)? [13:18:35] not in the initial version, but i would like to add that [13:18:51] because now with gated trunk we need social pressure for code review or it will never get done I think [13:18:51] what per reviewer stats are you thinking about? [13:19:06] well, who reviews how much for instance [13:19:33] maybe even how responsive people are to review requests (bt also how many requests do they receive) [13:19:43] paravoid: lets move to labs :) [13:20:38] drdee_: I suppose you're alredy going to make nice graphs of the number of unreviewed commits and such of course [13:20:55] ok, i'll note those per reviewer stats [13:21:05] yes, everything will be visualized using limn [13:21:38] the "etc." on (3) in http://lists.wikimedia.org/pipermail/wikitech-l/2012-April/060248.html is quite broad :) [13:21:59] ohhh about the DOI URI support [13:22:31] uh you were in cc I guess [13:22:37] you know that DOI standard is crazy? and that there are many more DOI formats then just doi:10.1000/186 ? [13:22:52] sigh [13:22:54] yeah, in berlin i was pushing for this as well [13:23:01] it's actually not easy [13:23:12] because basically any character is allowed in a DOI [13:23:18] someone mentioned in the bug that the other forms are nonstandard? [13:23:25] Logged the message, Master [13:25:09] drdee_: also things like number of reviews (not only of -1/-2/+1/+2), maybe even inline comments [13:25:23] * Nemo_bis just thinking out loud [13:26:51] so what i have now is: number of commits per day, number of days until first review (excluding bot reviews), number of days until +2, and this is a breakdown by staff and volunteers [13:27:24] but i am sure those metrics need to be refined, and we wil want to add more [13:28:53] nemo_bis: http://www.doi.org/doi_handbook/2_Numbering.html#2.5 and read the first line [13:30:37] New patchset: Nikerabbit; "Initial version of solr for ttmserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16732 [13:30:51] drdee_: those are general metrics on how well we're doing to verify there aren't too many forgotten commits [13:31:09] but not really a way to measure how much code review activity we have [13:31:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16732 [13:31:17] yes, but that is , AFAIK, the biggest concern right now [13:31:30] adding new metrics is quite straightforward [13:31:30] and even less a way to show who is doing such activity [13:31:59] oh and there is a metric for the percentage of self-review [13:32:00] I know that's the concern, but as I said we probably have a need for social pressure towards code review [13:32:29] haha we need a leader board :) and hand out badgges [13:34:26] drdee_: also, I assume basic stuff like number of commits per committer (besides reviews and merges) are in the plan? [13:34:43] drdee_: is there a mediawiki.org page or something for these specifications? [13:34:53] well the unit of analysis is the repo, not an individual [13:35:19] drdee_: but Erik mentioned also the individual as possible target, although with lower priority [13:35:32] yep, lower priority [13:35:56] but i am not sure that i like this carrot/ stick approach [13:35:57] drdee_: which still needs to be tracked somewhere so that some day someone does it :) [13:36:21] drdee_: carrot/stick is "how many -1 you have received, bad boy?" [13:36:46] just the number of reviews/merge/commits is normal stats [13:37:05] (basic) [13:37:19] so number of commits is present [13:37:32] Nemo_bis: about the gate trunk, the changes that are reviewed are now deployed in the next 2 weeks. So that is at least an improvement :-] [13:37:49] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [13:37:56] number of reviews is not necessarily useful, you have multiple patch sets per commit, you can have multiple reviewers per commit [13:38:31] the focus right now is how long do you have to wait for feedback [13:38:35] hashar: of course :) [13:38:49] and how long does it take to get your commit merged [13:38:59] and is there a difference between volunteers and staff [13:39:13] and this is on a per repo basis [13:39:16] drdee_: sure sure, I only want to understand if you think it makes sense to have stats on individual contributions too, at some point [13:39:25] platform eng does a lot of pair to pair review so we probably have a lower latency. [13:39:46] say I exchange 3 reviews from aaron vs me reviewing two of is changes. [13:39:46] hashar: uh, like citation market in scientific publishing? :D [13:40:04] we can count how often a dev does a review but that needs more context to be actionable [13:40:13] like we are closely collaborating and are available everyday [13:40:33] :p [13:40:34] whereas with a volunteer you get some inherent latency cause volunteers are not there everyday [13:40:42] and of course staff don't do that much review during the week-end :-] [13:40:58] drdee_: yes, stats don't need to be directly actionable do they [13:41:02] that is why I do my review on monday, I can apply the volunteer work that has been done while I am enjoying some fresh air and my family :-]]]]]]] [13:41:11] so yeah we will get a difference for sure [13:41:32] well if stats are not actionable, what is the purpose? [13:42:51] well if stat mean time for merge is 3 days and volunteers is 5, I don't think we have a problem [13:43:04] if staff is 2days and volunteers is 25 days that is entirely different :-] [13:43:04] drdee_: I mean that they can be neutral, don't suggest any action per se [13:43:16] stats are never neutral :D [13:43:28] well they can be more or less so [13:43:42] drdee_: is there any place we can open feature requests ? [13:43:58] one interesting thing would be an aggregate of extensions deployed on wmf vs non wmf extensions [13:44:06] is already done [13:44:18] drdee_: do you think that a "rank" of committers by number of commits/merges/reviews is too biased/wrong in some way [13:44:20] but yes, we need a place to file feature requests [13:44:25] !log deployment-prep rsync finished for both apache and upload6. Remounting and restarting apaches [13:44:32] Logged the message, Master [13:45:08] hashar: wrong channel? [13:45:09] :) [13:45:12] Nemo_bis i wouldn't want to go into that direction personally, but i am open for debate [13:45:50] grmblblblbl [13:45:57] I need to set different background colors [13:45:59] :) [13:47:18] drdee_: ok where should the debate happen? :) [13:47:40] where is the direction we're going to tracked? [13:47:59] probably as a follow up to the initial announcement of gerrit-stats [13:48:08] drdee_: I think the best thing is just to allow people to get their stats by themselves, which AFAICS is the way you're going [13:48:13] ok [13:48:17] not sure if i understand your question [13:48:43] so, code review DB was replicated to toolserver and people extracted stats there [13:49:22] drdee_: which question? I mean if a list like http://lists.wikimedia.org/pipermail/wikitech-l/2012-April/060248.html has been placed somewhere on wiki with subsequent updates etc. [13:49:34] mark: srv/mw partioning; needs fixing; while at it, do you want me to make any larger changes? [13:49:44] mark: like replace jfs or…? [13:50:43] yeah replace jfs [13:50:58] but we no longer need /a at all [13:51:09] perhaps make a larger /, and keep a lot of free space [13:51:16] we can always add partitions or LVM later if we suddenly need to [13:51:36] so put everything into a large / ? [13:51:36] image/video scalers can then use that space [13:51:47] depends on what you mean by 'everything' [13:51:57] and /a? not /tmp? [13:52:11] oh is that what peter renamed it to? [13:52:12] no idea [13:52:23] yeah apaches need a /tmp space, perhaps make that a special partition [13:52:45] i wonder if /usr/local should be separate, where mediawiki lives [13:53:13] curerently there's a 7gb / (that's filled up), a 65g /a and a 2g /tmp [13:53:24] uh ok [13:53:30] ext3 jfs ext3 respectively [13:53:34] make / 30 GB or thereabouts [13:53:48] 30g / ext3, 2g /tmp ext3 sounds good? [13:53:48] make a decent sized /tmp, 10 GB or so [13:53:52] 10? [13:54:05] if we want image scalers to use that yes [13:54:07] well, I guess we have 250g disks there [13:54:11] yeah [13:54:23] the least you would possibly find is 80 GB [13:54:26] but I think we don't have those [13:54:48] okay then [13:54:52] anyway, keep the remaining space unpartitioned or unused in LVM I think [13:54:54] then we have flexibility [13:55:04] okay! [13:56:37] hm, interesting, srv280 has a separate /usr/local/apache and no /a [13:56:44] but that's not in autoinstall [14:02:10] hashar: do you know what the current process is for apache config changes? I recall something about them being in both git and svn. [14:03:53] maplebed: I think we have almost everything in git now [14:04:03] Tim did cleanup them [14:04:35] and do you know what the deploy process is then? the wiki still says to run sync on fenari without any mention of git. [14:04:58] maplebed: yeah the local svn got archived [14:05:07] in /h/w/conf/httpd/archive [14:05:14] so it is 100% git / gerrit [14:05:15] oh wait, I was looking at the wrong page. [14:05:26] which page were you looking at ? [14:05:30] http://wikitech.wikimedia.org/view/Sync_scripts#Operating_on_apaches_and_image_scalers_dsh_groups [14:05:53] http://wikitech.wikimedia.org/view/Apaches#Deploying_config does say do it in gerrit first. [14:06:08] [[Apaches]] should be the reference now [14:06:14] I think I got proofread by mutante [14:06:35] maplebed: we're supposed to talk about an RT ticket :) [14:06:43] so we are. [14:06:52] maplebed: good morning btw, you're an early riser! [14:06:56] though ^^^ is relevant too. [14:07:04] (or a party animal) [14:07:09] so anyway [14:07:12] since part of the thing in between me and doing that ticket was figuring out how to deploy apache configs. [14:07:13] :P [14:07:21] there might be files that are ignored by git [14:07:22] paravoid: sadly not so much. It's already past 10am. [14:07:22] yeah, that's how I remembered it [14:07:23] havent checked [14:07:35] hmm no there is not :-) [14:07:36] maplebed: oh? not in SF? [14:07:43] we are 100% in git!!! yeah [14:07:57] thanks hashar. I think you've gotten me far enough that I'll be able to stage the change I want to make. [14:07:59] :) [14:08:09] so basically [14:08:26] copy the repo locally, submit change, have it reviewed/merged, git pull on fenari, sync-apache [14:08:27] paravoid: on cape cod (MA) [14:08:36] ohhh sneak the test suite before running sync-apache [14:08:47] can't remember where it is though but Jeff Green wrote a mail about is perl script [14:10:16] paravoid: I got as far as confirming that the rewrite rule should just send everything to index.php and that it should go in main.conf. [14:10:34] I didn't get to test it and just now got to figure out how to stage / review it. [14:10:43] I think that's actually about it. [14:11:02] (the hardest part was sorting through the tickets to find out what folks actually wanted) [14:11:45] if you find bugzilla easier, maybe you could have the apache-config request set there instead of RT ? [14:11:54] + you will get support from the volunteers :-) [14:12:02] (re: index.php - the rewrite rule is not supposed to send traffice to Special:ShortURL) [14:12:28] hashar: the problem was that there were 2 or 3 bugzilla tickets and an RT ticket, all with different versions of the request, and within each ticket, each request morphed through the comments. [14:13:18] having two issue tracker surely does not help [14:13:31] maybe those non private requests can be full filed via bugzilla [14:13:41] but then ops might not want to track bugs in both bugzilla and RT [14:13:59] what I mean: the biggest problem wasn't the choice of ticket tracking system [14:14:20] sure, it didn't help. but it wasn't the source of confusion. [14:15:35] paravoid: can we get the syslog hack in please? https://gerrit.wikimedia.org/r/#/c/16661/ [14:26:25] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:34] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [14:50:44] PROBLEM - Puppet freshness on potassium is CRITICAL: Puppet has not run in the last 10 hours [14:53:49] 26 04:04:24 < jeremyb> gerrit acct creation, already has SVN (so I can't do it myself): https://www.mediawiki.org/w/index.php?title=Developer_access&diff=565442&oldid=565340 [14:55:31] New patchset: Mark Bergsma; "cp1041 (Precise) mobile disk cache back to 100G" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16739 [14:56:13] New patchset: Mark Bergsma; "Add asw-c-eqiad to Torrus" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16641 [14:56:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16641 [14:56:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16739 [14:56:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16739 [14:58:39] maplebed: so, I presume you don't know more about that short url thing [14:58:45] and I should ask the requesters [14:58:56] mark: how's the varnish stuff? working so far? [14:59:12] yes [14:59:15] cool [14:59:16] the persistent storage is [14:59:19] right [14:59:20] streaming hasn't been tested yet [14:59:28] right, yeah, I got that [14:59:53] we need persistent storage basically everywhere and streaming in upload, right? [15:07:12] yes [15:21:42] hiiiii domas, you around? [15:39:04] paravoid: I might. what're you wondering? [15:41:28] maplebed: I'm not sure I understand what needs to be done... [15:41:49] is it a RewriteRule in files/apache/rewrite.conf? [15:41:59] yup. [15:42:04] now that I've actually got the apache configs out, [15:42:08] I can stage it... [15:42:36] one sec. [15:42:38] ottomata: whatsup [15:44:45] heya domas [15:44:57] domas: there was a question about scribe. do ya'll still use it, why hasn't it been updated to newest thrift version, etc. [15:44:59] yo [15:45:30] perhaps we can hire jeremyb as our secretary [15:45:37] hahahaha [15:45:57] last time I checked, scribe is very much in use [15:45:58] * jeremyb wouldn't be a very good one [15:46:02] yeah you would [15:46:05] you're now doing it for free [15:46:11] also, projects on apache get their own life [15:46:43] paravoid: http://pastebin.com/74dd58ik <-- that's the change I'm making, though I haven't got git convinced it should commit to gerrit for the apache configs yet. [15:47:15] maplebed: why not rewrite.conf? [15:47:30] New patchset: Hashar; "basic README introducing our files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16035 [15:47:31] that's where /wiki is apparently [15:47:37] but I know little about our apache configs [15:48:03] I don't remember why I chose main.conf. maybe because it's project-specific? [15:48:11] I read through them all trying to decide where looked appropriate. [15:49:17] maplebed: and where's main.conf? [15:49:24] not in puppet [15:49:41] it is in puppet. (operations/apache-conf project) [15:49:44] err.. [15:49:44] no, [15:49:46] you're right. [15:49:50] sorry. it is in git, not in puppet. [15:52:18] New patchset: Hashar; "basic README introducing our files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16035 [15:52:30] op, hey domas, just saw you responded [15:53:18] yeah, so , we are evaluating options for udp2log replacement [15:53:30] also considering ways to get log and other data into the analytics cluster [15:53:34] hadoop most likely [15:53:47] there's scribe, then there's flume and kafka and some other java options [15:53:55] scribe is comparably enice because it is not java [15:53:58] but [15:54:23] as far as I can tell, development and maintenance on scribe is pretty much non existent [15:54:30] at least on the open source side [15:54:40] and the community seems kind of inactive [16:17:55] New patchset: Bhartshorne; "Special:ShortURL redirect RT-2121" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/16742 [16:20:51] paravoid: ^^^ does that match what you think we need to do ? [16:22:38] New patchset: Bhartshorne; "apache rewrite rules to allow swift to call image scaler directly" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/16743 [16:24:07] maplebed: I don't really know enough about it to be able to judge it [16:24:23] the blind leading the blind. [16:24:47] well, I suppose we can always just deploy it to the host serving test wiki and try it out. [16:25:15] I suppose so :) [16:25:31] (not that I've ever done that before) [16:26:38] msg from OTRS tells us switching off of apache to something else will help performance ;) http://dpaste.com/775797/plain/ [16:27:04] of course we do already use all of the alternatives he mentioned [16:27:41] that's what you get when you use try to convince people to donate by using useless metrics and comparing apples & oranges [16:28:06] paravoid: oh, right you missed last year's fundraiser ;) [16:32:31] here is an ignorant un apropos question [16:32:34] what does -ng stand for [16:32:36] in thing slike [16:32:39] syslog-ng [16:32:39] etc. [16:32:41] ? [16:32:53] new generation [16:33:07] ahhhhh [16:33:09] I thought it was next not new [16:33:14] maybe next yeah [16:33:17] usually it just means "I suck at naming things" though [16:33:17] picard vs kirk [16:33:50] Nah, sucking at naming this is v2 is called the same as v1 with a totally different codebase :D [16:36:29] maplebed: so, we can directly edit srv193's config [16:36:43] (srv193 is out of rotation and serves test.wp.org) [16:36:52] but I don't really know what should follow after a /s/ :) [16:36:53] cool.- [16:36:54] i.e. a short url [16:36:57] to be able to test it [16:37:09] I think we should just wait for Roan or Reedy [16:37:15] right - the ShortURL extension has to be enabled (which it might be alread) [16:37:18] Wait? [16:37:26] then we can create one using teh extension [16:37:26] I think it is on testwiki.. [16:37:29] then test it with /s/ [16:38:17] Reedy: I don't see it in http://test.wikipedia.org/wiki/Special:SpecialPages [16:38:18] oh hi Reedy [16:38:53] paravoid: but also - I've got to bail for a few hours in about 8 minutes. [16:39:25] but don't let that stop you from carrying on! :) [16:39:36] heh [16:39:37] 'wmgUseShortUrl' => array( [16:39:37] 'default' => false, [16:39:37] #'testwiki' => true, #temp disable for testing AFTv5 --catrope [16:39:37] ), [16:39:43] It was there on testwiki [16:39:46] #'testwiki' => true, #temp disable for testing AFTv5 --catrope [16:39:50] * jeremyb is so slow [16:39:56] I'm AFK for dinner in a few minutes [16:40:19] can you enable it at some point? [16:40:51] [ Error writing wmf-config/InitialiseSettings.php: Permission denied ] [16:40:53] :( [16:41:10] -rw-r--r-- 1 nikerabbit wikidev 391609 2012-07-26 08:42 wmf-config/InitialiseSettings.php [16:41:29] Can someone fix write for wikidev on the file please? :p [16:41:49] Reedy: where is that located and how's it installed ? [16:41:55] where? [16:42:01] /h/w/c/wmf-config/InitialiseSettings.php [16:42:03] * jeremyb assumes fenari [16:42:06] yeah [16:42:16] I wonder how he managed that.. [16:42:38] !log Created ShortUrl tables on test2wiki [16:42:46] Logged the message, Master [16:42:59] Reedy: done [16:43:12] and verified it's the only file in there without g+w [16:44:33] maplebed: paravoid: enabled on both testwiki and test2wiki [16:44:55] Or not [16:44:55] thanks a lot [16:44:56] PHP fatal error in /usr/local/apache/common-local/wmf-config/CommonSettings.php line 2350: [16:44:56] Class 'Special' not found [16:45:06] oh? [16:45:41] also, how do you actually create a short url? :) [16:45:42] lol [16:46:18] * Reedy waits [16:46:24] for? :) [16:46:38] master [16:47:10] Change abandoned: RobH; "this was fixed by chris already" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16418 [16:47:35] Change abandoned: RobH; "this was already applied in another patch set" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4429 [16:47:43] paravoid: [16:47:43] Rue (The Hunger Games) [16:47:43] https://test.wikipedia.org/wiki/Special:ShortUrl/8bc [16:48:07] Is it supposed ot be /s/? [16:48:09] https://test.wikipedia.org/s/8bc no work [16:48:56] same on test2wiki [16:49:01] Reedy: is the last token case sensitive? [16:49:15] No idea without looking at the code [16:49:16] Right, afk for dinner before I get shouted at again [16:49:25] thanks a ton [16:49:42] haha [16:49:56] $wgShortUrlPrefix = $wmgShortUrlPrefix; [16:49:56] $wgShortUrlPath = "/s/$1"; [16:50:31] 'wmgShortUrlPrefix' => array( [16:50:31] 'default' => false, [16:50:32] ), [16:51:12] The globals look wrong/out of date [16:51:16] Like they've been renamed.. [16:52:42] Ah, that just changes the visible name.. [16:52:43] https://test2.wikipedia.org/s/4 [16:53:21] but those links don't work still either [16:53:24] Enjoy ;) [16:53:28] that's what we're trying to fix [16:54:04] but we couldn't test until you came to rescue [16:54:09] it's on both srv193/testwiki and test2wiki, so you can test it a few places [16:54:20] what is test2wiki? [16:54:33] like testwiki, but runs on any random apache, like the rest of the wikis [16:55:08] aha [16:55:15] great [16:55:17] thanks again [16:56:47] New patchset: RobH; "adding in the basic support for smokeping, will be using local puppetmaster in labs to further refine" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16748 [16:57:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16748 [16:57:48] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16748 [16:58:28] it's not case sensitive. so we should make the /s/ part also not case sensitive? [16:58:31] $ php -r 'print base_convert("ABCD",36,10) . "\n";print base_convert("abcd",36,10) . "\n";' 2>/dev/null | uniq -c 2 481261 [16:58:50] (those last 2 tokens were on the next line, idk why my paste failed) [17:01:52] awjr: figured out puppetmaster::self? [17:02:44] jeremyb enough to get a working manifest set up, but i haven't figured out how to push my changes to review. i was about to just start porting patches to another repo clone [17:03:04] that was my question [17:03:39] jeremyb is it possible to push from the puppetmaster:self clone? [17:06:45] awjr: so, i think there's little special about the puppet repo. just get it to gerrit the same way you would anything else on labs. [17:06:51] awjr: you can either just push directly but that's a bit of a security issue (having your gerrit private key on a labs box). or you could pull down to another clone and push from there. (add the ::self instance as a remote for the repo on your local workstation or do a one-off pull. or do git-format-patch and copy the file down and then git am on the other repo. (maybe something like git format-patch -k --stdout origin/master..master)) [17:07:16] ok cool thanks jeremyb [17:08:41] omg twitter [17:09:28] domas: omg twitter fail ? ;) [17:09:40] no fail whale even [17:09:48] Twitter is currently down for <%= reason %>. [17:09:48] We expect to be back in <%= deadline %> [17:09:51] New patchset: preilly; "remove carrier acl block and switch back to strtok instead of strtok_r" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16716 [17:09:58] hehehe [17:10:02] oh i didn't see that bit [17:10:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16716 [17:10:33] LeslieCarr: can you approve https://gerrit.wikimedia.org/r/#/c/16750/1/templates/varnish/mobile-frontend.inc.vcl.erb and merge [17:11:13] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [17:11:24] cmjohnson1: yes [17:11:30] cmjohnson1: in 5 minutes ? [17:11:49] preilly: hey, it relies on a previous commit https://gerrit.wikimedia.org/r/#/c/16716/3 -- do you want to rebase or … ? [17:12:26] Anyone else have an issue on labsconsole where loading the manage instances list will result in only showing the group title names, and not the actual instances grid? [17:12:40] It works for me on initial load, then it stops working on reloads and navigating to it. [17:13:10] The only way I can work around it is uncheckign the filter for testlabs project, submitting, then rechecking it. [17:14:32] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16716 [17:14:47] RobH: something's off. and slow too [17:14:50] LeslieCarr: no I need the other change as well https://gerrit.wikimedia.org/r/#/c/16716/2 [17:14:53] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16750 [17:15:09] jeremyb: as long as its just not me [17:15:21] At first I assumed it was just me [17:15:33] but its happening too often, on too many browsers, and I tried on two computers ;] [17:15:44] it is definitely not just you [17:16:01] As soon as ryan is about im gonna ping the hell out of him, heh [17:16:29] although i'm trying to repro again and try something else [17:16:44] yeah, broken [17:16:49] its also building super slow on instacnes [17:16:51] instances even [17:18:15] the list instances link in the sidebar is relatively fast though [17:18:28] i dont think that lets me apply changes to them though [17:19:05] yea, just lists, but better than nothing [17:20:25] anyone here who knows whom is handling the wikibugs bot (in #mediawiki)? [17:21:08] AzaToth: kinda [17:21:12] just noticed someone seems to have pulled some glue into it [17:21:14] RobH: have at it ;) [17:24:22] AzaToth: you mean behavior changed? [17:25:01] oh, wow 4 hrs idle [17:25:06] maybe it needs a boot [17:25:31] jeremyb: issues with labsconsole are known [17:25:36] it's due to scaling issues with nova [17:25:43] it'll go away when we upgrade [17:26:24] Ryan_Lane: okey. first i've heard of it but I don't read everything in #-labs [17:26:34] oh [17:26:40] RobH mentioned you saw some issues [17:26:58] only because RobH was asking if it was broke (or it was just him) [17:27:14] so i checked and it was broke [17:27:46] cmjohnson1: hey [17:27:54] jeremyb: seems to not work atm [17:28:04] AzaToth: right [17:28:12] it hasn't reported any bugs in a long long time [17:28:25] 26 17:24:50 [freenode] -!- idle : 0 days 4 hours 33 mins 44 secs [signon: Thu Jul 26 12:51:05 2012] [17:28:36] maybe Ryan_Lane wants to give it a boot ;) [17:28:40] seems to be on mchenry [17:28:55] cmjohnson1: sounds like time to party with the ex's [17:28:57] eh? [17:29:00] what's broken? [17:29:00] 4.5 hours is a loooooooong time in modern days [17:29:02] let me turn the ports down... [17:29:06] Ryan_Lane: wikibugs irc bot [17:29:09] oh [17:34:43] cmjohnson1: 16/1 ? [17:35:08] 15 is a gige linecard [17:35:13] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:35:21] foundry counts from 1, unlike juniper which counts from 0 [17:35:41] because all network gear has to be different ;) [17:37:30] I'm back [17:37:46] so, who wants to fix wikibugs then? (irc bot) [17:49:17] paravoid: Any luck? [17:49:18] cmjohnson1: doh, what's the switch called on the console server ? [17:49:25] cmjohnson1: but its ok as i downed the port [17:50:34] which scs is it on ? [17:53:05] i think i'm blind ... [17:53:09] not seeing it [17:53:10] oh [17:53:11] um [17:53:12] yes [17:53:15] the one labeled asw2 [17:53:20] * LeslieCarr hides [17:53:29] * Damianz finds LeslieCarr some redbull [17:53:36] It gives you wings apparently so you can fly away :D [17:53:40] hehehe [17:53:57] It's like 7pm here :P Damn west cost people [17:54:00] mmm, need to make chai.... [17:54:04] Damianz: where are you at again ? [17:54:19] England at the moment, sometimes Scotland. [17:54:50] It's always 17:00 somewhere [17:54:53] cool [17:55:17] i rarely hit up england … mainly the cold thing :) [17:55:33] Cold!? It's ruddy boiling atm, well not boiling but muggy and horrid. [17:55:52] Totally broke 25C today =/ [17:56:09] well then, for the middle of summer ;) [17:57:43] I'd totally do Florida, could wear shorts and sunglasses without risking rain every 5min... apparently it's 32C over there atm, jealous [18:00:19] It was into the 30s earlier this week [18:06:58] http://forecast.weather.gov/MapClick.php?lat=40.6498&lon=-73.9488&FcstType=text&unit=1&lg=en [18:16:21] New patchset: awjrichards; "Adds WLM api host config in misc-servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16755 [18:16:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16755 [18:19:10] New patchset: awjrichards; "Adds WLM api host config in misc-servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16755 [18:19:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16755 [18:20:53] New review: awjrichards; "This change set should be abandoned as it was replaced with Change-Id: I28d60135ff8a1286e3f7e44cbb3b..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16530 [18:22:09] can a puppet wizard please review https://gerrit.wikimedia.org/r/#/c/16755/2? [18:24:48] awjr: Does wlm.wm.o need HTTPS? [18:25:08] Change abandoned: MaxSem; "Abandoning in favor of https://gerrit.wikimedia.org/r/#/c/16755/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16530 [18:25:18] awjr: Also, where does /var/wlm/erfgoed come from? [18:25:38] I mean I know what that directory name means better than you :D but what installs it [18:26:21] Looks fine to me otherwise but I'm not all that familiar with this [18:26:27] RoanKattouw: per https, no i dont think so; the /var/wlm/erfgoed will get created manually atm [18:26:46] RoanKattouw erfgoed is the name of a bot that does a bunch of fancy stuff for WLM [18:27:04] we're hosting a portion of the API that's included with erfgoed for the WLM app that we're putting together [18:27:12] atm it's all hosted in a TS svn repository [18:27:28] OK [18:28:00] RoanKattouw erfgoed == heritage? [18:28:06] Yes [18:28:09] \o/ [18:28:17] Well, kind of [18:28:37] Heritage specifically within the meaning of cultural heritage [18:28:45] cool [18:35:42] PROBLEM - Puppet freshness on cp1020 is CRITICAL: Puppet has not run in the last 10 hours [18:36:45] PROBLEM - Puppet freshness on mw58 is CRITICAL: Puppet has not run in the last 10 hours [18:38:42] PROBLEM - Puppet freshness on srv209 is CRITICAL: Puppet has not run in the last 10 hours [18:38:43] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [18:39:16] RoanKattouw if it looks sane to you, can you approve it? We're trying to get this locked down asap [18:39:39] i don't think he can [18:39:50] o [18:39:52] then never mind [18:39:53] :p [18:54:01] PROBLEM - SSH on srv286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:54:19] PROBLEM - Apache HTTP on srv286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:54:46] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.36:11000 (Connection timed out) [18:56:43] RECOVERY - SSH on srv286 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:57:01] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.641 second response time [18:57:29] awjr: hey, is this urgent ? [18:57:37] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [19:09:01] LeslieCarr: semi-urgent; it would be great to get the chnges merged tody [19:09:03] *today [19:09:38] awjr: do you know about puppetmaster:self ? [19:09:57] we don't put lab nodes in prod puppet [19:10:07] LeslieCarr yes - we put that manifest together in labs with puppetmaster::self [19:10:29] LeslieCarr ok - we don't have a prod box yet [19:10:40] i can take out the labs node and just add the prod node once we hve it set up [19:10:49] sounds good :) i await patch set 3 [19:10:49] PROBLEM - Apache HTTP on srv286 is CRITICAL: Connection refused [19:11:32] circular dependency detected:P [19:12:03] LeslieCarr: maybe you could poke wikibugs @ mchenry? [19:12:22] New patchset: awjrichards; "Adds WLM api host config in misc-servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16755 [19:12:36] (irc bot) [19:12:39] LeslieCarr: https://gerrit.wikimedia.org/r/#/c/16755/ [19:13:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16755 [19:23:12] !log authdns update for new services [19:23:19] Logged the message, RobH [19:31:40] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [19:37:13] !log claiming yttrium for smokeping install [19:37:20] Logged the message, RobH [19:40:27] !log changed to calcium instead, as yttrium is a 610 and i only need a 310 [19:40:35] Logged the message, RobH [19:48:32] Interesting [19:48:51] I get the err: Could not retrieve catalog from remote server: Could not intern from pson: unexpected token in array at ''. error on my labs instance half the puppet runs [19:48:56] yet half of them work.... [19:49:08] seems odd a labs server being its own puppetmaster would have issues like that [19:55:38] New patchset: Ori.livneh; "Enable E3Experiments in enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16764 [19:57:40] heh [20:00:40] yep [20:04:23] well the port is admin up [20:04:56] it's up/up [20:06:11] cmjohnson1: actually thinking about it, can you run a second fiber between asw2-d3 and csw1-sdtpa ? [20:07:00] cmjohnson1: fyi, db64 isn't being responsive on its management console... [20:07:10] cool [20:07:12] and eep [20:07:13] at the same time [20:07:31] weird [20:09:22] New patchset: RobH; "claiming calcium for smokeping use" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16766 [20:09:54] streber is in some half puppetized half unpuppetized state [20:10:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16766 [20:10:02] im not comfortable installing my smokeping puppetization over it. [20:11:47] New review: RobH; "nothing to see here, no self review....." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/16766 [20:11:47] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16766 [20:13:24] hehehe [20:22:34] someone please cycle wikibugs! we miss the chatter ツ [20:23:50] AzaToth: find someone to do it? ;) [20:24:15] jeremyb: doh, let me look at mchenry now [20:24:22] heh [20:24:31] jeremyb: no one :( [20:24:41] AzaToth: leslie is ^ [20:24:46] ah ツ [20:26:26] jeremyb: people might start to think there are no new bugs in MW [20:26:39] AzaToth: are there any? [20:27:00] offcourse it's bugfree™ [20:27:07] is it up now ? :) [20:27:21] we'll have to wait and see [20:28:36] I just opened a test bug.. [20:29:06] changed https://bugzilla.wikimedia.org/show_bug.cgi?id=38627 to minor, but no notice yet [20:29:22] LeslieCarr: so, no [20:29:27] grrr [20:29:40] Who do we know who is subscribed to wikibugs-l.. [20:29:42] someone (unamed) said it might be email issue [20:30:23] Reedy: someone who complained when it stopped working? ;) [20:30:33] Reedy: I am... [20:30:40] Are you getting any mails for it? [20:30:45] Reedy: and I got your email [20:30:53] Right [20:30:58] [Bug 38728] New: Test wikibugs [20:31:07] sohrm, yeah, nothing for the last 10 hours [20:31:14] written to the log that it reads from [20:31:19] let me try something [20:31:33] did test123 come through ? [20:32:06] ok, so something with bugzilla emailing them then [20:32:07] yup [20:32:16] hrm, need food [20:32:29] I should actually make food, too tired [20:35:39] PROBLEM - Host db63 is DOWN: PING CRITICAL - Packet loss = 100% [20:38:07] ugh, i thought it just needed a restart [20:44:57] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [20:45:42] RECOVERY - Host db63 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:48:42] PROBLEM - Host db63 is DOWN: PING CRITICAL - Packet loss = 100% [21:05:17] New patchset: RobH; "adding in a new partman for 500gb misc servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16774 [21:05:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16774 [21:06:08] Anyone know partman magic? [21:06:16] I changed that to raid the swap since we were doing it wrong all this time [21:06:31] (loss of a disk on current non raided swaps that span two disks can result in data loss if a disk dies) [21:06:53] hrmm, peter and daniel are the other partman folks, and they are at defcon =P [21:07:14] Ryan_Lane: You happen to understand partman recipes? [21:09:29] New review: RobH; "not quite sure my swap raid entry will work, one way to find out is to do an install" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/16774 [21:09:30] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16774 [21:09:44] ahh damn it i forgot to commit netboot [21:12:12] New patchset: RobH; "calcium moved to raid1 450gb partman recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16776 [21:12:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16776 [21:14:21] RECOVERY - Host db63 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [21:14:38] New patchset: RobH; "calcium moved to raid1 450gb partman recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16776 [21:15:15] awjr: hey, one question on that [21:15:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16776 [21:15:33] awjr: on that change -- (inlined) but confirming that you do not want ssl ? [21:16:55] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16776 [21:17:48] PROBLEM - MySQL disk space on db63 is CRITICAL: Connection refused by host [21:18:10] LeslieCarr correct - at least as of now, there will be no sensitive data transacted [21:18:24] PROBLEM - SSH on db63 is CRITICAL: Connection refused [21:19:10] okay, it's not much of a hit, and people do like ssl [21:19:20] plus, https everywhere will break users on that site [21:19:29] just making sure you know [21:22:22] LeslieCarr: any news with the buggy bug [21:22:24] ? [21:24:37] AzaToth: i have no idea honestly [21:25:00] ;( [21:25:02] damn [21:25:03] something's wrong with bugzilla, the only thing i can think to do is stop and start it [21:25:15] i don't see any errors. [21:25:29] but bugzilla is sending out emails [21:25:32] oh [21:25:33] hrm [21:25:54] PROBLEM - SSH on calcium is CRITICAL: Connection refused [21:26:12] so wikibugs-l is working fine ? [21:26:48] I asked this earlier ;) [21:26:53] jupp [21:28:15] ah [21:28:16] first order of buisness is to find a scapegoat [21:28:16] heh [21:28:18] grrr [21:28:23] i need to find out what to hit [21:28:25] with a hammer [21:28:33] what or whom? [21:29:40] what the fuck [21:29:41] I just got emails dated 10/06/11 and 20/12/11 [21:29:41] From bugzilla [21:29:50] LeslieCarr: did you do something with the zilla? [21:30:06] They were to wikibugs-l [21:30:13] ahha [21:30:18] oh [21:30:26] yeah i posted some held messages that looked real [21:30:27] sorry ? [21:30:30] haha [21:30:33] hehehe [21:30:36] ahha, wikibugs-irc was bouncing [21:30:38] yay mailman [21:30:51] feel free to tell him that old emails are the first signs of skynet [21:31:30] LeslieCarr: tell him yourself ツ (didn't see he was in here) [21:31:35] he/she/it [21:31:49] I'm here [21:32:12] them [21:32:51] Krenair: yep, old emails are first sign of skynet [21:33:02] so, want to test again with doing something to a bug ? [21:33:26] I'll test [21:34:19] works [21:34:26] (NEW) testing123 - https://bugzilla.wikimedia.org/38732 blocker; Spam: Spam; (azatoth) [21:34:41] yay [21:34:42] hehe [21:34:48] ok, good to note [21:35:00] are you gonna send out last 10 hours worth of bugs to the irc? [21:35:04] i'll update [21:35:12] that's for skynet to do ! [21:35:15] hehe [21:35:21] they weren't queued, just not being delivered at all [21:35:45] can you close the bug for me (can only set it to resolved) [21:35:49] ok [21:36:44] !bug 38732 [21:36:44] https://bugzilla.wikimedia.org/38732 [21:37:09] PROBLEM - Host calcium is DOWN: PING CRITICAL - Packet loss = 100% [21:37:32] heh, turns out i don't have perms on bugzilla [21:37:36] awesome [21:37:38] hehe [21:37:45] * RoanKattouw closes [21:38:04] RoanKattouw: jeremyb did it [21:38:12] unless you are jeremyb [21:38:15] New patchset: Alchimista; "blogs update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16778 [21:38:18] Oh, no heh [21:38:24] idk [21:38:29] I thought you meant it was already RESOLVED and you wanted it to be CLOSED [21:38:34] I managed to set it to VERIFIED [21:38:37] hehe [21:38:47] LeslieCarr: given you bugzilla admin (incase you need it someday ;)) [21:38:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16778 [21:38:54] thanks Reedy [21:39:18] jeremyb: you don't know if you are roan? [21:39:23] nope! [21:39:25] * jeremyb runs away [21:39:28] creepy [21:42:10] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16764 [21:44:55] New review: Lcarr; "I would still like to see a changelist with SSL enabled in the future (this will break https everywh..." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/16755 [21:44:56] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16755 [21:48:15] PROBLEM - NTP on db63 is CRITICAL: NTP CRITICAL: No response from NTP server [21:51:51] New review: Reedy; "I've added it to the HttpsEverywhere repo, and submitted it to my github fork." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16755 [21:52:17] LeslieCarr, that WLM thingie is for our app only, people with HTTP everywhere aren't supposed to visit it at all [21:52:26] ah okay [21:53:20] MaxSem: but we know full well some will try [21:53:45] they can use TS [21:54:03] if there will be a requirement for the app to work on HTTPS, we'll amend this manifest but now we just need to get things going [21:58:55] RobH: i don't have backlog to see if you guys answered this but do you know http://wikitech.wikimedia.org/view/File:Wikimania_2012_-_The_Wikipedia_Mobile_Experience_%E2%80%94_Where_We%27ve_Been_and_Where_We%27re_Going_-_Tomasz_and_Jon.pdf - errors within the thumbnail are happening? [21:59:07] RobH: i can do a shorter filename if that'll help [22:03:12] Arrrgh [22:03:16] paravoid: still about? [22:03:31] srv281 is now serving traffic [22:03:32] /dev/sda1 7.9G 7.5G 0 100% / [22:03:45] Jul 26 22:03:08 10.0.8.31 apache2[30228]: PHP Warning: include_once(/apache/common/wmf-config/CommonSettings.php) [function.inc [22:03:45] lude-once]: failed to open stream: Permission denied in /usr/local/apache/common-local/php-1.20wmf7/LocalSettings.php on line 11 [22:03:48] huh [22:03:57] tfinc: thumbnail generation is by swift now i think.... [22:04:04] ben isnt online =[ [22:04:12] Not on wikitech ;) [22:04:14] RobH: and apparently it doesn't work very well [22:04:19] nope, swift is new [22:04:26] its documented somewhere there though [22:04:30] just not sure if its change,d lemme see [22:04:43] Why would we need swift on wikitech? [22:05:16] swift docs =p [22:05:29] the docs are on how to add and remove nodes, nothing on troubleshooting thumbnails [22:05:35] but the problem is making thumbnails on wikitech [22:05:37] not on the cluster [22:05:40] tfinc: I take it Ben isnt sitting around the office there? [22:05:42] oh [22:05:45] :p [22:05:46] ....we care? [22:05:51] * Reedy grins [22:05:55] RobH: i'm in a conf room so i don't know [22:06:03] tfinc: nm, i misunderstood [22:06:08] i have far too many windows open for my own good [22:06:23] i thought ya meant thumbs on cluster, didnt realize it was on wikitech you meant [22:06:30] maplebed is also on vacation, just fyi [22:06:33] LeslieCarr: any chance you could depool srv281? It's serving traffic with a full / [22:06:36] RobH: its on wikitech ;) [22:06:40] grrrr [22:06:48] i hate wikitech [22:06:54] its versions behind and borked. [22:07:11] if we can't make wikitech work then i'll just have to use commons [22:07:51] 1.17wmf1 o.O [22:08:00] FAIL [22:08:43] RobH: so should i 1) upload a shorter file name 2) upload it to commones 3) other ? [22:09:56] if its shorter name it thumbs ok? [22:10:14] i would do that...... because wikitech is a mess that is going to require hours to properly fix [22:10:18] RobH: i don't know. i was asking you wether its worth me trying that [22:10:24] hrmm, lemme see [22:10:35] labsconsole is also out of date: 1.20wmf2 [22:10:54] not quite as bad as wikitech though. [22:11:07] awjr: with wikitech this borked my desire to get it mobile ready is dropping fast [22:11:11] Reedy, I just found a reference to wmf4 [22:11:41] Krenair: When Ryan is less busy, he'll update it. It should be fine [22:11:43] Platonides: where? [22:12:06] yeah. it's a little out of date [22:12:24] it's not actively causing issues, though [22:12:30] Powered by MediaWiki at http://en.wikipedia.org/wiki/Iflavirus [22:12:39] mmh... it may be a page cached in squid [22:12:50] I get Powered by MediaWiki [22:12:54] tfinc: i was going to throw a photo on wikitech to see if it resized for you [22:13:05] but my internet is so slow here its failing to push to gmail wikitech, anything [22:13:12] I see wmf4 [22:13:13] http://bits.wikimedia.org/static-1.20wmf4/skins/common/images/poweredby_mediawiki_88x31.png [22:13:17] it may be faster for you to shorten and give it a shot then wait on me to try it [22:13:27] RobH: i can easily do that [22:13:29] purge ftw [22:13:49] if that doesn't work lemme know cuz we will drop a ticket and get it on the roadmap, in fact im going to see if we have a wikitech update one [22:14:10] there isnt... [22:14:11] the whole page I was served has wmf4 references: behavior:url("/w/skins-1.20wmf4/ [22:14:39] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [22:14:48] parseroutput of 20120304024107 :P [22:15:00] That should have expired by now.. [22:15:18] RobH: FAIL. i can't upload it till we delete the old copy. i'm getting a duplicate error [22:15:26] I was served it [22:15:34] just now [22:15:40] As was I [22:15:54] lemme delete it [22:16:05] heh, you already did [22:16:29] Couldn't it just be moved? :/ [22:17:39] RobH: FAIL - A file identical to this file (File:Wikimania 2012 - The Wikipedia Mobile Experience — Where We've Been and Where We're Going - Tomasz and Jon.pdf) has previously been deleted. You should check that file's deletion history before proceeding to re-upload it. [22:18:06] RobH: i'm just going to give up on wiktech and use commons [22:18:08] what the hell [22:18:12] wikitech is too broken [22:18:15] yea, sorry about that, i dropped a ticket for it [22:18:23] tfinc: fixed [22:18:23] http://wikitech.wikimedia.org/view/File:Wikimania_2012_-_The_Wikipedia_Mobile_Experience.pdf [22:18:31] Reedy: awesome! what did you do ? [22:18:37] moved it [22:19:12] Reedy: moved it to a shorter file name ? [22:19:15] yeah [22:19:42] tfinc, why would you not want to use commons? [22:21:21] Platonides: because many years ago we decided to host these in one place on wikitech and now we have six years of archives at http://wikitech.wikimedia.org/view/Presentations [22:24:17] ok, checking out srv281 [22:30:57] New patchset: Alchimista; "blogs update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16778 [22:31:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16778 [22:32:39] LeslieCarr: It needs repartitioning in line with all the other srv* hosts [22:33:08] It sounds like new srv installs may still use the old partitioning, if so that would explain how it got messed up when Faidon reinstalled it [22:33:33] RoanKattouw: yep, i was just making sure it's not serving [22:33:39] has anyone put in a ticket for srv281 ? [22:33:42] OK [22:33:44] I don't think so [22:33:57] well, that would be step #1 to getting it fixed ;) [22:34:10] Well Nagios warns against it [22:34:13] *about it [22:34:30] You mean like srv190 [22:34:32] /dev/sda1 7.9G 3.6G 4.0G 48% / [22:34:34] I would think step #1 would be ops paying attention to its own monitoring system [22:34:36] (ie not) [22:34:47] * RoanKattouw has a longer rant about how Nagios is used currently but will save that for some other time [22:36:14] Just move contents of /usr/local/apache onto /a, and remount that to /usr/local/apache [22:36:23] /dev/sda7 63G 4.0G 59G 7% /usr/local/apache [22:36:23] Peter has a script [22:36:26] Lemme dig it up [22:36:27] mmm [22:36:42] is anyone working on switching the appservers to precise? [22:36:45] I also thought he fixed the install recepies, seemingly not [22:36:59] AFAIK there's a couple already upgraded for testing [22:37:23] All Apaches have been migrated save a few [22:37:29] And those few are causing problems now [22:37:39] LeslieCarr: See bast1001:/home/py/apache-mover.sh [22:38:25] hmm, where do they take php5-parsekit from? [22:38:45] it doesn't seem to be present in precise packages [22:38:59] RoanKattouw:not enough time right now to talk about all the things that we could do better ;) just make a ticket please so it's not forgotten :( [22:39:06] Filing [22:39:18] If I file a ticket, will you acknowledge the alert in Nagios with a link to the ticket? [22:42:09] https://rt.wikimedia.org/Ticket/Display.html?id=3336 files [22:42:11] *filed [22:45:49] Hmm [22:45:54] Is disk space not monitored on that box? [22:46:02] MaxSem: it's in our apt repo [22:46:07] paravoid added/made it [22:46:39] https://bugzilla.wikimedia.org/show_bug.cgi?id=37076 [22:47:09] heh, I'll ack in nagios [22:47:17] Thanks [22:47:19] RoanKattouw: is aft5hide in the right groups for requesting oversight ? [22:47:22] Yes [22:47:46] Also, why do I not see a disk space alert for srv281 in Nagios? Is there no check for that box or am I just not looking in the right place? [22:48:10] I thought we had Nagios checks for root partition disk space on all machines [22:48:49] i don't beleive that we have that [22:48:54] Hmm OK [22:49:09] I was clicking around Nagios convinced that there had to be a disk space alert somewhere [22:49:20] maybe on ms etc? [22:49:24] ACKNOWLEDGEMENT - Apache HTTP on srv281 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception LeslieCarr RT 3336 [22:49:25] Yes, db* has them [22:49:28] Yay thanks [22:49:39] Lack of use of acks is one of my gripes [22:49:45] there are some disk space alerts but i don't believe it's by default [22:50:17] is fabrice in the SF office ? [22:50:30] Yes [22:50:32] 6th floor [22:51:07] Right across from the main entrance