[00:11:43] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [00:36:07] New patchset: Ori.livneh; "Puppetize Bugzilla" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [00:38:59] New review: Ori.livneh; "If this passes muster, the next step will be to get it running on labs, I guess. (There's already a ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [00:47:58] did you know bugzilla depends on everything in CPAN? i think it's a kind of performance art or a scavenger hunt -- depend on as many perl modules as possible, and then add a few extra. [00:51:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:52:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [00:58:28] PROBLEM - Host es10 is DOWN: PING CRITICAL - Packet loss = 100% [00:59:28] RECOVERY - Host es10 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [01:44:21] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:44:21] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [01:44:21] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:44:21] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [02:18:19] !log LocalisationUpdate failed: git pull of extensions failed [02:18:27] Logged the message, Master [02:18:35] !log restarting gerrit [02:18:43] Logged the message, Master [04:07:42] New patchset: Andrew Bogott; "First pass at a labsconsole puppet setup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53989 [04:07:43] New patchset: Andrew Bogott; "Switch the openstack manifest to use webserver::php5." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51798 [04:07:43] New patchset: Andrew Bogott; "Allow overriding of apache site." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62416 [04:07:43] New patchset: Andrew Bogott; "A few tuneups of LocalSettings.php:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62417 [04:26:05] New review: Ori.livneh; "The 'setup-swm' and 'import_labsconsole_initial_pages' need some satisfaction criteria so you don't ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53989 [05:29:01] New patchset: Faidon; "swiftrepl: add a -o (once) option" [operations/software] (master) - https://gerrit.wikimedia.org/r/62418 [05:29:01] New patchset: Faidon; "swiftrepl: add a much faster sync_deletes() method" [operations/software] (master) - https://gerrit.wikimedia.org/r/62419 [05:36:31] New review: Andrew Bogott; "WIP" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53989 [06:14:45] !log pushing new swift rings: ms-be1 weight 50->0, ms-be2 33->75 [06:14:54] Logged the message, Master [06:18:46] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [06:23:36] apergos: good morning :) [06:23:38] apergos: had fun? [06:23:46] morning [06:23:50] it was a nice if oo short break [06:24:09] a sort of forced cutoff from the computer [06:24:18] heh [06:24:26] I didn't work much on Friday either [06:24:28] I was in a house full of kids that were all begging to "can I pleeeeease have the laptop? please???" etc [06:24:31] I wanted to take a break from Fri-Tue [06:24:45] and since the parents said no to that we couldn't very well then camp in front of ours :-) [06:24:54] haha [06:25:18] how were thingshere? I was on a tiny bit friday just to check in on gsoc things [06:25:27] then it was run around and see people [06:25:43] heh [06:25:56] I really need a break but something happens all the time, so... :) [06:26:00] ugh [06:26:05] what was it this time? [06:26:22] nah, just a meeting that couldn't be postponed today [06:26:25] ah [06:26:28] plus the ceph dev summit tomorrow [06:26:36] so it didn't make much sense to work half days [06:26:53] yeah, take some days later in the summer [06:27:00] I hate summer [06:27:06] I hate the heat [06:27:20] I'm a weird kind of greek I guess :) [06:27:41] 09:14 < paravoid> !log pushing new swift rings: ms-be1 weight 50->0, ms-be2 33->75 [06:28:02] should have done this since fri/sat, but I forgot [06:28:08] I just finished force-running puppet [06:28:12] we have 3 broken disks, did you know? [06:28:18] I knew about two of them [06:28:22] ms-be10 sdc, ms-be5 sdb, ms-be11 sdh [06:28:33] I think 5 I didn't know about [06:28:37] are there tickets? [06:28:40] * paravoid checks [06:29:12] I never remember to open tickets, I always just ask on irc [06:29:15] apparently not [06:29:17] I should get into the habit [06:29:22] does steven know about them? [06:30:29] meh if I had a memory I could tell you (this was a little while ago) [06:30:35] this is why I should open tickets [06:30:57] I need to look at 'em and see [06:31:11] ok [06:31:51] it's three different zones, so this isn't very good [06:32:07] too late to rectify this now since I just pushed rings [06:32:23] but let's notify steven today to have those replaced asap [06:33:01] yeah let me check what was the deal on the two of those and I'll ask about the third as well [06:36:19] they're all broken [06:36:20] no need to ask :) [06:38:25] and since the parents said no to that we couldn't very well then camp in front of ours :-) [06:39:28] you kind of had to be there... there was nonstop begging for at laeast an hour [06:39:35] and then begging with breaks for another couple [06:40:36] New patchset: Ori.livneh; "Puppetize Bugzilla" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [06:43:59] hey paravoid, i have a puppet question for you [06:44:21] shoot :) [06:44:31] i regularly find myself wanting to write a manifest for a service that, say, runs under a particular uid / gid [06:45:00] i want to create it if it doesn't exist, and not create it if it does [06:45:01] !log Translate search on vanadium keeps being broken https://toolserver.org/~nemobis/tmp/SearchTranslations.log [06:45:09] Logged the message, Master [06:45:13] slacker [06:45:25] Nemo_bis: who, morebots? [06:45:49] anyways, the pattern i reached for is if !defined(User['foo']), but that's not a very good pattern, I know realize [06:45:58] because you can't assume the other definition is also conditional [06:46:12] ori-l: yes [06:46:24] and so if your class happens to be instantiated first, you end up creating the resource and causing the other class to fail [06:46:55] * I now [06:47:17] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=large&g=cpu_report&h=ms-be10.pmtpa.wmnet&c=Swift+pmtpa [06:47:25] apergos: swift processes busy looping... [06:47:43] I was wondering why that box had load 45% load while the others did not [06:48:07] what is going on? [06:49:08] nothing, just a few runaway processes busy looping [06:49:12] swift-init all restart fixed it [06:49:16] ok [06:49:24] I'm not gonna bother debugging it, there's a much newer swift version out there [06:49:44] ah when do reads go for ceph anyways? [06:49:45] 1.8, compared to 1.7 that we're running [06:49:57] but we need to get rid of 1.5 first :) [06:50:07] uh huh (almost there) [06:50:07] (ms-be1 & ms-be4) [06:50:17] so, ms-be1 is at 0% now [06:51:37] when this rebalancing round finishes [06:51:58] my wild guess would be Thursday [06:52:20] we can replace ms-be1 & ms-be4 [06:52:29] hm, those three broken disks might make it a bit more risky [06:52:42] but hopefully we can replace those today and they'll sync up until thursday [06:53:12] so maybe friday we can ship those c2100s to dell [06:56:01] (or even thursday :) [07:01:13] wait how can we pull ms-be4 wthout pushing rings for it to go to 0 and waiting for that to complete? [07:01:47] what do you mean? [07:01:50] we'll lose one replica [07:01:52] that's fine [07:02:00] like having the box go down because it failed or whatever [07:02:16] if otherwise we're synced up and have 3 replicas for everything [07:02:29] we can be with two for a week while it syncs up [07:04:18] (also as long we have the I/O bandwidth to fill it up again, which I think we do) [07:06:57] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [07:12:07] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [07:12:23] New patchset: Ori.livneh; "Puppetize Bugzilla" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [07:16:33] New review: Faidon; "This isn't a python project, this is a puppet repository which has scattered python files for entire..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [07:20:07] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [07:21:36] it's amazing how many thumbs we keep [07:21:58] we have ~250 million files, 213 of which are thumbs [07:22:24] that's just crazy [07:22:26] it will only get worse unless we start whacking them [07:23:00] http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&m=swift_object_count&h=Swift+pmtpa+prod&c=Swift+pmtpa&trend=1 [07:23:28] we'll get to 320 million in 6 months time [07:23:56] joy [07:24:24] I'm really inclined to write something that just randomly deletes 10% or so [07:25:01] but let's wait on ceph first I guess [07:26:04] weren't we just going to have varnish keep em (all lru and stuff) and be done with it? [07:26:07] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [07:26:39] yeah we need to think about that a bit [07:27:09] we should probably talk about it at the hackathon [07:29:08] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [07:29:53] ceph seems to handle DELETEs much better than swift [07:30:03] I've deleted 500k files in the past 3 hours [07:30:13] someitme I should learn about the ceph architecture [07:30:19] so I would know why that is [07:30:25] I'll put up some documentation [07:30:44] preferrably before we switch everything to it :) [07:30:54] :-D [07:35:07] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [07:37:05] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [08:04:11] hello [08:10:02] hashar: hello, https://review.openstack.org/#/c/28128/ , good night :) [08:10:17] ori-l: sleep fell :-] [08:10:29] ori-l: kudos :-] [08:10:55] ori-l: will deploy that this morning [08:23:29] PROBLEM - search indices - check lucene status page on search35 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 723 bytes in 0.055 second response time [08:25:18] !log Zuul upgrading to upstream version 7191ee8 + 1 wmf hack. [08:25:22] Logged the message, Master [08:39:13] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [08:58:02] New review: Hashar; "cant we just copy the favicon.ico from bits?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62125 [08:59:53] New review: Hashar; "I would prefer we stick all apache logs under the same directory for consistency. Splitting logs al..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61997 [09:05:12] New review: Hashar; "Ping Jeremyb, what is the reason for this change ? Can you please update the commit summary to expl..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54692 [09:48:33] New patchset: Faidon; "Update Ganglia to 3.5.7" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62423 [09:49:55] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62423 [09:51:07] !log upgrading Ganglia Web to 3.5.7 [09:51:14] Logged the message, Master [09:51:53] lol, sockpuppet has 17 kernels installed [09:56:10] PROBLEM - DPKG on sockpuppet is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:58:10] RECOVERY - DPKG on sockpuppet is OK: All packages OK [10:07:08] ah gdash is dead [10:07:14] http://gdash.wikimedia.org [10:07:26] that points to fenari … grbmbl [10:12:16] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [10:50:42] mark: do you have access to vanadium? [10:50:55] should an RT ticket be opened for the broken Translate Solr search? [10:52:32] if it's broken, probably yes [11:02:58] lunch [11:11:37] mark: can anyone open a ticket on RT emailing the address? I don't rememebr [11:16:09] Nemo_bis, yes [11:16:44] ok [11:24:36] mark: files [11:24:40] *filed [11:34:32] well, I know nothing about the translate solr search and there seems to be very little documention on it [11:34:35] so i'm not gonna fix it [11:45:02] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [11:45:02] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [11:45:02] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [11:45:02] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [12:07:45] odder: did you want to count closed wikis? [12:17:03] jeremyb_: all wikis, but it was more about the scale, not an exact number [12:17:08] < 900 is OK with me [12:18:33] odder: nic.wikimedia.org, then count all.db :) [12:18:38] *noc.* [12:26:18] 15:58 odder: wc -l says 870 [12:26:18] 15:58 odder: 871 after pulling the current version ;) [12:36:34] New review: Demon; "As long as that's not still a problem then this is ok." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62336 [12:49:47] New patchset: Jeremyb; "reenable rfaulkner round 3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62131 [12:56:22] New patchset: Jeremyb; "reenable rfaulkner round 3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62131 [12:58:39] New review: Jeremyb; "lgtm." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/62131 [12:59:14] mark: it's your week so I added you to that one ^^^ [13:06:15] New review: Jeremyb; "I really don't know what more you want from me... (but I'm inclined to just let this sit here until ..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/54692 [13:06:28] hashar: ^ [13:08:47] jeremyb_: basically, explain what is broken and what this fix :-] [13:09:04] hashar: you're welcome to :) [13:09:13] jeremyb_: I don't know what is broken :-] [13:09:21] hashar: did you read what i wrote? [13:09:25] yup [13:09:25] New review: Daniel Kinzler; "Looks good conceptually. No idea about the technical implications. Didn't test." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/60978 [13:09:32] jeremyb_: and that should be in the commit summary [13:09:40] hashar: so feel free to put it there [13:09:56] jeremyb_: do it, I +1 and we get it merged :-] [13:10:23] I can't babysit all the changes, unit tests errors I am asked to review / investigate :D [13:10:36] and I don't even know ruby! [13:11:10] i'm not asking you to babysit. but AFAIK the only reason it's not already merged is you set a crazy high bar. (beyond what i've seen anywhere else) [13:11:32] * jeremyb_ isn't interested in figuring how to satisfy that bar right now [13:12:21] it is not crazy [13:12:35] like you do foo and in the commit summary you says foo [13:12:47] that does not explain why foo is needed or what it fix / implements / whatever [13:13:37] so the Array() cast is probably obvious since the .map() call is a few lines below [13:13:45] still does not explain why it works in 1.8 and not in 1.9 [13:14:11] are you saying i should dig up a ruby bug # or upstream commit id? [13:14:12] I would expect "my string".map{} to yeld one item "my string" [13:14:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62131 [13:15:05] jeremyb_: from a non ruby outsider "futureproof for ruby1.9.x (not in use yet)" isn't descriptive [13:15:24] and from my understanding, most of the ops people aren't ruby people [13:15:28] peachey|laptop__: *I* am a non-ruby person... [13:15:45] > "my strin­g".map{ |stri­ng| retur­n strin­g } [13:15:45] => # [13:15:46] at http://tryruby.org/levels/1/challenges/0 [13:15:48] peachey|laptop__: in any case, I welcome amendments... [13:15:54] while it works for me with 1.8 : [13:15:55] >> "my string".map{ |string| puts string } [13:15:56] my string [13:16:12] so just explain that strings in ruby 1.9 do not have map() [13:16:17] which force us to cast to an array [13:16:18] done. [13:18:35] New patchset: Jeremyb; "futureproof for ruby1.9.x (not in use yet)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54692 [13:18:52] rofl [13:18:59] I guess I have to do it [13:20:44] well i'm impressed that jenkins could rebase a commit based on ~ production~1115 [13:20:56] I guess nobody changed that file :-] [13:21:02] * jeremyb_ guessed that too [13:25:30] New patchset: Hashar; "erb: cast string to array for ruby 1.9" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54692 [13:25:37] jeremyb_: that is what I mean by a commit summary. [13:27:42] hashar: what's the deal with the extra spaces? [13:27:50] > >> Array( "my strin­g" ).map­{ |stri­ng| strin­g } [13:30:19] nice formatting ? :D [13:30:33] I took the habit to indent code snippets [13:30:36] hashar: i mean in the middle of words [13:30:37] ala markdown [13:30:42] ahh [13:30:47] strin [13:30:47] g [13:30:51] "my string" <- that one you mean? [13:31:09] it's broken right where i just broke it [13:31:10] cause PHP people would probably assume that "my string" will yield array( 'my', 'string' ); [13:32:30] anyway +1ed :-] [13:32:48] (and no I am not ops, I can't merge it hehe) [13:32:56] i know you can't [13:33:01] but i still wonder about the spacing [13:33:09] not between "my" and "string" [13:33:15] I don't understand what spacing you are talking about [13:33:19] between "my strin" and "g" [13:33:25] I don't see it [13:33:27] and between "stri" and "ng" [13:33:31] i pasted it above [13:33:40] i don't see it at https://gerrit.wikimedia.org/r/#/c/54692/ [13:34:15] blame gerrit? [13:34:15] nor at https://gerrit.wikimedia.org/r/#/c/54692/5//COMMIT_MSG,unified [13:34:20] nor in my vim [13:34:27] * hashar points at emac [13:35:16] $ git log -p --decorate --stat --pretty=fuller -n 1 refs/changes/92/54692/5 | sed -n 31l >> Array( "my strin\302\255g" ).map\302\255{ |stri\302\255ng| s\ [13:35:19] trin\302\255g }$ [13:41:33] Ohai. Hey, I need to be able to test some behaviour with OS with my WMF hat. Do we have a test wiki where I can nab that flag that behaves exactly like other wmf wikis? [13:41:55] I.e.: is testwiki configured basically like *wp? [13:41:58] OpenStack ? [13:42:00] or what's OS? [13:42:07] Oversight. Sorry. :-) [13:42:13] ahh [13:42:43] so, you can obviously compare in LocalSettings.php, etc. but they're not exactly the same [13:43:00] I'm only looking for "close enough, same core" :-) [13:43:01] testwiki runs the latest code by NFS from fenari [13:43:17] and runs on only one dedicated host that does nothing else [13:43:18] jeremyb_: Yeah, but that's /more/ recent than deployed as a rule. [13:43:26] Isn't it? [13:43:39] i guess? i'm just telling you how it works :) [13:43:48] test2wiki is a normal wiki that runs on all the normal hosts [13:44:00] Heh. Should suffice. Can you give me bits on one of them? [13:44:29] so that means test2wiki has to be one of the widely deployed versions and testwiki can be a custom version that exists nowhere else [13:44:41] you might have to talk to a steward? or use a staff group'd account [13:44:56] to change rights @ a test wiki [13:45:06] When Oliver asked me "Do you need +staff" I said "No, I won't". :-) [13:45:13] heh [13:45:21] Which, of course, garanteed that I would. :-) [13:45:31] you can just pop into #wikimedia-stewards [13:45:35] Will do. [13:49:44] New patchset: Jeremyb; "erb: cast string to array for ruby 1.9" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54692 [13:49:50] hashar: ^ [13:51:45] New patchset: Hashar; "contint: libs to build udp-filters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62434 [13:51:57] $ diff -U0 <(git log -p -n 1 refs/changes/92/54692/5) <(git log -p -n 1 refs/changes/92/54692/6) | python -c $'import sys\nfor l in sys.stdin.readlines(): print repr(l);' | tail -n 2 [13:52:04] '- >> Array( "my strin\xc2\xadg" ).map\xc2\xad{ |stri\xc2\xadng| strin\xc2\xadg }\n' [13:52:07] '+ >> Array( "my string" ).map { |string| string }\n' [13:53:27] New review: Hashar; "$ apt-cache policy libcidr0-dev libanon0-dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62434 [13:54:35] !log gallium: installed manually libcidr0-dev libanon0-dev . Puppet change is {{gerrit|62434}} [13:54:43] Logged the message, Master [14:09:37] jeremyb_: Hm. How do I find which shard test2 wiki is on? [14:10:07] * Coren tries to find a list. [14:10:55] <^demon> Coren: You mean which db cluster? (missed context) [14:11:01] * Coren nods. [14:11:11] <^demon> That'd be db-*.php in wmf-config. [14:11:16] <^demon> eg: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php [14:11:31] And /there's/ the list. Thanks ^demon [14:11:40] <^demon> yw [14:12:07] I'm reviewing feedback that WLM participants left in a survey we did last year... [14:12:40] is this possible that they felt the uploads slow down when there was more traffic coming from the US? [14:13:04] I see a few people complaining about this... [14:15:21] We had numerous upload issues last year [14:15:55] And uploads were generally slow until mark fixed them [14:15:58] I know, but this is a very specific WLM issue, so September [14:16:41] select * from filearchive where fa_name='519steeldoor.jpg' \G [14:16:51] /me grumbles. [14:19:02] Brion! What is it with you an undocument bitmaps? [14:36:08] Hi; I'm getting "ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock'" on ee-prototype (labs instance) [14:36:44] I've tried rebooting the instance twice, but that did not seem to change much :) [15:00:13] is rayn lane areond? [15:09:14] matanya: no rlane, so probably not (yet) [15:09:29] thanks andre__ :) [15:10:18] BTW andre__ regarding bugs related to IT, where do they go in bugzilla? (or do they go to RT?) [15:10:50] matanya, isn't it all IT somehow? :P [15:10:53] depends on which area [15:11:03] if it's operations and low-level, RT is the best place, yeah [15:13:54] so is there rt ticket for https://bugzilla.wikimedia.org/show_bug.cgi?id=22622, is it still 452? [15:14:21] matanya, https://bugzilla.wikimedia.org/show_bug.cgi?id=22622#c41 [15:15:08] oh, missed that. so can you please clarify the status of this? [15:18:42] matanya, well, the status i publicly visible in the bug report. not sure what you mean. [15:18:50] it's visible to everybody. [15:19:16] i mean is there any other RT ticket on this? any progress? [15:19:37] matanya, no, there is no other RT ticket about this. [15:19:45] thank you [15:20:06] matanya, progress should be exposed in the public bug report. theoretically. :-/ [15:20:49] matanya, I think OTRS admins could have some more information [15:20:54] I've heard that they've been testing the new system [15:21:05] theory is very nice, but from my experiance only bugging people pushes stuff forward [15:24:34] New patchset: Nikerabbit; "ULS configuration for beta labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62446 [15:55:15] New review: Andrew Bogott; "> If we go the single .pep8 route we'll end up summarizing " [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/61999 [16:02:54] off be back at 9pm (UTC+2) [16:08:44] New review: Krinkle; "Why? We went through the initial setup to avoid duplication everywhere else, why not on integration ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62125 [16:11:22] !log reedy synchronized docroot [16:11:30] Logged the message, Master [16:19:26] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [16:21:13] New patchset: Reedy; "Fix path to wikimania2014wiki docroot" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62451 [16:23:23] Change abandoned: Reedy; "(no reason)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62451 [16:28:27] New patchset: Nemo bis; "ULS configuration for beta labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62446 [16:31:35] New review: Nemo bis; "Checked docs for - syntax; looks ok." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/62446 [16:38:52] @notify binasher [16:38:53] I will notify you, when I see binasher around here [16:41:48] oi, the job queue is so lagged [16:54:00] hey Coren [16:54:17] Hola, ori. [16:54:29] got a minute to help me plug a security hole? https://gerrit.wikimedia.org/r/#/c/62214/ [16:55:10] logsmsgbot accepts connections from anywhere. it's fixed, but it wasn't restarted with the updated config because -- see that patch. [16:56:36] * Coren checks. [16:57:41] ori-l: Don't we have a convention to notify the service from the file rather than subscribe to the file from the service? [16:58:02] Or are we just not consistent with that? [16:58:15] we do? it's fine if we do, i can update the patch [16:58:29] it's entirely possible that we do and i just didn't know [16:59:01] conceptually it seems more intuitive to think of the relationship as the service reacting to changes in the file rather than the file updating the service, but i don't want to put too fine a point on it [16:59:52] Coren: FWIW, I agree with ori-l. [17:00:20] Works for me either way. [17:00:40] well, let's do it this way then and save me the trouble of updating it :) [17:00:48] New review: coren; "Ze change, she is goode." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/62214 [17:00:49] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62214 [17:01:28] ori-l: Merged. [17:01:46] thanks -- the service on neon will need to be restarted too, i think - -see my note on that patch [17:02:04] 'service tcpircbot-logmsgbot restart' on neon should do it, after puppet runs [17:02:41] Someone's already on neon. Doing it already? [17:02:59] 65 days idle. Probably not. [17:03:01] :-) [17:04:39] Coren: actually, you don't need to wait for puppet to run [17:04:44] since the config file was updated by a previous patch [17:05:21] but the service still needs to be restarted, I think [17:05:26] I was going to force a puppet run, but it looks like there's already one running that is quite wedged. [17:05:41] Ah, no, it finally finished. [17:06:00] ori-l: Restarted. [17:06:33] Coren: confirmed, can't connect to it now except from the designated hosts [17:06:42] thank you! [17:07:01] np [17:07:11] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [17:11:48] New patchset: Aude; "Update Wikidata settings to use new variables for client and repo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62454 [17:15:27] !log turned off marmontel's network port [17:15:35] Logged the message, Mistress of the network gear. [17:16:42] Change abandoned: Aude; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62454 [17:17:11] PROBLEM - Host marmontel is DOWN: PING CRITICAL - Packet loss = 100% [17:17:42] !log marmontel's interface is now in the sandbox vlan [17:17:49] Logged the message, Mistress of the network gear. [17:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [17:24:08] sbernardin: hi [17:26:39] New patchset: Yurik; "Deployment rights for Yuri Astrakhan per RT 5069" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62457 [17:28:00] hi, could someone review https://gerrit.wikimedia.org/r/#/c/62457/ [17:28:17] this is per https://rt.wikimedia.org/Ticket/Display.html?id=5069 [17:32:12] Hey parvoid: [17:32:39] paravoid: what's up? [17:33:23] hello Wikimedia Operations [17:33:27] small request [17:33:49] the Debian repository of Wikimedia is missing libdclass-dev http://apt.wikimedia.org/wikimedia/pool/ [17:33:58] the deb is located here http://garage-coding.com/releases/libdclass-dev/libdclass-dev_2.0.12_amd64.deb [17:34:12] please maybe import it in the Debian repo ? [17:34:25] we in the Analytics team need this package [17:34:32] average: RT ticket [17:34:55] Why do we need that in our apt repo? [17:35:18] Reedy: because our software depends on it [17:35:36] sbernardin: did you see the ticket about the three swift bad drives? [17:35:56] ori-l: cannot log in to rt.wikimedia.org [17:36:38] paravoid: yes I did.... [17:37:02] sbernardin: do you have spare disks or are we waiting on Dell? [17:37:03] Reedy , ori-l how can I get an account on rt.wikimedia.org ? [17:37:41] paravoid: will have to get with Dell for replacement drives [17:37:43] Ask Ops [17:37:47] But there's a way to just file tickets via email [17:38:29] average: just mail ops-requests@rt.wikimedia.org . [17:38:37] New patchset: Aaron Schulz; "Remove old secure.wikimedia.org code." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62458 [17:40:42] New review: Reedy; "Unrelated changes to docroot/bits/WikipediaMobileFirefoxOS" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/62458 [17:40:43] New review: Faidon; "I think the *.prototype.wikimedia.org stanza can be removed too. en.prototype.wikimedia.org doesn't ..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62458 [17:41:30] New review: Aaron Schulz; "Odd, that didn't show up in the diff before I committed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62458 [17:42:13] paravoid: what is your e-mail please, I would to Cc you [17:42:19] average: why? [17:43:08] paravoid: because you pointed me to the mailing list [17:43:13] uhm, ok I will send it like this [17:43:20] Reedy: I'm not even sure how to fix that [17:43:32] Can't you amend and unstage it? [17:43:40] paravoid: sent [17:43:49] paravoid: to the e-mail you mentioned [17:49:05] sbernardin and paravoid: for ms-be10...the bad disk is sdb which is one of the ssds...we do have spares of those...of course on swift these are sda/sdb is md0..so may be hot swappable [17:49:57] cmjohnson1: this one is an H310 I think, so SSDs are sdm/sdn [17:50:35] paravoid: you are right..that was swapped out [17:52:56] Reedy: still messing around [17:53:06] fuck it, it would faster to just copy the file again to a new commit [17:53:09] lol [17:53:29] it's really annoying when random submodule changes get tossed into commits [17:53:53] New patchset: Dr0ptp4kt; "Constrain default language list for Dialog Sri Lanka to en, ta, si." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62460 [17:54:33] AaronSchulz: do you know what the wikimedia-commons-local-thumb container si? [17:54:35] paravoid and cmjohnson1: so you want me to replace the ssd on ms-be10? [17:54:36] *is [17:54:41] old test container? [17:54:50] sbernardin: no, it's not an SSD, it's a normal disk that is broken [17:54:55] New patchset: Aaron Schulz; "Remove old secure.wikimedia.org code." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62461 [17:55:06] ops, can i get a review and pull to production on ^ ? for future reference, are there specific reviewers i should add for the Wikipedia Zero Varnish updates? [17:55:13] Change abandoned: Aaron Schulz; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62458 [17:55:33] there we go [18:01:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:02:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [18:07:21] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.22wmf3 [18:07:28] Logged the message, Master [18:15:19] New patchset: Reedy; "enwiki to 1.22wmf3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62463 [18:16:09] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62463 [18:16:10] AaronSchulz: did you see my comment on the secure changeset? [18:16:20] AaronSchulz: it was immediately before Reedy's, so easy to miss :) [18:21:19] mutante: is db9 decommissioned as well...i know db10 is but i was told that db9 and 10 work together [18:21:43] cmjohnson1: no! it's not. dont shut down !:) [18:21:56] not going to...wanted to ask [18:22:21] kk, yea [18:22:46] it's going to replicate to another db server in eqiad [18:22:48] RECOVERY - mysqld processes on db1042 is OK: PROCS OK: 1 process with command name mysqld [18:23:54] New review: Hashar; "Cause that had yet another layer of complexity (favicon.php) and a dependency upon bits when we coul..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62125 [18:24:25] New review: Aude; "works fine as far as I can test, with my CommonSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60978 [18:26:06] mutante: cool..thx for update [18:26:31] sbernardin: did you send the 2 rdb servers yet? please create a ticket when you do with shipping information [18:28:04] mark, hi, could you take a look at https://gerrit.wikimedia.org/r/#/c/62457/ [18:28:06] cmjohnson1, for mailman tickets there's a master ticket 3173 that we usually link tickets to by adding that number as "referred to by": on the individual tickets [18:28:58] RECOVERY - Host db1020 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [18:29:00] thehelpfulone okay, i did not know but will make certain to do that in the future [18:29:09] sure :) [18:29:20] appreciate the help! [18:32:54] no worries, actually one ticket that's been open for a while is https://rt.wikimedia.org/Ticket/Display.html?id=2905 - mutante knows more but with wheezy released I think Mailman 2.1.15 comes included, so would it be feasible to upgrade sodium? [18:32:55] !log authdns-update for new server homium [18:33:03] !log holmium [18:33:03] Logged the message, RobH [18:33:11] Logged the message, RobH [18:35:45] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [18:36:30] !log authdns-update crashed on ns2, rerunning [18:36:43] Logged the message, RobH [18:37:16] !log ns2 appears offline [18:37:25] Logged the message, RobH [18:37:25] PROBLEM - mysqld processes on db1016 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:39:55] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [18:41:37] !log ns2/nescio appears offline, investigating [18:41:43] New review: Ryan Lane; "(1 comment)" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/62020 [18:41:45] Logged the message, RobH [18:43:15] PROBLEM - NTP on db1020 is CRITICAL: NTP CRITICAL: Offset unknown [18:43:16] New review: Krinkle; ""If bits is down", if bits is down we have bigger things to worry about than the favicon of integrat..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62125 [18:47:18] New patchset: Reedy; "Initial config for login.wikimedia.org (loginwiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [18:48:12] New patchset: Reedy; "Initial config for login.wikimedia.org (loginwiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [18:48:15] RECOVERY - NTP on db1020 is OK: NTP OK: Offset -0.0008463859558 secs [18:51:05] New patchset: Reedy; "Add initial apache config for login.wikimedia.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62021 [18:52:32] Reedy: favicon? [18:52:45] What about it? [18:53:03] Again, it's copy paste from above [18:56:25] New patchset: Reedy; "Initial config for login.wikimedia.org (loginwiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [18:56:41] New patchset: Krinkle; "multiversion: Remove code for closed prototype.wikimedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62469 [18:57:32] New patchset: RobH; "adding holmium as blog server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62470 [18:58:35] New review: Krinkle; "You could've just submitted a different patch set. Anyhow, for the record, new version is at Ia29aba..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62458 [18:58:52] New patchset: Reedy; "Initial config for login.wikimedia.org (loginwiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [18:59:00] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62470 [19:02:25] RECOVERY - mysqld processes on db1016 is OK: PROCS OK: 1 process with command name mysqld [19:03:05] New patchset: Reedy; "Add wgUseSiteJs and wgUseSiteCss" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62472 [19:05:30] New patchset: Reedy; "Add wgUseSiteJs and wgUseSiteCss" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62472 [19:06:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62472 [19:06:52] New patchset: Reedy; "Initial config for login.wikimedia.org (loginwiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:07:59] New patchset: Reedy; "Initial config for login.wikimedia.org (loginwiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:13:15] !log rebuilt db1016 from hotbackup of db1001; pmtpa side of m1 needs to be rebuilt as well [19:13:22] Logged the message, Master [19:14:54] Change abandoned: Dzahn; "obsoleted and other repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62108 [19:15:27] !log reedy synchronized wmf-config/ [19:15:34] Logged the message, Master [19:17:37] !log rebooting ns2 [19:17:44] Logged the message, Mistress of the network gear. [19:17:46] for some reason it's totally not arp'ing [19:18:05] urgent - is RU on the same code version as EN ? Zero is not showing any banners for EN, but i see them for RU [19:18:06] !log Reloading Zuul to deploy I259585e78ee [19:18:13] Logged the message, Master [19:18:55] yurik: No, it isn't [19:19:11] yurik: enwiki is on 1.22wmf3, ruwiki is on 1.22wmf2 [19:19:11] Reedy, english zero is broken [19:19:20] Reedy, what about spanish? [19:19:23] yurik: MediaWiki has this really useful page https://en.wikipedia.org/wiki/Special:Version [19:19:25] it also works [19:19:34] Reedy, hehe, thx :) [19:19:45] https://wikitech.wikimedia.org/wiki/Software_deployments [19:19:54] https://www.mediawiki.org/wiki/MediaWiki_1.22/Roadmap [19:21:40] Parsoid deployment ahead, please ignore related alerts in the next minutes [19:23:02] doing some slightly risky esams networking [19:23:17] RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 86.95 ms [19:23:17] RECOVERY - Host amssq56 is UP: PING OK - Packet loss = 0%, RTA = 85.89 ms [19:23:17] RECOVERY - Host amssq52 is UP: PING OK - Packet loss = 0%, RTA = 87.07 ms [19:23:17] RECOVERY - Host amssq61 is UP: PING OK - Packet loss = 0%, RTA = 88.79 ms [19:23:17] RECOVERY - Host amssq50 is UP: PING OK - Packet loss = 0%, RTA = 86.07 ms [19:23:18] RECOVERY - Host amssq59 is UP: PING OK - Packet loss = 0%, RTA = 88.54 ms [19:23:27] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 87.30 ms [19:23:37] RECOVERY - Host amssq51 is UP: PING OK - Packet loss = 0%, RTA = 86.00 ms [19:23:37] RECOVERY - Host maerlant is UP: PING OK - Packet loss = 0%, RTA = 87.13 ms [19:24:45] !log moved gateway from csw1-esams to cr1-esams [19:24:53] Logged the message, Mistress of the network gear. [19:24:57] PROBLEM - Host knsq18 is DOWN: PING CRITICAL - Packet loss = 100% [19:24:57] PROBLEM - Host knsq22 is DOWN: PING CRITICAL - Packet loss = 100% [19:25:07] PROBLEM - Host knsq17 is DOWN: PING CRITICAL - Packet loss = 100% [19:25:27] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:27] PROBLEM - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:30] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:34] PROBLEM - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:34] PROBLEM - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:34] PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:34] PROBLEM - LVS HTTP IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:34] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:35] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:35] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:36] PROBLEM - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:36] PROBLEM - LVS HTTP IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:37] PROBLEM - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:37] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:38] PROBLEM - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:42] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:42] PROBLEM - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:42] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:25:45] rolled back [19:26:01] RECOVERY - Host knsq17 is UP: PING OK - Packet loss = 0%, RTA = 88.48 ms [19:26:01] RECOVERY - Host knsq18 is UP: PING OK - Packet loss = 0%, RTA = 86.84 ms [19:26:01] RECOVERY - Host knsq22 is UP: PING OK - Packet loss = 0%, RTA = 89.15 ms [19:26:11] PROBLEM - Host amssq61 is DOWN: PING CRITICAL - Packet loss = 100% [19:26:21] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.705 second response time [19:26:21] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 561 bytes in 0.447 second response time [19:26:24] RECOVERY - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.436 second response time [19:26:24] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.703 second response time [19:26:27] RECOVERY - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 559 bytes in 0.176 second response time [19:26:30] RECOVERY - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.347 second response time [19:26:30] RECOVERY - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69427 bytes in 0.350 second response time [19:26:30] RECOVERY - LVS HTTP IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.350 second response time [19:26:32] RECOVERY - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.384 second response time [19:26:32] RECOVERY - LVS HTTP IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69427 bytes in 0.440 second response time [19:26:32] RECOVERY - LVS HTTP IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.440 second response time [19:26:33] RECOVERY - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.707 second response time [19:26:33] RECOVERY - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.710 second response time [19:26:33] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.723 second response time [19:26:33] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69427 bytes in 0.719 second response time [19:26:34] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [19:26:42] RECOVERY - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.446 second response time [19:26:42] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3917 bytes in 0.455 second response time [19:26:51] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69430 bytes in 0.717 second response time [19:26:52] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [19:27:02] PROBLEM - Host amssq52 is DOWN: PING CRITICAL - Packet loss = 100% [19:27:12] PROBLEM - Host amssq50 is DOWN: PING CRITICAL - Packet loss = 100% [19:27:57] New review: Hashar; "There is still the cost of maintenance :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62125 [19:28:45] the parsoid update is done [19:29:02] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [19:29:22] !log reedy synchronized php-1.22wmf3/extensions/ZeroRatedMobileAccess/ [19:29:29] Logged the message, Master [19:34:59] New patchset: Catrope; "New cluster SSH key for me" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62475 [19:36:07] PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100% [19:38:38] LeslieCarr: did you see those amssq* down? [19:38:46] syeah [19:38:55] they're down [19:39:10] they were up before [19:39:12] basically 1/2 of the ams machines are happy, 1/4 are down if csw1 is master, 1/4 are down if cr1 is master [19:39:16] ah [19:39:18] nope, they were down all weekend [19:39:28] they came up for a minute when i switched mastery of the gw ip [19:39:31] but then back down [19:39:31] okay, network issue that you're aware of [19:39:34] that's enough for me :-) [19:39:37] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100% [19:39:37] yeah [19:39:53] esp. after basically 18h of work :) [19:44:24] ah, time for you to stop working :) [19:46:51] LeslieCarr, how flexible is varnish cache purges at this point? We need to flush 90 minutes of problems :) [19:47:24] it may be possible, i don't know it off the top of my head [19:47:46] LeslieCarr, basically an incorrect version of Zero went live two hours ago [19:48:04] and ALL english Zero clients were not receiving Zero banners [19:48:21] i saw this page about purging -- [19:48:23] http://kly.no/posts/2010_02_02__Varnish_purges__.html [19:49:04] could we purge all "zero.en.* requests with X-CS header set" [19:49:21] sorry, " en.zero.* [19:49:42] who would be the varnsh god today? [19:49:45] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62119 [19:51:19] mark? [19:54:18] mutante, are you available to talk about RT a bit today? [19:56:06] hello? [19:57:15] Probably late for him [19:57:32] Reedy: mutante is here in SF [19:58:24] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62469 [19:58:33] !log DNS update - switch all blog cnames over to holmium [19:58:41] Logged the message, Master [19:58:59] andrewbogott: in a little while.. there is quite a bit going on right now [19:59:06] ok [19:59:06] !log aaron synchronized multiversion/MWMultiVersion.php 'f257ddac39194f8bfcfd145104999fdba83d33f5' [19:59:10] could anybody kill 28120 on wtp1001? [19:59:13] Logged the message, Master [19:59:47] i can't finish that update .. ns2 being worked on [20:00:30] !log killing parsoid server.js on wtp1001 [20:00:33] gwicke: done [20:00:38] Logged the message, Master [20:00:38] mutante: thanks! [20:01:27] PROBLEM - Host hooft is DOWN: PING CRITICAL - Packet loss = 100% [20:01:41] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62475 [20:02:01] mutante: could you also do a kill -9 17126 && /etc/init.d/parsoid restart on wtp1001? [20:02:41] !log restarting pdns on ns0 [20:02:44] am trying to make sure the parsoid code only runs current code, and that process was started April 25th [20:02:49] Logged the message, Master [20:02:57] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.032 seconds response time. www.wikipedia.org returns 208.80.154.225 [20:04:10] * Restarting parsoid [ OK ] [20:04:11] RoanKattouw: Yurk was seemingly asking for a response from m ark [20:04:17] !log restarting parsoid on wtp1001 [20:04:20] Oh OK [20:04:25] Logged the message, Master [20:04:41] mutante: that seems to have worked, thanks! [20:04:57] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [20:05:00] yw [20:05:17] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [20:06:58] Reedy, I asked mark because he is supposedly on RT duty (topic). Maybe its stale. [20:07:12] Doesn't mean they are at their keyboard [20:07:22] Especially out of their working day ;) [20:07:43] heh, Reedy do you know who would be the varnish king today? [20:08:11] i guess i should just throw in the towel and not bother with the 2 hours of missing banners that might get our partners annoyed [20:09:03] unless there was a cache flush right before Reedy pushed out wmf3, in which case it will stay in cache for a long time for a large number of common articles [20:09:04] PROBLEM - Memcached on holmium is CRITICAL: Connection refused [20:09:36] "You must be new around here" [20:10:09] yurik: Usually it's a case of asking your question and waiting a little while [20:10:21] mark, banisher, paravoid: request to have https://gerrit.wikimedia.org/r/#/c/62460/ merged for Varnish configuration for a Wikipedia Zero partner... [20:10:51] OOPS! mark, binasher, paravoid ^ (silly spellchecker) [20:11:09] ^ s/banisher/binasher/ [20:11:34] RECOVERY - Puppet freshness on mc15 is OK: puppet ran at Mon May 6 20:11:31 UTC 2013 [20:12:54] RECOVERY - Host amssq51 is UP: PING OK - Packet loss = 0%, RTA = 85.97 ms [20:12:54] RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 87.74 ms [20:12:54] RECOVERY - Host amssq61 is UP: PING OK - Packet loss = 0%, RTA = 89.59 ms [20:12:54] RECOVERY - Host amssq49 is UP: PING OK - Packet loss = 0%, RTA = 87.56 ms [20:12:54] RECOVERY - Host maerlant is UP: PING OK - Packet loss = 0%, RTA = 87.61 ms [20:12:54] RECOVERY - Host amssq53 is UP: PING OK - Packet loss = 0%, RTA = 87.47 ms [20:12:54] RECOVERY - Host amssq62 is UP: PING OK - Packet loss = 0%, RTA = 91.19 ms [20:12:55] RECOVERY - Host amssq60 is UP: PING OK - Packet loss = 0%, RTA = 88.82 ms [20:12:56] RECOVERY - Host hooft is UP: PING OK - Packet loss = 0%, RTA = 85.70 ms [20:12:56] RECOVERY - Host amssq54 is UP: PING OK - Packet loss = 0%, RTA = 87.16 ms [20:12:56] RECOVERY - Host amssq58 is UP: PING OK - Packet loss = 0%, RTA = 87.32 ms [20:12:57] RECOVERY - Host amssq55 is UP: PING OK - Packet loss = 0%, RTA = 88.93 ms [20:12:57] RECOVERY - Host amssq50 is UP: PING OK - Packet loss = 0%, RTA = 85.66 ms [20:12:58] RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 87.14 ms [20:12:58] RECOVERY - Host amssq59 is UP: PING OK - Packet loss = 0%, RTA = 88.71 ms [20:12:59] RECOVERY - Host nescio is UP: PING OK - Packet loss = 0%, RTA = 87.46 ms [20:13:04] RECOVERY - Host 91.198.174.6 is UP: PING OK - Packet loss = 0%, RTA = 87.05 ms [20:13:04] RECOVERY - Host amssq56 is UP: PING OK - Packet loss = 0%, RTA = 85.64 ms [20:13:04] RECOVERY - Host amssq52 is UP: PING OK - Packet loss = 0%, RTA = 87.06 ms [20:13:04] RECOVERY - Host amssq48 is UP: PING OK - Packet loss = 0%, RTA = 86.86 ms [20:13:04] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 87.10 ms [20:13:04] RECOVERY - Host ms6 is UP: PING OK - Packet loss = 0%, RTA = 88.31 ms [20:13:05] RECOVERY - Host ms-fe3001 is UP: PING OK - Packet loss = 0%, RTA = 85.47 ms [20:13:44] RECOVERY - Host ns2.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 86.11 ms [20:14:34] !log DNS update - sync with ns2 [20:14:42] Logged the message, Master [20:22:54] PROBLEM - RAID on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:54] PROBLEM - HTTP on holmium is CRITICAL: Connection refused [20:27:34] PROBLEM - NTP on nescio is CRITICAL: NTP CRITICAL: Offset unknown [20:28:27] New patchset: Lcarr; "fixing customized ports.conf for blog server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62515 [20:28:44] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 11.0011278632 (gt 8.0) [20:32:44] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.684491037736 [20:33:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62515 [20:33:34] PROBLEM - Packetloss_Average on analytics1005 is CRITICAL: CRITICAL: packet_loss_average is 10.4895393966 (gt 8.0) [20:36:56] New review: Hashar; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61428 [20:37:03] New patchset: Hashar; "beta: configuration for Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61428 [20:37:31] RECOVERY - Packetloss_Average on analytics1005 is OK: OK: packet_loss_average is 0.590788301887 [20:39:36] New patchset: Brion VIBBER; "Update Wikipedia FirefoxOS app: fix for langlinks menu" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62522 [20:46:18] !log swift replication scripts running on ms-fe1002 under screen -- kill if they have any effect on production [20:46:26] Logged the message, Master [20:46:28] and I'll call it a night :) [20:47:16] New review: Brion VIBBER; "Details can be see in the github repo for the app: https://github.com/wikimedia/WikipediaMobileFiref..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62522 [20:53:51] RECOVERY - HTTP on holmium is OK: HTTP OK: HTTP/1.1 200 OK - 86490 bytes in 0.509 second response time [20:59:42] !log holmium: bring up interface lo, comment "search" line in resolv.conf, restart varnish, ... [20:59:45] ! [20:59:50] Logged the message, Master [20:59:57] !log blog is back up [21:00:05] Logged the message, Master [21:01:20] New patchset: Ori.livneh; "Rsync public datasets for visualization to stat1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54116 [21:02:45] New review: Ori.livneh; "PS6 removes the Python modules, which aren't essential to this change (and deserve a proper commit)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54116 [21:03:11] good night parav0id [21:04:31] New patchset: Ori.livneh; "Rsync public datasets for visualization to stat1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54116 [21:19:25] New patchset: Andrew Bogott; "A few more tuneups of mediawiki_singlenode:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62416 [21:20:14] New patchset: Reedy; "Move non geodata cron jobs to terbium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62526 [21:21:31] New patchset: Reedy; "Move non geodata cron jobs to terbium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62526 [21:27:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.152 second response time [21:33:47] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62416 [21:45:46] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [21:45:46] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [21:45:46] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [21:45:46] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [21:52:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:53:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [21:59:27] Change abandoned: Andrew Bogott; "Squashed into a different patch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62417 [22:12:15] New review: Krinkle; "@Hashar: The plan is to have the mediawiki-core-frontend job, clear these files before every run, as..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61997 [22:15:11] New review: Andrew Bogott; "Ori, can you think of a way to detect whether or not an importDump has happened? I fear to embed a ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53989 [22:16:05] New review: Hashar; "Yeah that might work though I would prefer an error handler in the backend. " [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61997 [22:16:19] I am off *wave* [22:16:25] bye hashar [22:17:23] New patchset: Ori.livneh; "Specify the correct path to helper script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62530 [22:27:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:28:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.166 second response time [22:31:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:33:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [22:45:04] Mon May 6 10:00:28 UTC 2013 srv193 testwiki Banner::getMixins 10.0.6.76 1054 Unknown column 'mixin_name' in 'field list' (10.0.6.76) SELECT mixin_name FROM `cn_template_mixins` WHERE tmp_id = '90' [22:45:10] mwalker: any idea about that? [22:46:40] metawiki Banner::getHistoricalBanner 10.64.16.17 1064 You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ') LIMIT 1' at line 1 (10.64.16.17) SELECT ... [22:47:36] !log apache-graceful-all, add iegcom virtual host [22:47:46] Logged the message, Master [22:58:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:59:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [22:59:24] New review: Ori.livneh; "@Andrew: this should work (I tested it):" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53989 [23:12:38] New patchset: Ryan Lane; "Disable php for wordpress upload directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62537 [23:14:55] New review: coren; "Clearly correct." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/62537 [23:14:56] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62537 [23:15:10] New patchset: Dzahn; "add private wiki iegcom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62538 [23:17:07] New review: Dzahn; "https://wikitech.wikimedia.org/wiki/Add_a_wiki#IMPORTANT:_For_Private_Wikis" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/62538 [23:17:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62538 [23:24:28] hey, possible issue with esams ssl [23:25:16] !log restarting prelabsdbdb on db1054 [23:25:24] Logged the message, Master [23:25:50] WFM over ipv6 [23:37:40] Undefined index: wgStyleSheetPath [23:38:28] jenkins-bot failed with this error [23:38:53] New patchset: Dr0ptp4kt; "Adding extra IPs for Mobilink Pakistan. Verified at http://wq.apnic.net/apnic-bin/whois.pl." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62543 [23:41:42] can i ask someone to merge https://gerrit.wikimedia.org/r/#/c/53989/ ? it's a really simply change -- corrects the path reference to a file from /var/lib/ipython to /srv/ipython, where it is now located.a [23:42:12] mark, paravoid, binasher: request for merge - https://gerrit.wikimedia.org/r/#/c/62543/ - change requested by dfoy. [23:46:24] dr0ptp4kt: it's nearly 2 AM for the first two, and the third indicated being unavailable on the list [23:46:39] they may not appreciate the ping :) [23:46:47] New patchset: Ryan Lane; "Also disable php for uploads on blog via 443" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62544 [23:47:32] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62544 [23:48:11] ori-l, thx. hope i didn't wake anybody up or cause unnecessary stress. would you recommend i send an e-mail instead, or something else? [23:48:48] rt ticket [23:49:00] LeslieCarr, thx [23:51:28] +7653, -4 <-- really simple , hehe [23:51:41] ori-l-away: maybe got the wrong number there:) [23:51:43] bbl [23:52:06] New patchset: Dr0ptp4kt; "Merge branch 'production' of https://gerrit.wikimedia.org/r/p/operations/puppet into zero-mobilink-pakistan-add-ips" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62545 [23:57:07] Change abandoned: Dr0ptp4kt; "Oops, need to go back and rebase properly." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62545 [23:57:20] Change abandoned: Dr0ptp4kt; "Oops, need to rebase." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62543