[00:30:32] !log tstarling synchronized php-1.21wmf4/extensions/WikimediaMaintenance/fixBug41778.php [00:30:40] Logged the message, Master [00:31:19] !log tstarling synchronized php-1.21wmf3/extensions/WikimediaMaintenance/fixBug41778.php [00:31:25] Logged the message, Master [00:49:52] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [00:49:52] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [01:39:59] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 260 seconds [01:40:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 279 seconds [01:43:16] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:46:31] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [01:59:43] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 233 seconds [02:00:37] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 289 seconds [02:15:01] PROBLEM - Squid on brewster is CRITICAL: Connection refused [02:23:19] !log LocalisationUpdate completed (1.21wmf4) at Mon Nov 19 02:23:19 UTC 2012 [02:23:29] Logged the message, Master [02:24:37] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [02:31:41] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [02:43:07] !log LocalisationUpdate completed (1.21wmf3) at Mon Nov 19 02:43:07 UTC 2012 [02:43:14] Logged the message, Master [03:01:40] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [03:48:37] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:50:43] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [04:30:54] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [04:30:54] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:30:54] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [04:30:54] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:54:09] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [05:49:04] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:36:58] PROBLEM - Lucene on search14 is CRITICAL: Connection timed out [06:37:07] RECOVERY - Squid on brewster is OK: TCP OK - 0.001 second response time on port 8080 [06:39:58] RECOVERY - Lucene on search14 is OK: TCP OK - 0.018 second response time on port 8123 [06:51:47] PROBLEM - Lucene on search14 is CRITICAL: Connection timed out [06:52:32] apergos: now search14 ? [06:52:40] it incremented ;P [06:53:53] just got on line [06:56:14] * apergos waits for nagios to wise up [06:56:26] RECOVERY - Lucene on search14 is OK: TCP OK - 0.008 second response time on port 8123 [06:56:31] thanks [06:56:43] !log restarted lucene search on search14 [06:56:50] Logged the message, Master [08:36:07] !log running sync-common on mw46 [08:36:14] Logged the message, Master [08:37:07] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [08:44:10] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [09:31:20] link fix [10:34:26] DanielK_WMDE: see topic :-] [10:34:33] ah, cool [10:34:50] ...or not ;) [10:34:56] so, who *is* on duty? [10:37:19] DanielK_WMDE: Faidon has nickname paravoid [10:37:35] he is in Greece :-] as well as apergos [10:38:32] on duty = working? me, eventually mark I suppose, and eventually paravoid (based on order when we usually show up active in the channel) [10:38:44] Hi all! [10:39:16] DanielK_WMDE: https://labsconsole.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [10:39:47] DanielK_WMDE: once an instance is build simply check the box: [ ] puppetmaster::self [10:39:49] then run puppet [10:39:50] ;-) [10:47:24] Recently, I worked on puppet recipes for Wikidata on labs. I submitted some stuff for review which installs either a Wikidata client or a repo via puppet. Some people from the labs channel commented on it, but I'm somehow stuck: I am not a coder so I did not realize I have to ask for a style guide before starting. Now I got a lot of contradictory comments from the "tabs vs. spaces universe", but nobody has actually given me feedback [10:47:24] in the form of "grades" so merging would get any closer. Do you have any ideas how to get on with this? [10:50:41] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [10:50:41] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:51:32] Silke_WMDE: you're unlikely to get anywhere if you don't give links... [10:52:18] jeremyb :) Sorry! https://gerrit.wikimedia.org/r/#/c/30593/ [10:52:48] cat to vet :-( [12:25:25] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [12:26:54] Silke_WMDE_: indeed, the git tutorial and siblings should contain a visible warning: the likelihood of your commit to be actually reviewed is inversely proportional to the number of whitespace changes and to the square of the number of trailing whitespaces you added [12:32:28] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [12:35:36] Silke_WMDE_: hello :-] I talked with Daniel Kinzler this morning about your puppet change [12:35:44] Silke_WMDE_: definitely ignore anything regarding whitespaces :-] [12:52:17] DanielK_WMDE: our PHPUnit code is really messy :-] [12:52:35] you don't say ;) [12:53:35] so the new setupTestDB and teardownTestDB you wrote have been added to MediaWikiTestCase which is the parent class for any of our test class [12:53:57] made static so they could be run from the MediaWikiPHPUnitCommand [12:54:10] I guess we should move them from MediaWikiTestCase to MediaWikiPHPUnitCommand [12:54:29] seems they are to be run before and after the whole test suite [12:56:20] DanielK_WMDE: I will move your stuff to the MediaWikiPHPUnitCommand [12:56:26] and refactor some stuff while I am at it [12:57:42] hashar: that means that MediaWikiTestCase has to know MediaWikiPHPUnitCommand. [12:58:10] setupTestDB is not done before all tests are run, it's done before the first test runs that needs the database (kind of lazy initialization) [12:58:18] ahh [12:58:23] so it is generally triggered by MediaWikiTestCase [12:58:53] it's not really a problem, except that I dislike circular dependencies like this. [12:59:21] (or... does MediaWikiPHPUnitCommand know about MediaWikiTestCase?) [13:00:33] my idea was to get the testDB logic in the PHPUnitCommand class [13:00:43] and make MediaWikiTest case to execute it whenever needed [13:00:49] that is simply moving static classes aroun [13:00:50] d [13:00:53] or maybe another class [13:02:06] naw, put it where you like it, I don't care :) [13:02:59] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [13:03:34] lets merge all the mess first :-] [13:12:21] DanielK_WMDE: merged both [13:15:49] re [13:16:23] hashar: OK. :) SO whom can ask to review/give feedback/merge this? [13:17:00] Silke_WMDE_: andrewbogott did have a look at it, I guess you can follow up [13:17:09] err .. HE can follow up [13:17:09] yes [13:17:10] I think [13:17:40] Silke_WMDE_: I also told Daniel I was going to have a look at it :-] [13:17:43] doing that right now [13:17:53] great, thanks! [13:18:12] Silke_WMDE_: I will comment in the inline diff [13:18:18] ok [13:18:38] hashar: thanks! [13:18:55] oh my [13:19:35] if only we had something like: mediawiki::extension { name => "Diff", dest => "/some/path" } [13:20:51] Should be possible to create something like that. [13:52:38] New review: Hashar; "I dont think the two templates are actually required:" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/30593 [13:52:50] Silke_WMDE_: did a quick review, untested though [13:52:57] the apache conf could use some cleanup [13:53:11] and I am not sure why you add two template files which do not seems to be used anywhere [14:03:13] hashar: True, I think are even three unneeded files which were still included in the first submission. When I did the patchsets I ... erm ... just didn't know how to throw them out. [14:03:24] thanks for the comments! [14:03:28] ahh you will need some git magic [14:03:43] get the patch then ask git to remove the files with: git rm [14:04:06] "git status" will show you the file are staged for deletion [14:04:18] then amend your commit, that will add the file deletion to the commit [14:04:20] git commit --amend [14:04:33] you might want to try out in a sandbox first ;-] [14:05:10] cd /tmp ; mkdir gittest ; cd gittest; git init; touch FOO; git commit -a -m "adding FOO" file; [14:05:26] then git show to see the commit [14:05:33] then delete the file with git rm FOO [14:05:38] git status <-- show what is staged [14:05:43] git commit --amend [14:05:49] and finally git show to see the new commit :-] [14:06:04] or poke Jeroen :-] [14:06:12] git might be a bit confusing [14:13:40] apergos: how's ms-be7? [14:15:45] paravoid: status is the same. we left it on friday w/the installer not writing to /sdm/n [14:17:05] apergos: did a work around where we could install the OS on /sdm and /sdn but I don't think it is something we would want to do on all of them [14:17:30] apergos believes are issues are bug related https://bugs.launchpad.net/ubuntu/+source/debian-installer/+bug/1012629 [14:20:08] I told apergos on Friday to try with a different bootdev and see what happens [14:20:27] and ping me if that didn't work [14:23:44] parvoid: i added this to the partman recipe d-i grub-installer/bootdev string /dev/sdm /dev/sdn [14:23:55] !log putting ms-fe1 back into the pool [14:24:02] Logged the message, Master [14:24:05] but the installer sill insisted on writing to /sda [14:32:13] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:32:13] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:32:13] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [14:32:13] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:32:55] why is ms-be7 different from ms-be6? [14:35:44] mark: they're not different. when the initial OS was loaded on be6 it was installed on /dev/sda and b. We had to fix it to load on sdm and sdn which it then did w/out a problem. [14:36:16] however, /sda/sdb did not get written over ...apergos had to go in and manualy add them to the file system. [14:38:21] ms-be7 works right now but we had to go in to fdisk and create a single primary partition first..than installer wrote to /sdm and n [14:38:35] i see [14:40:53] we can get it back to original state by dd'ing /dev/sda [14:48:06] !log reedy synchronized php-1.21wmf4/cache/interwiki.cdb 'Updating interwiki cache' [14:48:13] Logged the message, Master [14:48:47] !log reedy synchronized php-1.21wmf3/cache/interwiki.cdb 'Updating interwiki cache' [14:48:54] Logged the message, Master [14:51:49] mark: isn't brewster a SPOF for all apt updates/installs? [14:52:08] yes [14:52:12] how is that "critical" [14:52:48] if that's critical, then everything is critical [14:53:24] anyway, I believe iron is brewster's counterpart in eqiad [14:53:43] so it would be far more productive to make sure that's a brewster replica than to migrate brewster itself [14:55:31] indeed, this makes sense [14:56:09] sorry, carbon [14:56:10] not iron [14:56:48] but yeah, what I meant with critical was "spof, affects day-to-day ops" [14:57:30] every service affects day to day stuff of some dept or person ;) [14:57:41] sure, it's not unimportant [14:58:00] but putting that server on one with warranty is not really gonna help anything either is it [14:58:27] if it's gonna be down for a few days we'll have it migrated to another box by the time the new parts arrive [14:58:35] it would be good to ensure good backups of our apt repo though [14:58:40] hehe I guess so [14:58:42] I think we had them but not sure how well it's still working [14:59:19] don't get me wrong, builting a counterpart for redundancy is a much better idea in my opinion too [15:02:35] a large part of brewster is in puppet [15:03:11] meh [15:03:16] is that batch out of warranty too [15:03:22] I still think of brewster as a newish server [15:03:28] i've been here too long ;) [15:06:52] how interesting [15:07:12] almost all of the FIN_WAITs on ms-fe.svc are with imagescalers [15:07:55] I think because squids etc have shorter fin_timeouts [15:08:01] while most of the ESTABLISHED ones are with cp [15:10:36] well, you're right that we have fin_timeout set to 3 on squid/varnish/swift [15:10:57] but otoh, why would it timeout on internal traffic? [15:11:05] it's not like we have broken tcp stacks on our machines [15:12:12] so, neither of ms-fe or srvNNN have FIN_WAIT on their netstat [15:12:16] but LVS thinks so [15:16:33] I've put ms-fe1 back into the pool for almost an hour now and it still gets 1/10 of the traffic than the rest [15:16:42] 18 connections vs. 180 for each of the other three [15:18:00] New patchset: Silke Meyer; "Added puppet files for Wikidata on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30593 [15:20:18] New review: Silke Meyer; "Removed the three obsolete template files (also the apache template), using the template from the ap..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/30593 [15:26:03] New review: Silke Meyer; "I forgot: the "keep_up_to_date" naming is Andrew's, he has used it in his mediawiki recipe, so I can..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/30593 [15:35:13] New patchset: Hashar; "Gerrit notifications for Wikidata to their channel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26042 [15:35:13] New patchset: Hashar; "cleanup/refactor gerrit logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8120 [15:35:13] New patchset: Hashar; "Gerrit hook tests now creates hookconfig.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26040 [15:35:14] New patchset: Hashar; "Gerrit hook tests extended coverage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26041 [15:48:11] !log reedy synchronized php-1.21wmf4/extensions/OAI [15:48:13] DanielK_WMDE: ^^ [15:48:17] Logged the message, Master [15:50:12] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [15:58:02] Reedy: our hero :-) [15:58:03] !log powercycling ms-be3, unresponsive [15:58:08] off to get my daughter [15:58:09] Logged the message, Master [16:01:59] mark: nope, it's not fin_timeout [16:02:36] however I do see one possible reason: there are some long-lived cp/sq connections, they're probably using keepalives [16:02:45] and pipelining [16:03:04] while imagescalers do not/can't [16:03:14] so that part makes sense [16:03:33] RECOVERY - Host ms-be3 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [16:04:15] yay [16:04:26] amazing, it came back up :) [16:11:18] New patchset: Mark Bergsma; "Create new Ganglia cluster(s), and add ms-be300x to it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34093 [16:17:39] hm [16:17:41] come to think about it [16:17:52] this might be the entire reason behind the traffic being unbalanced [16:18:31] persistent connections from varnish/squid [16:19:08] new realservers won't get much traffic, since most of the traffic passes through the existing connections [16:19:43] am I blind or just losing it [16:19:56] https://gerrit.wikimedia.org/r/#/c/34093/1/manifests/ganglia.pp [16:19:59] what's wrong here [16:20:10] i don't see it [16:20:41] doh [16:20:42] I do see it [16:21:08] New patchset: Mark Bergsma; "Create new Ganglia cluster(s), and add ms-be300x to it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34093 [16:21:30] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34093 [16:21:45] missing $cluster? [16:22:09] oh no, missing } [16:22:28] mark: see what I said above? [16:22:43] yes [16:22:43] about persistent http from backend proxies? [16:22:51] they are using persistent http yes [16:23:17] yeah, so I think that's why ms-fe1 gets less than 3% of the traffic despite being pooled back for 2 hours now [16:23:28] right [16:23:34] might want to set a max on the nr of requests ;) [16:23:51] we can do balancing inside varnish as well of course [16:24:08] mmhmm [16:24:54] pybal is nice though :-) [16:27:06] so, varnish has no way to close backend connections after a number of requests [16:27:11] and neither does swift it seems [16:27:13] heh [16:28:17] hmm [16:28:19] I think you could in vcl [16:28:22] but it's hacky [16:29:29] let's not [16:29:40] amazing, bitten by optimizations [16:30:31] we had persistent connections disabled until half a year ago [16:30:35] when I enabled it again on apaches [16:30:45] knowing that it would reduce equal load balancing [16:30:55] but the distance of eqiad caches to tampa apaches was more important [16:31:35] indeed [16:32:36] it's the same for swift now that we have caches in eqiad [16:32:47] yes [16:35:57] robh: please send license for ms-be7 when you get a chance...ST 2FLGYV1 [16:36:08] apergos: around? [16:36:28] yes [16:36:36] saw the comments above about ms-be7? [16:37:04] want me to take it over? [16:37:08] I saw you asked how things went [16:37:31] cmjohnson1's update is pretty much on the money [16:38:06] checkd the logs and saw the bootdev did seem to get set to sdm, n by the installer [16:38:07] and yet [16:38:18] it tried grub to sda and failed [16:38:37] I tested your theory about the partition table, that indeed let it complete [16:39:26] but the first puppet run would I guess not complete properly, as the filesystem wouldn't getmade for that partition [16:39:49] so we'd have manual intervention a couple of times, not so exciting [16:40:04] did you want to look at it some to see if you can figue out something better? [16:41:04] cmjohnson1: silly question: are the SSDs are connected to the RAID controller? is it possible to connect them to the onboard controller like we had it in the C2100s? [16:41:11] also, I don't know what state it's in tirhgt now; this weekend, maybe Saturday, I saw it was bouncing according to nagios, so I checked and it was at some other random step in the installer, failed. I left it in a shell from the installer [16:42:04] but I'm not sure where cmjohnson1 left it, except that it had gotten through the grub step (it was quite late at that point, I crawled off to sleep) [16:42:45] paravoid: they are connected to the raid controller and no, we are not able to connect them directly to the onboard controller. [16:43:00] worth a shot :) thanks [16:43:02] the c2100's were more flexible than the 720's [16:43:15] apergos: i left it in the shell [16:43:59] hmm [16:44:15] who knows what happened, grrr [16:44:18] I tend to prefer inflexible 720s though :) [16:44:30] yeah, this is annoying but I much prefer these boxes [16:47:56] New patchset: Mark Bergsma; "Set ms-be3001/2 as ganglia aggregators" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34097 [16:47:58] paravoid apergos: the additional 10 720's are supposed to be delivered today. I am going to get ms-be8 and 10 up and once we figure out the install they will be ready [16:48:10] ok that's great [16:48:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34097 [16:48:16] the install should be fine for systems without SSDs [16:48:25] also, I'd like to reserve a few of them for my ceph testing [16:48:26] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [16:48:56] just let me know which ones are to be swift with ssds, which ones are to be swift without ssds [16:49:00] not too many though, we'll have enough to be replacing boxes for a monthr or two [16:49:05] and I can leave the rest or do a base install as you prefer [16:49:28] we used to have 4 boxes with SSDs in the C2100 cluster iirc [16:49:32] I think that's enough for now [16:49:53] only 4? huh [16:50:28] paravoid: so you don't want ssd's in all of them? [16:50:35] no [16:50:59] that was the case with C2100s too [16:51:39] 5-12 had them it says here (according to the netboot config) [16:51:48] right [16:51:59] they were not in them but i recall Ben wanted ssds in all of them...let me know know which ones you want to leave out? [16:52:29] right but i had an old ticket for 1-4...the c2100's were nixed before we ever got that point [16:53:14] I think it's a waste to have it in all of them, although I wasn't around when they did that performance testing [16:53:24] maybe mark knows? [16:54:00] it was only for the container objects I think? [16:54:05] yes [16:54:10] but ben generally liked consistency, so probably for that reason alone [16:54:22] containers and accounts, although we have a single account so... [16:56:51] cmjohnson1: trying to find the damned link for the drac software [16:57:44] found it, pulling the key [16:57:44] ok....still have like 25 days before it expires [16:58:07] ok..keep it handy will have a few more for you today [16:58:10] no problem, ill have dell email it to me and i will forward you the email [16:58:27] or you can drop an RT ticket for each one [16:58:30] and I can attach the key [16:58:37] that may be easier for you to keep track of [16:58:47] since the key doesnt really have info saying its for such and such [16:58:55] ok...let's do that so I can refer back to it [17:01:15] their site for this stuff is really crappy [17:01:26] (in any browser ive tested) [17:02:34] constant error popups and timeouts from whatever crap server side stuff they are running [17:02:39] paravoid: so do you want to go w/ the current set up on which ones have and do not have ssd's? be1-4 do not 6-12 do...let me know or add to ticket 3829 [17:02:49] yeah that's fine [17:03:04] ok [17:03:33] 5-12 [17:03:34] do you know the ETA for the eqiad cluster? [17:04:11] New review: Andrew Bogott; "Regarding tabs vs. spaces... I really would like you to change these files to follow the existing st..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/30593 [17:04:12] projected time is first week of December [17:04:27] nothing certain yet...still in production [17:05:29] sigh [17:07:06] https://bugzilla.wikimedia.org/show_bug.cgi?id=20409 - would NE be needed on all rewrites that might deal with special chars? [17:07:19] Krenair: I think so. [17:07:48] And it's apache-config redirects.conf? [17:08:15] I think so :) [17:08:46] cmjohnson1: https://rt.wikimedia.org/Ticket/Display.html?id=3939 [17:08:54] thats ms-be7 key file [17:08:59] cool thx [17:09:04] just drop one ticket per system in the pmtpa queue and assign to me [17:09:15] i'll assign and attach key and assign abck to you [17:09:31] okay [17:09:46] ah so paravoid I see you asked if I want you to take over. I am out of ideas for sure [17:09:52] !log reedy synchronized php-1.21wmf4/cache/interwiki.cdb [17:09:58] Logged the message, Master [17:10:00] but I"m happy to follow along on the backread during the times I'm not here [17:10:07] or follow along live when I am [17:10:18] I notice lots of these redirect to specifically HTTP versions, without considering HTTPS... [17:10:22] wow Silke's client sensed Andrew's comment and quit before it [17:10:58] !log reedy synchronized php-1.21wmf3/cache/interwiki.cdb [17:11:04] Logged the message, Master [17:11:26] Reedy: any idea on where to redirect https://secure.wikimedia.org/w/extensions/skins/Donate/images/banners/Banner_88x31_0000_A.jpg ? [17:12:19] <^demon> paravoid: Same path, but just use the host we're already on. [17:12:28] er sorry? [17:12:30] <^demon> eg: https://en.wikipedia.org/w/extensions/skins/Donate/images/banners/Banner_88x31_0000_A.jpg [17:12:53] <^demon> https://fr.wikiquote.org/w/extensions/skins/Donate/images/banners/Banner_88x31_0000_A.jpg [17:12:55] <^demon> Etc. [17:12:57] what do you mean "we're already on"? [17:12:59] Depending on how they're actually including it, fundraising should possibly fix their code [17:13:05] there's no host here [17:13:08] that's the URL as you see it. [17:13:21] New patchset: Reedy; "Kill static-master" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34098 [17:13:23] ugh they're not actually including from secure.wm.o are they? [17:13:29] paravoid: Point it at meta [17:13:38] no, that's from a random blog that has sourced it [17:13:50] but I'm wondering about the general case here, not just this blog. [17:13:53] or URL [17:13:54] <^demon> Meta's prolly fine. [17:14:04] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34098 [17:14:04] redirect what? /w/extensions/ ? [17:14:07] Yeah [17:14:07] or /w/ ? [17:14:23] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.001 second response time on port 11000 [17:14:24] <^demon> I'd say /w/ is probably fine. [17:14:26] <^demon> Reedy? [17:14:29] Ditto [17:14:39] thanks :-) [17:15:42] There shouldn't be that much traffic ging there [17:16:05] no [17:16:22] but I'd like to drop the proxying eventually [17:16:30] since it bypasses caches and everything [17:16:32] <^demon> And if there's any traffic from us going there, we should fix it. [17:16:36] ^ [17:16:39] ideally, I'd like to move it to the SSL cluster, rather than singer [17:16:41] Have we got reasonable access logging? [17:16:53] no [17:16:59] i.e. we log all secure.wm.org hits [17:17:07] which isn't reasonable, but it's useful in this case :-) [17:18:45] New patchset: Faidon; "secure.wikimedia.org: add redirect for /w/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34100 [17:19:35] New patchset: Faidon; "secure.wikimedia.org: add redirect for /w/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34100 [17:21:29] New patchset: Faidon; "secure.wikimedia.org: make the redirects permanent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34101 [17:21:34] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34100 [17:21:42] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34101 [17:26:10] Nov 19 17:22:36 208.80.152.162 apache2: PHP Warning: Directive 'magic_quotes_gpc' is deprecated in PHP 5.3 and greater <-- looks like that server is out of sync with puppet [17:40:51] Is faidon on IRC? [17:44:42] http://pastebin.com/Gq46Bqjw - untested so far, basically just added NE to those which look like they would deal with special chars [17:45:03] paravoid, something like that? [17:50:32] RECOVERY - mysqld processes on es1 is OK: PROCS OK: 1 process with command name mysqld [17:54:13] New patchset: awjrichards; "Update technical feedback email address for mobile site contact page" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34105 [17:55:38] New patchset: awjrichards; "Update technical feedback email address for mobile site contact page" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34105 [17:57:09] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34105 [17:57:40] eww, I shouldn't have hurried with it [17:58:38] Reedy, it's your time - should I revert and resubmit ^^^ after it or you don't need to change settings so it won't disrupt you? [17:58:52] Go ahead if you want [17:59:03] I've only got to change 1 line in wikiversions.dat for this deploy [17:59:29] you mean go ahead and push it? [18:00:20] yeah [18:06:34] oh, who is the ops on duty today ? [18:06:43] this week [18:08:35] LeslieCarr, topic says it's you:) [18:08:39] hehe [18:08:42] that's a lie [18:12:32] LeslieCarr: I was waiting until the ops meeting :) [18:12:37] Krenair: I'm Faidon. [18:12:37] oh [18:12:37] hehe [18:13:24] ... ah. hi :) [18:14:08] hi :) [18:14:41] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.21wmf4 [18:14:48] Logged the message, Master [18:15:00] so, yeah, something like this I guess. [18:15:03] is this 'd:' again? [18:16:05] paravoid, upoading to gerrit [18:16:53] apergos, this is about special chars being double-encoded on redirect - https://bugzilla.wikimedia.org/show_bug.cgi?id=20409 [18:18:24] New patchset: Alex Monk; "(bug 20409) Use NE flag for rewrites that probably need to deal with special chars." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/34113 [18:18:56] New review: Alex Monk; "Untested" [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/34113 [18:20:20] New patchset: Reedy; "enwiki to 1.21wmf4" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34115 [18:20:31] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34115 [18:20:45] οη μυ [18:20:47] er [18:20:47] oh my [18:21:16] Krenair: that list probably needs a cleanup/update, if you're feeling up to it :-) [18:36:49] PROBLEM - Host analytics1027 is DOWN: PING CRITICAL - Packet loss = 100% [18:37:34] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [18:38:08] paravoid, the redirect list? I'm not really sure what needs to be changed.. [18:38:37] RECOVERY - Host analytics1027 is UP: PING OK - Packet loss = 0%, RTA = 35.43 ms [18:44:46] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [18:45:55] !log maxsem synchronized wmf-config/mobile.php 'https://gerrit.wikimedia.org/r/#/c/34105/' [18:46:03] Logged the message, Master [18:53:46] New patchset: Pyoungmeister; "repooling es1" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34119 [18:54:07] AaronSchulz: I saw a message regarding some errors; I guess that was referring to ms-be3? [18:54:22] I was wondering and rediscovered it myself accidentally. [18:59:18] New patchset: Ottomata; "Moving analytics kraken mysql server to analytics1027" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34120 [19:00:01] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34120 [19:10:43] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34119 [19:11:52] !log py synchronized wmf-config/db.php 'repooling es1' [19:11:58] Logged the message, Master [19:20:01] !log depooling srv258-srv280 for upgrade to precise [19:20:09] Logged the message, notpeter [19:20:19] !log depooling srv290-srv301 for upgrade to precise [19:20:27] Logged the message, notpeter [19:38:37] PROBLEM - Host wikisource-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [19:39:06] oh? [19:40:35] PROBLEM - Host wikisource-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [19:40:36] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [19:40:52] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:52] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:10] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:19] PROBLEM - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:19] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:19] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:22] RECOVERY - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 47144 bytes in 0.733 seconds [19:42:22] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66244 bytes in 0.847 seconds [19:42:22] RECOVERY - Host wikisource-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 118.14 ms [19:42:40] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 43068 bytes in 0.477 seconds [19:42:50] RECOVERY - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66241 bytes in 0.685 seconds [19:42:50] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 66244 bytes in 0.879 seconds [19:42:50] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 89621 bytes in 0.914 seconds [19:45:20] New review: Alex Monk; "This probably conflicts horribly with I2c6ab07d" [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/34113 [19:46:25] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 152.95 ms [19:46:26] RECOVERY - Host wikisource-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 152.52 ms [19:46:26] PROBLEM - LVS HTTP IPv4 on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:27] PROBLEM - Varnish HTTP bits on cp3021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:34] Well, that explains a lot [19:47:39] cp3021 is throwing out of socket memory [19:47:54] Reedy: oh sorry, we were on the middle of our ops call and were debugging this realtime [19:47:55] RECOVERY - Varnish HTTP bits on cp3021 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 4.037 seconds [19:48:12] heh [19:48:22] I was wondering if mobile had broken something for a few minutes [19:49:00] mark: varnish in cp3021 is going crazy [19:49:08] ok [19:49:10] load is 13290 [19:49:40] it's in the thread lockup thing you were debugging [19:49:43] yup [19:50:00] should I just restart it or do you want to debug it further? [19:50:17] i'm looking at it [19:50:39] k [19:56:18] !log Moving bits.esams traffic to pmtpa [19:56:24] Logged the message, Master [19:57:00] that bad? [19:58:21] binasher: curl -i http://commons.wikimedia.org/w/thumb.php?f=Aqueduc_Luynes.jpg [19:58:39] * AaronSchulz chases heisenbugs [19:58:50] PROBLEM - Varnish HTTP bits on cp3020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:05] AaronSchulz: note that I've pooled ms-fe1 [19:59:15] with the new rewrite.py [19:59:18] which the rest don't have [19:59:32] oh, that's commons [19:59:33] nevermind [19:59:35] paravoid: moving the traffic is the easiest way to make varnish stop freaking out [20:00:12] Ryan_Lane: I know, mark has been debugging this for two weeks now or something [20:00:18] * Ryan_Lane nods [20:00:20] RECOVERY - Varnish HTTP bits on cp3020 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.240 seconds [20:00:22] we've had this issue in the past too [20:00:28] I know [20:00:36] usually due to packet loss [20:01:41] RECOVERY - LVS HTTP IPv4 on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3909 bytes in 0.279 seconds [20:04:57] paravoid: it's like I request /w/thumb.php?f=Aqueduc_Luynes.jpg&w=405 and sometimes what reaches MW is just /w/thumb.php?f=Aqueduc_Luynes.jpg [20:07:29] why are 5% of all requests now for /wiki/Special:RecordImpression [20:08:04] what extension is that? [20:08:14] i don't know [20:08:49] isn't that some fundraising thing? [20:08:54] http://en.wikipedia.org/wiki/Special:RecordImpression [20:09:05] cannot be displayed, hmm [20:10:18] Reedy: Special:TimedMediaHandler taking forever on enwiki btw [20:10:38] And? :p [20:10:50] I'm not sure if anyone did any further fixes or whatever [20:14:00] centalnotice [20:14:05] *central [20:14:28] /home/w/common/php-1.21wmf4/extensions/CentralNotice/special/SpecialRecordImpression.php [20:14:47] !log Restoring bits.esams traffic [20:14:54] Logged the message, Master [20:16:32] PROBLEM - Host srv290 is DOWN: PING CRITICAL - Packet loss = 100% [20:16:32] PROBLEM - Host srv294 is DOWN: PING CRITICAL - Packet loss = 100% [20:16:46] // URL which is hit after a banner is loaded, for compatibility with analytics. [20:16:47] huh [20:16:50] that's me [20:16:55] the nagioses [20:16:57] ok [20:17:16] binasher: ^^ [20:19:05] PROBLEM - SSH on srv291 is CRITICAL: Connection refused [20:19:48] New patchset: awjrichards; "Remove support for all mobile contact form elements except technical problems" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34183 [20:19:58] $this->getOutput()->disable(); [20:19:58] $this->sendHeaders(); [20:20:17] PROBLEM - Apache HTTP on srv291 is CRITICAL: Connection refused [20:20:35] PROBLEM - SSH on srv293 is CRITICAL: Connection refused [20:20:35] PROBLEM - SSH on srv295 is CRITICAL: Connection refused [20:21:02] PROBLEM - Apache HTTP on srv293 is CRITICAL: Connection refused [20:21:38] PROBLEM - Apache HTTP on srv292 is CRITICAL: Connection refused [20:21:47] PROBLEM - Apache HTTP on srv295 is CRITICAL: Connection refused [20:21:56] PROBLEM - SSH on srv292 is CRITICAL: Connection refused [20:22:14] RECOVERY - Host srv294 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [20:22:14] RECOVERY - Host srv290 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [20:22:56] paravoid: how long before swift can be upgraded to 1.7.5? [20:23:02] * AaronSchulz would like that double GET fix [20:26:03] New patchset: Pyoungmeister; "setting srv290-srv295 to use applicationserver role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34209 [20:26:08] PROBLEM - SSH on srv294 is CRITICAL: Connection refused [20:26:08] PROBLEM - SSH on srv290 is CRITICAL: Connection refused [20:26:26] PROBLEM - Apache HTTP on srv290 is CRITICAL: Connection refused [20:26:44] PROBLEM - Apache HTTP on srv294 is CRITICAL: Connection refused [20:26:54] PROBLEM - Memcached on srv290 is CRITICAL: Connection refused [20:27:50] New patchset: Pyoungmeister; "setting srv290-srv295 to use applicationserver role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34209 [20:30:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34209 [20:30:29] RECOVERY - SSH on srv291 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:31:04] gonna make this channel loud [20:31:42] phone to vibrate [20:31:50] RECOVERY - SSH on srv293 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:31:50] RECOVERY - SSH on srv292 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:32:35] RECOVERY - SSH on srv290 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:32:44] RECOVERY - SSH on srv294 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:33:29] RECOVERY - SSH on srv295 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:37:05] PROBLEM - Host srv258 is DOWN: PING CRITICAL - Packet loss = 100% [20:40:05] PROBLEM - NTP on srv291 is CRITICAL: NTP CRITICAL: No response from NTP server [20:41:08] PROBLEM - NTP on srv293 is CRITICAL: NTP CRITICAL: No response from NTP server [20:41:44] PROBLEM - NTP on srv292 is CRITICAL: NTP CRITICAL: No response from NTP server [20:41:44] PROBLEM - NTP on srv295 is CRITICAL: NTP CRITICAL: No response from NTP server [20:42:47] RECOVERY - Host srv258 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [20:44:26] PROBLEM - Host srv259 is DOWN: PING CRITICAL - Packet loss = 100% [20:45:47] PROBLEM - NTP on srv290 is CRITICAL: NTP CRITICAL: No response from NTP server [20:45:47] PROBLEM - NTP on srv294 is CRITICAL: NTP CRITICAL: No response from NTP server [20:46:32] PROBLEM - SSH on srv258 is CRITICAL: Connection refused [20:46:41] PROBLEM - Apache HTTP on srv258 is CRITICAL: Connection refused [20:46:41] PROBLEM - Host srv260 is DOWN: PING CRITICAL - Packet loss = 100% [20:46:41] PROBLEM - Host srv264 is DOWN: PING CRITICAL - Packet loss = 100% [20:46:41] PROBLEM - Host srv262 is DOWN: PING CRITICAL - Packet loss = 100% [20:46:41] PROBLEM - Host srv261 is DOWN: PING CRITICAL - Packet loss = 100% [20:46:42] PROBLEM - Host srv263 is DOWN: PING CRITICAL - Packet loss = 100% [20:46:59] PROBLEM - Memcached on srv258 is CRITICAL: Connection refused [20:48:29] PROBLEM - Host srv265 is DOWN: PING CRITICAL - Packet loss = 100% [20:49:19] fyi, ganglia aggregator for pmtpa appservers is down. [20:50:08] RECOVERY - Host srv259 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [20:50:22] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33786 [20:50:26] PROBLEM - Host srv258 is DOWN: PING CRITICAL - Packet loss = 100% [20:51:20] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [20:51:20] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [20:51:29] RECOVERY - SSH on srv258 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:51:29] PROBLEM - Host srv268 is DOWN: PING CRITICAL - Packet loss = 100% [20:51:38] RECOVERY - Host srv258 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:51:47] PROBLEM - Host srv269 is DOWN: PING CRITICAL - Packet loss = 100% [20:52:03] New patchset: Pyoungmeister; "setting srv258-srv280 to use applicationserver role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34214 [20:52:14] PROBLEM - Host srv270 is DOWN: PING CRITICAL - Packet loss = 100% [20:52:23] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [20:52:23] RECOVERY - Host srv260 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [20:52:23] RECOVERY - Host srv264 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [20:52:23] RECOVERY - Host srv263 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [20:52:23] RECOVERY - Host srv261 is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms [20:52:24] RECOVERY - Host srv262 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [20:52:24] PROBLEM - Host srv272 is DOWN: PING CRITICAL - Packet loss = 100% [20:52:25] PROBLEM - Host srv271 is DOWN: PING CRITICAL - Packet loss = 100% [20:52:25] PROBLEM - Host srv273 is DOWN: PING CRITICAL - Packet loss = 100% [20:52:26] PROBLEM - Host srv274 is DOWN: PING CRITICAL - Packet loss = 100% [20:52:29] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34214 [20:53:35] PROBLEM - Memcached on srv259 is CRITICAL: Connection refused [20:53:44] PROBLEM - Apache HTTP on srv259 is CRITICAL: Connection refused [20:54:11] RECOVERY - Host srv265 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [20:54:11] PROBLEM - Host srv275 is DOWN: PING CRITICAL - Packet loss = 100% [20:54:29] PROBLEM - SSH on srv259 is CRITICAL: Connection refused [20:55:05] PROBLEM - Memcached on srv267 is CRITICAL: Connection refused [20:55:06] PROBLEM - Host srv277 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:23] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:50] PROBLEM - Memcached on srv260 is CRITICAL: Connection refused [20:55:50] PROBLEM - Host srv279 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:59] PROBLEM - Apache HTTP on srv267 is CRITICAL: Connection refused [20:55:59] PROBLEM - SSH on srv267 is CRITICAL: Connection refused [20:55:59] PROBLEM - SSH on srv262 is CRITICAL: Connection refused [20:56:17] PROBLEM - SSH on srv264 is CRITICAL: Connection refused [20:56:17] PROBLEM - Apache HTTP on srv262 is CRITICAL: Connection refused [20:56:17] PROBLEM - SSH on srv261 is CRITICAL: Connection refused [20:56:26] PROBLEM - Apache HTTP on srv264 is CRITICAL: Connection refused [20:56:26] PROBLEM - Apache HTTP on srv261 is CRITICAL: Connection refused [20:56:26] PROBLEM - Apache HTTP on srv263 is CRITICAL: Connection refused [20:56:35] PROBLEM - Apache HTTP on srv260 is CRITICAL: Connection refused [20:56:35] PROBLEM - Host srv280 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:44] PROBLEM - Memcached on srv261 is CRITICAL: Connection refused [20:56:44] PROBLEM - Memcached on srv263 is CRITICAL: Connection refused [20:56:44] PROBLEM - Memcached on srv264 is CRITICAL: Connection refused [20:56:44] PROBLEM - SSH on srv260 is CRITICAL: Connection refused [20:57:02] PROBLEM - SSH on srv263 is CRITICAL: Connection refused [20:57:11] PROBLEM - Memcached on srv262 is CRITICAL: Connection refused [20:57:11] RECOVERY - Host srv268 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [20:57:29] RECOVERY - Host srv269 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [20:57:47] RECOVERY - SSH on srv259 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:57:56] RECOVERY - Host srv270 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [20:58:05] PROBLEM - Apache HTTP on srv265 is CRITICAL: Connection refused [20:58:05] PROBLEM - Memcached on srv276 is CRITICAL: Connection refused [20:58:05] RECOVERY - Host srv271 is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [20:58:05] RECOVERY - Host srv272 is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [20:58:06] RECOVERY - Host srv274 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [20:58:06] RECOVERY - Host srv273 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [20:58:13] noisy ;) [20:58:23] PROBLEM - SSH on srv265 is CRITICAL: Connection refused [20:58:50] PROBLEM - Memcached on srv265 is CRITICAL: Connection refused [20:58:55] quite :) [20:58:59] PROBLEM - SSH on srv276 is CRITICAL: Connection refused [20:59:35] PROBLEM - Apache HTTP on srv276 is CRITICAL: Connection refused [20:59:53] RECOVERY - Host srv275 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [21:00:02] PROBLEM - Host srv261 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:11] RECOVERY - SSH on srv260 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:00:38] PROBLEM - SSH on srv268 is CRITICAL: Connection refused [21:00:47] RECOVERY - SSH on srv262 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:00:47] RECOVERY - Host srv277 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [21:00:56] PROBLEM - Apache HTTP on srv269 is CRITICAL: Connection refused [21:01:05] PROBLEM - Apache HTTP on srv268 is CRITICAL: Connection refused [21:01:05] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [21:01:14] RECOVERY - SSH on srv261 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:01:24] PROBLEM - Apache HTTP on srv270 is CRITICAL: Connection refused [21:01:24] PROBLEM - Memcached on srv272 is CRITICAL: Connection refused [21:01:24] PROBLEM - Memcached on srv273 is CRITICAL: Connection refused [21:01:24] RECOVERY - Host srv261 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [21:01:32] RECOVERY - Host srv279 is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [21:01:41] PROBLEM - Apache HTTP on srv272 is CRITICAL: Connection refused [21:01:42] PROBLEM - Memcached on srv269 is CRITICAL: Connection refused [21:01:50] PROBLEM - Memcached on srv268 is CRITICAL: Connection refused [21:01:51] RECOVERY - SSH on srv263 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:01:51] PROBLEM - Memcached on srv270 is CRITICAL: Connection refused [21:01:59] PROBLEM - SSH on srv270 is CRITICAL: Connection refused [21:02:08] PROBLEM - SSH on srv273 is CRITICAL: Connection refused [21:02:17] PROBLEM - Apache HTTP on srv274 is CRITICAL: Connection refused [21:02:17] PROBLEM - SSH on srv271 is CRITICAL: Connection refused [21:02:17] RECOVERY - Host srv280 is UP: PING OK - Packet loss = 0%, RTA = 1.94 ms [21:02:26] PROBLEM - SSH on srv269 is CRITICAL: Connection refused [21:02:35] PROBLEM - SSH on srv274 is CRITICAL: Connection refused [21:02:35] PROBLEM - SSH on srv272 is CRITICAL: Connection refused [21:02:44] RECOVERY - SSH on srv264 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:02:53] PROBLEM - Memcached on srv274 is CRITICAL: Connection refused [21:02:53] PROBLEM - Apache HTTP on srv273 is CRITICAL: Connection refused [21:03:12] PROBLEM - Apache HTTP on srv271 is CRITICAL: Connection refused [21:03:12] PROBLEM - Memcached on srv271 is CRITICAL: Connection refused [21:03:20] RECOVERY - SSH on srv265 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:03:20] PROBLEM - Memcached on srv275 is CRITICAL: Connection refused [21:03:56] PROBLEM - SSH on srv275 is CRITICAL: Connection refused [21:04:05] PROBLEM - Apache HTTP on srv275 is CRITICAL: Connection refused [21:04:14] PROBLEM - Apache HTTP on srv277 is CRITICAL: Connection refused [21:04:14] RECOVERY - SSH on srv267 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:04:23] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [21:04:41] PROBLEM - Memcached on srv278 is CRITICAL: Connection refused [21:05:08] PROBLEM - SSH on srv278 is CRITICAL: Connection refused [21:05:09] PROBLEM - Memcached on srv279 is CRITICAL: Connection refused [21:05:09] PROBLEM - Host srv269 is DOWN: PING CRITICAL - Packet loss = 100% [21:05:17] PROBLEM - Memcached on srv277 is CRITICAL: Connection refused [21:05:17] PROBLEM - SSH on srv277 is CRITICAL: Connection refused [21:05:26] RECOVERY - SSH on srv268 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:05:35] RECOVERY - Apache HTTP on srv291 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [21:05:35] PROBLEM - SSH on srv280 is CRITICAL: Connection refused [21:05:44] RECOVERY - SSH on srv269 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:05:44] PROBLEM - SSH on srv279 is CRITICAL: Connection refused [21:05:53] RECOVERY - Host srv269 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [21:06:02] PROBLEM - Apache HTTP on srv279 is CRITICAL: Connection refused [21:06:02] PROBLEM - Apache HTTP on srv280 is CRITICAL: Connection refused [21:06:38] PROBLEM - Memcached on srv280 is CRITICAL: Connection refused [21:06:47] RECOVERY - SSH on srv270 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:06:53] Year 2012 in progress?:P [21:07:05] RECOVERY - SSH on srv273 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:07:14] RECOVERY - SSH on srv271 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:07:32] RECOVERY - SSH on srv272 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:07:41] RECOVERY - SSH on srv274 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:08:26] RECOVERY - SSH on srv278 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:08:35] RECOVERY - SSH on srv277 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:08:44] RECOVERY - SSH on srv276 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:08:54] RECOVERY - SSH on srv275 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:09:03] heh [21:10:32] RECOVERY - SSH on srv280 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:10:32] RECOVERY - SSH on srv279 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:15:56] PROBLEM - NTP on srv258 is CRITICAL: NTP CRITICAL: No response from NTP server [21:15:56] PROBLEM - NTP on srv262 is CRITICAL: NTP CRITICAL: No response from NTP server [21:15:56] PROBLEM - NTP on srv264 is CRITICAL: NTP CRITICAL: No response from NTP server [21:16:14] PROBLEM - NTP on srv263 is CRITICAL: NTP CRITICAL: No response from NTP server [21:17:44] RECOVERY - NTP on srv291 is OK: NTP OK: Offset -0.1137701273 secs [21:19:05] RECOVERY - Apache HTTP on srv258 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.004 seconds [21:19:05] RECOVERY - Apache HTTP on srv270 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [21:19:59] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [21:20:08] RECOVERY - NTP on srv290 is OK: NTP OK: Offset -0.04360198975 secs [21:21:29] PROBLEM - NTP on srv274 is CRITICAL: NTP CRITICAL: No response from NTP server [21:21:48] PROBLEM - NTP on srv273 is CRITICAL: NTP CRITICAL: No response from NTP server [21:21:56] PROBLEM - NTP on srv259 is CRITICAL: NTP CRITICAL: No response from NTP server [21:22:05] PROBLEM - NTP on srv271 is CRITICAL: NTP CRITICAL: No response from NTP server [21:22:23] PROBLEM - NTP on srv272 is CRITICAL: NTP CRITICAL: No response from NTP server [21:23:44] PROBLEM - NTP on srv260 is CRITICAL: NTP CRITICAL: No response from NTP server [21:25:23] PROBLEM - NTP on srv261 is CRITICAL: NTP CRITICAL: No response from NTP server [21:25:32] PROBLEM - NTP on srv265 is CRITICAL: NTP CRITICAL: No response from NTP server [21:28:14] PROBLEM - NTP on srv267 is CRITICAL: NTP CRITICAL: No response from NTP server [21:28:23] PROBLEM - NTP on srv268 is CRITICAL: NTP CRITICAL: No response from NTP server [21:29:26] PROBLEM - NTP on srv269 is CRITICAL: NTP CRITICAL: No response from NTP server [21:29:44] PROBLEM - NTP on srv270 is CRITICAL: NTP CRITICAL: Offset unknown [21:31:41] PROBLEM - NTP on srv277 is CRITICAL: NTP CRITICAL: No response from NTP server [21:31:50] PROBLEM - NTP on srv276 is CRITICAL: NTP CRITICAL: No response from NTP server [21:32:09] PROBLEM - NTP on srv275 is CRITICAL: NTP CRITICAL: No response from NTP server [21:32:09] PROBLEM - NTP on srv278 is CRITICAL: NTP CRITICAL: No response from NTP server [21:32:26] RECOVERY - Apache HTTP on srv271 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.009 seconds [21:32:35] RECOVERY - Apache HTTP on srv259 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [21:33:40] PROBLEM - NTP on srv280 is CRITICAL: NTP CRITICAL: No response from NTP server [21:33:47] PROBLEM - NTP on srv279 is CRITICAL: NTP CRITICAL: No response from NTP server [21:34:23] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.009 seconds [21:40:08] New patchset: Brion VIBBER; "Enable mobilefrontend for wikivoyage -- need to test it before letting the mobile redirector go wild" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34220 [21:44:28] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [21:44:46] RECOVERY - Apache HTTP on srv260 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [21:45:13] RECOVERY - NTP on srv270 is OK: NTP OK: Offset -0.04301655293 secs [21:45:31] RECOVERY - NTP on srv258 is OK: NTP OK: Offset -0.0559180975 secs [21:47:28] RECOVERY - NTP on srv292 is OK: NTP OK: Offset -0.05022609234 secs [21:47:28] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.001 seconds [21:55:16] RECOVERY - NTP on srv260 is OK: NTP OK: Offset -0.06375491619 secs [21:57:22] RECOVERY - Apache HTTP on srv261 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.008 seconds [21:57:31] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.011 seconds [21:58:25] RECOVERY - NTP on srv259 is OK: NTP OK: Offset -0.05052435398 secs [21:59:46] RECOVERY - NTP on srv271 is OK: NTP OK: Offset -0.04410970211 secs [22:01:16] RECOVERY - NTP on srv293 is OK: NTP OK: Offset -0.0575067997 secs [22:04:52] RECOVERY - Apache HTTP on srv295 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 2.996 seconds [22:10:16] RECOVERY - Apache HTTP on srv262 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [22:10:34] RECOVERY - Apache HTTP on srv274 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [22:11:01] RECOVERY - NTP on srv272 is OK: NTP OK: Offset -0.05880987644 secs [22:13:13] !log depooling ssl1003 [22:13:19] Logged the message, Master [22:15:31] RECOVERY - NTP on srv294 is OK: NTP OK: Offset -0.04527282715 secs [22:21:49] RECOVERY - NTP on srv262 is OK: NTP OK: Offset 0.02152311802 secs [22:21:49] RECOVERY - Apache HTTP on srv263 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.016 seconds [22:23:01] RECOVERY - NTP on srv261 is OK: NTP OK: Offset -0.04834794998 secs [22:23:06] New patchset: Dzahn; "RT-804, umask for wikidev users, overwritten by /etc/profile on <= lucid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34223 [22:23:10] RECOVERY - Apache HTTP on srv275 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 1.347 seconds [22:23:19] RECOVERY - NTP on srv273 is OK: NTP OK: Offset -0.03881847858 secs [22:24:30] Change abandoned: awjrichards; "Resolved with a different approach taken in https://gerrit.wikimedia.org/r/#/c/34219/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34183 [22:26:08] New patchset: Dzahn; "RT-804, umask for wikidev users, overwritten by /etc/profile on <= lucid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34223 [22:26:28] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [22:28:02] New patchset: Dzahn; "RT-804, umask for wikidev users, overwritten by /etc/profile on <= lucid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34223 [22:31:05] binasher: im trying to find what file controls the invocation of the mobile redirector script but am failing - do you know? [22:32:01] RECOVERY - NTP on srv295 is OK: NTP OK: Offset -0.03044438362 secs [22:33:13] RECOVERY - NTP on srv275 is OK: NTP OK: Offset -0.06258249283 secs [22:33:31] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [22:35:20] !log upgrading ssl1003 to precise [22:35:27] Logged the message, Master [22:35:52] !log depooling ssl4 [22:35:55] RECOVERY - NTP on srv274 is OK: NTP OK: Offset -0.05456626415 secs [22:35:58] Logged the message, Master [22:37:23] awjr: the squid config [22:37:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:37:52] RECOVERY - Apache HTTP on srv264 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [22:37:53] RECOVERY - Apache HTTP on srv276 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [22:38:07] binasher: what repo is that in? im not seeing it in puppet or debs/squid [22:39:49] awjr: check out wikitech, there's a conf section in the squid doc [22:40:19] oic [22:40:21] thanks binasher [22:41:09] New patchset: Dzahn; "RT-804, umask for wikidev users, overwritten by /etc/profile on <= lucid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34223 [22:43:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [22:49:43] RECOVERY - NTP on srv263 is OK: NTP OK: Offset -0.04950976372 secs [22:50:12] LeslieCarr: Hi, I need 2 small things [22:50:35] LeslieCarr: one important, one not so important. Let me know if you can do it or if I need to something first. [22:51:05] LeslieCarr: I'm not in the "jenkins" group gallium.wikimedia.org. hashar is in it. I need it to deploy changes. [22:51:31] !log re-pooling ssl4 and depooling ssl3 [22:51:32] RECOVERY - NTP on srv264 is OK: NTP OK: Offset -0.006415367126 secs [22:51:32] since the git repo is owned by user "jenkins" there. [22:51:37] Logged the message, Master [22:52:23] Is the "IRC duty" up to date? [22:52:31] :D [22:52:34] RECOVERY - Apache HTTP on srv277 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [22:52:37] "not a lie" lol [22:53:03] New patchset: Hashar; "new jenkins sudoer: Timo "Krinkle" Tijhof" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34226 [22:53:46] !log rebooting ssl3 [22:53:52] Logged the message, Master [22:54:04] RECOVERY - Apache HTTP on srv265 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [22:54:45] https://rt.wikimedia.org/Ticket/Display.html?id=3942 [22:54:45] https://rt.wikimedia.org/Ticket/Display.html?id=3943 [22:56:10] PROBLEM - Host ssl3 is DOWN: PING CRITICAL - Packet loss = 100% [22:56:25] Krinkle: and then https://gerrit.wikimedia.org/r/#/c/34226/ [22:56:28] and you will be fine :-D [22:57:02] Krinkle: I have deployed our changes to gallium [22:57:07] ok [22:57:09] Thanks [22:57:15] the grunt symlink and Grunt: Add basic build file for linting with jshint. [22:57:22] RECOVERY - Host ssl3 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [22:57:29] so that should not block you too much [22:57:32] Since you added me manaully to jenkins admin, it means I can now add it to the job configuration for universal linter. [22:57:56] (not doing that yet though, need to make check how it plays with non-mediawiki/core) [22:58:10] i.e. repos that don't have a .jshintrc file [22:59:02] Krinkle: should probably look up in parent directory [22:59:06] or the parent again :-] [22:59:12] so you could use the MediaWiki core one [22:59:19] I think it does that already, [22:59:32] !log rebooting ssl1004 [22:59:34] and it has very tolerant defaults besides that [22:59:35] err [22:59:39] Logged the message, Master [22:59:40] !log make that rebooting ssl1003 [22:59:43] I just haven't verified yet how that plays out [22:59:47] Logged the message, Master [22:59:55] Krinkle: and if you pass --checkstyle-reporter to jshint, the xml result can be interpreted by Jenkins [23:00:19] hashar: so, is open registration an option yet? :) [23:00:26] No [23:00:30] :D [23:00:30] :( [23:00:37] * MaxSem looks around [23:00:47] * Ryan_Lane has been waiting for like 2 months now [23:00:47] yeah whenever sqlite, wikidata, random failure, people stop writing tests and so on [23:01:03] Ryan_Lane: got zuul in labs for real now [23:01:07] cool [23:01:11] Ryan_Lane: and polishing it up [23:01:28] We need Zuul to be able to run tests onmerge, and then Vagrant to run builds in a sandboxes environment. [23:01:34] Ryan_Lane: I have been spending a long time in #openstack-infra, there are some smart and cool guys there :-] [23:01:39] yep [23:01:42] good people in there [23:01:45] afiak those are the 2 major prerequisites. [23:01:49] :( [23:02:12] Ryan_Lane: anyway, that should be easy to deploy now. Basically: deploy some role::zuul::production class on gallium [23:02:23] Ryan_Lane: and then deploy the updated jenkins jobs [23:02:27] * Ryan_Lane nods [23:02:34] I wrote the classes [23:02:36] the jobs too [23:02:48] been testing them last week while setting up a new labs instance from scratch [23:02:53] just finished it :-] [23:03:04] cool. glad to see there's progress, either way [23:03:11] so now I am going to play a bit with the new conf [23:03:13] yeah been slow [23:03:15] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34220 [23:03:18] I need to stop multitasking [23:03:22] PROBLEM - HTTPS on ssl1003 is CRITICAL: Connection refused [23:03:29] and learned to keep IRC shutdown (too many IRQ comes from IRC) [23:03:36] s/learned/learn/ [23:04:11] hashar: Does ant only set if not set already [23:04:21] Krinkle: an ant variable is only set once [23:04:25] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [23:04:26] hashar: because properties.defaults set sourcedir to GIT-Fetching workspace [23:04:37] hashar: but then it sets it to env.WORKSPACE inside ant.xml [23:04:42] so... [23:05:05] the ant script loads private.properties, then job.properties then fallback to default.properties [23:05:10] RECOVERY - HTTPS on ssl1003 is OK: OK - Certificate will expire on 08/22/2015 22:23. [23:05:18] if job.properties set a variable, it will override whatever value is set in default.properties [23:05:24] the job.properties are on a per job basis [23:05:36] default.properties is global and symlinked in each job directory [23:05:44] Basically what I want to know, if I exec grunt lint from shell exec for "Universal-Linter" job, do I need to "cd" to anything is the current directly the correct one (correct one being the repo that the change was in, not the parent or child dir for meidawiki core) [23:05:49] THEN, you can also pass a property to ant script :-] [23:05:56] ah [23:06:06] !log repooling ssl1003 (upgrade to precise complete) [23:06:12] Logged the message, Master [23:06:31] RECOVERY - NTP on srv276 is OK: NTP OK: Offset -0.04197645187 secs [23:06:41] !log depooling ssl1002 for upgrade to precise [23:06:47] hashar: because if you exec lint from mediawiki/core it will ignore the extensions directory (as it should) [23:06:47] Logged the message, Master [23:06:49] Krinkle: that will be the job workspace, aka /var/lib/jenkins/jobs/MediaWiki-Universal-Linter/workspace [23:06:58] Okay [23:07:04] which is practically empty? [23:07:11] yup [23:07:14] So.. [23:07:21] I originally wanted to share the mediawiki sourcedir between jobs [23:07:33] which lead to much madness [23:07:43] we should just fetch a copy in each job [23:07:59] also, we shouldn't have the linter as a child job, because it is now tied to GIT-Fetcher anyway. [23:08:03] that is what the python script to generate script will do [23:08:22] We have 1 job for one or more repos, and they do everything inside there by executing different tasks. [23:08:27] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/#/c/34220/ and https://gerrit.wikimedia.org/r/#/c/33786/' [23:08:34] Logged the message, Master [23:08:38] hashar: But for the here andnow, where do I cd into? [23:09:07] Krinkle: I think you should simply set a new job that would use Gerrit Trigger (simply copy mediawiki-git-fetching and edit it) [23:09:32] Krinkle: then set the git repo to be test/mediawiki/core if that one still exist [23:09:44] Krinkle: this way you can test out without breaking existing jobs! [23:10:19] I know, but I mean after that. I'm pretty sure that will work, and you're gone in a few minutes. [23:10:35] test first :-] [23:10:38] be sure [23:10:43] I will [23:10:44] then we apply it :-] [23:10:52] I am confident it will work though [23:10:53] this is a jQuery.Promise() [23:11:03] but if you want to play with it, you are safer using a testing job [23:11:15] I will, but I want to know where to go next when I'm done with that [23:11:40] Since different repo jobs all trigger the Universal-Linter (right?), I need to know where to cd into to run the lint. [23:11:51] Or is it only triggered by mediawiki/core, now? [23:13:37] Krinkle: ahh sorry [23:13:41] Krinkle: so yeah that is only for core [23:13:58] okay, so there is no problem (yet) with navigating back to the extension dir for extension jobs [23:14:07] jshint should probably be added to each job [23:14:21] yeah you should be safe [23:14:22] :) [23:14:28] PROBLEM - Host ssl1002 is DOWN: PING CRITICAL - Packet loss = 100% [23:14:41] hashar: Ah, so we will (eventually) replace the universal-linter child-job with a shell-exec in the main job. That makes sense [23:14:53] since those tasks will all be grunt tasks at some point [23:14:54] Krinkle: there is a ant target that dump properties to the console; something like echoproperties [23:15:02] no point in making new child jobs for a 1 line shell exec. [23:15:09] ;) [23:15:21] Krinkle: still have to write a 30000 feet overview of jenkins job builder [23:15:39] will start that once zuul is deployed and the new workflow is somehow stable [23:15:53] I already got it to generate jobs, not sure they actually work though :] [23:16:08] then adding a jshint target would be all about adding a line in a yaml file \O/ [23:16:10] hashar: One last thing, what makes it comment on Gerrit? I don't see where that is triggered. I will run it on test/* repo, but if I don't know where the comment is triggered, it might still be on mediawiki/core [23:16:16] PROBLEM - Host ssl3 is DOWN: PING CRITICAL - Packet loss = 100% [23:16:24] hashar: grunt, not jshint :) [23:16:30] abstraction [23:16:40] Krinkle: the comment is handled by the Jenkins Gerrit Trigger Plugin [23:16:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:16:44] later on, will be managed by zuul [23:16:45] Ah, okay. [23:16:56] I am not sure how one can modify it [23:17:50] Krinkle: i am out now [23:17:55] already got 2 private message [23:18:02] need to rush or I will never sleep [23:18:05] cya! [23:24:04] RECOVERY - Host ssl1002 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [23:27:49] RECOVERY - Host ssl3 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [23:29:56] woosters: can you give my team an update for https://rt.wikimedia.org/Ticket/Display.html?id=2541 ? [23:31:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.595 seconds [23:32:10] PROBLEM - Host ssl3 is DOWN: PING CRITICAL - Packet loss = 100% [23:32:25] tfinc .. looks like robh has not gotten to it yet. Will send him a reminder [23:33:22] RECOVERY - Host ssl3 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [23:33:41] yes. we haven't seen any movement on this since march [23:33:50] and were taking heat on bugzilla for it https://bugzilla.wikimedia.org/show_bug.cgi?id=34788 [23:33:55] what else can we do to move it along ? [23:34:17] erm, I'm not sure how we'll make this [23:34:25] it's probably discussed before [23:34:36] but I do wonder how this will work [23:34:49] a single certificate with dozens of wildcard SANs for all possible projects? [23:35:37] it would be so much easier if we didn't need a .m sub domain [23:35:52] i wait for the day where you guys tell is we don't need it anymore and can cache separately [23:35:53] well, not dozens, maybe 14 or so [23:37:33] !log repooling ssl3 depooling ssl4 [23:37:34] New patchset: Reedy; "Make purgeList wikiless" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/34231 [23:37:40] Logged the message, Master [23:37:48] "RSA host key for srv280 has changed and you have requested strict checking." [23:37:57] reinstall fun? [23:38:10] yeah [23:38:21] puppet hasn't caught up yet and fixed the known keys entries [23:41:55] PROBLEM - Host ssl4 is DOWN: PING CRITICAL - Packet loss = 100% [23:42:13] RECOVERY - Host ssl4 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [23:43:04] Ok, so... [23:43:11] ExtensionDistributor is writing to upload7 [23:43:23] But upload.wikimedia.org still seems to be serving from upload6 [23:43:45] nas vs ms7 [23:43:56] meh, and stuff like srv264: rsync: mkdir "/apache/common-local/wmf-config" failed: No such file or directory (2) [23:44:12] guess not a suitable time for scap [23:44:26] Just ignore them [23:44:44] notpeter should ensure sync-common has been run on them before putting them back into rotation [23:46:18] Unless anyone says otherwise... [23:46:32] I'm going to change $wgExtDistTarDir = '/mnt/upload7/ext-dist'; for $wgExtDistTarDir = '/mnt/upload6/ext-dist'; in CommonSettings.php [23:47:34] that's correct [23:47:41] there's no web server pointed at nas right now [23:47:51] Right [23:48:01] Should I stop it using nas as the working copy too? [23:48:06] $wgExtDistWorkingCopy = '/mnt/upload7/private/ExtensionDistributor/mw-snapshot'; [23:48:13] or is that indifferent? [23:48:17] no idea what that is [23:48:28] but I'd say use upload6 for both, just to not be dependent on both [23:49:02] have I told you how much I'd like to see extdist replaced with something that doesn't use upload infrastructure? [23:49:07] heh [23:49:09] Yeaaah [23:49:42] a 10 line README file explaining how to fetch from Git/Svn yourself ;) [23:49:49] That's fine. I'll make sure the working copy on upload6 is svn upped/git pulled [23:50:07] curl | sh ? [23:50:08] :P [23:50:34] New patchset: Reedy; "Point extdist back at upload6" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34233 [23:52:21] Oh, grr. I can't run things as extdist [23:53:16] though, requesting things should update them.. [23:54:18] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34233 [23:58:06] New patchset: Reedy; "Add solarium to extension-list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34237 [23:58:22] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34237 [23:59:47] !log reedy synchronized wmf-config/CommonSettings.php [23:59:54] Logged the message, Master