[00:02:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [00:28:24] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [00:34:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:49:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [01:22:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:31:24] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [01:36:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.189 seconds [01:41:54] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 281 seconds [01:42:30] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 256 seconds [01:44:33] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [01:45:54] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 17 seconds [01:49:48] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:10:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:42] PROBLEM - Puppet freshness on srv237 is CRITICAL: Puppet has not run in the last 10 hours [02:23:42] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [02:24:45] PROBLEM - Puppet freshness on db1034 is CRITICAL: Puppet has not run in the last 10 hours [02:24:45] PROBLEM - Puppet freshness on cp1015 is CRITICAL: Puppet has not run in the last 10 hours [02:24:45] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [02:24:45] PROBLEM - Puppet freshness on mw30 is CRITICAL: Puppet has not run in the last 10 hours [02:24:45] PROBLEM - Puppet freshness on db26 is CRITICAL: Puppet has not run in the last 10 hours [02:24:46] PROBLEM - Puppet freshness on db51 is CRITICAL: Puppet has not run in the last 10 hours [02:24:46] PROBLEM - Puppet freshness on sanger is CRITICAL: Puppet has not run in the last 10 hours [02:24:47] PROBLEM - Puppet freshness on srv249 is CRITICAL: Puppet has not run in the last 10 hours [02:24:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.454 seconds [02:25:48] PROBLEM - Puppet freshness on cp1029 is CRITICAL: Puppet has not run in the last 10 hours [02:25:48] PROBLEM - Puppet freshness on db25 is CRITICAL: Puppet has not run in the last 10 hours [02:25:48] PROBLEM - Puppet freshness on srv212 is CRITICAL: Puppet has not run in the last 10 hours [02:39:09] RECOVERY - Puppet freshness on srv212 is OK: puppet ran at Mon Oct 1 02:38:34 UTC 2012 [02:40:03] RECOVERY - Puppet freshness on db1034 is OK: puppet ran at Mon Oct 1 02:39:37 UTC 2012 [02:40:03] RECOVERY - Puppet freshness on db25 is OK: puppet ran at Mon Oct 1 02:39:45 UTC 2012 [02:45:27] RECOVERY - Puppet freshness on cp1029 is OK: puppet ran at Mon Oct 1 02:45:11 UTC 2012 [02:46:03] RECOVERY - Puppet freshness on ms-be9 is OK: puppet ran at Mon Oct 1 02:45:57 UTC 2012 [02:46:39] RECOVERY - Puppet freshness on srv240 is OK: puppet ran at Mon Oct 1 02:46:23 UTC 2012 [02:48:09] RECOVERY - Puppet freshness on db1022 is OK: puppet ran at Mon Oct 1 02:47:49 UTC 2012 [02:50:06] RECOVERY - Puppet freshness on sq80 is OK: puppet ran at Mon Oct 1 02:50:02 UTC 2012 [02:50:24] RECOVERY - Puppet freshness on mw61 is OK: puppet ran at Mon Oct 1 02:50:07 UTC 2012 [02:52:03] RECOVERY - Puppet freshness on srv237 is OK: puppet ran at Mon Oct 1 02:51:58 UTC 2012 [02:53:33] RECOVERY - Puppet freshness on palladium is OK: puppet ran at Mon Oct 1 02:53:11 UTC 2012 [02:54:45] PROBLEM - Puppet freshness on db9 is CRITICAL: Puppet has not run in the last 10 hours [02:54:45] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: Puppet has not run in the last 10 hours [02:56:33] RECOVERY - Puppet freshness on mw30 is OK: puppet ran at Mon Oct 1 02:56:20 UTC 2012 [02:57:36] RECOVERY - Puppet freshness on srv249 is OK: puppet ran at Mon Oct 1 02:57:18 UTC 2012 [02:57:45] RECOVERY - Puppet freshness on ms-fe1002 is OK: puppet ran at Mon Oct 1 02:57:33 UTC 2012 [02:57:54] RECOVERY - Puppet freshness on cp1015 is OK: puppet ran at Mon Oct 1 02:57:38 UTC 2012 [02:59:33] RECOVERY - Puppet freshness on db51 is OK: puppet ran at Mon Oct 1 02:59:06 UTC 2012 [03:01:03] RECOVERY - Puppet freshness on sanger is OK: puppet ran at Mon Oct 1 03:00:53 UTC 2012 [03:02:06] RECOVERY - Puppet freshness on db9 is OK: puppet ran at Mon Oct 1 03:01:47 UTC 2012 [03:07:39] RECOVERY - Puppet freshness on db26 is OK: puppet ran at Mon Oct 1 03:07:22 UTC 2012 [03:12:45] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [03:12:45] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [04:02:41] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [04:02:41] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [04:02:41] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [04:02:41] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [04:02:41] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [04:02:42] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [04:02:42] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [04:02:43] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [04:02:43] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [04:02:44] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [04:02:44] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [04:02:45] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [04:02:45] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [04:02:46] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [04:02:46] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [04:02:47] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [04:34:10] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 271 seconds [04:37:19] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 231 seconds [04:39:25] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [04:42:34] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 214 seconds [04:50:04] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 18 seconds [04:54:16] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 11 seconds [04:56:22] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [05:10:19] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [05:51:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 230 seconds [05:52:00] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 274 seconds [05:54:33] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 12 seconds [05:56:30] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 9 seconds [06:09:19] New patchset: ArielGlenn; "initial commit of bz2 multistream toy offline reader" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/25851 [06:10:38] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/25851 [07:30:44] New patchset: Dereckson; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [07:32:42] New review: Dereckson; "PS8: improving clear-profile style according PS7 comments." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [07:32:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [07:39:14] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [07:39:14] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:39:14] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [07:39:14] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [07:39:14] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:41:15] New patchset: Matthias Mullie; "Make abusefilter emergency disable more sensible" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25855 [08:13:08] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [08:13:51] heya [08:15:09] hello [08:15:17] how's sleep? [08:15:32] fucked up [08:15:36] woke up at 5am [08:15:54] been there, finally rid of that. [08:16:13] how's London? [08:16:25] rainy :) [08:16:29] ah! [08:16:38] haven't seen much, sun rose at 7am and I came out for breakfast [08:16:40] and now working [08:16:59] hmm days are shorter there, forgot about that [08:17:00] I'll try to go to the Mozilla Space, see if anyone's there and if they can kindly host me :) [08:17:07] nice! [08:17:16] yeah, sunrise 7am and sunset 6:47 pm or something [08:17:21] I was very impressed [08:17:39] I like that during the summer (ridiculously long days) but [08:17:39] I opened my blinds at about 6am and it was very dark [08:17:44] the short days of winter are a killer [08:17:52] for a moment I thought I haven't adjusted my clock properly [08:18:51] heh heh [08:19:12] it's like that in Athens only very deep into the winter [08:19:19] and certainly not while still in DST [08:19:27] exactly [08:19:37] not for months and months and months [08:19:56] (remember I grew up at *exactly* the same latitude, so I know how these things go) [08:20:06] ny winters were a shock [08:20:17] I didn't know you grew up in NY :) [08:20:38] I think your answer to "where are you originally from" was "all around" or something :-) [08:20:46] hehehe [08:21:08] I didn't [08:21:12] I grew up in California [08:21:22] oh [08:21:22] originally from usualy means where were you born [08:21:39] okay [08:21:40] and the answer to that is so useless with regard to anything else that I usually don't even bother [08:21:49] heheh [08:22:05] it's hard enough with US people, you move a lot [08:22:15] e.g. college [08:22:25] well it's more that the country is large enough that a move really can be a large distance [08:22:46] a few days of car travel [08:22:58] I guess [08:23:18] if all moves were for example within the state of California (for size) it wouldn't be such a deal [08:23:35] my university was closer to my home than my high school... :) [08:24:08] yeah but look where you live [08:24:18] the other half of the population lives somewhere else :-P [08:58:00] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [08:59:16] orilly :-/ [09:00:18] hi [09:00:30] yo [09:00:34] hi mark! [09:00:40] welcome back [09:00:43] finally into a European timezone [09:00:50] although my sleep is not yet, but will be [09:00:57] :) [09:01:06] I don't have that problem when coming back [09:01:09] especially not this time [09:01:19] Sleep, what is this. Raise a feature request. :D [09:01:22] yeah, I'm the opposite [09:01:25] I didn't sleep during the night flight, arrived at 9 am in the morning, was active until midnight [09:01:44] i was soldering my in-ears during the evening hehe [09:01:57] adjusting coming back was better, I never woke up at 5 but I did wind up awake til 3 am :-D [09:02:13] then slept until 10 am or so the next day [09:02:55] hm 2.8 TB copied [09:02:56] that's pathetic [09:05:05] apergos: how can we tell if ms8 is still up to date? [09:05:50] well I was trying to get a zfs list ovr there but it was taking forver [09:05:53] I'll try that again [09:06:07] it won't be up to date to the minute anyways, replcation doesn't work that way [09:06:14] that's ok [09:06:19] as long as it's not a year behind [09:07:09] PROBLEM - Host cp3001 is DOWN: PING CRITICAL - Packet loss = 100% [09:07:57] it's a little over an hour behind, according to the last snap recorded as sent from ms7 [09:08:07] cool [09:08:12] that's more than fine [09:08:25] we can do an rsync with only-if-newer [09:08:28] if that zfs list hangs things we could get longer [09:08:42] sure. just no deletes and you're fine [09:09:00] a final rsync from ms7 will fix things [09:09:22] I'm a bit concerned that the zfs list is hanging over there still, this could actually hang up the replication in the worst case [09:09:23] so one of the two esams bits servers just went down [09:09:24] that is problematic [09:09:33] lemme restart it [09:11:29] it's in LACP block [09:11:49] lacp block? [09:12:03] Port Link L2 State Dupl Speed Trunk Tag Priori MAC Name [09:12:03] 2/21 Up LACP-BLOCKFull Auto 2 No level0 000c.dbfc.0b00 cp3001 [09:12:03] 2/23 Up LACP-BLOCKFull Auto 2 No level0 000c.dbfc.0b00 cp3001:eth1 [09:12:14] logging in on serial [09:13:38] Slave Interface: eth0 [09:13:38] MII Status: up [09:13:38] Link Failure Count: 2 [09:13:38] Permanent HW addr: 78:2b:cb:45:4a:31 [09:13:38] Aggregator ID: 2 [09:13:39] Slave Interface: eth1 [09:13:39] MII Status: up [09:13:40] Link Failure Count: 2 [09:13:40] Permanent HW addr: 78:2b:cb:45:4a:32 [09:13:40] Aggregator ID: 1 [09:14:27] Oct 1 09:10:55:I:LACP: Port 2/21 rx state transition: defaulted -> current [09:14:27] Oct 1 09:10:28:I:Interface ethernet2/21, state down - [09:14:27] Oct 1 09:10:28:I:LACP: Port 2/21 mux state transition: aggregate -> not aggregate (reason: timeout [09:14:27] Oct 1 09:10:28:I:LACP: Port 2/21 rx state transition: current -> expired (reason: timeout [09:14:30] weird [09:15:15] indeed [09:15:23] I've never seen anything like that before [09:15:29] it's probably foundry [09:16:20] I think they'll kick me out of this cafe [09:16:27] and the mozilla office manager isn't in yet [09:17:30] !log Rebooting cp3001, both ports in LACP block [09:17:41] Logged the message, Master [09:18:41] this is another nice example which shows that redundancy measures adding complexity don't always increase uptime ;) [09:18:54] heh [09:19:06] box rebooted, lacp back up [09:19:17] we're gonna replace this core switch any day now, so I won't spend more time on it [09:19:18] RECOVERY - Host cp3001 is UP: PING OK - Packet loss = 0%, RTA = 108.17 ms [09:24:38] nice [09:24:41] a df -h on ms8 hangs [09:27:21] tons of ganglia hung processes too [09:27:32] that do df [09:27:55] perhaps because of zfs list [09:28:46] oldest gmetric is from an hour ago [09:30:01] as is the zfs list [09:30:09] so, yes, that's most likely what it is [09:30:35] i guess we'll wait a bit, starting that copy [09:30:43] we can use ms1002 too [09:31:08] I guess those processes are waiting on the replication [09:31:16] what I hope is that we don't have the mutual deadlock [09:31:53] we have new mail in /var/mail/root! [09:32:00] yeah you will [09:32:02] lots of it [09:32:03] THE mutual deadlock? [09:32:14] "this replication didn't finish in time so we're not starting a new one", lots of those [09:32:15] well [09:32:43] hahahaha [09:32:52] one of many, we've had an issue where zfs replication and zfs list wind up waiting on a lock each thing the other has, or some such thing [09:33:07] quite some time ago, supposedly fixed but [09:33:30] what's the cure? reboot? [09:34:57] lemme find my notes [09:38:19] okay, leaving this place [09:38:25] I'll try to find something more permanent [09:38:32] (my hotel is tiny and has no good wifi :/) [09:38:34] my notes say that the zfs list will eventually exit, it just might be slow [09:38:43] i.e. very slow [09:38:46] talk to you in a bit [09:38:51] ok, good luck [09:38:54] vyw [09:38:56] bye [09:41:03] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [09:49:16] replication is taking about 2.5 hours to complete, probably in part due to the pile of rsyncs on ms7 [09:49:43] so I should check back on ms8 in a couple hours and see how it is [09:50:19] ok [09:52:09] in the meantime ms10 is available for rsyncs, it's probably a few days out of date (like 3), again perfectly fine for an initial sync to the netapp [09:52:35] yes [10:28:59] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [10:33:24] New patchset: Mark Bergsma; "Apply settings to cp1029 as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25865 [10:34:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25865 [10:34:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25865 [10:37:11] back [10:37:50] I keep learning machines like that [10:37:59] I didn't know we also had ms10 [10:38:05] it's very recent [10:38:10] apergos set it up a few weeks ago [10:38:14] but was silent about it [10:38:36] i also didn't know it had an rsync replica yet [10:41:35] PROBLEM - Host cp1029 is DOWN: PING CRITICAL - Packet loss = 100% [10:41:43] full mirror setup isn't done, of course it got dropped for swift stuff [10:42:00] you didn't use the same as ms1002? [10:42:19] that's ot what I mean [10:42:38] RECOVERY - Host cp1029 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [10:42:40] for example we are not generating image tarballs in house yet [10:43:12] i really wonder why we'd need to do that [10:43:19] who wants a huge tarball of images [10:43:29] you'd be surprised [10:43:35] idiots [10:43:36] not of commons, but of the projects, [10:43:56] people want to be able to retrieve the ones for their project in one r a few tarballs [10:44:12] they should just get a list of images and then fetch separately [10:44:53] you should just do that, most folks don't want to do that for all the images on en pedia [10:45:18] anyways, the point is that luckily ms10 is not particularly loaded right now [10:45:22] so rsync away [10:45:28] we should provide a script which generates a tarball or download for them, locally [10:45:38] instead of generate them on our side [10:46:24] and then also is able to update it later [10:46:41] much like rsync [10:47:04] generating them is better for the end user (also these tarballs canbe uploaded to archive.org, as I think someone may already be doing) [10:47:49] there is an rsync mirror of these, and there are lists of the files but people were much more excited about the tarballs [10:47:54] (currently generated off-site) [10:47:56] it's trivial to generate a tarball on the fly once you have a downloaded set of files [10:48:04] like I said, idiots :P [10:48:15] it's so wasteful [10:48:18] no. just regular users who want things to be convenient [10:48:47] folks have requested these for years, we finally provide them [10:48:54] so [10:48:57] ms10 rsyncs? [10:49:04] tarballs?! [10:49:11] do we provide a 30T-sized tarball or what? [10:49:15] that's so crazy [10:49:17] no [10:49:19] not of commons [10:49:21] :-/ [10:49:26] that alone demonstrates why it's a stupid idea [10:49:28] "not for commons" [10:50:04] it's not a stupid idea, it's useful for endusers. they get one or a series (depending ont he size) of tarballs of locally and remotely (= hosted on commons) image bundles [10:50:19] they would also get that if we provided a script which would do it for them [10:50:56] it would be a bit harder to mirro those tarballs or archive them though wouldn't it? [10:51:01] not at all [10:51:02] we could do that [10:51:13] you can generate a tarball on the fly and upload it [10:51:14] well if we are going to mirror and archive them we might as well create them [10:51:18] or the script could do it or whatever [10:51:25] are we gonna store history of them? [10:51:40] no, we won't keep a history, we don't have that sort of spce [10:51:44] exactly [10:51:45] "want your own Ubuntu mirror? here's a tarball of all .debs" [10:51:45] but keeping the current run is fine [10:51:46] so what's the point then [10:52:31] what's the point? convenience for our mirrors, for the archive, and for endusers who would download directly from us (some will if we have them [10:53:08] you could even generate the tarball on the fly server side if you wanted to provide that "convenience" [10:53:28] eh no, on demand takes far too long [10:53:38] best to do a monthly run and one or two incremental runs in the meantime [10:53:39] no [10:53:43] on the fly doesn't take any time [10:53:45] I think he meant synchronously [10:53:55] tar is an append-only format anyway [10:53:57] it's just sequentially appended data [10:54:09] heh :) [10:54:17] but anyways [10:54:20] I think we should take a lesson from distros in any case [10:54:21] well it takes about 5 days running multiple parallel jobs to comeplete these on an offsite location with pretty decent hardware [10:54:38] they're in the business of pushing regularly updated files to mirrors [10:54:46] and they've optimized that [10:54:50] yes, it's a solved problem [10:54:50] do they have any that are 100gb? [10:54:57] much larger than 100gb [10:54:58] does it matter? [10:55:00] (but yes) [10:55:05] yes, it does matter [10:55:18] do we have images that are 100GB? [10:55:22] are you telling me they don't have a local copy of their own repo? [10:55:24] or videos for that matter [10:55:30] New patchset: Mark Bergsma; "Add cp1029 and cp1030 to the backend list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25868 [10:55:33] they don't have tarballs of their own repos no [10:55:39] I'm telling that we don't ship a giant tarball of the whole repository [10:55:53] but provide methods for people to mirror e.g. Debian [10:55:54] because they realize that would be retarded [10:56:05] no, but neither do their repos contain a few million separate files that would require a few million retrievals [10:56:15] are you kidding me? :) [10:56:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25868 [10:56:33] last I looked debian did not have a few million packages in it, no [10:56:51] it's less than a 158 million, yes [10:57:04] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25868 [10:57:08] but it's quite a lot of files, at least 4 per version, with multiple versions [10:57:52] and most endusers don't want half of those, they only need a small portion which they get at install/update time [10:57:54] also, if we ship per project and we don't ship commons, I think the individual wikis are averaging less in size than the Debian archive [10:58:01] that's not what our endusers who want this data need [10:58:17] because they want a tarball to do what exactly? [10:58:35] the small projects are obviously much smaller, but they (I assume) are not the issue, since by definition they don't take up much space [10:58:59] shall I go request use cases from our readers? I'm happy to do that [10:59:08] because a tarball is such an efficient filesystem to work on or something? [10:59:28] can I suggest shipping tapes instead? [10:59:39] so I'll say it again: we had rsyncable available for some time, but people were much happier to get bundles [10:59:51] you can suggest whatever you like, but that doesn't mean it's going to get implemented :-P [11:00:21] fortunately much more useful stuff is being done now than generation of tarballs too [11:00:35] yes, please start those rsyncs [11:03:12] so, the people getting those tarballs, how do they keep them up to date? [11:04:14] well given that it's an eperimental service yet (remember how I said we aren't running it in house?) I expect they pick up the incrementals form time to time, or that they don't care if the tarballs are out of date for a few months; folks that care will get a new full [11:04:28] no one has asked for up to the minute access [11:04:52] that may change if WMF really drops toolserver replication :) [11:05:05] I assume you know some reasons WHY people would prefer tarballs over seemingly more sane methods [11:05:28] because some people are crazy? :) [11:05:42] as otherwise I assume you would have proposed some alternative methods to see if people would like those [11:05:58] but perhaps I'm being naive [11:06:31] I don't mind though, as long as someone else is taking care of it :-) [11:07:00] that works until you can no longer hire some positions that are important for budget reasons [11:07:24] do I prefer someone working on tarball generation, or someone working on the security of the cluster while dumps and images are already available through several means? :) [11:07:47] so (1) there are already download scripts that do full retrieval, advertised on Meta iirc. (2) people specifically requested these files (3) you wanna recommend they hire someone else and fire me, go right ahead, I'm starting to get irritated [11:07:52] I prefer staying out of people/time allocations :-) [11:08:22] (but I do like optimizing our work, don't get me wrong :) [11:09:57] apergos: :/ my objections have nothing to do with you, don't take it personally... [11:10:03] no this is not personal at all [11:10:15] i just don't understand the reasoning here [11:11:51] !log Added cp1029 and cp1030 to the Varnish upload.eqiad pool (10 mins ago) [11:12:02] Logged the message, Master [11:12:48] btw, I liked the chash idea [11:12:53] yeah [11:13:02] can probably just alter the existing chash director with an extra option that does that [11:13:05] although isn't the two-tiered architecture supposed to help with that? [11:13:20] only for really hot objects, since the frontends have tiny caches [11:13:29] aha [11:13:32] if one entire disk cache drops empty, it's still a problem [11:13:43] yeah, I remember ;) [11:13:51] Platonides: about toolserver replication, let's cross that bridge when they get a lot closer to burning it... [11:14:18] we do purges on all caches via the htcp reflector, right? [11:14:24] yes [11:15:00] are you planning on pushing chash to varnish upstream btw? :) [11:15:05] they'll probably ask me on Friday [11:15:24] they know about it and are supposed to merge it [11:15:31] i don't think there's much I need to do [11:15:34] perhaps write docs for it ;) [11:15:38] hehe [11:16:04] there's no reason why it shouldn't go upstream [11:16:20] (don't want you to think I have anything against upstreaming stuff unless it causes me a lot of extra work ;) [11:16:48] yeah, I remember [11:17:12] so your hotel is tiny? ;) [11:17:22] yes, why? [11:17:35] i'm glad I didn't go and also bring my gf then hehe [11:17:43] haha yes [11:17:55] well I got the single bed, not the queen one [11:18:01] right [11:18:26] i was gonna have my car fixed while I was in london [11:18:33] since I wouldn't use it anyway [11:18:39] but yeah, it's so tiny there's no room for my luggage to stay open [11:18:39] now i'll have to miss it for a few days :/ [11:18:52] no space for both sides of it on the floor [11:18:52] (my neighbour downstairs bumped into it :( ) [11:19:00] really [11:19:06] oh yeah [11:19:07] it's like the mosser then hehe [11:19:12] it's an old victorian house [11:19:20] fortunately renovated, so it's not that bad [11:19:41] my bed is against the wall on three out of the four sides [11:20:13] one of them is also the window [11:20:17] i should find my picture of my hotel in london in 2008 [11:20:38] I've stayed in small rooms before, mostly in Paris [11:22:30] New patchset: Mark Bergsma; "Add cp103[12] to the backend pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25870 [11:23:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25870 [11:23:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25870 [11:24:32] you should send a picture of the current hotel (maybe from the outside if it's more interesting) [11:26:32] from outside it's very nice [11:30:42] its wifi also sucks [11:30:50] it's a curse that follows me [11:30:51] ugh, par for the course [11:30:55] not just you [11:31:03] wifi also sucks here at the british library where I'm sitting right now [11:31:18] I've been trying to SSH to a box for three minutes now [11:31:20] go to the apple store [11:31:28] that's what I did in 2008, worked then ;) [11:31:29] to do what? [11:31:32] free wifi [11:31:39] hahaaha [11:32:05] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [11:32:34] my power adapter died the day I arrived in london [11:32:45] so I was afraid I'd have to spend a weekend without my laptop in london [11:32:55] but I went to the apple store at regent street, and they replaced it under warranty [11:32:58] that was nice [11:33:36] aww [11:48:16] New review: Hashar; "We were waiting for the files to be copied to docroot/noc which has been done (and merged) by I92551..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/23425 [11:50:59] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [11:52:52] New review: Hashar; "I cant find the change... It is "" d482144 - Bug 40112 - Add noc.wikimedia.org (Wikimedia NOC) files..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/23425 [11:55:53] New patchset: Hashar; "Move noc from /h/w/htdocs to /h/w/c/docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23425 [11:56:47] New review: Hashar; "I have removed a whitespace at the end of the Apache configuration file files/apache/sites/noc.wikim..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/23425 [11:56:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23425 [13:10:03] !log powering down db62 to replace raid controller card [13:10:15] Logged the message, Master [13:13:39] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [13:13:39] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [13:18:52] New patchset: Mark Bergsma; "Add cp103[34] to the backends pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25876 [13:19:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25876 [13:30:05] apergos: so, did you start any extra rsyncs? [13:31:55] I am setting them up now, sorry but it took a bit of digging around to figure out how to grant myself mount permissions from the netapp [13:32:13] there will be some mail in a little bit [13:32:48] so, does solaris has the equivalent of rsize/wsize? [13:32:54] yes [13:33:02] asher increased it I think [13:33:03] ah, it does and it's already set in ms7 [13:33:10] okay, moving along [13:34:04] i waited with snapmirror btw, as last time, with the fr_archive stuff, the quick copying sometimes made the snapshot transfers between the netapps fall behind [13:34:18] not a big deal, but I figured i might as well wait as we don't urgently need the data in eqiad [13:34:32] it's better to get it on the tampa netapp quickly and safely [13:35:32] agreed [13:40:13] apergos: the new card has been installed [13:40:19] yay [13:41:07] the config has been changed for non-raid [13:41:18] cool [13:41:47] for you and paravoid to work on [13:42:02] which card? on the C2100s? [13:43:04] no, we are testing db62 (720xd)...Dell sent us the h300 card. we need to see if it is a good replacement for the c2100's [13:43:15] ah right [13:43:18] this makes much more sense :) [13:43:23] which box is that? [13:43:35] db62? [13:43:41] okay [13:45:34] we are done testing on the c2100s right? as in no more disk stress tests, no more playing with cards, no more anything? [13:46:27] I think so [13:47:45] New review: Mark Bergsma; "Were these changes actually tested? Notably the NGINX one..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/12188 [13:49:38] note that there's some additional load on swift now as i'm bringing up 8 more varnish servers with empty caches [13:50:18] apergos: yes, we are done testing the c2100 unless you really want to help Dell out with their troubleshooting [13:50:41] mmm I think I'llleave that to their professionalss [13:51:05] mark: are you worried about the nginx udplog module? [13:55:11] I just couldn't quickly verify from the docs whether it will work [13:55:31] i assume it does, but for all I know, the creator of the patch thought the same [13:55:55] yes and we've been bitten by that nginx module before [13:56:22] segfaults with the third udp endpoint, I think that happened on my first or second week :) [13:56:34] yeah [14:03:16] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [14:03:16] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [14:03:16] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [14:03:16] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [14:03:16] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [14:03:17] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [14:03:17] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [14:03:18] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [14:03:18] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [14:03:19] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [14:03:19] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [14:03:20] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [14:03:20] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [14:03:21] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [14:03:21] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [14:03:22] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [14:04:02] New patchset: Mark Bergsma; "Add the final two varnish servers to the backend pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25878 [14:04:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25878 [14:06:05] !log Added cp1029 and cp1030 to the frontend varnish upload pool in eqiad (PyBal) [14:06:15] Logged the message, Master [14:10:08] New review: Ottomata; "Yes, all were tested extensively on log1.pmtpa.wmflabs. Even with tests :)" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/12188 [14:11:56] New review: Mark Bergsma; "Go ahead whenever you're ready." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/12188 [14:14:13] PROBLEM - Host ps1-b2-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [14:17:55] sigh [14:20:50] RECOVERY - Host ps1-b2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.68 ms [14:44:09] New patchset: Demon; "Revert "Perform daily backups of gerrit for amanda to pick up"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25881 [14:44:29] no [14:44:32] that doesn't remove it [14:45:00] <^demon> Oh, whoops. Need ensure => absent. [14:45:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25881 [14:45:11] <^demon> I'll redo that. [14:45:15] thanks [14:49:02] New patchset: Demon; "Remove backup stuff from puppet (reverting Ibf2e1d85)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25881 [14:49:36] <^demon> Ok, now should be correct. [14:49:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25881 [14:53:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25881 [14:57:39] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [15:07:11] we are loosing gerrit [15:07:19] manganese is going to swap http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=manganese.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [15:08:34] ^demon is coming back shortly [15:11:45] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [15:14:27] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:57] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.999 second response time [15:30:55] <^demon> hashar: Ok back. Seems puppet's eating up about 83% of the RAM. [15:31:10] <^demon> Puppet's been running for 4h? Wtf? [15:31:27] <^demon> Oh, minutes, had my digits off. [15:31:32] maybe related to the revert of https://gerrit.wikimedia.org/r/25881 [15:31:40] err by https://gerrit.wikimedia.org/r/25881 which is a revert of some amanda stuff [15:32:01] puppet as a concept of file buckets, maybe it is trying to use that right now [15:32:13] <^demon> Lets see if puppet finishes. [15:35:32] <^demon> Ok, I gracefully killed gerrit. Maybe someone can kill that puppet run. [15:35:48] somebody broke gerrit :) [15:36:10] <^demon> It's swapping and I don't know why yet, so I gracefully killed gerrit. [15:36:15] heh [15:36:39] <^demon> !log gracefully shut down gerrit until we figure out why manganese is swapping [15:36:44] * aude goes to get cookies [15:36:50] Logged the message, Master [15:37:05] killed puppet [15:37:15] <^demon> Ah, just saw it disappear. [15:37:43] restarted puppet run [15:38:05] <^demon> Mmk. It should start gerrit as a result as well, so I won't do that yet. [15:38:41] shit, the mozilla space closes in 20' and I need to catch Aaron somewhere [15:38:54] puppet didn't start gerrit [15:39:09] <^demon> I'll start it by hand, no worries. [15:40:58] <^demon> !log restarted gerrit on manganese. [15:41:08] Logged the message, Master [15:41:50] <^demon> `top` is looking much nicer now. Dunno why puppet was making it swap. [15:43:08] <^demon> http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=manganese.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 looks nicer too. [15:43:18] <^demon> hashar: Thanks for pinging me. [15:48:56] ^demon: you are welcome :-] [15:49:08] it might goes to swap again next time puppet run though [15:50:14] <^demon> Dunno...usually works fine. [15:50:21] <^demon> And it ran fine the second time. [15:50:26] * ^demon will watch, like a hawk [15:53:01] I had such an issue on gallium where puppet tried to file bucked a few thousands of mediawiki/core.git clones :-D [15:53:35] anyway out to grab my daughter. Will be back later [15:54:15] <^demon> Later. [16:36:11] New patchset: Jgreen; "switched aluminium:/srv/br backup from tarball to remote snapshot (rsync + hard links)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25886 [16:38:06] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25886 [16:47:29] New patchset: Jgreen; "remove rsync -v flag on fundraising backups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25887 [16:48:23] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25887 [16:52:16] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [17:05:37] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:09:46] binasher: any chance you could add some tracking of the number of jobs on zhwiki somewhere please? [17:11:37] paravoid: ping [17:12:37] preilly: uh.. you have root? [17:13:01] binasher: nope [17:13:12] whats up with the vanadium thing? [17:13:19] binasher: so rsync to be done in a week? :) [17:13:34] * AaronSchulz yays [17:13:36] binasher: I had LeslieCarr do it [17:13:46] binasher: and I copied and pasted it [17:13:58] AaronSchulz: give or take :) [17:14:03] yeah, preilly gave me $20 and i gave him root, that's how it works, right ? [17:14:22] I thought it was a great deal of money to waste on root [17:14:34] but whatever you get what you pay for right?!? [17:15:05] * Damianz just doesn't even start reading into 'root' [17:16:20] speaking of vanadium stuffs.. LeslieCarr: any estimate on when esams will be tunneled to eqiad? if it isn't already [17:17:02] yeah… there's a problem with that, mainly in asymetrical routing to anything with external ip space [17:17:03] :( [17:17:12] well sort of problem [17:17:14] solvable problem [17:17:18] but like end of month [17:17:33] really a "solvable when leslie and mark are not travelling around the world" problem [17:18:52] hehe, ok [17:18:57] are we actually doing tunneling after all? [17:19:08] maybe! [17:19:11] maybe not [17:19:16] I thought mark's idea was to do IPsec transport mode [17:19:24] between end hosts [17:19:40] ugh, I hate MTU disparities [17:19:46] LeslieCarr: Talking about networkishness, any chance you got to poke TS ranges being blocked from Labs ranges? [17:19:55] which we won't avoid with either transport or tunnel [17:19:57] the best idea i would love would be to just get a link and then just ebgp [17:20:09] Damianz: there is a chance! and that chance happens to be 0 [17:20:14] yeah that'd be so great [17:20:14] heh [17:20:23] i will look now Damianz [17:20:24] no chance and fat chance? [17:20:34] :) [17:20:41] any idea how expensive would that be? :) [17:20:41] Not urgent just annoying heh [17:22:29] this is the first day i am back [17:22:32] yes, we have 1 quite [17:22:33] quote [17:22:39] i guess it depends on fundraising [17:23:01] ;) [17:23:45] hm, Equinix is an AMS-IX partner, interesting [17:24:50] New patchset: ArielGlenn; "db62 pulled from db stanza for testing h310 controller" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25892 [17:25:00] paravoid: yeah - they connect to it in one of their totally reasonably priced datacenters in AMS [17:25:06] uh, db stanza, so poetic [17:25:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25892 [17:26:24] New patchset: Aaron Schulz; "swift: add support for timeline/math paths in rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24303 [17:26:55] er whatevs [17:27:09] AaronSchulz: new patchset? [17:27:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24303 [17:27:33] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25892 [17:28:05] ah, emptying shard [17:29:21] I don't see why, but it doesn't hurt either [17:31:17] Damianz: so was the specific issue with not being able to ssh to toolserver ? [17:31:25] Yes [17:31:28] plz to be refreshing memory as I have been passing through timezones like they were butter [17:31:30] cool, i see that [17:31:40] ori-l: ping [17:31:42] telnet login.toolserver.org 22 from anything in labs = Network is unreachable basically [17:31:45] yep [17:31:51] LeslieCarr: we block SSH from labs -> production [17:31:56] exactly [17:32:04] and apparently the filter is overzealous and includes toolserver into production [17:32:12] and toolserver is using space in our AS43821 /24 [17:32:20] we really need to reallocate them PA space [17:32:25] where be toolserver admins ? [17:32:28] good luck with that :P [17:32:32] they'd love that [17:34:36] yeah i haven't been able to get ahold of anyone [17:34:39] me and mark tried [17:35:37] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:28] oh? [17:36:32] yeah [17:37:07] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 47528 bytes in 0.150 seconds [17:37:45] you in the #wikimedia-toolserver channel? [17:38:20] i never hang out there, mark had emailed the sysadmins [17:38:23] i guess it's worth asking [17:38:49] New patchset: Andrew Bogott; "Move wikiupdates out from nova::common and into nova::compute" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25895 [17:38:52] half of the appservers are red in ganglia [17:38:56] more than half [17:39:13] my network connections sucks with many seconds of latency [17:39:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25895 [17:40:36] can someone else take a look too? I'm completely crippled [17:40:43] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:40:43] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [17:40:43] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [17:40:43] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [17:40:43] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [17:41:13] binasher: ping? [17:41:28] New patchset: Cmjohnson; "Removing db62 from db cnfg and adding to ms-be1-4 config for swift test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25896 [17:41:54] LeslieCarr: [17:42:06] This isn't new [17:42:25] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25895 [17:42:25] The day view shows me these Apaches have had high CPU utilization for at least the past 24h [17:42:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25896 [17:42:29] I wonder if they're job runners [17:43:00] Yeah looks like at least the majority of them are [17:43:12] https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Application+servers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [17:43:25] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Mon Oct 1 17:43:09 UTC 2012 [17:43:25] sorry, it looks like they have been in trouble for a while [17:43:49] indeed, the 4hr/1 day graphs are still loading here [17:44:00] but I can see similar patterns already [17:44:05] Most of the red ones have a blue graph going up and down (that's regular web requests, going up and down with traffic) with a yellow graph consistently topping that off to about 90% (those are the reniced job runners using whatever spare resources are available to them) [17:44:49] i think we need to get the precise jobrunners back in there ? [17:44:57] paravoid: Are we having any issues with the site or are you just worried because those Apaches are red in Ganglia? [17:44:58] LeslieCarr: NO [17:45:11] The precise job runners broke category sorting on ptwiki [17:45:54] (Specifically, differences between precise and lucid and the fact that we were running jobs on both at the same time is what caused that breakage) [17:45:59] yes i meant after the package is applied [17:46:05] Right Ok [17:46:26] RoanKattouw: there was an LVS flap a few minutes ago [17:46:40] Oh Ok [17:46:50] It looks like the red job runners have been like that all month [17:47:04] zhwiki probably... [17:47:04] The month view is pretty much the same as the other views, as far as those hosts are concerned [17:47:10] Wouldn't surprise me [17:47:12] and its 2 million jobs [17:47:35] Although last time we blamed it on zhwiki, Tim discovered it was really frwiki [17:48:58] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Mon Oct 1 17:48:41 UTC 2012 [17:54:21] AaronSchulz: ping? [17:54:39] ? [17:54:53] how are you toady? [17:55:03] ribbit! [17:55:06] are you up for deploying that math/timeline change? :) [17:55:18] ribbit! [17:55:42] what's ribbit? :) [17:56:04] <^demon|busy> AaronSchulz is doing the toad. [17:57:09] I kind of expected you to say "kneedeep kneedeep" :-P [17:58:17] paravoid: 1.21wmf1 deployment to testwiki and mediawikiwiki is due in 2 minutes [17:58:22] shouldn't be a long job [17:58:36] might also take 2 minutes ;) [17:58:49] * AaronSchulz kicks his ssd [17:59:06] New review: Siebrand; "Erik, apparently this has been waiting on you for a long time. Not sure if you even knew that, becau..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/12188 [17:59:07] At least that won't upset it like it would a spindle drive :p [17:59:21] * Damianz gives Leslie 1 cookie [17:59:27] Just 1? [17:59:28] Reedy: thanks. we still have some preparatory work anyway, and it shouldn't overlap with anything you're doing. [17:59:45] yay cookie [17:59:46] apt-get and another other install related thing is horribly slow [17:59:56] Reedy: I packet obviously, only so she can share [17:59:59] build from source then [18:00:03] that'll help [18:00:13] worse than my hard disk when spammed with rename,fsync [18:00:21] time to just jfdi [18:00:30] Reedy: maybe I should have got an intel ;) [18:00:57] AaronSchulz: so, are you up on helping me test this before & after deployment? [18:01:07] hehe you noticed it was working before i actually said it was, eh Damianz ? [18:01:17] sure [18:01:18] Yep :) [18:01:22] Long deployment was long. [18:01:36] "Fun toys are fun" -Ralph Wiggum [18:01:44] Exactly. [18:02:45] New patchset: Cmjohnson; "adding db62 to ms-fe cnfg in netboot.cfg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25897 [18:03:43] paravoid: notpeter: where are we on getting precise apaches back in service? [18:03:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25897 [18:04:07] Change abandoned: Cmjohnson; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25897 [18:04:59] Sooooooooon, hopefully [18:11:25] New patchset: Cmjohnson; "fixing the partman recipe for db62" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25898 [18:12:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25898 [18:14:00] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [18:17:18] Change abandoned: Cmjohnson; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25898 [18:20:54] RECOVERY - mysqld processes on es10 is OK: PROCS OK: 1 process with command name mysqld [18:22:24] es10 crashed over the weekend.. i just brought mysqld back up, and pulled it out of db.php [18:23:18] PROBLEM - MySQL Replication Heartbeat on es10 is CRITICAL: CRIT replication delay 130323 seconds [18:23:45] PROBLEM - MySQL Slave Delay on es10 is CRITICAL: CRIT replication delay 125021 seconds [18:31:30] AaronSchulz: live on ms-fe1; wfm, try to test it [18:31:44] do you still have those urls handy? [18:31:48] AaronSchulz: also live is the redirect passthrough and the imagescaler error messages passthrough [18:32:05] they are opened as tabs on my home laptop, heh [18:32:12] well, I have a few URL that I test with, but I know these work :-) [18:32:20] math and timeline? [18:32:27] RECOVERY - MySQL Replication Heartbeat on es10 is OK: OK replication delay 0 seconds [18:33:03] RECOVERY - MySQL Slave Delay on es10 is OK: OK replication delay 0 seconds [18:35:37] Change abandoned: Cmjohnson; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25896 [18:36:36] math, timeline, image original, image thumb [18:37:00] I don't have anything to test image scaler errors or image scaler redirects though [18:37:58] you can test errors by giving bad paths (includes ones with the wrong hash path)...redirects can only be tested with test wiki urls [18:40:21] paravoid: http://pastebin.com/m7k2fAkn [18:41:20] erm? [18:41:57] that should be a redirect [18:47:13] ok, it works [18:47:18] if you try to run it again ms-fe1 [18:49:38] yep [18:50:43] New patchset: ArielGlenn; "db62 moved to ms-fe partition layout for r720xd testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25904 [18:51:00] !log stopping puppet on brewster [18:51:06] apergos: is that ok ? ^ [18:51:11] Logged the message, notpeter [18:51:32] yes but in a few we'll want to push out that change [18:51:39] ok, want to go first? [18:51:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25904 [18:51:40] let me know [18:51:43] kk [18:53:00] paravoid: I tested a math and a timeline url, and they work [18:54:35] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25904 [18:55:34] apergos: lemme know when I can tinker [18:56:45] well what I want is to get that last change onto brewster and then you can play [18:56:57] care to to a test run while you're over there, then you can disable it again? [18:57:05] notpeter: [18:57:16] sure [18:57:25] thanks [18:58:22] apergos: your stuff is live [18:58:54] thanks dude [18:59:00] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [18:59:03] no prob! [18:59:23] as long as you don't break pxe play away [18:59:30] nein [18:59:33] just changing a mac [19:01:12] well, probably a lot of macs... [19:01:13] but whatever [19:01:47] * AaronSchulz can't login to bz [19:01:59] AaronSchulz: you're free!!!! [19:07:23] "The username or password you entered is not valid. " [19:07:31] * AaronSchulz just reset the password... [19:07:54] paravoid: seems to look fine [19:08:23] yeah I'm on the squid config atm [19:09:03] ok, nvm [19:12:13] acl swift_timeline url_regex ^http://upload\.wikimedia\.org/(wikibooks|wikinews|wikiquote|wikiversity|wikimedia|wikipedia|wikisource|wiktionary)/[^/]+/timeline/ [19:12:16] acl swift_math url_regex ^http://upload\.wikimedia\.org/math/ [19:12:17] look sane? [19:13:58] * AaronSchulz checks [19:14:58] yes [19:19:17] New review: CSteipp; "I don't think you're setting those correctly for what you're trying to do. The logic to disable a ru..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/25855 [19:30:50] New review: Faidon; "Looks good, passed testing." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/24303 [19:30:52] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24303 [19:31:12] New review: Faidon; "Looks good, passed testing." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/24514 [19:31:13] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24514 [19:31:34] New review: Faidon; "Looks good, passed testing." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/24576 [19:31:35] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24576 [19:33:40] !log pushing new rewrite.py to swift servers (r24303, r24514, r24576) [19:33:50] Logged the message, Master [19:34:52] RECOVERY - Puppet freshness on ms-fe1 is OK: puppet ran at Mon Oct 1 19:34:45 UTC 2012 [19:35:41] New patchset: Ryan Lane; "Fixing ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25912 [19:36:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25912 [19:36:50] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25912 [19:36:50] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25465 [19:38:06] ^ =D [19:41:42] paravoid: are you messing with squids now? [19:42:03] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [19:43:15] no [19:43:17] not yet [19:43:32] * AaronSchulz is going to get lunch and come back [19:43:33] I've just deployed rewrite.py and I'm going through graphs/logs [19:43:41] see anything weird? [19:44:02] no [19:44:08] my internet really really reaaaally sucks [19:44:19] I'm kinda wondering if I should just push the squid change tomorrow [19:44:33] I have 1-2s lag and 10-15% packet loss [19:47:38] New patchset: Hashar; "beta: enable 'dnsblacklist' log group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25913 [20:02:17] okay, too many disconnects [20:02:23] I'll call it a day [20:05:07] au revoir [20:05:56] notpeter: have an idea of the number of memcaches we need optics for ? [20:05:58] links for [20:07:51] you mean the mc boxes? [20:08:19] yeah [20:08:24] or any new 10g boxes you know of coming in [20:08:37] afaik, mc1-16 are all up and imaged and good, mc1001, 1003, 1005, 1007, and 1008 are all up and imaged [20:08:53] 1002, 1004 and 1006 had some kinda hardware issues (I believe bad dimms) [20:09:04] and 1009 to 1012 have good cables ? [20:09:07] and 1009-1016 didn't have networking up last time I touched them [20:09:20] 1009-1013 I'm not sure about [20:09:23] I think so [20:09:39] if you check the netwroking side on them, I'll try imaging them [20:09:51] I think last time I intkered with them, they didn't have mgmt netowrking set up... [20:09:58] New review: Hashar; "Thanks for the fixes :)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [20:10:12] i can't help with mgmt networking - those are just open ports that if it's plugged in it works [20:10:43] mgmtm on 1014 was up when I tried just now [20:11:21] so mgmt is probably up [20:12:06] working with these boxes is very infuriating, so I only do it for as long as I can before my rage cup overfloweth [20:13:55] New patchset: Hashar; "remove 'configchange' script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25939 [20:14:21] notpeter: ports up now [20:14:32] notpeter: i should say ports *configured* [20:14:42] mc 1011-13 are up/up [20:14:50] New patchset: Hashar; "remove 'configchange' script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25939 [20:14:54] 9,10,14-16 don't believe they have cables plugged in [20:15:07] which if they do have anything physically there, it means that the juniper doesn't recognize it as an sfp [20:15:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25939 [20:15:53] pls see robh email on this. he set up one for testing before he opens /tears open the rest [20:16:12] ieslie / notpeter - on the cables [20:16:31] woosters: was that sent to me ? [20:16:47] i think on rt [20:17:20] leslicarr: i had an issue w/1 sfp on the juniper...replaced it and worked [20:17:33] woosters: so, the test cable was on mc1014 [20:17:36] and it isn't booting [20:17:40] ah yes, that is the ticket we had been discussing woosters [20:17:49] but I want him to move it to a known good host that images properly [20:17:56] cmjohnson1: considering the price of the optics, just throw that one away , it's still worth it [20:18:00] so that we can know for sure that it's the cable [20:18:13] heh! [20:18:18] though it's probably the cable ;) but it could possibly be a bad switch port [20:18:26] unlikely but it is a possibility [20:18:30] I believe in science! [20:18:41] control group, ftw! [20:19:16] :) [20:19:45] but yeah, it's probably the cable :) [20:26:54] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:30:03] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [20:30:57] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [20:32:32] heyaaahmm, hi guys [20:32:32] question [20:33:38] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=files/dhcpd/linux-host-entries.ttyS0-115200;h=2a3283fd7300962536c627e3b18f22ff65e09ccb;hb=HEAD#l6 [20:33:51] the correct hostname for analytic1001 [20:33:56] analytics1001 [20:34:00] is analytics1001.wikimedia.org [20:34:04] can/should I change this here? [20:34:16] yes, but also in dns [20:34:24] i'm pretty sure that's already done [20:34:39] i think it was just originally installed as .eqiad.wmnet [20:34:48] and then after the fact leslie gave us a public IP/subdomain [20:34:58] i'm installing precise on this now [20:35:05] yes, it's in dns [20:35:05] oh, yes, please do change [20:35:06] and it didn't pxe boot like I thought it would [20:35:09] or at least ns0 knows what it is [20:35:25] ottomata: yeah, change in the dhcpd and it should be gtg [20:35:52] ok cool [20:36:02] so, I merge…and then run puppet on brewster? [20:36:40] New patchset: Ottomata; "Fixing hostname for analytics1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25942 [20:37:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25942 [20:37:47] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25942 [20:39:37] ja [20:39:52] which is good, because I forgot to restart it :) [20:44:09] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:45:23] hee hee [20:49:33] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [20:52:02] New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [20:52:56] New review: Hashar; "PS9 adds doc for 'dologmsg' and 'lint'." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [20:52:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [20:53:54] hmm, notpeter, nopers [20:54:03] i can't get analytics1001 to PXE boot [20:55:16] New patchset: awjrichards; "Support for wgMFEnableDesktopReousrces; enable on simplewiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25947 [20:57:12] New patchset: awjrichards; "Support for wgMFEnableDesktopReousrces; enable on simplewiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25947 [21:20:21] New patchset: Jgreen; "remove deprecated fundraising archiver script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25951 [21:21:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25951 [21:21:54] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25951 [21:30:36] New patchset: Jgreen; "adjusting aluminium/grosley backups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25954 [21:31:34] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25954 [21:36:43] New patchset: Ryan Lane; "Fix range for Labs on the apt-proxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25956 [21:37:46] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25956 [21:46:08] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25913 [21:48:07] PROBLEM - Host db62 is DOWN: PING CRITICAL - Packet loss = 100% [21:50:32] New patchset: Ryan Lane; "Fix Labs instance range for puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25958 [21:51:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25958 [21:52:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25958 [21:52:29] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [21:54:47] Change merged: preilly; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25947 [21:55:19] RECOVERY - Host db62 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [22:09:50] New review: Diederik; "Hey Siebrand," [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/12188 [22:13:51] New review: Siebrand; "Thanks for pasting, Diederik." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/12188 [22:15:31] New patchset: Andrew Bogott; "Allow glustermanager to rmdir." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25960 [22:16:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25960 [22:17:56] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25960 [23:00:21] New patchset: Aaron Schulz; "Removed copy2() and friends from rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25410 [23:01:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25410 [23:14:28] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [23:14:28] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [23:32:00] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322