[00:14:55] RECOVERY - RAID on mw1110 is OK: OK: no RAID installed [00:15:05] RECOVERY - DPKG on mw1110 is OK: All packages OK [00:18:05] RECOVERY - Disk space on mw1110 is OK: DISK OK [00:20:55] RECOVERY - Disk space on srv287 is OK: DISK OK [00:21:25] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Jan 20 00:21:14 UTC 2012 [00:27:31] RECOVERY - Host virt1 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [00:28:01] RECOVERY - Puppet freshness on bast1001 is OK: puppet ran at Fri Jan 20 00:27:56 UTC 2012 [00:32:48] !log deployed new enwiki slave, db52 [00:32:49] Logged the message, Master [00:43:41] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [00:44:11] PROBLEM - DPKG on ms-fe1 is CRITICAL: Connection refused by host [00:44:11] PROBLEM - DPKG on ms-fe2 is CRITICAL: Connection refused by host [00:47:21] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [00:47:31] PROBLEM - Disk space on ms-fe1 is CRITICAL: Connection refused by host [00:49:31] PROBLEM - RAID on ms-fe1 is CRITICAL: Connection refused by host [00:49:31] PROBLEM - RAID on ms-fe2 is CRITICAL: Connection refused by host [00:51:11] PROBLEM - Disk space on ms-fe2 is CRITICAL: Connection refused by host [00:51:31] PROBLEM - SSH on ms-fe1 is CRITICAL: Connection refused [00:51:31] PROBLEM - SSH on ms-fe2 is CRITICAL: Connection refused [00:57:41] PROBLEM - RAID on virt1 is CRITICAL: CRITICAL: Degraded [01:01:31] RECOVERY - MySQL replication status on db1025 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:03:39] maplebed: is your phone going off? [01:03:48] no. [01:03:51] hm [01:03:55] I wonder what that song is [01:03:57] *was [01:04:40] Ryan_Lane: http://www.stumbleupon.com/su/5EBWks [01:12:00] !log started another hotbackup of db38 to db52 [01:12:02] Logged the message, Master [01:28:15] New patchset: Bhartshorne; "added new config for ms-fe hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1985 [01:28:51] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1985 [01:28:52] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1985 [01:33:58] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - --wiki [01:41:30] I hate partman, but it at least appears that my hosts are building correctly. [01:44:48] RECOVERY - SSH on ms-fe1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [01:54:38] RECOVERY - SSH on ms-fe2 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [01:57:18] RECOVERY - DPKG on ms-fe2 is OK: All packages OK [01:57:28] RECOVERY - DPKG on ms-fe1 is OK: All packages OK [01:57:48] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [02:00:28] RECOVERY - Disk space on ms-fe1 is OK: DISK OK [02:01:08] RECOVERY - Disk space on ms-fe2 is OK: DISK OK [02:45:15] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1338s [02:45:25] PROBLEM - MySQL replication status on db1025 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1348s [02:51:25] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1708s [03:31:56] RECOVERY - Puppet freshness on mw1096 is OK: puppet ran at Fri Jan 20 03:31:35 UTC 2012 [04:16:54] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [04:18:15] RECOVERY - Disk space on es1004 is OK: DISK OK [04:19:24] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:21:44] RECOVERY - MySQL replication status on db1025 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 33s [04:34:46] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:41:16] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1205s [04:45:56] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1485s [04:50:06] PROBLEM - MySQL replication status on db1025 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1735s [05:31:16] RECOVERY - MySQL replication status on db1025 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [09:52:29] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 443032 MB (3% inode=99%): [09:54:19] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 433037 MB (3% inode=99%): [10:44:07] RECOVERY - MySQL slave status on es1004 is OK: OK: [11:23:47] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 189 MB (2% inode=60%): /var/lib/ureadahead/debugfs 189 MB (2% inode=60%): [11:29:37] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 166 MB (2% inode=60%): /var/lib/ureadahead/debugfs 166 MB (2% inode=60%): [11:41:32] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 189 MB (2% inode=60%): /var/lib/ureadahead/debugfs 189 MB (2% inode=60%): [11:51:32] RECOVERY - Disk space on srv220 is OK: DISK OK [11:57:42] RECOVERY - Disk space on srv223 is OK: DISK OK [12:11:52] RECOVERY - Host srv199 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [12:11:52] ACKNOWLEDGEMENT - Host sq46 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn hardware problem - RT 2301 [12:30:42] PROBLEM - Apache HTTP on srv199 is CRITICAL: Connection refused [12:30:42] PROBLEM - RAID on srv199 is CRITICAL: Connection refused by host [12:34:42] PROBLEM - Disk space on srv199 is CRITICAL: Connection refused by host [12:39:17] what is the story with srv199? is it being installed or something? [12:41:12] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [12:42:42] PROBLEM - DPKG on srv199 is CRITICAL: Connection refused by host [12:47:52] PROBLEM - Memcached on srv199 is CRITICAL: Connection refused [12:57:22] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [13:00:42] RECOVERY - Apache HTTP on srv199 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.005 seconds [13:03:10] apergos: yes [13:03:15] !log reinstalling srv199 [13:03:16] Logged the message, Master [13:03:24] good I guessed it :-P [13:03:26] apergos: i should have logged before, but i am in DC [13:03:33] I saw yer email [13:03:39] how are things there? [13:04:01] WMF DE wants to put in a lot of new hardware [13:04:08] but it wasnt even delivered until like an hour ago [13:04:15] and now they wont get it done im afraid [13:10:52] RECOVERY - RAID on srv199 is OK: OK: no RAID installed [13:12:52] RECOVERY - DPKG on srv199 is OK: All packages OK [13:14:32] RECOVERY - Disk space on srv199 is OK: DISK OK [13:35:14] RECOVERY - Memcached on srv199 is OK: TCP OK - 0.003 second response time on port 11000 [14:22:34] PROBLEM - SSH on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:24] PROBLEM - Disk space on srv272 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:24:04] PROBLEM - DPKG on srv272 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:29:58] RECOVERY - SSH on srv272 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:29:59] RECOVERY - DPKG on srv272 is OK: All packages OK [14:33:08] RECOVERY - Disk space on srv272 is OK: DISK OK [14:53:58] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [15:07:52] !log powercycling knsq30 after replacing cable [15:07:53] Logged the message, Master [15:11:08] !log knsq30 still has bad disk, powering down again [15:11:09] Logged the message, Master [15:13:58] RECOVERY - Disk space on srv224 is OK: DISK OK [17:48:53] !log knsq9 is dead/overloaded [17:48:54] Logged the message, Mistress of the network gear. [17:51:40] ah hey leslie... you don't have notes from yer tech talk on networking do you? [17:51:51] i don't , it was mostly just winging it :( [17:52:09] phooey [17:52:16] when it's beach weather in greece, i'll go and give you the talk :) [17:52:23] seee ya in May :-) [17:52:28] yay! [17:52:44] don't be like *all* the other staffers who are all talk and no visit! [17:53:27] so, do you know how ot get into the management of the esams hosts ? knsq9 is totally dead and i see there's a .ipmi name but it doesn't ssh ? [17:53:30] haha [17:54:05] PROBLEM - Backend Squid HTTP on knsq9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:10] ummm [17:54:11] :-D [17:54:25] oh now you tell us nagios :p [17:54:25] PROBLEM - Frontend Squid HTTP on knsq9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:50] mark: you think we could do enotif testing now? [17:55:16] knsq9.ipmi.knams.wikimedia.org [17:55:28] that' s from looking aat dobson :-P [17:56:50] Nemo_bis: I asked ;) [17:56:58] hmmmm [17:58:06] it's supposed to be a 1950 [17:58:12] so in theory ssh in ought to work [17:58:37] yeah, so i got the ip address (yay searching the dns files) [17:58:41] maybe it's just borked [17:58:44] yeah that's how I got it :-D [17:58:45] PROBLEM - SSH on knsq9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:55] that's all I can figure [18:00:11] but what worries me about that theory is [18:00:39] that knsq16 for example [18:00:49] is fine, you can ssh to the actual host but...not to mgmt [18:00:52] so [18:01:52] http://wikitech.wikimedia.org/view/Pascal I wonder if this is true [18:01:56] and if so what it means [18:04:24] i just realized that knssq9 was cuz i was trying to ssh from the states [18:05:50] all righty then [18:15:05] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:16:23] mark: around? [18:18:25] PROBLEM - Host knsq9 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:00] !log knsq9 will be rebooted as it is dead, dead, dead [18:21:01] Logged the message, Mistress of the network gear. [18:25:25] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [18:33:30] RECOVERY - SSH on knsq9 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:33:40] RECOVERY - Host knsq9 is UP: PING OK - Packet loss = 0%, RTA = 109.26 ms [18:33:50] !log knsq9 has recovered post-reboot [18:33:51] Logged the message, Mistress of the network gear. [18:33:54] looks like it was completely frozen [18:35:10] RECOVERY - Backend Squid HTTP on knsq9 is OK: HTTP OK HTTP/1.0 200 OK - 629 bytes in 0.440 seconds [18:37:10] RECOVERY - Frontend Squid HTTP on knsq9 is OK: HTTP OK HTTP/1.0 200 OK - 650 bytes in 0.220 seconds [18:38:56] we used to have to reboot the squids every so often just cause [18:39:47] hehe, time for the monthly reboot ? :) [18:40:24] there's an old version of force10 hardware that had a bug that every 500ish days (when the seconds had filled up the counter to its max) you had to reboot because it froze :) [18:40:47] that would be an easy fix [18:40:50] cron job. yearly :-P [18:40:56] hehe [18:50:01] maplebed: so cp3001/3002 actually have a 2gig bundle already [18:50:14] hrmph. [18:50:33] what about teh LVS server feeding them? (I think I checked it and it was ok, but I'm not positive) [18:51:01] i am curious that if during peak if there are any tcp queues or dropped packets (there's some showing up in ifconfig but i have no idea how old they are [18:51:07] any chance we're saturating our outbound pipe? probably not, because then we'd see all hosts lopped instead of just bits. [18:51:16] ganglia! [18:51:18] :) [18:51:43] lemme double check all the pipes, maybe some suboptimal routing is messing with stuff [18:52:24] the worst part though is there could be "microspikes" (totally my made up word) on the lag'ed connections that cause some errors (i have seen it before) [18:53:19] i can't wait until we get the new junipers... [18:54:35] the srxes? [18:55:28] mx 80's [18:55:30] for amsterdam [18:55:33] oh [18:55:34] replacing those foundries [18:55:38] cool! [18:55:45] are we trying to convert over? [18:55:50] maplebed: so this is weird, cp3001 is getting like 10x the traffic of cp3002 [18:56:04] * apergos lurks and follows the discussion [18:56:08] eh? [18:56:18] actually cp3001 has a few spikes on each interface getting up to about 800M per interface [18:56:26] http://rt.wikimedia.org/Ticket/Display.html?id=2294 [18:56:43] LeslieCarr: they look the same to me. [19:01:03] i was wrong, their bonded interfaces aren't equally sending the traffic [19:01:40] http://observium.wikimedia.org/graphs/639/port_bits/1326999411// http://observium.wikimedia.org/graphs/641/port_bits/1326999282// (cp3001 eth0 and eth1 ) [19:15:56] who owns snapshot*? [19:16:38] is it apergos? [19:16:42] yes [19:16:47] I saw the mail [19:17:26] next time I stop the workers on a host I'll deal with it [19:19:30] do you know why lucid-wikimedia was removed from sources.list in the first place? [19:19:57] no [19:20:17] sometimes pastbin has such interesting stuff, yet i really wish i could find out who created this file and ask them what they were doing ( [19:20:31] (also english needs a third person singular gender neutral pronoun) [19:20:43] "they" :-P [19:21:32] https://en.wikipedia.org/wiki/Singular_they [19:21:57] but they isn't really singular [19:22:31] Maybe you can think of 'they' as kind of a probability wave representing a set of potential people, and hence plural. [19:22:32] you should read the article, there's a case to be made that it can be singular [19:22:40] Even though it ultimately collapses to a single person... [19:23:03] i like the probability wave theory [19:23:09] But, yeah, also, what TimStarling said. [19:23:48] since it is used widely that way, it can be used that way [19:23:57] but then I'm not a prescriptivist [19:24:26] Something tells me there are not a lot of prescriptivists involved in WM projects :) [19:24:43] you'd be amazed :-D [19:24:59] :) [19:25:08] it's like the deletionists vs the inclusionists, there's enough to go round on both sides [19:25:25] I'm not sure if it's on that article, but there's an interesting analogy with the word "you" [19:26:34] "you" is an example of a word in english that's both singular and plural [19:26:41] Hm... that page cites Pinker a lot. Pinker unsettles me. [19:26:54] so, the argument is, singular "they" shouldn't be considered unnatural [19:27:26] andrewbogott: how so? [19:28:06] Um... it's not that profound. Just, I find him utterly convincing but also suspect that he's kind of a jerk. [19:28:18] :-D [19:28:29] I guess that's a problem with ev-psych generally. Convincing but obnoxious. [19:30:20] He wrote this whole serious of brilliant books about linguistics and psychology and then capped them off with a volume about how feminists were idiots :( [19:30:46] well [19:30:56] brilliant in one arena != brilliant in all [19:31:04] Yeah. [19:31:07] although we have a tendency to act like it [19:32:47] TimStarling: the thing about singular vs. plural 'you' is that people are desperate to invent a different word for plural 'you'. So people clearly feel some kind of discomfort about it being dual-purpose. Maybe that same discomfort with 'they' will endure. [19:33:03] "y'all" [19:33:48] Yeah, but in the South "y'all" is singular now. so they resort to "All y'all" [19:33:55] So there's tension in both directions. [19:34:02] hehe [19:34:09] (In Minnesota, at least, "y'all" is still plural) [19:34:17] in australia we don't have y'all but we do have youse [19:34:26] not in polite company though [19:34:42] Heh, 'youse' sounds like 1920's gangster-speak to me. [19:35:02] ...ya'll has always been plural to me. [19:35:57] RobH: Where do you live/grow up? [19:36:11] 13+ in florida [19:36:15] Huh. [19:36:20] before that ohio, but ya'll was more florida thing to me [19:36:26] i say it a lot now, it gives away my roots [19:36:41] Pretty sure it was singular (or at least dual-purpose) in TN. But maybe I'm confused. [19:36:51] i use it for both [19:37:00] how are ya'll works for single person or group [19:37:08] Ah, right, that's what I was thinking. [19:37:37] anyway yes singular they is controversial, but it's commonly used, widely understood, and has a long history in literature [19:38:15] I suspect leslie has long since tuned us out. [19:38:26] unlike, say, sie and hir [19:38:48] though i have said all yall [19:38:54] in particular if its a large group [19:39:00] so hell if i know. [19:39:18] I've often heard people claim that in gendered languages the whole issue if trying to use gender-neutral language is moot on account of being generally impossible. [19:39:52] hence wikipedia is female [19:39:55] =P [19:40:22] RobH: That's just what I was saying -- that y'all was invented to distinguish between singular 'you' and plural y'all but then y'all /also/ became ambiguous so we had to invent all y'all. [19:40:39] indeed, i see your point [19:40:47] Which, I wonder if that means that "all y'all" will start to be ambiguous as well and we'll have to keep extending... [19:41:16] I've often heard people claim that in gendered languages the whole issue if trying to use gender-neutral language is moot on account of being generally impossible. [19:41:17] RobH: it depends on the language... [19:41:18] I've often heard people claim that in gendered languages the whole issue if trying to use gender-neutral language is moot on account of being generally impossible. [19:41:39] that's why the push for {{GENDER}} and gender preferences in MW came from russian, not english [19:42:06] in english we can get by without knowing someone's gender, in russian not so much [19:42:30] I don't know what {{GENDER}} is -- a tag that specifies the gender of a bio? [19:42:33] using gender neutral language is pretty tough all right [19:42:56] I haven't found any good approaches [19:43:02] TimStarling: really? I think the push came when I implemented it [19:44:34] Nikerabbit: you're saying you implemented it for some other reason? [19:45:40] TimStarling: yes and no, I do speak some Russian, but there were lots of languages wanting it or taking use of it after it was implemented [19:45:55] andrewbogott: it's a parser function for use in interface text, like "$1 had {{GENDER:$1|his|her}} user rights removed" [19:46:23] Oh, sure. Makes sense. [19:46:41] It presumes 2 options? Or n? [19:47:04] http://translatewiki.net/wiki/Gender#Gender_in_languages some investigation we did [19:47:24] I think it depends on the language [19:47:58] but there are only 3 user options: male, female and unspecified [19:48:29] so far it is the same for all languages, male|female|unspecified [19:48:54] Ah, I see, it's when describing users only. [19:49:00] Not for grammar generation generally. [19:50:34] New patchset: Bhartshorne; "adding sharding to proxy config, sharding two containers in the eqiad cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1986 [19:50:50] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1986 [19:54:25] New patchset: Bhartshorne; "adding sharding to proxy config, sharding two containers in the eqiad cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1986 [19:54:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1986 [19:54:57] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1986 [19:54:57] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1986 [20:02:18] sorry, i have been all busy with offline talking for like an hour, gah! [20:04:50] New patchset: Asher; "attempt to fix db22 puppet breakage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1987 [20:05:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1987 [20:05:16] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1987 [20:05:16] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1987 [20:10:13] TimStarling, also, Russian has been the last which implemented gender namespaces, and Polish seems to be the biggest user of the magic word [20:10:47] although French translators seem to care enough too [20:40:18] !log dns update for dataset1001 [20:40:20] Logged the message, RobH [20:40:49] !log dns servers all still online after update [20:40:50] Logged the message, RobH [20:42:17] PROBLEM - SSH on knsq9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:43:29] if knsq9 is dead again, it's gotta go outta rotation cuz it must have a hw problem [20:45:08] does seem pretty suspect [20:46:03] New patchset: Asher; "ensure nrpe.d directory exists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1988 [20:46:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1988 [20:46:41] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1988 [20:46:42] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1988 [20:58:17] PROBLEM - Backend Squid HTTP on knsq9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:02:07] RECOVERY - SSH on knsq9 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:34:54] New patchset: Lcarr; "adding in bonding information to correct hashing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1989 [21:35:07] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1989 [21:38:15] New patchset: Lcarr; "adding in bonding information to correct hashing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1989 [21:38:27] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1989 [21:47:32] New patchset: Lcarr; "adding in bonding information to correct hashing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1989 [21:49:14] New patchset: Lcarr; "adding in bonding information to correct hashing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1989 [21:49:21] maplebed: wanna check this out ? [21:49:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1989 [21:49:35] it's already loaded. [21:49:37] :) [21:50:47] if they're allowed, you need comments describing what it's doing, why, and where you got the setitngs from in bonding.conf [21:51:10] okay, i'll comment it up [21:51:21] first i'll check and make sure it honors comments [21:51:34] I'd also recommend changing the commit message from 'correct' hashing to something like 'use hashing appropriate to a single destination gateway' [21:51:38] yes it does ignore them :) [21:51:56] or maybe throw xmit_hash_policy in there to help searching the mail spool. [21:52:02] or the commit log record [21:56:29] New patchset: Lcarr; "Changing default bonding xmit_hash_policy to layer2+3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1989 [22:02:50] New patchset: Lcarr; "Changing default bonding xmit_hash_policy to layer2+3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1989 [22:03:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1989 [22:08:18] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/1989 [22:11:41] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1989 [22:11:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1989 [22:21:34] maplebed: if i do a services networking restart, will that read the modules.conf info ? [22:22:32] honestly, any time I need to restart networking, I always reboot the box if at all possible. The sequence of events is different when you restart networking from when you boot the box, and those differences have caused me trouble in the past. [22:22:41] sadly, I don't think that's reasonable in this case; [22:22:50] one of those hosts (without bonding) can't handle the traffic alone. [22:23:02] lemme look at their traffic right now [22:23:05] (which is a separate problem - it means we've got a spof) [22:23:07] it's lowish in europe... [22:23:56] yeah, not quite low enough… almost [22:24:09] dns switch over to the states then reload ? [22:24:21] nah. [22:24:25] reload in place on one of them. [22:24:29] reboot if it doesn't work. [22:24:45] !log restarting networking on cp3001 [22:24:46] Logged the message, Mistress of the network gear. [22:24:48] we're already suffering from a saturated link; re-saturating one for a bit won't be any worse than 12 hours of the normal day. [22:24:54] (on the ipmi console just in case...) [22:24:58] +1 [22:25:06] restart: Unknown instance: ? [22:25:16] huh? [22:30:37] !log reloading cp3001 [22:30:38] Logged the message, Mistress of the network gear. [22:33:10] I want a custom morebots message :P [22:33:45] Derp... wrong channel [22:34:47] maplebed: it doesn't seemt o have done it [22:34:50] New patchset: Asher; "running pt-heartbeat daemon on core cluster dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1990 [22:35:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1990 [22:35:58] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1990 [22:35:58] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1990 [22:38:09] maplebed: so in /var/log/messages it doesn't even look like it evaluated that bit - also it wants ethtool (which i sort of want) - i wonder if we can pass it as a parameter in the module (lots of other parameters seem to just prepend a bond- to themselves ) [22:38:32] ``hmm... [22:40:11] New patchset: Asher; "fix path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1991 [22:40:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1991 [22:43:19] New patchset: Lcarr; "adding in the bonding xmit-hash-policy to interface commands" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1992 [22:43:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1992 [22:43:48] maplebed: want to see if it looks too crazy? ;) [22:45:55] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/1992 [22:47:47] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1992 [22:47:47] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1992 [22:50:51] hehe maplebed, sadly the puppet scheduled refresh of the interface doesn't work when the interface already is up [22:51:08] I wouldn't have expected puppet to kick interfaces... [22:51:17] it does for bonded ones [22:51:19] supposedly... [22:51:25] huh. [22:51:41] oh, now it's just dead, thanks puppet [22:52:02] maybe it did kick the interface. [22:52:05] :P [22:52:13] !log rebooting cp3001 [22:52:14] Logged the message, Mistress of the network gear. [22:52:15] maybe [22:55:02] aww yeah, maplebed watch --differences ifconfig is showing that it looks pretty balanced to me [22:55:50] \o/ [22:55:55] so it needed the reboot? [22:55:58] yep [22:56:01] or was it the puppet changes? [22:56:03] (or both) [22:57:08] i think both [22:57:16] puppet then rebooting [22:58:38] it's started to spew an insane number (173?) new ganglia metrics. [22:58:46] I think they're from varnish, [22:58:52] but they're making the graphs I want to see impossible. [23:00:27] okay, i think i better do the same for cp3002 and then the other bonded interface machines before they puppet it up themselves [23:00:49] +1 [23:01:28] LeslieCarr: please also send mail to ops@lists in case shit goes weird. [23:03:51] PROBLEM - Host ms6 is DOWN: PING CRITICAL - Packet loss = 100% [23:05:01] PROBLEM - Host sq70 is DOWN: PING CRITICAL - Packet loss = 100% [23:05:11] PROBLEM - Host niobium is DOWN: PING CRITICAL - Packet loss = 100% [23:05:22] checking out those [23:07:32] who knows how to connect to a console on the sun boxes (wikitech's ipmi search has failed me) [23:08:22] check platform documentation on wikitech [23:08:28] it has it for each server type we have [23:08:53] LeslieCarr: http://wikitech.wikimedia.org/view/Category:Platform-specific_documentation [23:09:27] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1991 [23:09:27] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1991 [23:12:55] stupid sun boxes and their two kinds of console servers … [23:13:00] hehe, mark what are you doing up ? [23:13:41] RECOVERY - Host sq70 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [23:15:10] packing :( [23:15:13] I hate packing [23:15:23] :( [23:15:26] it sucks [23:15:33] see my email about the link aggregation and hashing ? [23:15:38] yeah, good catch [23:15:40] interesting/crazy stuff [23:15:42] ohi [23:15:50] LH landed half an hour later [23:15:59] which means first passenger out of the aircraft had to wait for an hour [23:16:03] in immigration queue [23:16:06] so stupid that it's not the default [23:16:10] because goddamn air china landed ahead [23:16:30] i'd laugh [23:16:35] but maybe i'll have similar problems tomorrow ;) [23:16:48] aww [23:16:51] otoh, I had a good long sleep in the flight [23:17:01] got a new rubber duck too [23:17:20] cutie: https://fbcdn-sphotos-a.akamaihd.net/hphotos-ak-snc7/p480x480/404961_10150499721346962_570396961_9380866_525696558_n.jpg [23:17:52] PROBLEM - Host ms2 is DOWN: PING CRITICAL - Packet loss = 100% [23:18:11] RECOVERY - Host niobium is UP: PING OK - Packet loss = 0%, RTA = 28.74 ms [23:18:51] PROBLEM - Host sq69 is DOWN: PING CRITICAL - Packet loss = 100% [23:18:51] PROBLEM - Host ms1 is DOWN: PING CRITICAL - Packet loss = 100% [23:19:22] awww [23:19:28] * apergos <3 rubber duckies [23:19:33] I don't even have exit seats tomorrow :( [23:20:11] PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100% [23:21:11] PROBLEM - Host cp3002 is DOWN: PING CRITICAL - Packet loss = 100% [23:21:21] PROBLEM - Host sq68 is DOWN: PING CRITICAL - Packet loss = 100% [23:21:21] RECOVERY - Host ms2 is UP: PING OK - Packet loss = 0%, RTA = 2.11 ms [23:22:21] RECOVERY - Host sq69 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [23:23:21] uhoh cp3002 failed its fsck [23:23:32] ruh roh [23:26:04] binasher: did you install the varnish gmond python module? [23:26:19] i did [23:26:24] a long time ago [23:26:50] mark: any chance we could tune or remove it? It creates 173 graphs, which make it rather annoying to use any of the other graphs on the host. [23:27:03] we can tune it, not remove it [23:27:07] our request statistics depend on it [23:27:14] mark: we're having an outage [23:27:15] maybe the top 5 or 10 metrics rather than all 17x... [23:27:18] hey guys, big apologies if reported already but do we have a service outage? [23:27:25] yes [23:27:29] oop nevermind /me goes in the corner to watch [23:27:34] it's related to bits [23:27:40] bits where? [23:27:48] is this related to your bonding/hashing changes? [23:27:52] leslie's I mean [23:27:55] yes [23:28:10] yeah, my first outage :( [23:28:23] well done [23:28:24] so where is it broken? [23:28:33] apparently everywhere? [23:28:51] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [23:29:23] bits.esams seems up? [23:29:36] seems it's pmtpa [23:29:47] bits.pmtpa is unused [23:29:51] bits.eqiad serves US [23:30:13] binasher suggests mobing traffic to pmtpa [23:30:17] for eqiad ones [23:30:25] sounds like a good idea [23:30:39] 23:30:13 up 73 days, 2:16, 3 users, load average: 19596.32, 16822.89, 9404.44 [23:30:40] wow [23:30:42] haha [23:30:50] !log reloading arsenic [23:30:51] Logged the message, Mistress of the network gear. [23:30:57] reloading? [23:31:04] just stop varnish [23:31:12] oh, i will do that on niobium [23:31:17] the load went crazy for some reason … [23:31:41] RECOVERY - Host sq68 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [23:31:45] so is anyone moving traffic? [23:32:11] !log killed carnish on niobium , cpu load seems to be going down [23:32:11] mark: asher is. [23:32:12] Logged the message, Mistress of the network gear. [23:32:33] would something with the network being temporarily out cause varnish to freak out and give a crazy load ? [23:32:37] (with ryan over his shoulden) [23:33:01] RECOVERY - Host ms1 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [23:33:12] yes it could [23:33:21] PROBLEM - Host ms3 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:27] especially with packet loss we've seen it do this before [23:33:37] many threads get stuck, thread pileup, scheduler overload, etc [23:33:41] yep [23:34:02] we've gotten that under control for the last year, until now [23:34:22] us bits switched over [23:34:22] sites up [23:34:47] * mark goes back to packing [23:34:50] heh [23:34:51] !log moved bits eqiad to pmtpa (via scenarios/normal/bits-geo.wikimedia.org) [23:34:52] Logged the message, Master [23:34:53] looks good [23:34:55] it's all fixed now :) [23:34:56] hehe [23:35:08] that was a quick fix. great job guys [23:37:04] makes an interesting load graph - http://ganglia.wikimedia.org/2.2.0/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 (look at the scale) [23:39:06] * maplebed wants to see the esams bits network graph tomorrow. [23:42:16] if you move traffic back to eqiad now it'll likely be fine [23:42:28] but there's no need, we can do that next week [23:42:38] ya, we are having that discussion now [23:44:14] switched [23:44:15] !log moving north america bits back to eqiad [23:44:16] Logged the message, Master [23:45:47] !log ms6 sdc is undergoing fsck due to wrong fs type, bad option, bad superblock, or other on /dev/sdc1, [23:45:48] Logged the message, Mistress of the network gear. [23:46:29] ms6? is that a thumb replica? [23:46:34] i kinda doubt that... fsck for btrfs doesn't exist yet :P [23:47:08] it's a caching thumbs server in esams [23:47:51] RECOVERY - Host ms3 is UP: PING OK - Packet loss = 0%, RTA = 1.92 ms [23:49:23] New patchset: Bhartshorne; "deploying new rewrite.py with sharded container support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1993 [23:49:25] it looked it was fsck'ing [23:49:28] maybe it wasn't [23:49:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1993 [23:49:44] yeah it is btrfs [23:49:45] weird [23:49:58] going to try to reboot again [23:50:02] nooo [23:50:08] doh [23:50:11] i shouldn't have listened to ryan [23:50:27] too late :( [23:52:30] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1993 [23:52:31] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1993 [23:54:12] New review: Diederik; "Three suggestions:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1794 [23:54:14] mark: eh? [23:54:33] mark: well, if there's no fsck, what else are you supposed to do? [23:54:46] debug and send info to oracle? [23:54:50] then reformat? [23:55:08] mount with "degraded" perhaps [23:55:12] depends on what the problem is [23:55:28] hm. I guess I need to take some time to learn btrfs [23:55:50] how about we just nuke ms6 and build out a swift cluster in esams? [23:55:58] you don't need ms6 [23:56:04] you can just leave it off [23:56:08] then you don't need to learn yet another obscure filesystem format. [23:56:18] ms6 was more of a btrfs experiment of me [23:56:23] I don't think btrfs in the long run, will be obscure [23:56:24] squids will bypass it if it's down [23:56:32] btrfs is nice [23:56:38] it's likely to be the new default at some point [23:56:46] well did a skip of mounting hdc [23:56:48] it's been surprisingly stable on ms6, until now ;) [23:56:50] just like betamax? [23:56:54] so let's see what's up now :) [23:56:58] no. seriously. btrfs is really, really nice [23:57:00] haha, i killed ms6 + the site [23:57:07] so was betamax. [23:57:16] * Ryan_Lane shrugs [23:57:23] aww, you're no fun. [23:57:26] ubuntu was considering setting as the default this release [23:57:31] for desktop [23:57:35] that's a bit early [23:57:38] agreed [23:57:43] i've had btrfs fuck up my home server too ;) [23:57:49] (but I was prepared for that) [23:57:56] they backed off of that decision I think [23:59:26] is ms6 accessible? [23:59:43] seems it's being considered again for 12.10