[00:04:40] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [00:10:19] New patchset: Kaldari; "turning on PageTriage on test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9490 [00:10:25] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9490 [00:16:39] I'm supposed to do a deployment tomorrow, but I'm not in the wmf-deployment group: https://gerrit.wikimedia.org/r/#/admin/groups/21,members :( [00:17:20] does anyone have access to edit that besides Chad? [00:17:38] Reedy: ^ [00:22:32] New review: awjrichards; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9490 [00:22:34] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9490 [00:23:13] kaldari: done [00:23:22] oh, thanks! [00:24:05] AaronSchulz: Can you add Benny too while you're in there? [00:24:31] he's Bsitu in gerrit [00:25:23] ok [01:42:25] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 313 seconds [01:45:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:46:19] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [01:48:07] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [01:52:17] New patchset: Kaldari; "turning pagetriage off on test2 since the sql files need to be run there" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9493 [01:52:22] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9493 [01:53:06] New review: Kaldari; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9493 [01:53:08] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9493 [02:36:07] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [02:41:58] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [02:44:22] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [02:48:16] RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Thu May 31 02:48:02 UTC 2012 [02:53:04] PROBLEM - Puppet freshness on srv210 is CRITICAL: Puppet has not run in the last 10 hours [02:54:07] PROBLEM - Puppet freshness on srv296 is CRITICAL: Puppet has not run in the last 10 hours [02:55:01] PROBLEM - Puppet freshness on srv244 is CRITICAL: Puppet has not run in the last 10 hours [02:56:31] RECOVERY - Puppet freshness on srv210 is OK: puppet ran at Thu May 31 02:56:07 UTC 2012 [02:58:01] PROBLEM - Puppet freshness on srv300 is CRITICAL: Puppet has not run in the last 10 hours [02:58:01] PROBLEM - Puppet freshness on srv298 is CRITICAL: Puppet has not run in the last 10 hours [02:59:04] PROBLEM - Puppet freshness on srv236 is CRITICAL: Puppet has not run in the last 10 hours [03:00:07] PROBLEM - Puppet freshness on srv202 is CRITICAL: Puppet has not run in the last 10 hours [03:01:01] PROBLEM - Puppet freshness on srv254 is CRITICAL: Puppet has not run in the last 10 hours [03:04:37] RECOVERY - Puppet freshness on analytics1001 is OK: puppet ran at Thu May 31 03:04:12 UTC 2012 [03:05:04] PROBLEM - Puppet freshness on srv228 is CRITICAL: Puppet has not run in the last 10 hours [03:05:04] PROBLEM - Puppet freshness on srv284 is CRITICAL: Puppet has not run in the last 10 hours [03:06:07] PROBLEM - Puppet freshness on srv292 is CRITICAL: Puppet has not run in the last 10 hours [03:06:07] PROBLEM - Puppet freshness on srv248 is CRITICAL: Puppet has not run in the last 10 hours [03:06:07] PROBLEM - Puppet freshness on srv297 is CRITICAL: Puppet has not run in the last 10 hours [03:07:01] PROBLEM - Puppet freshness on srv229 is CRITICAL: Puppet has not run in the last 10 hours [03:08:04] PROBLEM - Puppet freshness on srv208 is CRITICAL: Puppet has not run in the last 10 hours [03:09:07] PROBLEM - Puppet freshness on srv267 is CRITICAL: Puppet has not run in the last 10 hours [03:10:01] PROBLEM - Puppet freshness on srv259 is CRITICAL: Puppet has not run in the last 10 hours [03:10:01] PROBLEM - Puppet freshness on srv290 is CRITICAL: Puppet has not run in the last 10 hours [03:16:01] PROBLEM - Puppet freshness on srv241 is CRITICAL: Puppet has not run in the last 10 hours [03:19:01] PROBLEM - Puppet freshness on srv299 is CRITICAL: Puppet has not run in the last 10 hours [03:20:04] PROBLEM - Puppet freshness on srv231 is CRITICAL: Puppet has not run in the last 10 hours [03:20:04] PROBLEM - Puppet freshness on srv295 is CRITICAL: Puppet has not run in the last 10 hours [03:23:04] PROBLEM - Puppet freshness on srv226 is CRITICAL: Puppet has not run in the last 10 hours [03:23:04] PROBLEM - Puppet freshness on srv237 is CRITICAL: Puppet has not run in the last 10 hours [04:37:24] PROBLEM - Puppet freshness on bellin is CRITICAL: Puppet has not run in the last 10 hours [05:13:24] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [05:18:21] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [05:18:21] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [05:18:21] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:01:24] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [06:25:23] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [08:12:47] ping mark [08:12:54] Or someone else involved with the Amsterdam cluster [08:13:33] might be a little early, most people are probably in berlin or so [08:14:10] robh is the other possibility but also probably not here just yet [08:14:44] Berlin is in the same timezone as me, 10 am [08:15:08] yes. 10 might be a bit on the earliy side for people to be showing up on line [08:15:23] That might be true too :-) [08:15:27] :-) [08:17:41] I bet folsk will be on within the next 30 min- 1 hour [08:40:16] New patchset: Hashar; "remove excutable bits from some regular files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9500 [08:40:22] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9500 [08:40:43] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9500 [08:40:45] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9500 [08:54:21] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [09:00:21] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [09:04:15] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [09:04:15] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [09:11:18] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [09:11:18] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [09:11:18] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [09:21:21] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:54:03] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [09:55:24] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:03:51] I'll be going home... Forgot to take my laptop's electricity cord with me this morning. [10:05:18] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [10:47:09] !log temporarily pulled srv199 from lvs for php testing [10:47:12] Logged the message, Master [10:47:53] binasher: you need a whole server for that's how i roll [11:19:32] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:22:09] test [11:22:58] tset [11:25:05] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:25:38] https://gerrit.wikimedia.org/r/#/c/7823/ https://gerrit.wikimedia.org/r/#/c/9130/ https://gerrit.wikimedia.org/r/#/c/9129/ https://gerrit.wikimedia.org/r/#/c/7831/ [11:25:48] ^ Could someone from ops approve and push those please? [11:26:14] I know there's been requests to remove the debian specific stuff (fair enough), but with inconsistencies as they currently are, it'd be nicer to have the level playing field to start with [11:30:25] New review: Reedy; "This can probable be abandoned in favour of https://gerrit.wikimedia.org/r/#/c/9126 going live to site" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/9012 [11:33:24] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9129 [11:33:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9129 [11:35:27] New review: Mark Bergsma; "double -o -o ?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/7831 [11:37:41] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:55:59] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:04:22] New review: Reedy; "Other scripts have it.. Such as sync-common-file" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/7831 [12:49:22] Change abandoned: Reedy; "Abandoning. mb_check_encoding code is now live, so this shouldn't be an issue (or at least, not in t..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9012 [12:54:29] PROBLEM - Puppet freshness on srv296 is CRITICAL: Puppet has not run in the last 10 hours [12:55:32] PROBLEM - Puppet freshness on srv244 is CRITICAL: Puppet has not run in the last 10 hours [12:58:32] PROBLEM - Puppet freshness on srv298 is CRITICAL: Puppet has not run in the last 10 hours [12:58:32] PROBLEM - Puppet freshness on srv300 is CRITICAL: Puppet has not run in the last 10 hours [12:59:35] PROBLEM - Puppet freshness on srv236 is CRITICAL: Puppet has not run in the last 10 hours [13:00:29] PROBLEM - Puppet freshness on srv202 is CRITICAL: Puppet has not run in the last 10 hours [13:01:17] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9469 [13:01:19] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9469 [13:01:32] PROBLEM - Puppet freshness on srv254 is CRITICAL: Puppet has not run in the last 10 hours [13:02:05] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8326 [13:02:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8326 [13:03:25] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9333 [13:03:27] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9333 [13:05:35] PROBLEM - Puppet freshness on srv228 is CRITICAL: Puppet has not run in the last 10 hours [13:05:35] PROBLEM - Puppet freshness on srv284 is CRITICAL: Puppet has not run in the last 10 hours [13:06:29] PROBLEM - Puppet freshness on srv248 is CRITICAL: Puppet has not run in the last 10 hours [13:06:29] PROBLEM - Puppet freshness on srv292 is CRITICAL: Puppet has not run in the last 10 hours [13:06:29] PROBLEM - Puppet freshness on srv297 is CRITICAL: Puppet has not run in the last 10 hours [13:07:32] PROBLEM - Puppet freshness on srv229 is CRITICAL: Puppet has not run in the last 10 hours [13:08:35] PROBLEM - Puppet freshness on srv208 is CRITICAL: Puppet has not run in the last 10 hours [13:09:29] PROBLEM - Puppet freshness on srv267 is CRITICAL: Puppet has not run in the last 10 hours [13:10:32] PROBLEM - Puppet freshness on srv290 is CRITICAL: Puppet has not run in the last 10 hours [13:10:32] PROBLEM - Puppet freshness on srv259 is CRITICAL: Puppet has not run in the last 10 hours [13:16:32] PROBLEM - Puppet freshness on srv241 is CRITICAL: Puppet has not run in the last 10 hours [13:19:18] PROBLEM - Puppet freshness on srv299 is CRITICAL: Puppet has not run in the last 10 hours [13:20:30] PROBLEM - Puppet freshness on srv231 is CRITICAL: Puppet has not run in the last 10 hours [13:20:30] PROBLEM - Puppet freshness on srv295 is CRITICAL: Puppet has not run in the last 10 hours [13:20:31] New patchset: Bhartshorne; "inserting smerritt's key for access to the hw swift test cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9510 [13:20:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9510 [13:21:29] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9510 [13:21:31] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9510 [13:23:17] Reedy!! [13:23:21] RECOVERY - Lucene on search22 is OK: TCP OK - 0.004 second response time on port 8123 [13:23:30] PROBLEM - Puppet freshness on srv226 is CRITICAL: Puppet has not run in the last 10 hours [13:23:30] PROBLEM - Puppet freshness on srv237 is CRITICAL: Puppet has not run in the last 10 hours [13:23:48] maplebed!! [13:24:05] your change is polluting my diff. [13:24:12] lol, which change? [13:24:19] I can access maganese now btw. Thanks for that [13:24:30] changinging the log message for scap, it looks like. [13:24:36] oh good. [13:28:02] anyway, it's merged now. [13:28:30] Reedy: lol, 1.20wmf4 was actually deployed to all non-Wikipedia wikis yesterday? [13:28:37] I can't believe it [13:28:38] awesome [13:28:43] Who did it [13:29:24] Ah Aaron did [13:30:27] Aaron and I [13:36:33] RECOVERY - Puppet freshness on search13 is OK: puppet ran at Thu May 31 13:36:16 UTC 2012 [13:36:48] binasher: am I reading your email correctly that you agree that I don't need to take the es servers out of db.php before stopping mysql, running an update, and rebooting? [13:37:46] no, i think you should comment out of db.php [13:37:54] RECOVERY - Lucene disk space on search13 is OK: DISK OK [13:38:17] ok, I'll do so. I remember reading the mediawiki stuff and deciding it wasn't necessary, but it also doesn't hurt. [13:38:29] I'm going to start on the eqiad es hosts though, [13:38:37] which already aren't in rotation. [13:40:23] !log upgrading and rebooting eqiad es hosts due to 210 day kernel bug thingy. [13:40:27] Logged the message, Master [13:41:10] mediawiki won't send queries to a downed slave but if a process gets a handle to a host that it thinks it can query, then mysql goes away, its going to fail. there's a little window of user impact [13:41:40] ah. [13:42:54] binasher: random other question - on 'aptitude upgrade' I've seen more than once that aptitude wants to uninstall sysstat. Do you know why? [13:43:33] or mark or Ryan_Lane ^^^ [13:43:37] or anybody really. [13:43:42] no clue [13:44:28] nope [13:44:57] PROBLEM - mysqld processes on es1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [13:45:33] ^^^ es1001 is me. [13:45:51] YOU BROKE IT [13:53:59] hey man, your merge was the last thing to hit sockpuppet. [13:55:12] * apergos gets popcorn [13:55:36] mmm.... popcorn.... [13:56:33] http://www.popcorncalories.info/images/popcorn.jpg [13:56:39] RECOVERY - NTP on search13 is OK: NTP OK: Offset -0.01105856895 secs [13:57:31] paravoid: I'm trying to use puppetmaster::self on instance mwreview-test4 and it's in an interesting state now... want to have a look? [13:57:59] NOMS! [13:58:22] what is that, 2 calories worth of popcorn and 10000 from the butter? [13:58:27] PROBLEM - Host es1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:58:59] yeah, can't you smell the movie theater butter smell too? :-) [13:59:39] RECOVERY - Host es1001 is UP: PING OK - Packet loss = 0%, RTA = 26.86 ms [14:00:55] man, I love that stuff. that website is something else. [14:01:20] real winners like "Do you know how much movie popcorn calories you get from watching at theatres? Others might be sarcastic and say, “Who cares?” People with this attitude are prone to getting fat." [14:03:06] RECOVERY - mysqld processes on es1001 is OK: PROCS OK: 1 process with command name mysqld [14:03:55] !log returning srv199 to lvs (dialed back to slow / no longer running php 5.4) [14:03:58] Logged the message, Master [14:08:30] PROBLEM - mysqld processes on es1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [14:10:00] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 190 seconds [14:10:09] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 192 seconds [14:10:21] ^^^ es1002, still me. db1018 - not me. [14:12:26] 1018 isn't in rotation [14:20:03] PROBLEM - Host es1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:15] RECOVERY - Host es1002 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [14:23:03] RECOVERY - Puppet freshness on es1002 is OK: puppet ran at Thu May 31 14:22:37 UTC 2012 [14:28:56] !log pulling srv199 from lvs again for further experimentation [14:29:00] Logged the message, Master [14:31:00] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 25 seconds [14:31:18] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 22 seconds [14:52:27] RECOVERY - mysqld processes on es1002 is OK: PROCS OK: 1 process with command name mysqld [15:19:27] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [15:19:27] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [15:19:28] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [15:49:27] mark: Have you ever run into trouble with partman failing due to the presence of existing partitions? [15:49:38] yes [15:49:40] (I'm guessing that that's my problem, although I have little to no evidence.) [15:49:40] annoying [15:49:57] the few times I ran into something like that I erased the partition table manually [15:50:09] (dd if=/dev/zero of=/dev/sda bs=512 count=1) [15:50:24] Hm... well, that's worth a try. [15:51:24] I take it 'd-i partman-md/device_remove_md boolean true' only works some or none of the time? [15:54:01] Interesting... [15:54:06] mark, when I run that command I get 'dd: can't open '/dev/sda': No medium found' [15:54:12] which fits with the error partman is giving me. [15:55:10] There is an sda link in /dev. [15:55:14] well [15:55:15] is it on the new ciscos ? [15:55:26] if you don't have /dev/sda then that's probably the reason why stuff doesn't work :) [15:55:33] indeed! [15:55:36] hey! [15:55:40] that is happening to me on the new ciscos! [15:55:41] LeslieCarr: Yes, on virt1002. [15:55:44] just because andrew otto was mentioning that a few machines he got set up for him were missing sda and sdb [15:55:47] for analyics [15:55:48] yeah! [15:55:49] damn, ottomata1 you beat me to it :) [15:55:51] well then. [15:56:15] i submitted an RT ticket [15:56:18] hoping someone could help [15:56:28] Oh, and it has sdi and sdj [15:56:41] So the drives are... c through j instead of a through h [15:57:20] ah, its at the bottom of this ticket [15:57:20] https://rt.wikimedia.org/Ticket/Display.html?id=1985 [15:57:31] I don't know what those letters mean. Does it reflect actual jumper settings and/or cable hookups? [15:57:34] is that right? and sda and sdb just show up in /dev because...? [15:59:49] ottomata: No idea why they appear. [16:00:09] I could just rewrite my partman to use c through j, except not /all/ of the servers are configured like that. [16:00:14] virt1001 has a through h. [16:00:41] d'you think this issue needs its own ticket? [16:00:53] maybe so, yeah [16:01:00] i was hoping mutante would have some insight [16:01:07] since I think he set up the drives for me [16:01:11] and had a partman recipe too [16:01:58] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [16:02:51] hm... would this issue be in eqiad, or ops-request? [16:11:44] ottomata: https://rt.wikimedia.org/Ticket/Display.html?id=3055 [16:12:50] cool, thanks [16:19:23] hrm, i mean we should have them all be a-h [16:19:33] i don't want to start having some starting at c .. [16:19:35] i'm curious why [16:22:27] time for some dmidecode? ;) [16:25:58] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [16:28:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9487 [16:33:14] New patchset: Lcarr; "adding in snmptt init file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9487 [16:33:36] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9487 [16:33:36] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9487 [16:56:04] New patchset: Lcarr; "subscribing snmptt to all its config files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9516 [16:56:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9516 [16:57:08] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9516 [16:57:10] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9516 [17:01:08] !log rebooted ms1004 [17:01:13] Logged the message, Mistress of the network gear. [17:01:47] !log rebooted ms1004 due to it being unresponsive [17:01:50] Logged the message, Mistress of the network gear. [17:01:52] (figured i should clarify) [17:02:55] RECOVERY - Host ms1004 is UP: PING OK - Packet loss = 0%, RTA = 26.39 ms [17:05:31] !log rebooted mw1091 due to being unresponsive [17:05:35] Logged the message, Mistress of the network gear. [17:08:51] !log rebooted mw1102 because it thinks it has no eth0 [17:08:55] Logged the message, Mistress of the network gear. [17:09:04] RECOVERY - Host mw1091 is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [17:09:51] !log rebooted ms1004 for kernel upgrade [17:09:55] Logged the message, Mistress of the network gear. [17:10:03] !log rebooted mw1091 for kernel upgrade [17:10:07] Logged the message, Mistress of the network gear. [17:10:15] so what about the sda/sdc thing? [17:12:31] PROBLEM - Host ms1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:01] !log rebooting unresponsive mw1135 [17:13:04] Logged the message, Mistress of the network gear. [17:13:43] RECOVERY - Host ms1004 is UP: PING OK - Packet loss = 0%, RTA = 27.00 ms [17:14:22] !log rebooting unresponsive mw1143 [17:14:25] Logged the message, Mistress of the network gear. [17:15:04] RECOVERY - Host mw1135 is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [17:18:22] PROBLEM - SSH on mw1135 is CRITICAL: Connection refused [17:19:43] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:20:46] RECOVERY - Host mw1143 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [17:20:55] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [17:24:49] Logged the message, Master [17:32:52] !log rebooting mw1135 for kernel upgrade [17:32:56] Logged the message, Mistress of the network gear. [17:33:12] cmjohnson1: shit sorry, i think i messed up your memtest [17:33:19] and rebooted mw64 [17:34:07] jeremyb, good question, what about the sda-sdc thing? [17:34:48] PROBLEM - Host mw1135 is DOWN: PING CRITICAL - Packet loss = 100% [17:35:12] ottomata: i was saying do dmidecode. and compare [17:36:10] RECOVERY - Host mw1135 is UP: PING OK - Packet loss = 0%, RTA = 26.40 ms [17:46:21] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 181 seconds [17:47:24] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 211 seconds [18:18:09] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 182 seconds [18:19:27] New patchset: Hashar; "basic tests for .dblist files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9524 [18:19:33] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9524 [18:20:06] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 220 seconds [18:37:08] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9524 [18:38:27] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9384 [18:39:37] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9383 [18:49:55] it's pretty late in the day here (10pm). maybe I can pass you to LeslieCarr? [18:50:07] who I think is on a timezone closer to yours [18:54:31] no worries [18:54:44] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [19:00:44] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [19:04:47] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [19:11:50] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [19:11:50] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [19:11:50] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [19:21:44] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:52:36] cmjohnson1: hey [19:52:42] sorry i was getting food, are you still around ? [19:53:26] ok [19:53:37] mw31 and mw30, eh ? [19:54:18] I'll uise this as the opportunity to give them a proper dist-upgrade [19:55:50] ok [19:56:10] so actually power off mw31 but the others are just getting recabled so they will be temporarily down, right ? [19:57:21] !log powering off mw31 [19:57:25] Logged the message, Mistress of the network gear. [19:57:46] nope, though i would like to reboot mw30, mw32, and mw33, so let me know when you're going to work on them and i'll reboot them at the same time [19:58:37] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [19:59:13] PROBLEM - Apache HTTP on mw30 is CRITICAL: Connection refused [19:59:13] PROBLEM - Apache HTTP on mw33 is CRITICAL: Connection refused [19:59:13] PROBLEM - Apache HTTP on mw32 is CRITICAL: Connection refused [19:59:47] !log powering down mw30 for maintenance [19:59:50] Logged the message, Mistress of the network gear. [20:00:03] !log powering down mw32 for maintenance [20:00:07] Logged the message, Mistress of the network gear. [20:00:10] cmjohnson1: also took down 32 :) [20:01:10] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.32:11000 (Connection timed out) 10.0.11.30:11000 (Connection timed out) 10.0.11.31:11000 (Connection timed out) [20:01:55] PROBLEM - Host mw30 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:16] PROBLEM - Host mw32 is DOWN: PING CRITICAL - Packet loss = 100% [20:05:58] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [20:06:22] LeslieCarr: ping [20:06:37] jeremyb: pong [20:07:08] my ssh is hanging so i didn't check yet but that memcached alert is probably one of the ones you just took down? [20:07:13] yes [20:07:19] need to swap it in mc.php [20:07:25] oh [20:07:37] yeah, it's mw32 [20:07:48] where do you swap this ? [20:10:19] jeremyb: where's the file to change this ? [20:10:27] next to db.php [20:11:10] i don't have a browser up atm or i would have pastebinned [20:11:13] RECOVERY - Host mw32 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:11:49] RECOVERY - Host mw30 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [20:16:19] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [20:18:46] cmjohnson1: hey can we bring mw31 back up ? [20:18:56] cmjohnson1: i didn't realize the impact it would have (stupid memcache) [20:21:44] suppose it depends on how regular they happen [20:21:51] ahh, Reedy's here [20:21:59] Am I? [20:22:06] Reedy: can you walk through an mc.php swap? [20:22:19] Reedy: so, do you know which is worse, swapping out a memcache server or having one with memory errors ? [20:22:40] Well, as long as you actually create a replacement.... It should slowly get populated [20:23:00] If it's already been turned off, the cache is gone now, so turning that one back on now is indifferent [20:23:08] it's already off [20:23:14] the place should be taken by another server [20:23:18] ^ [20:23:21] That's the big thing [20:23:26] to keep the key distribution of the other [20:23:29] when a new server is there, it'll slowly get populated [20:23:54] so possibly a bit (relevant) more apache/mysql load as some other things will need computing again [20:24:15] Are any of the "spares" actually useable? [20:24:19] hrm, so i already turned it off :-/ and it is currently off physically, but not in mc.php [20:24:37] also, trying to find the wikimedia MW repo ... [20:24:46] /h/w/wmf-config/mc.php [20:24:59] bah [20:25:04] /h/w/c/wmf-config/mc.php [20:25:11] just commit directly from the directory [20:25:17] from fenari ? [20:25:17] otherwise it's operations/mediawiki-config IIRC [20:25:22] yeah, must easier [20:25:27] make changes, commit, git push origin [20:26:13] ok [20:27:06] ok, so it looks like srv192 is good to go [20:29:19] Reedy: got permission denied (publickey) when trying to push ? [20:29:31] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [20:29:46] ssh agent forwarded etc? [20:30:01] yeah [20:30:10] stupid thing [20:30:39] oh, it's as root [20:30:51] I suspect the root key isn't in gerrit [20:31:01] oh doh, i use bast1001 as my bastion so always wind up logging into fenari as root [20:31:02] hehe [20:31:26] hrm [20:31:56] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [20:32:34] ok, looks like it is working now .. [20:32:51] did you sync? [20:32:59] oh wait nope [20:33:09] spence thinks all is ok though :p [20:33:34] and spence is LeslieCarr's domain [20:33:42] what does neon think? ;) [20:33:51] spence is a compulsive liar [20:34:11] RECOVERY - Host mw64 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [20:34:33] cmjohnson1: yeah, should be fine (as long as it's not up and running again. if so, needs shutting down etc) [20:34:34] ok [20:34:37] looks good right now ... [20:34:45] hehe [20:35:26]