[00:04:40] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [00:10:19] New patchset: Kaldari; "turning on PageTriage on test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9490 [00:10:25] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9490 [00:16:39] I'm supposed to do a deployment tomorrow, but I'm not in the wmf-deployment group: https://gerrit.wikimedia.org/r/#/admin/groups/21,members :( [00:17:20] does anyone have access to edit that besides Chad? [00:17:38] Reedy: ^ [00:22:32] New review: awjrichards; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9490 [00:22:34] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9490 [00:23:13] kaldari: done [00:23:22] oh, thanks! [00:24:05] AaronSchulz: Can you add Benny too while you're in there? [00:24:31] he's Bsitu in gerrit [00:25:23] ok [01:42:25] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 313 seconds [01:45:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:46:19] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [01:48:07] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [01:52:17] New patchset: Kaldari; "turning pagetriage off on test2 since the sql files need to be run there" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9493 [01:52:22] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9493 [01:53:06] New review: Kaldari; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9493 [01:53:08] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9493 [02:36:07] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [02:41:58] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [02:44:22] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [02:48:16] RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Thu May 31 02:48:02 UTC 2012 [02:53:04] PROBLEM - Puppet freshness on srv210 is CRITICAL: Puppet has not run in the last 10 hours [02:54:07] PROBLEM - Puppet freshness on srv296 is CRITICAL: Puppet has not run in the last 10 hours [02:55:01] PROBLEM - Puppet freshness on srv244 is CRITICAL: Puppet has not run in the last 10 hours [02:56:31] RECOVERY - Puppet freshness on srv210 is OK: puppet ran at Thu May 31 02:56:07 UTC 2012 [02:58:01] PROBLEM - Puppet freshness on srv300 is CRITICAL: Puppet has not run in the last 10 hours [02:58:01] PROBLEM - Puppet freshness on srv298 is CRITICAL: Puppet has not run in the last 10 hours [02:59:04] PROBLEM - Puppet freshness on srv236 is CRITICAL: Puppet has not run in the last 10 hours [03:00:07] PROBLEM - Puppet freshness on srv202 is CRITICAL: Puppet has not run in the last 10 hours [03:01:01] PROBLEM - Puppet freshness on srv254 is CRITICAL: Puppet has not run in the last 10 hours [03:04:37] RECOVERY - Puppet freshness on analytics1001 is OK: puppet ran at Thu May 31 03:04:12 UTC 2012 [03:05:04] PROBLEM - Puppet freshness on srv228 is CRITICAL: Puppet has not run in the last 10 hours [03:05:04] PROBLEM - Puppet freshness on srv284 is CRITICAL: Puppet has not run in the last 10 hours [03:06:07] PROBLEM - Puppet freshness on srv292 is CRITICAL: Puppet has not run in the last 10 hours [03:06:07] PROBLEM - Puppet freshness on srv248 is CRITICAL: Puppet has not run in the last 10 hours [03:06:07] PROBLEM - Puppet freshness on srv297 is CRITICAL: Puppet has not run in the last 10 hours [03:07:01] PROBLEM - Puppet freshness on srv229 is CRITICAL: Puppet has not run in the last 10 hours [03:08:04] PROBLEM - Puppet freshness on srv208 is CRITICAL: Puppet has not run in the last 10 hours [03:09:07] PROBLEM - Puppet freshness on srv267 is CRITICAL: Puppet has not run in the last 10 hours [03:10:01] PROBLEM - Puppet freshness on srv259 is CRITICAL: Puppet has not run in the last 10 hours [03:10:01] PROBLEM - Puppet freshness on srv290 is CRITICAL: Puppet has not run in the last 10 hours [03:16:01] PROBLEM - Puppet freshness on srv241 is CRITICAL: Puppet has not run in the last 10 hours [03:19:01] PROBLEM - Puppet freshness on srv299 is CRITICAL: Puppet has not run in the last 10 hours [03:20:04] PROBLEM - Puppet freshness on srv231 is CRITICAL: Puppet has not run in the last 10 hours [03:20:04] PROBLEM - Puppet freshness on srv295 is CRITICAL: Puppet has not run in the last 10 hours [03:23:04] PROBLEM - Puppet freshness on srv226 is CRITICAL: Puppet has not run in the last 10 hours [03:23:04] PROBLEM - Puppet freshness on srv237 is CRITICAL: Puppet has not run in the last 10 hours [04:37:24] PROBLEM - Puppet freshness on bellin is CRITICAL: Puppet has not run in the last 10 hours [05:13:24] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [05:18:21] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [05:18:21] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [05:18:21] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:01:24] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [06:25:23] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [08:12:47] ping mark [08:12:54] Or someone else involved with the Amsterdam cluster [08:13:33] might be a little early, most people are probably in berlin or so [08:14:10] robh is the other possibility but also probably not here just yet [08:14:44] Berlin is in the same timezone as me, 10 am [08:15:08] yes. 10 might be a bit on the earliy side for people to be showing up on line [08:15:23] That might be true too :-) [08:15:27] :-) [08:17:41] I bet folsk will be on within the next 30 min- 1 hour [08:40:16] New patchset: Hashar; "remove excutable bits from some regular files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9500 [08:40:22] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9500 [08:40:43] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9500 [08:40:45] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9500 [08:54:21] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [09:00:21] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [09:04:15] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [09:04:15] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [09:11:18] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [09:11:18] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [09:11:18] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [09:21:21] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:54:03] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [09:55:24] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:03:51] I'll be going home... Forgot to take my laptop's electricity cord with me this morning. [10:05:18] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [10:47:09] !log temporarily pulled srv199 from lvs for php testing [10:47:12] Logged the message, Master [10:47:53] binasher: you need a whole server for that's how i roll [11:19:32] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:22:09] test [11:22:58] tset [11:25:05] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:25:38] https://gerrit.wikimedia.org/r/#/c/7823/ https://gerrit.wikimedia.org/r/#/c/9130/ https://gerrit.wikimedia.org/r/#/c/9129/ https://gerrit.wikimedia.org/r/#/c/7831/ [11:25:48] ^ Could someone from ops approve and push those please? [11:26:14] I know there's been requests to remove the debian specific stuff (fair enough), but with inconsistencies as they currently are, it'd be nicer to have the level playing field to start with [11:30:25] New review: Reedy; "This can probable be abandoned in favour of https://gerrit.wikimedia.org/r/#/c/9126 going live to site" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/9012 [11:33:24] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9129 [11:33:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9129 [11:35:27] New review: Mark Bergsma; "double -o -o ?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/7831 [11:37:41] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:55:59] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:04:22] New review: Reedy; "Other scripts have it.. Such as sync-common-file" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/7831 [12:49:22] Change abandoned: Reedy; "Abandoning. mb_check_encoding code is now live, so this shouldn't be an issue (or at least, not in t..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9012 [12:54:29] PROBLEM - Puppet freshness on srv296 is CRITICAL: Puppet has not run in the last 10 hours [12:55:32] PROBLEM - Puppet freshness on srv244 is CRITICAL: Puppet has not run in the last 10 hours [12:58:32] PROBLEM - Puppet freshness on srv298 is CRITICAL: Puppet has not run in the last 10 hours [12:58:32] PROBLEM - Puppet freshness on srv300 is CRITICAL: Puppet has not run in the last 10 hours [12:59:35] PROBLEM - Puppet freshness on srv236 is CRITICAL: Puppet has not run in the last 10 hours [13:00:29] PROBLEM - Puppet freshness on srv202 is CRITICAL: Puppet has not run in the last 10 hours [13:01:17] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9469 [13:01:19] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9469 [13:01:32] PROBLEM - Puppet freshness on srv254 is CRITICAL: Puppet has not run in the last 10 hours [13:02:05] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8326 [13:02:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8326 [13:03:25] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9333 [13:03:27] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9333 [13:05:35] PROBLEM - Puppet freshness on srv228 is CRITICAL: Puppet has not run in the last 10 hours [13:05:35] PROBLEM - Puppet freshness on srv284 is CRITICAL: Puppet has not run in the last 10 hours [13:06:29] PROBLEM - Puppet freshness on srv248 is CRITICAL: Puppet has not run in the last 10 hours [13:06:29] PROBLEM - Puppet freshness on srv292 is CRITICAL: Puppet has not run in the last 10 hours [13:06:29] PROBLEM - Puppet freshness on srv297 is CRITICAL: Puppet has not run in the last 10 hours [13:07:32] PROBLEM - Puppet freshness on srv229 is CRITICAL: Puppet has not run in the last 10 hours [13:08:35] PROBLEM - Puppet freshness on srv208 is CRITICAL: Puppet has not run in the last 10 hours [13:09:29] PROBLEM - Puppet freshness on srv267 is CRITICAL: Puppet has not run in the last 10 hours [13:10:32] PROBLEM - Puppet freshness on srv290 is CRITICAL: Puppet has not run in the last 10 hours [13:10:32] PROBLEM - Puppet freshness on srv259 is CRITICAL: Puppet has not run in the last 10 hours [13:16:32] PROBLEM - Puppet freshness on srv241 is CRITICAL: Puppet has not run in the last 10 hours [13:19:18] PROBLEM - Puppet freshness on srv299 is CRITICAL: Puppet has not run in the last 10 hours [13:20:30] PROBLEM - Puppet freshness on srv231 is CRITICAL: Puppet has not run in the last 10 hours [13:20:30] PROBLEM - Puppet freshness on srv295 is CRITICAL: Puppet has not run in the last 10 hours [13:20:31] New patchset: Bhartshorne; "inserting smerritt's key for access to the hw swift test cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9510 [13:20:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9510 [13:21:29] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9510 [13:21:31] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9510 [13:23:17] Reedy!! [13:23:21] RECOVERY - Lucene on search22 is OK: TCP OK - 0.004 second response time on port 8123 [13:23:30] PROBLEM - Puppet freshness on srv226 is CRITICAL: Puppet has not run in the last 10 hours [13:23:30] PROBLEM - Puppet freshness on srv237 is CRITICAL: Puppet has not run in the last 10 hours [13:23:48] maplebed!! [13:24:05] your change is polluting my diff. [13:24:12] lol, which change? [13:24:19] I can access maganese now btw. Thanks for that [13:24:30] changinging the log message for scap, it looks like. [13:24:36] oh good. [13:28:02] anyway, it's merged now. [13:28:30] Reedy: lol, 1.20wmf4 was actually deployed to all non-Wikipedia wikis yesterday? [13:28:37] I can't believe it [13:28:38] awesome [13:28:43] Who did it [13:29:24] Ah Aaron did [13:30:27] Aaron and I [13:36:33] RECOVERY - Puppet freshness on search13 is OK: puppet ran at Thu May 31 13:36:16 UTC 2012 [13:36:48] binasher: am I reading your email correctly that you agree that I don't need to take the es servers out of db.php before stopping mysql, running an update, and rebooting? [13:37:46] no, i think you should comment out of db.php [13:37:54] RECOVERY - Lucene disk space on search13 is OK: DISK OK [13:38:17] ok, I'll do so. I remember reading the mediawiki stuff and deciding it wasn't necessary, but it also doesn't hurt. [13:38:29] I'm going to start on the eqiad es hosts though, [13:38:37] which already aren't in rotation. [13:40:23] !log upgrading and rebooting eqiad es hosts due to 210 day kernel bug thingy. [13:40:27] Logged the message, Master [13:41:10] mediawiki won't send queries to a downed slave but if a process gets a handle to a host that it thinks it can query, then mysql goes away, its going to fail. there's a little window of user impact [13:41:40] ah. [13:42:54] binasher: random other question - on 'aptitude upgrade' I've seen more than once that aptitude wants to uninstall sysstat. Do you know why? [13:43:33] or mark or Ryan_Lane ^^^ [13:43:37] or anybody really. [13:43:42] no clue [13:44:28] nope [13:44:57] PROBLEM - mysqld processes on es1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [13:45:33] ^^^ es1001 is me. [13:45:51] YOU BROKE IT [13:53:59] hey man, your merge was the last thing to hit sockpuppet. [13:55:12] * apergos gets popcorn [13:55:36] mmm.... popcorn.... [13:56:33] http://www.popcorncalories.info/images/popcorn.jpg [13:56:39] RECOVERY - NTP on search13 is OK: NTP OK: Offset -0.01105856895 secs [13:57:31] paravoid: I'm trying to use puppetmaster::self on instance mwreview-test4 and it's in an interesting state now... want to have a look? [13:57:59] NOMS! [13:58:22] what is that, 2 calories worth of popcorn and 10000 from the butter? [13:58:27] PROBLEM - Host es1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:58:59] yeah, can't you smell the movie theater butter smell too? :-) [13:59:39] RECOVERY - Host es1001 is UP: PING OK - Packet loss = 0%, RTA = 26.86 ms [14:00:55] man, I love that stuff. that website is something else. [14:01:20] real winners like "Do you know how much movie popcorn calories you get from watching at theatres? Others might be sarcastic and say, “Who cares?” People with this attitude are prone to getting fat." [14:03:06] RECOVERY - mysqld processes on es1001 is OK: PROCS OK: 1 process with command name mysqld [14:03:55] !log returning srv199 to lvs (dialed back to slow / no longer running php 5.4) [14:03:58] Logged the message, Master [14:08:30] PROBLEM - mysqld processes on es1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [14:10:00] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 190 seconds [14:10:09] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 192 seconds [14:10:21] ^^^ es1002, still me. db1018 - not me. [14:12:26] 1018 isn't in rotation [14:20:03] PROBLEM - Host es1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:15] RECOVERY - Host es1002 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [14:23:03] RECOVERY - Puppet freshness on es1002 is OK: puppet ran at Thu May 31 14:22:37 UTC 2012 [14:28:56] !log pulling srv199 from lvs again for further experimentation [14:29:00] Logged the message, Master [14:31:00] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 25 seconds [14:31:18] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 22 seconds [14:52:27] RECOVERY - mysqld processes on es1002 is OK: PROCS OK: 1 process with command name mysqld [15:19:27] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [15:19:27] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [15:19:28] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [15:49:27] mark: Have you ever run into trouble with partman failing due to the presence of existing partitions? [15:49:38] yes [15:49:40] (I'm guessing that that's my problem, although I have little to no evidence.) [15:49:40] annoying [15:49:57] the few times I ran into something like that I erased the partition table manually [15:50:09] (dd if=/dev/zero of=/dev/sda bs=512 count=1) [15:50:24] Hm... well, that's worth a try. [15:51:24] I take it 'd-i partman-md/device_remove_md boolean true' only works some or none of the time? [15:54:01] Interesting... [15:54:06] mark, when I run that command I get 'dd: can't open '/dev/sda': No medium found' [15:54:12] which fits with the error partman is giving me. [15:55:10] There is an sda link in /dev. [15:55:14] well [15:55:15] is it on the new ciscos ? [15:55:26] if you don't have /dev/sda then that's probably the reason why stuff doesn't work :) [15:55:33] indeed! [15:55:36] hey! [15:55:40] that is happening to me on the new ciscos! [15:55:41] LeslieCarr: Yes, on virt1002. [15:55:44] just because andrew otto was mentioning that a few machines he got set up for him were missing sda and sdb [15:55:47] for analyics [15:55:48] yeah! [15:55:49] damn, ottomata1 you beat me to it :) [15:55:51] well then. [15:56:15] i submitted an RT ticket [15:56:18] hoping someone could help [15:56:28] Oh, and it has sdi and sdj [15:56:41] So the drives are... c through j instead of a through h [15:57:20] ah, its at the bottom of this ticket [15:57:20] https://rt.wikimedia.org/Ticket/Display.html?id=1985 [15:57:31] I don't know what those letters mean. Does it reflect actual jumper settings and/or cable hookups? [15:57:34] is that right? and sda and sdb just show up in /dev because...? [15:59:49] ottomata: No idea why they appear. [16:00:09] I could just rewrite my partman to use c through j, except not /all/ of the servers are configured like that. [16:00:14] virt1001 has a through h. [16:00:41] d'you think this issue needs its own ticket? [16:00:53] maybe so, yeah [16:01:00] i was hoping mutante would have some insight [16:01:07] since I think he set up the drives for me [16:01:11] and had a partman recipe too [16:01:58] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [16:02:51] hm... would this issue be in eqiad, or ops-request? [16:11:44] ottomata: https://rt.wikimedia.org/Ticket/Display.html?id=3055 [16:12:50] cool, thanks [16:19:23] hrm, i mean we should have them all be a-h [16:19:33] i don't want to start having some starting at c .. [16:19:35] i'm curious why [16:22:27] time for some dmidecode? ;) [16:25:58] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [16:28:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9487 [16:33:14] New patchset: Lcarr; "adding in snmptt init file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9487 [16:33:36] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9487 [16:33:36] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9487 [16:56:04] New patchset: Lcarr; "subscribing snmptt to all its config files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9516 [16:56:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9516 [16:57:08] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9516 [16:57:10] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9516 [17:01:08] !log rebooted ms1004 [17:01:13] Logged the message, Mistress of the network gear. [17:01:47] !log rebooted ms1004 due to it being unresponsive [17:01:50] Logged the message, Mistress of the network gear. [17:01:52] (figured i should clarify) [17:02:55] RECOVERY - Host ms1004 is UP: PING OK - Packet loss = 0%, RTA = 26.39 ms [17:05:31] !log rebooted mw1091 due to being unresponsive [17:05:35] Logged the message, Mistress of the network gear. [17:08:51] !log rebooted mw1102 because it thinks it has no eth0 [17:08:55] Logged the message, Mistress of the network gear. [17:09:04] RECOVERY - Host mw1091 is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [17:09:51] !log rebooted ms1004 for kernel upgrade [17:09:55] Logged the message, Mistress of the network gear. [17:10:03] !log rebooted mw1091 for kernel upgrade [17:10:07] Logged the message, Mistress of the network gear. [17:10:15] so what about the sda/sdc thing? [17:12:31] PROBLEM - Host ms1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:01] !log rebooting unresponsive mw1135 [17:13:04] Logged the message, Mistress of the network gear. [17:13:43] RECOVERY - Host ms1004 is UP: PING OK - Packet loss = 0%, RTA = 27.00 ms [17:14:22] !log rebooting unresponsive mw1143 [17:14:25] Logged the message, Mistress of the network gear. [17:15:04] RECOVERY - Host mw1135 is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [17:18:22] PROBLEM - SSH on mw1135 is CRITICAL: Connection refused [17:19:43] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:20:46] RECOVERY - Host mw1143 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [17:20:55] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [17:24:49] Logged the message, Master [17:32:52] !log rebooting mw1135 for kernel upgrade [17:32:56] Logged the message, Mistress of the network gear. [17:33:12] cmjohnson1: shit sorry, i think i messed up your memtest [17:33:19] and rebooted mw64 [17:34:07] jeremyb, good question, what about the sda-sdc thing? [17:34:48] PROBLEM - Host mw1135 is DOWN: PING CRITICAL - Packet loss = 100% [17:35:12] ottomata: i was saying do dmidecode. and compare [17:36:10] RECOVERY - Host mw1135 is UP: PING OK - Packet loss = 0%, RTA = 26.40 ms [17:46:21] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 181 seconds [17:47:24] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 211 seconds [18:18:09] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 182 seconds [18:19:27] New patchset: Hashar; "basic tests for .dblist files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9524 [18:19:33] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9524 [18:20:06] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 220 seconds [18:37:08] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9524 [18:38:27] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9384 [18:39:37] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9383 [18:49:55] it's pretty late in the day here (10pm). maybe I can pass you to LeslieCarr? [18:50:07] who I think is on a timezone closer to yours [18:54:31] no worries [18:54:44] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [19:00:44] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [19:04:47] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [19:11:50] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [19:11:50] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [19:11:50] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [19:21:44] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:52:36] cmjohnson1: hey [19:52:42] sorry i was getting food, are you still around ? [19:53:26] ok [19:53:37] mw31 and mw30, eh ? [19:54:18] I'll uise this as the opportunity to give them a proper dist-upgrade [19:55:50] ok [19:56:10] so actually power off mw31 but the others are just getting recabled so they will be temporarily down, right ? [19:57:21] !log powering off mw31 [19:57:25] Logged the message, Mistress of the network gear. [19:57:46] nope, though i would like to reboot mw30, mw32, and mw33, so let me know when you're going to work on them and i'll reboot them at the same time [19:58:37] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [19:59:13] PROBLEM - Apache HTTP on mw30 is CRITICAL: Connection refused [19:59:13] PROBLEM - Apache HTTP on mw33 is CRITICAL: Connection refused [19:59:13] PROBLEM - Apache HTTP on mw32 is CRITICAL: Connection refused [19:59:47] !log powering down mw30 for maintenance [19:59:50] Logged the message, Mistress of the network gear. [20:00:03] !log powering down mw32 for maintenance [20:00:07] Logged the message, Mistress of the network gear. [20:00:10] cmjohnson1: also took down 32 :) [20:01:10] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.32:11000 (Connection timed out) 10.0.11.30:11000 (Connection timed out) 10.0.11.31:11000 (Connection timed out) [20:01:55] PROBLEM - Host mw30 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:16] PROBLEM - Host mw32 is DOWN: PING CRITICAL - Packet loss = 100% [20:05:58] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [20:06:22] LeslieCarr: ping [20:06:37] jeremyb: pong [20:07:08] my ssh is hanging so i didn't check yet but that memcached alert is probably one of the ones you just took down? [20:07:13] yes [20:07:19] need to swap it in mc.php [20:07:25] oh [20:07:37] yeah, it's mw32 [20:07:48] where do you swap this ? [20:10:19] jeremyb: where's the file to change this ? [20:10:27] next to db.php [20:11:10] i don't have a browser up atm or i would have pastebinned [20:11:13] RECOVERY - Host mw32 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:11:49] RECOVERY - Host mw30 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [20:16:19] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [20:18:46] cmjohnson1: hey can we bring mw31 back up ? [20:18:56] cmjohnson1: i didn't realize the impact it would have (stupid memcache) [20:21:44] suppose it depends on how regular they happen [20:21:51] ahh, Reedy's here [20:21:59] Am I? [20:22:06] Reedy: can you walk through an mc.php swap? [20:22:19] Reedy: so, do you know which is worse, swapping out a memcache server or having one with memory errors ? [20:22:40] Well, as long as you actually create a replacement.... It should slowly get populated [20:23:00] If it's already been turned off, the cache is gone now, so turning that one back on now is indifferent [20:23:08] it's already off [20:23:14] the place should be taken by another server [20:23:18] ^ [20:23:21] That's the big thing [20:23:26] to keep the key distribution of the other [20:23:29] when a new server is there, it'll slowly get populated [20:23:54] so possibly a bit (relevant) more apache/mysql load as some other things will need computing again [20:24:15] Are any of the "spares" actually useable? [20:24:19] hrm, so i already turned it off :-/ and it is currently off physically, but not in mc.php [20:24:37] also, trying to find the wikimedia MW repo ... [20:24:46] /h/w/wmf-config/mc.php [20:24:59] bah [20:25:04] /h/w/c/wmf-config/mc.php [20:25:11] just commit directly from the directory [20:25:17] from fenari ? [20:25:17] otherwise it's operations/mediawiki-config IIRC [20:25:22] yeah, must easier [20:25:27] make changes, commit, git push origin [20:26:13] ok [20:27:06] ok, so it looks like srv192 is good to go [20:29:19] Reedy: got permission denied (publickey) when trying to push ? [20:29:31] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [20:29:46] ssh agent forwarded etc? [20:30:01] yeah [20:30:10] stupid thing [20:30:39] oh, it's as root [20:30:51] I suspect the root key isn't in gerrit [20:31:01] oh doh, i use bast1001 as my bastion so always wind up logging into fenari as root [20:31:02] hehe [20:31:26] hrm [20:31:56] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [20:32:34] ok, looks like it is working now .. [20:32:51] did you sync? [20:32:59] oh wait nope [20:33:09] spence thinks all is ok though :p [20:33:34] and spence is LeslieCarr's domain [20:33:42] what does neon think? ;) [20:33:51] spence is a compulsive liar [20:34:11] RECOVERY - Host mw64 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [20:34:33] cmjohnson1: yeah, should be fine (as long as it's not up and running again. if so, needs shutting down etc) [20:34:34] ok [20:34:37] looks good right now ... [20:34:45] hehe [20:35:26] New patchset: Kaldari; "Adding LastModified and E3Experiments to config files for testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9590 [20:35:32] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9590 [20:36:34] Reedy: what's the next step ? [20:36:45] Have you run sync-file? [20:36:46] have the new mc.php set and pushed [20:36:53] nope [20:36:57] so run sync-file , no arguments ? [20:37:29] PROBLEM - Apache HTTP on mw64 is CRITICAL: Connection refused [20:37:33] New review: Kaldari; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9590 [20:37:35] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9590 [20:37:42] sync-file wmf-config/mc.php [20:37:51] can put a summary after if you want [20:38:07] ok, running sync-file now [20:38:14] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [20:38:28] it's completed [20:38:42] so that's just for mw31. what about 32? [20:38:58] is that also memcached? [20:39:25] mw32 is already back up [20:39:32] but it's going down again? [20:39:46] cmjohnson1: can you confirm if mw32 is fixed or if it is going back down ? [20:40:09] looks like 31 is the memtest. so maybe 32 is the cable [20:40:43] Computers suck [20:40:44] cool [20:40:48] thanks chris :) [20:40:50] {{fact}}! [20:41:04] if only you had working wifi … :-/ [20:41:23] ok, so Reedy/jeremyb -- is there another step or are we magically happy ? [20:41:27] * Reedy hands cmjohnson1 a 100m cat 5e cable [20:41:30] That's good :) [20:42:53] magically happy except that it needs another update later to say 31 is back up? (or not if it fails the test?) [20:43:14] meh, or don't [20:43:23] just make sure 31 is in spares [20:43:37] right. i'm saying it's in /* DOWN */ [20:43:55] so if it's really a spare it should move to spares [20:44:01] well, currently it isn't a spare ;) [20:44:09] in an hour or 10 hours or whenever we know [20:44:17] manyana! [20:44:32] mañana* [20:44:40] ok [20:44:52] Reedy fails spanish! [20:44:55] thanks cmjohnson1 :) [20:45:02] and thanks Reedy and jeremyb [20:45:17] RECOVERY - Puppet freshness on srv290 is OK: puppet ran at Thu May 31 20:44:48 UTC 2012 [20:46:24] I don't claim any proficiency in spanish [20:46:49] Logged the message, Master [20:46:52] Logged the message, Master [20:46:56] Logged the message, Master [20:47:34] o_0 [20:47:54] pediapress? [20:59:01] Reedy: happened several other times too. we were saying they should get some rdns. or cloaks or something [20:59:12] mmm [20:59:19] or just enter the channel themselves.. [20:59:29] i wonder if that's one irc client on each pdf box.. [21:03:00] i assumed so [21:03:32] i don't have a clue who they are though so what diff does it make if they're a human. (does someone else know offhand?) [21:03:53] anyway, why are those services not run internally? [21:03:59] so they can iterate faster? [21:04:44] * jeremyb wonders if leslie is still around [21:05:06] oh hey [21:05:11] the pediapress stuff ? [21:05:24] nah [21:05:56] oh which services ? [21:06:03] bug 37089 / rt 3024. word is there was an ops list thread about it? [21:06:14] he's come back a few times asking (in this channel) [21:06:26] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 181 seconds [21:06:26] oh [21:06:30] he's not here now but i figured you could maybe say if the thread came to a resolution? [21:06:51] i don't know if we do that or what - no ops thread [21:07:03] well teeny tiny thread [21:07:20] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 187 seconds [21:07:21] which we're like "um, isn't that a wikipedia admin thing where they decide to ban people or not" ? [21:07:35] oh, heh [21:07:35] (also we have ip bans from many years back which worries me with the current ip crunch) [21:07:56] do we have a policy on bogons? ;) [21:08:02] basically the ops opinion is ignorance [21:08:16] anyway, i think the story is both apergos and mark separately banned at the same instant at different levels [21:08:18] manually set some of the previous martian range to unmartian [21:08:20] oh [21:08:23] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [21:08:23] then i would ask them [21:08:29] mark especially works in mysterious ways [21:08:35] hah [21:08:41] mark is just mysterious [21:08:56] apergos said he was fine with lifting as long as the guy new that there would be no 3rd chances [21:09:03] idk if anyone asked mark [21:09:18] but there was supposed to have been an ops list thread too. with all of that info in it [21:09:22] i think [21:09:26] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [21:09:28] incredibly tiny thread [21:09:47] Apparently it's not the size that matters [21:09:50] one person saying "anyone have input" and the second person saying "um, i don't know where the ban is or whose call" [21:10:14] i guess i have to remember to ask mark when he's around [21:12:05] what happened to gadgets? [21:14:20] ok, fixed by doing a null edit to MediaWiki:Gadgets-definition [21:14:39] it was probably stored in the "lost" memcached [21:16:38] RECOVERY - Puppet freshness on srv296 is OK: puppet ran at Thu May 31 21:16:20 UTC 2012 [21:19:02] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 20 seconds [21:20:12] bye [21:20:13] eep [21:20:17] eeep [21:20:23] let's keep a ping running! [21:20:26] ;) [21:20:41] :) [21:21:03] cmjohnson totally isn't a sysadmin - http://xkcd.com/705/ [21:23:59] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 22 seconds [21:25:03] ping? [21:25:11] just wait for people to complain they can't access wikipedia [21:30:10] :) [21:40:07] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 196 seconds [21:41:19] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9524 [21:41:21] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9524 [21:42:31] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 186 seconds [21:45:14] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9355 [21:45:16] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9355 [21:46:43] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 193 seconds [21:46:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:47:01] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 192 seconds [21:48:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [21:49:34] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 228 seconds [21:50:28] * jeremyb wonders who wants to review 8344 or 8120? [21:50:51] some reviewer? [21:50:55] or 5492 if you want a really easy one! [21:56:01] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [21:57:04] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [21:57:20] i just checked all three. they still apply cleanly on production HEAD [22:02:59] New patchset: Aaron Schulz; "Properly purge old versions." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9594 [22:03:05] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9594 [22:03:32] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9594 [22:03:33] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9594 [22:05:01] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 199 seconds [22:05:28] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 181 seconds [22:18:04] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 199 seconds [22:19:23] New patchset: Aaron Schulz; "Removed unused hook handlers." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9597 [22:19:29] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9597 [22:20:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:22:16] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 188 seconds [22:23:10] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 196 seconds [22:24:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [22:27:40] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [22:28:52] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [22:35:01] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 192 seconds [22:37:16] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 210 seconds [22:37:52] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 209 seconds [22:43:07] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 210 seconds [22:56:55] PROBLEM - Puppet freshness on srv244 is CRITICAL: Puppet has not run in the last 10 hours [22:57:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:59:46] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 13 seconds [22:59:55] PROBLEM - Puppet freshness on srv298 is CRITICAL: Puppet has not run in the last 10 hours [22:59:55] PROBLEM - Puppet freshness on srv300 is CRITICAL: Puppet has not run in the last 10 hours [23:00:04] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [23:00:58] PROBLEM - Puppet freshness on srv236 is CRITICAL: Puppet has not run in the last 10 hours [23:00:58] PROBLEM - Puppet freshness on srv202 is CRITICAL: Puppet has not run in the last 10 hours [23:02:55] PROBLEM - Puppet freshness on srv254 is CRITICAL: Puppet has not run in the last 10 hours [23:04:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.515 seconds [23:05:55] RECOVERY - Puppet freshness on srv236 is OK: puppet ran at Thu May 31 23:05:29 UTC 2012 [23:05:55] RECOVERY - Puppet freshness on srv244 is OK: puppet ran at Thu May 31 23:05:42 UTC 2012 [23:05:55] RECOVERY - Puppet freshness on srv202 is OK: puppet ran at Thu May 31 23:05:51 UTC 2012 [23:06:58] PROBLEM - Puppet freshness on srv228 is CRITICAL: Puppet has not run in the last 10 hours [23:06:58] PROBLEM - Puppet freshness on srv297 is CRITICAL: Puppet has not run in the last 10 hours [23:06:58] PROBLEM - Puppet freshness on srv248 is CRITICAL: Puppet has not run in the last 10 hours [23:06:58] PROBLEM - Puppet freshness on srv284 is CRITICAL: Puppet has not run in the last 10 hours [23:06:58] PROBLEM - Puppet freshness on srv292 is CRITICAL: Puppet has not run in the last 10 hours [23:08:55] PROBLEM - Puppet freshness on srv229 is CRITICAL: Puppet has not run in the last 10 hours [23:09:58] PROBLEM - Puppet freshness on srv208 is CRITICAL: Puppet has not run in the last 10 hours [23:09:58] PROBLEM - Puppet freshness on srv267 is CRITICAL: Puppet has not run in the last 10 hours [23:11:55] PROBLEM - Puppet freshness on srv259 is CRITICAL: Puppet has not run in the last 10 hours [23:17:55] PROBLEM - Puppet freshness on srv241 is CRITICAL: Puppet has not run in the last 10 hours [23:19:52] PROBLEM - Puppet freshness on srv299 is CRITICAL: Puppet has not run in the last 10 hours [23:20:55] PROBLEM - Puppet freshness on srv295 is CRITICAL: Puppet has not run in the last 10 hours [23:20:55] PROBLEM - Puppet freshness on srv231 is CRITICAL: Puppet has not run in the last 10 hours [23:23:55] PROBLEM - Puppet freshness on srv237 is CRITICAL: Puppet has not run in the last 10 hours [23:23:55] PROBLEM - Puppet freshness on srv226 is CRITICAL: Puppet has not run in the last 10 hours [23:38:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:45:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.510 seconds [23:47:45] AaronSchulz: in https://gerrit.wikimedia.org/r/#/c/9355/1/wmf-config/swift.php line 175, the for loop appends each element to the urls array? [23:48:11] (if I used that same syntax in other languages it would overwrite the entire array each time and you'd wind up with an array with one element at the end) [23:50:51] [] is append [23:51:24] excellent. [23:51:36] it's live now [23:51:39] in that case lgtm. [23:51:42] not sure how much it will help, since much of the time it just times out [23:51:42] is it working? [23:51:47] eh? [23:51:52] really? most of the time? [23:52:01] not the squid part of course, but the listing [23:52:14] hrmph. [23:52:58] tomorrow I'll test it on a smaller wiki (load a thumbnail, delete it from ms5 but not swift, purge it, see what happens) [23:53:07] the smaller wikis don't timeout like commons does. [23:53:21] well p50 are not timeouts, though all of p90 are [23:53:22] and hopefully we'll have the first fix in place sometime next week. [23:53:32] thanks for pushing that out. [23:54:13] maybe I should fire up that log-scanning, purge script I used before to cleanup after the timeouts [23:55:17] well, ms5 is still the canonical source, [23:55:18] I haven't used that since swift was taken out of use and put back in again [23:55:27] so there shouldn't be anything in swift that's not in ms5 (yet) [23:55:51] hey, in other news, have you tested the migrate originals between backends stuff ? [23:55:57] that'd be an interesting thing to start working on. [23:56:13] ms5 purges first, then swift [23:56:32] if the swift listing GET fails, nothing gets deleted from swift [23:56:38] oh, sorry, I was still thinking about purging from squid, not swift. [23:56:39] so swift has stuff ms5 doesn't anymore [23:56:49] right. I get what you mean. [23:57:05] I haven't tested those migrate scripts...heh, there are still a few patches in gerrit too ;) [23:58:39] ok. maybe that's something we can do in labs. [23:59:32] how is the ssd situation?