[00:00:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.043 seconds [00:01:15] ^demon: what's this registerEmailPrivateKey thing? [00:01:19] it just got removed from the config [00:01:23] is it important? [00:01:39] <^demon> I didn't touch that in my change to gerrit. [00:02:04] it was added on the upgrade [00:02:10] when puppet ran it removed it [00:02:43] <^demon> Hmm. Don't know. It's not documented in [auth] [00:02:53] <^demon> But it is in an example config for secure.config :\ [00:03:06] <^demon> Inconsistent docs, whe [00:03:52] I'm looking at the code [00:04:26] seems it's needed for allowing people to register their email addresses [00:04:37] they can't do that anyway [00:04:39] since we're using ldap [00:05:40] <^demon> The UI tells them they can, then spews errors when you try. [00:05:46] <^demon> Finally makes sense why now [00:05:48] ugh [00:05:59] well, we'll need to add this to the private repo [00:06:17] lemme add this really quick [00:06:38] <^demon> You can register multiple user names under your account. The only one you can't set is preferred_email since that'll always revert back to LDAP version [00:07:32] New patchset: Ryan Lane; "Adding missing field for email registration in gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14434 [00:08:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14434 [00:09:13] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14434 [00:09:25] !log force running puppet on manganese, it'll restart gerrit [00:09:33] Logged the message, Master [00:10:05] <^demon> 2.5 hasn't been branched yet, so we shouldn't have to repeat this soon. In the meantime, I'll work on gerrit it 100% working with puppet on labs, so we can iron out some of these issues. [00:10:33] <^demon> s/it/so it is/ [00:10:34] bleh [00:10:38] undefined [00:10:47] I fucked that up somehow [00:11:29] Are they still on track to add in the plugin interface to 2.5? [00:11:44] <^demon> Yeah, that's already in master. [00:11:49] New patchset: Ryan Lane; "Add email key to gerrit config class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14435 [00:11:50] Oh cool. [00:12:05] <^demon> So unless they branch some arbitrary point before that, yeah it'll be in 2.5 [00:12:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14435 [00:12:29] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14435 [00:12:52] <^demon> 2.5 is gonna be super awesome. I need to try writing a plugin :) [00:13:17] saved searches, saved searches! [00:13:20] heh [00:13:27] <^demon> Don't think that made it in yet :( [00:13:32] <^demon> Plugins are a Big Deal though. [00:13:38] write that as a plugin ;) [00:13:46] !log restarting gerrit [00:13:54] Logged the message, Master [00:14:24] well, we're done :) [00:14:28] <^demon> Ryan_Lane: My first idea for a plugin is actually getting drdee_'s stats stuff integrated. Having it as a dashboard like /stats/ would be super cool. [00:14:35] oh [00:14:36] yeah [00:14:37] that would be [00:15:25] <^demon> Ok, gonna mark this FIXED :) [00:15:32] Can we have a plugin that tracks how many times ryan has to restar it per upgrade? It's java so I'd expect a steady upwards line :D [00:15:40] -_- [00:17:16] <^demon> We'll resolve a bunch of these issues before next time :) [00:21:50] <^demon> Ryan_Lane: Thanks so much! [00:21:55] yw [00:30:56] Ryan_Lane: When you talk about per-project puppet branches requiring modules... are you envisioning one module per repo, or a bunch of modules all in one repo? [00:31:13] a bunch of modules all in one repo [00:31:17] 'k [00:31:25] otherwise we need to deal with submodules [00:31:35] and the workflow for that would suck [00:31:52] git submodules, you mean? ok, makes sense. [00:32:00] yeah [00:32:03] * andrewbogott doesn't hate git submodules, but understands why they are generally hated [00:32:12] it would be much nicer than one monolithic repo [00:32:29] Git submodules are awesome but are 100% not svn externals. [00:32:41] well, nice to everyone but us [00:33:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:33:24] Supposedly gerrit is moderately smart about submodules. Not that I've ever seen it done. [00:33:32] Gerrit is smart about it [00:33:37] See the mediawiki/extensions.git repo [00:35:58] andrewbogott: but, either way, we need to use puppet modules :) [00:36:07] it'll likely be a long time till we're fully using modules [00:36:46] email incoming [00:41:56] New patchset: Bhartshorne; "updating swift ring files putting ms-be6,7,8 into rotation for containers and accounts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14437 [00:42:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14437 [00:42:32] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14437 [00:43:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [00:44:41] hm... forgot that Faidon is on vacation forever :( [00:55:41] andrewbogott: ah. I see what you were asking about now [00:55:48] we'd really want to have modules for each key piece [00:55:58] like, openstack would be a module with a number of components [00:56:13] like keystone, nova, glance, cinder, swift, quantum, etc [00:56:21] gerrit would be a module [00:56:53] even if we went with one giant module, we'd need to rename every class [00:57:09] it's better to just break out pieces slowly over time and turn them into modules [01:05:57] Erik and I are just getting caught up on the db40 fun today....still ongoing, or in cleanup now? [01:06:15] Eloquence: I just asked :) [01:06:19] k [01:07:38] mutante: still around? [01:09:43] * robla may have to resort to random pinging [01:09:57] robla: asher and domas would likely know that answer [01:10:03] ask via email? [01:10:16] I'd imagine that it's done, or we'd see them around talking about it and working on it [01:16:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:21:02] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /a/squid/zero-orange-kenya.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [01:25:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.073 seconds [01:26:05] Ryan_Lane: Why would we have to rename ever class? [01:26:19] the class names need to start with the module name [01:26:46] so, openstack would need to start with openstack::nova, openstack::swift, etc [01:28:32] RECOVERY - udp2log log age for oxygen on oxygen is OK: OK: all log files active [01:29:15] New patchset: Ryan Lane; "Diablo ppa is gone, removing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14439 [01:29:20] ah, ok. Hm. [01:29:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14439 [01:31:02] it should be somewhat easy to convert a lot of it quickly [01:31:31] it's the spaghetti code-ish parts that will be difficult [01:35:44] !log updated squid redirector to cover wiki(quotes|books|versity) [01:35:53] Logged the message, Master [01:40:50] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14439 [01:40:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 224 seconds [01:42:47] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 251 seconds [01:49:32] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 655s [01:52:25] binasher: I think I got FlaggedRevs working with LocalRDBStore now in testing \o/ [01:52:38] * AaronSchulz should really go home now... [01:52:44] :O that's awesome!! [01:54:02] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 24s [01:54:29] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 23 seconds [01:54:56] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 1 seconds [01:58:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [02:34:58] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [02:34:58] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [02:48:55] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [02:53:01] http://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/RTVE_Testcard.svg/200px-RTVE_Testcard.svg.png [02:53:10] ***MEMORY-ERROR***: rsvg-convert[9385]: GSlice: failed to allocate 496 bytes (alignment: 512): Cannot allocate memory [02:53:28] Error generating thumbnail Error creating thumbnail: [02:53:39] http://commons.wikimedia.org/wiki/File:RTVE_Testcard.svg [02:59:16] ToAruShiroiNeko: surprise [02:59:37] 24 megabytes of SVG are not easy to render [03:07:23] if its normal behaviour, please disregard [03:48:47] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [05:01:44] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [05:11:47] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [05:21:48] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [05:34:51] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:18:59] New patchset: Tim Starling; "Remove some accumulated crap from live-1.5" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14447 [06:38:26] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [06:38:26] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [06:40:32] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [06:41:26] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [06:41:26] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [06:41:26] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [06:46:32] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:47:26] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [06:47:26] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [06:47:26] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [06:47:26] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [06:47:26] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [06:47:27] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [06:47:27] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [06:47:27] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [06:47:28] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [06:50:26] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [06:52:57] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [06:54:00] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [06:54:00] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [06:54:00] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [06:56:06] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [06:57:09] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [06:57:09] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [06:57:09] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [07:00:00] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [07:04:03] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [07:04:03] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [07:04:03] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [07:04:03] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [07:06:00] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [07:09:00] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [07:16:35] ACKNOWLEDGEMENT - Host srv266 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT-2896 - hardware fail [07:29:40] !log continue to restart and upgrade downed mw10xx servers [07:29:47] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [07:29:50] Logged the message, Master [07:36:14] RECOVERY - Host mw1015 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [07:36:23] RECOVERY - Host mw1044 is UP: PING OK - Packet loss = 0%, RTA = 31.46 ms [07:36:23] RECOVERY - Host mw1040 is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [07:36:23] RECOVERY - Host mw1047 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [07:41:56] RECOVERY - Host mw1048 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [07:42:05] RECOVERY - Host mw1050 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [07:42:32] RECOVERY - Host mw1085 is UP: PING WARNING - Packet loss = 80%, RTA = 30.90 ms [07:45:50] PROBLEM - SSH on mw1085 is CRITICAL: Connection refused [07:46:35] PROBLEM - swift-object-auditor on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [07:47:20] RECOVERY - SSH on mw1085 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:48:05] RECOVERY - Host mw1087 is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [07:48:15] RECOVERY - Host mw1089 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [07:50:55] New patchset: Raimond Spekking; "Move generic wikisource/wikiversitry entries to top of the section" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14448 [07:53:20] PROBLEM - Host mw1048 is DOWN: PING CRITICAL - Packet loss = 100% [07:53:29] RECOVERY - Host mw1092 is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [07:54:23] RECOVERY - Host mw1048 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [07:58:08] RECOVERY - Host mw1093 is UP: PING OK - Packet loss = 0%, RTA = 30.94 ms [07:59:20] RECOVERY - Host mw1095 is UP: PING OK - Packet loss = 0%, RTA = 31.00 ms [07:59:20] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [07:59:38] RECOVERY - Host mw1096 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [07:59:38] RECOVERY - Host mw1105 is UP: PING OK - Packet loss = 0%, RTA = 30.95 ms [07:59:38] RECOVERY - Host mw1098 is UP: PING OK - Packet loss = 0%, RTA = 30.94 ms [07:59:47] RECOVERY - Host mw1110 is UP: PING WARNING - Packet loss = 50%, RTA = 42.77 ms [08:00:59] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 30.97 ms [08:03:14] PROBLEM - SSH on mw1110 is CRITICAL: Connection refused [08:03:23] RECOVERY - swift-object-auditor on ms-be7 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:04:45] RECOVERY - SSH on mw1110 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:16:45] RECOVERY - Host mw1117 is UP: PING OK - Packet loss = 0%, RTA = 30.92 ms [08:16:45] RECOVERY - Host mw1114 is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [08:16:45] RECOVERY - Host mw1115 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [08:16:54] RECOVERY - Host mw1160 is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [08:16:54] RECOVERY - Host mw1132 is UP: PING OK - Packet loss = 0%, RTA = 30.98 ms [08:16:54] RECOVERY - Host mw1141 is UP: PING OK - Packet loss = 0%, RTA = 30.92 ms [08:17:03] RECOVERY - Host mw1154 is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [08:20:12] PROBLEM - SSH on mw1132 is CRITICAL: Connection refused [08:20:12] PROBLEM - SSH on mw1117 is CRITICAL: Connection refused [08:21:42] RECOVERY - SSH on mw1132 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:21:42] RECOVERY - SSH on mw1117 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:21:51] RECOVERY - Host mw1119 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [08:22:36] RECOVERY - Host mw1128 is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [08:22:36] RECOVERY - Host mw1123 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [08:35:57] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:39:15] PROBLEM - Host mw1141 is DOWN: PING CRITICAL - Packet loss = 100% [08:41:39] RECOVERY - Host mw1141 is UP: PING OK - Packet loss = 0%, RTA = 30.92 ms [08:46:27] PROBLEM - swift-object-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:46:54] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:58:01] !log dist-upgrading (unused) db10xx servers [08:58:09] RECOVERY - swift-object-auditor on ms-be8 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:58:10] Logged the message, Master [09:02:57] PROBLEM - Host db1009 is DOWN: PING CRITICAL - Packet loss = 100% [09:03:51] RECOVERY - Host db1009 is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [09:04:54] PROBLEM - Host db1010 is DOWN: PING CRITICAL - Packet loss = 100% [09:05:57] RECOVERY - Host db1010 is UP: PING OK - Packet loss = 0%, RTA = 30.96 ms [09:08:10] PROBLEM - Host db1027 is DOWN: PING CRITICAL - Packet loss = 100% [09:09:40] RECOVERY - Host db1027 is UP: PING OK - Packet loss = 0%, RTA = 30.95 ms [09:09:58] PROBLEM - Host mw1158 is DOWN: PING CRITICAL - Packet loss = 100% [09:10:16] PROBLEM - Host db1028 is DOWN: PING CRITICAL - Packet loss = 100% [09:11:28] RECOVERY - Host db1028 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [09:11:55] PROBLEM - Host db1013 is DOWN: PING CRITICAL - Packet loss = 100% [09:12:58] RECOVERY - Host db1013 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [09:13:34] PROBLEM - Host db1015 is DOWN: PING CRITICAL - Packet loss = 100% [09:14:01] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:15:13] RECOVERY - Host db1015 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [09:15:31] RECOVERY - Host mw1158 is UP: PING OK - Packet loss = 0%, RTA = 30.92 ms [09:19:25] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [09:21:31] PROBLEM - SSH on mw1156 is CRITICAL: Connection refused [09:23:01] RECOVERY - SSH on mw1156 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [09:23:28] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [09:23:46] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [09:24:49] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [09:41:01] PROBLEM - SSH on mw1152 is CRITICAL: Connection refused [09:42:31] RECOVERY - SSH on mw1152 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [09:43:16] PROBLEM - Host mw1151 is DOWN: PING CRITICAL - Packet loss = 100% [09:48:04] RECOVERY - Host mw1151 is UP: PING OK - Packet loss = 0%, RTA = 30.92 ms [10:04:55] New patchset: Mark Bergsma; "Redo monitor module imports, to not conflict" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14450 [10:06:23] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14450 [10:18:47] PROBLEM - Host db1046 is DOWN: PING CRITICAL - Packet loss = 100% [10:20:44] RECOVERY - Host db1046 is UP: PING OK - Packet loss = 0%, RTA = 30.92 ms [10:22:05] PROBLEM - Host db1045 is DOWN: PING CRITICAL - Packet loss = 100% [10:23:44] RECOVERY - Host db1045 is UP: PING OK - Packet loss = 0%, RTA = 30.92 ms [10:24:11] PROBLEM - Host db1030 is DOWN: PING CRITICAL - Packet loss = 100% [10:25:41] RECOVERY - Host db1030 is UP: PING OK - Packet loss = 0%, RTA = 30.95 ms [10:27:29] PROBLEM - Host db1016 is DOWN: PING CRITICAL - Packet loss = 100% [10:27:38] PROBLEM - SSH on mw1127 is CRITICAL: Connection refused [10:28:32] RECOVERY - Host db1016 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [10:28:38] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14447 [10:29:08] RECOVERY - SSH on mw1127 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:29:35] PROBLEM - Host db1014 is DOWN: PING CRITICAL - Packet loss = 100% [10:32:17] RECOVERY - Host db1014 is UP: PING OK - Packet loss = 0%, RTA = 30.92 ms [10:33:38] PROBLEM - Host db1011 is DOWN: PING CRITICAL - Packet loss = 100% [10:34:41] RECOVERY - Host db1011 is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [10:36:29] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 220 seconds [10:36:56] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 242 seconds [10:39:56] PROBLEM - SSH on mw1129 is CRITICAL: Connection refused [10:41:26] RECOVERY - SSH on mw1129 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:42:47] PROBLEM - Host db1048 is DOWN: PING CRITICAL - Packet loss = 100% [10:43:59] RECOVERY - Host db1048 is UP: PING OK - Packet loss = 0%, RTA = 31.12 ms [10:45:20] PROBLEM - Host db1029 is DOWN: PING CRITICAL - Packet loss = 100% [10:47:44] PROBLEM - Host db1044 is DOWN: PING CRITICAL - Packet loss = 100% [10:49:05] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 5 seconds [10:49:05] RECOVERY - Host db1044 is UP: PING OK - Packet loss = 0%, RTA = 30.94 ms [10:49:32] PROBLEM - Host db1031 is DOWN: PING CRITICAL - Packet loss = 100% [10:49:50] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 5 seconds [10:50:35] RECOVERY - Host db1031 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [10:50:53] RECOVERY - Host db1029 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [10:51:02] PROBLEM - SSH on mw1137 is CRITICAL: Connection refused [10:52:32] RECOVERY - SSH on mw1137 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:53:35] PROBLEM - Host db1026 is DOWN: PING CRITICAL - Packet loss = 100% [10:54:02] PROBLEM - SSH on db1029 is CRITICAL: Connection refused [10:55:05] PROBLEM - MySQL disk space on db1029 is CRITICAL: Connection refused by host [10:55:05] RECOVERY - Host db1026 is UP: PING OK - Packet loss = 0%, RTA = 31.16 ms [10:55:23] PROBLEM - Host db1012 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:26] RECOVERY - Host db1012 is UP: PING OK - Packet loss = 0%, RTA = 31.20 ms [10:57:20] PROBLEM - Host db1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:58:23] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:59:26] RECOVERY - Host db1003 is UP: PING OK - Packet loss = 0%, RTA = 31.03 ms [11:11:15] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:13:33] New patchset: Tim Starling; "Delete more junk files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14452 [11:17:29] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14452 [11:21:00] RECOVERY - Host mw1069 is UP: PING OK - Packet loss = 0%, RTA = 30.98 ms [11:26:42] RECOVERY - Host mw1076 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [11:31:13] New patchset: Mark Bergsma; "Fix failure.check() invocation" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14457 [11:31:49] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14457 [11:32:24] RECOVERY - Host mw1071 is UP: PING OK - Packet loss = 0%, RTA = 30.94 ms [11:32:24] RECOVERY - Host mw1064 is UP: PING OK - Packet loss = 0%, RTA = 31.30 ms [11:32:33] RECOVERY - Host mw1082 is UP: PING WARNING - Packet loss = 80%, RTA = 584.91 ms [11:36:18] PROBLEM - SSH on mw1082 is CRITICAL: Connection refused [11:37:48] RECOVERY - SSH on mw1082 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:38:06] RECOVERY - Host mw1078 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [11:38:06] PROBLEM - Host mw1071 is DOWN: PING CRITICAL - Packet loss = 100% [11:38:42] RECOVERY - Host mw1071 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [12:07:57] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:08:06] If anyone's around labs currently has rather high load with what looks like a chunk of io wait, probably not much that can be done about it but just incase there is :) [12:08:12] * Damianz goes back to failing to login to bastion [12:09:24] New patchset: Mark Bergsma; "Fix calcStatus broken by a previous commit" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14459 [12:09:40] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:10:18] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14459 [12:25:49] New review: Hashar; "Thanks for cleaning all of the old files! I have been wondering myself if they were actually of any ..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14452 [12:36:04] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [12:36:04] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [12:45:42] New patchset: Reedy; "pngcrush everything" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13285 [12:50:10] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [12:56:47] New patchset: Mark Bergsma; "Merge branch 'master' into monitors/dns" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14463 [12:56:48] New patchset: Mark Bergsma; "Bug fixes" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14464 [12:56:48] New patchset: Mark Bergsma; "Fix successful result report" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14465 [12:56:49] New patchset: Mark Bergsma; "Improve DNS query error handling" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14466 [12:56:50] New patchset: Mark Bergsma; "Report up on NXDOMAIN" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14467 [12:56:50] New patchset: Mark Bergsma; "Fix DNS query error handling bugs" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14468 [12:56:51] New patchset: Mark Bergsma; "Rename DNS monitor to DNSQuery" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14469 [12:56:52] New patchset: Mark Bergsma; "Improve DNS query error messages" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14470 [12:56:53] New patchset: Mark Bergsma; "Allow configuration of down status on NXDOMAIN responses" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14471 [12:56:53] New patchset: Mark Bergsma; "Add DNSQuery monitor example" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14472 [12:56:56] New patchset: Mark Bergsma; "Configuration variables are enforced to be lower case" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14473 [12:56:56] New patchset: Mark Bergsma; "Shorten NXDOMAIN error message" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14474 [12:56:57] New patchset: Mark Bergsma; "Fix string expansion" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14475 [12:56:59] New patchset: Mark Bergsma; "Merge branch 'monitors/dns'" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14476 [12:57:34] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14385 [12:57:56] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14386 [12:58:13] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14463 [12:58:50] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14464 [12:59:17] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14465 [12:59:49] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14466 [13:00:24] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14467 [13:00:49] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14468 [13:01:23] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14469 [13:01:51] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14470 [13:02:18] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14471 [13:02:42] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14472 [13:03:06] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14473 [13:03:28] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14474 [13:04:06] Change merged: Mark Bergsma; [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14475 [13:04:25] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14476 [13:15:54] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:22:04] !log Inserted new pybal 1.04 package in the precise-wikimedia APT repository, and upgraded all precise LVS servers [13:22:14] Logged the message, Master [13:25:39] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [13:28:16] New patchset: Mark Bergsma; "Fix lvs manifest on hosts that don't have IPv6" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14481 [13:28:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14481 [13:28:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14481 [13:30:36] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 31.16 ms [13:31:03] RECOVERY - Puppet freshness on cp1017 is OK: puppet ran at Fri Jul 6 13:30:54 UTC 2012 [13:31:12] RECOVERY - Puppet freshness on search18 is OK: puppet ran at Fri Jul 6 13:30:59 UTC 2012 [13:31:48] RECOVERY - Frontend Squid HTTP on cp1017 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.189 seconds [13:32:09] New patchset: Mark Bergsma; "Some hosts don't have eth0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14482 [13:32:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14482 [13:32:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14482 [13:33:27] RECOVERY - Puppet freshness on search25 is OK: puppet ran at Fri Jul 6 13:33:11 UTC 2012 [13:34:21] RECOVERY - Puppet freshness on sq70 is OK: puppet ran at Fri Jul 6 13:34:10 UTC 2012 [13:34:57] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Fri Jul 6 13:34:29 UTC 2012 [13:36:00] RECOVERY - Puppet freshness on search30 is OK: puppet ran at Fri Jul 6 13:35:54 UTC 2012 [13:36:27] RECOVERY - Puppet freshness on search24 is OK: puppet ran at Fri Jul 6 13:36:01 UTC 2012 [13:36:27] RECOVERY - Puppet freshness on search23 is OK: puppet ran at Fri Jul 6 13:36:02 UTC 2012 [13:37:03] RECOVERY - Puppet freshness on palladium is OK: puppet ran at Fri Jul 6 13:36:46 UTC 2012 [13:39:00] RECOVERY - Puppet freshness on sq68 is OK: puppet ran at Fri Jul 6 13:38:52 UTC 2012 [13:40:03] RECOVERY - Puppet freshness on cp3002 is OK: puppet ran at Fri Jul 6 13:39:34 UTC 2012 [13:40:21] RECOVERY - Puppet freshness on search28 is OK: puppet ran at Fri Jul 6 13:40:04 UTC 2012 [13:41:33] RECOVERY - Puppet freshness on arsenic is OK: puppet ran at Fri Jul 6 13:41:17 UTC 2012 [13:41:33] RECOVERY - Puppet freshness on search20 is OK: puppet ran at Fri Jul 6 13:41:22 UTC 2012 [13:42:27] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:46:57] RECOVERY - Puppet freshness on search36 is OK: puppet ran at Fri Jul 6 13:46:53 UTC 2012 [13:47:33] RECOVERY - Puppet freshness on strontium is OK: puppet ran at Fri Jul 6 13:47:07 UTC 2012 [13:49:03] RECOVERY - Puppet freshness on sq67 is OK: puppet ran at Fri Jul 6 13:48:36 UTC 2012 [13:49:21] RECOVERY - Puppet freshness on search34 is OK: puppet ran at Fri Jul 6 13:49:07 UTC 2012 [13:49:30] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [13:50:33] RECOVERY - Puppet freshness on search13 is OK: puppet ran at Fri Jul 6 13:50:08 UTC 2012 [13:50:42] RECOVERY - Puppet freshness on search16 is OK: puppet ran at Fri Jul 6 13:50:27 UTC 2012 [13:50:51] RECOVERY - Puppet freshness on search33 is OK: puppet ran at Fri Jul 6 13:50:35 UTC 2012 [13:51:00] RECOVERY - Puppet freshness on search29 is OK: puppet ran at Fri Jul 6 13:50:45 UTC 2012 [13:51:00] RECOVERY - Puppet freshness on search19 is OK: puppet ran at Fri Jul 6 13:50:49 UTC 2012 [13:51:21] New patchset: Mark Bergsma; "Add DNSQuery monitor to dns_rec LVS service" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14483 [13:51:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14483 [13:52:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14483 [13:52:30] RECOVERY - Puppet freshness on search15 is OK: puppet ran at Fri Jul 6 13:52:19 UTC 2012 [13:53:33] RECOVERY - Puppet freshness on search27 is OK: puppet ran at Fri Jul 6 13:53:18 UTC 2012 [13:53:51] RECOVERY - Puppet freshness on search31 is OK: puppet ran at Fri Jul 6 13:53:41 UTC 2012 [13:54:27] RECOVERY - Puppet freshness on search17 is OK: puppet ran at Fri Jul 6 13:54:11 UTC 2012 [13:55:57] RECOVERY - Puppet freshness on search22 is OK: puppet ran at Fri Jul 6 13:55:41 UTC 2012 [13:55:57] RECOVERY - Puppet freshness on search26 is OK: puppet ran at Fri Jul 6 13:55:55 UTC 2012 [13:56:10] yeah more puppet fixes :-) [13:57:00] RECOVERY - Puppet freshness on sq69 is OK: puppet ran at Fri Jul 6 13:56:52 UTC 2012 [13:58:03] RECOVERY - Puppet freshness on search21 is OK: puppet ran at Fri Jul 6 13:57:54 UTC 2012 [14:00:09] RECOVERY - Puppet freshness on search14 is OK: puppet ran at Fri Jul 6 13:59:57 UTC 2012 [14:04:39] RECOVERY - Backend Squid HTTP on cp1017 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.190 seconds [14:10:57] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:58] labs again unusable [14:25:58] Has been most the day :( [14:28:45] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 214 seconds [14:38:55] !log authdns-update to correct typo in mgmt dns entry [14:39:03] Logged the message, Master [14:39:22] hrmm, someone updated morebots and yanked my name out of its confirmation.... [14:40:36] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 12 seconds [14:44:15] New patchset: RobH; "added vanadium to dhcp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14485 [14:44:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14485 [14:45:25] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14485 [14:49:13] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14277 [14:50:26] New review: Reedy; "I think leaving it as was, and prefixing a + might work" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14277 [14:50:48] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14448 [15:03:15] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [15:05:39] hey ^demon, what do you get as output when you run the gerrit ls-projects command through ssh? [15:05:48] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /a/squid/zero-orange-kenya.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [15:06:06] <^demon> drdee_: One second, lemme pastebin. [15:06:10] my suspicion is that gerrit has 'a one off' error when returning the projects list [15:06:49] I sent ya two, but indeed, one goes where the memcached are [15:07:03] you should work with mark or Leslie to get it setup [15:07:19] <^demon> drdee_: http://p.defau.lt/?g_L6c0NT7ENRJOm53FMXVg [15:07:20] but you can do the usual, rack it, power and serial wire, mgmt wire [15:07:28] what rack is it? [15:08:09] ack, none of the servers are in it yet in racktables =P [15:08:10] orgchart is also missing for you [15:08:34] cmjohnson1: So yea, normally we rack the switches in rack top, so just slap that in u44/45 [15:08:37] ^demon pretty sure it's a gerrit bug :) [15:09:02] <^demon> Weird.... [15:09:05] <^demon> Let's file it. [15:09:10] or fix it? ;) [15:09:22]