[00:06:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.985 seconds [00:36:25] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [00:36:25] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [00:38:22] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [00:39:25] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [00:39:25] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [00:39:25] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [00:40:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:49] PROBLEM - Host mw1007 is DOWN: PING CRITICAL - Packet loss = 100% [00:44:22] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:44:22] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [00:44:22] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [00:44:22] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [00:44:22] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [00:44:23] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [00:45:25] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [00:45:25] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [00:45:25] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [00:45:25] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [00:48:25] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [00:49:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.540 seconds [00:50:22] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [00:50:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14200 [00:51:16] PROBLEM - Host mw1002 is DOWN: PING CRITICAL - Packet loss = 100% [00:51:25] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [00:51:25] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [00:51:25] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [00:53:22] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [00:54:25] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [00:54:25] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [00:54:25] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [00:57:25] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [00:59:22] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [00:59:22] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [00:59:22] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [01:00:25] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [01:02:22] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [01:05:22] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [01:22:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:26:22] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [01:31:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.936 seconds [01:42:07] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 292 seconds [01:43:28] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 289 seconds [01:49:35] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 654s [01:51:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [01:52:53] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 26 seconds [01:53:56] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [02:06:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:14:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.727 seconds [03:16:26] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [03:20:29] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [03:53:46] can someone explain http://toolserver.org/~pathoschild/stalktoy/index.php?target=2000%3A%3A%2F4? [03:53:54] (why it keeps giving a db-related error) [03:53:58] is that server down? [03:56:20] hmm [03:56:22] looking [03:56:54] works fine here [03:56:59] maybe a temporary hickup? [03:59:58] try dropping the ? in the end though I dont think you had that in your orginal use [04:06:45] PROBLEM - Host mw1132 is DOWN: PING CRITICAL - Packet loss = 100% [05:16:30] Jasper_Deng: #wikimedia-toolserver [05:16:41] (issue is resolved now) [05:16:53] -operations is for the real servers, toolserver is a replicated read-only cluster separate from that. [05:16:56] just fyi :) [05:37:22] PROBLEM - Host mw1142 is DOWN: PING CRITICAL - Packet loss = 100% [05:38:21] wut? mediawiki.org is giving me dns not found [05:38:25] http://www.mediawiki.org/wiki/Template:Ombox [05:38:27] time out [05:38:40] Unable to resolve the server's DNS address. [05:39:01] anyone else? [05:39:34] loads fine here [05:39:37] that link I mean [05:39:43] I am in europe though [05:42:33] weird, works from curl for me but not in Chrome [05:42:42] dnsflush fied it [05:42:45] fixed* [05:44:42] Krinkle: if you're running a recent build and have the flag enabled, chrome doesn't use the os's getaddrinfo() [05:44:47] Krinkle: see https://plus.google.com/103382935642834907366/posts/FKot8mghkok [05:45:06] Krinkle: and "Built-in Asynchronous DNS" in chrome://flags [05:45:27] that might account for a host resolving correctly in curl but not in chrome (or vice versa) [05:46:42] os dns flush did fix it [05:46:48] could be a coincendence [05:55:40] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:13:50] New patchset: Raimond Spekking; "Fix for https://gerrit.wikimedia.org/r/#/c/14180/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14277 [06:20:35] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:32:17] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:33:29] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [06:33:29] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [06:47:31] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:51:25] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:47:08] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [08:30:38] PROBLEM - Host mw1040 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:56] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [09:00:13] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [09:10:16] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [09:20:19] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [09:33:13] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [10:13:55] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:33:07] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:37:10] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [10:37:10] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [10:37:44] New patchset: Hashar; "nagios authdns now check nagiostest.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14286 [10:38:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14286 [10:38:41] New review: Hashar; "The change fix 3 nagios errors." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/14286 [10:39:16] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [10:40:10] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [10:40:10] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [10:40:10] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [10:45:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:45:16] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [10:45:16] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [10:45:16] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [10:45:16] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [10:45:17] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [10:46:10] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [10:46:10] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [10:46:10] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [10:46:10] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [10:49:10] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [10:51:16] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [10:52:10] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [10:52:10] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [10:52:10] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [10:54:16] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [10:54:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14286 [10:55:10] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [10:55:10] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [10:55:10] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [10:55:34] New patchset: Hashar; "planet: comment that update-planets need to be changed too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14287 [10:56:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14287 [10:58:10] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [11:00:00] New patchset: Mark Bergsma; "Allow servers to prefix nameservers (e.g. LVS)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14289 [11:00:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14289 [11:00:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14289 [11:01:39] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [11:01:39] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [11:01:39] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [11:01:39] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [11:03:45] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [11:06:45] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [11:07:37] New patchset: Mark Bergsma; "Install a DNS recursor on new LVS servers after all" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14290 [11:08:07] hashar: wow, puppet has *inline* switch statements? [11:08:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14290 [11:08:14] directl to assignment [11:08:18] yes [11:08:19] https://gerrit.wikimedia.org/r/#/c/14289/1/manifests/realm.pp,unified [11:08:28] amazing [11:08:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14290 [11:08:40] beats javascript [11:09:05] <^demon|away> Don't say that too loud, the ruby enthusiasts might hear you ;-) [11:09:25] though js is not far behind [11:09:41] var nameservers = ({ "esams": [ .. ], "eqiad": [ .. ] })[ site ] || [ .. ] [11:10:04] object literals and the || default operator [11:10:12] which, contrary to php, returns the value, not boolean [11:10:53] do more with less [11:11:02] anyway, puppet can stay now ;P [11:11:20] no [11:11:27] you can't even concatenate arrays with it :P [11:11:55] its interesting though, its an odd category syntax [11:12:20] its not really a language for logic / execution. More like json/ini with with some logics built-in [11:12:36] but it looks lot like the java type of langauge [11:13:15] or C-family rather [11:13:21] it's declarative [11:13:25] yeah [11:13:34] but one wouldn't see switch statements in INI or JSON. [11:14:30] * Krinkle opens puppet for dummies [11:14:52] regexp, conditionals, inheritance, hashes, "in", "unless", nice :) [11:15:38] and then you also miss a ton of stuff [11:16:02] sure [11:16:03] and it's not consistently implemented and has many bugs [11:16:20] gonna be a bumpy ride [11:16:39] <^demon|away> Buckle up :) [11:16:43] mark: this is my "puppet" for now, during initial sketching/labs: https://labsconsole.wikimedia.org/wiki/Nova_Resource:Integration/Setup [11:16:54] good ol' shell executables [11:17:05] and some inline homebrew comments/syntax [11:17:50] !log Installed new pybal snapshot build for testing on lvs1005 [11:17:59] Logged the message, Master [11:18:40] New patchset: Dzahn; "add missing Russian locales for planet on singer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14293 [11:19:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14293 [11:22:32] Krinkle: var nameservers = ({ "esams": [ .. ], "eqiad": [ .. ] })[ site ] || [ .. ] [11:22:37] Krinkle: I like that syntax [11:22:44] Krinkle: really easy to figure out / read when properly indented [11:22:50] js :) [11:23:06] New patchset: Mark Bergsma; "Convert remaining $nameservers changes to $nameservers_prefix, install recursor on all LVS servers again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14294 [11:23:11] yeah, javascript has 2 things that literally make out almost all of the languageL [11:23:13] objects and functions [11:23:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14294 [11:23:44] objects are arrays, hashes, and what not. functions are functions, methods, classes, modules, closures, scope.. [11:24:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14294 [11:24:44] mark: are you going to wikimania? [11:24:50] no [11:25:02] anyone else from the ops team as far as you know? [11:25:06] yeah many [11:25:13] okauy [11:26:20] mark: I still have to laugh when I think back about january 2011. only little over a year ago. [11:26:42] in amsterdam [11:26:56] what about it? :) [11:27:04] be trying to sounds smart about something with backbones, whatever I thought that was. [11:27:15] don't worry about it ;) [11:27:15] to you of all people [11:27:36] but I found my place, on the opposite side of the pipe, so to speak [11:27:45] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [11:27:59] i'll just be careful to not try to sound smart about javascript to you then [11:28:20] * hashar sends Krinkle in a datacenter with RobH so he learns about the pipe side that really matter :-] [11:28:23] (i'm not likely to do that anyway, I avoid web programming ;) [11:28:30] you do ruby already! [11:28:36] hehe, enjoy every minute of it [11:28:37] <^demon|away> Never admit to knowing anything :) [11:28:44] I do python whenever I can [11:28:56] * hashar knows about igniting a lighter [11:29:01] mark: well, the day comes we're going to have to install node js on wmf servers. prepare for the worst [11:29:14] won't be me then ;) [11:29:27] though ops wouldn't mind too much I suppose, that's still software side. [11:29:29] speaking of nodejs, we need to update the nodesjs -wm debian package :) [11:29:47] * ^demon|away hides server-side from our JS future [11:30:39] * Krinkle mumbles away to lunch about /usr/nodejs/common/docroot/index.js :P [11:30:40] brb later [11:52:48] PROBLEM - Host mw1141 is DOWN: PING CRITICAL - Packet loss = 100% [11:54:13] !log Inserted new pybal_1.02 package into APT distribution precise-wikimedia [11:54:22] Logged the message, Master [11:58:08] New patchset: Dzahn; "add missing Russian locales for planet on singer (ru_RU ISO-8859-5, -5 not -2)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14293 [11:58:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14293 [11:59:16] New review: Dzahn; "fix RT 3227 and bug 38198" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14293 [11:59:37] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14293 [12:03:15] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.044 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [12:04:45] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.028 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [12:10:09] bbl [13:17:53] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [13:18:42] Database error: "SqlBagOStuff::set". Database returned error "1114: The table 'pc233' is full (10.0.6.50)". - https://bugzilla.wikimedia.org/38202 critical [13:18:50] @info 10.0.6.50 [13:18:52] Krinkle: [10.0.6.50: ] db40 [13:18:56] @info db40 [13:18:56] Krinkle: [db40: s7] 10.0.6.50 [13:19:01] @replag s7 [13:19:01] Krinkle: [s7] db37: 0s, db56: 0s, db58: 0s, db26: 0s [13:19:12] (obviously, not related to replag, jus checking) [13:19:17] Database error: "SqlBagOStuff::set". Database returned error "1114: The table 'pc233' is full (10.0.6.50)". - https://bugzilla.wikimedia.org/38202 critical [13:19:32] <^demon> Sounds like 10.0.6.50 might be out of space? [13:19:40] pc193 as well [13:19:44] see #wikimedia-tech [13:20:16] <^demon> pc[\d{3}] isn't actually the name of a table or server :) [13:20:35] whatever [13:20:40] O_O [13:20:45] dbbot-wm: [13:20:52] :( [13:21:56] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [13:29:21] anyone? [13:29:23] Db error? [13:34:02] yeah, looking at it [13:34:14] sorry, I was tyiping about it but in the wrong channel [13:34:23] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:41:41] ^demon: should we disable the parser cache ? *grin* [13:42:02] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:42:59] (Cannot contact the database server: Unknown error (10.0.6.50)) [13:43:52] so what's going on now is that we're looking at whether we can change the innodb space constraint in my.cnf or whether that is going ot have undesireable side effects [13:45:02] apergos: site totally down [13:45:06] right [13:45:12] matanya: ops working on it [13:46:02] thanks [13:48:11] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [13:48:20] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [13:48:38] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [13:48:56] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [13:50:08] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60158 bytes in 0.130 seconds [13:50:26] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60357 bytes in 0.933 seconds [13:51:11] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60351 bytes in 0.884 seconds [13:51:20] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60167 bytes in 0.171 seconds [13:54:38] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [13:54:56] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [13:55:41] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [13:55:50] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [13:57:11] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60351 bytes in 0.864 seconds [13:57:20] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60167 bytes in 0.272 seconds [13:57:38] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60160 bytes in 0.130 seconds [13:57:56] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60358 bytes in 1.232 seconds [14:01:29] hehe http://xkcd.com/903/ comes to mind indeed [14:01:36] domas: mh? [14:01:53] YAY WIKTIONARY DOWN [14:02:02] FINALLY [14:02:17] domas: :P You did the migrate to the mysql parser cache no? [14:02:38] What was the impact on flushing it all? Will the Apaches take that? [14:03:20] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [14:03:47] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [14:03:56] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [14:03:56] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [14:04:50] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 48811 bytes in 0.140 seconds [14:05:17] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 48816 bytes in 0.383 seconds [14:05:26] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49004 bytes in 0.698 seconds [14:05:26] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49011 bytes in 0.917 seconds [14:06:45] http://flux.defau.lt/wikiwhat.png [14:08:17] http://wikimania2012.wikimedia.org/wiki/Google_Reception :( [14:08:22] Logged the message, Master [14:08:29] domas: yeah, randomized testing of some new banners [14:08:36] this is one of them. [14:08:37] back [14:08:42] !quote [14:08:49] domas: they're quite different, aren't they :D [14:09:03] hahaha [14:09:09] at least they are not mentioning facebook in these banners anymore [14:09:11] ha ha ha ha [14:09:31] Cause FB isn't very reliable, is it? :D [14:09:32] * hoo hids [14:10:15] so we should have asher make the determination about table space? [14:12:36] we should have people monitor stuff that matters [14:12:40] rather than wanking around [14:20:36] ok [14:21:58] I can do some innodb wizzardry or just reinitialize everything [14:22:40] what would the wizardry look like and how long would it take? [14:22:50] it can't happen anymore [14:23:12] it would be trying to engineer custom build that skips those assertions :) [14:23:23] ok hen [14:23:38] reinitialize it is [14:24:21] I'm not an expert in fsp code [14:24:24] :) [14:26:37] if anyone is feeling like it, can build a repro [14:27:12] anyway, RCA is pointing at this - http://flux.defau.lt/wikiwhat.png [14:27:25] "Google might have" [14:27:26] lol [14:28:16] surprised Yahoo still has 13000 staff. [14:31:58] domas: I am wondering which of the 5 servers are mine [14:32:42] is the site down? [14:32:55] it seems to work for me [14:33:25] here, see [14:33:26] http://en.wikipedia.org/wiki/User:Midom/test [14:33:37] wfm [14:33:41] ok, free space 10% initialized [14:33:52] worksforme [14:34:45] I like this though [14:34:49] every time it gets full, we just nuke it [14:34:52] with site down for a while [14:34:58] <3 [14:35:03] ok I see the change [14:35:04] thanks [14:35:05] we definitely want to monitor that so [14:35:08] domas: yeah, even wikilove is operational. and then some (!) [14:35:10] domas: maybe memsql! [14:35:11] specially if that happens from time to time [14:35:14] jeff_Green: yup [14:35:17] that would be wikipedia way [14:35:23] throwing lots of hardware at the problem [14:35:35] because paying few minutes of attention to site critical system is not necessary [14:35:36] :) [14:35:50] so if that lives in puppet someplace it would be nice to make it permanent [14:37:09] don't find it. hmm [14:37:23] <^demon> apergos: Whatcha looking for in puppet? [14:37:35] the my.cnf setting [14:37:55] <^demon> Ah [14:38:28] if the old setting isn't in there I guess it won't get overwritten :-/ [14:39:05] what setting? [14:39:15] templates/mysql$ vi generic_my.cnf.erb ? [14:39:22] why would you talk about settings [14:39:23] or does it not use the generic one [14:39:25] it is not about settings [14:39:30] it is about not allowing the site to go down [14:39:38] because of not caring about things [14:39:39] innodb_data_file_path=ibdata1:2000G [14:39:49] meh [14:40:11] t is about keepipng puppet in sync with what's live, but if this isn't in puppet then it's not a problem [14:41:26] oh ugh [14:41:27] I see [14:42:14] so we need to get table free space into our fine quality monitoring . . . [14:42:54] uh huh [14:43:13] so mutante I still don't see it in there [14:46:08] db40 fixed? [14:46:14] Note, it's got loads of disk space ;) [14:46:54] yes, domas fixed it and so now we can forget about the space issues for another several months :-P [14:47:15] * closedmouth cuddles domas [14:47:18] apergos: me neither, lots of innodb_ settings but not the one you were looking for [14:47:32] :-D [14:47:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:36] domas: truncating all the tables is cheating [14:47:53] created RT for monitoring [14:48:02] well t cleaned up the mess, which is what matters [14:48:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.187 seconds [14:49:34] I guess we could write something that collects from "show table status" [14:53:54] Jeff_Green: http://exchange.nagios.org/directory/Plugins/Databases/MySQL/MySQL-find-InnoDBs-and-check-free-space/details [14:57:12] thanks for the email update domas [15:00:01] is there a metric checking innodb table space in ganglia ? [15:00:16] Cause I am pretty sure there is nagios plugin that check gmetad variables [15:00:35] I did wrote one, no idea if I open sourced it or not [15:00:38] definitely exist [15:00:51] so you could collect table space data in ganglia (yeahhh nice graphs) [15:00:58] then ask nagios/icinga to alarm on it [15:01:19] (or maybe Ganglia as a build in system to send a SNMP trap whenever a threshold is reached for a metric) [15:01:23] all the metrics i've found so far are table-specific which is a drag [15:01:45] 40% [15:02:32] Jeff_Green: you might want to raise the issue on ops list [15:02:43] ben / asher might able to set something up [15:02:48] (or you) [15:02:49] well whatever [15:02:49] domas: already up from 10%?? [15:03:10] it is definitely doable and will most probably avoid the nasty "oh we haven't seen it was going to be full" recurring issues :) [15:03:42] sure, even if we have to come up with some aggregation scheme [15:03:45] jeff_Green: thats just creating an empty ibdata [15:06:18] the nagios script just looks for a condition where any one table is over limit and notifies on that [15:06:57] in ganglia, I'm not sure what we'd do. but I don't think we'd want to graph every host+db+table combination [15:10:57] jeff_green: writing out a 2TB empty file takes a while, if you write out zeroes on top [15:11:20] yep [15:15:16] 55% [15:23:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:58] !log updated Jenkins configuration on gallium : Updating f407ebe..4b669b9 [15:26:07] Logged the message, Master [15:27:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.885 seconds [15:28:14] !log powercycling argon [15:28:22] Logged the message, Master [15:29:51] RECOVERY - Host argon is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [15:29:56] back later [15:33:09] RECOVERY - SSH on argon is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:33:34] !log argon (limesurvey) fscked, dist-upgrading [15:33:42] Logged the message, Master [15:34:01] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:37:37] !log argon back up with new kernel,mysql,grub,.. looks happy afaict [15:37:45] Logged the message, Master [15:38:07] <^demon> Happy other than the fact that limesurvey runs on it ;-) [15:38:08] Actually, that's a point [15:38:18] I logged a db40 diskspacce ticket back in October [15:38:19] https://rt.wikimedia.org/Ticket/Display.html?id=1663 [15:38:40] hashar mutante ^ [15:39:14] so did it took like 9 months to fill it ? :-D [15:39:43] ohh great [15:39:59] Reedy: thanks, merged with a new one :p [15:40:05] Jeff_Green: mutante apergos : so there is rt 1663 about monitoring table space instead of disk space :) [15:40:28] table space was the issue today [15:40:51] true. disk space != table space [15:40:58] we were under 80% for disk space [15:40:59] indeed [15:41:04] We had these issues back in October ;) [15:41:36] hashar: well, it says "db40 mysql disk space" :p [15:41:37] also, the innodb file will never shrink [15:41:53] ah, yeah we should rename [15:42:01] monitor all the disk spaces! [15:42:30] basically monitor everything :-D [15:42:36] then send ton of alerts [15:42:41] <^demon> If we use mongodb, these things wouldn't happen. [15:42:41] fwiw (and this is probably implied) we should monitor everywhere not just db40 [15:42:44] then have some tool to aggregate / sort them out :-] [15:42:57] ^demon: memsql [15:43:02] disk space exists: http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=2&host=db40&service=MySQL+disk+space [15:43:10] ^demon: we still would have different issue with some limits being reached without us knowing about it [15:43:13] I hear cassandra would work too!! [15:43:25] but there's no pretty graphs? [15:43:29] <^demon> domas: Yes! Mongodb is dated already. Need the latest nosql toys :) [15:43:35] Jeff_Green: "mid-air collision" on renaming :) [15:43:41] ^demon: memsql is SQL [15:43:48] unless domas enabled autoexpand we probably won't max out disk space :-P [15:43:54] autoextend rather [15:43:58] mutante: oh noes! [15:44:03] I set it to 2000G [15:44:23] domas: but it'll cap there right? [15:44:30] yeh [15:44:40] and then hit the same bug! [15:44:41] :) [15:44:46] unless you guys get a repro! [15:44:52] in fact, when it's done filling disk with zeroes we can stop worrying about disk space! !! ! [15:51:51] re: "< Krinkle> hehe http://xkcd.com/903/ comes to mind indeed" that could work as a fundraising banner without further explanation :) [15:52:27] mutante: The image is to big probably [15:52:47] if you get it to SVG :D [15:54:53] mutante: zomg you're onto something there with the xkcd fundraising banner campaign idea [15:55:46] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: host 91.198.174.244, sessions up: 3, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [16:00:14] ask randall to draw a bunch of fundraising banners for us? [16:00:16] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [16:01:35] nice idea maplebed [16:01:49] maplebed: yes [16:02:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:02:49] domas: who broke facebook? [16:02:57] "Your account is currently unavailable due to a site issue" [16:03:16] thats when i thought he is seriously multitasking :p [16:03:32] And it's back [16:04:17] at least one guy was like "omg, the net is down. fb and wp" earlier [16:08:46] New patchset: Dzahn; "RT #2841: decommission gilman" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14314 [16:09:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14314 [16:11:11] New review: Dzahn; "sigh, the gilman has to close :p" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/14314 [16:11:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [16:11:31] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:14:04] ACKNOWLEDGEMENT - Host gilman is DOWN: CRITICAL - Host Unreachable (208.80.152.176) daniel_zahn #2841: decommission gilman [16:14:49] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, sessions up: 26, down: 1, shutdown: 1BRPeering with AS64600 not established - BR [16:18:33] !log powercycling mw1002 [16:18:41] Logged the message, Master [16:19:10] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [16:20:43] cmjohnson1: but nothing related to mw100* these days? [16:21:37] k,thx [16:21:40] cmjohnson1: is ms-be5 back on its way up? [16:22:04] I'm just glad it didn't crash on its own. [16:24:40] New patchset: Mark Bergsma; "Use 16 bit example ASNs for now, PyBal doesn't support 32 bit yet" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14316 [16:24:41] New patchset: Mark Bergsma; "New snapshot version" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14317 [16:24:42] New patchset: Mark Bergsma; "Merge branch 'malus/master'" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14318 [16:24:42] New patchset: Mark Bergsma; "Make init script wait 2 seconds for PyBal to stop" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14319 [16:24:43] New patchset: Mark Bergsma; "Merge branch 'malus/master'" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14320 [16:24:44] New patchset: Mark Bergsma; "Merge branch 'malus/master'" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14321 [16:24:44] New patchset: Mark Bergsma; "Merge branch 'malus/master'" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14322 [16:24:45] New patchset: Mark Bergsma; "Merge branch 'malus/master'" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14323 [16:24:46] New patchset: Mark Bergsma; "Merge branch 'malus/master'" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14324 [16:24:47] New patchset: Mark Bergsma; "Merge branch 'malus/master'" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14325 [16:24:47] New patchset: Mark Bergsma; "pybal (1.02~2.gbpa6789a) UNRELEASED; urgency=low" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14326 [16:24:48] New patchset: Mark Bergsma; "pybal (1.02) precise; urgency=low" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14327 [16:25:17] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14316 [16:25:41] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14317 [16:26:22] RECOVERY - Host mw1002 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [16:27:16] RECOVERY - Host mw1007 is UP: PING OK - Packet loss = 0%, RTA = 30.96 ms [16:27:26] !log mw1002, mw1007,mw1009,mw1011 - crashed,powercycling,dist-upgrading+kernel,reboot [16:27:34] Logged the message, Master [16:27:55] New patchset: Mark Bergsma; "Use 16 bit example ASNs for now, PyBal doesn't support 32 bit yet" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14328 [16:27:56] New patchset: Mark Bergsma; "New snapshot version" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14329 [16:27:57] New patchset: Mark Bergsma; "Make init script wait 2 seconds for PyBal to stop" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14330 [16:27:58] New patchset: Mark Bergsma; "pybal (1.02~2.gbpa6789a) UNRELEASED; urgency=low" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14331 [16:27:58] New patchset: Mark Bergsma; "pybal (1.02) precise; urgency=low" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14332 [16:30:07] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [16:30:52] New patchset: Mark Bergsma; "Make init script wait 2 seconds for PyBal to stop" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14334 [16:30:53] New patchset: Mark Bergsma; "pybal (1.02~2.gbpa6789a) UNRELEASED; urgency=low" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14335 [16:30:54] New patchset: Mark Bergsma; "pybal (1.02) precise; urgency=low" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14336 [16:31:01] PROBLEM - Host mw1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:21] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14334 [16:31:45] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14335 [16:32:08] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14336 [16:32:28] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14318 [16:32:43] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14319 [16:32:53] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14320 [16:32:58] RECOVERY - Host mw1009 is UP: PING OK - Packet loss = 0%, RTA = 30.96 ms [16:33:03] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14321 [16:33:11] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14322 [16:33:21] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14323 [16:33:32] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14324 [16:33:41] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14325 [16:33:52] RECOVERY - Host mw1002 is UP: PING OK - Packet loss = 0%, RTA = 30.94 ms [16:33:53] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14326 [16:33:57] GOOD MORNING GERRIT!!! [16:34:01] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:34:03] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14327 [16:34:10] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [16:34:10] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [16:34:13] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14328 [16:34:22] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14329 [16:34:33] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14330 [16:34:43] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14331 [16:34:55] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14332 [16:38:40] RECOVERY - Host mw1011 is UP: PING OK - Packet loss = 0%, RTA = 31.21 ms [16:42:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:44:22] RECOVERY - Host mw1017 is UP: PING OK - Packet loss = 0%, RTA = 31.29 ms [16:44:26] !log powercycling and upgrading more mw10xx servers, 1017,1023,1025 ... [16:44:26] who knows what the status 'n' means in gerrit, i know that A = abandoned, M is merged but lower case n is not clear to me [16:44:34] Logged the message, Master [16:44:46] RobHalsell: was ms-be1003 one of the ones you put SSDs into? [16:45:22] <^demon> drdee_: Example? [16:45:51] changeid 3694 [16:46:25] i can also look myself D [16:46:30] :D [16:46:59] <^demon> drdee_: Ah ok. One sec and I'll have the file that lists all the statuses :) [16:47:07] oh cool! [16:48:08] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [16:48:33] <^demon> drdee_: http://code.google.com/p/gerrit/source/browse/gerrit-reviewdb/src/main/java/com/google/gerrit/reviewdb/client/Change.java [16:48:43] PROBLEM - Host db1015 is DOWN: PING CRITICAL - Packet loss = 100% [16:48:52] <^demon> drdee_: Line ~180-190 [16:48:58] yep got it [16:49:02] thanks! [16:49:06] <^demon> You're welcome :) [16:49:43] so n = new [16:49:57] it's interesting that use of lowercase = open, uppercase = closed [16:50:05] RECOVERY - Host mw1020 is UP: PING OK - Packet loss = 0%, RTA = 30.97 ms [16:50:13] RECOVERY - Host mw1023 is UP: PING OK - Packet loss = 0%, RTA = 30.94 ms [16:50:13] RECOVERY - Host mw1025 is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [16:50:17] !log powercycling downed db1015 [16:50:26] Logged the message, Master [16:50:30] <^demon> Platonides: Yeah, interesting is one word for it :) [16:50:41] yep i just read that as well [16:50:46] it's good to know [16:51:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.397 seconds [16:53:22] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:53:58] PROBLEM - SSH on mw1025 is CRITICAL: Connection refused [16:54:16] RECOVERY - Host db1015 is UP: PING OK - Packet loss = 0%, RTA = 30.95 ms [16:56:17] RECOVERY - SSH on mw1025 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:56:35] PROBLEM - NTP on db1015 is CRITICAL: NTP CRITICAL: Offset unknown [16:56:44] RECOVERY - Host mw1036 is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [16:58:14] RECOVERY - NTP on db1015 is OK: NTP OK: Offset 0.06047487259 secs [17:14:47] !log powercycling db1013 [17:14:55] Logged the message, Master [17:15:24] New patchset: Bhartshorne; "adding eqiad ms-be hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14337 [17:15:56] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14337 [17:17:18] New patchset: preilly; "add more opera mini IPs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14163 [17:17:45] notpeter: https://gerrit.wikimedia.org/r/14338 [17:17:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14163 [17:18:47] RECOVERY - Host db1013 is UP: PING OK - Packet loss = 0%, RTA = 30.94 ms [17:20:30] LeslieCarr: actually can you please merge this for me https://gerrit.wikimedia.org/r/#/c/14338/ ? [17:20:40] I don't think notpeter is actually around [17:20:51] back [17:22:08] !log powercycling db1027,db1028 [17:22:17] Logged the message, Master [17:22:42] preilly checking it out [17:22:52] remind me [17:22:55] how does one create pcache tables? [17:23:14] i'm guessing you need https://gerrit.wikimedia.org/r/#/c/14163/2 as well ? [17:23:20] or you need to rebase 14338 [17:25:05] LeslieCarr: yes [17:25:05] RECOVERY - Host db1027 is UP: PING OK - Packet loss = 0%, RTA = 31.11 ms [17:25:05] PROBLEM - Host mw1134 is DOWN: PING CRITICAL - Packet loss = 100% [17:25:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:26] RECOVERY - Host db1028 is UP: PING OK - Packet loss = 0%, RTA = 30.94 ms [17:26:33] !log powercycling db1009,db1010 [17:26:42] Logged the message, Master [17:27:10] preilly: one message/change needed in 14163 [17:27:34] LeslieCarr: what is that? [17:27:48] made an inline comment, have one ip range overlapping [17:28:07] LeslieCarr: I see the virt hosts were cabled, have they been networked yet? [17:28:33] Ryan_Lane: the network is set up, however chris said that they were cabled backwards and is fixing them [17:28:38] ugh [17:28:43] heh [17:28:45] he's there? [17:28:55] ah. right. it's thursday [17:29:01] thanks [17:29:17] the network ports are up and ready whenever the cabling is fixed [17:29:17] RECOVERY - Host db1009 is UP: PING OK - Packet loss = 0%, RTA = 30.97 ms [17:30:08] ugh. my wikimania flight is monday [17:30:38] RECOVERY - Host db1010 is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms [17:30:47] New review: Lcarr; "issue fixed in https://gerrit.wikimedia.org/r/#/c/14338/" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/14163 [17:30:49] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14163 [17:30:52] mine too Ryan_Lane [17:30:56] !log more powercycling and upgrading: mw1036, mw1134, mw1043 .. [17:31:03] when's yours ? [17:31:04] Logged the message, Master [17:31:25] departs at 9:48 am [17:31:43] they really need to start booking my flights later if they don't want me taking cabs [17:31:57] we're on the same flight [17:32:13] cool [17:32:14] considering traffic, if you're not getting there earlier, a cab to bart is probably your best bet [17:32:33] if I'm going to cab to bart, I may as well take it the entire way [17:33:06] You should buy a bike :) [17:33:24] Or not live miles away from any form of rail :) [17:33:35] Damianz: if I left a bike locked up for a week, for sure it would be stolen by the time I got back [17:33:48] guys, if you have the time for 1 or 2 mw servers each or so, powercycle and dist-upgrade, would be nice, i can't stay that much longer but there are lots to go and they all went down within the last couple days [17:33:52] There is that... amount of bikes I've had nicked :( [17:33:53] or this city's public transit could stop sucking [17:34:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.151 seconds [17:34:28] There are areas where it doesn't suck, you could live there instead :) But I agree it's insane that people in Orinda get to downtown and the airport faster than you [17:34:54] RoanKattouw: rent is also way higher in all of those areas too [17:35:02] Yeah, you pay for it [17:36:20] RECOVERY - Host mw1134 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [17:37:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14338 [17:37:41] RECOVERY - Host mw1043 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [17:38:26] RECOVERY - Host mw1136 is UP: PING WARNING - Packet loss = 73%, RTA = 30.88 ms [17:39:27] Hah, Embarcadero BART has a self-service bike parking station, 3 cents per hour, maximum duration 10 days [17:39:38] when did that start ? [17:39:51] I've known there was /something/ like that at Embarcadero for a while [17:39:57] But I never bothered to look up the details [17:40:32] Oh you have to like register and get a card with a $20 initial balance [17:40:54] So it's not exactly easy to start using but it could work for commuters [17:41:03] !log powercycling db1048, mw1136, mw1046 [17:41:06] so it's not on clipper, that's weird [17:41:13] Logged the message, Master [17:41:25] No, it's a separate system called BikeLink [17:41:45] http://bartbikestation.com/getstarted.php [17:43:22] Oh, meh, but no overnight [17:43:23] RECOVERY - Host db1048 is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [17:43:59] RECOVERY - Host mw1142 is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [17:43:59] RECOVERY - Host mw1148 is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms [17:47:17] PROBLEM - mysqld processes on db1048 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [17:48:02] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [17:48:11] RECOVERY - Host mw1046 is UP: PING OK - Packet loss = 0%, RTA = 31.13 ms [17:50:22] preilly: all merged [17:50:47] New patchset: preilly; "Mobile default for sibling projects for wikiquote, wikibooks and wikiversity" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/14341 [17:51:33] LeslieCarr: thanks [17:51:40] LeslieCarr: can you merge this one now: https://gerrit.wikimedia.org/r/#/c/14341/ [17:53:19] LeslieCarr: hi, can you check out the config for c3-pmtpa ports 11-13 when you get a chance? dhcp requests don't seem to be making it thru to brewster [17:53:31] sure [17:53:43] preilly first, then binasher [17:54:18] Change merged: Asher; [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/14341 [17:54:43] oh asher got that [17:55:02] now to actually compile it [17:55:44] LeslieCarr: is the puppet change merged and live on the cache boxes? [17:55:45] binasher: pc11-pc13 are supposed to be internal, right ? [17:56:22] LeslieCarr: just clarifying because one of the carriers isn't seeing a change [17:56:31] pc1-3, yep. well, whatever vlan db's are on, i think that's internal? [17:56:35] preilly: yep, and did a banadm [17:56:46] LeslieCarr: okay great thanks [17:57:12] binasher: yep they're on internal [17:58:24] anything else that might effect dhcp requests? [17:59:20] New patchset: Asher; "build of redirector as of https://gerrit.wikimedia.org/r/#/c/14341/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14342 [17:59:38] these are just an addition to asw-d-pmtpa , so no new real config on those switches [17:59:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14342 [18:00:11] um, i can do a monitoring if you want to do a reboot of a machine ? [18:00:50] ok, just a sec [18:01:09] let me know which machine (and it's mac address plz) [18:02:26] LeslieCarr: it'll be pc1 via 88:43:E1:C2:4C:AA [18:02:37] thanks, let me set up the monitor [18:04:29] just powercyclyed it, but it'll be a while before it gets to the pxe boot stage [18:05:31] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14342 [18:08:15] cool, watching bootp forwarding now... [18:08:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:40] LeslieCarr: did you see anything? [18:11:09] nope [18:12:51] its trying again now [18:13:14] 88:43:e1:c2:4c:aa [18:15:52] doing tcpdump now [18:18:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [18:22:57] LeslieCarr: just did another pxe attempt, catch anything? [18:24:07] nada, but my flagging bootp doesn't seem to actually be flagging anything - sigh, i am guessing it's because it only flags if it gets to the RE, and it's all being done in the forwarding plane [18:24:10] at least my best guess [18:26:12] hmm.. can you verify that some packets are being sent from pc1's switch port? [18:26:31] yeah, one minute [18:26:57] let me turn off lldp for that port as well [18:28:36] ok, there's some broadcasty arp traffic going into that port now, nothing has yet come out [18:29:56] it should be sending requests again right now [18:30:33] or so says the console . . .*spinny cursor* [18:32:39] what's the issue? [18:32:45] i have seen 0 bytes coming into that interface [18:33:09] so i don't think it's actually sending the request (that or it's miscabled and i'm checking out the wrong interface [18:34:03] mark: pc1 isn't getting dhcp -- it's one of the new ciscos [18:34:10] mark: i'm trying to pxe boot a cisco server, and it appears to be doing a standard broadcom pxe boot request on the console, but no requests make it to brewster [18:34:38] maybe it is miscabled [18:35:01] compare MACs? [18:38:20] ah... [18:38:31] 6/0/10 i think is what it is in [18:38:57] doesn't explain why pc2 and pc3 weren't working, but at least now we can look in the right spot :) [18:39:10] * domas reacts to 'pc' [18:39:16] i haven't tried pc2 or 3, just 1 [18:39:21] * domas realizes pc2 is not pc002 [18:39:25] * domas disappears [18:39:39] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 27, down: 0, shutdown: 1 [18:39:41] domas: it's like the pc jr. but without the chiclet keyboard. it's going to revolutionize business. [18:40:01] binasher: did you see, db40 filled up, caused July 5th fireworks [18:40:08] \o/ [18:40:30] yeah! fun [18:40:44] <^demon> Only fireworks people fire off on July 5th are the ones they forgot to fire off on the 4th. [18:41:00] <^demon> db40 was just waiting for us to get back :) [18:41:10] :-) [18:43:07] for i in {0..255}; printf "CREATE TABLE pc%03d LIKE objectcache;" | mysql -h db40 parsercache [18:43:10] very good script [18:43:11] write it down [18:43:46] !log Built new pybal_1.03 package and inserted it into the precise-wikimedia APT repository [18:43:54] * Reedy finds a post-it note [18:43:55] Logged the message, Master [18:44:37] pc1 is going to replace db40, possibly joined by pc2-3 later [18:44:39] New patchset: Mark Bergsma; "Don't bailout on missing BGP next hops if no matching AF prefixes are configured" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14348 [18:44:39] New patchset: Mark Bergsma; "Add bgp-nexthop-ipv[46] examples in pybal.conf" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14349 [18:44:40] New patchset: Mark Bergsma; "Don't add AF defaults to peerings dict" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14350 [18:44:41] New patchset: Mark Bergsma; "Account for nonexisting AFs in BGPFailover.prefixes" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14351 [18:44:42] New patchset: Mark Bergsma; "pybal (1.03) precise; urgency=low" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14352 [18:44:53] ooh good call on the postit [18:45:05] is it going to run pmysql? [18:45:06] ergh [18:45:08] memsql? [18:45:19] you should use cassandra [18:45:22] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14348 [18:45:32] hmmm, or hdfs [18:45:33] memsql all the way! [18:45:46] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14349 [18:45:57] that corruption bug is awesome [18:46:12] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14350 [18:46:32] binasher: let's try this again ? [18:46:34] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14351 [18:46:58] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/14352 [18:46:59] actually i'm going to have enough memcache so the sqlbagostufff db doesn't really do anything but write stuff thats never read.. then i'll switch to the blackhole storage engine [18:48:01] right [18:48:04] the wikipedia way! [18:48:12] show banners all year round [18:48:34] LeslieCarr: it works now! [18:48:59] and i saw packets go in [18:49:04] magical when we have the right ports, eh? :) [18:49:30] hah, yep. thanks for tracking that down! [18:50:20] what is pc1 hardware? [18:50:29] it doesn't let me in! [18:50:30] :) [18:50:36] my 486 [18:51:01] TROLL DETECTED [18:51:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:05] oh my [18:51:12] it's still negotiating keys, that's why [18:51:12] mark: did you see the beautiful banner served to me today? [18:51:33] how would I see it if it was served to you? [18:51:40] I posted a link to screenshot [18:51:43] http://flux.defau.lt/wikiwhat.png [18:51:44] ;-) [18:52:34] pc1.. and now it appears the disks aren't installed in the right order either, wee [18:52:49] mark: beautiful, isn't it [18:52:53] you should tell zack that you love it [18:53:14] hmmm [18:53:26] does 679 count include eqiad? [18:53:39] i would hope so [18:53:47] i dunno what it's based on [18:53:51] I thought we had more ;) [18:53:56] me too [18:54:01] did someone change Varnish config recently? [18:54:07] !log Upgraded pybal on all precise LVS servers [18:54:15] Logged the message, Master [18:54:22] if only we had a public version control system [18:54:24] MaxSem: mobile ? [18:54:26] for our configuration files [18:54:32] LeslieCarr, yup [18:54:44] mark: why would anyone have that? thats invitation for hackers [18:54:47] stealing your passwords [18:54:52] yes, it was MaxSem [18:59:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.570 seconds [19:00:26] Hello - I can no longer run Git commands; I get the error message "Permission denied (publickey)." What's the best way to troubleshoot this? [19:00:57] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [19:01:45] Yaron: make sure you got the right user name when trying to log in [19:03:04] The username's right, I think; I'm just trying to call commands like "git pull" in a directory that already existed. [19:03:23] cmjohnson1: are you around in pmtpa today? [19:03:40] MaxSem: is there a problem ? [19:03:54] Yaron: Shouldn't you get another error then? [19:04:04] I don't know. [19:04:11] LeslieCarr, Special:MobileOptions behaves strangely. we suspect cookie problems [19:05:02] hoo - I should have noted before, I re-generated my SSH key today, because the last one wasn't working either; but I updated the record of my public key on Wikimedia Labs. [19:05:10] Yaron: mhm "[...@homeserv FlaggedRevs]$ git pull" works fine [19:05:14] preilly: see above ? [19:05:42] Yaron: Try to ssh into bastion.wmflabs.org maybe [19:06:15] LeslieCarr: I see it [19:06:17] if that works your ssh stuff is correct [19:06:23] hoo - you mean, replace gerrit.wikimedia.org with that, in the URL? [19:06:44] MaxSem: it only passes specific cookies [19:06:55] MaxSem: because it varies on cookies [19:06:57] <^demon|away> Yaron: Also need to update your key in gerrit. It doesn't pull the keys from ldap [19:06:58] just run "ssh yourName@bastion.wmflabs.org" [19:07:04] <^demon|away> It's stupid and annoying and I hate it [19:07:14] without git, just to test the ssh [19:07:19] hoo - okay. [19:07:30] ^demon|away - I'll try that too; thanks. [19:08:07] hoo - okay, that worked... I mean, I'm logged in. [19:08:11] ^demon|away: sofixit? :) [19:08:24] <^demon|away> Nah, I'd much rather nag #gerrit about it :) [19:08:33] Yaron: Great, then follow demon's advice [19:08:50] and try a command you know to work [19:08:54] Okay, I'll do that now... [19:09:20] <^demon|away> Ryan_Lane: Last I heard, redoing the authz/authn stuff to actually pull data from LDAP on the fly (rather than copying fields over...I'm not joking) is on the roadmap. [19:09:25] <^demon|away> Don't know how committed anyone is to it though. [19:09:32] * Ryan_Lane nods [19:09:35] so [19:09:42] what hardware is pc1? [19:09:48] did someone just purge logs from blondel? [19:09:53] I understand that quite a few of you replied with jokes [19:10:32] but a question was probably valid [19:10:41] domas: cisco's [19:10:42] domas: eh? [19:10:45] pc1? [19:11:00] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [19:11:09] lesliecarr: those 300GB ones? [19:11:10] Aha, that worked! hoo, ^demon|away - thanks to both of you. [19:11:13] okie [19:11:23] <^demon|away> Yaron: Yay, glad you're fixed :) [19:11:27] bbl [19:11:31] LeslieCarr: can you please merge this change https://gerrit.wikimedia.org/r/#/c/14353/ [19:11:49] I've been fixed. [19:11:49] You're welcome, Yaron ;) [19:13:20] domas: pc1 is a cisco w/192GB of ram, 2 of the original 300GB sas drives for the os, and six 300GB intel 710 ssd's [19:13:58] which will be striped [19:14:08] !log *somebody* purged binary logs on blondel [19:14:16] Logged the message, Master [19:21:03] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [19:21:49] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14353 [19:26:41] New patchset: preilly; "add Saudi Telecom landing page for zero domain" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14355 [19:27:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14355 [19:27:17] LeslieCarr: ^^ [19:27:26] LeslieCarr: last time I bug you I promise [19:28:11] haha [19:28:30] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14355 [19:29:03] LeslieCarr: thanks so much [19:29:32] pushing it all out to the mobile caches now... [19:31:40] LeslieCarr: thanks again [19:32:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:57] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [19:38:18] LeslieCarr: networking doesn't seem to be working on virt6 [19:38:38] on the secondary interface ? [19:38:50] yes [19:39:03] eth1 is up, eth1.103 is up [19:39:12] eth1.103 is added to br103 [19:39:20] the vnet device is added to br103 [19:39:22] no networking [19:40:59] sigh, wonder if it's the one off again [19:41:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.997 seconds [19:42:02] Ryan_Lane: is it working now (in 30 more seconds) [19:42:07] ok [19:42:20] will this need to be fixed on the others too? [19:42:30] well if it's the one off error, then nope [19:42:48] one off error? [19:44:28] i got a list that said something like 5-10 when in fact it was ports 4-9 [19:44:37] so the first machine wouldn't be in the proper range [19:44:46] working now ? [19:44:58] ok, finally gotta run [19:46:17] New patchset: Mark Bergsma; "Initial implementation of a DNS monitor for PyBal" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14385 [19:46:17] New patchset: Mark Bergsma; "Add the DNS monitor" [operations/debs/pybal] (monitors/dns) - https://gerrit.wikimedia.org/r/14386 [19:47:59] :( [19:48:04] it's not working still [19:53:36] I see dhcp packets being sent out, but nothing received [20:05:36] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:15:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.488 seconds [20:27:20] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:37:41] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [20:37:41] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [20:38:22] Ryan_Lane: all working ? [20:38:37] after mark tagged the interfaces, yeah ;) [20:39:18] cmjohnson1: hey, i think there's been a systemic problem on rack c3 where the plugs are off by 1 -- my guess is that your count started from 1 instead of 0 on that ? (junipers start from 0, foundry's from 1) [20:39:26] ah, didn't realize you were using tagging on the second interfaces [20:39:47] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [20:40:41] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [20:40:41] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [20:40:41] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [20:43:23] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14040 [20:45:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:46:41] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [20:46:42] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [20:46:42] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [20:46:42] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [20:46:42] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [20:46:42] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [20:46:42] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [20:46:43] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [20:46:44] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [20:47:34] LeslieCarr: have to use tagging [20:47:59] the device is a bridge [20:48:06] the instances need to be on the vlan [20:49:41] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [20:51:47] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [20:52:41] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [20:52:41] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [20:52:41] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [20:54:47] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [20:55:41] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [20:55:41] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [20:55:41] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [20:58:41] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [20:59:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:59:53] PROBLEM - Host mw1116 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:20] anyone working on mw1116 ? [21:02:44] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [21:02:44] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [21:02:44] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [21:02:44] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [21:04:41] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [21:06:29] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:07:41] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [21:08:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.541 seconds [21:09:28] !log added new ms-be pmtpa hosts to DNS [21:09:29] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:09:36] Logged the message, Master [21:15:04] !log powercycling unresponsive mw1116 [21:15:15] Logged the message, Mistress of the network gear. [21:21:29] RECOVERY - Host mw1116 is UP: PING OK - Packet loss = 0%, RTA = 31.11 ms [21:25:27] hey ^demon, question about gerrit ls-projects command, some repos are not returned, like wikimedia/orgchart, mediawiki/extensions/Contest and integration/testswarm is that because these repo's are private or something like that (assuming that such a thing as private exists in gerrit) [21:26:37] <^demon|away> Yes, it won't return repos that you don't have Read permissions on. [21:26:43] <^demon|away> But those 3 you should :\ [21:27:34] strange..... [21:27:39] <^demon|away> drdee_: You can view them in the UI though? [21:27:58] yep [21:28:02] <^demon|away> That's even more bizarre :\ [21:28:40] and it's always the same repo's that not returned [21:28:41] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [21:28:52] <^demon|away> Hrm. Wonder why. Is it just those 3? [21:30:02] maybe they are at the top of the list or something like that? [21:30:35] s/are/would be/ [21:30:47] shall i paste the output in pastebin? [21:31:04] <^demon|away> Yeah that'd be good [21:31:33] oh, and I don't see why Contest extension would need to be secret, anyway [21:31:53] 1 sec [21:32:04] <^demon|away> I don't know why any of those 3 would be. Everything in mediawiki/* has Read permissions for anons. [21:32:15] <^demon|away> integration/testswarm has Read explicitly granted to anons. [21:32:52] <^demon|away> Same with wikimedia/* [21:33:39] actually, it's only orgchart (blush blush) [21:35:10] that is missing [21:35:10] <^demon|away> Hmm. Still weird, since it should have the permissions. I'll explicitly grant Read on it, can't hurt. [21:42:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:09] i keep getting issues with snmptt crashing on neon --- i think i'm going to try upgrading it to precise [21:50:27] ^demon: what is your output when you run ssh -p 29418 gerrit.wikimedia.org gerrit ls-projects -d [21:51:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.002 seconds [21:51:59] New patchset: Bhartshorne; "adding entries for ms-be6-12. false entries (00:00:00) for 9-12." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14422 [21:52:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14422 [21:52:59] ignore any neon pages [21:56:28] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14422 [21:58:41] PROBLEM - SSH on neon is CRITICAL: Connection refused [21:59:17] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [22:13:39] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:17:21] LeslieCarr: last week, you saw that the new C2100s (ms-be hosts) were spamming DHCP requests. [22:17:24] RECOVERY - MySQL disk space on neon is OK: DISK OK [22:17:26] did you do something to squash them? [22:17:33] RECOVERY - SSH on neon is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:17:35] oh i may have turned the ports down on a few... [22:17:44] I'm not seeing a lease request from ms-be7 or ms-be8. [22:17:49] checking... [22:17:51] (though ms-be6 seemes to be ok. [22:17:53] ) [22:18:09] PROBLEM - NTP on neon is CRITICAL: NTP CRITICAL: Offset unknown [22:18:33] it may also be that I have the MAC address wrong. [22:20:58] hrm [22:21:07] no, it looks like i don't have ms-be7 or ms-be8 configured [22:21:35] do you have a record of what ports they're supposed to be using? or do you need cmjohnson1 to look that up? [22:22:30] have a ticket …. need to find other machines on that patch panel [22:22:30] RECOVERY - NTP on neon is OK: NTP OK: Offset 0.03774738312 secs [22:23:43] oh weird [22:23:57] ah i see what happened [22:24:11] they were all labeled ms789 instead of ms-be789 [22:24:16] and yes, those two ports are disabled [22:24:38] ok. nice to have an explanation. [22:25:16] hey, I think one just got a lease! [22:25:19] ok, now they should be good [22:25:20] yay [22:25:22] ms-be7. [22:25:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:25:34] thanks! [22:27:08] ms-be8 hasn't yet, but I want to check something there. I'll ping again if it's still not working in a bit. [22:33:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.893 seconds [22:34:12] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:36:04] ms-be8 has net just fine, so \o/ [22:43:15] New patchset: preilly; "fix wgZeroDisableImages issue if NOT on Zero domain" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14426 [22:44:42] New patchset: preilly; "fix wgZeroDisableImages issue if NOT on Zero domain" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14426 [22:45:18] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [22:45:45] PROBLEM - SSH on neon is CRITICAL: Connection refused [22:46:43] Change merged: preilly; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14426 [22:57:45] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:00:42] ^demon: ok. package is built. I just need to push it to the repo [23:01:02] <^demon> Ok. I'll send out the notice to wikitech and spam irc. [23:01:08] ^demon: so, what's the command I need to run after I update the package? [23:01:18] <^demon> Had it in front of me, one sec. [23:01:24] I stop gerrit, then run the upgrade command, then start it, right? [23:01:30] PROBLEM - NTP on neon is CRITICAL: NTP CRITICAL: No response from NTP server [23:01:35] I need to make sure to stop it on formey too :) [23:02:52] <^demon> Right [23:03:21] <^demon> Upgrade command is `java -jar gerrit-whatever.war init -d /var/lib/gerrit2/review_site --no-auto-start` [23:03:22] New patchset: preilly; "fix subdomain for carrier" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14429 [23:03:56] New patchset: Lcarr; "moving neon to precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14430 [23:03:57] ok [23:04:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14429 [23:04:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14430 [23:04:43] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14430 [23:07:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:07:38] ^demon: ready to start? [23:07:43] <^demon> Ready when you are. [23:07:54] !log stopping gerrit on formey and disabling puppet [23:08:02] Logged the message, Master [23:08:22] !log stopping gerrit on manganese and disabling puppet [23:08:30] Logged the message, Master [23:09:06] Ryan_Lane: do you happen to know the IPv6 addresses of the Polish toolserver? [23:09:13] nope [23:09:23] !log upgrading gerrit on manganese [23:09:30] Logged the message, Master [23:10:27] ugh, this package is not great [23:10:33] why is sumanah a member of the gerrit2 group? [23:10:50] <^demon> B) Not sure. [23:10:53] ugh [23:10:53] <^demon> A) What's wrong? [23:11:01] it's not a system group? [23:11:17] <^demon> I don't know, it's not puppetized. [23:11:42] I'm disabling ldap on manganese temporarily [23:16:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [23:17:06] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:17:16] having to fix permissions and such [23:17:27] I need to fix this package later too [23:17:46] we should really not have a gerrit2 user in ldap. heh [23:18:36] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [23:18:51] <^demon> Yeah, that should be fixed. Will make deploying it to labs via puppet way easier. [23:21:02] !log updating database for gerrit [23:21:10] Logged the message, Master [23:21:17] ^demon: was a backup done? [23:21:24] <^demon> No. Whoops. [23:21:27] heh [23:21:29] <^demon> Totally should've done that [23:21:32] can you do that really quicl? [23:21:36] I didn't start yet [23:21:40] <^demon> On it. [23:21:43] thanks [23:22:26] <^demon> Oh, ldap's down, I can't snatch the password from secure.config :p [23:22:34] lemme bring that back [23:22:39] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [23:22:58] ^demon: ok [23:23:12] should work now [23:24:35] yes [23:24:35] <^demon> Dumped. [23:24:38] cool [23:26:50] ok. it's updating the db [23:27:24] <^demon> Just 3 schema changes this time. [23:27:26] it's done [23:27:30] starting gerrit [23:27:46] hm [23:27:47] failed [23:28:26] <^demon> Ouch, why? [23:29:06] lemme see [23:29:56] <^demon> Missing password for database.user or something? [23:30:06] <^demon> Caused by: java.sql.SQLException: Access denied for user 'gerrit'@'manganese.wikimedia.org' (using password: NO) [23:30:13] weird [23:30:38] it's there.... [23:30:57] oh [23:31:01] it probably can't read it [23:31:41] <^demon> Ah, that'd do it [23:31:49] hm [23:31:53] still failed [23:32:42] <^demon> I got http://p.defau.lt/?AbiGVXUbZLQeSNKe8gV_lw at the end of error_log [23:32:53] yeah [23:32:57] no clue what that means [23:33:28] <^demon> Trying to find out, one sec. [23:35:29] probably from: [repository "*"] [23:35:29] ownerGroup = Project Creators [23:35:31] ? [23:35:47] <^demon> That doesn't sound like it'd cause it, but it's harmless to remove. [23:35:49] <^demon> Can try that. [23:36:14] that was it [23:36:25] now, why that was it, I have no clue [23:36:27] Ryan_Lane, where is that line? [23:36:32] in the config [23:37:08] <^demon> Ryan_Lane: I'll dig into why. It only affects people creating new projects (ie: me, really). [23:37:08] oh, a real file? [23:37:17] wow. it's really fucking slow [23:37:32] <^demon> Caches are stale. [23:37:50] <^demon> s/stale/shutting down gerrit flushes them/ [23:38:20] !log upgrading gerrit on formey [23:38:21] I thought it was one of the files in a hidden ref [23:38:28] Logged the message, Master [23:38:53] <^demon> Platonides: No, gerrit.config. It's something I added recently. [23:39:35] it's way slower than it should be ;) [23:39:56] when I try to go to the groups screen [23:39:58] <^demon> Platonides: https://gerrit.wikimedia.org/r/Documentation/config-gerrit.html#_a_id_repository_a_section_repository, if you're interested. [23:40:03] there we go [23:40:22] <^demon> Groups was painfully slow in 2.3 as well. Once caches fill the first time it's *slightly* better. [23:40:40] yeah [23:40:48] heh [23:40:53] project creators doesn't exist [23:41:00] it's project owners [23:41:09] <^demon> https://gerrit.wikimedia.org/r/#/admin/groups/119,members [23:41:15] <^demon> It's supposed to be referring to that. [23:41:25] <^demon> Perhaps I changed the group name and forgot to update config? [23:41:35] lemme try with that group [23:41:37] on formey [23:42:33] I think it very much dislikes the & [23:43:06] <^demon> Ahhh. Could be. [23:43:10] ^demon: can you rename it using and rather than & ? [23:43:26] <^demon> Done. [23:44:22] that worked [23:44:22] <^demon> gerrit.config is a standard .git/config-style file. Perhaps &'s need to be escaped or something. [23:44:34] <^demon> I'll fix it in puppet. [23:44:59] can you remove that workaround too? [23:45:11] <^demon> Yeah I'll do that while I'm there. [23:45:20] and fix the whitespace for that ownerGroup line? :) [23:45:36] !log restarting gerrit on manganese [23:45:43] Logged the message, Master [23:49:48] ^demon: make sure to fix the package version in the manifests too :) [23:50:00] should now be ensure => "2.4.2-1" [23:50:05] <^demon> Caught me just before I pushed :p [23:50:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:25] New patchset: Demon; "Couple of fixes for Gerrit:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14432 [23:53:28] <^demon> Ryan_Lane: ^ [23:53:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14432 [23:54:04] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14432 [23:54:55] !log force running puppet on formey and manganese, since a config change is involved, it's going to restart [23:55:04] Logged the message, Master [23:56:12] <^demon> Ryan_Lane: Right after I tell everyone it's up :p [23:56:18] heh [23:56:50] puppet is slow as hell right now, too, it seems