[00:06:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.985 seconds [00:36:25] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [00:36:25] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [00:38:22] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [00:39:25] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [00:39:25] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [00:39:25] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [00:40:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:49] PROBLEM - Host mw1007 is DOWN: PING CRITICAL - Packet loss = 100% [00:44:22] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:44:22] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [00:44:22] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [00:44:22] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [00:44:22] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [00:44:23] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [00:45:25] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [00:45:25] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [00:45:25] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [00:45:25] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [00:48:25] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [00:49:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.540 seconds [00:50:22] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [00:50:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14200 [00:51:16] PROBLEM - Host mw1002 is DOWN: PING CRITICAL - Packet loss = 100% [00:51:25] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [00:51:25] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [00:51:25] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [00:53:22] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [00:54:25] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [00:54:25] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [00:54:25] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [00:57:25] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [00:59:22] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [00:59:22] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [00:59:22] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [01:00:25] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [01:02:22] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [01:05:22] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [01:22:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:26:22] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [01:31:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.936 seconds [01:42:07] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 292 seconds [01:43:28] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 289 seconds [01:49:35] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 654s [01:51:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [01:52:53] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 26 seconds [01:53:56] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [02:06:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:14:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.727 seconds [03:16:26] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [03:20:29] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [03:53:46] can someone explain http://toolserver.org/~pathoschild/stalktoy/index.php?target=2000%3A%3A%2F4? [03:53:54] (why it keeps giving a db-related error) [03:53:58] is that server down? [03:56:20] hmm [03:56:22] looking [03:56:54] works fine here [03:56:59] maybe a temporary hickup? [03:59:58] try dropping the ? in the end though I dont think you had that in your orginal use [04:06:45] PROBLEM - Host mw1132 is DOWN: PING CRITICAL - Packet loss = 100% [05:16:30] Jasper_Deng: #wikimedia-toolserver [05:16:41] (issue is resolved now) [05:16:53] -operations is for the real servers, toolserver is a replicated read-only cluster separate from that. [05:16:56] just fyi :) [05:37:22] PROBLEM - Host mw1142 is DOWN: PING CRITICAL - Packet loss = 100% [05:38:21] wut? mediawiki.org is giving me dns not found [05:38:25] http://www.mediawiki.org/wiki/Template:Ombox [05:38:27] time out [05:38:40] Unable to resolve the server's DNS address. [05:39:01] anyone else? [05:39:34] loads fine here [05:39:37] that link I mean [05:39:43] I am in europe though [05:42:33] weird, works from curl for me but not in Chrome [05:42:42] dnsflush fied it [05:42:45] fixed* [05:44:42] Krinkle: if you're running a recent build and have the flag enabled, chrome doesn't use the os's getaddrinfo() [05:44:47] Krinkle: see https://plus.google.com/103382935642834907366/posts/FKot8mghkok [05:45:06] Krinkle: and "Built-in Asynchronous DNS" in chrome://flags [05:45:27] that might account for a host resolving correctly in curl but not in chrome (or vice versa) [05:46:42] os dns flush did fix it [05:46:48] could be a coincendence [05:55:40] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:13:50] New patchset: Raimond Spekking; "Fix for https://gerrit.wikimedia.org/r/#/c/14180/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14277 [06:20:35] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:32:17] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:33:29] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [06:33:29] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [06:47:31] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:51:25] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:47:08] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [08:30:38] PROBLEM - Host mw1040 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:56] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [09:00:13] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [09:10:16] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [09:20:19] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [09:33:13] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [10:13:55] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:33:07] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:37:10] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [10:37:10] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [10:37:44] New patchset: Hashar; "nagios authdns now check nagiostest.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14286 [10:38:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14286 [10:38:41] New review: Hashar; "The change fix 3 nagios errors." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/14286 [10:39:16] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [10:40:10] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [10:40:10] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [10:40:10] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [10:45:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:45:16] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [10:45:16] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [10:45:16] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [10:45:16] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [10:45:17] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [10:46:10] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [10:46:10] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [10:46:10] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [10:46:10] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [10:49:10] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [10:51:16] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [10:52:10] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [10:52:10] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [10:52:10] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [10:54:16] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [10:54:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14286 [10:55:10] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [10:55:10] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [10:55:10] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [10:55:34] New patchset: Hashar; "planet: comment that update-planets need to be changed too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14287 [10:56:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14287 [10:58:10] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [11:00:00] New patchset: Mark Bergsma; "Allow servers to prefix nameservers (e.g. LVS)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14289 [11:00:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14289 [11:00:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14289 [11:01:39] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [11:01:39] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [11:01:39] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [11:01:39] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [11:03:45] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [11:06:45] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [11:07:37] New patchset: Mark Bergsma; "Install a DNS recursor on new LVS servers after all" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14290 [11:08:07] hashar: wow, puppet has *inline* switch statements? [11:08:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14290 [11:08:14] directl to assignment [11:08:18] yes [11:08:19] https://gerrit.wikimedia.org/r/#/c/14289/1/manifests/realm.pp,unified [11:08:28] amazing [11:08:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14290 [11:08:40] beats javascript [11:09:05] <^demon|away> Don't say that too loud, the ruby enthusiasts might hear you ;-) [11:09:25] though js is not far behind [11:09:41] var nameservers = ({ "esams": [ .. ], "eqiad": [ .. ] })[ site ] || [ .. ] [11:10:04] object literals and the || default operator [11:10:12] which, contrary to php, returns the value, not boolean [11:10:53] do more with less [11:11:02] anyway, puppet can stay now ;P [11:11:20] no [11:11:27] you can't even concatenate arrays with it :P [11:11:55] its interesting though, its an odd category syntax [11:12:20] its not really a language for logic / execution. More like json/ini with with some logics built-in [11:12:36] but it looks lot like the java type of langauge [11:13:15] or C-family rather [11:13:21] it's declarative [11:13:25] yeah [11:13:34] but one wouldn't see switch statements in INI or JSON. [11:14:30] * Krinkle opens puppet for dummies [11:14:52] regexp, conditionals, inheritance, hashes, "in", "unless", nice :) [11:15:38] and then you also miss a ton of stuff [11:16:02] sure [11:16:03] and it's not consistently implemented and has many bugs [11:16:20] gonna be a bumpy ride [11:16:39] <^demon|away> Buckle up :) [11:16:43] mark: this is my "puppet" for now, during initial sketching/labs: https://labsconsole.wikimedia.org/wiki/Nova_Resource:Integration/Setup [11:16:54] good ol' shell executables [11:17:05] and some inline homebrew comments/syntax [11:17:50] !log Installed new pybal snapshot build for testing on lvs1005 [11:17:59] Logged the message, Master [11:18:40] New patchset: Dzahn; "add missing Russian locales for planet on singer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14293 [11:19:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14293 [11:22:32] Krinkle: var nameservers = ({ "esams": [ .. ], "eqiad": [ .. ] })[ site ] || [ .. ] [11:22:37] Krinkle: I like that syntax [11:22:44] Krinkle: really easy to figure out / read when properly indented [11:22:50] js :) [11:23:06] New patchset: Mark Bergsma; "Convert remaining $nameservers changes to $nameservers_prefix, install recursor on all LVS servers again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14294 [11:23:11] yeah, javascript has 2 things that literally make out almost all of the languageL [11:23:13] objects and functions [11:23:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14294 [11:23:44] objects are arrays, hashes, and what not. functions are functions, methods, classes, modules, closures, scope.. [11:24:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14294 [11:24:44] mark: are you going to wikimania? [11:24:50] no [11:25:02] anyone else from the ops team as far as you know? [11:25:06] yeah many [11:25:13] okauy [11:26:20] mark: I still have to laugh when I think back about january 2011. only little over a year ago. [11:26:42] in amsterdam [11:26:56] what about it? :) [11:27:04] be trying to sounds smart about something with backbones, whatever I thought that was. [11:27:15] don't worry about it ;) [11:27:15] to you of all people [11:27:36] but I found my place, on the opposite side of the pipe, so to speak [11:27:45] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [11:27:59] i'll just be careful to not try to sound smart about javascript to you then [11:28:20] * hashar sends Krinkle in a datacenter with RobH so he learns about the pipe side that really matter :-] [11:28:23] (i'm not likely to do that anyway, I avoid web programming ;) [11:28:30] you do ruby already! [11:28:36] hehe, enjoy every minute of it [11:28:37] <^demon|away> Never admit to knowing anything :) [11:28:44] I do python whenever I can [11:28:56] * hashar knows about igniting a lighter [11:29:01] mark: well, the day comes we're going to have to install node js on wmf servers. prepare for the worst [11:29:14] won't be me then ;) [11:29:27] though ops wouldn't mind too much I suppose, that's still software side. [11:29:29] speaking of nodejs, we need to update the nodesjs -wm debian package :) [11:29:47] * ^demon|away hides server-side from our JS future [11:30:39] * Krinkle mumbles away to lunch about /usr/nodejs/common/docroot/index.js :P [11:30:40] brb later [11:52:48] PROBLEM - Host mw1141 is DOWN: PING CRITICAL - Packet loss = 100% [11:54:13] !log Inserted new pybal_1.02 package into APT distribution precise-wikimedia [11:54:22] Logged the message, Master [11:58:08] New patchset: Dzahn; "add missing Russian locales for planet on singer (ru_RU ISO-8859-5, -5 not -2)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14293 [11:58:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14293 [11:59:16] New review: Dzahn; "fix RT 3227 and bug 38198" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14293 [11:59:37] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14293 [12:03:15] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.044 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [12:04:45] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.028 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [12:10:09] bbl [13:17:53] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [13:18:42] Database error: "SqlBagOStuff::set". Database returned error "1114: The table 'pc233' is full (10.0.6.50)". - https://bugzilla.wikimedia.org/38202 critical [13:18:50] @info 10.0.6.50 [13:18:52] Krinkle: [10.0.6.50: ] db40 [13:18:56] @info db40 [13:18:56] Krinkle: [db40: s7] 10.0.6.50 [13:19:01] @replag s7 [13:19:01] Krinkle: [s7] db37: 0s, db56: 0s, db58: 0s, db26: 0s [13:19:12] (obviously, not related to replag, jus checking) [13:19:17] Database error: "SqlBagOStuff::set". Database returned error "1114: The table 'pc233' is full (10.0.6.50)". - https://bugzilla.wikimedia.org/38202 critical [13:19:32] <^demon> Sounds like 10.0.6.50 might be out of space? [13:19:40] pc193 as well [13:19:44] see #wikimedia-tech [13:20:16] <^demon> pc[\d{3}] isn't actually the name of a table or server :) [13:20:35] whatever [13:20:40] O_O [13:20:45] dbbot-wm: [13:20:52] :( [13:21:56] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [13:29:21] anyone? [13:29:23] Db error? [13:34:02] yeah, looking at it [13:34:14] sorry, I was tyiping about it but in the wrong channel [13:34:23] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:41:41] ^demon: should we disable the parser cache ? *grin* [13:42:02] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:42:59] (Cannot contact the database server: Unknown error (10.0.6.50)) [13:43:52] so what's going on now is that we're looking at whether we can change the innodb space constraint in my.cnf or whether that is going ot have undesireable side effects [13:45:02] apergos: site totally down [13:45:06] right [13:45:12] matanya: ops working on it [13:46:02] thanks [13:48:11] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [13:48:20] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [13:48:38] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [13:48:56] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [13:50:08] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60158 bytes in 0.130 seconds [13:50:26] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60357 bytes in 0.933 seconds [13:51:11] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60351 bytes in 0.884 seconds [13:51:20] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60167 bytes in 0.171 seconds [13:54:38] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [13:54:56] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [13:55:41] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [13:55:50] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [13:57:11] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60351 bytes in 0.864 seconds [13:57:20] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60167 bytes in 0.272 seconds [13:57:38] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60160 bytes in 0.130 seconds [13:57:56] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60358 bytes in 1.232 seconds [14:01:29] hehe http://xkcd.com/903/ comes to mind indeed [14:01:36] domas: mh? [14:01:53] YAY WIKTIONARY DOWN [14:02:02] FINALLY [14:02:17] domas: :P You did the migrate to the mysql parser cache no? [14:02:38] What was the impact on flushing it all? Will the Apaches take that? [14:03:20] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [14:03:47] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [14:03:56] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [14:03:56] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [14:04:50] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 48811 bytes in 0.140 seconds [14:05:17] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 48816 bytes in 0.383 seconds [14:05:26] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49004 bytes in 0.698 seconds [14:05:26] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49011 bytes in 0.917 seconds [14:06:45] http://flux.defau.lt/wikiwhat.png [14:08:17] http://wikimania2012.wikimedia.org/wiki/Google_Reception :( [14:08:22] Logged the message, Master [14:08:29] domas: yeah, randomized testing of some new banners [14:08:36] this is one of them. [14:08:37] back [14:08:42] !quote [14:08:49] domas: they're quite different, aren't they :D [14:09:03] hahaha [14:09:09] at least they are not mentioning facebook in these banners anymore [14:09:11] ha ha ha ha [14:09:31] Cause FB isn't very reliable, is it? :D [14:09:32] * hoo hids [14:10:15] so we should have asher make the determination about table space? [14:12:36] we should have people monitor stuff that matters [14:12:40] rather than wanking around [14:20:36] ok [14:21:58] I can do some innodb wizzardry or just reinitialize everything [14:22:40] what would the wizardry look like and how long would it take? [14:22:50] it can't happen anymore [14:23:12] it would be trying to engineer custom build that skips those assertions :) [14:23:23] ok hen [14:23:38] reinitialize it is [14:24:21] I'm not an expert in fsp code [14:24:24] :) [14:26:37] if anyone is feeling like it, can build a repro [14:27:12] anyway, RCA is pointing at this - http://flux.defau.lt/wikiwhat.png [14:27:25] "Google might have" [14:27:26] lol [14:28:16] surprised Yahoo still has 13000 staff. [14:31:58] domas: I am wondering which of the 5 servers are mine [14:32:42] is the site down? [14:32:55] it seems to work for me [14:33:25] here, see [14:33:26] http://en.wikipedia.org/wiki/User:Midom/test [14:33:37] wfm [14:33:41] ok, free space 10% initialized [14:33:52] worksforme [14:34:45] I like this though [14:34:49] every time it gets full, we just nuke it [14:34:52] with site down for a while [14:34:58] <3 [14:35:03] ok I see the change [14:35:04] thanks [14:35:05] we definitely want to monitor that so [14:35:08] domas: yeah, even wikilove is operational. and then some (!) [14:35:10] domas: maybe memsql! [14:35:11] specially if that happens from time to time [14:35:14] jeff_Green: yup [14:35:17] that would be wikipedia way [14:35:23] throwing lots of hardware at the problem [14:35:35] because paying few minutes of attention to site critical system is not necessary [14:35:36] :) [14:35:50] so if that lives in puppet someplace it would be nice to make it permanent [14:37:09] don't find it. hmm [14:37:23] <^demon> apergos: Whatcha looking for in puppet? [14:37:35] the my.cnf setting [14:37:55] <^demon> Ah [14:38:28] if the old setting isn't in there I guess it won't get overwritten :-/ [14:39:05] what setting? [14:39:15] templates/mysql$ vi generic_my.cnf.erb ? [14:39:22] why would you talk about settings [14:39:23] or does it not use the generic one [14:39:25] it is not about settings [14:39:30] it is about not allowing the site to go down [14:39:38] because of not caring about things [14:39:39] innodb_data_file_path=ibdata1:2000G [14:39:49] meh [14:40:11] t is about keepipng puppet in sync with what's live, but if this isn't in puppet then it's not a problem [14:41:26] oh ugh [14:41:27] I see [14:42:14] so we need to get table free space into our fine quality monitoring . . . [14:42:54] uh huh [14:43:13] so mutante I still don't see it in there [14:46:08] db40 fixed? [14:46:14] Note, it's got loads of disk space ;) [14:46:54] yes, domas fixed it and so now we can forget about the space issues for another several months :-P [14:47:15] * closedmouth cuddles domas [14:47:18] apergos: me neither, lots of innodb_ settings but not the one you were looking for [14:47:32] :-D [14:47:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:36] domas: truncating all the tables is cheating [14:47:53] created RT for monitoring [14:48:02] well t cleaned up the mess, which is what matters [14:48:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.187 seconds [14:49:34] I guess we could write something that collects from "show table status" [14:53:54] Jeff_Green: http://exchange.nagios.org/directory/Plugins/Databases/MySQL/MySQL-find-InnoDBs-and-check-free-space/details [14:57:12] thanks for the email update domas [15:00:01] is there a metric checking innodb table space in ganglia ? [15:00:16] Cause I am pretty sure there is nagios plugin that check gmetad variables [15:00:35] I did wrote one, no idea if I open sourced it or not [15:00:38] definitely exist [15:00:51] so you could collect table space data in ganglia (yeahhh nice graphs) [15:00:58] then ask nagios/icinga to alarm on it [15:01:19] (or maybe Ganglia as a build in system to send a SNMP trap whenever a threshold is reached for a metric) [15:01:23] all the metrics i've found so far are table-specific which is a drag [15:01:45] 40% [15:02:32] Jeff_Green: you might want to raise the issue on ops list [15:02:43] ben / asher might able to set something up [15:02:48] (or you) [15:02:49] well whatever [15:02:49] domas: already up from 10%?? [15:03:10] it is definitely doable and will most probably avoid the nasty "oh we haven't seen it was going to be full" recurring issues :) [15:03:42] sure, even if we have to come up with some aggregation scheme [15:03:45] jeff_Green: thats just creating an empty ibdata [15:06:18] the nagios script just looks for a condition where any one table is over limit and notifies on that [15:06:57] in ganglia, I'm not sure what we'd do. but I don't think we'd want to graph every host+db+table combination [15:10:57] jeff_green: writing out a 2TB empty file takes a while, if you write out zeroes on top [15:11:20] yep [15:15:16] 55% [15:23:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:58] !log updated Jenkins configuration on gallium : Updating f407ebe..4b669b9 [15:26:07] Logged the message, Master [15:27:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.885 seconds [15:28:14] !log powercycling argon [15:28:22] Logged the message, Master [15:29:51] RECOVERY - Host argon is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [15:29:56] back later [15:33:09] RECOVERY - SSH on argon is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:33:34] !log argon (limesurvey) fscked, dist-upgrading [15:33:42] Logged the message, Master [15:34:01] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:37:37] !log argon back up with new kernel,mysql,grub,.. looks happy afaict [15:37:45] Logged the message, Master [15:38:07] <^demon> Happy other than the fact that limesurvey runs on it ;-) [15:38:08] Actually, that's a point [15:38:18]