[02:06:30] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3610 MB (3% inode=99%): [02:16:56] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-21 02:16:56+00:00 [02:17:06] Logged the message, Master [02:20:49] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [02:26:49] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [02:29:52] !log LocalisationUpdate completed (1.24wmf21) at 2014-09-21 02:29:51+00:00 [02:29:58] Logged the message, Master [02:36:04] (03PS1) 10Brian Wolff: Increase account creation throttle on enwiki for Cochrane colloquium. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161766 (https://bugzilla.wikimedia.org/71090) [02:37:52] If there are any folks around - User:Jmh649 appearently wants the account throttle on enwiki increased for a conference happening right now (see ---^) [02:41:37] !log LocalisationUpdate completed (1.24wmf22) at 2014-09-21 02:41:36+00:00 [02:41:42] Logged the message, Master [02:44:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [02:45:43] (03CR) 10Ori.livneh: [C: 032] Increase account creation throttle on enwiki for Cochrane colloquium. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161766 (https://bugzilla.wikimedia.org/71090) (owner: 10Brian Wolff) [02:45:48] (03Merged) 10jenkins-bot: Increase account creation throttle on enwiki for Cochrane colloquium. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161766 (https://bugzilla.wikimedia.org/71090) (owner: 10Brian Wolff) [02:46:01] Woo, ori [02:46:43] !log ori Synchronized wmf-config/throttle.php: I7bb42b49a: Increase account creation throttle on enwiki for Cochrane colloquium. (duration: 00m 07s) [02:46:49] Logged the message, Master [03:01:00] RECOVERY - Disk space on virt0 is OK: DISK OK [03:06:53] bawolff: https://bugzilla.wikimedia.org/25000 [03:08:02] Or people could just ask more than 20 minutes in advanced ;) [03:08:37] But yeah, I could see having some UI that users could fill out themselves would be nice [03:08:45] as a low priorty-ish feature [03:09:20] Low priority? [03:09:31] Requests involving sysadmins are expensive. [03:09:40] All things being relative [03:09:56] I guess it depends how common this sort of thing is [03:10:06] * bawolff usually isn't involved in these types of requests [03:11:45] On a totally unrelated note, I wonder if instead of totally blocking tor, it would be ok if we just set a really low account creation throttle for all of tor [03:13:37] and disable anonymous editing? [03:13:48] The Wikimedia Foundation is considering becoming a Tor relay. [03:16:07] !log labsdb1001 mysqld restarted in gdb; crash loop with a labs user's table [03:16:13] Logged the message, Master [03:19:09] Carmela: Really? [03:19:31] That would be kind of ironic. We're relay your traffic, but not let you use edit the site [03:20:17] Krenair: Yeah. The main concern is people being unblockable when using tor. If we limit it to 2 account creations a day, well then at worst we'll have some vandal just ruin it for all the good tor users [03:34:33] bawolff: https://gerrit.wikimedia.org/r/140948 [03:34:36] Took me a while to find it. [03:35:08] That makes me happy though [03:37:31] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Sep 21 03:37:31 UTC 2014 (duration 37m 30s) [03:37:38] Logged the message, Master [03:58:09] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [06:28:10] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Epic puppet fail [06:28:19] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: Epic puppet fail [06:28:22] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: Epic puppet fail [06:28:29] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Epic puppet fail [06:28:29] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Epic puppet fail [06:28:40] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Epic puppet fail [06:30:49] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 5 failures [06:30:59] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:59] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:59] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:09] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:39] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:39] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:20] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:45:29] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:45:31] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:45:31] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:47:20] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:47:20] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:47:29] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:47:32] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:47:39] RECOVERY - puppet last run on mw1054 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:47:50] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:03:35] (03CR) 10Aaron Schulz: "Ahh, right, I see." [puppet] - 10https://gerrit.wikimedia.org/r/161473 (owner: 10GWicke) [11:02:20] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:04:33] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 1 failures [12:10:50] (03PS1) 10Gerrit Patch Uploader: Add '*.beeldbank.cultureelerfgoed.nl' to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161779 (https://bugzilla.wikimedia.org/70840) [12:10:57] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161779 (https://bugzilla.wikimedia.org/70840) (owner: 10Gerrit Patch Uploader) [12:22:53] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:24:47] (03PS4) 10Yuvipanda: nagios_common: Refactor custom command definitions [puppet] - 10https://gerrit.wikimedia.org/r/161478 [12:25:33] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Refactor custom command definitions [puppet] - 10https://gerrit.wikimedia.org/r/161478 (owner: 10Yuvipanda) [12:25:50] (03PS5) 10Yuvipanda: nagios_common: Refactor custom command definitions [puppet] - 10https://gerrit.wikimedia.org/r/161478 [12:28:45] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:31:14] (03PS6) 10Yuvipanda: [WIP] nagios_common: Refactor custom command definitions [puppet] - 10https://gerrit.wikimedia.org/r/161478 [12:51:13] (03PS7) 10Yuvipanda: [WIP] nagios_common: Refactor custom command definitions [puppet] - 10https://gerrit.wikimedia.org/r/161478 [12:53:22] (03PS8) 10Yuvipanda: [WIP] nagios_common: Refactor custom command definitions [puppet] - 10https://gerrit.wikimedia.org/r/161478 [13:04:14] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:44:45] (03PS4) 10Faidon Liambotis: Allocate sandbox vlans for codfw and ulsfo [dns] - 10https://gerrit.wikimedia.org/r/158636 (owner: 10Mark Bergsma) [15:44:47] (03PS3) 10Faidon Liambotis: Allocate IPv4/IPv6 for RIPE Atlas codfw/ulsfo [dns] - 10https://gerrit.wikimedia.org/r/158939 [15:44:49] (03PS1) 10Faidon Liambotis: Renumber sandbox1-b-eqiad IPv6 to match convention [dns] - 10https://gerrit.wikimedia.org/r/161787 [15:46:25] (03CR) 10Faidon Liambotis: [C: 032] Allocate sandbox vlans for codfw and ulsfo [dns] - 10https://gerrit.wikimedia.org/r/158636 (owner: 10Mark Bergsma) [15:47:26] (03CR) 10Faidon Liambotis: [C: 032] Renumber sandbox1-b-eqiad IPv6 to match convention [dns] - 10https://gerrit.wikimedia.org/r/161787 (owner: 10Faidon Liambotis) [15:47:38] (03CR) 10Faidon Liambotis: [C: 032] Allocate IPv4/IPv6 for RIPE Atlas codfw/ulsfo [dns] - 10https://gerrit.wikimedia.org/r/158939 (owner: 10Faidon Liambotis) [18:02:26] PROBLEM - very high load average likely xfs on ms-be1008 is CRITICAL: CRITICAL - load average: 200.05, 115.05, 58.71 [19:39:12] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: Epic puppet fail [19:57:27] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [21:10:38] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [21:39:07] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [21:44:18] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [21:59:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [22:22:07] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [22:22:47] RECOVERY - very high load average likely xfs on ms-be1008 is OK: OK - load average: 11.38, 3.24, 1.11 [22:23:47] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:43:39] !log ms-be1008 overloaded starting 18:00:24 UTC, syslog says "BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:1:2196]". machine became unresponsive at 21:35, coinciding with a spike of 5xxs, lasting until Coren powercycled it at 22:10. [22:43:46] Logged the message, Master [23:42:52] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected