[00:00:16] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [00:00:23] it looks sane for now. there are a few things i don't love (the fact that the variable is named 'exceptions' but is now used to contain more than that) and i'd like it if it contained timing info, like the served by message in the html output [00:00:49] but saying that it could be better isn't tantamount to saying that there's anything harmful about it as-is [00:01:04] if you think it'd be useful to have that for debugging, i think it's okay to go [00:01:26] It's not in master, this is going to be reverted soon after [00:01:42] For proper debugging in the long term is should indeed contain both, and not just in case of an exception [00:01:55] specifically only one exception [00:02:05] * ori nods [00:02:22] you can get this now, though [00:02:59] Oh? [00:03:03] hmmmm, i guess not -- the headers only identify the varnishes [00:03:12] yes, been there [00:03:23] (see backscroll if you like :) ) [00:03:29] I thoguht it'd be there too [00:03:36] "X-Served-By: mw1041" would be nice and can be made generic [00:03:44] Yep [00:03:51] we could drop it from html/api output [00:04:01] or not, but either way [00:04:20] !log krinkle synchronized php-1.24wmf4/includes/resourceloader/ResourceLoader.php 'I718fcf23d' [00:04:24] Logged the message, Master [00:04:39] Served by: mw1151 [00:04:42] Ok, that's one [00:05:19] https://github.com/wikimedia/operations-puppet/blob/a15c8062d2/manifests/role/cache.pp#L358 [00:05:29] There are four [00:06:00] what's one? [00:06:00] Yep, always that one [00:06:02] MatmaRex: [00:06:11] always that one for what? [00:06:17] for all RL urls, or some particular URL, or? [00:06:26] ori: For some reason, on requests to that apache, ext.gadget.* does not exist [00:06:29] state missing [00:06:34] bug 65424 [00:06:43] ori: you chimed in a bit earlier as well [00:06:45] yes, so only requests to that particular apache? [00:06:51] So it would seem, yes [00:07:24] but reqs *do* get routed to the other bits app servers [00:07:29] and the other bits app servers don't manifest the issue [00:07:30] correct? [00:07:33] Indeed [00:07:37] huh. [00:07:38] I busted cache for 20 requests in a row [00:07:49] the ones with the error were mw1151, and the others are fine [00:07:52] ori: this is filed as bug https://bugzilla.wikimedia.org/show_bug.cgi?id=65424 now fwiw [00:07:54] greg-g: ^ [00:08:06] https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Bits+application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [00:08:16] cpu usage on that node is up [00:08:22] abnormally so, relative to the others in its group [00:08:25] * ori sshs in to see [00:09:33] Krinkle: it'd be useful to grep various log files on fluorine for that hostname [00:09:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [00:10:23] ori: I grepped resourceloader.logon fluorine, nothing [00:10:24] for any hosts [00:10:28] it has disk problems [00:10:33] Krinkle: $ cat resourceloader.log | grep -v 'New definition hash' | grep -v 'request for private module' [00:11:28] !log timo identified weird RL responses as all originating in mw1151; dmesg shows ata1 disk troubles: "failed command: READ DMA EXT", "sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed" [00:11:33] Logged the message, Master [00:11:34] i'm going to depool it [00:12:37] OK [00:13:04] I'm doing a quick check via eval.php on that local node to see what the state is [00:14:28] > return Gadget::loadStructuredList(); [00:14:28] bool(false) [00:14:36] i don't see it in pybal anywhere [00:14:40] (unlike when doing from tin or another apache, where it returns an object) [00:16:08] !log On mw1151, Gadget::loadStructuredList() returns false, memcached has no value for 'enwiki:gadgets-definition:7' and is unable to store it. [00:16:12] Logged the message, Master [00:16:30] (that's probably the same as you said but in words I understand) [00:16:39] and sans a few abstraction layers [00:16:48] you guys should totally paste that on the bug [00:16:54] or i might as well if i'm already here [00:17:01] Roger [00:17:12] doing [00:19:05] OK, i don't want to deploy a VCL change, and it looks like you can't simply tell varnish to stop forwarding requests to a specific backend. The way to do that is apparently to just make the backend stop responding to health probes [00:19:21] so I'm just going to stop apache and disable puppet so it doesn't get restarted [00:20:07] !log stopping apache and disabling puppet on mw1151 so that varnish stops forwarding reqs to it [00:20:11] Logged the message, Master [00:20:24] ori: You could also depool it in Pybal [00:20:33] RoanKattouw: it's not listed anywhere [00:20:40] do i need to *add* an entry for it? [00:20:47] there is no entry marking it as enabled [00:21:08] Meh [00:21:17] Is there no LVS for this? [00:21:23] i don't think so [00:21:28] I guess it might be load-balanced by Varnish directly? [00:21:31] Bo [00:21:33] o [00:21:34] yep [00:21:39] Consistency, we've heard of it [00:21:54] i disabled apache, i'm surprised icinga-wm isn't complaining [00:22:17] <^demon|away> RoanKattouw: Why isn't everything behind lvs? [00:22:18] <^demon|away> lvs rocks. [00:22:48] ori: mw1151 in logstash -- https://logstash.wikimedia.org/#dashboard/temp/5AvpxWezQzqHBamrN6fONQ [00:23:16] PROBLEM - Apache HTTP on mw1151 is CRITICAL: Connection refused [00:23:18] Oh, interesting [00:23:41] So the fact that resourceloader.log was "only" showing module definition invalidation would've been a clue [00:23:43] bd808: thx [00:23:48] !log varnishadm on cp1056 confirms that varnish recognizes mw1151 as "sick" [00:23:50] i'll ack the alert [00:23:53] Logged the message, Master [00:24:03] RoanKattouw: ori: .. because it was unable to store the timestamp in definition has mtime [00:24:03] ^demon|away: I don't know, IMO it should be, in general [00:24:11] so it kept incrementing [00:24:24] Unable how? [00:24:32] RoanKattouw: because mw1151 has no disk / memcached abilities [00:24:41] I dont know how memcached is on disk connected [00:24:42] but somehow it is [00:24:47] Any host with 1.7M log events in a hour is probably sick [00:24:51] Whoa memcached errors all over [00:25:08] "memcached-serious" [00:25:10] nice log group [00:25:17] is there a not so serious? I suppose ther eis [00:25:30] ACKNOWLEDGEMENT - Apache HTTP on mw1151 is CRITICAL: Connection refused ori.livneh It was serving garbage due to disk issues. I disabled Puppet and stopped Apache. [00:25:45] memcached-serious is kind of spammy. It logs every cache miss. [00:26:05] * bd808 was just told that he has to leave for dinner [00:26:38] bd808: Hm.. what would you recommend for strategy in terms of logstash and automation? It's too invisible. I guess the thing it feeds on or it itself can be used to perhaps put alerts in here [00:26:44] via icinga perhaps [00:27:05] a 1000% increase in errors is usually a clue [00:27:24] how do missing modules affect cache headers? [00:27:38] They shouldn [00:27:44] I'd say they don't. [00:27:45] Why? [00:28:15] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Bits+application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [00:28:32] bytes out went down [00:28:35] memcached is broken, that causes two things for RL: Modules that are register based on cache (e.g. gadgets-definition is a wikitext page parsed, we cache that result, if that's broken, we can't register those modules) [00:28:44] so that's more http requests with errors that we don't cache long [00:28:46] not sure why, since the same number of requests are coming in [00:29:07] the other thing it caused is definition-hash-timestamp to not be stored well, so module timestamps were rolling over quickly [00:29:13] like the languagedatamodule, but for a different reason [00:29:14] anyways, greg-g, no deployments until ops looks at this obviously [00:29:18] i'll e-mail the ops list [00:29:31] Krinkle: right, that makes sense then [00:29:45] so that's more cache misses and thus more load, but that second thing only happens if the startup module happens to be served from the problem server [00:30:00] which probably happend as well at some point [00:30:17] but at least memcached is per-apache, so it does go back to the correct timestamp now [00:30:22] right? [00:32:16] memcached isn't per apache [00:32:36] each apache connects to 127.0.0.1 for memcached, but the thing that is listening is not a memcached instance but twemproxy [00:32:48] which proxies to the memcached cluster [00:33:01] so the fact that it's 127.0.0.1 is misleading [00:33:19] do things look ok? any user complaints? [00:33:28] i don't expect any, the remaining three bits app servers look fine, but want to make sure [00:34:55] Was that box a memc server? [00:34:57] I mean probably not [00:35:08] But it's worth checking [00:41:37] RoanKattouw: ori: Hm.. what do you mean? [00:41:43] RoanKattouw: ori: we shard memcached, right? [00:41:58] so most apaches run memcached, but it's not all keys on every apache, it's sharded [00:42:11] they happen to have both roles [00:42:11] is it like that? [00:42:12] no apache runs memcached [00:42:34] oh, ok [00:42:44] apaches run a daemon called twemproxy that pretends to be memcached but really proxies each request to a real and remote memcached server [00:43:00] in that case its worth investigating why memcache was unable to give a value to mw1151. Could be affecting other apaches [00:43:21] e.g. if the real mem server behind it has the same issue [00:43:38] so we actually have boxes that serve only as memcached? [00:45:10] mc{xxx}.eqiad.wmnet [00:45:12] there we go [00:45:21] https://github.com/wikimedia/operations-puppet/blob/70346aecd61/manifests/site.pp#L1654 [00:45:22] nice :) [00:46:09] nn [00:46:11] why would mw1151 get more reqs? [00:46:22] there must be something about the error responses that is cached differently, no? [00:46:59] so that's more http requests with errors that we don't cache long [00:47:48] oh, is it If-Modified-Since? [00:48:24] ah, i see [00:48:35] it's ResourceLoader::sendResponseHeaders [01:23:55] PROBLEM - Disk space on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:05] PROBLEM - MySQL Slave Running Port 3307 on labsdb1003 is CRITICAL: Timeout while attempting connection [01:24:05] PROBLEM - MySQL Slave Delay Port 3307 on labsdb1003 is CRITICAL: Timeout while attempting connection [01:24:05] PROBLEM - MySQL Slave Running Port 3308 on labsdb1003 is CRITICAL: Timeout while attempting connection [01:24:05] PROBLEM - DPKG on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:05] PROBLEM - puppet disabled on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:05] PROBLEM - MySQL Idle Transactions Port 3308 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:06] PROBLEM - MySQL Idle Transactions Port 3307 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:06] PROBLEM - MySQL Slave Delay Port 3308 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:07] PROBLEM - MySQL Slave Running Port 3306 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:25] PROBLEM - SSH on labsdb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:24:25] PROBLEM - mysqld processes on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:25] heh [01:24:25] PROBLEM - check configured eth on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:25] PROBLEM - RAID on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:35] PROBLEM - MySQL Recent Restart Port 3306 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:35] PROBLEM - check if dhclient is running on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:45] PROBLEM - MySQL Recent Restart Port 3307 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:45] PROBLEM - MySQL Idle Transactions Port 3306 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:46] PROBLEM - MySQL Slave Delay Port 3306 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:48] it's labs and it's 6pm on a friday and i'm not ops [01:24:51] * ori waves cheerfully [01:24:55] PROBLEM - MySQL Recent Restart Port 3308 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:25:22] wtf [01:25:34] poor springle [01:30:15] PROBLEM - Host labsdb1003 is DOWN: PING CRITICAL - Packet loss = 100% [01:50:54] ori, thanks for the mw1151 intervention. [01:58:46] !log powercycling labsdb1003 [01:58:51] Logged the message, Master [02:07:15] RECOVERY - check configured eth on labsdb1003 is OK: NRPE: Unable to read output [02:07:15] RECOVERY - RAID on labsdb1003 is OK: OK: optimal, 1 logical, 2 physical [02:07:25] RECOVERY - Host labsdb1003 is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [02:07:25] RECOVERY - MySQL Recent Restart Port 3306 on labsdb1003 is OK: OK seconds since restart [02:07:25] RECOVERY - check if dhclient is running on labsdb1003 is OK: PROCS OK: 0 processes with command name dhclient [02:07:35] RECOVERY - MySQL Recent Restart Port 3307 on labsdb1003 is OK: OK seconds since restart [02:07:35] RECOVERY - MySQL Idle Transactions Port 3306 on labsdb1003 is OK: OK longest blocking idle transaction sleeps for seconds [02:07:36] RECOVERY - MySQL Slave Delay Port 3306 on labsdb1003 is OK: OK replication delay 0 seconds [02:07:45] RECOVERY - Disk space on labsdb1003 is OK: DISK OK [02:07:55] RECOVERY - DPKG on labsdb1003 is OK: All packages OK [02:07:55] RECOVERY - puppet disabled on labsdb1003 is OK: OK [02:07:55] RECOVERY - MySQL Idle Transactions Port 3307 on labsdb1003 is OK: OK longest blocking idle transaction sleeps for seconds [02:07:55] RECOVERY - MySQL Slave Running Port 3308 on labsdb1003 is OK: OK replication [02:07:55] RECOVERY - MySQL Slave Running Port 3307 on labsdb1003 is OK: OK replication [02:07:56] RECOVERY - MySQL Slave Delay Port 3308 on labsdb1003 is OK: OK replication delay 0 seconds [02:07:56] RECOVERY - MySQL Slave Running Port 3306 on labsdb1003 is OK: OK replication [02:07:57] RECOVERY - MySQL Slave Delay Port 3307 on labsdb1003 is OK: OK replication delay 0 seconds [02:07:57] RECOVERY - MySQL Idle Transactions Port 3308 on labsdb1003 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:08:05] RECOVERY - SSH on labsdb1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [02:08:05] RECOVERY - mysqld processes on labsdb1003 is OK: PROCS OK: 3 processes with command name mysqld [02:12:45] RECOVERY - MySQL Recent Restart Port 3308 on labsdb1003 is OK: OK 331 seconds since restart [02:13:15] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3790 MB (3% inode=99%): [02:15:17] !log LocalisationUpdate completed (1.24wmf4) at 2014-05-17 02:14:13+00:00 [02:15:22] Logged the message, Master [02:21:15] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3431 MB (3% inode=99%): [02:25:49] !log LocalisationUpdate completed (1.24wmf5) at 2014-05-17 02:24:46+00:00 [02:25:54] Logged the message, Master [02:58:35] PROBLEM - Puppet freshness on mw1151 is CRITICAL: Last successful Puppet run was Fri May 16 23:57:57 2014 [03:00:15] RECOVERY - Disk space on virt0 is OK: DISK OK [03:08:18] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat May 17 03:07:12 UTC 2014 (duration 7m 11s) [03:08:23] Logged the message, Master [03:10:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [03:14:35] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:26:25] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [03:35:35] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2100: active_shards: 6299: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 0 [03:35:35] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2100: active_shards: 6299: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 0 [03:36:35] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2101: active_shards: 6302: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [03:36:35] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2101: active_shards: 6302: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [03:46:53] ACKNOWLEDGEMENT - Puppet freshness on mw1151 is CRITICAL: Last successful Puppet run was Fri May 16 23:57:57 2014 ori.livneh deliberately disabled to prevent Puppet from restarting Apache. See https://bugzilla.wikimedia.org/show_bug.cgi?id=65424#c8. [04:22:24] (03CR) 10Ori.livneh: [C: 032] "Can't reproduce test case, this appears to have been fixed in PHP" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133831 (owner: 10Ori.livneh) [05:02:36] (03CR) 10Dan-nl: "so $wgJobBackoffThrottling['gwtoolsetUploadMetadataJob'] = 5 / 3600 will limit each _or_ all job runners to processing 5 gwtoolsetUploadMe" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132112 (owner: 10Gergő Tisza) [06:11:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [07:12:18] (03PS1) 10MaxSem: Kill GeoData Solr, decom servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/133886 [07:13:28] (03CR) 10jenkins-bot: [V: 04-1] Kill GeoData Solr, decom servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/133886 (owner: 10MaxSem) [07:16:16] (03PS2) 10MaxSem: Kill GeoData Solr, decom servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/133886 [07:19:20] (03CR) 10MaxSem: [C: 04-1] "Maybe in a couple weeks:)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133886 (owner: 10MaxSem) [09:04:35] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Sat May 17 06:03:26 2014 [09:12:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [10:03:05] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Sat May 17 10:02:56 UTC 2014 [12:13:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [12:24:34] Greetings, can someone help me with a bot in #da.wikipedia ? - It keeps getting "Excess Flood", so that mean it keep joining and leaveing. :) [12:26:38] There's nothing we can do about it [12:26:55] Oka, have a nice day then. [13:32:52] (03CR) 10Andrew Bogott: [C: 031] "This corresponds properly with ldap, so gets my vote." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133761 (owner: 10Dzahn) [13:50:35] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2135: active_shards: 6404: relocating_shards: 1: initializing_shards: 5: unassigned_shards: 0 [13:50:35] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2135: active_shards: 6404: relocating_shards: 1: initializing_shards: 5: unassigned_shards: 0 [13:50:35] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2135: active_shards: 6404: relocating_shards: 1: initializing_shards: 5: unassigned_shards: 0 [13:50:35] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2135: active_shards: 6404: relocating_shards: 1: initializing_shards: 5: unassigned_shards: 0 [13:50:35] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2135: active_shards: 6404: relocating_shards: 1: initializing_shards: 5: unassigned_shards: 0 [13:50:36] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2135: active_shards: 6404: relocating_shards: 1: initializing_shards: 5: unassigned_shards: 0 [13:51:35] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2140: active_shards: 6409: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:51:35] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2140: active_shards: 6409: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:51:35] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2140: active_shards: 6409: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:51:35] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2140: active_shards: 6409: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:51:35] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2140: active_shards: 6409: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:51:36] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2140: active_shards: 6409: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:56:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data exceeded the critical threshold [500.0] [14:03:35] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Sat May 17 11:02:38 2014 [14:03:45] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Sat May 17 14:03:40 UTC 2014 [14:09:15] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [14:30:37] to clarify: Simeondahl earlier meant that "wikichanges-de567142-d148-47ea-ad13-4d2653ae31e5" bot is coming and going (Excess Flood) on Mediawiki's IRC-channels (irc.wikimedia.org) [14:37:36] Stryn: same clarification; There's nothing we can do about it [14:44:08] can't you just ignore part and joins on your client? [15:14:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [18:15:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [19:27:24] (03PS1) 10Ori.livneh: Delete applicationserver::cron [operations/puppet] - 10https://gerrit.wikimedia.org/r/133902 [19:27:26] (03PS1) 10Ori.livneh: Add package dependencies from wikimedia-task-appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/133903 [19:29:00] (03CR) 10Ori.livneh: [C: 032] Delete applicationserver::cron [operations/puppet] - 10https://gerrit.wikimedia.org/r/133902 (owner: 10Ori.livneh) [19:29:23] (03CR) 10Ori.livneh: [C: 032] Add package dependencies from wikimedia-task-appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/133903 (owner: 10Ori.livneh) [19:43:06] (03PS1) 10Ori.livneh: Express additional package dependencies of wikimedia-task-appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/133906 [19:44:39] (03CR) 10Ori.livneh: [C: 032] Express additional package dependencies of wikimedia-task-appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/133906 (owner: 10Ori.livneh) [19:47:22] (03PS1) 10Ori.livneh: Correct typo in lilypond package name [operations/puppet] - 10https://gerrit.wikimedia.org/r/133908 [19:47:34] (03CR) 10Ori.livneh: [C: 032 V: 032] Correct typo in lilypond package name [operations/puppet] - 10https://gerrit.wikimedia.org/r/133908 (owner: 10Ori.livneh) [19:57:15] (03PS1) 10Ori.livneh: Remove duplicate packages from imagescaler manifest [operations/puppet] - 10https://gerrit.wikimedia.org/r/133932 [19:57:36] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove duplicate packages from imagescaler manifest [operations/puppet] - 10https://gerrit.wikimedia.org/r/133932 (owner: 10Ori.livneh) [21:16:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [23:30:35] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Sat May 17 20:30:12 2014