[00:37:32] legoktm: Yes [00:49:53] (03PS1) 10Ori.livneh: HHVM: set MALLOC_CONF envvar in upstart job def [puppet] - 10https://gerrit.wikimedia.org/r/168919 [00:51:35] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 313 seconds [00:52:13] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 338 seconds [00:53:08] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:53:46] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [01:05:59] (03PS1) 10Hoo man: Only add the "oauthadmin" group on the central wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168922 [02:17:08] !log LocalisationUpdate completed (1.25wmf4) at 2014-10-27 02:17:08+00:00 [02:17:18] Logged the message, Master [02:27:46] !log LocalisationUpdate completed (1.25wmf5) at 2014-10-27 02:27:45+00:00 [02:27:53] Logged the message, Master [03:24:17] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [03:28:03] (03CR) 10Ori.livneh: [C: 032] HHVM: set MALLOC_CONF envvar in upstart job def [puppet] - 10https://gerrit.wikimedia.org/r/168919 (owner: 10Ori.livneh) [03:36:39] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Oct 27 03:36:39 UTC 2014 (duration 36m 38s) [03:36:48] Logged the message, Master [03:42:46] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [04:08:50] (03PS1) 10Springle: Increase wgQueryCacheLimit for enwiki and dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168929 (https://bugzilla.wikimedia.org/44321) [05:01:52] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [05:03:52] RECOVERY - Disk space on ocg1001 is OK: DISK OK [05:04:38] !log forced logrotate ocg1001 [05:04:46] Logged the message, Master [05:18:11] (03PS1) 10Springle: Increase updatequerypages frequency to twice per month for all wikis. [puppet] - 10https://gerrit.wikimedia.org/r/168933 [05:52:13] (03PS1) 10Gage: logstash: hadoop: extract and infer job, task, attempt IDs [puppet] - 10https://gerrit.wikimedia.org/r/168935 [05:56:01] (03CR) 10Gage: [C: 032] logstash: hadoop: extract and infer job, task, attempt IDs [puppet] - 10https://gerrit.wikimedia.org/r/168935 (owner: 10Gage) [06:04:47] (03PS1) 10Legoktm: Add debug log group for CentralAuthUserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168936 [06:06:34] (03PS1) 10Gage: logstash: hadoop: syntax fix [puppet] - 10https://gerrit.wikimedia.org/r/168937 [06:07:25] (03CR) 10Gage: [C: 032] logstash: hadoop: syntax fix [puppet] - 10https://gerrit.wikimedia.org/r/168937 (owner: 10Gage) [06:23:51] (03CR) 10Nemo bis: [C: 031] "Let's remember to check special pages at the next run." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168929 (https://bugzilla.wikimedia.org/44321) (owner: 10Springle) [06:24:19] (03PS1) 10Gage: logstash: hadoop: sytax fix #2 [puppet] - 10https://gerrit.wikimedia.org/r/168938 [06:25:25] (03CR) 10Gage: [C: 032] logstash: hadoop: sytax fix #2 [puppet] - 10https://gerrit.wikimedia.org/r/168938 (owner: 10Gage) [06:28:22] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: puppet fail [06:29:01] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: puppet fail [06:29:02] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:12] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:12] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:22] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:25] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:31] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:32] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:33] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:41] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:42] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:42] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:51] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:51] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:52] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:52] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:01] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:04] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:05] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:11] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:11] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:11] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:32] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:34:23] above hosts appear fine, looks like it was just puppet catalog compilation race condition [06:35:11] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:45:13] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:21] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:45:21] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:45:31] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:51] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:46:12] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:51] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:46:52] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:52] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:47:11] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:47:32] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:47:32] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:47:51] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:56:15] <_joe__> jgage: it's the usual 6.30Z logrotate effect [06:56:27] <_joe__> jgage: what time is it there btw? [06:57:01] PROBLEM - CI: Puppet failure events on labmon1001 is CRITICAL: CRITICAL: integration.integration-slave1002.puppetagent.failed_events.value (30.00%) [07:06:55] (03CR) 10Nemo bis: [C: 031] "If they succeed once they should also succeed twice. :) Note, there is a 4 hours difference with the time non-disabled special pages are u" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168933 (owner: 10Springle) [07:13:24] _joe__: midnight :) [07:14:11] <_joe__> oh, ok then [07:14:58] PROBLEM - Disk space on dbproxy1001 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=89%): [07:16:41] <_joe__> springle: is that you? / is full on this server [07:17:17] <_joe__> yes it's you [07:17:49] yeah sorry [07:17:57] RECOVERY - Disk space on dbproxy1001 is OK: DISK OK [07:20:03] <_joe__> np [07:20:24] <_joe__> whenever I see a "disk full" alert on a db, I kinda worry :) [07:21:42] eheh [07:24:07] RECOVERY - CI: Puppet failure events on labmon1001 is OK: OK: All targets OK [08:00:48] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 915.222672751 [08:25:04] (03PS1) 10Gage: logstash: hadoop: syntax #3 [puppet] - 10https://gerrit.wikimedia.org/r/168941 [08:30:52] ^^ i see that analytics1021 is no longer the leader for any partitions, but it's still in the ISR. service is healthy so i'm going to leave it alone. [08:31:08] (03CR) 10Gage: [C: 032] logstash: hadoop: syntax #3 [puppet] - 10https://gerrit.wikimedia.org/r/168941 (owner: 10Gage) [08:37:14] (03CR) 10Florianschmidtwelzow: [C: 04-1] Set $wgMFAnonymousEditing = true for Italian Wikipedia in November-December 2014 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [08:43:07] (03CR) 10Nemo bis: "Thanks Florian for looking into it. This is the format we usually use, see for instance e2ebc24b1044dd342afb7099ef91a70c9136acaf, so there" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [08:45:05] (03PS2) 10Nemo bis: Set $wgMFAnonymousEditing = true for Italian Wikipedia in November-December 2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) [08:51:51] (03CR) 10Florianschmidtwelzow: [C: 031] "> This is the format we usually use" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [08:52:52] (03PS3) 10Nemo bis: Set $wgMFAnonymousEditing = true for Italian Wikipedia in November-December 2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) [09:15:35] <_joe__> Nemo_bis: why use strtotime there? just put in the integers and set the human-readable date in a comment [09:15:42] <_joe__> :) [09:15:57] * _joe__ annoying optimizator [09:26:01] _joe_: dunno, I just followed the tradition [09:26:16] Feel free to amend, especially as you're the second person pointing it out ;) [09:26:38] <_joe_> eheh there's plenty of annoying people around here [09:26:40] <_joe_> :P [09:28:21] RECOVERY - check if salt-minion is running on virt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:29:23] _joe_: http://tools.wmflabs.org/ times out for me, can you have a short peek please ? [09:34:35] <_joe_> matanya: I know zero about toollabs [09:34:44] <_joe_> let me see if someone else knows more [09:34:58] <_joe_> else, you'll have to wait for me to read through the docs [09:35:29] thanks much for whatever you choose to do :) [09:37:19] _joe_: just for the sake of more info, some tools do work with a direct link, e.g. https://tools.wmflabs.org/hay/directory/#/ [09:37:23] <_joe_> matanya: single tools work correctly btw [09:37:37] <_joe_> so I don't think this is _really_ a toolabs general issue [09:37:46] <_joe_> matanya: yeah just verified that myself [09:37:46] agree [09:38:00] will report a bug then [09:38:05] thanks for your time :) [09:38:20] <_joe_> matanya: np I am taking a further look anyway [09:39:35] (03CR) 10Florianschmidtwelzow: [C: 031] Set $wgMFAnonymousEditing = true for Italian Wikipedia in November-December 2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [09:46:23] (03PS2) 10Giuseppe Lavagetto: webserver: move to a module, fix and remove a few things [puppet] - 10https://gerrit.wikimedia.org/r/168604 [09:49:58] (03CR) 10Giuseppe Lavagetto: [C: 032] "verified with the puppet compiler." [puppet] - 10https://gerrit.wikimedia.org/r/168604 (owner: 10Giuseppe Lavagetto) [09:55:26] (03CR) 10Ricordisamoa: "What does prevent this change from being deployed in November?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [10:00:42] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [10:00:56] <_joe_> ewww that's probably me [10:01:52] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: puppet fail [10:04:43] (03PS1) 10Giuseppe Lavagetto: icinga: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/168945 [10:05:24] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/168945 (owner: 10Giuseppe Lavagetto) [10:11:13] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:27:43] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.39 [12:28:04] PROBLEM - ElasticSearch health check for shards on elastic1017 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.39:9200/_cluster/health error while fetching: Request timed out. [12:29:24] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [12:30:23] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:31:13] actually elasticsearch seems to have some problems? :/ [12:32:23] PROBLEM - ElasticSearch health check for shards on elastic1008 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.140:9200/_cluster/health error while fetching: Request timed out. [12:32:44] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.113 [12:33:14] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.112:9200/_cluster/health error while fetching: Request timed out. [12:33:33] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [12:35:03] PROBLEM - ElasticSearch health check for shards on elastic1006 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.113:9200/_cluster/health error while fetching: Request timed out. [12:35:43] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [12:36:21] can someone from ops review https://gerrit.wikimedia.org/r/#/c/168255/ today? akosiaris would you be comfortable doing that? [12:36:25] PROBLEM - ElasticSearch health check for shards on elastic1004 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.111:9200/_cluster/health error while fetching: Request timed out. [12:39:25] RECOVERY - ElasticSearch health check for shards on elastic1004 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [12:40:36] RECOVERY - ElasticSearch health check for shards on elastic1008 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [12:41:05] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:43:03] PROBLEM - ElasticSearch health check for shards on elastic1004 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.111:9200/_cluster/health error while fetching: Request timed out. [12:43:43] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [12:43:55] PROBLEM - ElasticSearch health check for shards on elastic1008 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.140:9200/_cluster/health error while fetching: Request timed out. [12:44:13] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [12:44:40] elasticsearch is still down since 20 min [12:45:15] robh: ^ [12:45:43] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.111 [12:47:03] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.112:9200/_cluster/health error while fetching: Request timed out. [12:48:43] <_joe_> mmh [12:49:24] looks not good [12:50:02] <_joe_> FlorianSW: it dues not [12:50:59] _joe_ do you know someone who can see the logs to take a look? Just with the graphs of ganglia it seems, that es starts but interrupt at any point and crashes :/ [12:51:15] <_joe_> yes [12:51:24] <_joe_> I am taking a look [12:51:31] great :) thx [12:51:33] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:52:23] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:54:33] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [12:54:49] (03CR) 10Anomie: "Instructions for deploying a new extension to Beta Labs appear to already be at https://www.mediawiki.org/wiki/Writing_an_extension_for_de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168688 (https://bugzilla.wikimedia.org/72465) (owner: 10Kaldari) [12:55:43] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [12:56:34] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:58:19] (03PS1) 10Faidon Liambotis: Re-pool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/168954 [12:58:23] (03PS2) 10Faidon Liambotis: Re-pool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/168954 [12:59:54] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [13:01:36] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [13:02:17] Assuming that SSL3 has been disabled on misc-web-lb.eqiad.wikimedia.org and Tool Labs, anybody knows why ssllabs.com says it hasn't? [13:02:22] (brought up in https://bugzilla.wikimedia.org/show_bug.cgi?id=72072#c2 ) [13:02:45] RECOVERY - ElasticSearch health check for shards on elastic1008 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [13:02:57] for whomever to note (commons) An error has occurred while searching: Search is currently too busy. Please try again later. (presumably SFO servers) [13:03:15] been like that for a couple of minutes [13:03:15] RECOVERY - ElasticSearch health check for shards on elastic1006 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [13:03:16] sDrewth: nope, it's everywhere; we're awae [13:03:16] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:03:20] *aware [13:03:21] k [13:03:23] thanks :) [13:03:37] ah, all the elastic search pings :-/ [13:03:55] sDrewth: already seen on all wikipedias using CirrusSerach (elasticsearch cluster) [13:04:10] excellent news [13:04:17] equality on show [13:04:18] not really :P [13:04:34] sarcasm Joyce [13:04:54] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.112:9200/_cluster/health error while fetching: Request timed out. [13:05:19] ;) [13:05:36] let us all wake up manybubbles [13:05:42] he's around [13:05:45] I'm up [13:05:47] _joe_ have you found something inetersting? :-) [13:05:49] just started poking it [13:05:54] PROBLEM - ElasticSearch health check for shards on elastic1008 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.140:9200/_cluster/health error while fetching: Request timed out. [13:06:06] <_joe_> FlorianSW: not really, the cluster seems in a really dire state [13:06:24] PROBLEM - ElasticSearch health check for shards on elastic1006 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.113:9200/_cluster/health error while fetching: Request timed out. [13:06:25] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.113 [13:07:06] _joe_ :/ manybubbles: i think a restart will not solve the problem (but i hope, fingers crossed) [13:07:22] which one are we restarting? [13:07:29] <_joe_> none [13:07:53] manybubbles: oops, sorry, i read false, i thought you restarted one :) [13:07:54] RECOVERY - ElasticSearch health check for shards on elastic1008 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [13:07:57] that isn't great: [2014-10-27 12:24:37,334][WARN ][monitor.jvm ] [elastic1008] [gc][young][1579767][889812] duration [1.2s], collections [1]/[1.9s], total [1.2s]/[14.3h], memory [16.3gb]->[16.7gb]/[29.9gb], all_pools {[young] [33.4mb]->[122.5mb]/[665.6mb]}{[survivor] [83.1mb]->[70.8mb]/[83.1mb]}{[old] [16.1gb]->[16.5gb]/[29.1gb]} [13:08:34] RECOVERY - ElasticSearch health check for shards on elastic1006 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [13:10:25] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:11:04] PROBLEM - ElasticSearch health check for shards on elastic1008 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.140:9200/_cluster/health error while fetching: Request timed out. [13:11:35] PROBLEM - ElasticSearch health check for shards on elastic1006 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.113:9200/_cluster/health error while fetching: Request timed out. [13:12:59] RECOVERY - ElasticSearch health check for shards on elastic1008 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [13:13:29] would anyone mind acking that in icinga? [13:13:35] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [13:14:09] <_joe_> manybubbles: acking won't do any good [13:14:15] <_joe_> as it's flapping [13:14:19] ah [13:14:28] <_joe_> I can disable notifications [13:15:29] nothing super clear yet [13:15:36] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:16:04] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:16:15] PROBLEM - ElasticSearch health check for shards on elastic1008 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.140:9200/_cluster/health error while fetching: Request timed out. [13:18:35] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:19:00] <_joe_> and we also have a nice apache issue as well? [13:19:14] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.111 [13:20:02] _joe_ this problem is new :/ [13:20:06] <_joe_> ottomata: CirrusSearch is not working as of now. [13:20:25] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:28] <_joe_> FlorianSW: the ES one? it surely is. [13:20:41] wha [13:20:42] uh oh [13:20:44] good mornignt! [13:20:48] morning* [13:20:55] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:04] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:07] !log elastic1008 is logging gc issues. restarting it because that might help it [13:21:14] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.096 second response time [13:21:17] Logged the message, Master [13:21:35] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [13:21:45] <_joe_> these errors ^^ are related to search I fear [13:21:54] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [13:22:04] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.336 second response time [13:22:08] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.533 second response time [13:22:15] _joe_: we shouldn't be using too many apaches. hopefully. we try to prevent that [13:22:46] <_joe_> manybubbles: it's the API appservers that get stuck connected to search.svc.eqiad.wmnet:9200 [13:22:49] <_joe_> :/ [13:23:12] <_joe_> just confirmed strace-ing one of the stuck apache processes [13:23:16] manybubbles: you think this is all caused by just 1008 having issues? [13:23:32] hmm, I didn't know that WD was elastic search [13:23:57] sDrewth: it was communicated as "the new search" :) [13:24:23] how silly of me to have not interpreted that [13:25:08] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:26:06] a lot of reports about search issues on nlwiki and commons btw [13:26:35] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:26:39] <_joe_> sjoerddebruin: search is in a very bad state right now [13:26:47] sjoerddebruin: it should be on all wmf wikis (using CirrusSearch) [13:27:03] Yeah, but are you working on it? :D [13:27:29] _joe_ mw1147 and mw1128 aren't in the Application eqiad cluster in ganglia, is this correct? [13:27:34] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.214 second response time [13:27:37] <_joe_> FlorianSW: API [13:27:37] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:37] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [13:27:44] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:47] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:54] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:58] _joe_ ah, ok :) [13:28:04] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [13:28:07] <_joe_> manybubbles: is there any parameter that can set a timeout for cirrus searches? [13:28:23] manybubbles: i am looking around for issues, tell me if there is anything I can do or a place I can look...i could also scramble to get some new nodes up right now too, if you like [13:28:26] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:26] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:26] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:26] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:31] (was going to start that this morning anyway...) [13:28:33] <_joe_> ottomata: ^^ [13:28:34] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [13:28:35] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:35] what the... [13:28:35] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:35] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:46] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.091 second response time [13:29:09] _joe_: look at wgCirrusSearchClientSideSearchTimeout [13:29:15] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.098 second response time [13:29:33] <_joe_> manybubbles: if we set that to 0? [13:29:34] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:29:37] !log restarted elasticsearch on elastic1017 - memory was totally full there [13:29:39] now I get a page... doing the backread [13:29:43] _joe_: I'm not sure [13:29:43] Logged the message, Master [13:29:44] RECOVERY - ElasticSearch health check for shards on elastic1006 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [13:29:44] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:30:02] <_joe_> manybubbles: ok nevermind [13:30:05] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 19592 bytes in 0.310 second response time [13:30:56] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [13:30:57] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:57] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:57] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:31:07] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.112:9200/_cluster/health error while fetching: Request timed out. [13:31:17] PROBLEM - ElasticSearch health check for shards on elastic1016 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.13:9200/_cluster/health error while fetching: Request timed out. [13:31:35] pfff [13:31:38] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.774 second response time [13:31:47] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.649 second response time [13:32:06] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.778 second response time [13:32:37] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time [13:32:37] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.527 second response time [13:32:48] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:56] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.113 [13:32:56] PROBLEM - ElasticSearch health check for shards on elastic1006 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.113:9200/_cluster/health error while fetching: Request timed out. [13:33:49] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.088 second response time [13:33:56] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13 [13:34:06] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [13:34:06] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.324 second response time [13:34:37] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:46] PROBLEM - Apache HTTP on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:46] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:06] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [13:35:56] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:00] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:07] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:22] since we _have_ lsearchd still should we just fail back to it for the moment? [13:36:27] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [13:36:56] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:56] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:56] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:16] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:17] PROBLEM - Apache HTTP on mw1118 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:17] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:18] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:22] manybubbles: what nodes are currently master eligible? [13:37:26] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:26] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:36] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:36] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:37] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:37] RECOVERY - ElasticSearch health check for shards on elastic1004 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [13:37:42] and, manybubbles, yes if we can fallback to lsearchd for a bit while we fix this, let's do it [13:37:56] manybubbles: just a my "non-professional" opnion: If it's possible to switch, i would switch :) [13:37:58] <_joe_> manybubbles: +1 for falling back [13:38:07] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:12] <_joe_> ottomata: +1 [13:38:21] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:26] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:26] <_joe_> we'd probably need to rolling restart the api apaches as well [13:38:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nitpicks, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/168698 (owner: 10Dzahn) [13:38:51] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [13:39:07] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:12] <_joe_> ottomata: are you doing the fallback? [13:39:16] PROBLEM - Apache HTTP on mw1124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:17] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:22] <_joe_> ottomata: at leas for api for now [13:39:26] <_joe_> if it's possible [13:39:26] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:36] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:38] _joe_, i'm don't know how off the top of my head...looking [13:39:41] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.112:9200/_cluster/health error while fetching: Request timed out. [13:39:41] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:42] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:39:48] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:57] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:07] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:10] (03PS1) 10Manybubbles: Stop sending searches to cirrus while elasticsearch is down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168956 [13:40:21] gotcha manybubbles... [13:40:33] (03CR) 10Ottomata: [C: 032] Stop sending searches to cirrus while elasticsearch is down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168956 (owner: 10Manybubbles) [13:40:33] someone review that please [13:40:37] <_joe_> I was about to do the same [13:40:57] PROBLEM - ElasticSearch health check for shards on elastic1004 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.111:9200/_cluster/health error while fetching: Request timed out. [13:41:00] ok. hould I sync it then? [13:41:01] did someone already puppet-merge it? [13:41:13] <_joe_> ottomata: it's mediawiki-config [13:41:18] oh duhhh [13:41:18] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:18] ha [13:41:21] yes, manybubbles, it is merged [13:41:23] go ahead [13:41:45] <_joe_> ottomata: I am 40 minutes late for lunch, grabbing something then I'll be back [13:41:50] ok [13:41:53] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 05s) [13:42:01] Logged the message, Master [13:42:04] <_joe_> ottomata: maybe a rollign restart of api appservers will be needed [13:42:14] <_joe_> manybubbles: are you sure lsearchd can handle the load? [13:42:33] _joe_: it'll be angry for a bit but it'll be better than just failing [13:42:39] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [13:42:52] (03PS3) 10Ottomata: kafka - remove/replace pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/168727 (owner: 10Dzahn) [13:43:06] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [13:43:14] _joe_, i assume that means just restart them all one at a time? [13:43:16] nothing fancy, right? [13:43:26] _joe_: if that is required we'll have have to figure out why cirrus needed that and fix it later [13:43:52] we should be mostly off of cirrus - all except for people optin into the beta feature [13:43:56] <_joe_> ottomata: yep, maybe try one first [13:44:06] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:17] ok, and, apache, right (sorry, have had very little interaction with app servers) [13:44:19] ? [13:44:46] <_joe_> ottomata: yeah, but I'm not sure that is a fix [13:44:48] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:44:48] <_joe_> lemme try [13:45:03] manybubbles, ottomata, _joe_ thanks for this hot fix [13:45:12] oh, i'm on 1115 now _joe_, was about to [13:45:14] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 19592 bytes in 0.267 second response time [13:45:19] _joe_, go ahead [13:45:29] you might know more of what to look for [13:45:55] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [13:46:02] * ^d catches scrollback [13:46:03] (03PS1) 10Manybubbles: Try again to disable cirrus for a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168957 [13:46:04] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time [13:46:06] PROBLEM - ElasticSearch health check for shards on elastic1013 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.10:9200/_cluster/health error while fetching: Request timed out. [13:46:06] (03CR) 10Ottomata: [C: 032] "Ya, this is good. I need to refactor how this works someday." [puppet] - 10https://gerrit.wikimedia.org/r/168727 (owner: 10Dzahn) [13:46:14] <_joe_> manybubbles: I still see api using elasticsearch [13:46:31] I mistaked [13:46:33] hm, it didn't take manybubbles? [13:46:34] ok [13:46:35] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.359 second response time [13:46:47] (03CR) 10Ottomata: [C: 032] Try again to disable cirrus for a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168957 (owner: 10Manybubbles) [13:46:53] (03CR) 10Ottomata: [V: 032] Try again to disable cirrus for a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168957 (owner: 10Manybubbles) [13:46:54] ^d: any chance you can review that last change and deploy it while I figure out why elasticsearch is so sad? [13:46:55] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.877 second response time [13:46:56] merged, manybubbles [13:47:01] or I'll just do it [13:47:05] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.965 second response time [13:47:28] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: fall back to lsearchd for a bit (duration: 00m 05s) [13:47:34] Logged the message, Master [13:47:36] <^d> manybubbles: Was about to say yes. [13:47:44] <^d> But everyone else is obviously more awake than I. [13:47:48] hehe [13:47:56] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.101 second response time [13:48:01] ok - we're now mostly off of cirrus [13:48:11] I'm seeing curl code 28 a lot - timeout iirc [13:48:15] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.522 second response time [13:48:16] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.644 second response time [13:48:25] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.079 second response time [13:48:34] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [13:48:36] <_joe_> I confirm we're off [13:48:44] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.517 second response time [13:48:45] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.676 second response time [13:48:46] <_joe_> ottomata: no need to restart apaches [13:48:48] <_joe_> :) [13:48:55] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.089 second response time [13:48:55] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.116 second response time [13:48:57] manybubbles: correct, 28 -> timed out [13:48:57] ok [13:49:04] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [13:49:08] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.079 second response time [13:49:08] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.086 second response time [13:49:08] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.127 second response time [13:49:08] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.106 second response time [13:49:14] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.098 second response time [13:49:15] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.130 second response time [13:49:15] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.114 second response time [13:49:16] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.111 second response time [13:49:16] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.108 second response time [13:49:34] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.106 second response time [13:49:37] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.109 second response time [13:49:38] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.131 second response time [13:49:38] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [13:49:38] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.083 second response time [13:49:38] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.142 second response time [13:49:38] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.184 second response time [13:49:47] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.061 second response time [13:49:48] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [13:49:49] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.109 second response time [13:49:54] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [13:50:34] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [13:52:14] RECOVERY - ElasticSearch health check for shards on elastic1004 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [13:53:14] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [13:54:20] fingers crossed, that lsearchd can handle all the requests: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Search+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report (that's the power of wikipedia :)) [13:54:39] <^d> We haven't taken any of lsearchd's capacity away, no reason it shouldn't be able to. [13:55:14] PROBLEM - ElasticSearch health check for shards on elastic1004 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.111:9200/_cluster/health error while fetching: Request timed out. [13:55:38] <^d> manybubbles: Anything I can do to help? [13:56:25] lsearchd is heavily reliant on caching so it'll probably flap for a bit. [13:56:47] ^d: I dunno, poke at things and see what it up. looks like all the machines filled up their heap all at once [13:56:59] and they started getting really long gc pauses [13:57:13] because, well, when your heap is full it takes a while just to free a little bit [13:57:35] jobrunners: http://ganglia.wikimedia.org/latest/?c=Jobrunners%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 and parsoid: http://ganglia.wikimedia.org/latest/?c=Parsoid%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 have take a minor hit as well but it is expected with all the API instability [13:58:06] yay for cascading failures [13:58:18] yeah [13:58:32] <_joe_> well [13:58:35] domino effect :-) [13:58:46] fwiw, I'm waiting for all the panic to be over so that I can depool ulsfo [13:58:46] <_joe_> techincally parsoid and jobrunners are back at work :) [13:58:47] akosiaris: yeah - cirrus heavily uses the jobs so I'd expect to see lots of failures there [13:58:50] <_joe_> they're not failing [13:58:55] I dind't turn that off [13:58:57] (so that we have the attention to deal with a second panic wave) [13:59:09] manybubbles: all of the machines started doing long GCs at the same time? [13:59:16] s/depool/repool/ [13:59:27] oh god [13:59:32] maybe I need coffee^Wtea first [13:59:38] ottomata: well, within a few minutes of eachother I think. maybe not all either [13:59:44] _joe_: yeah, they are catching up now... [13:59:46] hm [14:00:18] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [14:00:39] <_joe_> ottomata: JVMs tend to syncronize themselves in a cluster, so that they have their distruptive GC cycles toghether. That's because Java is Enterprise(TM) [14:00:52] ha [14:01:32] <_joe_> (I've seen that a lot of times, the first node goes in heavvy gc, so the traffic goes to the others, and that triggers more and more disruptive GC cycles everywhere) [14:01:46] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [14:01:46] (03CR) 10Alexandros Kosiaris: [C: 032] Disable l10nupdate for the duration of CLDR 26 plural migration [puppet] - 10https://gerrit.wikimedia.org/r/168255 (https://bugzilla.wikimedia.org/62861) (owner: 10Nikerabbit) [14:02:03] <_joe_> it's called "the stampede effect of cluelessness" [14:02:45] RECOVERY - ElasticSearch health check for shards on elastic1013 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6088, initializing_shards: 0, number_of_data_nodes: 18 [14:03:55] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.144 [14:04:04] PROBLEM - ElasticSearch health check for shards on elastic1012 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.144:9200/_cluster/health error while fetching: Request timed out. [14:04:35] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [14:04:40] ottomata: master eligible is 1002, 1014, 1007 [14:04:54] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.113 [14:04:57] ok, i was tryhing to ask es for that, but coudlnt' get it because of timeout, i guess [14:05:01] i will update wikitech doc [14:05:54] PROBLEM - ElasticSearch health check for shards on elastic1013 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.10:9200/_cluster/health error while fetching: Request timed out. [14:06:04] <_joe_> search works btw [14:06:11] <^d> ottomata: master eligible is also in puppet role. [14:06:19] <^d> (for next time) [14:06:23] <_joe_> long live lsearchd! [14:06:53] <^d> sssshhh, don't say that too loud or lsearchd might think we still actually like it ;-) [14:07:29] <^d> manybubbles: I'm going to run and grab breakfast + caffeine. Nothing is immediately jumping out at me and I have a feeling it's going to be a long morning. Back in 15ish. [14:07:48] akosiaris: thanks [14:08:04] PROBLEM - ElasticSearch health check for shards on elastic1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.110:9200/_cluster/health error while fetching: Request timed out. [14:08:05] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [14:08:43] Nikerabbit: :-) [14:09:34] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.111 [14:10:05] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [14:13:24] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [14:14:10] <_joe_> manybubbles, ^d [2014-10-27 14:13:41,517][DEBUG][action.admin.cluster.health] [elastic1004] no known master node, scheduling a retry [14:14:18] <_joe_> after someone restarted it [14:14:22] <_joe_> not a good sign [14:16:48] <_joe_> my advice would be not to restart other nodes [14:17:07] <_joe_> or to restart them after you're pretty sure of what's going on [14:17:29] <^d> Are we split-brained? [14:18:06] <_joe_> ^d: I think we're in deep sh*t; it looks like no node is able to answer to its peers efficiently [14:18:18] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [14:18:39] _joe_, yeah, but why would that be..., i guess if they all did slow GCs and had indexing backed up, maybe they all were unresponsive enough? [14:19:01] or, maybe split brain happened like ^d suggested? could the unresponsiveness cause split brain? [14:19:17] if one master responded to some nodes, but not others? [14:19:23] <_joe_> ottomata: so the result is everything explodng? Yeah unresponsiveness can. [14:19:28] <^d> it could. [14:19:32] <_joe_> ottomata: "web scale"! [14:19:34] PROBLEM - ElasticSearch health check for shards on elastic1009 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.141:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [14:20:03] ha, welp, i have the OS (almost) installed on all 12 new nodes, i'm going to apply base puppet stuff in a second [14:20:34] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.141 [14:21:37] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.113 [14:24:25] (03PS1) 10Alexandros Kosiaris: nfs::netapp::home backwards compatibility [puppet] - 10https://gerrit.wikimedia.org/r/168959 [14:24:29] <^d> prelim e-mail sent to operations list so everyone's at least aware we've fallen back and things are bad. [14:24:30] <_joe_> so my hypothesis is - we were hit by a few very slow and expensive GCs, that caused some servers to become unresponsive, the master diverted traffic to the others, which got overloaded as well, and every server basically lost communications with everyone else intermittently. I have no idea what this means in terms of the cluster coming back in a usable state [14:25:13] (03CR) 10jenkins-bot: [V: 04-1] nfs::netapp::home backwards compatibility [puppet] - 10https://gerrit.wikimedia.org/r/168959 (owner: 10Alexandros Kosiaris) [14:25:22] (03Abandoned) 10Faidon Liambotis: icinga: add alert for high latency in api requests [puppet] - 10https://gerrit.wikimedia.org/r/118435 (owner: 10Matanya) [14:27:45] (03CR) 10Faidon Liambotis: [C: 04-1] "As others said, switch Java 7 for /usr/bin/java. Let's see what breaks." [puppet] - 10https://gerrit.wikimedia.org/r/153764 (owner: 10Hashar) [14:29:00] <_joe_> manybubbles: we should probably disable shard recovery for now, and roll-restart the whole thing and pray for the better. [14:29:31] <^d> +1 to disabling non-primary shard allocation [14:29:34] (03PS2) 10Alexandros Kosiaris: nfs::netapp::home backwards compatibility [puppet] - 10https://gerrit.wikimedia.org/r/168959 [14:30:32] <_joe_> ^d: I see a lot of [2014-10-27 14:12:50,265][WARN ][monitor.jvm ] [elastic1006] [gc][old][1590067][247] duration [2.5m], collections [1]/[2.5m], total [2.5m]/[1.7h], memory [29.8gb]->[29.2gb]/[29.9gb], all_pools {[young] [665.6mb]->[69.5mb]/[665.6mb]}{[survivor] [45.8mb]->[0b]/[83.1mb]}{[old] [29.1gb]->[29.1gb]/[29.1gb]} [14:30:45] (03CR) 10Faidon Liambotis: [C: 032] Re-pool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/168954 (owner: 10Faidon Liambotis) [14:30:48] <_joe_> the duration is 2.5 microseconds or minutes? [14:30:54] <^d> i have no clue. [14:30:56] fyi, I'm repooling ulsfo [14:31:42] hah [14:31:46] bblack: right on time :) [14:31:52] hi :) [14:32:02] info: Zone wikimediacommons.net.: source rfc1035:wikimediacommons.net with serial 2014090915 loaded as authoritative [14:32:12] we get lots of that on restart with new config [14:32:20] gdnsd 2.x logging changes I suppose? [14:32:28] <^d> !log elasticsearch: disabling replica allocation, less things moving about if we restart cluster [14:32:30] yeah sorry, that needs an >/dev/null on the checkconfig in the authdns-update stuff I think [14:32:35] Logged the message, Master [14:32:53] <^d> Well shit. [14:32:54] <^d> "error" : "RemoteTransportException[[elastic1002][inet[/10.64.0.109:9300]][cluster/settings/update]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (cluster_update_settings) within 30s]; ", [14:33:24] <_joe_> ^d: curl -XPUT localhost:9200/_settings/ -d '{"cluster.routing.allocation.allow_rebalance" :"none"}' right? [14:33:42] <^d> No, "cluster.routing.allocation.enable": "primaries" [14:33:46] _joe_: m is minutes [14:33:53] <_joe_> manybubbles: oh shit. [14:34:38] <_joe_> I've seen GCs going for more than 10 minutes then [14:34:42] _joe_: I'me something like 15 minutes ahead of you guys - just trying to work through it [14:34:54] _joe_: I'm trying to find the _cause_ of the gcs [14:35:09] <_joe_> ok np [14:35:14] if we bounce the nodes that had them then we'll lose the ability to figure out why [14:35:38] <_joe_> yes, take your time [14:35:59] but that will fuck the cluster right up. [14:35:59] its not resiliant to the nodes sitting on stuff for that long. [14:35:59] it really ought to be given that its a distributed system [14:36:10] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Function 'require_package' does not return a value at /etc/puppet/modules/wikimania_scholarships/manifests/init.pp:41 on node zirconium.wikimedia.org [14:36:20] that's _joe_'s change [14:36:27] you get logs about how it receiving responsses for requests it sent too long ago [14:36:41] but I don't see anything wrong with it? [14:37:24] ottomata: are you dealing with the ES outage? if not, I have more bad news for you :) [14:37:41] Iterating over heap. This may take a while... [14:37:44] paravoid, i am not directly, no, i'm working on es nodes [14:37:47] whatcha got? [14:37:50] (new es nodes) [14:37:59] icinga is *full* of varnishkafka/esams delivery errors [14:38:38] ah [14:38:42] also an1021 kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 3.03162838962e-09 [14:38:45] yes, i see those, i think that is due to the ganglia server move [14:39:00] oh? [14:39:05] not sure exactly why yet, but ja [14:39:11] ganglia is missing esams data, [14:39:16] and icinga uses ganglia data for that [14:39:18] oh :( [14:39:27] akosiaris: ^ [14:40:29] <_joe_> paravoid: when did you see that error? [14:40:34] _joe_: found it, fixing it [14:40:35] <_joe_> (the puppet one) [14:40:52] <_joe_> paravoid: ugh thanks [14:41:33] regarding elasticsearch outage: one thing that we'll *have* to work on after this is the ganglia/graphite monitoring stuff. Ganglia seems pretty broken now so I can't do the kind of investigation this needs. [14:41:48] (03PS1) 10Faidon Liambotis: Fix require_package conversion typo [puppet] - 10https://gerrit.wikimedia.org/r/168963 [14:41:59] manybubbles: what is broken? [14:42:07] <_joe_> I must've missed it earlier while fixing neon [14:42:11] * _joe_ facepalms [14:42:17] PROBLEM - check if salt-minion is running on mw1119 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:42:20] (03CR) 10Faidon Liambotis: [C: 032] Fix require_package conversion typo [puppet] - 10https://gerrit.wikimedia.org/r/168963 (owner: 10Faidon Liambotis) [14:42:40] paravoid: nothing is updating. its like its just sticking the values at a point and leaving them there forever. probably our fault, but its frustrating. [14:43:11] manybubbles: possibly not your fault, there was a big ganglia fuss on friday (european) night and we swapped into a new server [14:43:11] <_joe_> manybubbles: we do get most metrics from ES itself; when it doesn't respond, no metrics [14:43:14] RECOVERY - check if salt-minion is running on mw1119 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:43:20] or that :) [14:43:37] paravoid: I noticed the update to the ui [14:43:51] (03CR) 10Alexandros Kosiaris: [C: 032] nfs::netapp::home backwards compatibility [puppet] - 10https://gerrit.wikimedia.org/r/168959 (owner: 10Alexandros Kosiaris) [14:44:00] <^d> I've got WIP on beta to replca ganglia with graphite for those metrics. [14:44:01] (03PS1) 10BBlack: Kill info-level spam on authdns-update w/ gdnsd 2.x [puppet] - 10https://gerrit.wikimedia.org/r/168964 [14:44:04] <^d> Could do same in prod. [14:44:04] (03PS2) 10Faidon Liambotis: Fix require_package conversion typo [puppet] - 10https://gerrit.wikimedia.org/r/168963 [14:44:08] <^d> Then it's push not pull. [14:44:09] _joe_: no - its from before that. there are shapes in the graph it'd expect to see before something like that if this were a gradual thing. but the graphs aren't moving [14:44:09] (03CR) 10Faidon Liambotis: [V: 032] Fix require_package conversion typo [puppet] - 10https://gerrit.wikimedia.org/r/168963 (owner: 10Faidon Liambotis) [14:47:44] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:48:08] (03PS2) 10BBlack: Kill info-level spam on authdns-update w/ gdnsd 2.x [puppet] - 10https://gerrit.wikimedia.org/r/168964 [14:48:15] (03CR) 10BBlack: [C: 032 V: 032] Kill info-level spam on authdns-update w/ gdnsd 2.x [puppet] - 10https://gerrit.wikimedia.org/r/168964 (owner: 10BBlack) [14:49:55] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2029: active_shards: 5074: relocating_shards: 0: initializing_shards: 48: unassigned_shards: 966 [14:50:05] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2029: active_shards: 5074: relocating_shards: 0: initializing_shards: 48: unassigned_shards: 966 [14:50:23] * anomie sees nothing to SWAT this morning [14:50:25] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 2008: active_shards: 4398: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 1648 [14:50:25] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 2008: active_shards: 4398: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 1648 [14:50:36] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 2008: active_shards: 4398: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 1648 [14:50:36] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 2008: active_shards: 4398: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 1648 [14:50:37] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 2008: active_shards: 4398: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 1648 [14:50:44] <^d> anomie: probably good. [14:51:04] <^d> number of nodes 15? [14:51:06] <^d> wtf? [14:51:24] (03PS1) 10Ottomata: Install base system on new elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/168966 [14:52:04] (03PS2) 10Ottomata: Install base system on new elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/168966 [14:52:04] PROBLEM - ElasticSearch health check for shards on elastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 2366 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 13, uunassigned_shards: 2327, utimed_out: False, uactive_primary_shards: 1944, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 3722, uinitializing_shards: 39, unumber_of_data_nodes: 13} [14:52:05] PROBLEM - ElasticSearch health check for shards on elastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 2366 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 13, uunassigned_shards: 2327, utimed_out: False, uactive_primary_shards: 1944, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 3722, uinitializing_shards: 39, unumber_of_data_nodes: 13} [14:52:05] PROBLEM - ElasticSearch health check for shards on elastic1015 is CRITICAL: CRITICAL - elasticsearch inactive shards 2366 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 13, uunassigned_shards: 2327, utimed_out: False, uactive_primary_shards: 1944, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 3722, uinitializing_shards: 39, unumber_of_data_nodes: 13} [14:52:05] PROBLEM - ElasticSearch health check for shards on elastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 2366 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 13, uunassigned_shards: 2327, utimed_out: False, uactive_primary_shards: 1944, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 3722, uinitializing_shards: 39, unumber_of_data_nodes: 13} [14:52:24] PROBLEM - ElasticSearch health check for shards on elastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 2366 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 14, uunassigned_shards: 2327, utimed_out: False, uactive_primary_shards: 1944, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 3722, uinitializing_shards: 39, unumber_of_data_nodes: 14} [14:52:34] PROBLEM - ElasticSearch health check for shards on elastic1018 is CRITICAL: CRITICAL - elasticsearch inactive shards 2366 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 14, uunassigned_shards: 2327, utimed_out: False, uactive_primary_shards: 1944, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 3722, uinitializing_shards: 39, unumber_of_data_nodes: 14} [14:52:54] PROBLEM - ElasticSearch health check for shards on elastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 2366 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 15, uunassigned_shards: 2327, utimed_out: False, uactive_primary_shards: 1944, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 3722, uinitializing_shards: 39, unumber_of_data_nodes: 15} [14:52:54] PROBLEM - ElasticSearch health check for shards on elastic1014 is CRITICAL: CRITICAL - elasticsearch inactive shards 2366 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 15, uunassigned_shards: 2327, utimed_out: False, uactive_primary_shards: 1944, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 3722, uinitializing_shards: 39, unumber_of_data_nodes: 15} [14:53:45] ^d, manybubbles: not to add to your problems, but there's a report on #wikimedia-tech; geodata outage because of those ES troubles [14:53:45] (03PS3) 10Ottomata: Install base system on new elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/168966 [14:54:02] paravoid: yeah - sorry - that one is just stuck I think [14:54:06] <^d> paravoid: Aware, obvious it would be. [14:54:15] yup [14:54:16] thanks [14:54:24] no solr to fallback to anymore :) [14:54:35] Look at all the SWATters, and all the zero patches. [14:54:44] good [14:54:47] <^d> 2 swatters are otherwise occupied :p [14:54:50] (03CR) 10Ottomata: [C: 032 V: 032] Install base system on new elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/168966 (owner: 10Ottomata) [14:55:14] (03Abandoned) 10Rush: phab serve repos from /srv [puppet] - 10https://gerrit.wikimedia.org/r/168391 (owner: 10Rush) [15:00:05] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141027T1500). [15:00:17] <^d> no swat, go away jouncebot. [15:00:22] ^d: that was me [15:00:23] Yay successful swat [15:00:43] <^d> manybubbles: ok. [15:01:43] (03PS1) 10Alexandros Kosiaris: Move uranium/nickel in the correct network stanza [puppet] - 10https://gerrit.wikimedia.org/r/168968 [15:02:35] !log restarted a bunch of the elasticsearch nodes that had their heap full. wasn't able to get a heap dump on any of them because they all froze while trying to get the heap dump. [15:02:41] Logged the message, Master [15:03:04] !log restarting gmond on all elasticsearch systems because stats aren't updating properly in ganglia and usually that helps [15:03:09] Logged the message, Master [15:04:03] (03CR) 10Alexandros Kosiaris: [C: 032] Move uranium/nickel in the correct network stanza [puppet] - 10https://gerrit.wikimedia.org/r/168968 (owner: 10Alexandros Kosiaris) [15:04:31] manybubbles: I have the feeling this ^ is going to help more [15:04:49] ottomata: this ^ should fix the esams ganglia data missing as well [15:04:52] akosiaris: tahnks! [15:05:00] akosiaris: ahhhh, oops [15:05:01] thanky ou [15:05:33] you would think review would catch these kind of errors... :-( [15:05:34] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 1 failures [15:05:46] and I am the reviewer on that one :-( :-( :-( [15:06:16] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [15:07:29] !log for posterity 10/18 of the elasticsearch servers had got the point where they couldn't free any heap. Its currently not clear to me why they did that. This caused the cluster to basically collapse. The master node kept beind unable to communicate with anyone because everyone was pausing for multiple minutes between replies. The cluster handshaking couldn't cope with that and promptly got itself into a state where nodes [15:07:29] were both part of the cluster and not part of the cluster at the same time. Thats bad. [15:07:35] Logged the message, Master [15:08:11] !log Its unclear how much of the master going haywire is something that'll be fixed in elasticsearch 1.4. They've done a lot of work there on the cluster state communication. [15:08:15] Logged the message, Master [15:08:52] ^d: good news! Elasticsearch is back to reasonably stable. Cluster is yellow and all the nodes are communicating again. [15:09:02] <^d> "The cluster handshaking couldn't cope with that and promptly got itself into a state where nodes were both part of the cluster and not part of the cluster at the same time. Thats bad." [15:09:08] <^d> ^ Quote of the week. [15:09:40] is yellow normal? :) [15:09:44] <^d> yellow is ok. [15:09:51] sounds like DHS :) [15:09:51] ^d: bad news! I dunno why we filled up the heap and started to fall over. I dunno why it was just on 10/18 of the nodes! [15:09:52] <^d> long as it doesn't stay yellow forever. [15:10:01] bblack: yellow is normal after you've been restarting things [15:10:04] <^d> Which 10? [15:11:04] PROBLEM - check if salt-minion is running on virt1002 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:11:06] ^d: elastic1017, elastic1006, elastic1005, elastic1016, elastic1013, elastic1008, elastic1004, elastic1003, elastic1012, elastic1009 [15:12:01] <^d> That's all 3 racks, right. [15:12:14] PROBLEM - check if salt-minion is running on cp1037 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:12:15] PROBLEM - check if salt-minion is running on dbstore1002 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:12:15] PROBLEM - check if salt-minion is running on stat1002 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:12:15] PROBLEM - check if salt-minion is running on cp1050 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:12:33] pssh, salt whatever [15:13:24] RECOVERY - check if salt-minion is running on cp1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:13:24] RECOVERY - check if salt-minion is running on dbstore1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:13:24] RECOVERY - check if salt-minion is running on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:13:24] RECOVERY - check if salt-minion is running on cp1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:14:14] RECOVERY - check if salt-minion is running on virt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:15:04] PROBLEM - check if salt-minion is running on ocg1002 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:15:05] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:46] heading to a cafe [15:16:04] RECOVERY - check if salt-minion is running on ocg1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:18:05] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:28] <^d> manybubbles: Want green before switching traffic back? [15:18:36] ^d: probably a good idea [15:18:42] not _required_ but we probably should [15:19:47] <^d> We should be ready to switch traffic back though soon. Losing GeoData is bad [15:21:35] <^d> Wonder if we could take the throttle off recovery for a bit. [15:23:40] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:23:58] ^d: is geo not back? [15:24:12] ^d: or is it busted because we're not enabled [15:24:42] (03PS1) 10Alexandros Kosiaris: ganglia_new: remove manutius, add uranium [puppet] - 10https://gerrit.wikimedia.org/r/168970 [15:25:21] <^d> manybubbles: Busted cuz we're disabled. [15:25:26] <^d> Config could be nicer there anyway [15:25:54] do we run a dedicated master? [15:26:02] and would that help here anyway? [15:26:27] paravoid: we don't have dedicated master nodes. Its not clear if that would help. It'd be useful in a few cases but its not clear if here is one of them. [15:26:49] paravoid: we were thinking of using the zookeeper nodes in analytics as dedicated masters but no one pushed for it [15:26:54] like, doubling them up [15:27:12] I'm not sure if we'd like to mix those two roles [15:27:34] but if it's needed, we can procure extra hardware for this [15:27:41] paravoid: the trouble is that dedicated es masters dont' do much work at all. its a good job for a misc machine that you can through a bit of ram at [15:27:46] right [15:27:54] when is swat? [15:28:01] did i miss it? [15:28:07] aude: half an hour ago [15:28:08] no swat in the middle of an outage please [15:28:15] grrr [15:28:22] aude: I *think* we're safely out of the day. [15:28:22] * aude wants to abolish daylight savings [15:28:24] sorry :) [15:28:27] ok [15:28:45] we could reenable cirrus now I think. I'd wait longer if it werent' for geo. [15:29:15] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia_new: remove manutius, add uranium [puppet] - 10https://gerrit.wikimedia.org/r/168970 (owner: 10Alexandros Kosiaris) [15:29:30] ^d: any way to _just_ send geo searchs to us? [15:29:54] <^d> lemme look [15:30:01] PROBLEM - check if salt-minion is running on wtp1023 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:30:01] PROBLEM - check if salt-minion is running on db1044 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:30:01] PROBLEM - check if salt-minion is running on cerium is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:30:01] PROBLEM - check if salt-minion is running on mc1002 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:30:01] PROBLEM - check if salt-minion is running on labnet1001 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:30:02] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:30:08] what's with all the salt errors? [15:30:10] apergos: that you? [15:30:40] RECOVERY - Varnishkafka log producer on amssq42 is OK: PROCS OK: 1 process with command name varnishkafka [15:30:44] <^d> manybubbles: $wmgUseCirrus || $wmgUseCirrusAsAlternative [15:30:49] <^d> Could turn on alternative everywhere. [15:30:50] (03PS1) 10Manybubbles: Reenable cirrus as an alternative after the outage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168974 [15:30:51] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 1 failures [15:30:51] I would expect two rather than three but yes, I'm running a job [15:30:55] paravoid: [15:30:58] <^d> Heh, you're right with me. [15:30:59] oh [15:31:00] RECOVERY - check if salt-minion is running on wtp1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:31:00] RECOVERY - check if salt-minion is running on db1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:31:00] RECOVERY - check if salt-minion is running on cerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:31:00] RECOVERY - check if salt-minion is running on labnet1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:31:00] RECOVERY - check if salt-minion is running on mc1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:31:07] yeah maybe we need to adjust this check to not count jobs :) [15:31:30] root 13977 0.0 0.1 417732 41224 ? Ssl Oct20 3:00 /usr/bin/python /usr/bin/salt-minion [15:31:34] root 29263 0.0 0.1 417732 35088 ? Sl 15:31 0:00 /usr/bin/python /usr/bin/salt-minion [15:31:34] (03CR) 10Chad: [C: 032] Reenable cirrus as an alternative after the outage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168974 (owner: 10Manybubbles) [15:31:37] blergh, no way to differentiate them [15:31:40] ^d: can you sync that? [15:31:41] (03Merged) 10jenkins-bot: Reenable cirrus as an alternative after the outage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168974 (owner: 10Manybubbles) [15:31:44] <^d> on it [15:32:05] yeah [15:32:14] what we on't want is two actual minions on a host [15:32:21] nod [15:32:32] !log demon Synchronized wmf-config/InitialiseSettings.php: Enable Cirrus as secondary everywhere, brings back GeoData (duration: 00m 04s) [15:32:33] both responding (been there, seen that, been annoyed)... but they are going to look just like those [15:32:33] PROBLEM - check if salt-minion is running on snapshot1002 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:32:37] Logged the message, Master [15:32:39] (03PS1) 10Alexandros Kosiaris: Specifiy mountpoint for bast1001's netapp::home mount [puppet] - 10https://gerrit.wikimedia.org/r/168975 [15:32:41] PROBLEM - check if salt-minion is running on mw1213 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:32:44] yeah I've seen this too [15:32:52] 269 Matching Service Entries Displayed [15:32:58] christmas is early [15:33:00] PROBLEM - check if salt-minion is running on mw1140 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:33:01] PROBLEM - check if salt-minion is running on lvs1001 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:33:01] PROBLEM - check if salt-minion is running on mw1075 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:33:07] humbug [15:33:15] elasticsearch, varnishkafka, salt [15:33:19] plus the usual suspects [15:33:30] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:33:35] 347 Matching Service Entries Displayed [15:33:41] RECOVERY - check if salt-minion is running on mw1213 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:33:42] PROBLEM - check if salt-minion is running on mw1139 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:33:42] PROBLEM - check if salt-minion is running on db1039 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:33:42] PROBLEM - check if salt-minion is running on es1008 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:33:51] well turning off the count for now is a good idea, this is silliness [15:33:54] (03CR) 10Alexandros Kosiaris: [C: 032] Specifiy mountpoint for bast1001's netapp::home mount [puppet] - 10https://gerrit.wikimedia.org/r/168975 (owner: 10Alexandros Kosiaris) [15:34:01] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:34:40] RECOVERY - check if salt-minion is running on snapshot1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:34:52] RECOVERY - check if salt-minion is running on mw1139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:34:52] RECOVERY - check if salt-minion is running on db1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:34:52] RECOVERY - check if salt-minion is running on es1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:34:53] PROBLEM - puppet last run on mw1050 is CRITICAL: CRITICAL: Puppet has 1 failures [15:35:00] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Puppet has 26 failures [15:35:21] PROBLEM - check if salt-minion is running on mw1053 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:35:21] PROBLEM - check if salt-minion is running on cp1057 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:36:11] RECOVERY - check if salt-minion is running on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:36:11] RECOVERY - check if salt-minion is running on lvs1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:36:21] PROBLEM - check if salt-minion is running on virt1004 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:36:21] RECOVERY - check if salt-minion is running on mw1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:36:43] PROBLEM - check if salt-minion is running on mw1203 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:36:43] PROBLEM - check if salt-minion is running on mw1076 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:36:44] I culd up the number for now, make it more than 4 [15:37:06] tht woudl loet the check, the original, and a couple other jobs go without mass whiines [15:37:13] paravoid: ^^ ? [15:37:16] RECOVERY - check if salt-minion is running on mw1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:37:28] apergos: sure [15:37:30] RECOVERY - check if salt-minion is running on virt1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:37:31] RECOVERY - check if salt-minion is running on cp1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:37:50] RECOVERY - check if salt-minion is running on mw1203 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:37:52] RECOVERY - check if salt-minion is running on mw1076 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:37:53] I mean, it won't catch the two-minion-daemons-running scenario [15:38:10] PROBLEM - check if salt-minion is running on elastic1017 is CRITICAL: PROCS CRITICAL: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:38:10] but if we can't catch it in some other way, it's better to not get spammed [15:38:17] for right now that's the thing [15:38:20] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:38:36] yup [15:39:01] RECOVERY - check if salt-minion is running on elastic1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:40:04] (03PS1) 10ArielGlenn: salt check: crit if over 4 processes [puppet] - 10https://gerrit.wikimedia.org/r/168978 [15:40:30] (03PS2) 10Faidon Liambotis: salt check: crit if over 4 processes [puppet] - 10https://gerrit.wikimedia.org/r/168978 (owner: 10ArielGlenn) [15:40:33] the right way is to see if two responses to a ping come back to the master [15:40:39] (03CR) 10Faidon Liambotis: [C: 032] salt check: crit if over 4 processes [puppet] - 10https://gerrit.wikimedia.org/r/168978 (owner: 10ArielGlenn) [15:41:11] (diff: "igure out" -> "figure out") [15:42:04] (03CR) 10ArielGlenn: [C: 032] salt check: crit if over 4 processes [puppet] - 10https://gerrit.wikimedia.org/r/168978 (owner: 10ArielGlenn) [15:43:13] ah, thanks [15:43:22] still need to clean this keyboard [15:43:47] :) [15:43:54] running puppet on neon [15:44:17] (03PS2) 10Glaisher: Raise account creation throttle for JNCF 2014 workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168817 (https://bugzilla.wikimedia.org/72518) [15:44:46] (03CR) 10Alexandros Kosiaris: "fixed in:" [puppet] - 10https://gerrit.wikimedia.org/r/167885 (owner: 10Dzahn) [15:50:10] RECOVERY - puppet last run on mw1050 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:59:58] manybubbles: on the monitoring fixes of your postmortem, could you add a "rethink the icinga failures so that we don't get an alert flood from all nodes" kind of item? :) [16:00:17] oh too late [16:03:18] (03CR) 10Andrew Bogott: [C: 032] Tampa cleanup - delete dynamicproxy::pmtpa role [puppet] - 10https://gerrit.wikimedia.org/r/168729 (owner: 10Dzahn) [16:04:12] ^d: oo, manybubbles is gone [16:04:18] 12 new ES nodes ready to go [16:04:24] haven't installed ES there yet [16:04:24] <^d> Hmm, ok [16:04:28] but we can do so whenever we are ready [16:05:02] <^d> I'd rather wait for the cluster to finish initializing and go green. [16:05:11] <^d> Otherwise, let's do it. [16:05:12] yes please :) [16:05:27] thanks for the awesome response btw [16:05:36] i think waiting til the cluster is ready to go is good, if it is healing now [16:05:44] we opsens really should get some more experience with it so that we can at least do some tier 1 debugging [16:05:46] would be better to add these one at a time to a healthy cluster [16:06:18] ottomata: is https://gerrit.wikimedia.org/r/#/c/133695/ still relevant? [16:06:26] yeah, paravoid, i've been supporting for a while now, and still didn't really know what to really look at. I could see that something nasty with node communication was going on, but I wasn't sure why [16:06:28] I think we don't have a varnish (sub)module anymore :) [16:06:46] we don't, it isn't, really. i mean, i'd still like to make that happen, but somehow I doubt i will [16:06:55] abandon then? [16:06:57] i will probably give up on that and make a separate vagrant module [16:06:57] yeah... [16:07:03] I've also replied to you to https://gerrit.wikimedia.org/r/#/c/160480/ not sure if you've seen this [16:07:21] ah yeah, i did see that, i like the idea of a dry run, [16:07:39] i've been putting off working on that due to other things [16:07:51] should I abandon that for now and try again with that later? [16:08:02] <^d> https://secure.phabricator.com/P1402 [16:08:04] or should I leave it (i can remove reviewers til its ready) and work on it there [16:08:06] no it's fine [16:08:08] <^d> Bahhh [16:08:10] <^d> Wrong window. [16:08:13] <^d> And domain. [16:08:27] phabricator.com? [16:08:30] elasticsearch failures? [16:08:43] heh [16:08:51] <^d> I meant to paste to our pastebin on phab. [16:08:57] hah [16:09:08] maybe we need to theme our phabricator instance a bit [16:09:12] like a wikimedia logo :) [16:09:27] <^d> There's a task or 2 about that. [16:09:32] heh [16:10:14] <^d> Anyway, ~2200 more shards to go until green. [16:10:18] <^d> Doing about ~50 at a time. [16:15:55] (03PS1) 10Alexandros Kosiaris: Backup user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/168981 [16:18:16] (03CR) 10Ottomata: Backup user home dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168981 (owner: 10Alexandros Kosiaris) [16:19:28] (03CR) 10Alexandros Kosiaris: "Adding Ariel for his input regarding the Data Retention policy. Ariel bacula defaults are at role::backup::config, role::backup::host and " [puppet] - 10https://gerrit.wikimedia.org/r/168981 (owner: 10Alexandros Kosiaris) [16:20:27] (03CR) 10Alexandros Kosiaris: Backup user home dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168981 (owner: 10Alexandros Kosiaris) [16:23:51] (03PS1) 10Giuseppe Lavagetto: hiera: mediawiki-based backend for labs (wip) [puppet] - 10https://gerrit.wikimedia.org/r/168984 [16:26:30] (03PS2) 10Alexandros Kosiaris: Backup user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/168981 [16:27:49] (03PS2) 10Vogone: Create new user group for WMDE staff at wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168771 (https://bugzilla.wikimedia.org/72459) [16:30:48] (03CR) 10Vogone: Create new user group for WMDE staff at wikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168771 (https://bugzilla.wikimedia.org/72459) (owner: 10Vogone) [16:31:05] <^d> !log elasticsearch: temporarily raised node_concurrent_recoveries from 3 to 5. [16:31:10] Logged the message, Master [16:32:25] (03CR) 10ArielGlenn: "This looks like a good set of hosts (maybe we should check with the devs too). And we can tell them 'please don't keep logs or any sensit" [puppet] - 10https://gerrit.wikimedia.org/r/168981 (owner: 10Alexandros Kosiaris) [16:35:23] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.39 [16:35:52] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.139 [16:36:23] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.143 [16:36:23] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13 [16:36:53] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.142 [16:37:22] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 [16:38:13] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [16:38:42] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.143 [16:38:42] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13 [16:39:14] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.142 [16:39:24] ^d: ?! [16:39:43] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 [16:40:32] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [16:41:05] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.39 [16:41:53] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 [16:43:24] Hi, I have some issues reaching some wikipedia servers and I suspect my ISP to be a little boggus [16:43:57] does anyone have acces to 91.198.174.192 and could try to ping me? [16:44:38] ziirish: wanna send us traceroutes ? it will probably help more [16:44:48] sure [16:46:06] http://ziirish.info/p/?d3949eb9adb2d066#MagoubMkIHTvKC3evwCrxw= [16:46:26] (03PS1) 10BBlack: re-enable amssq42 text varnish backend [puppet] - 10https://gerrit.wikimedia.org/r/168990 [16:46:49] ^d: looks like some of the es servers have filled their heap again! [16:46:52] its reproducable! [16:46:58] (yay?) [16:47:04] <^d> ouch. [16:47:56] ziirish: so step 7 and onwards is killing all ICMP ? that will make debugging this way more difficult... [16:48:09] I should have one more hope (at least) at my ISP's routers (regarding another traceroute from people at the same ISP) [16:48:33] chasemp: yt [16:48:34] ? [16:48:52] hi [16:49:08] the thing is, other people are routed by londres-6k-1-po104.intf.routers.proxad.net whereas I'm rooted by londres-6k-1-po102.intf.routers.proxad.net [16:49:22] ziirish: where in the world are you? what isp? [16:49:23] so my guess is londres-6k-1-po102.intf.routers.proxad.net mess up with my packets... [16:49:32] Free France [16:49:32] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.143 [16:49:41] ziirish: probably equal cost multipath [16:49:49] that is load balancing [16:49:53] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.39 [16:50:03] icinga-wm: I know.... [16:50:08] which is why others see different interface of pretty much the same router [16:51:10] oh [16:51:16] akosiaris: yes probably, but the result is I can't reach a bunch of website because of this messy route [16:51:17] hmmm [16:51:27] seems worse [16:51:30] gimme a sec [16:51:34] manybubbles, I had some GC problems with Solr - in my case GC started eating all of the CPU time - switching to MarkSweep GC helped. Are these problems related? [16:52:17] MaxSem: the default elasticsearch gc configuration is pretty optimized. The trouble is that _something_ is actually eating all the heap so I can't be swept. [16:52:26] eeh [16:52:58] (03CR) 10Aaron Schulz: [C: 031] Increase updatequerypages frequency to twice per month for all wikis. [puppet] - 10https://gerrit.wikimedia.org/r/168933 (owner: 10Springle) [16:54:12] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [16:56:13] (03CR) 10BBlack: [C: 032] re-enable amssq42 text varnish backend [puppet] - 10https://gerrit.wikimedia.org/r/168990 (owner: 10BBlack) [16:56:59] <^d> manybubbles: Did me raising recovery limits hurt stuff? [16:57:09] ^d: I dunno [16:57:10] probably not [16:58:42] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [17:00:53] PROBLEM - Varnishkafka log producer on amssq42 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [17:06:52] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 [17:11:26] (03CR) 10Alexandros Kosiaris: "Not exactly. The 90 days stuff is actually irrelevant and has more to do with bacula and indexing than actually data retention. I was just" [puppet] - 10https://gerrit.wikimedia.org/r/168981 (owner: 10Alexandros Kosiaris) [17:12:13] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 [17:13:58] (03CR) 10Alexandros Kosiaris: [C: 032] management/ipmi - move to module [puppet] - 10https://gerrit.wikimedia.org/r/168719 (owner: 10Dzahn) [17:15:07] (03PS1) 10Ottomata: Include research mysql user and password in file on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/168993 [17:15:53] (03CR) 10jenkins-bot: [V: 04-1] Include research mysql user and password in file on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [17:16:28] (03CR) 10Alexandros Kosiaris: [C: 032] NTP, lucene - remove pmtpa remnants [puppet] - 10https://gerrit.wikimedia.org/r/168730 (owner: 10Dzahn) [17:17:35] (03PS2) 10Ottomata: Include research mysql user and password in file on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/168993 [17:19:51] (03PS1) 10Rush: phab update phd configuration [puppet] - 10https://gerrit.wikimedia.org/r/168994 [17:23:04] Can $instanceproject in realm.pp be used to target a specific wiki - say, the current beta wiki ? we wanted to have beta route through the new beta mx ( deployment-mx ) and that would involve changing the route_list of beta *only* to our new mx. [17:23:47] (03PS1) 10Manybubbles: Set default location for heap dump for Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/168995 [17:23:56] will something like if $realm == 'labs' { if($instanceproject == 'deployment-testing') { //change the route_list } } [17:24:02] work ? [17:26:30] (03CR) 10Ori.livneh: "ping" [puppet] - 10https://gerrit.wikimedia.org/r/165779 (owner: 10Ori.livneh) [17:26:41] (03PS1) 10Ori.livneh: update my (=ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/168998 [17:26:42] pong! [17:26:43] (03PS1) 10Ori.livneh: mediawiki: tidy /tmp [puppet] - 10https://gerrit.wikimedia.org/r/168999 [17:27:22] <^d> manybubbles: Where we at? [17:27:23] (03CR) 10jenkins-bot: [V: 04-1] update my (=ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/168998 (owner: 10Ori.livneh) [17:27:23] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.343333333333 [17:27:31] ^d: lost wifi and drove home [17:27:51] <^d> I see recovery halted. [17:28:12] ori: I want to -1 the 'update my dotfiles' patch saying 'you have no authority to decide what those files say' :p [17:28:14] ori: good idea, I don't like puppet doing it, though [17:28:14] I dunno why but I'm not supoer surprised [17:28:24] ori: so, package { 'tmpreaper': ensure => installed } instead? [17:28:39] I tried bouncing one of the machines to get more information about heap errors but that confused the master I think [17:28:39] paravoid: why not? [17:28:47] paravoid: i mean, why not puppet? [17:28:50] ^d: trying to capture a heap dump or _something_ [17:29:02] paravoid: (i actually kinda agree, but curious to see if we have the same rationale) [17:29:17] because I don't trust it for starters :) [17:29:28] vulnerabilities for instance, puppet runs as root [17:29:45] oh [17:29:48] paravoid: you don't wanna know how the tidy resource is implemented.. [17:30:03] it creates a File[] resource with ensure => absent for every file that matches the criteria [17:30:09] ffs [17:30:25] (03CR) 10John F. Lewis: [C: 031] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/168999 (owner: 10Ori.livneh) [17:30:36] tmpreaper should do the job just fine [17:31:09] paravoid: 'find' ditto [17:31:46] (03PS1) 10BBlack: allow zero-length scratch adds as well [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/169002 [17:32:10] ori: tmpreaper's default is even 1-week, so you don't even need to configure it [17:32:21] wfm [17:32:32] this will greatly help imagescalers [17:32:58] something that should be independently fixed though :) [17:34:58] (03CR) 10Chad: [C: 031] Set default location for heap dump for Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/168995 (owner: 10Manybubbles) [17:36:11] the /tmp thing seems a bit scary to me [17:36:21] I can imagine all sorts of random things relying on dropping a file in /tmp and still seeing it a week later [17:36:33] like what? [17:36:36] (assuming no reboot or the process that created the file didn't die) [17:36:50] I don't have an example, I'm just saying, you're breaking basic assumptions that software might have [17:37:06] (that tmp dirs might get wiped on reboot, but not randomly at runtime) [17:37:34] we use relatime, which updates the atime if the current atime was more than a day ago or if mtime > atime [17:37:40] <_joe_> !log uploaded a version of jemalloc for trusty with --enable-prof [17:37:43] and puppet uses atime [17:37:51] Logged the message, Master [17:37:57] so it'd be a bit weird for a file not to be accessed but still be vital [17:38:18] not really [17:38:38] it's still reasonable for a process to drop a file in tmp, not update the atime for weeks, and then go back and look for it. [17:39:04] conceivable or reasonable? [17:39:25] reasonable, imho. your cleaner doesn't own those files. the process that created them does. [17:39:42] <_joe_> what if we did check for specific types of files that we know are being leaked? [17:39:59] <_joe_> there are similar janitor jobs for php5 btw [17:40:02] <_joe_> in the debian package [17:40:04] I don't think it's reasonable fwiw [17:40:11] or we could have those not be generic files in /tmp. put them in /tmp/my_software_sucks_and_leaves_trash_please_wipe_this_dir_periodically [17:41:07] but we should have them under a subdir yes [17:41:11] for security most of all [17:41:26] just from a basic *nix perspective, I don't think even for a tmpdir it's correct to assume you can arbitrarily wipe out files you know nothing about. and I don't think checking atime fixes that. [17:41:32] I had a discussion with Tim about libxml2 and its containment [17:42:01] (but it is reasonable to wipe a tmpdir on reboot, imho) [17:42:56] (03PS10) 10ArielGlenn: data retention audit script for logs, /root and /home dirs [software] - 10https://gerrit.wikimedia.org/r/141473 [17:43:01] (03CR) 10jenkins-bot: [V: 04-1] data retention audit script for logs, /root and /home dirs [software] - 10https://gerrit.wikimedia.org/r/141473 (owner: 10ArielGlenn) [17:44:48] (03CR) 10Dzahn: "yay! all looks good. thanks for the update. ☑☺" [puppet] - 10https://gerrit.wikimedia.org/r/167885 (owner: 10Dzahn) [17:45:05] (03PS4) 10ArielGlenn: script to monitor, clean up salt keys of deleted labs instances [puppet] - 10https://gerrit.wikimedia.org/r/168601 [17:46:59] (03PS2) 10Ori.livneh: update my (=ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/168998 [17:48:16] (03CR) 10ArielGlenn: "sorry about this, I needed to be able to show those lists and it was silly to have three scripts. This is it though, no more updates to th" [puppet] - 10https://gerrit.wikimedia.org/r/168601 (owner: 10ArielGlenn) [17:48:30] (03CR) 10Aaron Schulz: "Any chance this could be deployed soon?" [puppet] - 10https://gerrit.wikimedia.org/r/167310 (owner: 10Aaron Schulz) [17:48:35] (03CR) 10Ori.livneh: [C: 032] update my (=ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/168998 (owner: 10Ori.livneh) [17:51:45] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Puppet has 1 failures [17:52:59] (03PS1) 10Rush: phab blackhole noreply@ emails [puppet] - 10https://gerrit.wikimedia.org/r/169008 [17:53:39] !log replacing disk /dev/sdl slot 11 ms-be1013 [17:53:44] Logged the message, Master [17:53:45] (03CR) 10Faidon Liambotis: [C: 04-1] "No reason to modify exim's conf for that, just create an alias." [puppet] - 10https://gerrit.wikimedia.org/r/169008 (owner: 10Rush) [17:55:05] (03PS1) 10Glaisher: Add 'mai' to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/169011 (https://bugzilla.wikimedia.org/72346) [17:55:45] (03CR) 10Rush: "I'm happy to do it any other way, but I don't get where to create the alias?" [puppet] - 10https://gerrit.wikimedia.org/r/169008 (owner: 10Rush) [17:57:38] (03CR) 10MaxSem: "Ping!" [puppet] - 10https://gerrit.wikimedia.org/r/167453 (https://bugzilla.wikimedia.org/72186) (owner: 10MaxSem) [18:00:08] chasemp: I'll have a look after our meeting [18:00:17] hey thanks man [18:00:17] <^d> manybubbles: Another duh, but Phab search affected too. [18:00:30] yeah [18:00:56] I wonder if phab is doing something funky..... I'm just trying to figure out why it would have started this morning [18:08:02] (03PS1) 10Mforns: Add centralauth parameters to wikimetrics role [puppet] - 10https://gerrit.wikimedia.org/r/169018 [18:10:22] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:11:10] manybubbles: are you in office? [18:11:11] (03CR) 10Nuria: [C: 031] Add centralauth parameters to wikimetrics role [puppet] - 10https://gerrit.wikimedia.org/r/169018 (owner: 10Mforns) [18:11:16] matanya: I am [18:11:28] is katy love there too? [18:11:55] sorry - I mean I'm in my office [18:12:07] its nice here, but here isn't SF [18:12:43] (03CR) 10PiRSquared17: [C: 031] "I didn't find any errors." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168771 (https://bugzilla.wikimedia.org/72459) (owner: 10Vogone) [18:13:00] enjoy there manybubbles :) [18:13:29] (03CR) 10Aaron Schulz: [C: 032] Remove obsolete flags (all of them) from $wgAntiLockFlags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164012 (owner: 10PleaseStand) [18:13:57] (03Merged) 10jenkins-bot: Remove obsolete flags (all of them) from $wgAntiLockFlags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164012 (owner: 10PleaseStand) [18:15:17] !log aaron Synchronized wmf-config/CommonSettings.php: Remove obsolete flags (all of them) from $wgAntiLockFlags (duration: 00m 07s) [18:15:23] Logged the message, Master [18:15:42] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection timed out [18:15:55] <^d> uh oh [18:16:22] we're getting paged [18:16:27] for lsearchd [18:17:14] ^d, checking.. [18:17:19] not exactly sure what i'm looking for yet... [18:17:58] search is working.. [18:18:04] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [18:18:16] oook ^ [18:18:22] <^d> bleh, lsearchd. [18:18:30] i miss noc.wm.org/pybal... [18:18:34] weren't we gonna get that back? [18:18:43] ottomata: palladium:/srv [18:18:47] <_joe_> ottomata: config-master.wikimedia.org/pybal [18:18:52] OOO [18:18:53] yay! [18:19:14] <_joe_> didn't I update wikitech? [18:20:12] _joe_, you probably did, I didn't check it, i think i tried to find that last week but didn't try hard enough [18:20:16] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:20:26] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 1972: active_shards: 3604: relocating_shards: 0: initializing_shards: 108: unassigned_shards: 2376 [18:20:27] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 1972: active_shards: 3604: relocating_shards: 0: initializing_shards: 108: unassigned_shards: 2376 [18:20:27] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 1972: active_shards: 3604: relocating_shards: 0: initializing_shards: 108: unassigned_shards: 2376 [18:20:50] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 1972: active_shards: 3604: relocating_shards: 0: initializing_shards: 108: unassigned_shards: 2376 [18:20:50] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 1972: active_shards: 3604: relocating_shards: 0: initializing_shards: 108: unassigned_shards: 2376 [18:20:57] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 1972: active_shards: 3604: relocating_shards: 0: initializing_shards: 108: unassigned_shards: 2376 [18:21:07] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 1972: active_shards: 3604: relocating_shards: 0: initializing_shards: 108: unassigned_shards: 2376 [18:21:16] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 1972: active_shards: 3604: relocating_shards: 0: initializing_shards: 108: unassigned_shards: 2376 [18:21:17] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 1972: active_shards: 3604: relocating_shards: 0: initializing_shards: 108: unassigned_shards: 2376 [18:21:17] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [18:21:17] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 1972: active_shards: 3604: relocating_shards: 0: initializing_shards: 108: unassigned_shards: 2376 [18:21:38] <_joe_> how nice [18:21:46] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:58] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:10] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:39] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:42] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:43] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:43] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [18:22:56] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.229 second response time [18:22:58] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:17] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:26] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:26] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.627 second response time [18:23:27] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.268 second response time [18:23:27] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:36] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.679 second response time [18:23:46] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:56] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:57] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.171 second response time [18:24:07] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.094 second response time [18:24:12] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.079 second response time [18:24:12] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.108 second response time [18:24:26] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.036 second response time [18:24:26] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.119 second response time [18:24:57] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.048 second response time [18:25:37] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [18:25:46] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.091 second response time [18:25:46] RECOVERY - Varnishkafka log producer on amssq42 is OK: PROCS OK: 1 process with command name varnishkafka [18:26:58] (03PS2) 10Manybubbles: Elasticsearch java options for gc issues [puppet] - 10https://gerrit.wikimedia.org/r/168995 [18:27:32] any puppeter want to merge https://gerrit.wikimedia.org/r/#/c/168995/ for me? I want it to help figure out those elasticsearch issues [18:27:51] _joe_: it started happening again. we hadn't moved back from lsearchd [18:28:03] at least we'll be able to figure out what is causing it. OTOH it isn't a one time thing [18:28:31] blah, why do we even still have lsearchd? :) [18:28:39] (03PS3) 10BBlack: Elasticsearch java options for gc issues [puppet] - 10https://gerrit.wikimedia.org/r/168995 (owner: 10Manybubbles) [18:28:59] matanya: :) [18:29:08] (03CR) 10BBlack: [C: 032] Elasticsearch java options for gc issues [puppet] - 10https://gerrit.wikimedia.org/r/168995 (owner: 10Manybubbles) [18:29:11] <^d> MaxSem: Because otherwise nobody would have search right now :) [18:29:26] (03CR) 10BBlack: [V: 032] Elasticsearch java options for gc issues [puppet] - 10https://gerrit.wikimedia.org/r/168995 (owner: 10Manybubbles) [18:29:54] <_joe_> manybubbles: sorry we're all in a meeting [18:30:08] _joe_: its cool. thanks [18:30:27] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:31:01] (03PS2) 10BBlack: allow zero-length scratch adds as well [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/169002 [18:31:36] (03CR) 10BBlack: [C: 032 V: 032] allow zero-length scratch adds as well [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/169002 (owner: 10BBlack) [18:31:38] (03PS1) 10Aaron Schulz: Added labswiki to the dump skip list to avoid error spam [puppet] - 10https://gerrit.wikimedia.org/r/169030 [18:32:10] mw1008 mediawikiwiki Error connecting to 10.64.16.29: :real_connect(): (42000/1049): Unknown database 'mediawikiwiki' [18:32:12] hrm, odd [18:32:26] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:37] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.747 second response time [18:33:07] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.807 second response time [18:33:30] (03PS1) 10BBlack: Merge tag '1.0.6' into debian [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/169031 [18:33:32] (03PS1) 10BBlack: varnishkafka (1.0.6-1) trusty-wikimedia; urgency=medium [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/169032 [18:33:50] (03CR) 10BBlack: [C: 032 V: 032] Merge tag '1.0.6' into debian [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/169031 (owner: 10BBlack) [18:34:00] (03CR) 10BBlack: [C: 032 V: 032] varnishkafka (1.0.6-1) trusty-wikimedia; urgency=medium [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/169032 (owner: 10BBlack) [18:34:39] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:50] !log restarting elasticsearch servers to pick up new gc logging and to reset them into a "working" state so they can have their gc problem again and we can log it properly this time. [18:34:56] Logged the message, Master [18:35:36] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.872 second response time [18:36:56] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:01] !log note that this is a restart without waiting for the cluster to go green after each restart. I expect lots of whining from icinga. This will cause us to lose some updates but should otherwise be safe. [18:37:06] Logged the message, Master [18:37:20] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:56] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:56] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:56] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.891 second response time [18:38:04] <^d> we've already lost a ton of updates as it is. [18:38:15] <^d> no big deal, we can recover that. [18:38:16] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.107 second response time [18:38:51] PROBLEM - Apache HTTP on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:58] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:17] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:22] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:57] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:07] PROBLEM - Apache HTTP on mw1147 is CRITICAL: Connection timed out [18:40:07] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:08] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:10] are apaches melting due to search? [18:40:16] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:16] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:17] PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:17] PROBLEM - Apache HTTP on mw1148 is CRITICAL: Connection timed out [18:40:17] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:17] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:18] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:30] PROBLEM - Apache HTTP on mw1125 is CRITICAL: Connection timed out [18:40:30] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:36] PROBLEM - Apache HTTP on mw1202 is CRITICAL: Connection timed out [18:40:36] PROBLEM - Apache HTTP on mw1133 is CRITICAL: Connection timed out [18:40:36] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:37] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:47] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:49] PROBLEM - Apache HTTP on mw1139 is CRITICAL: Connection timed out [18:40:56] PROBLEM - Apache HTTP on mw1145 is CRITICAL: Connection timed out [18:40:56] PROBLEM - Apache HTTP on mw1135 is CRITICAL: Connection timed out [18:40:56] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:56] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:57] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:57] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:11] <_joe_> MaxSem: I guess so [18:41:26] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:48] lsearchd is still on, why would app servers care so much...job runners? [18:41:54] <_joe_> manybubbles: we need to revert to lsearchd? [18:42:02] it thought we were still on lsearchd... [18:42:12] (or did I miss that) [18:42:17] _joe_: we should still be on it [18:42:23] <_joe_> earch.svc.eqiad.wmnet:9200 [18:42:26] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:29] <_joe_> manybubbles: I don't think so [18:42:39] <_joe_> at least, api servers beg to differ [18:42:47] I see a fuckton of Elastica errors in logs [18:42:53] _joe_: hmmm - I wonder if users are forcing it? [18:42:57] so definitely not on lsearchd [18:43:04] <_joe_> Oct 27 18:42:48 mw1145 apache2[25905]: PHP Warning: Unexpected connection error communicating with Elasticsearch. Curl code: 28 [Called from {closure} in /srv/mediawiki/php-1.25wmf4/extensions/Elastica/ElasticaConnection.php at line 94] in /srv/mediawiki/php-1.25wmf4/includes/debug/MWDebug.php on line 302 [18:43:07] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:13] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:13] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:13] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.016 second response time [18:43:16] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.788 second response time [18:43:16] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:16] PROBLEM - Apache HTTP on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:17] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:17] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:17] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:21] eh, geodata is still trying to use ES [18:43:29] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.059 second response time [18:43:37] PROBLEM - Apache HTTP on mw1130 is CRITICAL: Connection timed out [18:43:37] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time [18:43:37] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 19592 bytes in 0.235 second response time [18:43:45] <_joe_> if we don't have other options I will blackhole connections to search [18:43:56] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.708 second response time [18:43:57] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:57] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.028 second response time [18:43:57] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:57] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.053 second response time [18:43:57] PROBLEM - Apache HTTP on mw1137 is CRITICAL: Connection timed out [18:43:57] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.104 second response time [18:43:58] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:06] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time [18:44:07] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.099 second response time [18:44:07] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.487 second response time [18:44:07] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time [18:44:15] _joe_: if its eating all the apaches we can do that. one moment [18:44:17] PROBLEM - Apache HTTP on mw1131 is CRITICAL: Connection timed out [18:44:20] PROBLEM - Apache HTTP on mw1124 is CRITICAL: Connection timed out [18:44:20] PROBLEM - Apache HTTP on mw1123 is CRITICAL: Connection timed out [18:44:26] PROBLEM - Apache HTTP on mw1146 is CRITICAL: Connection timed out [18:44:26] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:26] PROBLEM - Apache HTTP on mw1122 is CRITICAL: Connection timed out [18:44:27] PROBLEM - Apache HTTP on mw1118 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:27] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [18:44:33] live-hacking GD... [18:44:35] ^d: do you think we can hack something together that keeps geo from turning off but disallows selecting cirrus? [18:44:47] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time [18:44:50] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time [18:44:53] <_joe_> manybubbles: I'm removing all backends from the search pool maybe? [18:44:56] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.791 second response time [18:45:00] <^d> manybubbles: easily enough. [18:45:02] lsearchd is still the primary btw [18:45:05] <_joe_> or I can point the host to localhost [18:45:23] _joe_: I think ^d can do it a little more gracefully [18:45:24] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.063 second response time [18:45:26] <_joe_> I mean search.svc.eqiad.wmnet [18:45:35] <_joe_> manybubbles: this was the "bofh solution" [18:45:36] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [18:45:37] <_joe_> :) [18:47:57] !log maxsem Synchronized php-1.25wmf4/extensions/GeoData: live hack to disable geosearch (duration: 00m 04s) [18:48:05] Logged the message, Master [18:48:06] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.562 second response time [18:48:08] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.130 second response time [18:48:08] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.577 second response time [18:48:09] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.767 second response time [18:48:17] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.615 second response time [18:48:26] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.086 second response time [18:48:27] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [18:48:27] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time [18:48:27] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.038 second response time [18:48:28] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.339 second response time [18:48:36] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.052 second response time [18:48:37] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [18:48:37] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.111 second response time [18:48:46] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.079 second response time [18:48:51] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.151 second response time [18:48:53] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.241 second response time [18:48:55] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.116 second response time [18:48:55] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [18:48:55] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.122 second response time [18:49:00] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.115 second response time [18:49:11] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.100 second response time [18:49:11] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.114 second response time [18:49:11] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.111 second response time [18:49:11] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.090 second response time [18:49:11] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.101 second response time [18:49:11] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.138 second response time [18:49:12] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.101 second response time [18:49:12] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.173 second response time [18:49:16] <_joe_> MaxSem: thanks for syncing that [18:49:21] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.148 second response time [18:49:30] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.126 second response time [18:49:33] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.147 second response time [18:49:33] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.110 second response time [18:49:33] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.124 second response time [18:49:33] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.142 second response time [18:49:35] (03CR) 10Rush: [C: 04-1] preamble script to read client address from HTTP_X_FORWARDED_FOR (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168509 (owner: 1020after4) [18:50:10] PROBLEM - LVS HTTP IPv4 on search.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 404 bytes in 0.007 second response time [18:50:27] <_joe_> we know, icinga [18:50:44] ^d: can you disable cirrus as an option - that'll fix it for everyone who's selected it as a betafeature [18:52:04] <^d> It looks off...? [18:52:15] <^d> 'wmgUseCirrusAsAlternative' => array( [18:52:15] <^d> // 'default' => true, [18:52:15] <^d> // Falling back to lsearchd while Elasticsearch is freaking out. [18:52:17] <^d> 'default' => false, [18:52:19] <^d> ), [18:52:42] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [18:53:01] PROBLEM - ElasticSearch health check for shards on elastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 5796 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 5748, utimed_out: False, uactive_primary_shards: 292, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 292, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:01] PROBLEM - ElasticSearch health check for shards on elastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 5796 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 5748, utimed_out: False, uactive_primary_shards: 292, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 292, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:11] PROBLEM - ElasticSearch health check for shards on elastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 5674 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 5626, utimed_out: False, uactive_primary_shards: 414, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 414, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:11] PROBLEM - ElasticSearch health check for shards on elastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 5630 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 5582, utimed_out: False, uactive_primary_shards: 458, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 458, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:11] PROBLEM - ElasticSearch health check for shards on elastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 5630 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 5582, utimed_out: False, uactive_primary_shards: 458, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 458, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:12] PROBLEM - ElasticSearch health check for shards on elastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 5586 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 5538, utimed_out: False, uactive_primary_shards: 502, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 502, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:12] PROBLEM - ElasticSearch health check for shards on elastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 5586 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 5538, utimed_out: False, uactive_primary_shards: 502, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 502, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:12] PROBLEM - ElasticSearch health check for shards on elastic1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 5586 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 5538, utimed_out: False, uactive_primary_shards: 502, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 502, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:12] PROBLEM - ElasticSearch health check for shards on elastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 5586 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 5538, utimed_out: False, uactive_primary_shards: 502, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 502, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:13] PROBLEM - ElasticSearch health check for shards on elastic1014 is CRITICAL: CRITICAL - elasticsearch inactive shards 5455 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 5407, utimed_out: False, uactive_primary_shards: 633, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 633, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:21] PROBLEM - ElasticSearch health check for shards on elastic1013 is CRITICAL: CRITICAL - elasticsearch inactive shards 5372 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 5324, utimed_out: False, uactive_primary_shards: 716, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 716, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:23] manybubbles, logstash is also affected? ^^^ [18:53:28] shouldn't be [18:53:32] RECOVERY - LVS HTTP IPv4 on search.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 387 bytes in 0.004 second response time [18:53:44] that isn't logstash [18:53:55] PROBLEM - ElasticSearch health check for shards on elastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 5020 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 4972, utimed_out: False, uactive_primary_shards: 1068, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 1068, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:53:55] PROBLEM - ElasticSearch health check for shards on elastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 4976 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 16, uunassigned_shards: 4928, utimed_out: False, uactive_primary_shards: 1112, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 1112, uinitializing_shards: 48, unumber_of_data_nodes: 16} [18:54:01] icinga-wm> PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [18:54:05] PROBLEM - ElasticSearch health check for shards on elastic1015 is CRITICAL: CRITICAL - elasticsearch inactive shards 4675 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 17, uunassigned_shards: 4624, utimed_out: False, uactive_primary_shards: 1386, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 1413, uinitializing_shards: 51, unumber_of_data_nodes: 17} [18:54:13] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 4638 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 17, uunassigned_shards: 4587, utimed_out: False, uactive_primary_shards: 1414, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 1450, uinitializing_shards: 51, unumber_of_data_nodes: 17} [18:54:47] (03PS1) 10Legoktm: Enable global user merge on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169049 (https://bugzilla.wikimedia.org/47918) [18:57:21] !log completed restarting elasticsearch cluster. now it'll make a useful file on out of memory errors. raised the recovery throttling so it'll recover fast enough to cause oom errors [18:57:27] Logged the message, Master [19:02:02] mhm ClusterBlockException [19:03:53] (03PS3) 10Ottomata: Include research mysql user and password in file on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/168993 [19:04:06] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:05:10] manybubbles, how do we know it's fixed? [19:07:45] MaxSem: hmmm - time unfortunately [19:07:52] then slowly adding things back [19:08:07] last time it took about an hour after we thought it was stable for it to spiral into crazy town [19:08:42] if that doesn't work we should add geo back I think [19:08:54] I don't expect what I did to fix it - just give me more information when it fails again [19:09:04] so if it doesn't fail again I'm going to be sad [19:23:21] (03PS1) 10Ottomata: Create analytics-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169190 [19:23:48] (03PS1) 10Aude: Get version for Wikibase cache key from the build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169192 [19:24:01] (03CR) 10jenkins-bot: [V: 04-1] Create analytics-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169190 (owner: 10Ottomata) [19:24:59] (03CR) 10Aude: "this will work best with I5fcbb22 where we base the cache key there on the Wikidata branch version" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169192 (owner: 10Aude) [19:25:03] (03PS2) 10Ottomata: Create analytics-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169190 [19:25:41] (03PS1) 1001tonythomas: Made the beta clusters use deployment-mx for outgoing mail delivery [puppet] - 10https://gerrit.wikimedia.org/r/169194 [19:27:00] (03PS1) 10Ottomata: Create eventlogging-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169195 [19:34:19] (03PS2) 10Ottomata: Create eventlogging-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169195 [19:36:28] Keep getting this when trying to save a page: [19:36:28] Request: POST http://www.wikidata.org/w/index.php?title=Wikidata:Project_chat&action=submit, from 10.64.0.102 via cp1052 cp1052 ([10.64.32.104]:3128), Varnish XID 3190331356 Forwarded for: 145.99.155.163, 91.198.174.104, 208.80.154.77, 10.64.0.102 Error: 503, Service Unavailable at Mon, 27 Oct 2014 19:24:23 GMT [19:37:05] (03PS3) 10Ottomata: Create analytics-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169190 [19:39:26] !log after restarting elasticsearch we expected to get memory errors again. no such luck so far.... [19:39:33] Logged the message, Master [19:43:16] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 41: active_shards: 113: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [19:50:58] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail [20:00:04] gwicke, cscott, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141027T2000). Please do the needful. [20:07:25] MaxSem and ^d: elasticsearch is still recovering shards around but we haven't seen the issue. I suppose we have a few things we can test. 1. wait for all shards to relocate/cluster to go green. 2. reeanable geo. 3. reenable cirrus for betafeatures users. 4.reenable cirrus for all users who used to have it [20:07:42] so I think we should do them in that order. except maybe skip #1. maybe. [20:07:58] ok, should I do GD now? [20:08:23] MaxSem: I suppose now is as good a time as any [20:08:32] if our issue comes back we'll have more information [20:08:39] both a heap dump and a variable that we changed to make it come back [20:10:10] !log maxsem Synchronized php-1.25wmf4/extensions/GeoData: GeoData back to normal (duration: 00m 03s) [20:10:11] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [20:10:15] manybubbles, ^ [20:10:15] Logged the message, Master [20:11:13] MaxSem: thanks. so far nice and boring [20:17:41] (03PS2) 10Dzahn: management/ipmi - move to module [puppet] - 10https://gerrit.wikimedia.org/r/168719 [20:18:41] (03CR) 10Dzahn: [C: 032] management/ipmi - move to module [puppet] - 10https://gerrit.wikimedia.org/r/168719 (owner: 10Dzahn) [20:23:05] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures [20:23:27] that's me [20:24:14] eh, the switch just needed 2 runs [20:24:15] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:26:56] !log updated OCG to version 60b15d9985f881aadaa5fdf7c945298c3d7ebeac [20:27:03] Logged the message, Master [20:27:37] (03PS1) 10Aaron Schulz: Stop GWT wgJobBackoffThrottling values from getting lost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169207 [20:27:55] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:56] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:31:07] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:14] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:31:37] (03CR) 10Dzahn: rancid - move to module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/168698 (owner: 10Dzahn) [20:32:31] (03PS4) 10Dzahn: rancid - move to module [puppet] - 10https://gerrit.wikimedia.org/r/168698 [20:33:05] (03CR) 10John F. Lewis: [C: 031] "Re-add +1" [puppet] - 10https://gerrit.wikimedia.org/r/168698 (owner: 10Dzahn) [20:35:45] !log deploy parsoid sha 617e9e61 [20:35:51] Logged the message, Master [20:36:18] i wish i could s/deploy/deployed/ .. is it okay to edit the server admin log? [20:36:47] subbu: Yes. But usually people don't bother [20:36:49] yea, edit it on wiki [20:36:56] ok. [20:36:58] i've also done it after typos [20:37:00] and there's history [20:38:05] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures [21:01:20] (03PS1) 10Dzahn: ipmi - fix path to file in module [puppet] - 10https://gerrit.wikimedia.org/r/169215 [21:02:19] (03CR) 10Dzahn: [C: 032] ipmi - fix path to file in module [puppet] - 10https://gerrit.wikimedia.org/r/169215 (owner: 10Dzahn) [21:03:13] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection timed out [21:03:30] paged again [21:03:38] manybubbles, ^d: status? [21:03:52] paravoid: huh. thats lsearchd [21:04:01] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:04:02] it is [21:04:07] it doesn't handle the load well [21:04:10] that's not new :) [21:04:10] :) [21:04:11] poor old horse coudn;t handle the load? [21:04:18] how's ES? [21:04:26] alive so far [21:04:27] paravoid: fine now [21:04:42] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 1.000 second response time on port 8123 [21:04:44] we couldn't get it to crash again after we installed the config to get it to do the right thing when it crashes [21:04:48] hmmmmmmm [21:05:04] so are we switched to ES again? [21:05:19] nope, only GeoData and beta [21:05:26] paravoid: no - we were going to wait for the servers to go green but they are taking forever. [21:05:41] honestly we should start in on that [21:05:44] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [21:05:53] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [21:06:04] ^d: what do you think about kicking over the beta users again? [21:07:54] ^d: do you want to do the migrating stuff and watch for a while? I'm going to go out to dinner in about an hour I think. Then I'll be back this evening [21:08:21] jouncebot: next [21:08:21] In 1 hour(s) and 51 minute(s): SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141027T2300) [21:09:26] I'm still thnking those are at midnight here... damn timezones [21:09:33] (03PS2) 10Dzahn: role/deployment - remove pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/167888 [21:10:29] JohnLewis: They will be again next week [21:10:46] RoanKattouw: ^ small change to deployment itself, just killing pmtpa [21:10:55] in the unlikely event ... [21:11:06] RoanKattouw: good :D [21:11:48] (03CR) 10Dzahn: [C: 032] role/deployment - remove pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/167888 (owner: 10Dzahn) [21:13:34] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:14:02] (03PS1) 10Manybubbles: Reenable cirrus for all users that used to have it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169219 [21:14:24] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 65 data above and 9 below the confidence bounds [21:15:44] <_joe_> AaronS: what's the patch you need a review of? [21:16:05] <_joe_> AaronS: add me as a reviewer [21:16:23] _joe_: you are, it's https://gerrit.wikimedia.org/r/#/c/167310/ [21:17:00] <_joe_> AaronS: ah ok, not in my list... dunno why [21:17:29] <_joe_> AaronS: I'm not actually [21:17:35] * _joe_ giuseppe :) [21:17:43] <_joe_> I'll ping filippo tomorrow [21:17:58] <_joe_> today it was bank holiday in IE [21:22:01] _joe_: huh, I though it was both of you guys [21:25:24] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [21:26:49] (03CR) 10Chad: [C: 031] Reenable cirrus for all users that used to have it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169219 (owner: 10Manybubbles) [21:27:17] (03CR) 10Manybubbles: "You wanna deploy this and keep an eye on it tonight?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169219 (owner: 10Manybubbles) [21:28:24] <^d> manybubbles: I can ^ [21:28:35] ^d: sweet [21:28:38] <^d> Maybe we'll see what everyone else says in a minute tho. [21:30:46] manybubbles: so cluster is better now? [21:31:42] manybubbles: not sure if you saw this before, but 12 new nodes are ready [21:31:43] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:33:51] ottomata: yeah! we had that outage and I didn't want add the servers in case it changed things. what do you think of added them this morning? [21:33:54] soryr, in the morning [21:34:09] icinga-wm: no [21:34:40] ^d: sounds like its ok [21:34:43] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [21:34:45] yeah, i'd like to do it with one of yall around [21:34:49] at least for the first few [21:34:52] <^d> manybubbles: Yeah, let's add it to swat. [21:34:53] to make sure i'm doing it right [21:34:54] but ja [21:34:54] sure [21:35:07] did you figure out what was up with the outage? [21:35:07] I'll be here in the morning [21:35:30] ottomata: no. not really:( we know the cause but not the cause's cause [21:35:37] <_joe_> ottomata: heap starvation, the cause of which is unknown [21:35:39] we added instrumentation so we can get further the next time [21:36:01] aye [21:36:03] but it didn't come back unfortunately [21:36:03] <_joe_> because - surprise! when the jvm is jammed you're not able to obtain an heap dump [21:36:12] _joe_: such a bitch [21:36:46] (03PS2) 10Dzahn: contacts/outreach - move to module [puppet] - 10https://gerrit.wikimedia.org/r/168713 [21:36:51] <_joe_> manybubbles: I know all the java delightful features to make the ops life horrible [21:36:53] elasticsearch is configured to dump a heap file when the heap is _totally_ fucked. but it doesn't specify where (so / maybe) nor how bad. We did that today [21:37:23] <_joe_> manybubbles: which GC are we using btw? the one ES suggests I guess? [21:37:24] _joe_: if it _weren't_ hung all up it'd be pretty nice. [21:37:28] _joe_: yeah [21:37:48] (03CR) 10Dzahn: [C: 032] "number of users is something between 0 and 1 , afaik" [puppet] - 10https://gerrit.wikimedia.org/r/168713 (owner: 10Dzahn) [21:37:59] ParNew/CMS - pretty much the default from the deb package [21:38:15] <_joe_> you know that setting is like weather forecasting, right? [21:38:56] honestly the settings _shouldn't_ be a bit deal to us. maybe helpful from a performance optimization standpoint (yeah, like weather forecasting) but what hit us today was requests that never let go [21:39:12] Are there GC implementations that don't suck? [21:39:19] <_joe_> manybubbles: yes, I don't think that was a "GC" issue [21:40:16] I'm starting to really think of using a GC as a software anti-pattern for stability/performance (which means using languages that force GCs are an anti-pattern in general). [21:40:28] bblack: the one in jvm isn't optimized for heaps > 8GB or so. they tried to make one that was better but it kinda doesn't work right. one day? who knows. The jvm's one is _probably_ the best around but they all have trouble. [21:40:38] <_joe_> bblack: is there another GC implementation that sucks as much as the jvm one? [21:40:43] * gwicke tends to agree with bblack, especially if the heaps can get large [21:41:00] <_joe_> manybubbles: I had good experiences with G1 honestly [21:41:05] <_joe_> and horrible ones as well [21:41:09] I saw a similar pattern when I stress-tested Cassandra [21:41:27] gwicke: cassandra is known the be a real bitch [21:41:29] * gwicke is keeping an eye on Rust [21:41:45] yeah - I'd like to use rust one day too. [21:42:05] <_joe_> why rust? [21:42:14] manybubbles: would more hw help to keep the load per node low enough to avoid getting into the run-off zone? [21:42:50] this made me dig up: http://paste.debian.net/128973/ (which is from some random person's comment I remembered from an old LWN article: http://lwn.net/Articles/274859/) [21:43:23] gwicke: in this case we don't know what pushed things into the run off zone. we have 2x the servers just waiting to be installed but I didn't want to do it today because I wanted the problem to happen again if it was going to happen again so I could get a heap dump [21:44:11] in Cassandra heavy writes can trigger something like that; the only thing that helped was limiting the write parallelism [21:44:50] adding more nodes has a similar effect [21:44:51] _joe_: rust seems like the most expressive you are going to get without built in gc. [21:45:13] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: puppet fail [21:45:32] _joe_: it's got a pretty neat static ownership / lifetime system [21:45:53] which enables it to support pretty clean & safe code without a GC [21:46:14] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [21:46:25] sadly good concurrency (as opposed to IO) support is deferred until after 1.0 [21:46:39] argh, as opposed to parallelism [21:46:47] they went back to 1:1 threading [21:47:11] gwicke: ah - didn't see that they did that. oh well. [21:47:14] this is the plan for IO: https://github.com/rust-lang/rfcs/issues/388 [21:47:41] imagine if you applied GC-like ideas about managing memory resources to managing pthread lock resources. You'd have a thread library runtime that tries to infer shared memory contention based on thread access patterns and automatically performed mutex locks behind your back, and you'd never have to worry about locking again! :) [21:48:24] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [21:48:33] yeah, the Rust compiler catches such issues [21:49:06] (03PS3) 10Faidon Liambotis: fix cert mismatch on mail.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/154223 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [21:50:44] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection timed out [21:51:48] the other pretty attractive thing about Rust is its performance: http://benchmarksgame.alioth.debian.org/u64q/benchmark.php?test=all&lang=rust&data=u64q [21:52:05] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 3.005 second response time on port 8123 [21:52:26] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:53:54] (03PS4) 10Faidon Liambotis: Redirect mail.wikipedia.org to lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/154223 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [21:54:12] paravoid: you have the pager? we're likely going to be failing back over to cirrus soon. ^d will watch it. if something goes wrong and you get paged again can you call me or make someone in the US do it? [21:54:26] like, if ^d isn't around [21:54:27] do what? [21:54:31] revert to lsearchd? [21:54:44] paravoid: revert from lsearchd to cirrus [21:55:14] so if that fails and pages you again and neither of us is online [21:55:33] ironically, ES doesn't page us, lsearch does :) [21:55:39] (03PS1) 10Jalexander: Adjustments to voteWiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169229 (https://bugzilla.wikimedia.org/72589) [21:55:49] but we do monitoring this channel and if all hell breaks loose we get paged for API failures [21:55:53] do monitor* [21:55:55] Hey opsen, speaking of paging [21:55:58] man something's wrong with me today [21:55:59] I hear you have paging groups now? [21:56:19] oh well. just so its in the public channel - if it breaks tonight have someone call me. I'm totally on the contact page on officewiki [21:56:29] paravoid: I see your irc output buffer was written in Java [21:56:37] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection timed out [21:57:48] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 0.999 second response time on port 8123 [21:58:40] (03CR) 10John F. Lewis: Adjustments to voteWiki config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169229 (https://bugzilla.wikimedia.org/72589) (owner: 10Jalexander) [21:58:47] jamesofur ^ :) [21:59:00] If that's true then I'd like to remove myself from the "all pages" group and only get pages for Parsoid and sca or something like that [21:59:09] _joe_: here? [21:59:19] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:59:27] or ori maybe? [21:59:28] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures [21:59:40] <_joe_> paravoid: more or less [21:59:50] or mutante, I'll ask anyway :) [21:59:58] are the sync-apache docs still relevant? [22:00:12] paravoid: what's up? [22:00:16] oh, probably not. let me look. [22:00:17] with all the moving configs to modules/mediawiki work? [22:00:26] there's a modules/apachesync [22:00:28] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [22:00:30] in puppet [22:00:59] how does one deploy changes such as https://gerrit.wikimedia.org/r/#/c/154223/4 these days? [22:01:18] JohnLewis: nah uh [22:01:29] by default crats can not remove/add sysop/crat [22:01:35] <_joe_> paravoid: I updated wikitech when we changed that [22:01:58] ah https://wikitech.wikimedia.org/wiki/Apaches you mean [22:02:06] there's still https://wikitech.wikimedia.org/wiki/Wikimedia_binaries#sync-apache plus modules/apachesync [22:02:16] which is still included on tin [22:02:23] <_joe_> paravoid: https://wikitech.wikimedia.org/wiki/Apaches#Deploying_config [22:02:42] <_joe_> paravoid: uhm sorry I didn't know about that [22:02:58] no worries, I'm just trying to figure things out [22:03:04] fun things happen when you're out for months :) [22:03:21] <_joe_> paravoid: eh, sorry [22:03:27] <^d> Gonna be a happy fun evening I'm sure :) [22:03:48] (03CR) 10Faidon Liambotis: [C: 032] Redirect mail.wikipedia.org to lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/154223 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [22:03:50] apachesync can be removed, but we wanted to keep apache-fast-test and something else from that module [22:04:00] that's why didn't want to delete the entire thing yet [22:04:30] cool that you are doing that redirect [22:04:37] jamesofur: one of us is confused and it isn't me here :) You've add 'bureaucrat => array( ..., 'sysop', 'bureaucrat' ) under a definition where sysop and bureaucrat are already defined. Looking at the wiki - what I commented is correct :) [22:04:44] *added [22:04:48] JohnLewis: ah, I switched them, that is supposed to be in the removeGroups section and [22:04:57] :D [22:04:59] vice versa [22:05:03] * jamesofur adjusts [22:05:08] fix that and it can have a +1 ;) [22:06:38] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [22:07:50] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection timed out [22:08:30] blar, lucene [22:09:21] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [22:10:05] (03PS2) 10Jalexander: Adjustments to voteWiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169229 (https://bugzilla.wikimedia.org/72589) [22:10:28] (03PS5) 10Dzahn: rancid - move to module [puppet] - 10https://gerrit.wikimedia.org/r/168698 [22:11:06] (03CR) 10John F. Lewis: [C: 031] "Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169229 (https://bugzilla.wikimedia.org/72589) (owner: 10Jalexander) [22:11:52] jamesofur ^ enjoy the magic of a +1. Could be deployed in 50 minutes during the SWAT window if you want to throw it onto it. [22:12:00] (03CR) 10Aaron Schulz: [C: 032] Stop GWT wgJobBackoffThrottling values from getting lost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169207 (owner: 10Aaron Schulz) [22:12:02] * jamesofur nods [22:12:05] adding [22:12:07] (03Merged) 10jenkins-bot: Stop GWT wgJobBackoffThrottling values from getting lost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169207 (owner: 10Aaron Schulz) [22:12:16] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 1 failures [22:12:24] !log aaron Synchronized wmf-config/CommonSettings.php: Stop GWT wgJobBackoffThrottling values from getting lost (duration: 00m 03s) [22:12:31] Logged the message, Master [22:14:10] (03PS3) 10Legoktm: Adjustments to votewiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169229 (https://bugzilla.wikimedia.org/72589) (owner: 10Jalexander) [22:15:25] <_joe_> jgage: look at the pybal log for lucene, there are probably servers that are not working but still pooled [22:15:32] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:16:23] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [22:17:46] (03CR) 10Dzahn: [C: 032] "ran in compiler, fixed nitpicks from Alex" [puppet] - 10https://gerrit.wikimedia.org/r/168698 (owner: 10Dzahn) [22:18:56] !log aaron Synchronized php-1.25wmf4/maintenance: 64fe61e0dbfea84d2bab4c17bf01f5dfdf5cc3b5 (duration: 00m 04s) [22:19:00] Logged the message, Master [22:21:46] !log Running cleanupBlocks.php on all wikis [22:21:52] Logged the message, Master [22:22:59] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: puppet fail [22:24:20] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Puppet has 2 failures [22:28:55] (03PS1) 10Dzahn: rancid - adjust file pathes to module structure [puppet] - 10https://gerrit.wikimedia.org/r/169240 [22:29:00] RECOVERY - ElasticSearch health check for shards on elastic1014 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:01] RECOVERY - ElasticSearch health check for shards on elastic1012 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:09] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:11] RECOVERY - ElasticSearch health check for shards on elastic1018 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:11] RECOVERY - ElasticSearch health check for shards on elastic1008 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:11] RECOVERY - ElasticSearch health check for shards on elastic1015 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:12] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:29:20] RECOVERY - ElasticSearch health check for shards on elastic1011 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:29] RECOVERY - ElasticSearch health check for shards on elastic1001 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:29] RECOVERY - ElasticSearch health check for shards on elastic1007 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:29] RECOVERY - ElasticSearch health check for shards on elastic1003 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:30] RECOVERY - ElasticSearch health check for shards on elastic1013 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:30] RECOVERY - ElasticSearch health check for shards on elastic1006 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:30] RECOVERY - ElasticSearch health check for shards on elastic1017 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:40] RECOVERY - ElasticSearch health check for shards on elastic1004 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:40] RECOVERY - ElasticSearch health check for shards on elastic1002 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:40] RECOVERY - ElasticSearch health check for shards on elastic1010 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:49] RECOVERY - ElasticSearch health check for shards on elastic1016 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 554, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5480, initializing_shards: 54, number_of_data_nodes: 18 [22:29:49] RECOVERY - ElasticSearch health check for shards on elastic1009 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 553, timed_out: False, active_primary_shards: 2031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5481, initializing_shards: 54, number_of_data_nodes: 18 [22:29:59] (03PS2) 10Rush: phab update phd configuration [puppet] - 10https://gerrit.wikimedia.org/r/168994 [22:30:26] (03CR) 10Rush: [C: 032 V: 032] phab update phd configuration [puppet] - 10https://gerrit.wikimedia.org/r/168994 (owner: 10Rush) [22:32:11] (03PS1) 10Alexandros Kosiaris: osm export the expired tile list [puppet] - 10https://gerrit.wikimedia.org/r/169242 [22:34:25] (03CR) 10MaxSem: [C: 032] "You know you don't need to SWAT labs-only changes? ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169049 (https://bugzilla.wikimedia.org/47918) (owner: 10Legoktm) [22:34:33] (03Merged) 10jenkins-bot: Enable global user merge on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169049 (https://bugzilla.wikimedia.org/47918) (owner: 10Legoktm) [22:36:35] (03CR) 10Dzahn: [C: 032] rancid - adjust file pathes to module structure [puppet] - 10https://gerrit.wikimedia.org/r/169240 (owner: 10Dzahn) [22:37:29] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection timed out [22:38:35] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [22:38:46] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 1.003 second response time on port 8123 [22:38:50] (03PS1) 10Rush: phab update phd daemons 4->10 [puppet] - 10https://gerrit.wikimedia.org/r/169243 [22:39:10] (03CR) 10Rush: [C: 032 V: 032] phab update phd daemons 4->10 [puppet] - 10https://gerrit.wikimedia.org/r/169243 (owner: 10Rush) [22:39:46] so is the lucene problem that search1019 is disabled in http://config-master.wikimedia.org/pybal/eqiad/search but enabled in http://config-master.wikimedia.org/pybal/eqiad/search_pool5 ? [22:40:13] no [22:40:26] it's that lsearch is overloaded because it wasn't supposed to get any traffic [22:42:09] so, elastic in yellow... [22:42:33] any ETA on how long it will take for the transition to green ? :-) [22:43:32] (03PS1) 10Dzahn: contacts - fix template path for module [puppet] - 10https://gerrit.wikimedia.org/r/169246 [22:44:37] (03CR) 10Dzahn: [C: 032] contacts - fix template path for module [puppet] - 10https://gerrit.wikimedia.org/r/169246 (owner: 10Dzahn) [22:45:37] <^d> akosiaris: 497 shards left to initialize. [22:45:44] <^d> So, 20-30min, tops. [22:45:53] !log activated heap profiling on mw1114 [22:46:01] Logged the message, Master [22:46:06] yay, thanks ^d [22:46:07] (03PS1) 10Jackmcbarn: Remove parameter left over from trial long since ended [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169247 (https://bugzilla.wikimedia.org/72593) [22:46:39] ^d, I'll be doing SWAT, poke me if you need search changes deployed [22:46:50] <^d> I've got one config change. [22:46:55] <^d> Lemme make sure it's on the list. [22:47:42] it doesn't have to be specifically on the SWAT list, really, because it's related to an outage [22:47:43] <^d> I think that's 8, actually. [22:47:46] (03CR) 10Aaron Schulz: "Seems like some other wiki config copy-pasted the same setting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169247 (https://bugzilla.wikimedia.org/72593) (owner: 10Jackmcbarn) [22:49:07] (03CR) 10Jackmcbarn: "I'll handle the others in a follow-up, just in case there's any issues with them. This one is rather urgent, as enwiki just hit this limit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169247 (https://bugzilla.wikimedia.org/72593) (owner: 10Jackmcbarn) [22:49:38] gah, that should really go out ASAP, but this swat window's full [22:49:54] (03CR) 10Aaron Schulz: [C: 032] Remove parameter left over from trial long since ended [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169247 (https://bugzilla.wikimedia.org/72593) (owner: 10Jackmcbarn) [22:50:01] (03Merged) 10jenkins-bot: Remove parameter left over from trial long since ended [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169247 (https://bugzilla.wikimedia.org/72593) (owner: 10Jackmcbarn) [22:50:30] jackmcbarn: lol, loks like it's going out :P [22:50:47] !log aaron Synchronized wmf-config/flaggedrevs.php: Removed $wgFlaggedRevsProtectQuota for enwiki (duration: 00m 03s) [22:50:53] Logged the message, Master [22:51:33] I wonder if we should've just nuked it everywhere [22:52:00] what, FR? we sould force it everywhere, instead:P [22:52:15] I was meaning the 2k limit [22:52:35] MaxSem: let's not, for so many reasons :P [22:53:00] OMG BLP WHY DONCHA THINK OF THE CHILDREN [22:53:15] Krinkle: here? [22:53:17] NEEDS MOAR BEES [22:53:43] BLps would end up being much angrier that it was taking hours/days/months for new into to be approved on all of their pages in different languages [22:53:47] (03CR) 10Hoo man: [C: 031] Get version for Wikibase cache key from the build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169192 (owner: 10Aude) [22:54:07] limited cases they understand, wholesale would end up makign them understand significantly less :) [22:58:50] (03PS1) 10Tim Landscheidt: checkhosts: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169249 [23:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141027T2300). Please do the needful. [23:00:07] (03PS1) 10Tim Landscheidt: fwconfigtool: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169252 [23:00:10] on it [23:00:33] (03PS1) 10Tim Landscheidt: geturls: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169253 [23:00:36] wow... an hour earlier now :P DST... [23:00:51] (03PS1) 10Tim Landscheidt: swiftcleaner: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169255 [23:01:16] (03PS1) 10Tim Landscheidt: udpprofile: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169256 [23:01:43] MaxSem: I added another one like 15 mins ago, did you see it? [23:01:55] yup [23:02:02] yay, don't have to stay up so late this time :) [23:02:25] * jamesofur waves [23:02:31] 'Max 8 patches' MaxSem: enjoy deploying 11 :D [23:02:43] (03PS1) 10Jackmcbarn: Remove quota from the rest of the wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169258 [23:02:57] aude, do your changes need to be deployed in a particular order? [23:03:10] MaxSem: not really, but maybe the config one first [23:03:14] (03CR) 10Jackmcbarn: "See also I8374b8332a5b7a943ce89d043c6d9bdebfddb7eb" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169258 (owner: 10Jackmcbarn) [23:03:16] ok [23:03:31] JohnLewis: hey, at least the last two were his :) [23:03:33] not doing config first might actually break [23:03:34] other way might temporarily affect test.wikidata [23:03:36] test only, though [23:03:41] yep, taht [23:03:43] exactly [23:04:10] (03CR) 10MaxSem: [C: 032] Get version for Wikibase cache key from the build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169192 (owner: 10Aude) [23:04:18] (03Merged) 10jenkins-bot: Get version for Wikibase cache key from the build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169192 (owner: 10Aude) [23:04:30] jamesofur: true [23:04:48] (03PS1) 10Rush: phab allow pygmentizing [puppet] - 10https://gerrit.wikimedia.org/r/169259 [23:05:10] may I know why it checks $wmfVersionNumber === '1.25wmf3' ? :P [23:05:13] <^d> 371 to go until green. [23:05:58] !log maxsem Synchronized wmf-config/Wikibase.php: https://gerrit.wikimedia.org/r/#/c/169192/ (duration: 00m 04s) [23:05:58] (03PS2) 10Rush: phab allow pygmentizing [puppet] - 10https://gerrit.wikimedia.org/r/169259 [23:06:01] oh, wait... wikipedia and wikidata are wmf4 [23:06:02] aude, hoo ^^^ [23:06:03] (03CR) 10Rush: [C: 032 V: 032] phab allow pygmentizing [puppet] - 10https://gerrit.wikimedia.org/r/169259 (owner: 10Rush) [23:06:04] Logged the message, Master [23:06:09] no big deal [23:06:26] hoo@tin:~$ mwversionsinuse [23:06:26] 1.25wmf4 1.25wmf5 [23:06:31] * aude checked wikidata [23:06:56] (03PS1) 10Tim Landscheidt: compare-puppet-catalogs: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169260 [23:07:01] should be good... we can remove that cruft next week [23:07:08] or with the next regular change [23:07:09] looks good [23:08:00] (03CR) 10Tim Landscheidt: [C: 04-1] "Ooops, missed one." [software] - 10https://gerrit.wikimedia.org/r/169260 (owner: 10Tim Landscheidt) [23:10:58] (03CR) 10Tim Landscheidt: [C: 04-1] "Missed two." [software] - 10https://gerrit.wikimedia.org/r/169253 (owner: 10Tim Landscheidt) [23:11:09] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection timed out [23:11:23] (03PS1) 10Dzahn: Revert "contacts/outreach - move to module" [puppet] - 10https://gerrit.wikimedia.org/r/169261 [23:11:36] (03CR) 10jenkins-bot: [V: 04-1] Revert "contacts/outreach - move to module" [puppet] - 10https://gerrit.wikimedia.org/r/169261 (owner: 10Dzahn) [23:11:51] (03CR) 10Dzahn: "Error 400 on SERVER: Could not find class ::contacts for zirconium.wikimedia.org on node zirconium.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/169261 (owner: 10Dzahn) [23:12:10] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [23:12:10] (03CR) 10Dzahn: "you don't like this either jenkins? greeeaatt" [puppet] - 10https://gerrit.wikimedia.org/r/169261 (owner: 10Dzahn) [23:12:16] (03PS2) 10Faidon Liambotis: Kill wiki-mail.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/143762 [23:12:18] (03PS2) 10Faidon Liambotis: Move mail.wikipedia.org to the main cluster [dns] - 10https://gerrit.wikimedia.org/r/154222 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [23:12:20] (03PS1) 10Faidon Liambotis: Kill mail.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/169262 [23:13:03] !log maxsem Synchronized php-1.25wmf3/extensions/Wikidata/: (no message) (duration: 00m 12s) [23:13:05] (03PS5) 10Faidon Liambotis: mail: remove secondary MX role from sodium [puppet] - 10https://gerrit.wikimedia.org/r/143887 [23:13:07] (03PS1) 10Faidon Liambotis: exim: kill mail.wikipedia.org/mail.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/169263 [23:13:09] aude, hoo ^^^ [23:13:09] (03PS1) 10Faidon Liambotis: Kill role::mail::oldmx, not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/169264 [23:13:10] Logged the message, Master [23:13:11] <- serial killer [23:13:13] wmf3? [23:13:31] grrr [23:13:37] * aude didn't intend to confuse [23:13:37] TimStarling: do you see any reason for mail.wiki{p,m}edia.org to exist nowadays? [23:13:43] as mail domains that is [23:13:58] (03PS2) 10Tim Landscheidt: geturls: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169253 [23:13:59] !log maxsem Synchronized php-1.25wmf5/extensions/Wikidata/: (no message) (duration: 00m 10s) [23:14:02] aude, now for realz ^ [23:14:04] Logged the message, Master [23:14:05] ok [23:14:15] !log maxsem Synchronized php-1.25wmf5/extensions/VisualEditor/: (no message) (duration: 00m 05s) [23:14:20] Logged the message, Master [23:14:29] RoanKattouw, ^^ [23:14:36] that sounds like a complicated question [23:14:42] looks good [23:14:44] !log maxsem Synchronized php-1.25wmf5/extensions/MobileFrontend/: (no message) (duration: 00m 04s) [23:14:49] Logged the message, Master [23:15:00] thanks [23:16:47] !log maxsem Synchronized php-1.25wmf4/extensions/MobileFrontend/: (no message) (duration: 00m 05s) [23:16:53] Logged the message, Master [23:17:01] (03CR) 10Faidon Liambotis: [C: 032] Move mail.wikipedia.org to the main cluster [dns] - 10https://gerrit.wikimedia.org/r/154222 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [23:17:04] !log maxsem Synchronized php-1.25wmf4/extensions/VisualEditor/: (no message) (duration: 00m 04s) [23:17:07] RoanKattouw, ^ [23:17:09] Logged the message, Master [23:17:26] (03CR) 10Faidon Liambotis: [C: 032] Kill mail.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/169262 (owner: 10Faidon Liambotis) [23:18:02] MaxSem: Lydia said she'll get around to give you a resp about the group thing shortly (before the window is over definitely). [23:18:20] aha, talking to her right now [23:18:38] MaxSem: Working in wmf4, thanks! [23:18:42] MaxSem: All working. [23:18:49] <^d> MaxSem: I'll take care of my config stuff later. Still won't be done restoring until after swat probably. [23:19:11] silly UES [23:19:34] (03CR) 10Lydia Pintscher: [C: 031] "Good from my side." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168771 (https://bugzilla.wikimedia.org/72459) (owner: 10Vogone) [23:19:49] (03CR) 10MaxSem: [C: 032] Create new user group for WMDE staff at wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168771 (https://bugzilla.wikimedia.org/72459) (owner: 10Vogone) [23:20:03] (03Merged) 10jenkins-bot: Create new user group for WMDE staff at wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168771 (https://bugzilla.wikimedia.org/72459) (owner: 10Vogone) [23:20:28] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/168771/ (duration: 00m 04s) [23:20:33] Logged the message, Master [23:22:14] !log on mw1114: disabled puppet, enabled Eval.PerfPidMap, restarted hhvm [23:22:20] Logged the message, Master [23:22:38] (03CR) 10MaxSem: [C: 032] Adjustments to votewiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169229 (https://bugzilla.wikimedia.org/72589) (owner: 10Jalexander) [23:22:43] (03PS1) 10Dzahn: phabricator - add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/169265 [23:22:47] (03Merged) 10jenkins-bot: Adjustments to votewiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169229 (https://bugzilla.wikimedia.org/72589) (owner: 10Jalexander) [23:23:06] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/169229/ (duration: 00m 04s) [23:23:11] Logged the message, Master [23:23:14] jamesofur, ^^^ [23:23:20] thanks checkin [23:25:09] (03CR) 10Chad: "Isn't this behind misc varnish? Wouldn't that handle monitoring?" [puppet] - 10https://gerrit.wikimedia.org/r/169265 (owner: 10Dzahn) [23:25:49] MaxSem: hmmm, not seeing the change picked upyet (when usually I see it right away) [23:26:18] (03CR) 10Dzahn: "it's behind misc varnish so we can either only check http on iridium or https on nginx, it doesn't mean anything is happening automaticall" [puppet] - 10https://gerrit.wikimedia.org/r/169265 (owner: 10Dzahn) [23:26:56] !log restarted logstash on logstash1001 [23:27:02] Logged the message, Master [23:27:33] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/169229/ for reals now (duration: 00m 04s) [23:27:39] Logged the message, Master [23:27:40] jamesofur, ^^^ :P [23:27:50] heh, see them now ;) thank you [23:28:13] perfect [23:28:39] and I have enough time to deploy as many changes [23:28:51] WHO WANTS TO BREAK WIKIPEDIA TODAY??! [23:29:08] ^d apparently ;) [23:29:44] <^d> Hey, it had already broken itself quite nicely before I was awake. [23:29:48] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [23:29:51] fair point [23:30:15] MaxSem: If you want to deploy moar https://gerrit.wikimedia.org/r/168922 :D [23:30:29] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [23:33:39] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [23:34:08] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [23:34:38] (03CR) 10Chad: "Got it, ok :)" [puppet] - 10https://gerrit.wikimedia.org/r/169265 (owner: 10Dzahn) [23:34:57] (03PS1) 10Faidon Liambotis: apache: reload when config changes (bugfix) [puppet] - 10https://gerrit.wikimedia.org/r/169266 [23:34:58] ori: ^ [23:36:48] (03CR) 10Ori.livneh: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/169266 (owner: 10Faidon Liambotis) [23:37:50] (03CR) 10Faidon Liambotis: [C: 032] apache: reload when config changes (bugfix) [puppet] - 10https://gerrit.wikimedia.org/r/169266 (owner: 10Faidon Liambotis) [23:38:13] paravoid: thanks [23:38:33] the 'replaces' parameter should be removed, imo [23:38:55] ori: unfortunately for you, I started my gerrit reviews on a FIFO basis [23:39:02] and that includes forking into other things :) [23:40:04] fine by me [23:47:48] paravoid: have you ever experimented with non-contig netmasks in iptables? [23:48:10] how do you mean? [23:49:11] chasemp: can I change noreply to no-reply? [23:49:49] I have a pool of SNAT (outbound NAT) ips [23:49:59] but it's not deterministic about which IP you get [23:50:34] so for one session you might get .1, for another you'll get .2 [23:50:50] paravoid: Yes? [23:51:54] (03CR) 10Dzahn: "hrmm. i have a hard time finding any way to ask iridium for this and NOT getting a 301, no matter which combination of host header, URL, m" [puppet] - 10https://gerrit.wikimedia.org/r/169265 (owner: 10Dzahn) [23:52:39] !log Restarted logstash service on logstash1001 because I was not seeing any events from MW make it into kibana [23:52:40] I'd like to use a non contiguous netmask to make external IPs sticky... eg --source 192.168.100.0/255.255.255.1 [23:52:45] Logged the message, Master [23:53:01] the .1 should match even (or maybe odd, I need to do my mask math) hosts only [23:53:56] http://www.gossamer-threads.com/lists/iptables/user/2732 [23:54:47] unless maybe you have another idea of how I could make SNAT sticky (same internal IP gets mapped fairly reliably to same external IP..) [23:59:10] cajoel: -j SAME? [23:59:21] Krinkle: do you have jenkins admin rights? [23:59:29] Yes [23:59:45] ok, can we change the mail relay jenkins uses? [23:59:56] paravoid: you rock [23:59:58] I found a comment on our DNS that says it uses "wiki-mail"