[00:06:25] (03PS1) 10Ori.livneh: webperf: refactor navtiming.py [puppet] - 10https://gerrit.wikimedia.org/r/170260 [00:07:26] (03PS2) 10Ori.livneh: webperf: refactor navtiming.py [puppet] - 10https://gerrit.wikimedia.org/r/170260 [00:07:32] (03CR) 10Ori.livneh: [C: 032 V: 032] webperf: refactor navtiming.py [puppet] - 10https://gerrit.wikimedia.org/r/170260 (owner: 10Ori.livneh) [00:36:08] (03Abandoned) 10Dzahn: puppetception - lint [puppet] - 10https://gerrit.wikimedia.org/r/170256 (owner: 10Dzahn) [01:14:28] !log db1040 dberror spam is https://gerrit.wikimedia.org/r/#/c/169964/ only jobrunners affected, annoying but not critical [01:14:38] Logged the message, Master [02:26:24] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [02:30:05] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 29095 seconds ago, expected 28800 [02:35:05] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 29395 seconds ago, expected 28800 [02:40:06] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 29695 seconds ago, expected 28800 [02:45:04] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 29995 seconds ago, expected 28800 [02:45:05] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [02:50:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 30295 seconds ago, expected 28800 [02:55:06] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 30595 seconds ago, expected 28800 [03:00:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 30895 seconds ago, expected 28800 [03:05:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 31195 seconds ago, expected 28800 [03:10:06] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 31495 seconds ago, expected 28800 [03:15:06] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 31795 seconds ago, expected 28800 [03:20:05] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 32095 seconds ago, expected 28800 [03:25:05] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 32395 seconds ago, expected 28800 [03:29:25] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: puppet fail [03:30:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 32695 seconds ago, expected 28800 [03:35:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 32995 seconds ago, expected 28800 [03:40:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 33295 seconds ago, expected 28800 [03:45:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 33595 seconds ago, expected 28800 [03:47:06] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [03:50:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 33895 seconds ago, expected 28800 [03:55:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 34195 seconds ago, expected 28800 [04:00:06] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 34495 seconds ago, expected 28800 [04:05:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 34795 seconds ago, expected 28800 [04:10:06] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 35095 seconds ago, expected 28800 [04:11:38] paravoid: ^ frack. do we know anything about it? [04:11:44] nope [04:11:55] but it's just puppet, right? [04:12:16] hmm, alex/jeff did something with payment lvs yesterday [04:12:19] ganglia-related [04:12:38] yes, hence i havn't paniced [04:12:42] ok [04:14:46] :) [04:15:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 35395 seconds ago, expected 28800 [04:20:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 35695 seconds ago, expected 28800 [04:25:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 35995 seconds ago, expected 28800 [04:30:07] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 36295 seconds ago, expected 28800 [04:35:07] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 36595 seconds ago, expected 28800 [04:40:07] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 36895 seconds ago, expected 28800 [04:45:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 37195 seconds ago, expected 28800 [04:50:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 37495 seconds ago, expected 28800 [04:55:16] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 37795 seconds ago, expected 28800 [04:58:43] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [05:00:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 38096 seconds ago, expected 28800 [05:05:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 38395 seconds ago, expected 28800 [05:10:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 38695 seconds ago, expected 28800 [05:15:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 38995 seconds ago, expected 28800 [05:17:19] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:20:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 39295 seconds ago, expected 28800 [05:25:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 39595 seconds ago, expected 28800 [05:30:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 39895 seconds ago, expected 28800 [05:35:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 40195 seconds ago, expected 28800 [05:40:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 40495 seconds ago, expected 28800 [05:42:34] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Puppet last ran 14415 seconds ago, expected 14400 [05:45:17] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 40795 seconds ago, expected 28800 [05:49:53] PROBLEM - RAID on nickel is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:04] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 41095 seconds ago, expected 28800 [05:50:43] RECOVERY - RAID on nickel is OK: OK: Active: 3, Working: 3, Failed: 0, Spare: 0 [05:55:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 41395 seconds ago, expected 28800 [06:00:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 41695 seconds ago, expected 28800 [06:05:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 41995 seconds ago, expected 28800 [06:10:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 42295 seconds ago, expected 28800 [06:15:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 42595 seconds ago, expected 28800 [06:20:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 42895 seconds ago, expected 28800 [06:25:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 43195 seconds ago, expected 28800 [06:26:33] RECOVERY - Disk space on ocg1003 is OK: DISK OK [06:27:02] RECOVERY - Disk space on ocg1002 is OK: DISK OK [06:27:05] * paravoid waits for puppetmaster failures [06:27:12] let's see if it got fixed :) [06:28:43] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:43] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 43495 seconds ago, expected 28800 [06:31:57] ok, 2 failures [06:32:01] not that bad, not perfect [06:32:34] <_joe_> paravoid: :) [06:33:02] <_joe_> as I told you, some failures are to be expected [06:33:04] <_joe_> alas [06:33:25] <_joe_> but yeah, we were helping it by restarting apache repeatedly [06:35:07] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 43795 seconds ago, expected 28800 [06:40:07] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 44095 seconds ago, expected 28800 [06:42:57] PROBLEM - CI: Puppet failure events on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.puppetagent.failed_events.value (33.33%) [06:45:07] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:45:10] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 44395 seconds ago, expected 28800 [06:46:00] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:50:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 44695 seconds ago, expected 28800 [06:55:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 44995 seconds ago, expected 28800 [06:56:31] PROBLEM - puppet last run on db1006 is CRITICAL: CRITICAL: Puppet has 1 failures [07:00:10] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 45295 seconds ago, expected 28800 [07:05:07] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 45595 seconds ago, expected 28800 [07:06:49] RECOVERY - CI: Puppet failure events on labmon1001 is OK: OK: All targets OK [07:10:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 45895 seconds ago, expected 28800 [07:13:38] RECOVERY - puppet last run on db1006 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:15:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 46195 seconds ago, expected 28800 [07:20:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 46495 seconds ago, expected 28800 [07:25:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 46795 seconds ago, expected 28800 [07:30:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 47095 seconds ago, expected 28800 [07:35:05] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 47395 seconds ago, expected 28800 [07:40:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 47695 seconds ago, expected 28800 [07:45:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 47995 seconds ago, expected 28800 [07:50:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 48295 seconds ago, expected 28800 [07:55:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 48595 seconds ago, expected 28800 [08:00:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 48895 seconds ago, expected 28800 [08:05:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 49195 seconds ago, expected 28800 [08:10:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 49496 seconds ago, expected 28800 [08:15:10] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 49795 seconds ago, expected 28800 [08:20:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 50095 seconds ago, expected 28800 [08:25:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 50395 seconds ago, expected 28800 [08:30:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 50695 seconds ago, expected 28800 [08:35:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 50995 seconds ago, expected 28800 [08:40:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 51296 seconds ago, expected 28800 [08:45:17] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 51595 seconds ago, expected 28800 [08:47:04] (03PS2) 10Giuseppe Lavagetto: osmium: remove specific classes. [puppet] - 10https://gerrit.wikimedia.org/r/170289 [08:47:15] (03CR) 10Giuseppe Lavagetto: [C: 032] osmium: remove specific classes. [puppet] - 10https://gerrit.wikimedia.org/r/170289 (owner: 10Giuseppe Lavagetto) [08:50:17] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 51895 seconds ago, expected 28800 [08:55:18] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 52195 seconds ago, expected 28800 [09:00:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 52495 seconds ago, expected 28800 [09:02:56] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:05:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 52795 seconds ago, expected 28800 [09:09:22] (03CR) 10Filippo Giunchedi: [C: 04-1] Add robots.txt rewrite rule where wiki is public (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/147487 (owner: 10Reedy) [09:10:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 53095 seconds ago, expected 28800 [09:15:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 53395 seconds ago, expected 28800 [09:20:16] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 53695 seconds ago, expected 28800 [09:25:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 53995 seconds ago, expected 28800 [09:30:10] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 54295 seconds ago, expected 28800 [09:35:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 54595 seconds ago, expected 28800 [09:40:16] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 54895 seconds ago, expected 28800 [09:45:16] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 55195 seconds ago, expected 28800 [09:48:30] (03CR) 10Alexandros Kosiaris: [C: 032] "Idiotic error on my part. LGTM. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/170160 (owner: 10QChris) [09:50:03] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 55495 seconds ago, expected 28800 [09:51:26] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Restart CXServer on change in config.erb [puppet] - 10https://gerrit.wikimedia.org/r/170283 (owner: 10KartikMistry) [09:55:03] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 55795 seconds ago, expected 28800 [10:00:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 56095 seconds ago, expected 28800 [10:03:02] (03CR) 10Alexandros Kosiaris: [C: 032] phab: move defines into autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/170273 (owner: 10Dzahn) [10:05:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 56395 seconds ago, expected 28800 [10:10:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 56695 seconds ago, expected 28800 [10:12:54] (03CR) 10Hashar: "> let's shoot for monday (nov 3rd)" [puppet] - 10https://gerrit.wikimedia.org/r/153764 (owner: 10Hashar) [10:15:06] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 56995 seconds ago, expected 28800 [10:15:12] PROBLEM - ElasticSearch health check on elastic1018 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.40 [10:16:03] PROBLEM - ElasticSearch health check for shards on elastic1018 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.40:9200/_cluster/health error while fetching: Request timed out. [10:18:22] PROBLEM - ElasticSearch health check for shards on elastic1031 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.55:9200/_cluster/health error while fetching: Request timed out. [10:20:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 57295 seconds ago, expected 28800 [10:20:58] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.112:9200/_cluster/health error while fetching: Request timed out. [10:20:59] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [10:20:59] RECOVERY - ElasticSearch health check for shards on elastic1031 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 27, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 27 [10:21:32] PROBLEM - ElasticSearch health check on elastic1031 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.55 [10:21:33] PROBLEM - ElasticSearch health check for shards on elastic1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.110:9200/_cluster/health error while fetching: Request timed out. [10:22:53] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.53 [10:22:54] PROBLEM - ElasticSearch health check for shards on elastic1029 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.53:9200/_cluster/health error while fetching: Request timed out. [10:23:32] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [10:24:13] PROBLEM - ElasticSearch health check for shards on elastic1031 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.55:9200/_cluster/health error while fetching: Request timed out. [10:25:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 57595 seconds ago, expected 28800 [10:29:44] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 27, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 27 [10:29:47] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [10:30:16] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 57895 seconds ago, expected 28800 [10:33:03] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [10:33:03] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.112:9200/_cluster/health error while fetching: Request timed out. [10:35:06] <_joe_> ES is down again [10:35:14] <_joe_> how funny [10:35:18] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 58195 seconds ago, expected 28800 [10:35:36] <_joe_> godog: ping ^^ (ES down) [10:36:34] <_joe_> looks like exhausted heap again [10:38:18] RECOVERY - ElasticSearch health check on elastic1031 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [10:38:33] mhh let's take a look [10:39:10] <_joe_> I can help in operating on the servers if you need it [10:39:44] (03PS1) 10Alexandros Kosiaris: ganglia DNS views: Remove the colon character [puppet] - 10https://gerrit.wikimedia.org/r/170299 [10:40:06] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 58495 seconds ago, expected 28800 [10:41:57] PROBLEM - ElasticSearch health check on elastic1031 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.55 [10:41:58] yeah I'm taking a look but we might need to bounce it [10:42:05] (03PS1) 10Giuseppe Lavagetto: mediawiki: simplify apache config [puppet] - 10https://gerrit.wikimedia.org/r/170300 [10:42:06] <_joe_> I guess so [10:42:23] <_joe_> but I won't do that without guidance from someone that knows more about it [10:44:39] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 27, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 27 [10:44:46] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [10:45:07] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 58795 seconds ago, expected 28800 [10:45:10] recovery ? [10:45:18] so ok the cluster is green, however on 18/29/31 there are alarms [10:45:23] ok silencing that pay-lvs1001 thing [10:45:50] also on 3, 9/10/11/12 are down for maintenance IIRC [10:46:39] ACKNOWLEDGEMENT - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 58795 seconds ago, expected 28800 alexandros kosiaris Silencing, jgreen should take a look I suppose [10:47:27] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia DNS views: Remove the colon character [puppet] - 10https://gerrit.wikimedia.org/r/170299 (owner: 10Alexandros Kosiaris) [10:48:08] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [10:48:16] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.112:9200/_cluster/health error while fetching: Request timed out. [10:49:57] ah... nice... timeout again... [10:49:59] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run [[86400, day], [3600, hour], [60, minute], [1, second]] ago with 0 failures [10:50:24] ori: ^ not what you expected, is it ? [10:51:10] I'm looking at this page btw https://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&c=Elasticsearch+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=es_heap_used&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [10:52:25] !log restarting elasticsearch on elastic1031, heap exhausted at 30G [10:52:35] Logged the message, Master [10:52:45] godog: so 40mins ago , traffic went down by 60+% ... [10:52:49] not good I assume [10:53:05] <_joe_> akosiaris: same as the other time [10:53:18] do we have search engines like a few days ago? [10:53:19] not sure how to read to es_head_used graph tbh [10:53:32] es_heap_used [10:53:39] s/search engine issues/ [10:53:54] <_joe_> so, we have 4 nodes that are reported down by ganglia [10:53:55] sDrewth: seems like it, we are investigating [10:53:59] enWS : An error has occurred while searching: Search is currently too busy. Please try again later. [10:54:02] <_joe_> godog: should I cycle them? [10:54:02] k. tjx [10:54:04] thx [10:54:06] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:54:18] _joe_: I think those four are down in maint [10:54:31] <_joe_> oh ok lemme check the SAL [10:54:46] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [10:55:01] akosiaris: true traffic dropped, however that might be a reindex, if you zoom out to week/month it shows [10:55:08] and still these distributed spam attacks continue [10:55:29] sDrewth: what's enws ? [10:55:41] _joe_: you are not going to like this HHVM busy threads [10:55:41] UNKNOWN 2014-10-31 10:53:41 0d 0h 1m 3s 1/3 UNKNOWN: execution of the check script exited with exception timed out [10:56:01] <_joe_> that's graphite [10:56:03] <_joe_> ;) [10:56:32] ah yes [10:56:34] now the error is UNKNOWN: Got status 502 from the graphite serve [10:56:47] etc etc... [10:56:51] <_joe_> yep [10:57:08] <_joe_> about ES, the drop we're seeing is due to the cluster exploding [10:57:11] <_joe_> like the other day [10:57:16] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 27, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 27 [10:57:29] ok so, GC/heap issues again [10:57:34] <_joe_> if a couple reboots don't cut it we should really call nik [10:57:37] <_joe_> akosiaris: yes [10:57:45] ok I'm not sure what's the right way to recover from that, restart the nodes? [10:57:48] yeah, we should page nik [10:58:44] indeed, how did we engage him the last time? [10:58:53] he was online [10:59:04] but he probably is asleep now... [10:59:19] <_joe_> lemme check the contact list [11:00:18] RECOVERY - ElasticSearch health check for shards on elastic1003 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 27, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 27 [11:00:18] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:00:23] it does not seem as bad as last time btw, but it can escalate I suppose ... [11:00:27] godog: English Wikisource [11:00:56] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.112:9200/_cluster/health error while fetching: Request timed out. [11:00:56] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [11:01:08] enWP enWQ enWikt enVoy enWN enWV [11:01:19] sDrewth: ah I see, thanks! [11:01:38] <- speaks in abbreviated jargon :-) [11:02:43] not sure either.... [11:03:38] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [11:03:42] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [11:04:37] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:04:55] <_joe_> so... one solution is to use lsearchd for everything, again [11:05:03] <_joe_> I'll prepare the patch in the meanwhile [11:06:09] PROBLEM - ElasticSearch health check for shards on elastic1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.110:9200/_cluster/health error while fetching: Request timed out. [11:06:58] so, the part where only one jvm thread is actually active [11:07:07] it's where the GC happens ? [11:07:42] ok so now at least 5 are active... for example on elastic1003 [11:07:57] PROBLEM - ElasticSearch health check for shards on elastic1015 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.12:9200/_cluster/health error while fetching: Request timed out. [11:08:05] and now only 1 again.... the rest are all sleeping ... [11:08:17] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:08:48] Si è verificato un errore durante la ricerca: Attualmente la ricerca è sovraccarica. Riprova più tardi. [11:09:07] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:09:09] it.wiki search seems to be totally off [11:09:12] Nemo_bis: yeah we know, thanks [11:09:23] _joe_: falling back to lsearchd ? [11:09:44] <_joe_> akosiaris: I was searching how to do that cleanly [11:10:17] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [11:10:26] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 11.5333333333 [11:11:49] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.53 [11:12:46] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [11:13:14] https://it.wikipedia.org/w/index.php?title=Speciale%3ARicerca&search=what&go=Vai&srbackend=LuceneSearch works [11:13:22] 'wmgUseCirrus' => array( [11:13:23] 'default' => true, [11:13:28] just make false? [11:16:58] akosiaris: ^ [11:17:16] (03PS1) 10Giuseppe Lavagetto: Disable cirrus as we are having troubles on ElasticSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170305 [11:18:07] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:18:30] <_joe_> wait for it, one more PS coming [11:19:19] (03PS2) 10Giuseppe Lavagetto: Disable cirrus as we are having troubles on ElasticSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170305 [11:19:36] <_joe_> godog, akosiaris ^^ [11:20:00] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 27, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 27 [11:20:01] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:21:00] _joe_: https://gerrit.wikimedia.org/r/#/c/169219/1/wmf-config/InitialiseSettings.php [11:21:09] judging from this [11:21:17] don't disable the AsAlternative [11:21:25] and perhaps add test2wiki ? [11:21:35] <_joe_> akosiaris: look further back in history [11:21:47] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [11:22:06] <_joe_> akosiaris: https://gerrit.wikimedia.org/r/#/c/168956/ [11:22:21] yeah, saw it [11:23:16] <_joe_> so, you'd say disable but not for people who explicitly opted in? [11:23:18] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.112:9200/_cluster/health error while fetching: Request timed out. [11:23:18] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [11:23:22] (03CR) 10Alexandros Kosiaris: [C: 032] Disable cirrus as we are having troubles on ElasticSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170305 (owner: 10Giuseppe Lavagetto) [11:23:28] <_joe_> I think no one should use something that's broken [11:23:31] (03Merged) 10jenkins-bot: Disable cirrus as we are having troubles on ElasticSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170305 (owner: 10Giuseppe Lavagetto) [11:23:35] <_joe_> ok, syncing [11:23:38] RECOVERY - ElasticSearch health check for shards on elastic1015 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 27, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 27 [11:23:39] so no test2wiki? [11:23:46] anyways that'll do [11:23:53] let's do that later... [11:23:58] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:03] for now let's unbreak search [11:24:45] !log oblivian Synchronized wmf-config/InitialiseSettings.php: ES is down, long live lsearchd (duration: 00m 09s) [11:24:49] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:24:49] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.481 second response time [11:24:51] Logged the message, Master [11:25:00] heh, even search on wikitech is dead... [11:25:28] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:25:28] RECOVERY - ElasticSearch health check for shards on elastic1029 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 27, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 27 [11:25:30] since we moved it into the cluster that is to be expected, of course, but still .... [11:26:03] <_joe_> there are reasons why wikitech in the cluster is a bad idea [11:26:08] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:26:10] <_joe_> this is one of the most important ones [11:26:25] <_joe_> that I pointed out on day -1 :) [11:27:09] PROBLEM - ElasticSearch health check for shards on elastic1015 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.12:9200/_cluster/health error while fetching: Request timed out. [11:27:57] _joe_: I can see the value of allowing elastic and old search on things like wikitech [11:28:15] there was that capacity by url to swap between both [11:28:27] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [11:28:27] <_joe_> sDrewth: I can see the value in keeping our documentation platform disjoint from what it's documenting [11:28:30] <_joe_> ;) [11:29:57] yes, though there are numbers of doing that [11:30:41] !log restarting gmond on elasticsearch nodes so I can get a clearer picture of them [11:30:47] Logged the message, Master [11:31:08] PROBLEM - ElasticSearch health check for shards on elastic1029 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.53:9200/_cluster/health error while fetching: Request timed out. [11:31:18] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.53 [11:32:07] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [11:32:17] RECOVERY - ElasticSearch health check for shards on elastic1029 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 27, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 27 [11:32:17] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:36:07] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [11:37:59] PROBLEM - ElasticSearch health check for shards on elastic1029 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.53:9200/_cluster/health error while fetching: Request timed out. [11:38:05] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 27, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 27 [11:38:06] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.53 [11:40:38] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:41:40] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.112:9200/_cluster/health error while fetching: Request timed out. [11:43:17] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:44:08] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [11:46:19] !log heap dumps aren't happening. Even with the config to dump them on oom errors. Restarting Elasticsearch nodes to get us back to stable and going to have to investigate from another direction. [11:46:26] Logged the message, Master [11:48:48] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [11:49:19] :( [11:53:07] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [11:56:37] PROBLEM - ElasticSearch health check for shards on elastic1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.109:9200/_cluster/health error while fetching: Request timed out. [11:57:07] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.109 [11:57:07] PROBLEM - ElasticSearch health check for shards on elastic1007 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.139:9200/_cluster/health error while fetching: Request timed out. [11:57:08] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.139 [11:57:49] RECOVERY - ElasticSearch health check for shards on elastic1015 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 26, unassigned_shards: 145, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5805, initializing_shards: 78, number_of_data_nodes: 26 [11:58:17] ElasticbandSearch has snapped again. [11:58:29] PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.45 [11:58:37] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 [11:58:58] RECOVERY - ElasticSearch health check for shards on elastic1002 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 25, unassigned_shards: 145, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5805, initializing_shards: 78, number_of_data_nodes: 25 [11:58:58] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [11:59:14] manybubbles: anything we can help with? [11:59:17] PROBLEM - ElasticSearch health check for shards on elastic1016 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.13:9200/_cluster/health error while fetching: Request timed out. [11:59:18] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13 [12:00:12] <_joe_> can I say I'm not surprised that heap dumps in ES are impossible when everything is fucked up? [12:00:15] <_joe_> :/ [12:00:17] PROBLEM - ElasticSearch health check for shards on elastic1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.108:9200/_cluster/health error while fetching: Request timed out. [12:00:27] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.109 [12:00:29] PROBLEM - ElasticSearch health check for shards on elastic1021 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.45:9200/_cluster/health error while fetching: Request timed out. [12:01:50] _joe_: me neither - I'm going to take my kids to school. be back in 10 [12:03:28] PROBLEM - ElasticSearch health check for shards on elastic1015 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.12:9200/_cluster/health error while fetching: Request timed out. [12:03:28] PROBLEM - ElasticSearch health check for shards on elastic1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.109:9200/_cluster/health error while fetching: Request timed out. [12:04:57] RECOVERY - ElasticSearch health check for shards on elastic1021 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 25, unassigned_shards: 145, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5805, initializing_shards: 78, number_of_data_nodes: 25 [12:05:08] PROBLEM - ElasticSearch health check on elastic1026 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.50 [12:06:08] PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.45 [12:07:38] PROBLEM - ElasticSearch health check for shards on elastic1026 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.50:9200/_cluster/health error while fetching: Request timed out. [12:08:18] PROBLEM - ElasticSearch health check for shards on elastic1021 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.45:9200/_cluster/health error while fetching: Request timed out. [12:08:18] PROBLEM - LVS HTTP IPv4 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:08:48] RECOVERY - ElasticSearch health check for shards on elastic1026 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 25, unassigned_shards: 145, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5805, initializing_shards: 78, number_of_data_nodes: 25 [12:09:11] (03CR) 10Filippo Giunchedi: mediawiki: simplify apache config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/170300 (owner: 10Giuseppe Lavagetto) [12:09:18] RECOVERY - LVS HTTP IPv4 on search.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 387 bytes in 0.005 second response time [12:09:45] PROBLEM - ElasticSearch health check on elastic1026 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.50 [12:10:08] so what's going on with search today? [12:11:06] mark: looks like the same problem as we had 4 days ago with elasticsearch heap going nuts [12:13:01] <_joe_> mark: we fell back to lsearchd [12:13:20] back [12:13:21] ok [12:15:19] (03CR) 10Giuseppe Lavagetto: mediawiki: simplify apache config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170300 (owner: 10Giuseppe Lavagetto) [12:16:24] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13 [12:16:55] PROBLEM - ElasticSearch health check for shards on elastic1026 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.50:9200/_cluster/health error while fetching: Request timed out. [12:19:52] (03PS1) 10Alexandros Kosiaris: Catch a couple of corner cases in check_puppetrun [puppet] - 10https://gerrit.wikimedia.org/r/170310 [12:21:14] PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.49 [12:21:53] (03PS1) 10Hoo man: Enable global AbuseFilter on medium sized Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170311 [12:23:09] well at least Nik won't need to be upset [12:23:45] PROBLEM - ElasticSearch health check for shards on elastic1025 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.49:9200/_cluster/health error while fetching: Request timed out. [12:24:15] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.111 [12:24:15] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.113 [12:24:45] PROBLEM - ElasticSearch health check on elastic1028 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 7: number_of_data_nodes: 7: active_primary_shards: 0: active_shards: 0: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:24:45] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 4: number_of_data_nodes: 4: active_primary_shards: 0: active_shards: 0: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:24:45] PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 0: active_shards: 0: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:24:45] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 0: active_shards: 0: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:24:46] mark: sadly, I'm *still* unable to get heap dumps. so all I can do is figure investiage logs :( [12:24:54] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [12:24:55] PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.41 [12:24:56] where are those logs? [12:25:05] PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 0: active_shards: 0: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:25:15] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 0: active_shards: 0: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:25:15] PROBLEM - ElasticSearch health check on elastic1027 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 0: active_shards: 0: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:25:15] PROBLEM - ElasticSearch health check on elastic1020 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 0: active_shards: 0: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [12:25:26] mark: lots of places. /var/log/elasticsearch on the boxes has some. fluorine:/a/mw-log has some others [12:25:26] PROBLEM - ElasticSearch health check on elastic1030 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.54 [12:25:35] PROBLEM - ElasticSearch health check on elastic1022 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.46 [12:25:45] PROBLEM - ElasticSearch health check for shards on elastic1025 is CRITICAL: CRITICAL - elasticsearch inactive shards 6028 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 25, uunassigned_shards: 5959, utimed_out: False, uactive_primary_shards: 0, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 0, uinitializing_shards: 69, unumber_of_data_nodes: 25} [12:25:45] PROBLEM - ElasticSearch health check for shards on elastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 6028 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 25, uunassigned_shards: 5959, utimed_out: False, uactive_primary_shards: 0, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 0, uinitializing_shards: 69, unumber_of_data_nodes: 25} [12:25:55] PROBLEM - ElasticSearch health check for shards on elastic1029 is CRITICAL: CRITICAL - elasticsearch inactive shards 6028 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 25, uunassigned_shards: 5959, utimed_out: False, uactive_primary_shards: 0, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 0, uinitializing_shards: 69, unumber_of_data_nodes: 25} [12:25:55] PROBLEM - ElasticSearch health check for shards on elastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 6028 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 25, uunassigned_shards: 5959, utimed_out: False, uactive_primary_shards: 0, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 0, uinitializing_shards: 69, unumber_of_data_nodes: 25} [12:25:55] PROBLEM - ElasticSearch health check for shards on elastic1016 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.13:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [12:26:04] PROBLEM - ElasticSearch health check for shards on elastic1021 is CRITICAL: CRITICAL - elasticsearch inactive shards 6028 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 25, uunassigned_shards: 5959, utimed_out: False, uactive_primary_shards: 0, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 0, uinitializing_shards: 69, unumber_of_data_nodes: 25} [12:26:05] PROBLEM - ElasticSearch health check for shards on elastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 5962 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 24, uunassigned_shards: 5890, utimed_out: False, uactive_primary_shards: 66, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 66, uinitializing_shards: 72, unumber_of_data_nodes: 24} [12:26:20] !log restarting all elasticsearch boxes in quick sequence. when I try restarting a frozen box another one freezes up (probably an evil request being retried on it after its buddy went down). [12:26:26] Logged the message, Master [12:26:29] PROBLEM - ElasticSearch health check for shards on elastic1031 is CRITICAL: CRITICAL - elasticsearch inactive shards 5971 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 23, uunassigned_shards: 5902, utimed_out: False, uactive_primary_shards: 57, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 57, uinitializing_shards: 69, unumber_of_data_nodes: 23} [12:26:34] PROBLEM - ElasticSearch health check for shards on elastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 5914 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 24, uunassigned_shards: 5848, utimed_out: False, uactive_primary_shards: 114, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 114, uinitializing_shards: 66, unumber_of_data_nodes: 24} [12:26:44] PROBLEM - ElasticSearch health check for shards on elastic1026 is CRITICAL: CRITICAL - elasticsearch inactive shards 5914 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 24, uunassigned_shards: 5848, utimed_out: False, uactive_primary_shards: 114, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 114, uinitializing_shards: 66, unumber_of_data_nodes: 24} [12:26:54] PROBLEM - ElasticSearch health check for shards on elastic1027 is CRITICAL: CRITICAL - elasticsearch inactive shards 5776 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 25, uunassigned_shards: 5701, utimed_out: False, uactive_primary_shards: 252, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 252, uinitializing_shards: 75, unumber_of_data_nodes: 25} [12:26:58] PROBLEM - ElasticSearch health check for shards on elastic1017 is CRITICAL: CRITICAL - elasticsearch inactive shards 5776 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 25, uunassigned_shards: 5701, utimed_out: False, uactive_primary_shards: 252, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 252, uinitializing_shards: 75, unumber_of_data_nodes: 25} [12:27:04] PROBLEM - ElasticSearch health check for shards on elastic1028 is CRITICAL: CRITICAL - elasticsearch inactive shards 5630 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 26, uunassigned_shards: 5552, utimed_out: False, uactive_primary_shards: 398, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 398, uinitializing_shards: 78, unumber_of_data_nodes: 26} [12:27:04] PROBLEM - ElasticSearch health check for shards on elastic1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 5630 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 26, uunassigned_shards: 5552, utimed_out: False, uactive_primary_shards: 398, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 398, uinitializing_shards: 78, unumber_of_data_nodes: 26} [12:27:04] PROBLEM - ElasticSearch health check for shards on elastic1023 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.47:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [12:27:05] PROBLEM - ElasticSearch health check for shards on elastic1019 is CRITICAL: CRITICAL - elasticsearch inactive shards 5597 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 24, uunassigned_shards: 5523, utimed_out: False, uactive_primary_shards: 429, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 431, uinitializing_shards: 74, unumber_of_data_nodes: 24} [12:27:09] I have a sneaking suspicion its in the glorious regex matcing code I wrote a while back. but I'm not sure [12:27:14] PROBLEM - ElasticSearch health check for shards on elastic1030 is CRITICAL: CRITICAL - elasticsearch inactive shards 5597 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 24, uunassigned_shards: 5523, utimed_out: False, uactive_primary_shards: 429, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 431, uinitializing_shards: 74, unumber_of_data_nodes: 24} [12:27:14] PROBLEM - ElasticSearch health check for shards on elastic1014 is CRITICAL: CRITICAL - elasticsearch inactive shards 5597 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 24, uunassigned_shards: 5523, utimed_out: False, uactive_primary_shards: 429, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 431, uinitializing_shards: 74, unumber_of_data_nodes: 24} [12:27:19] can we turn it off? ;) [12:27:24] PROBLEM - ElasticSearch health check for shards on elastic1024 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.48:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [12:27:29] mark: sure [12:27:44] PROBLEM - ElasticSearch health check for shards on elastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 5743 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 23, uunassigned_shards: 5688, utimed_out: False, uactive_primary_shards: 283, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 285, uinitializing_shards: 55, unumber_of_data_nodes: 23} [12:27:45] PROBLEM - ElasticSearch health check for shards on elastic1022 is CRITICAL: CRITICAL - elasticsearch inactive shards 5743 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 27, uunassigned_shards: 5688, utimed_out: False, uactive_primary_shards: 283, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 285, uinitializing_shards: 55, unumber_of_data_nodes: 27} [12:27:45] PROBLEM - ElasticSearch health check for shards on elastic1013 is CRITICAL: CRITICAL - elasticsearch inactive shards 5743 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 27, uunassigned_shards: 5688, utimed_out: False, uactive_primary_shards: 283, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 285, uinitializing_shards: 55, unumber_of_data_nodes: 27} [12:27:45] i've learned to trust my own gut feeling in cases where no more substantial info is available [12:27:46] PROBLEM - ElasticSearch health check for shards on elastic1020 is CRITICAL: CRITICAL - elasticsearch inactive shards 5743 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 27, uunassigned_shards: 5688, utimed_out: False, uactive_primary_shards: 283, ucluster_name: uproduction-search-eqiad, urelocating_shards: 0, uactive_shards: 285, uinitializing_shards: 55, unumber_of_data_nodes: 27} [12:27:48] <_joe_> mark: ElasticSearch? we're doing a good work at turning it off :) [12:27:49] and i'm happy to trust yours now ;) [12:30:03] (03PS1) 10Manybubbles: Disable Cirrus' fancy regex code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170312 [12:30:58] I'm going to turn that off right now. [12:31:03] yep [12:31:04] (03CR) 10Manybubbles: [C: 032] Disable Cirrus' fancy regex code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170312 (owner: 10Manybubbles) [12:31:11] (03Merged) 10jenkins-bot: Disable Cirrus' fancy regex code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170312 (owner: 10Manybubbles) [12:31:38] !log restart of elasticsearch nodes got them back to responsive. Cluster isn't fully healed yet but we're better then we were. Still not sure how we got this way [12:31:45] Logged the message, Master [12:32:38] !log manybubbles Synchronized wmf-config/: Disable Cirrus accelerated regexes as we *think* they might be causing outages (duration: 00m 04s) [12:32:45] Logged the message, Master [12:35:13] (03PS1) 10Manybubbles: Reenable Cirrus on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170313 [12:35:13] RECOVERY - check_puppetrun on pay-lvs1001 is OK: OK: Puppet is currently enabled, last run 187 seconds ago with 0 failures [12:35:26] (03CR) 10Manybubbles: [C: 032] Reenable Cirrus on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170313 (owner: 10Manybubbles) [12:35:33] (03Merged) 10jenkins-bot: Reenable Cirrus on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170313 (owner: 10Manybubbles) [12:36:27] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: reenable cirrus on testwiki (duration: 00m 04s) [12:36:32] Logged the message, Master [12:37:20] !log cirrus is working on test2wiki - we look to be recovered save for some loss of redundancy [12:37:25] Logged the message, Master [12:38:53] (03PS1) 10Manybubbles: Reenabel cirrus as alternative everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170314 [12:39:38] any objections to me reenabling cirrus as an alternative everywhere? that should fix geo which goes down when cirrus is down (I think) [12:40:14] let's try [12:40:27] (03CR) 10Manybubbles: [C: 032] Reenabel cirrus as alternative everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170314 (owner: 10Manybubbles) [12:40:34] (03Merged) 10jenkins-bot: Reenabel cirrus as alternative everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170314 (owner: 10Manybubbles) [12:40:42] it's not like it will be less painful next week, rather the opposite, really [12:41:16] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: reenable cirrus as betafeature everywhere (duration: 00m 05s) [12:41:21] Logged the message, Master [12:41:49] !log reenabled cirrus as betafeature - no spike in error logs [12:41:55] Logged the message, Master [12:42:17] <_joe_> I agree [12:42:53] <_joe_> manybubbles: we have to reenable cirrus, else we won't see fi the regex code was the issue here [12:43:01] yeah [12:43:23] _joe_: it might take days to see it again. but yeah. [12:43:46] <_joe_> manybubbles: well, better to start the clock now :) [12:44:16] I suppose :) - ok so we have 2 way redundancy back pretty much everywhere - let me verify that - but if we have it then I propose we just reenable it for everyeone that had it before [12:47:37] hmm - looks like we don't have redundancy back online for a few of the indexes yet. its coming but it takes longer than it should. [12:48:09] I'm going to reindex all the changes we might have lost and give the cluster an hour or so to sync itself before we cut back over from lsearchd [12:53:40] RECOVERY - ElasticSearch health check for shards on elastic1031 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 531, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5426, initializing_shards: 71, number_of_data_nodes: 27 [12:53:40] RECOVERY - ElasticSearch health check for shards on elastic1013 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 529, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5428, initializing_shards: 71, number_of_data_nodes: 27 [12:53:41] RECOVERY - ElasticSearch health check for shards on elastic1022 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 529, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5428, initializing_shards: 71, number_of_data_nodes: 27 [12:53:41] RECOVERY - ElasticSearch health check for shards on elastic1008 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 529, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5428, initializing_shards: 71, number_of_data_nodes: 27 [12:53:41] RECOVERY - ElasticSearch health check for shards on elastic1004 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 529, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5428, initializing_shards: 71, number_of_data_nodes: 27 [12:53:41] RECOVERY - ElasticSearch health check for shards on elastic1020 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 529, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5429, initializing_shards: 70, number_of_data_nodes: 27 [12:53:41] RECOVERY - ElasticSearch health check for shards on elastic1026 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 529, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5429, initializing_shards: 70, number_of_data_nodes: 27 [12:53:42] RECOVERY - ElasticSearch health check for shards on elastic1003 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 529, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5429, initializing_shards: 70, number_of_data_nodes: 27 [12:53:50] RECOVERY - ElasticSearch health check for shards on elastic1017 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 526, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5432, initializing_shards: 70, number_of_data_nodes: 27 [12:53:56] RECOVERY - ElasticSearch health check for shards on elastic1027 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 526, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5432, initializing_shards: 70, number_of_data_nodes: 27 [12:53:56] RECOVERY - ElasticSearch health check for shards on elastic1025 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 526, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5432, initializing_shards: 70, number_of_data_nodes: 27 [12:53:59] RECOVERY - ElasticSearch health check for shards on elastic1015 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 525, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5433, initializing_shards: 70, number_of_data_nodes: 27 [12:53:59] RECOVERY - ElasticSearch health check for shards on elastic1002 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 525, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5433, initializing_shards: 70, number_of_data_nodes: 27 [12:53:59] RECOVERY - ElasticSearch health check for shards on elastic1023 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 524, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5434, initializing_shards: 70, number_of_data_nodes: 27 [12:54:00] RECOVERY - ElasticSearch health check for shards on elastic1006 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 524, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5434, initializing_shards: 70, number_of_data_nodes: 27 [12:54:00] RECOVERY - ElasticSearch health check for shards on elastic1028 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 524, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5434, initializing_shards: 70, number_of_data_nodes: 27 [12:54:10] RECOVERY - ElasticSearch health check for shards on elastic1030 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 516, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5442, initializing_shards: 70, number_of_data_nodes: 27 [12:54:14] RECOVERY - ElasticSearch health check for shards on elastic1014 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 516, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5442, initializing_shards: 70, number_of_data_nodes: 27 [12:54:14] RECOVERY - ElasticSearch health check for shards on elastic1019 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 516, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5442, initializing_shards: 70, number_of_data_nodes: 27 [12:54:21] RECOVERY - ElasticSearch health check for shards on elastic1016 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 515, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5444, initializing_shards: 69, number_of_data_nodes: 27 [12:54:21] RECOVERY - ElasticSearch health check for shards on elastic1018 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 515, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5444, initializing_shards: 69, number_of_data_nodes: 27 [12:54:21] RECOVERY - ElasticSearch health check for shards on elastic1001 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 513, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5446, initializing_shards: 69, number_of_data_nodes: 27 [12:54:21] RECOVERY - ElasticSearch health check for shards on elastic1029 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 513, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5446, initializing_shards: 69, number_of_data_nodes: 27 [12:54:21] RECOVERY - ElasticSearch health check for shards on elastic1021 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 511, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5449, initializing_shards: 68, number_of_data_nodes: 27 [12:54:22] RECOVERY - ElasticSearch health check for shards on elastic1007 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 508, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5452, initializing_shards: 68, number_of_data_nodes: 27 [12:54:22] RECOVERY - ElasticSearch health check for shards on elastic1005 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 508, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5452, initializing_shards: 68, number_of_data_nodes: 27 [12:54:30] RECOVERY - ElasticSearch health check for shards on elastic1024 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 27, unassigned_shards: 505, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 5455, initializing_shards: 68, number_of_data_nodes: 27 [13:04:19] .... weird - its been yellow for a long long time [13:10:00] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [13:10:17] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [13:10:17] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [13:10:22] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [13:10:36] PROBLEM - check if salt-minion is running on ocg1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:14:56] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [13:15:06] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [13:15:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [13:15:10] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [13:19:37] <_joe_> mmmh ocg [13:20:00] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 253 seconds ago with 0 failures [13:20:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [13:20:10] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [13:20:10] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [13:20:14] (03PS1) 10Manybubbles: Reenable cirrus everywhere where it was [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170316 [13:20:59] we have enough redundancy to reenable cirrus whevere where it was. ony objections [13:21:00] ? [13:21:08] *any* objections, rather? [13:21:28] RECOVERY - ElasticSearch health check on elastic1028 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:21:29] RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:21:29] RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:21:41] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:21:41] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:21:48] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:21:49] none from me [13:21:50] <_joe_> nope [13:21:55] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:21:55] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:21:55] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:08] RECOVERY - ElasticSearch health check on elastic1024 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:08] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:09] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:09] RECOVERY - ElasticSearch health check on elastic1026 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:09] RECOVERY - ElasticSearch health check on elastic1021 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:09] RECOVERY - ElasticSearch health check on elastic1030 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:09] RECOVERY - ElasticSearch health check on elastic1027 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:21] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:21] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:21] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:22] RECOVERY - ElasticSearch health check on elastic1025 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:27] RECOVERY - ElasticSearch health check on elastic1020 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:28] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:28] RECOVERY - ElasticSearch health check on elastic1022 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:28] RECOVERY - ElasticSearch health check on elastic1018 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:38] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:42] RECOVERY - ElasticSearch health check on elastic1031 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:22:42] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 27: number_of_data_nodes: 27: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [13:24:57] (03CR) 10Manybubbles: [C: 032] Reenable cirrus everywhere where it was [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170316 (owner: 10Manybubbles) [13:25:04] (03Merged) 10jenkins-bot: Reenable cirrus everywhere where it was [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170316 (owner: 10Manybubbles) [13:25:08] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [13:25:09] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 208 seconds ago with 0 failures [13:25:18] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [13:25:41] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: reenable cirrus everywhere where it has been after the outage has passed (duration: 00m 03s) [13:25:47] Logged the message, Master [13:27:16] !log reenable was uneventful. good news. [13:27:22] Logged the message, Master [13:30:09] RECOVERY - check_puppetrun on pay-lvs1001 is OK: OK: Puppet is currently enabled, last run 287 seconds ago with 0 failures [13:30:10] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [13:35:19] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 226 seconds ago with 0 failures [13:37:08] well - just filled up the heap on my laptop's elasticsearch real fast with my regex plugin. surprise surprise..... [13:37:14] _joe_: ^^^ better news [13:37:29] nice [13:38:12] I just ran the first query logged in CirrusSearch-slow.log. [13:38:29] its a regex. and its decently complex [13:38:37] so now, lets make that not happen [13:42:08] <^d> *yawn* [13:42:14] <^d> Morning folks. [13:42:59] awwwww :( [13:43:04] no regex [13:44:57] (03PS1) 10Hashar: zuul: log file at error level [puppet] - 10https://gerrit.wikimedia.org/r/170319 [13:45:23] <^d> manybubbles: And I'd just told rob.la yesterday that we'd fixed it with better logging ;-) [13:45:35] I thought we had too [13:45:54] how do you fix something by logging better? [13:46:22] <^d> You don't, but you can be snarky about it and be like "well obviously the logging change made it better :)" [13:46:38] perhaps if it slows down things enough to avoid racy conditions, deadlocks, etc ;-) [13:46:48] <^d> hehehe [13:48:10] <_joe_> bbiab, lunchtime [13:50:30] <^d> Did we get elastic1009-12 yet? [13:50:49] wow, manybubbles, yuck. curious [13:50:52] they're down in ganglia [13:50:52] <^d> They were waiting on initial puppet run last night. [13:50:56] with so many more nodes [13:51:02] did this fail as hard as it did on monday? [13:51:07] ottomata: just as hard [13:51:15] if 1009-1012 are ready, i can start on them shortly [13:51:17] I *think* I know why [13:51:25] but let me investigate [13:51:29] yeah, read your email, regex plugin, but still crazy that it would happen on all of them [13:51:29] k [13:51:32] <^d> ottomata: I think they're just ready for puppetssss [13:51:36] hokay [13:51:42] why wouldn't it happen on all of them? :) [13:52:05] presumably they all get the queries that cause it? [13:53:37] from the description I've heard, it cascades in someway? [13:57:46] <^d> manybubbles: Probably not good to do this on a friday post-outage, but I redid the shard/replica counts yesterday for you. Patches up :) [13:58:08] <^d> (and your per-index replica thing went out ok with swat to all branches) [14:02:25] mark: yeah - I think what happens is that the query actually gets retries some on failures too. [14:02:41] ottomata: I can make ooms at will on my laptop with it right now. not cool [14:02:43] <^d> Could we adjust retry for regex? [14:04:00] ^d: I'm not sure if it is us or elasticright now [14:04:22] <^d> I'm just wondering though if it'd mitigate things. [14:04:38] <^d> Rather than retrying on something so expensive. [14:09:41] ^d: do you think you can hack something together that turns off regex entirely? I'm aftraid that this doesn't require acceleration to blow up. [14:09:57] <^d> I'll have a look [14:13:28] I'm going to hunt down the root cause of this [14:13:54] <^d> Yeah, shouldn't take long to hack something up that'll skip it. [14:15:23] RECOVERY - check if salt-minion is running on elastic1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:15:42] RECOVERY - check configured eth on elastic1009 is OK: NRPE: Unable to read output [14:15:44] RECOVERY - DPKG on elastic1009 is OK: All packages OK [14:15:53] RECOVERY - Disk space on elastic1010 is OK: DISK OK [14:15:53] RECOVERY - check if dhclient is running on elastic1009 is OK: PROCS OK: 0 processes with command name dhclient [14:15:58] RECOVERY - Disk space on elastic1011 is OK: DISK OK [14:16:01] RECOVERY - RAID on elastic1009 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:16:12] RECOVERY - check if salt-minion is running on elastic1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:16:13] RECOVERY - check if dhclient is running on elastic1012 is OK: PROCS OK: 0 processes with command name dhclient [14:16:14] RECOVERY - check if dhclient is running on elastic1010 is OK: PROCS OK: 0 processes with command name dhclient [14:16:26] RECOVERY - check if salt-minion is running on elastic1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:16:31] RECOVERY - DPKG on elastic1010 is OK: All packages OK [14:16:32] RECOVERY - RAID on elastic1010 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:16:41] RECOVERY - check configured eth on elastic1010 is OK: NRPE: Unable to read output [14:16:42] RECOVERY - RAID on elastic1011 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:16:42] RECOVERY - Disk space on elastic1012 is OK: DISK OK [14:16:42] RECOVERY - DPKG on elastic1012 is OK: All packages OK [14:16:42] RECOVERY - Disk space on elastic1009 is OK: DISK OK [14:16:51] RECOVERY - RAID on elastic1012 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:16:51] RECOVERY - check configured eth on elastic1011 is OK: NRPE: Unable to read output [14:16:51] RECOVERY - check if salt-minion is running on elastic1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:16:52] RECOVERY - DPKG on elastic1011 is OK: All packages OK [14:17:01] RECOVERY - check if dhclient is running on elastic1011 is OK: PROCS OK: 0 processes with command name dhclient [14:17:03] RECOVERY - check configured eth on elastic1012 is OK: NRPE: Unable to read output [14:17:51] RECOVERY - puppet last run on elastic1009 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [14:18:04] RECOVERY - puppet last run on elastic1010 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:18:10] RECOVERY - puppet last run on elastic1012 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:18:11] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 28: number_of_data_nodes: 28: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [14:18:24] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 29: number_of_data_nodes: 29: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [14:18:31] RECOVERY - ElasticSearch health check for shards on elastic1011 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 30, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 30 [14:18:31] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [14:18:31] RECOVERY - puppet last run on elastic1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:43] RECOVERY - ElasticSearch health check for shards on elastic1012 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 31, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 16, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 31 [14:18:50] manybubbles: ^d, 1009-1012 done [14:18:57] is that the last of them? [14:19:02] RECOVERY - ElasticSearch health check for shards on elastic1010 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 31, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 16, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 31 [14:19:12] RECOVERY - ElasticSearch health check for shards on elastic1009 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 31, unassigned_shards: 0, timed_out: False, active_primary_shards: 2011, cluster_name: production-search-eqiad, relocating_shards: 16, active_shards: 6028, initializing_shards: 0, number_of_data_nodes: 31 [14:19:24] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2011: active_shards: 6028: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [14:19:46] <^d> ottomata: Yes :D [14:19:57] <^d> Thank you thank you thank you and cmjohnson for all your work this week!!! [14:20:21] yup, np, it was easy from our end, yall have dealt with the hard stuff [14:20:21] ottomata: looks good to me [14:21:35] <^d> manybubbles: https://gerrit.wikimedia.org/r/#/c/170325/ [14:23:40] !log update DNS/NTP settings, add codfw on nas1001-a,b [14:23:46] Logged the message, Master [14:24:21] ^d anytime! [14:24:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: log file at error level [puppet] - 10https://gerrit.wikimedia.org/r/170319 (owner: 10Hashar) [14:27:52] <^d> manybubbles: I'm guessing you want that live like yesterday? :) [14:28:21] ^d: ..... yeah- the more I play with it the more it look like a bug in lucene's regex library.... [14:33:05] RECOVERY - NTP on elastic1010 is OK: NTP OK: Offset -0.01214683056 secs [14:34:06] RECOVERY - NTP on elastic1011 is OK: NTP OK: Offset -0.01416957378 secs [14:34:06] RECOVERY - NTP on elastic1012 is OK: NTP OK: Offset 0.002558946609 secs [14:34:36] RECOVERY - NTP on elastic1009 is OK: NTP OK: Offset -0.01792645454 secs [14:40:33] <^d> manybubbles: https://gerrit.wikimedia.org/r/#/q/I09777b1c2842e8f87c9e7e788b15a472ba7d4c59,n,z all ready for merging live [14:40:55] ^d: please do it when you get a chance [14:41:45] <^d> Doing it as quickly as jenkins will let me :) [14:43:42] (03PS5) 10Cscott: Allow OCG machines in Beta to be jenkins slaves. [puppet] - 10https://gerrit.wikimedia.org/r/170130 [14:43:44] (03PS3) 10Cscott: De-lint modules/ocg/manifests/decommission.pp. [puppet] - 10https://gerrit.wikimedia.org/r/170140 [14:45:33] <_joe_> cscott: we are out of disk space on ocg1001, but it's not ocg's fault [14:46:29] _joe_: good? [14:46:53] is it the logs again? i meant to turn down the console logging volume, i've got patches not yet deployed for that. [14:47:01] <_joe_> yes [14:47:08] <_joe_> but it's not really your fault [14:47:15] upstart is logging all the console output in addition to the logging we're doing to syslog and to logstash [14:47:19] <_joe_> however, deploying that patch may help [14:47:25] <_joe_> cscott: yep [14:47:29] <_joe_> upstart does that [14:47:40] _joe_: i don't think i have permissions to clean up the upstart logs, they are root only and i don't have root on the ocg machines [14:47:40] <_joe_> maybe I could just throw stdout to junk [14:47:48] <_joe_> in the upstart script [14:47:53] <_joe_> cscott: I am doing that [14:47:55] _joe_: that would also work, but i was going to do that in the ocg logging configuration. [14:48:11] <_joe_> cscott: ok so, that is cleaner [14:48:15] so that anything actually logged by upstart would then be 'interesting' (as in, it shouldn't be happening, but maybe some stderr gets through in a bug case) [14:48:26] * _joe_ nods [14:48:56] <_joe_> cscott: I agree completely; as an helpful ops, I was proposing the 'duct tape' hotfix [14:48:59] the reconfig was just waiting for me to finish https://gerrit.wikimedia.org/r/167703 which tweaks the log mechanism we are using. [14:49:25] <_joe_> "helpful ops", that almost sounded plausible :) [14:49:38] _joe_: if we can get away with waiting until the monday deploy, i don't think i need the duct tape applied. ;) [14:49:51] <_joe_> cscott: ok, I'd say I will blackhole the stdout until monday [14:50:03] <_joe_> cscott: I don't think so [14:50:13] <_joe_> and it's not your fault, it's our (ops) fault. [14:50:14] just file a bug or an RT or something to remind us to undo that after monday then? [14:50:29] <_joe_> cscott: I will take care of it [14:50:32] ok, cool. [14:50:51] <_joe_> and also, the location of the main ocg log is configurable? [14:51:08] yeah, via puppets rsyslog config I think. [14:51:22] btw, i'm pretty happy in general with how https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=ocg_data_filesystem_utilization&s=by+name&c=PDF+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 is looking [14:51:51] <_joe_> cscott: yes me too [14:51:58] i finally fixed the last cpu and disk leaks i looks like [14:52:05] <_joe_> I thought I could happily forget ocg was live :) [14:53:07] <_joe_> ok, on it [14:53:32] i actually found the last cpu leak yesterday, and it was a bit surprising: https://gerrit.wikimedia.org/r/170329 [14:53:56] but the cpu time limiting we deployed at the start of the week greatly mitigated the consequence of the leak. [14:55:48] RECOVERY - Disk space on ocg1001 is OK: DISK OK [14:56:00] <_joe_> !log rotated logs on ocg1001, restarted both ocg and rsyslog [14:56:05] <_joe_> MEH [14:56:07] Logged the message, Master [14:59:37] !log demon Synchronized php-1.25wmf5/extensions/CirrusSearch: (no message) (duration: 00m 04s) [14:59:42] Logged the message, Master [14:59:47] !log demon Synchronized php-1.25wmf6/extensions/CirrusSearch: (no message) (duration: 00m 04s) [14:59:49] _joe_: how is rsyslog looking? [14:59:53] Logged the message, Master [14:59:58] <^d> manybubbles: You're live. Didn't do config change yet. Stepping to grab breakfast now. [15:00:12] that should be space-limited now, but i think someone mentioned that we might need to trigger the logrotate cron job more often than daily? [15:00:22] _joe_: ^ [15:00:27] <_joe_> cscott: that makes no sense [15:00:49] <_joe_> that's a bad solution by the people that caused the problem in the first place (us) [15:00:49] it rotates when the log size grows larger than XXX Mb, but it only rotates when triggered [15:01:00] and the cron job is only triggered once/day [15:01:12] so the logs may in fact be larger than XXX Mb when they are finally rotated [15:01:29] ie, the *config* is set up so we have a maximum of 15 logs * XXX Mb each (I forget the exact number) [15:01:39] <_joe_> so, my idea is - I will reimage those servers in a sane way, but please do not log that much to stdout if possible [15:01:42] but the logs can in fact be larger then XXX Mb because they are only rotated at most once a day [15:01:56] <_joe_> cscott: yeah I got that [15:02:01] <_joe_> but don't worry [15:02:08] <_joe_> just reduce the unnecessary noise [15:02:15] <_joe_> I will take care of the rest [15:02:38] <_joe_> if a log is useful, we shouldn't throw it away because we partitioned badly [15:02:48] _joe_: see https://gerrit.wikimedia.org/r/168536 and comments on it. [15:03:36] also note that ocg's logrotate configuration is currently identical to parsoid's, so any fixes to ocg might also be appropriate to apply for parsoid. [15:04:02] (in particular, switching to a greater frequency of logrotate invocations) [15:05:43] papaul: ping [15:06:20] (03PS1) 10Giuseppe Lavagetto: ocg: blackhole the stdout in order to save disk space [puppet] - 10https://gerrit.wikimedia.org/r/170333 [15:06:30] <_joe_> cscott: ^^ duct tape! [15:09:37] <_joe_> cscott: this commit will blackhole all the stdout output. [15:13:51] (03CR) 10Cscott: [C: 031] "Seems reasonable to me. Revert once I reconfigure OCG to suppress all console output (so that anything upstart *would* log in that case w" [puppet] - 10https://gerrit.wikimedia.org/r/170333 (owner: 10Giuseppe Lavagetto) [15:14:45] (03PS2) 10Giuseppe Lavagetto: ocg: blackhole the stdout in order to save disk space [puppet] - 10https://gerrit.wikimedia.org/r/170333 [15:14:54] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ocg: blackhole the stdout in order to save disk space [puppet] - 10https://gerrit.wikimedia.org/r/170333 (owner: 10Giuseppe Lavagetto) [15:16:02] <_joe_> cscott: FYI, today's and yesterday's ocg logs are in /srv/logs on ocg1001 [15:16:14] <_joe_> I will move them back on monday if possible [15:16:38] RECOVERY - check if salt-minion is running on ocg1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:17:18] (03PS1) 10Jgreen: DNS for aluminium's new public IP [dns] - 10https://gerrit.wikimedia.org/r/170334 [15:17:28] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:18:28] <_joe_> cscott: and ocg is not logging to upstart/ocg.log anymore [15:18:34] (03CR) 10Jgreen: [C: 032 V: 031] DNS for aluminium's new public IP [dns] - 10https://gerrit.wikimedia.org/r/170334 (owner: 10Jgreen) [15:21:26] (03PS1) 10Manybubbles: Disable regex searching in cirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170335 [15:24:47] (03CR) 10Chad: [C: 032] Disable regex searching in cirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170335 (owner: 10Manybubbles) [15:24:54] (03Merged) 10jenkins-bot: Disable regex searching in cirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170335 (owner: 10Manybubbles) [15:25:02] ^d: did you plan to deploy that? [15:25:19] !log demon Synchronized wmf-config/CirrusSearch-production.php: (no message) (duration: 00m 04s) [15:25:20] <^d> Yep :) [15:25:23] ah cool [15:25:26] Logged the message, Master [15:25:38] (03CR) 10Hashar: De-lint modules/ocg/manifests/decommission.pp. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170140 (owner: 10Cscott) [15:25:44] (03CR) 10Hashar: [C: 031] De-lint modules/ocg/manifests/decommission.pp. [puppet] - 10https://gerrit.wikimedia.org/r/170140 (owner: 10Cscott) [15:26:40] (03CR) 10Cscott: De-lint modules/ocg/manifests/decommission.pp. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170140 (owner: 10Cscott) [15:26:57] (03CR) 10Hashar: "> switch deployment-pdf02 to the role::ocg::beta first," [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [15:27:09] hey, _joe_, since you're doing puppet stuff: care to review https://gerrit.wikimedia.org/r/170140 ? I don't have +2 rights there. [15:28:34] (03CR) 10Hashar: De-lint modules/ocg/manifests/decommission.pp. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170140 (owner: 10Cscott) [15:28:57] hey paravoid [15:29:15] can I put librdkafka-0.8.5 in our apt? [15:29:25] if so, I should build a trusty-wikimedia version, yes? [15:29:34] what should the version be? [15:29:35] 0.8.5-2~wmf1 [15:29:36] ?> [15:29:38] something like that? [15:30:22] ^d: sent email to wikitech-ambassadors@lists.wikimedia.org about disable [15:34:53] <^d> mmk [15:41:47] _joe_: if you're online: what's the current status of HHVM rollout/what's the plan for next week? [15:42:53] <_joe_> greg-g: it's at 10% of anons now, but we will wait for a better patch to the memleak to move further [15:43:06] <_joe_> as the patch to that problem caused repeated segfaults [15:54:00] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 1 failures [15:59:26] (03PS1) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [15:59:49] (03CR) 10Andrew Bogott: [C: 04-2] "Work in progress, do not merge" [puppet] - 10https://gerrit.wikimedia.org/r/170340 (owner: 10Andrew Bogott) [16:11:30] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:15:43] _joe_: ty [16:21:22] (03PS1) 10Ottomata: Install kafkatee on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/170344 [16:23:12] (03CR) 10Ottomata: [C: 032] Install kafkatee on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/170344 (owner: 10Ottomata) [16:24:38] (03CR) 10Ottomata: [V: 032] Install kafkatee on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/170344 (owner: 10Ottomata) [16:41:10] (03CR) 10Ori.livneh: [C: 032] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/170310 (owner: 10Alexandros Kosiaris) [16:42:24] jenkins/zuul sees to be down, and hashar's not on irc [16:42:35] anyone here know enough about it to try giving it a kick? [16:48:00] sure, I'll give it a try [16:49:22] (03PS1) 10Reedy: Add export-0.10.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170351 (https://bugzilla.wikimedia.org/72417) [16:49:27] ah: /var/log/zuul/zuul.log on gallium says "NoConnectedServersError: No connected Gearman servers" [16:49:31] godog: ^ [16:50:19] cscott: yeah I think it is the first one https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues [16:50:46] godog: ah -- i did recently add a job to jenkins, perhaps that's it. [16:51:17] (03PS2) 10Reedy: Add export-0.10.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170351 (https://bugzilla.wikimedia.org/72417) [16:51:31] godog: i think gallium is also stuck (known issue #2) -- see https://integration.wikimedia.org/ci/job/integration-zuul-layoutvalidation/ [16:51:45] godog: are you going to do the suggested recovery steps, or should i? [16:53:02] cscott: I'm going to but first I'm taking a look at what is stuck exactly, no more than 10/15 min [16:53:31] godog: ok, cool. i'm in no hurry. [16:54:28] godog: do you know what steps #4 and #5 on https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Jenkins_execution_lock mean exactly? [16:54:42] godog: they could stand to be better documented, just in case I have to do this in the future. [16:55:02] cscott: I think it's disabling the plugin via the checkbox in jenkins [16:55:10] cscott: what Reedy said I think [16:55:25] on that url is a "Gearman Plugin Config" section [16:55:31] Enable Gearman [x] [16:55:56] https://integration.wikimedia.org/ci/computer/gallium/ has a box at top for 'mark this node temporarily offline', i see that. i don't know what url has the 'gearman plugin config' section though. [16:56:24] See step 1 :) [16:56:28] https://integration.wikimedia.org/ci/configure [16:57:40] Reedy: ok, https://www.mediawiki.org/w/index.php?title=Continuous_integration%2FZuul&diff=1247520&oldid=1243149 is what i understand how to do ;) [16:58:02] maybe the 'disconnect' link will turn into a 'relaunch slave agent' link once gallium is disconnected? [17:04:27] ottomata: https://gerrit.wikimedia.org/r/170140 needs a puppet-op +2. it's got a +1 already. [17:05:39] (03CR) 10Andrew Bogott: [C: 032] De-lint modules/ocg/manifests/decommission.pp. [puppet] - 10https://gerrit.wikimedia.org/r/170140 (owner: 10Cscott) [17:06:19] ottomata: never mind! another andrew got to it. [17:12:22] cscott: ok I'm going to do the recovery steps in #1 [17:13:00] godog: ok! if you don't mind, i'd like to do the steps in #2 after you are done (assuming gallium still looks stuck) to double-check that i understand the process. [17:13:15] cscott: sure [17:15:37] cscott: I'm done with steps for #! [17:15:42] #1 even [17:15:53] godog: https://integration.wikimedia.org/zuul/ looks much better already! [17:16:37] godog: gallium also unstuck itself, so I don't think #2 is necessary. [17:19:32] cscott: indeed, it started back up [17:41:43] (03PS1) 10MaxSem: Don't disable GeoData completely if CirrusSearch is having problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170358 (https://bugzilla.wikimedia.org/72559) [17:42:43] greg-g, I'd like to deploy ^^^ now before another outage strikes. CC ^demon|away and manybubbles [17:43:27] (03CR) 10Manybubbles: [C: 031] Don't disable GeoData completely if CirrusSearch is having problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170358 (https://bugzilla.wikimedia.org/72559) (owner: 10MaxSem) [17:44:05] MaxSem: we believe we've protected ourselves from the thing that caused the last cirrus outages. no reason not to deploy this anyway though [17:59:44] manybubbles: worth a friday deploy or wait until monday morning? [18:00:29] greg-g: that config change for MaxSem? Worth it I think. Just so if something scarry happens to Cirrus over the weekend max doesn't have to care much. [18:00:41] its not super likely but its a simple change [18:00:53] kk [18:00:59] MaxSem: doit at will [18:01:04] thanks:) [18:02:17] (03CR) 10MaxSem: [C: 032] Don't disable GeoData completely if CirrusSearch is having problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170358 (https://bugzilla.wikimedia.org/72559) (owner: 10MaxSem) [18:02:25] (03Merged) 10jenkins-bot: Don't disable GeoData completely if CirrusSearch is having problems [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170358 (https://bugzilla.wikimedia.org/72559) (owner: 10MaxSem) [18:03:23] !log maxsem Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/170358 (duration: 00m 04s) [18:03:30] Logged the message, Master [18:08:53] (03PS2) 10Ottomata: Fix comment about row of analytics1012 [puppet] - 10https://gerrit.wikimedia.org/r/170092 [18:09:00] (03CR) 10Ottomata: [C: 032 V: 032] Fix comment about row of analytics1012 [puppet] - 10https://gerrit.wikimedia.org/r/170092 (owner: 10Ottomata) [18:19:04] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:19:41] (03PS2) 10Rush: phab rename ext_ref as Reference [puppet] - 10https://gerrit.wikimedia.org/r/170237 [18:19:57] (03CR) 10Rush: [C: 032 V: 032] phab rename ext_ref as Reference [puppet] - 10https://gerrit.wikimedia.org/r/170237 (owner: 10Rush) [18:21:23] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:57:00] (03CR) 10Dzahn: [C: 031] "proper defines seem definitely better than how we handled it before, yay for that. let's see what Sean thinks though" [puppet] - 10https://gerrit.wikimedia.org/r/169722 (owner: 10Ottomata) [19:03:12] YuviPanda: small change in shinken @ 170269 [19:09:56] ccogdill: hi :p [19:10:02] hi there! [19:10:07] sorry about that [19:10:13] so… are you interested? [19:10:21] Yeah - people get -ops and -operations mixed up a bit [19:10:28] it’s tough [19:10:35] hey ops - we got an email from a donor who wants to donate IP Transit per https://wikimediafoundation.org/wiki/Peering [19:10:35] they also run the Philadelphia Internet Exchange, and say they’d be happy to set us up with a free port there. [19:10:57] mutante ^^ you might be knowledgable of how to take that/where to take that [19:11:02] I could ask them to email you/do an introduction if you’d like [19:18:54] ccogdill: JohnLewis : there is a special email address for people who handle peering requests. peering@wikimedia.org [19:19:58] okay I’ll ask the donor to send an email there [19:20:01] thanks mutante! [19:21:37] you're welcome [19:25:09] (03PS1) 10Ottomata: Include refinery-hive in hive's auxpath if it is deployed on the current node [puppet] - 10https://gerrit.wikimedia.org/r/170382 [19:26:15] (03PS2) 10Ottomata: Include refinery-hive in hive's auxpath if it is deployed on the current node [puppet] - 10https://gerrit.wikimedia.org/r/170382 [19:27:16] (03CR) 10Ottomata: [C: 032] Include refinery-hive in hive's auxpath if it is deployed on the current node [puppet] - 10https://gerrit.wikimedia.org/r/170382 (owner: 10Ottomata) [19:30:22] (03PS1) 10Spage: Enable Flow on officewiki on test page Talk:Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170383 [19:36:14] (03PS1) 10Ottomata: Make sure refinery is included before hive role so that auxpath is properly set [puppet] - 10https://gerrit.wikimedia.org/r/170384 [19:36:46] spagewmf: we're testing on private wikis now? :p [19:37:52] (03CR) 10Ottomata: [C: 032] Make sure refinery is included before hive role so that auxpath is properly set [puppet] - 10https://gerrit.wikimedia.org/r/170384 (owner: 10Ottomata) [19:38:30] JohnLewis: https://en.wikipedia.org/wiki/Dogfooding :) WMF staff really want it [19:38:47] love your department store chain btw [19:39:05] spagewmf: I can't see the discussion you guys had :( [19:39:08] and thanks :p [19:41:07] (03CR) 10John F. Lewis: [C: 031] "Patch itself looks good - can't see the discussion but why would we object to enabling flow on a page where the staff can marvel at it/sta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170383 (owner: 10Spage) [19:41:10] JohnLewis: I don't know why they're not public. WMF has Loomio discussions for "Clean the windows", "Bike parking debacle", "Continuing Cross-Team Lunches?". Thrill city. [19:41:13] spagewmf: see my full reason :p [19:42:01] I hope I didn't reveal teh Secret Cabal [19:42:01] I'd love to see the Loomio for the clean the windows. 'I oppose unless we use open source windows cleaners.' [19:42:21] <^d> you don't want to see the loomios. [19:42:25] <^d> they're all pretty silly. [19:42:34] (03CR) 10Dzahn: [C: 031] Enable Flow on officewiki on test page Talk:Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170383 (owner: 10Spage) [19:42:39] ^d: that makes me want to see them even more [19:42:51] (people will ask about the loomio URL in mw-config, but good) [19:42:53] <^d> flow was part of a discussion that was like "we should stop using $lame_third_party_tool and use the wiki instead" [19:42:59] <^d> Then people were like "but I h8 talkpages" [19:43:09] mutante: I take that is a 'yeah - we agreed' +1 [19:43:09] <^d> So Flow was a compromise. A good one, because dogfood. [19:43:30] JohnLewis: dogfood is good, yep [19:43:35] <^d> nom nom nom [19:43:56] A discussion on a third party tool to decide whether to stop using third party tools? :D [19:44:07] closed-source Flowfork will have voting on topics, we crush Loomio, team buys private island. [19:44:24] JohnLewis: yep [19:45:45] spagewmf: You unlocked an acheivement btw as that patch somehow made the channel active :D [19:49:58] (03PS1) 10Ori.livneh: $wgPercentHHVM => 15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170387 [19:50:04] <^d> JohnLewis: No, a discussion on the staff mailman list about not using a third party tool :) [19:50:38] ah mailman, my love :p [19:51:38] JohnLewis: https://bugzilla.wikimedia.org/show_bug.cgi?id=72289 [19:52:09] mutante: what about it? :) [19:52:46] (03CR) 10Ori.livneh: [C: 032] $wgPercentHHVM => 15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170387 (owner: 10Ori.livneh) [19:52:54] (03Merged) 10jenkins-bot: $wgPercentHHVM => 15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170387 (owner: 10Ori.livneh) [19:53:02] JohnLewis: the part what disabling really means to do [19:53:26] JohnLewis: just spam filter that discards all? hide from listinfo? unsubscribe all? and so on.. so many options [19:53:59] mutante: https://wikitech.wikimedia.org/wiki/Lists.wikimedia.org#Disable_a_mailing_list would be it [19:54:54] but yeah - discard all mail and hide from listinfo is what should be done [19:55:18] keeping sub'd users as disabling something allows it to be enabled at a later date easily [19:55:21] JohnLewis: heh, good link! though, i remember quite a few mails we got where people said they still get spammed from disabled lists [19:55:31] and then we tried different ways to do it properly.. didnt we [19:55:45] mutante: spammed through -owner iirc was the issue [19:55:59] ah, ok [19:56:03] you could just - remove them from the -owner spamlist (no list admins/mods) [20:28:15] greg-g: can we do 10 mins later? [20:36:13] !log aaron Synchronized php-1.25wmf6/includes/GlobalFunctions.php: 04c35b2ca42d7a186278882763eb853552d8441c (duration: 00m 04s) [20:36:20] Logged the message, Master [20:43:21] (03PS1) 10John F. Lewis: mailman: enable rmlist for web-list deletion [puppet] - 10https://gerrit.wikimedia.org/r/170398 [20:45:21] mutante ^^ ottomata, might also be a nice look for you above [20:46:11] !log aaron Synchronized php-1.25wmf5/includes/GlobalFunctions.php: 721435c3a6c8f7c728d3fa8ec34abb0f2ef7543d (duration: 00m 07s) [20:46:15] Logged the message, Master [20:51:02] ori: https://gerrit.wikimedia.org/r/#/c/169030/ [20:51:16] (03PS2) 10Ori.livneh: Added labswiki to the dump skip list to avoid error spam [puppet] - 10https://gerrit.wikimedia.org/r/169030 (owner: 10Aaron Schulz) [20:51:21] (03CR) 10Ori.livneh: [C: 032] Added labswiki to the dump skip list to avoid error spam [puppet] - 10https://gerrit.wikimedia.org/r/169030 (owner: 10Aaron Schulz) [20:51:29] gah, I guess you can't +2 [20:51:31] or not [21:00:56] (03PS1) 1020after4: make optional the 'real name' user profile field in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/170433 [21:02:38] (03PS2) 10Hoo man: Make the 'real name' user profile field optional in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/170433 (owner: 1020after4) [21:03:08] (03PS3) 10Qgil: Make the 'real name' user profile field optional in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/170433 (owner: 1020after4) [21:27:25] (03PS1) 10Ori.livneh: hhvm: make source installation update whenever package is updated [puppet] - 10https://gerrit.wikimedia.org/r/170443 [21:28:47] (03CR) 10Ori.livneh: [C: 032] hhvm: make source installation update whenever package is updated [puppet] - 10https://gerrit.wikimedia.org/r/170443 (owner: 10Ori.livneh) [22:06:24] (03PS2) 10John F. Lewis: mailman: enable rmlist for web-list deletion [puppet] - 10https://gerrit.wikimedia.org/r/170398 [22:11:09] (03CR) 10Yuvipanda: [C: 031] shinken - move hosts define to own file [puppet] - 10https://gerrit.wikimedia.org/r/170269 (owner: 10Dzahn) [22:15:54] (03CR) 10Dzahn: [C: 032] shinken - move hosts define to own file [puppet] - 10https://gerrit.wikimedia.org/r/170269 (owner: 10Dzahn) [22:41:11] (03PS1) 10MaxSem: Add WikiGrok [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170450 [22:41:13] (03PS1) 10MaxSem: Enable WikiGrok on test and test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170451 [22:41:15] (03PS1) 10MaxSem: Enable WikiGrok on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170452 [22:41:17] (03PS1) 10MaxSem: Add author campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170453 [22:41:26] (03CR) 10MaxSem: [C: 04-2] Add WikiGrok [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170450 (owner: 10MaxSem) [22:56:41] (03PS1) 10Dzahn: puppetception: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170459 [22:59:34] (03CR) 10John F. Lewis: [C: 031] puppetception: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170459 (owner: 10Dzahn) [23:01:09] (03CR) 10Dzahn: [C: 032] puppetception: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170459 (owner: 10Dzahn) [23:03:11] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [23:09:38] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [23:10:17] (03PS1) 10John F. Lewis: admin: puppet lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170463 [23:12:54] (03PS1) 10Dzahn: releases: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170464 [23:22:20] (03PS1) 10John F. Lewis: apache: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170466 [23:27:16] (03PS1) 10John F. Lewis: apparmor: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170469 [23:30:14] (03PS1) 10Dzahn: puppetmaster - lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170471 [23:31:12] (03PS1) 10John F. Lewis: archiva: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170472 [23:33:29] (03PS1) 10John F. Lewis: authdns: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170473 [23:34:17] (03PS4) 1020after4: Make the 'real name' user profile field optional in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/170433 [23:34:55] (03PS1) 10John F. Lewis: backup: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170474 [23:35:16] (03PS1) 10Dzahn: icinga - lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170475 [23:35:25] mutante: so many lint fixes :p [23:35:59] (03CR) 10Dzahn: [C: 031] Make the 'real name' user profile field optional in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/170433 (owner: 1020after4) [23:41:24] (03PS1) 10John F. Lewis: bacula: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170476 [23:45:36] (03PS1) 10John F. Lewis: base: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170477 [23:49:30] (03PS1) 10Dzahn: wikimetrics - lint fixes [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/170478