[00:07:02] (03PS1) 10QChris: Configure gerrit's its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/167524 [00:08:26] (03PS2) 10QChris: Configure gerrit's its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/167524 [00:09:47] (03CR) 10QChris: [C: 04-1] "Blocking until username and password have been decided." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167524 (owner: 10QChris) [00:12:35] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 340119 msg: ocg_render_job_queue 68 msg [00:12:45] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 340133 msg: ocg_render_job_queue 56 msg [00:13:05] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 340144 msg: ocg_render_job_queue 29 msg [00:17:09] (03PS1) 10QChris: Add its-phabricator from d425a5ded909ee73df53d5e6d91d28014d0be375 [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/167525 [00:44:25] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [00:51:26] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 303 seconds [00:51:32] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 309 seconds [00:52:08] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 342960 msg: ocg_render_job_queue 500 msg (=500 critical) [00:52:47] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 343003 msg: ocg_render_job_queue 502 msg (=500 critical) [00:53:39] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [00:53:39] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [01:02:32] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 343576 msg: ocg_render_job_queue 68 msg [01:02:52] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 343593 msg: ocg_render_job_queue 32 msg [01:08:23] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 344585 msg: ocg_render_job_queue 621 msg (=500 critical) [01:08:42] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 344612 msg: ocg_render_job_queue 622 msg (=500 critical) [01:09:02] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 344642 msg: ocg_render_job_queue 635 msg (=500 critical) [01:24:52] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 345619 msg: ocg_render_job_queue 80 msg [01:25:03] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 345631 msg: ocg_render_job_queue 67 msg [01:25:32] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 345653 msg: ocg_render_job_queue 57 msg [01:55:26] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail [02:13:56] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [02:16:31] !log LocalisationUpdate completed (1.25wmf3) at 2014-10-20 02:16:31+00:00 [02:16:41] Logged the message, Master [02:28:35] !log LocalisationUpdate completed (1.25wmf4) at 2014-10-20 02:28:35+00:00 [02:28:44] Logged the message, Master [02:31:14] (03PS1) 10Springle: depool db1066 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167527 [02:31:51] (03CR) 10Springle: [C: 032] depool db1066 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167527 (owner: 10Springle) [02:31:58] (03Merged) 10jenkins-bot: depool db1066 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167527 (owner: 10Springle) [02:34:12] !log springle Synchronized wmf-config/db-eqiad.php: depool db1066 (duration: 00m 06s) [02:34:21] Logged the message, Master [02:47:52] (03PS1) 10Springle: repool db1042, move vslow/dump back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167528 [02:48:48] (03CR) 10Springle: [C: 032] repool db1042, move vslow/dump back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167528 (owner: 10Springle) [02:48:55] (03Merged) 10jenkins-bot: repool db1042, move vslow/dump back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167528 (owner: 10Springle) [02:49:32] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 352544 msg: ocg_render_job_queue 955 msg (=500 critical) [02:49:33] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 352550 msg: ocg_render_job_queue 941 msg (=500 critical) [02:50:02] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 352581 msg: ocg_render_job_queue 878 msg (=500 critical) [02:50:05] !log springle Synchronized wmf-config/db-eqiad.php: repool db1042 (duration: 00m 06s) [02:50:11] Logged the message, Master [02:56:23] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 872.337060099 [02:59:44] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 353313 msg: ocg_render_job_queue 94 msg [03:00:03] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 353340 msg: ocg_render_job_queue 77 msg [03:00:42] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 353408 msg: ocg_render_job_queue 69 msg [03:42:25] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Oct 20 03:42:24 UTC 2014 (duration 42m 23s) [03:42:31] Logged the message, Master [03:51:13] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 329602 msg: ocg_render_job_queue 934 msg (=500 critical) [03:51:24] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 329618 msg: ocg_render_job_queue 914 msg (=500 critical) [03:52:04] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 329680 msg: ocg_render_job_queue 847 msg (=500 critical) [04:02:54] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [04:04:23] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 331248 msg: ocg_render_job_queue 759 msg (=500 critical) [04:04:33] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 331284 msg: ocg_render_job_queue 779 msg (=500 critical) [04:04:44] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 331398 msg: ocg_render_job_queue 858 msg (=500 critical) [04:19:44] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 333312 msg: ocg_render_job_queue 90 msg [04:20:36] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 333359 msg: ocg_render_job_queue 0 msg [04:20:36] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 333363 msg: ocg_render_job_queue 0 msg [04:24:53] RECOVERY - Disk space on ocg1001 is OK: DISK OK [04:27:14] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:24] RECOVERY - Disk space on ocg1003 is OK: DISK OK [04:29:01] !log removed old /var/log/ocg* on ocg1001 and ocg1003 and forced logrotate, / space critical [04:29:08] Logged the message, Master [04:29:26] (03PS10) 10KartikMistry: WIP: Beta: Update cxserver to use Apertium service [puppet] - 10https://gerrit.wikimedia.org/r/157787 [04:43:44] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54215 bytes in 8.027 second response time [05:01:34] PROBLEM - HHVM rendering on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50512 bytes in 0.283 second response time [05:05:51] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 71973 bytes in 0.236 second response time [05:23:27] <_joe_> springle: thanks for looking at ocg [05:23:36] <_joe_> (good week) [05:25:44] i didn't check all the other ocg*. some may need the same attention before long [05:26:02] _joe_: (and good morning :) [05:26:59] <_joe_> springle: well, cscott should be back this week and he should address most of ocg issues (which are expected from a new service with a lot of weird input) [05:27:14] cool [06:29:11] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:31] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:41] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:52] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:10] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:11] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:21] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:20] PROBLEM - puppet last run on amssq37 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:21] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:44] <_joe_> YuviPanda|zzz: whenever you get up, ping me [06:45:57] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:51] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [06:50:50] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Puppet has 1 failures [06:57:20] RECOVERY - puppet last run on amssq37 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:05:31] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 352322 msg: ocg_render_job_queue 2428 msg (=500 critical) [07:05:41] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 352380 msg: ocg_render_job_queue 2461 msg (=500 critical) [07:06:00] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 352949 msg: ocg_render_job_queue 2888 msg (=500 critical) [07:07:00] <_joe_> ... [07:07:36] <_joe_> !log rolling restart of ocg services [07:07:44] Logged the message, Master [07:08:11] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:09:15] morning _joe_ do HAT servers serve the "strong" ssl_cipher_suite ? [07:12:42] <_joe_> matanya: they have nothing to do with ssl :) [07:12:53] <_joe_> the ssl frontend is the same for HAT/Zend [07:13:34] so no ssl terminetors are on trusty, or the other way around? [07:13:47] * matanya is confused [07:13:48] <_joe_> ssl terminators are still on precise [07:14:03] <_joe_> but HAT == HHVM, Apache, Trusty [07:14:15] <_joe_> so we use it for the appservers running trusty and hhvm [07:15:05] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 356507 msg: ocg_render_job_queue 0 msg [07:15:13] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 356510 msg: ocg_render_job_queue 0 msg [07:15:17] <_joe_> btw, https://www.ssllabs.com/ssltest/analyze.html?d=en.wikipedia.org [07:15:33] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 356536 msg: ocg_render_job_queue 0 msg [07:15:38] thank you, that clears it up [07:18:52] (03PS2) 10Giuseppe Lavagetto: gerrit: move to module [puppet] - 10https://gerrit.wikimedia.org/r/167215 [07:25:45] (03CR) 10Giuseppe Lavagetto: "This is a noop:" [puppet] - 10https://gerrit.wikimedia.org/r/167215 (owner: 10Giuseppe Lavagetto) [07:48:52] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 362961 msg: ocg_render_job_queue 1366 msg (=500 critical) [07:49:03] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 363370 msg: ocg_render_job_queue 1549 msg (=500 critical) [07:49:12] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 363431 msg: ocg_render_job_queue 1558 msg (=500 critical) [07:56:38] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 367576 msg: ocg_render_job_queue 10 msg [07:56:38] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 367578 msg: ocg_render_job_queue 2 msg [07:57:08] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 367656 msg: ocg_render_job_queue 0 msg [08:29:49] _joe_: hey [08:37:55] sigh, power cut [08:43:25] _joe_: can you +1/+2 (or CR!) https://gerrit.wikimedia.org/r/#/c/166902/ whenever you have the time? Thanks :) [08:43:37] hmm, I just saw scfc_de's comments [08:43:45] should probably put in the hour... [08:45:12] <_joe_> YuviPanda|zz: I do disagree completely [08:45:25] with my exception comments, or scfc_de's? [08:45:48] <_joe_> both, but the former is something you can make a call on [08:46:01] <_joe_> the latter... [08:46:47] <_joe_> I think that we don't have that entry, so why be explicit in setting something that will be set as a default? [08:48:03] explicit better than implicit, etc? [08:48:17] <_joe_> no way. [08:48:40] <_joe_> if you really think that, please rewrite all your manifests expliciting all the default values [08:48:53] <_joe_> or the undefs where you count on the default to be good [08:49:13] <_joe_> also, we _do_not_ write manifests to be applied on previously unpuppetized configs [08:49:28] <_joe_> we write manifests that cleanly apply and install on a vanilla server. [08:50:09] <_joe_> that makes a lot of sense, too [08:50:18] <_joe_> but I don't want to discuss this too long [08:50:28] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2039: active_shards: 6112: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 1 [08:50:31] <_joe_> if you want to specify hour, please specify everything [08:50:44] <_joe_> 'dayofweek' and 'dayofmonth' as well [08:51:26] greetings [08:51:39] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2040: active_shards: 6115: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [08:51:44] well, it's explicitly just 'every hour', not 'every day hour every day of week', and I guess I find the specifying of the hour a bit more satisfyingly explicit. [08:51:57] but I don't particularly care, so I'll let it be :) [08:54:40] explicit vs implicit isn't black/white [08:59:13] * YuviPanda murmurs about stupid internet and power situation [09:08:28] (03CR) 10Filippo Giunchedi: "LGTM, did the puppet compiler run already? it'd be nice to compare that too" [puppet] - 10https://gerrit.wikimedia.org/r/167183 (owner: 10Giuseppe Lavagetto) [09:08:35] (03CR) 10Yuvipanda: graphite: Add labs archiver script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/166902 (owner: 10Yuvipanda) [09:09:40] (03CR) 10Filippo Giunchedi: [C: 031] "is it applied to the cluster already?" [puppet] - 10https://gerrit.wikimedia.org/r/166690 (owner: 10Chad) [09:15:53] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, just a nice-to-have suggestion but not a blocker" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164270 (owner: 10Chad) [09:25:14] (03CR) 10Filippo Giunchedi: "out of curiosity how is temp used internally? also there seem to be "world read" acls for -temp containers, should we restrict that too?" [puppet] - 10https://gerrit.wikimedia.org/r/167310 (owner: 10Aaron Schulz) [09:26:27] hmm, I wonder if I should have the shinken config generator be a package or not... [09:26:50] probably better than putting it all in a single file... [09:39:17] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [09:41:28] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Apertium service configuration [puppet] - 10https://gerrit.wikimedia.org/r/165485 (owner: 10KartikMistry) [09:42:30] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 382227 msg: ocg_render_job_queue 1766 msg (=500 critical) [09:42:47] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 382689 msg: ocg_render_job_queue 2018 msg (=500 critical) [09:42:47] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 382690 msg: ocg_render_job_queue 2017 msg (=500 critical) [09:44:25] and were we go with ocg [09:48:19] sigh, is it /tmp ? [09:48:57] godog: no [09:49:15] but it is stuck in full user CPU again [09:49:21] restarting [09:51:49] <_joe_> wait [09:51:58] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 386152 msg: ocg_render_job_queue 43 msg [09:52:00] too late [09:52:01] <_joe_> akosiaris: it's most of the times a temporary load [09:52:21] I think it was one of those times at least at one of the boxes [09:52:48] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 386284 msg: ocg_render_job_queue 0 msg [09:52:58] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 386307 msg: ocg_render_job_queue 0 msg [09:53:13] !log restarted ocg on ocg1001, ocg1002, ocg1003 [09:53:20] Logged the message, Master [09:53:22] <_joe_> akosiaris: as a rule of thumb, wait 10 minutes before restarting _at least_ [09:54:01] it get's stuck after than period of time ? [09:54:08] kind of a heisenbug then ? [09:57:58] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [10:06:12] (03PS1) 10Faidon Liambotis: Handle CORS preflight requests for upload in VCL [puppet] - 10https://gerrit.wikimedia.org/r/167542 (https://bugzilla.wikimedia.org/55631) [10:10:53] _joe_: ^ [10:19:36] <_joe_> paravoid: already CR [10:20:01] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM but we should check a few details" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167542 (https://bugzilla.wikimedia.org/55631) (owner: 10Faidon Liambotis) [10:20:16] <_joe_> :) [10:21:08] <_joe_> CORS is a damned hell. I can never remember what every damn specific Allow- header does [10:21:29] <_joe_> at least google does [10:25:11] (03CR) 10Faidon Liambotis: Handle CORS preflight requests for upload in VCL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167542 (https://bugzilla.wikimedia.org/55631) (owner: 10Faidon Liambotis) [10:28:11] so since I'm on the varnish front [10:28:17] I'm reading a bit the whole HHVM stuff [10:28:30] <_joe_> paravoid: ugh damn gerrit [10:28:43] <_joe_> I hate how hard it is to read a whole file there [10:28:46] <_joe_> sorry :/ [10:29:10] <_joe_> paravoid: mmmh ok, the hhvm stuff should go away soon, but it's quite complicated in fact [10:29:12] am I understanding this correctly that now that we have HHVM in a significant portion of our traffic, we're effectively halfing all of our varnish cache? [10:29:25] <_joe_> yes [10:29:57] but why? [10:30:35] <_joe_> it was requested so that the bugs on hhvm would a) stay local to hhvm b) be easier to spot as hhvm-specific issues [10:30:47] <_joe_> but I guess we're at the point where we can eliminate that [10:31:25] <_joe_> not because we are deploying hhvm everywhere, but because we are relatively low on new hhvm-specific user-facing bugs [10:31:52] <_joe_> we are now in the "make hhvm be stable and not eat all resources after running for 1 day" phase [10:41:05] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 6433.62767741 [10:42:23] (03CR) 10Giuseppe Lavagetto: [C: 031] Handle CORS preflight requests for upload in VCL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167542 (https://bugzilla.wikimedia.org/55631) (owner: 10Faidon Liambotis) [11:02:08] (03Abandoned) 10Alexandros Kosiaris: Update config for Language pairs [puppet] - 10https://gerrit.wikimedia.org/r/163841 (owner: 10KartikMistry) [11:16:49] (03PS1) 10Alexandros Kosiaris: Add apertium module tests [puppet] - 10https://gerrit.wikimedia.org/r/167545 [11:21:38] (03CR) 10Alexandros Kosiaris: [C: 032] Add apertium module tests [puppet] - 10https://gerrit.wikimedia.org/r/167545 (owner: 10Alexandros Kosiaris) [11:34:05] akosiaris: thanks for 167545 [11:34:44] kart_: you are most welcome. Thanks for all the work on this :-) [11:37:25] PROBLEM - puppet last run on mw1091 is CRITICAL: CRITICAL: Puppet has 1 failures [11:37:55] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:05] PROBLEM - puppet last run on wtp1010 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:24] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:25] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:34] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:37] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:37] PROBLEM - puppet last run on mw1071 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:44] PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:45] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:45] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:46] PROBLEM - puppet last run on mw1090 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:54] PROBLEM - puppet last run on mw1093 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:07] PROBLEM - puppet last run on elastic1003 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:14] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:15] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:16] PROBLEM - puppet last run on mw1107 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:24] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:28] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:28] PROBLEM - puppet last run on mw1066 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:29] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:29] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:29] PROBLEM - puppet last run on mw1064 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:41] PROBLEM - puppet last run on mw1027 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:42] PROBLEM - puppet last run on mw1037 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:46] PROBLEM - puppet last run on mw1143 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:55] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:57] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:57] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [11:40:04] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: Puppet has 2 failures [11:40:05] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Puppet has 1 failures [11:40:14] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [11:40:24] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet has 1 failures [11:40:25] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [11:40:44] PROBLEM - puppet last run on mw1113 is CRITICAL: CRITICAL: Puppet has 1 failures [11:40:45] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: Puppet has 1 failures [11:40:45] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: Puppet has 1 failures [11:40:55] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: Puppet has 1 failures [11:40:58] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:04] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:09] PROBLEM - puppet last run on mw1016 is CRITICAL: CRITICAL: Puppet has 2 failures [11:41:15] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:26] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:35] PROBLEM - puppet last run on elastic1010 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:35] PROBLEM - puppet last run on elastic1009 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:35] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:57] PROBLEM - puppet last run on elastic1013 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:58] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:58] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:59] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:06] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:07] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:07] PROBLEM - puppet last run on elastic1016 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:15] PROBLEM - puppet last run on mw1085 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:15] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:15] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 2 failures [11:42:15] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:15] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:16] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:17] !log killed stray/old copy of diamond that was filling up conntrack on virt1000 [11:42:23] Logged the message, Master [11:42:25] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:30] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:31] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:36] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:36] PROBLEM - puppet last run on mw1058 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:56] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:06] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:15] PROBLEM - puppet last run on mw1070 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:15] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:16] PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:16] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:16] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: Puppet has 2 failures [11:43:35] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:35] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:36] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:37] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:37] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [11:43:55] mh it seems to fail with [11:43:56] Error: Could not update: Execution of '/usr/bin/salt-call --out=json grains.append deployment_target scap' returned 2: Minion failed to authenticate with the master, has the minion key been accepted? [11:44:00] Error: /Stage[main]/Mediawiki::Scap/Package[scap]/ensure: change from absent to latest failed: Could not update: Execution of '/usr/bin/salt-call --out=json grains.append deployment_target scap' returned 2: Minion failed to authenticate with the master, has the minion key been accepted? [11:44:05] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:06] PROBLEM - puppet last run on mw1036 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:06] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:06] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:11] (03PS1) 10QChris: Require 2 ACKs from kafka brokers for mobile caches [puppet] - 10https://gerrit.wikimedia.org/r/167550 (https://bugzilla.wikimedia.org/69667) [11:44:16] PROBLEM - puppet last run on mw1035 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:17] (03PS1) 10QChris: Require 2 ACKs from kafka brokers for text caches [puppet] - 10https://gerrit.wikimedia.org/r/167551 (https://bugzilla.wikimedia.org/69667) [11:44:23] (03PS1) 10QChris: Require 2 ACKs from kafka brokers for bits caches [puppet] - 10https://gerrit.wikimedia.org/r/167552 (https://bugzilla.wikimedia.org/69667) [11:44:27] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:27] PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:28] (03PS1) 10QChris: Require 2 ACKs from kafka brokers per default [puppet] - 10https://gerrit.wikimedia.org/r/167553 (https://bugzilla.wikimedia.org/69667) [11:44:35] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:36] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:45] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:45] PROBLEM - puppet last run on mw1130 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:45] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet has 2 failures [11:44:45] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:55] (03CR) 10QChris: [C: 04-1] "Wait for deployment of I241984b375ce65c95cafd76dd3bb22bdd0aa71f7 to" [puppet] - 10https://gerrit.wikimedia.org/r/167550 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [11:44:56] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:44:56] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:10] (03CR) 10QChris: [C: 04-1] "Wait for deployment of I1cbd587c89f048dbdc75c28e4a5091e12cd19d3f to" [puppet] - 10https://gerrit.wikimedia.org/r/167551 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [11:45:10] PROBLEM - puppet last run on mw1062 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:15] PROBLEM - puppet last run on mw1096 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:15] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: Puppet has 2 failures [11:45:21] (03CR) 10QChris: [C: 04-1] "Wait for deployment of I657916fa272e1b7977497e6fece1af7e7c0f533c to" [puppet] - 10https://gerrit.wikimedia.org/r/167552 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [11:45:27] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:27] PROBLEM - puppet last run on wtp1017 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:27] PROBLEM - puppet last run on wtp1024 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:28] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:35] PROBLEM - puppet last run on mw1124 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:35] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:35] PROBLEM - puppet last run on mw1013 is CRITICAL: CRITICAL: Puppet has 2 failures [11:45:57] PROBLEM - puppet last run on mw1080 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:57] PROBLEM - puppet last run on mw1038 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:57] PROBLEM - puppet last run on mw1067 is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:59] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: Puppet has 1 failures [11:46:00] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 1 failures [11:46:00] PROBLEM - puppet last run on mw1089 is CRITICAL: CRITICAL: Puppet has 1 failures [11:46:05] so all of this is related to some weird salt problem I am investigating [11:46:15] PROBLEM - puppet last run on wtp1019 is CRITICAL: CRITICAL: Puppet has 1 failures [11:46:16] PROBLEM - puppet last run on mw1072 is CRITICAL: CRITICAL: Puppet has 1 failures [11:46:16] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Puppet has 1 failures [11:46:17] ah! [11:46:26] PROBLEM - puppet last run on mw1028 is CRITICAL: CRITICAL: Puppet has 1 failures [11:46:28] PROBLEM - puppet last run on mw1048 is CRITICAL: CRITICAL: Puppet has 1 failures [11:46:32] godog: Error: Execution of '/usr/bin/salt-call --out=json grains.append deployment_target parsoid' returned 2: Minion failed to authenticate with the master, has the minion key been accepted? [11:46:37] PROBLEM - puppet last run on vanadium is CRITICAL: CRITICAL: Puppet has 1 failures [11:46:37] PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: Puppet has 1 failures [11:46:39] weird... [11:46:42] akosiaris: indeed [11:46:48] but reproducible at least [11:46:56] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:01] <_joe_> akosiaris: it shouldn't even try to install scap [11:47:02] PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:05] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:05] PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:06] <_joe_> it's already there :) [11:47:16] PROBLEM - puppet last run on mw1063 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:20] scap ? [11:47:25] PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:26] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Puppet has 2 failures [11:47:27] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: Puppet has 3 failures [11:47:32] apergos: any clues on that salt error above? [11:47:35] PROBLEM - puppet last run on mw1106 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:35] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:35] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:35] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:36] PROBLEM - puppet last run on mw1059 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:36] PROBLEM - puppet last run on mw1045 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:45] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:47] PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:51] yeah that is me, I have just tried to upgrade the master on pallaium (which implies also the minion) [11:47:55] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:56] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:02] and it's not happening >_< [11:48:06] PROBLEM - puppet last run on elastic1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:06] PROBLEM - puppet last run on elastic1004 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:15] PROBLEM - puppet last run on elastic1007 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:16] PROBLEM - puppet last run on mw1082 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:17] ah, ok [11:48:23] a !log would have been nice btw :) [11:48:33] well I was hoping it would take 3 minutes and then I would log it [11:48:35] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:36] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:36] PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:36] PROBLEM - puppet last run on elastic1012 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:36] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:45] instead are we are after 5 minutes and it's behaving badly [11:48:45] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:46] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:56] PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:56] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:56] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:57] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Puppet has 2 failures [11:48:57] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 2 failures [11:48:57] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:58] PROBLEM - puppet last run on mw1088 is CRITICAL: CRITICAL: Puppet has 1 failures [11:48:58] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:05] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:07] <_joe_> should we kill icinga-wm for the moment? [11:49:14] yes, thank you for that [11:49:15] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Puppet has 2 failures [11:49:15] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:16] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:16] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:16] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:16] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:20] <_joe_> I have to go fetch my step-daughter now [11:49:25] PROBLEM - puppet last run on elastic1008 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:27] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:27] PROBLEM - puppet last run on mw1069 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:27] PROBLEM - puppet last run on mw1006 is CRITICAL: CRITICAL: Puppet has 2 failures [11:49:27] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 2 failures [11:49:36] _joe_: I'll take it [11:49:36] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:45] <_joe_> godog: thanks! [11:49:53] np [11:49:55] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:56] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 1 failures [11:49:57] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:19] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:29] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:30] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 2 failures [11:50:30] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:39] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:41] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:42] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:42] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:42] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:49] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:49] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:49] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:49] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:49] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:50] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:59] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 1 failures [11:51:00] PROBLEM - puppet last run on sca1002 is CRITICAL: CRITICAL: Puppet has 2 failures [11:51:13] wtf, rm fails silently? [11:51:41] !log temporarily stopped ircecho/icinga-wm on neon, shower of alarms [11:51:47] Logged the message, Master [11:51:54] YuviPanda: what do you mean? it shouldn't [11:53:36] apergos: I've shut icinga-wm, let me know when it is good to be turned back on [11:53:44] ok, thanks [11:53:57] because this issue is new to me it may take a bit [11:54:05] godog: yeah, labs instance with a full root partition, rm dooesn't actually seem to remove them when used with shell expansion... [11:54:18] sudo rm -v omniscan_* returns, and the files still are around... [11:54:21] not a big fan of having it off, it affects also labs and wikidata :( [11:54:29] if I do them individually it works fine... [11:54:38] * YuviPanda scratches head [11:55:52] YuviPanda: sounds extra weird [11:55:57] yeah [11:55:59] YuviPanda: what output that rm gives? first thing that comes to mind is that it is your user shell doing the expansion (and failing?) [11:56:27] rm doesn't give me anything, and ls with the same expansion works. [11:57:16] weird indeed [11:57:19] (03PS1) 10Glaisher: Disable new page patrol on fishbowl/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167555 (https://bugzilla.wikimedia.org/72239) [11:57:38] wat, xargs doesn't do anything either [11:57:45] so it is not failing but it is being passed arguments [11:58:09] yeah [11:58:16] ok... YuviPanda has being hacked emergency procedures please :P [11:58:25] this is tools-webgrid-01 [11:58:28] haha :) [11:59:00] ah, hmm. [11:59:08] so if I sudo -s and then do it, it works fine. [11:59:12] but if I do sudo rm it doesn't [12:00:28] strace it ? [12:01:21] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 6 failures [12:02:06] akosiaris: have to wait for a while, all the files have been deleted now. [12:02:10] will strace next time [12:02:13] rm -v also gave me no output [12:02:40] * YuviPanda emails the perpetrator of the /tmp filling [12:03:46] kart_: btw, puppet failure in betalabs apertium server, I guess you're on it :) [12:04:25] apergos: haha puppet came around and restarted ircecho btw, I'll leave it on [12:05:15] * YuviPanda has been wanting to replace SAL, ircecho etc with something else unified. Should do at some point [12:05:29] godog.. ok, well I'm watching the master startup and trying to ee what's going wrong (because somethign is taking much longer than it ought to) [12:06:09] YuviPanda: yes. Should be okay now. [12:06:24] kart_: cool! :) icinga complains in -qa, I wonder if I should make it complain here as well [12:06:28] YuviPanda: do we have any kind of monitering placed? [12:06:36] kart_: ya! [12:06:45] kart_: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon [12:06:53] YuviPanda: thanks! [12:07:03] kart_: yw! I can add you to be email notified if you want ;) [12:07:11] YuviPanda: please! [12:07:24] kart_: it will be for all of betalabs tho :) [12:07:31] YuviPanda: gah. [12:07:35] but that's fine. [12:07:45] will have more granular monitoring soon [12:07:48] * YuviPanda is working on shingen [12:07:56] which is shinken's equivalent to naggen [12:07:57] apergos: mh I seem to remember a similar problem I encountered with the salt master a while ago where the process that cleans jobs would die leaving many jobs behind [12:08:56] YuviPanda: cool. Uses wikitech user/password? [12:09:12] kart_: shinken.wmflabs.org. User/pass: guest/guest [12:09:16] nothing much there atm [12:11:32] kart_: oh, the icinga url? yeah, it uses wikitech username/pass [12:11:40] (03PS1) 10Yuvipanda: labmon: Add kart to betalabs monitoring [puppet] - 10https://gerrit.wikimedia.org/r/167560 [12:11:44] well there's 3k of them but it shouldn't really take tis long to stat those files [12:11:50] kart_: ^ need to add a entry to the private repo as well tho [12:15:42] apergos: what directory are you looking at btw? this returns way more than 3k find /var/cache/salt/master/jobs -iname 'return.p' -type f [12:16:06] the number of jobs, not the number of entries [12:16:15] but I might stp the master and blow away the entire thing and let it recreate [12:16:26] we'll lose some data but that's the way it goes [12:17:23] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2039: active_shards: 6112: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [12:17:23] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2039: active_shards: 6112: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [12:17:25] that's what I did IIRC, move aside the jobs directory and restart the master [12:18:00] I"m just tossing it [12:18:22] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2040: active_shards: 6115: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [12:18:24] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2040: active_shards: 6115: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [12:18:27] deploy data is all in redis, anything else has either been returned to people or wa broken so.. [12:18:48] YuviPanda: what is kart there? IRC username? [12:18:52] :) [12:19:18] My Gerrit id is: KartikMistry, IRC, kart_ etc [12:19:46] brb. [12:19:50] apergos: if you move it though you don't have to wait for the rm to finish [12:20:49] good point [12:26:55] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [12:27:14] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:27:17] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:27:25] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:27:34] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:27:34] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:27:34] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [12:27:35] RECOVERY - puppet last run on mw1054 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:27:35] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [12:27:35] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:27:35] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:27:40] well that was painful... apparently somethign else was keeping it open [12:27:44] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [12:27:46] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:27:47] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 64 seconds ago with 0 failures [12:27:48] RECOVERY - puppet last run on mw1149 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:27:48] RECOVERY - puppet last run on mw1076 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:27:49] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:27:51] so now... [12:27:58] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:28:04] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [12:28:04] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:28:04] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:28:09] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:28:10] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:28:14] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:28:14] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:28:15] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:28:15] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:28:24] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:28:25] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:28:25] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:28:29] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:28:29] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 65 seconds ago with 0 failures [12:28:29] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:28:29] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:28:34] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:28:34] RECOVERY - puppet last run on mw1055 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:28:34] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [12:28:34] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:28:34] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:28:48] !log upgraded salt master (plus minion) on palladium to 2014.1.11, all neww precise installs will get this version now, other minion upgrades to follow shortly [12:28:53] Logged the message, Master [12:28:54] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:28:54] RECOVERY - puppet last run on mw1160 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:28:54] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:28:54] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:28:54] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [12:28:55] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:28:55] RECOVERY - puppet last run on mw1098 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:28:56] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:28:56] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [12:29:05] RECOVERY - puppet last run on mw1044 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:29:05] RECOVERY - puppet last run on mw1151 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:29:06] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [12:29:06] RECOVERY - puppet last run on mw1051 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:29:14] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:29:14] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [12:29:25] RECOVERY - puppet last run on mw1111 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:29:25] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:29:34] RECOVERY - puppet last run on mw1081 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [12:29:35] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [12:29:40] this is annoying [12:29:40] RECOVERY - puppet last run on elastic1006 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [12:29:45] RECOVERY - puppet last run on mw1084 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:29:45] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:29:54] RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 66 seconds ago with 0 failures [12:29:54] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:29:55] RECOVERY - puppet last run on wtp1018 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [12:29:55] RECOVERY - puppet last run on mw1125 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:29:55] RECOVERY - puppet last run on mw1079 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:29:55] RECOVERY - puppet last run on mw1034 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [12:29:55] RECOVERY - puppet last run on mw1159 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:29:55] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:29:56] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 66 seconds ago with 0 failures [12:30:05] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:30:06] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [12:30:06] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:30:06] RECOVERY - puppet last run on mw1050 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:30:07] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:30:14] RECOVERY - puppet last run on mw1056 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:30:14] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:30:14] RECOVERY - puppet last run on mw1087 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [12:30:14] RECOVERY - puppet last run on elastic1011 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:30:24] RECOVERY - puppet last run on mw1057 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:30:27] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:30:35] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:30:44] RECOVERY - puppet last run on wtp1023 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:30:45] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [12:30:45] RECOVERY - puppet last run on mw1029 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [12:30:45] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [12:30:54] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:30:54] RECOVERY - puppet last run on mw1156 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:30:54] RECOVERY - puppet last run on elastic1014 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:30:54] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:31:04] RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [12:31:05] RECOVERY - puppet last run on mw1074 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:31:14] RECOVERY - puppet last run on mw1097 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [12:31:16] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [12:31:24] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:31:34] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:31:44] RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:31:54] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:32:37] RECOVERY - puppet last run on wtp1013 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:32:37] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:32:44] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:32:44] RECOVERY - puppet last run on elastic1002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:32:44] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:32:54] RECOVERY - puppet last run on elastic1005 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:32:55] RECOVERY - puppet last run on mw1105 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [12:32:55] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:33:00] sigh, even having downtime for those won't elimitate recoveries IIRC [12:33:08] RECOVERY - puppet last run on mw1167 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:33:09] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:33:09] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:33:09] RECOVERY - puppet last run on mw1033 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:33:25] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:33:26] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [12:33:26] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:33:45] RECOVERY - puppet last run on mw1108 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:33:49] RECOVERY - puppet last run on mw1032 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:33:58] RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:34:05] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [12:34:14] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:34:14] RECOVERY - puppet last run on mw1077 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:34:14] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:34:15] RECOVERY - puppet last run on mw1024 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:34:32] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:34:32] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:34:32] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:34:32] RECOVERY - puppet last run on mw1121 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:34:33] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:34:46] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:34:55] RECOVERY - puppet last run on mw1091 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:34:55] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [12:34:56] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:34:56] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:34:56] RECOVERY - puppet last run on mw1022 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:35:04] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:35:05] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [12:35:05] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:35:05] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [12:35:05] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:35:20] RECOVERY - puppet last run on mw1090 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:35:21] RECOVERY - puppet last run on mw1093 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:35:21] RECOVERY - puppet last run on elastic1003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:35:21] RECOVERY - puppet last run on mw1043 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [12:35:24] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:35:35] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:35:35] RECOVERY - puppet last run on wtp1010 is OK: OK: Puppet is currently enabled, last run 67 seconds ago with 0 failures [12:35:44] RECOVERY - puppet last run on mw1107 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:35:45] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:35:46] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:35:46] RECOVERY - puppet last run on mw1066 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:35:46] RECOVERY - puppet last run on mw1064 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:35:56] RECOVERY - puppet last run on mw1027 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [12:36:19] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:36:19] RECOVERY - puppet last run on mw1071 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [12:36:19] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:36:19] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [12:36:20] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:36:20] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:36:24] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:36:25] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [12:36:34] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [12:36:48] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [12:36:48] RECOVERY - puppet last run on mw1113 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:36:48] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:37:04] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [12:37:04] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:37:04] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:37:05] RECOVERY - puppet last run on mw1037 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:37:05] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:37:05] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:37:05] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:37:10] ah [12:37:14] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:37:14] RECOVERY - puppet last run on tmh1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:37:15] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [12:37:24] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [12:37:35] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:37:39] RECOVERY - puppet last run on mw1078 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:37:39] RECOVERY - puppet last run on elastic1010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:37:39] RECOVERY - puppet last run on mw1058 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:37:47] RECOVERY - puppet last run on elastic1009 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:37:47] RECOVERY - puppet last run on mw1103 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:37:48] RECOVERY - puppet last run on elastic1013 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:37:54] RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:38:07] RECOVERY - puppet last run on mw1070 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:38:07] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:38:14] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:38:14] RECOVERY - puppet last run on mw1085 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:38:14] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [12:38:15] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [12:38:15] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 68 seconds ago with 0 failures [12:38:15] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:38:15] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [12:38:25] RECOVERY - puppet last run on mw1102 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:38:37] RECOVERY - puppet last run on lanthanum is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:38:44] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:38:44] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:38:45] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:38:55] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:39:05] RECOVERY - puppet last run on mw1157 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:39:06] RECOVERY - puppet last run on mw1095 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:39:06] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:39:19] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 66 seconds ago with 0 failures [12:39:20] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:39:20] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:39:20] RECOVERY - puppet last run on mw1096 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [12:39:24] RECOVERY - puppet last run on tmh1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:39:32] looks like what broke is rong anyways; that salt call should all be local, no auth to master needed [12:39:34] RECOVERY - puppet last run on mw1094 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:39:34] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [12:39:35] RECOVERY - puppet last run on elastic1016 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:39:35] RECOVERY - puppet last run on mw1035 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:39:35] I'll see about fixing tht [12:39:44] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:39:56] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:39:56] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:40:04] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:40:16] RECOVERY - puppet last run on mw1036 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:40:35] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:40:35] RECOVERY - puppet last run on wtp1024 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:40:55] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:40:55] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:40:55] RECOVERY - puppet last run on mw1124 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [12:40:56] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:41:04] RECOVERY - puppet last run on mw1130 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [12:41:04] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [12:41:15] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:41:15] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:41:15] RECOVERY - puppet last run on mw1062 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [12:41:35] RECOVERY - puppet last run on wtp1017 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:41:35] RECOVERY - puppet last run on mw1048 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [12:41:44] RECOVERY - puppet last run on vanadium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [12:41:44] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:41:45] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:41:54] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:41:56] RECOVERY - puppet last run on mw1038 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:41:56] RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:42:14] RECOVERY - puppet last run on mw1080 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:42:14] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:42:14] RECOVERY - puppet last run on mw1067 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:42:17] RECOVERY - puppet last run on mw1031 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:42:18] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:42:18] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [12:42:18] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:42:19] RECOVERY - puppet last run on mw1089 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:42:25] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:42:25] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:42:34] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:42:34] RECOVERY - puppet last run on mw1072 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:42:34] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:42:44] RECOVERY - puppet last run on mw1028 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:42:44] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:42:45] RECOVERY - puppet last run on mw1045 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:42:45] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [12:43:05] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:43:15] RECOVERY - puppet last run on mw1059 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:43:15] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:43:16] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 64 seconds ago with 0 failures [12:43:16] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:43:25] RECOVERY - puppet last run on elastic1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [12:43:25] RECOVERY - puppet last run on elastic1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:43:25] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:43:25] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:43:26] RECOVERY - puppet last run on elastic1007 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [12:43:26] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:43:36] RECOVERY - puppet last run on mw1063 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:43:36] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:43:45] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:43:50] RECOVERY - puppet last run on mw1106 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:43:50] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:43:51] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:43:55] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [12:43:56] RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:44:16] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:44:25] RECOVERY - puppet last run on sca1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:44:25] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:44:26] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:44:27] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:44:35] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:44:36] RECOVERY - puppet last run on elastic1008 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:44:36] RECOVERY - puppet last run on mw1082 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:44:45] RECOVERY - puppet last run on mw1069 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:44:55] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:44:55] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:44:56] RECOVERY - puppet last run on elastic1012 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [12:45:19] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:45:19] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:45:20] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:45:27] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:45:27] RECOVERY - puppet last run on mw1088 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:45:35] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:45:35] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 408346 msg (=400000 warning): ocg_render_job_queue 1010 msg (=500 critical) [12:45:45] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 66 seconds ago with 0 failures [12:45:56] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:46:02] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 408664 msg (=400000 warning): ocg_render_job_queue 1159 msg (=500 critical) [12:46:06] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 409066 msg (=400000 warning): ocg_render_job_queue 1303 msg (=500 critical) [12:54:21] (03PS1) 10ArielGlenn: trebuchet provider: grains.get should be local [puppet] - 10https://gerrit.wikimedia.org/r/167562 [12:56:25] (03PS2) 10ArielGlenn: trebuchet provider: grains.get should be local [puppet] - 10https://gerrit.wikimedia.org/r/167562 [12:58:52] _joe_: can you respond when you've the time on https://gerrit.wikimedia.org/r/#/c/166902/? :) [12:59:44] <_joe_> YuviPanda: will do [12:59:49] ty [13:05:19] (03CR) 10QChris: "No need to block this. This is the first part of this topic." [puppet] - 10https://gerrit.wikimedia.org/r/167550 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [13:06:24] (03CR) 10QChris: [C: 04-1] "Wait for deployment of I657916fa272e1b7977497e6fece1af7e7c0f533c to see if it goes well." [puppet] - 10https://gerrit.wikimedia.org/r/167553 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [13:07:46] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [13:09:06] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1 [13:09:15] PROBLEM - Host bits-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1:a [13:09:21] PROBLEM - Host bits-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:09:40] PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:10:02] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:b [13:10:03] wth [13:10:10] ulsfo down [13:10:13] ? [13:10:15] no it's not [13:10:22] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1:c [13:10:22] ulsfo-eqiad down [13:10:28] (probably :)) [13:10:34] RECOVERY - Host bits-lb.ulsfo.wikimedia.org is UP: PING WARNING - Packet loss = 28%, RTA = 76.99 ms [13:10:39] flap [13:11:03] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 73%, RTA = 75.16 ms [13:11:10] PROBLEM - Varnish HTCP daemon on cp4009 is CRITICAL: Timeout while attempting connection [13:11:10] PROBLEM - Varnish traffic logger on cp4020 is CRITICAL: Timeout while attempting connection [13:11:10] PROBLEM - check configured eth on cp4018 is CRITICAL: Timeout while attempting connection [13:11:10] PROBLEM - DPKG on cp4008 is CRITICAL: Timeout while attempting connection [13:11:10] PROBLEM - RAID on cp4010 is CRITICAL: Timeout while attempting connection [13:11:47] icinga-wm: it's too early in the am for text spam :p [13:12:05] nope not even that [13:12:45] Oct 20 13:08:13 cr2-ulsfo /kernel: rnh_get_forwarding_nh: RNH type 0 unexpected [13:12:49] Oct 20 13:08:13 cr2-ulsfo mib2d[1412]: SNMP_TRAP_LINK_DOWN: ifIndex 555, ifAdminStatus up(1), ifOperStatus down(2), ifName pe-0/0/0.32769 [13:13:06] RECOVERY - Varnish HTTP upload-frontend on cp4013 is OK: HTTP OK: HTTP/1.1 200 OK - 308 bytes in 3.156 second response time [13:13:14] is that card broken again? [13:13:15] lots of that [13:13:16] on cr2-ulsfo [13:13:25] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:25] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:25] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:25] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:13:38] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:39] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:39] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:39] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:39] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:39] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:39] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:50] RECOVERY - Varnish HTCP daemon on cp4009 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [13:13:50] RECOVERY - Varnish HTCP daemon on cp4007 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [13:13:50] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 1548 seconds ago with 0 failures [13:13:50] RECOVERY - Varnish traffic logger on cp4020 is OK: PROCS OK: 2 processes with command name varnishncsa [13:13:50] RECOVERY - Varnishkafka log producer on cp4013 is OK: PROCS OK: 1 process with command name varnishkafka [13:13:50] RECOVERY - DPKG on cp4010 is OK: All packages OK [13:13:50] RECOVERY - check if dhclient is running on cp4017 is OK: PROCS OK: 0 processes with command name dhclient [13:13:50] is this the same card as the problematic link before? [13:15:06] I don't see anything else wrong [13:15:09] hmm apparently we lost the other link 2 days ago as well? [13:15:12] PROBLEM - LVS HTTP IPv4 on bits-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [13:15:17] yeah giglinx is down [13:15:20] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [13:15:42] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:47] (03CR) 10QChris: "Whoops .. linked wrong change :-(" [puppet] - 10https://gerrit.wikimedia.org/r/167551 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [13:16:05] I take it back [13:16:17] giglinx is back, gtt has insane packet loss [13:16:27] er [13:16:27] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 93%, RTA = 73.88 ms [13:16:29] giglinx is down [13:16:42] Oct 18 10:25:28 cr2-ulsfo mib2d[1412]: SNMP_TRAP_LINK_DOWN: ifIndex 573, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-1/2/0 [13:16:43] (03CR) 10QChris: "Whoops .. linked wrong change :-( We have to wait for" [puppet] - 10https://gerrit.wikimedia.org/r/167552 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [13:16:45] that's giglinx [13:16:48] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING WARNING - Packet loss = 86%, RTA = 74.12 ms [13:16:48] just reboot the card paravoid [13:16:53] which card? [13:16:59] fpc 1 card 2 [13:17:02] Hm.. this looks interesting [13:17:02] http://ganglia.wikimedia.org/latest/graph.php?r=year&z=large&h=searchidx1001.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Search+eqiad [13:17:02] it's bad [13:17:18] PROBLEM - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [13:17:52] PROBLEM - Varnish traffic logger on cp4017 is CRITICAL: Timeout while attempting connection [13:17:53] PROBLEM - puppet last run on lvs4002 is CRITICAL: Timeout while attempting connection [13:17:53] PROBLEM - puppet last run on lvs4001 is CRITICAL: Timeout while attempting connection [13:17:53] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:03] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:03] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:03] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:18:21] PROBLEM - Host cp4014 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:18:21] PROBLEM - Host cp4016 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:18:21] PROBLEM - Host cp4005 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:18:21] PROBLEM - Host bits-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1:a [13:19:04] PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 13.2572885542 [13:19:16] PROBLEM - LVS HTTPS IPv4 on bits-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:19:35] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [13:19:36] PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [13:19:51] PROBLEM - Host cp4002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:51] PROBLEM - Host cp4019 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:51] PROBLEM - Host lvs4001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:51] PROBLEM - Host cp4009 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:51] PROBLEM - Host lvs4003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:52] PROBLEM - Host cp4011 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:52] PROBLEM - Host lvs4004 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:52] PROBLEM - Host cp4010 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:53] PROBLEM - Host lvs4002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:53] PROBLEM - Host cp4003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:54] PROBLEM - Host cp4008 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:54] PROBLEM - Host bast4001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:55] PROBLEM - Host cp4007 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:55] PROBLEM - Host cp4020 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:56] PROBLEM - Host cp4012 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:19:56] PROBLEM - Host cp4015 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:20:19] RECOVERY - Host cp4005 is UP: PING WARNING - Packet loss = 93%, RTA = 73.92 ms [13:20:19] RECOVERY - Host cp4019 is UP: PING WARNING - Packet loss = 93%, RTA = 73.80 ms [13:20:20] RECOVERY - Host cp4007 is UP: PING WARNING - Packet loss = 93%, RTA = 75.56 ms [13:20:20] RECOVERY - Host lvs4004 is UP: PING WARNING - Packet loss = 93%, RTA = 73.82 ms [13:20:20] RECOVERY - Host cp4010 is UP: PING WARNING - Packet loss = 93%, RTA = 78.04 ms [13:20:28] RECOVERY - Host cp4003 is UP: PING WARNING - Packet loss = 93%, RTA = 73.83 ms [13:20:28] RECOVERY - Host bast4001 is UP: PING WARNING - Packet loss = 93%, RTA = 74.08 ms [13:20:28] RECOVERY - Host cp4015 is UP: PING WARNING - Packet loss = 93%, RTA = 73.80 ms [13:20:28] RECOVERY - Host lvs4002 is UP: PING WARNING - Packet loss = 93%, RTA = 74.30 ms [13:20:28] RECOVERY - Host cp4002 is UP: PING WARNING - Packet loss = 93%, RTA = 73.85 ms [13:20:29] RECOVERY - Host cp4011 is UP: PING WARNING - Packet loss = 93%, RTA = 76.09 ms [13:20:29] RECOVERY - Host cp4012 is UP: PING WARNING - Packet loss = 93%, RTA = 76.17 ms [13:20:30] RECOVERY - Host cp4008 is UP: PING WARNING - Packet loss = 93%, RTA = 74.99 ms [13:20:39] RECOVERY - Host cp4013 is UP: PING WARNING - Packet loss = 93%, RTA = 75.65 ms [13:20:39] RECOVERY - Host cp4018 is UP: PING WARNING - Packet loss = 93%, RTA = 73.87 ms [13:20:49] RECOVERY - LVS HTTPS IPv4 on bits-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 4054 bytes in 0.387 second response time [13:21:23] PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection timed out [13:22:22] PROBLEM - puppet last run on search1021 is CRITICAL: CRITICAL: Puppet has 2 failures [13:22:26] PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:23:28] PROBLEM - Host cp4007 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:23:28] PROBLEM - Host cp4004 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:23:28] PROBLEM - Host lvs4004 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:23:28] PROBLEM - Host cp4001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:23:28] PROBLEM - Host cp4015 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:23:29] PROBLEM - Host cp4011 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:23:31] PROBLEM - SSH on lvs4002 is CRITICAL: Connection timed out [13:23:31] PROBLEM - HTTPS_unified on cp4003 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [13:23:31] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org is UP: PING WARNING - Packet loss = 73%, RTA = 79.23 ms [13:23:48] RECOVERY - Host cp4020 is UP: PING WARNING - Packet loss = 54%, RTA = 74.28 ms [13:23:48] RECOVERY - Host cp4009 is UP: PING WARNING - Packet loss = 50%, RTA = 74.30 ms [13:23:49] PROBLEM - SSH on cp4017 is CRITICAL: Connection timed out [13:23:54] so when will server back online? [13:23:58] PROBLEM - RAID on cp4003 is CRITICAL: Timeout while attempting connection [13:23:59] PROBLEM - check if salt-minion is running on lvs4002 is CRITICAL: Timeout while attempting connection [13:24:09] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:18] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:18] PROBLEM - Host mr1-ulsfo is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:24:18] PROBLEM - Host cr2-ulsfo is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:24:18] PROBLEM - Host cr1-ulsfo is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:24:18] PROBLEM - Host bits-lb.ulsfo.wikimedia.org is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:24:24] PROBLEM - Host upload-lb.ulsfo.wikimedia.org is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:24:54] PROBLEM - DPKG on cp4003 is CRITICAL: Timeout while attempting connection [13:24:54] PROBLEM - Disk space on lvs4002 is CRITICAL: Timeout while attempting connection [13:24:54] PROBLEM - Varnishkafka log producer on cp4003 is CRITICAL: Timeout while attempting connection [13:24:55] PROBLEM - check if dhclient is running on cp4003 is CRITICAL: Timeout while attempting connection [13:24:55] PROBLEM - check configured eth on cp4003 is CRITICAL: Timeout while attempting connection [13:25:04] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 75.06 ms [13:25:04] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 74.35 ms [13:25:04] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 74.36 ms [13:25:04] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 74.28 ms [13:25:04] RECOVERY - LVS HTTP IPv4 on bits-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 4047 bytes in 0.150 second response time [13:25:18] RECOVERY - SSH on lvs4002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [13:25:18] RECOVERY - HTTPS_unified on cp4003 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 457 days) [13:25:19] RECOVERY - Host cp4015 is UP: PING WARNING - Packet loss = 86%, RTA = 73.85 ms [13:25:19] RECOVERY - Host cp4002 is UP: PING WARNING - Packet loss = 86%, RTA = 73.88 ms [13:25:19] RECOVERY - Host cp4004 is UP: PING WARNING - Packet loss = 86%, RTA = 73.81 ms [13:25:19] RECOVERY - Host cp4007 is UP: PING WARNING - Packet loss = 86%, RTA = 73.86 ms [13:25:28] RECOVERY - Host cp4016 is UP: PING WARNING - Packet loss = 80%, RTA = 73.84 ms [13:25:29] RECOVERY - Host cp4014 is UP: PING WARNING - Packet loss = 80%, RTA = 73.85 ms [13:25:29] RECOVERY - Host bits-lb.ulsfo.wikimedia.org is UP: PING WARNING - Packet loss = 80%, RTA = 73.78 ms [13:25:48] hm, uhh [13:25:51] PROBLEM - HTTPS_unified on cp4018 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [13:25:51] RECOVERY - Host cp4006 is UP: PING WARNING - Packet loss = 73%, RTA = 69.34 ms [13:26:02] RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 700 bytes in 0.370 second response time [13:26:08] RECOVERY - check if dhclient is running on cp4011 is OK: PROCS OK: 0 processes with command name dhclient [13:26:08] RECOVERY - Varnish traffic logger on cp4017 is OK: PROCS OK: 2 processes with command name varnishncsa [13:26:08] RECOVERY - check if salt-minion is running on cp4014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:26:08] RECOVERY - DPKG on bast4001 is OK: All packages OK [13:26:08] RECOVERY - Host lvs4003 is UP: PING WARNING - Packet loss = 44%, RTA = 72.97 ms [13:26:09] RECOVERY - Host lvs4004 is UP: PING WARNING - Packet loss = 44%, RTA = 82.04 ms [13:26:09] RECOVERY - Varnish HTTP mobile-frontend on cp4011 is OK: HTTP OK: HTTP/1.1 200 OK - 371 bytes in 0.146 second response time [13:26:10] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 1693 seconds ago with 0 failures [13:26:10] RECOVERY - check if salt-minion is running on lvs4003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:26:11] RECOVERY - DPKG on cp4003 is OK: All packages OK [13:26:11] RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 700 bytes in 0.373 second response time [13:26:26] RECOVERY - check configured eth on cp4001 is OK: NRPE: Unable to read output [13:26:27] RECOVERY - RAID on lvs4003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:26:27] RECOVERY - DPKG on cp4014 is OK: All packages OK [13:26:27] RECOVERY - Host upload-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 69.29 ms [13:26:27] !log cr2-ulsfo: "request chassis mic {off,on}line fpc-slot 1 mic-slot 1" to reboot broken card [13:26:36] Logged the message, Master [13:26:50] RECOVERY - Disk space on lvs4002 is OK: DISK OK [13:26:50] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 1929 seconds ago with 0 failures [13:26:51] RECOVERY - check configured eth on cp4011 is OK: NRPE: Unable to read output [13:26:51] RECOVERY - check if dhclient is running on cp4003 is OK: PROCS OK: 0 processes with command name dhclient [13:26:51] RECOVERY - Varnishkafka log producer on cp4003 is OK: PROCS OK: 1 process with command name varnishkafka [13:26:51] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 10.128.128.1, interfaces up: 35, down: 0, dormant: 0, excluded: 1, unused: 0 [13:26:51] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 63, down: 0, dormant: 0, excluded: 1, unused: 0 [13:26:52] RECOVERY - SSH on bast4001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [13:26:52] RECOVERY - Varnish HTTP upload-backend on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.141 second response time [13:26:53] RECOVERY - SSH on cp4011 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [13:26:53] RECOVERY - check configured eth on cp4003 is OK: NRPE: Unable to read output [13:26:54] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 1793 seconds ago with 0 failures [13:27:06] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 74, down: 0, dormant: 0, excluded: 0, unused: 0 [13:27:07] RECOVERY - Host bits-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 71.40 ms [13:27:08] link problem to ulsfo? i haven't checked email yet.... [13:27:18] mark, yt? [13:27:18] fixed [13:27:20] oh ok [13:27:24] see log above [13:27:24] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 70.28 ms [13:27:25] sorry, just signed in [13:27:28] no worries :) [13:27:31] thanks for caring :) [13:27:31] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 69.21 ms [13:27:32] :) [13:27:37] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 70.24 ms [13:27:37] RECOVERY - Host cr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.69 ms [13:27:37] PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 25.0862913793 [13:28:02] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0 [13:28:02] RECOVERY - Host cr2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.14 ms [13:28:02] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 69.21 ms [13:28:23] RECOVERY - SSH on cp4017 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [13:28:24] RECOVERY - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 678 bytes in 0.145 second response time [13:28:50] RECOVERY - RAID on cp4003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:28:54] i guess we are calling RT duty 'Ops duty' now? [13:29:00] RECOVERY - check if salt-minion is running on lvs4002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:29:00] RECOVERY - HTTPS_unified on cp4018 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 457 days) [13:29:01] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: puppet fail [13:29:40] PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 14.6411538136 [13:30:20] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [13:31:10] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail [13:31:43] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 2 failures [13:31:50] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [13:32:00] RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: 2.36820388235 [13:33:50] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail [13:34:00] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: puppet fail [13:35:00] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:35:01] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:35:12] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:35:21] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:37:00] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:37:01] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail [13:37:43] ottomata: clinic duty [13:37:51] PROBLEM - puppet last run on elastic1003 is CRITICAL: CRITICAL: Puppet has 1 failures [13:38:02] well that's what we put in our meetings notes ... [13:38:05] /topic says 'Ops duty' [13:38:11] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:38:20] and topic sounds way better [13:38:26] sorry mark :P [13:39:01] RECOVERY - puppet last run on search1021 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:39:30] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:39:30] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:39:31] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: puppet fail [13:39:41] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [13:40:51] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 69 seconds ago with 0 failures [13:41:51] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:42:06] RobH will do clinic duty this week [13:42:20] RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 2.73679050847 [13:42:28] i haz the con [13:42:42] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:43:37] * robh begins to furiously update the maint-announce queue [13:43:47] .... [13:43:59] no chance, faidon had it before you [13:44:00] holy shit who was on duty last week? the maint announce queue is ... under control. [13:44:19] man, that totally undoes my 'lowest hanging fruit to start duty' task! [13:44:21] ;D [13:44:32] you might actually have to do real work this week [13:44:57] hey usually RT is real work about 30% into the week (after the massive amounts of triage) [13:45:04] but indeed, seems far less triage this week [13:45:07] (strange) [13:46:30] RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 1.89986423729 [13:47:00] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Puppet has 1 failures [13:48:51] wait i thought it was me this week? is it me next week? [13:48:54] checking... [13:49:21] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:49:24] Oh, npe, me next week [13:49:27] /topic [13:49:32] cool [13:49:52] RobH, I think I still don't know what to do with maint-announce stuff, I will look at it next week and ask you [13:50:26] yea its pretty easy just update the subjects to the maint windows and merge any related tickets [13:50:37] as the announces go to the queue, and it makes a bunch of tickets for the same maint [13:50:54] it is really easy, it was my default rant about ops duty that folks didnt do ;] [13:51:08] in the event of an outage, its nice to pull that up and see if anything is in a maint window at a glance. [13:51:18] ya I remember the rant...:) [13:51:20] last week I made a gcal for it [13:51:29] perhaps we can start tracking maint windows on it [13:51:35] it's pretty confusing otherwise [13:51:36] i think that sounds reasonable to me [13:51:48] check your google calendar, it's called "ops maintenance & contracts" or something like that [13:51:55] i put the telia outages/maint on it last week [13:51:55] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:59] cool, ill update the ops duty docs [13:52:02] as I couldn't make cheese of it otherwise [13:52:04] and reflect it [13:52:10] we should mention it in the meeting today [13:52:17] and adding to etherpad now ;] [13:52:39] hrmm, i dont see it [13:52:52] neither do I [13:52:56] sec [13:53:26] * robh plans to do a slight overhaul on the wording of the ops duty doc to eliminate the old phrasing and titles anyhow [13:53:37] it was on my 'i dont feel like doing this until its my job as ops duty person to do it' list [13:53:48] it's shared with everyone in wmf [13:54:05] not sure what else I need to do [13:54:07] hrmm, can you add me specifically to the share list ? [13:54:09] * mark doesn't really get calendars [13:54:12] can do [13:54:19] make me admin and I'll handle it all for you ;D [13:54:55] i think im still admin on the major staff calendars from initial implementation support ;P [13:54:59] try now [13:55:09] i have it now [13:55:22] RECOVERY - puppet last run on elastic1003 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:55:42] check oct 7 for example maint windows [13:55:45] heh, i can make changes to events, but not the share settings (just fyi) [13:55:54] you can now [13:56:01] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Puppet has 2 failures [13:56:02] so we should mention the affected circuit IDs and RT# [13:56:21] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 54273 bytes in 0.112 second response time [13:57:07] I like it [13:57:31] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:57:32] hrmm, on the calendar i still dont have the share settings tab, but thats not big deal [13:57:41] log out I guess [13:57:51] (its make changes to events and manage sharing in the settings for it) [13:57:56] lemme login to chrome and check [13:58:13] (03CR) 10QChris: [C: 031] gerrit: Remove duplicate mirrors [puppet] - 10https://gerrit.wikimedia.org/r/167162 (https://bugzilla.wikimedia.org/68054) (owner: 10Krinkle) [13:58:28] oh [13:58:30] yea [13:58:33] chrome shows it fine [13:58:38] i guess firefox has odd caching [13:58:41] mark: thanks! [13:59:37] * ottomata also doesn't really get calendars [14:01:53] ottomata: i imagine you saying that in a very 'things happen when they happen man' kind of way ;] [14:02:09] like dude lebowski [14:02:10] haha [14:02:21] it is true [14:04:51] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:05:11] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: Puppet has 1 failures [14:05:28] Ok, all the opsen under the root alias now have direct permissions on the calendar [14:05:36] and should see it listed in their calendar lists in google. [14:05:58] we should also list contract end dates on it [14:06:12] and set notification well in advance, a month at least [14:06:20] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Puppet has 1 failures [14:08:21] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 2 failures [14:08:21] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 2 failures [14:08:36] i'm happy to do so, but i really don't know all the existing contracts. (though if you do and provide me a list, I can track them down with accounting and add them this week) [14:08:51] well it's not highest prio atm [14:09:00] yea... i'll just make an rt ticket so we dont forget it =] [14:09:05] ok [14:10:06] (03CR) 10BBlack: [C: 031] Handle CORS preflight requests for upload in VCL [puppet] - 10https://gerrit.wikimedia.org/r/167542 (https://bugzilla.wikimedia.org/55631) (owner: 10Faidon Liambotis) [14:11:51] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 2 failures [14:12:02] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [14:13:52] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:14:50] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet has 2 failures [14:18:24] 503 again [14:19:52] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: Puppet has 1 failures [14:22:10] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:22:13] those are probblt cause by puppet trying to get the apt cache lock at the same time as I want it for salt minion upgrade; they clear themselves up on th next run [14:22:19] *probably [14:22:56] <_joe_> apergos: ok [14:23:10] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [14:23:50] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 2 failures [14:26:20] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:26:26] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:26:31] !log cr1-ulsfo: deactivating ospf/ospf3 on GTT ulsfo-eqiad link [14:26:38] Logged the message, Master [14:28:47] (03CR) 10Chad: More elasticsearch tools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164270 (owner: 10Chad) [14:28:50] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:30:00] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:31:10] (03CR) 10Chad: More elasticsearch tools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164270 (owner: 10Chad) [14:31:40] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [14:36:32] (03CR) 10Chad: "Going to do it today in beta. Wanted config live first so it'll come up when I restart each node so I don't have to do it twice. Should be" [puppet] - 10https://gerrit.wikimedia.org/r/166690 (owner: 10Chad) [14:37:10] RECOVERY - puppet last run on tmh1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:39:50] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [14:42:30] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:47:16] Who wants to SWAT? [14:51:06] many, anomie, ^demon|away, any of you interested in SWAT? [14:51:10] marktraceur: I can SWAT, unless you want to [14:51:13] (manybubbles apparently doesn't exist today) [14:51:25] anomie: I got it all last week. You go. :) [14:54:08] James_F: Ping for SWAT in about 6 minutes [14:54:50] anomie: Hey. [14:54:53] (03CR) 10Filippo Giunchedi: "+ two nitpicks :P" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/166690 (owner: 10Chad) [14:55:12] ^demon|away: so ^ can be merged today, possibly after beta? [14:55:49] <^demon|away> After? Hm? That config's for beta. [14:56:15] <^demon|away> I just noticed my commit message fails to be clear on this. [15:00:04] manybubbles, anomie, ^d, marktraceur, James_F: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141020T1500). Please do the needful. [15:00:17] * anomie starts SWAT [15:00:50] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: Puppet has 1 failures [15:00:57] PROBLEM - puppet last run on mw1070 is CRITICAL: CRITICAL: Puppet has 1 failures [15:02:39] ^demon|away: ack, if you want to fix that and the align while we're at it I'll merge [15:03:35] (03PS1) 10Matanya: exim: remove ref to tampa [puppet] - 10https://gerrit.wikimedia.org/r/167582 [15:06:18] <^demon|away> godog: Yeah, sec. [15:07:47] (03PS4) 10Chad: Configure Beta Elasticsearch for statsd [puppet] - 10https://gerrit.wikimedia.org/r/166690 [15:08:08] !log anomie Synchronized php-1.25wmf4/extensions/VisualEditor/: SWAT: VE bug fixes [[gerrit:167577]] (duration: 00m 10s) [15:08:10] James_F: ^ test please [15:08:17] Logged the message, Master [15:08:57] * James_F waits for debug=true to work. [15:10:31] PROBLEM - puppet last run on mw1149 is CRITICAL: CRITICAL: Puppet has 1 failures [15:12:02] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Puppet has 2 failures [15:12:12] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 2 failures [15:12:21] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 2 failures [15:12:32] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 2 failures [15:12:46] anomie: Good. [15:13:05] James_F: ok, next patch [15:15:28] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:15:42] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Puppet has 2 failures [15:17:08] (03PS5) 10Filippo Giunchedi: Configure Beta Elasticsearch for statsd [puppet] - 10https://gerrit.wikimedia.org/r/166690 (owner: 10Chad) [15:17:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Configure Beta Elasticsearch for statsd [puppet] - 10https://gerrit.wikimedia.org/r/166690 (owner: 10Chad) [15:18:32] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:18:41] RECOVERY - puppet last run on mw1070 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:19:47] !log anomie Synchronized php-1.25wmf3/extensions/VisualEditor/: SWAT: VE bug fixes [[gerrit:167344]] (duration: 00m 10s) [15:19:52] Logged the message, Master [15:20:00] !log anomie Synchronized php-1.25wmf3/resources/lib/oojs-ui/: SWAT: OOJS-UI bug fixes [[gerrit:167344]] (duration: 00m 12s) [15:20:01] James_F: ^^^ ^ test please [15:20:05] Logged the message, Master [15:20:08] <_joe_> !log disabling puppet on mw1189 to do some hhvm testing [15:20:13] Logged the message, Master [15:20:29] <_joe_> p [15:21:59] anomie: Doing. [15:26:09] anomie: Yes, all looks well. [15:26:27] * anomie is done with SWAT, at least for the moment [15:26:31] Yay. [15:26:38] (03PS1) 10Yuvipanda: Add .gitreview [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167585 [15:28:12] RECOVERY - puppet last run on mw1149 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:29:12] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:29:14] * anomie reopens SWAT for a SecurePoll backport [15:30:02] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:30:04] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:30:45] (03CR) 10Yuvipanda: [C: 032] Add .gitreview [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167585 (owner: 10Yuvipanda) [15:32:17] (03CR) 10Yuvipanda: [V: 032] Add .gitreview [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167585 (owner: 10Yuvipanda) [15:33:31] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:38:55] !log anomie Synchronized php-1.25wmf4/extensions/SecurePoll/: SWAT: Update SecurePoll for testing on testwiki [[gerrit:167586]] (duration: 00m 10s) [15:38:57] anomie: ^ Test please [15:39:03] Logged the message, Master [15:39:27] bah, i18n [15:39:32] !log anomie Started scap: (no message) [15:39:38] Logged the message, Master [15:40:47] (03PS1) 10Yuvipanda: shinken: Puppetize templates.cfg [puppet] - 10https://gerrit.wikimedia.org/r/167591 [15:41:43] <_joe_> YuviPanda: what do you need me to merge with you? [15:42:13] _joe_: https://gerrit.wikimedia.org/r/#/c/166902/ [15:43:07] (03PS1) 10Anomie: Enable wgSecurePollUseNamespace for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167592 [15:45:34] (03PS1) 10Alexandros Kosiaris: Move upstart script to the correct place [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167594 [15:46:05] (03PS2) 10Yuvipanda: shinken: Puppetize templates.cfg [puppet] - 10https://gerrit.wikimedia.org/r/167591 [15:46:20] (03CR) 10Reedy: [C: 032] Enable wgSecurePollUseNamespace for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167592 (owner: 10Anomie) [15:46:33] (03Merged) 10jenkins-bot: Enable wgSecurePollUseNamespace for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167592 (owner: 10Anomie) [15:49:09] (03PS2) 10Alexandros Kosiaris: Move upstart script to the correct place [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167594 [15:50:51] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [15:50:59] (03PS1) 10Yuvipanda: [WIP] Add hosts.cfg generator [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167595 [15:53:22] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 397319 msg: ocg_render_job_queue 0 msg [15:53:37] (03PS3) 10Alexandros Kosiaris: Move upstart script to the correct place [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167594 [15:53:51] (03PS3) 10Yuvipanda: shinken: Puppetize templates.cfg [puppet] - 10https://gerrit.wikimedia.org/r/167591 [15:54:01] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 383357 msg: ocg_render_job_queue 0 msg [15:54:14] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 378767 msg: ocg_render_job_queue 0 msg [15:54:36] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Move upstart script to the correct place [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167594 (owner: 10Alexandros Kosiaris) [15:57:01] (03CR) 10Andrew Bogott: [C: 032] shinken: Puppetize templates.cfg [puppet] - 10https://gerrit.wikimedia.org/r/167591 (owner: 10Yuvipanda) [15:57:02] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [15:57:37] (03PS1) 10Yuvipanda: shinken: Set default contact to guest [puppet] - 10https://gerrit.wikimedia.org/r/167597 [15:57:55] !log anomie Finished scap: (no message) (duration: 18m 22s) [15:58:00] Logged the message, Master [15:58:07] anomie: i18n works now [15:58:36] !log anomie Synchronized wmf-config: SWAT: Enable wgSecurePollUseNamespace for testwiki [[gerrit:167592]] (duration: 00m 10s) [15:58:41] Logged the message, Master [15:58:42] anomie: ^ test please [15:59:00] cute [15:59:22] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [16:00:09] anomie: Works [16:00:13] * anomie is done with SWAT again [16:02:07] (03PS1) 10Alexandros Kosiaris: Add sqlite3 add a build dependency [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167598 [16:03:17] anomie: :) :) [16:03:27] I wonder if I can create that wiki now [16:03:31] They're getting a bit impatient... [16:04:01] Looks like it [16:06:35] (03PS1) 10BBlack: depool ulsfo from DNS [dns] - 10https://gerrit.wikimedia.org/r/167599 [16:06:58] (03CR) 10BBlack: [C: 04-1] "Do not randomly merge this!" [dns] - 10https://gerrit.wikimedia.org/r/167599 (owner: 10BBlack) [16:07:05] heh [16:07:11] bblack: no one should randomly merge anything! [16:07:14] (just sayin) [16:07:17] :) [16:07:27] what ? [16:07:36] and how things would work then ? [16:07:43] (03PS1) 10Yuvipanda: shinken: Set up config for default host ping check [puppet] - 10https://gerrit.wikimedia.org/r/167600 [16:07:47] randomness is the key to order [16:07:50] :P [16:07:56] We need a chaos monkey to randomly merge stuff [16:08:18] Reedy: we call those 'devs that we give root' [16:08:29] see RoanKattouw_away [16:08:33] ahaha... [16:08:33] ;D [16:08:50] (in roan's defense he just recently said he should lose said root) [16:09:09] ....and i guess as ops duty person this week i should ensure that happened (or happens) [16:09:28] How about a cron that, every day, has a 1/1000 chance of just merging every single pending gerrit patch? That would motivate speedy code review. [16:09:45] man, shinken restart is also slow when it has 40 hosts to parse [16:09:56] andrewbogott: or at least encourage people to not leave half finished patches in gerrit? :) [16:10:09] robh: technically there's no ensuring unless we wipe the entire prod infrastructure. Once you're root nobody can gaurantee you're not indefinitely-root through some hack or other :) [16:10:18] correct [16:10:27] but still, polish the brass on the titanic and all that. [16:10:31] ;] [16:10:48] after all, brion doesnt have root (that we know of ;) [16:10:59] but if we ever disparage the elegant language of esperanto.... [16:11:11] (aww, he isnt in here to take the bait) [16:11:13] what's esperanto? [16:11:20] :) [16:12:07] like klingon but for renaissance nerds. [16:12:20] (i think this is the best summary of it i have ever written) [16:12:34] <_joe_> LOL [16:12:52] <_joe_> robh: this is class-A trolling [16:12:57] I am pretty proud of it. [16:13:00] <_joe_> I'll reuse it [16:13:09] Please do! [16:13:36] <_joe_> telling people that know Esperanto something like that is surely going to fire them up [16:16:00] the backstory of brion and esperanto is he got started with the movement on esperanto wikipedia [16:16:12] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Puppet has 1 failures [16:16:25] (though i didnt see any esperanto folks at last wikimania, it was pretty busy) [16:16:27] (03CR) 10Andrew Bogott: [C: 032] shinken: Set default contact to guest [puppet] - 10https://gerrit.wikimedia.org/r/167597 (owner: 10Yuvipanda) [16:16:39] (03CR) 10Andrew Bogott: [C: 032] shinken: Set up config for default host ping check [puppet] - 10https://gerrit.wikimedia.org/r/167600 (owner: 10Yuvipanda) [16:17:50] nine times out of ten when I log in to palladium to merge, my fingers type 'sudo -s puppet agent -tv' despite my willing them to type 'sudo -s puppet merge' It really slows down my workflow. [16:18:21] (03CR) 10Ottomata: "Looking good! Could you add some big ol warning/caveat documention in the script about why this has to be done? And about intentions to " [puppet] - 10https://gerrit.wikimedia.org/r/167044 (owner: 10Gage) [16:18:39] andrewbogott: ditch the -s btw. No need for it [16:18:46] (03PS1) 10Alexandros Kosiaris: Simplify apertium module [puppet] - 10https://gerrit.wikimedia.org/r/167602 [16:18:49] see ? one more thing to slow you down [16:19:14] excellent! Over a year that'll save me a cumulative 2 seconds :) [16:19:25] :-) [16:19:59] (03CR) 10Alexandros Kosiaris: [C: 032] Simplify apertium module [puppet] - 10https://gerrit.wikimedia.org/r/167602 (owner: 10Alexandros Kosiaris) [16:21:28] gwicke: hiyyaaa [16:25:36] (03PS4) 10Chad: Decom deployment-elastic01 from beta [puppet] - 10https://gerrit.wikimedia.org/r/167010 [16:26:01] * _joe_ away [16:26:16] folks using this repo might want to check on its status, seems some whines about it on my apt-updates: http://ppa.launchpad.net/chris-lea/node.js/ubuntu/dists/precise/ [16:30:23] PROBLEM - DPKG on tungsten is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:30:38] ottomata: good morning [16:31:04] hiya! [16:31:13] (03PS4) 10Reedy: Create Oriya Wikisource (orwikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166186 (https://bugzilla.wikimedia.org/71875) (owner: 10Glaisher) [16:31:39] so, gwicke, q, have you gotten cassandra with client encryption to work before? [16:31:56] i'm trying it, and i think i'm doing it right, but i'm having trouble making cqlsh and cassandra-cli connect [16:32:09] ottomata: I have not yet used client encryption [16:32:12] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures [16:32:13] aye hm, ok [16:32:25] gwicke: why do we need encryption? [16:32:29] what clients will be connecting? [16:32:33] do you want client? or just server to server? [16:33:10] ottomata: we could start without encryption, as we won't do cross-DC replication initially either [16:33:32] (03CR) 10Manybubbles: [C: 031] Decom deployment-elastic01 from beta [puppet] - 10https://gerrit.wikimedia.org/r/167010 (owner: 10Chad) [16:33:34] RECOVERY - DPKG on tungsten is OK: All packages OK [16:33:37] aye, i need to check for sure, but I think the internode encryption is working [16:33:39] I'm not sure if we now do ipsec for that traffic, but if we don't then we'd definitely want encryption for cross-dc requests [16:33:46] aye [16:34:06] ok, other q, as for auth [16:34:17] i can very easily puppetize enabling authentication [16:34:20] and authorization [16:34:22] the client normally tries hard to only speak to local nodes, but I think that once we set up some encryption we should not have much more work to use it everywhere [16:34:23] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:34:54] we definitely need password auth [16:35:03] that's just a single line [16:35:16] the passwords are stored in cassandra itself, similar to MySQL [16:35:16] aye, e nabling is easy [16:35:19] ya [16:35:24] do you want me to puppetize that part? [16:35:29] setting the users and pws? [16:35:31] yes, that would be great [16:35:37] rats, ok :p :) [16:36:01] so, i'm not going to try to puppetize a generic define for this, but I can puppetize at least the 'cassandra' user [16:36:01] user and pws.. for restbase only a single user will be needed [16:36:06] hm [16:36:08] ok [16:36:11] i guess we can do that. [16:36:22] I'm not sure that doing that with puppet makes much sense [16:36:33] ok good, i dont' think so either, i mean [16:36:33] as it's a one-time only thing *per cluster* rather than per node [16:36:36] aye [16:36:39] and it is for the app [16:36:44] *nod* [16:36:49] Christian75: hey [16:36:50] why no cross-dc replication initially/ [16:36:50] er [16:36:51] ? [16:36:51] chasemp: hey :) [16:37:04] howdy [16:37:07] mark: remember, we are starting with three test nodes only [16:37:27] howdy :) [16:37:27] i think we should involve codfw soon though [16:37:30] gwicke: i think I also have to do something like this: [16:37:31] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/analytics/kafkatee.pp#L120 [16:37:34] oops [16:37:36] mark: I'm more than happy to add remote replicas soon after we have that up [16:37:36] not that [16:37:37] the sooner the better [16:37:38] http://www.datastax.com/documentation/cql/3.1/cql/cql_using/update_ks_rf_t.html [16:37:41] ok [16:38:03] chasemp: not sure if you're aware of it, we get bounces from phabricator [16:38:16] Date: Mon, 20 Oct 2014 15:59:57 +0000 [16:38:16] From: noreply@phabricator.wikimedia.org [16:38:19] To: Mailer-Daemon@polonium.wikimedia.org [16:38:22] Subject: Error Processing Mail (No Receivers) [16:38:22] mark: we are already settings things up for it, with network topology aware replication etc [16:38:27] ok [16:38:52] I did see them not sure what is the right thing to do, it's ppl replying to noreply or putting in unreachable emails from what I've seen [16:39:19] mark: should I create a codfw hw request? [16:39:33] you might as well, we'll see when we can honor it [16:39:36] I was going to wait until we have some more perf data [16:39:36] you are meant to copy out a url to your browser for email verification, ppl seem to be replying (some of them) and it's get bounced back to them which seems to end up going to mailer-daemon [16:39:45] but if you'd like to be really quick then we could guess for now [16:39:47] gwicke: just put it in and note that [16:39:54] we can probably wait [16:39:59] okay [16:40:00] (03PS5) 10Reedy: Create Oriya Wikisource (orwikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166186 (https://bugzilla.wikimedia.org/71875) (owner: 10Glaisher) [16:40:00] but then it's on our radar, that's not bad [16:40:05] (03CR) 10Reedy: [C: 032] Create Oriya Wikisource (orwikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166186 (https://bugzilla.wikimedia.org/71875) (owner: 10Glaisher) [16:40:21] mark: *nod* [16:40:33] (03Merged) 10jenkins-bot: Create Oriya Wikisource (orwikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166186 (https://bugzilla.wikimedia.org/71875) (owner: 10Glaisher) [16:41:04] ottomata: yes, we'll have to automate this process [16:41:16] chasemp: is there any reason for noreply@ to not be blackholed/discarded? [16:41:19] that's one of the issues [16:41:34] that's app-level though; replication factors depend on how valuable data is [16:41:36] these are double bounces [16:42:07] in theory no, yeah that should just be swallowed up, but let me make sure noreply@ isn't used anywhere inappropriately [16:42:13] I'll make a task to do that now [16:42:19] ottomata: in restbase the idea is to set this per bucket, which models use cases [16:42:21] gwicke: well, according to that, if we are going to use cassandra auth, we should increase replciation factor of the system_auth keyspace [16:42:36] (which...I don't actually see in the install I have, i have two other keyspaces that might be relevant.) [16:42:40] ah, yes; agreed for the system_auth keyspace [16:42:58] do you know how to show the current replication strategy for a keyspace? [16:43:05] ottomata: we generally need to export the topology info per cluster somehow to the app [16:43:12] ah [16:43:13] hm [16:43:17] describe keyspace seems to show it, i think [16:43:19] so what happens is, phab sends an email with from & envelope-from noreply@ to: ; bounces, so exim mails noreply@ with the bounce [16:43:28] !log all precise hosts salt updated to 2014.1.11, this includes tin (deployment) and virt1000 (salt master for labs). Not updated: virt1006 (inaccessible) [16:43:33] ottomata: describe keyspace IIRC [16:43:36] Logged the message, Master [16:43:45] !log reseating pem3 cr2-eqiad [16:43:50] Logged the message, Master [16:44:04] noreply@ gets aliased to general@, phabricator is unable to parse the bounce as valid email and bounces again, to mailer-daemon@polonium [16:44:09] ok, gwicke, i see a 'system' keyspace [16:44:11] and that's aliased to root@ and we get it [16:44:14] !log reedy Synchronized database lists: orwikisource (duration: 00m 13s) [16:44:16] i assume that's what they mean, not sure [16:44:19] Logged the message, Master [16:44:26] either phabricator needs to learn how to parse bounces, or these mails should get discarded [16:44:43] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 15s) [16:44:48] Logged the message, Master [16:45:11] but i can't modify it... [16:45:12] hm [16:45:36] ottomata: did you start with pw auth enabled? [16:45:44] cajoel: ping [16:46:06] paravoid: understood, only issue is now I see noreply@phabricator.wikimedia.org being used for messages that _can_ be replied to [16:46:16] uhhm... :) [16:46:18] so I need to track down where that's sourced before blackholing the whole thing [16:46:26] hmm, ha, maybe not yet gwicke! [16:47:07] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: orwikisource [16:47:14] Logged the message, Master [16:47:26] (03PS11) 10KartikMistry: Beta: Update cxserver to use Apertium service [puppet] - 10https://gerrit.wikimedia.org/r/157787 [16:48:00] ottomata: after that, cqlsh -u cassandra -p cassandra [16:48:20] AHH, got it [16:48:21] thank you. [16:48:29] !log reedy Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 14s) [16:48:34] Logged the message, Master [16:48:37] paravoid: I'm sure your correct but I need some time to dig in and probably to get more advice once I figure out why that is, as clearly bounces are handled poorly overall [16:48:45] (03PS1) 10Reedy: Add orwikisource to interwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167610 [16:48:56] (03CR) 10Reedy: [C: 032] Add orwikisource to interwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167610 (owner: 10Reedy) [16:49:02] ottomata: http://www.datastax.com/documentation/cassandra/2.0/cassandra/security/security_config_native_authenticate_t.html [16:49:02] (03Merged) 10jenkins-bot: Add orwikisource to interwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167610 (owner: 10Reedy) [16:49:26] aude: About/ [16:50:09] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:50:25] (03CR) 10KartikMistry: [C: 031] Add sqlite3 add a build dependency [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167598 (owner: 10Alexandros Kosiaris) [16:51:57] (03CR) 10Alexandros Kosiaris: [C: 032] exim: remove ref to tampa [puppet] - 10https://gerrit.wikimedia.org/r/167582 (owner: 10Matanya) [16:57:04] paravoid: pong [16:57:38] cajoel: we're doing some unplanned maintenance @ ulsfo; jgage is going there today [16:57:48] need a hand? [16:57:53] I could get there. [16:57:56] don't think so [16:58:06] before 11am meeting? [16:58:07] no thanks, but bgp session to the offfice wil go down [16:58:11] ah [16:58:16] it shouldn't [16:58:16] going right after meeting [16:58:22] but you should lose the eqiad routes [16:58:27] i.e. wikipedias and such [16:58:39] transit will work fine [16:58:41] ok, as long as normnal transit stays health, just fine [16:58:44] check [16:58:48] thx [16:58:55] ulsfo will just become an island as far as WMF prod is concerned [16:59:07] this is how it should go, but better be aware of it :) [16:59:12] long term situation or short term? [16:59:20] very short term [16:59:21] short, just swapping cards [16:59:23] kk [17:00:58] if they can afford it [17:01:01] oop [17:02:13] cajoel: also, VoIP is broken [17:02:20] that's no fun [17:02:22] did you change IPs? [17:02:26] Connection Information (c): IN IP4 192.168.39.76 [17:02:50] that's the RTP endpoint that asterisk is sending me [17:02:54] (in the SDP) [17:03:01] voip.corp.wikimedia.org has address 192.168.39.76 [17:03:06] hah [17:03:23] let me check the asterisk config [17:03:33] SIP works, I just get no audio [17:04:03] no idea for how long it has been broken though [17:04:47] if you're NAT'ing asterisk, it's going to be a PITA [17:05:08] either you need to tell asterisk that and have it learn its own public IP when appropriate etc. [17:05:11] or you have to use a nat helper [17:05:15] on your router [17:05:31] gwicke: i was starting to puppetize that [17:05:32] yep -- gimmie a minute -- lots of people comign in today [17:05:36] the system_auth replication strategy change [17:05:41] nf_nat_sip etc. [17:05:44] but i think it might be kind of difficult [17:05:48] no worries, it's no big deal [17:05:56] i'm trying to do so based on a list of datacenters given to the class [17:06:05] since NetworkTopologyStrategy needs that [17:06:15] cajoel: I can mail techsupport@ if you prefer [17:06:31] yes please -- so I don't drop it [17:06:33] so, i tested locally, and it seems if the replication can't be fulffilled [17:06:42] cqlsh can't connect auth authenticate [17:06:50] AuthenticationException(why='org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM') [17:07:10] i tried to do it with a dc for which there were not yet any nodes [17:07:42] ottomata: makes sense [17:07:53] hm, so, hm [17:07:56] after you up the replication factor, you need to have the nodes to actually achieve the quorum [17:08:00] yes [17:08:05] should I make puppet manage this then? [17:08:06] (03CR) 10Dzahn: [C: 032] iegreview: Enable rewrite engine in Location context [puppet] - 10https://gerrit.wikimedia.org/r/167394 (https://bugzilla.wikimedia.org/72201) (owner: 10BryanDavis) [17:08:08] seems a little dangerous [17:08:26] i can make puppet manage the cassandra password [17:08:27] i think [17:08:35] ottomata: I think it's better to separate the per-cluster stuff from the per-node stuff [17:08:37] but if i change the replication settings, things could be left in a nasty state [17:08:46] well, cassandra user will be per-cluster as well, right? [17:08:58] ottomata: yes [17:09:20] do you plan to update the password from puppet as well? [17:09:21] ok, so, i will not puppetize that stuff then? I will just puppetize enabling authenticationa nd authorization [17:09:25] i could [17:09:33] we could store it in the private puppet repo [17:09:39] jgage: GTT has worked around the issue [17:09:43] it doesn't necessarily make sense to do that for every node [17:09:59] I think I'll reenable OSPF [17:10:06] right, it'd be hacky; and exec that checks that the user is correct, but, meh [17:10:07] dunno [17:10:14] ha, actually, this is the super user [17:10:22] if it isn't correct it won't be able to do anything anyway :p [17:10:41] gwicke: how about I don't puppetize that either, just for safety [17:10:58] 75 packets transmitted, 35 packets received, 53% packet loss [17:11:01] nah, nevermind that [17:11:10] ottomata: makes sense to me, especially as a first step [17:11:10] when the cluster is up, you can manually manage users and keyspace replication settings, ja? [17:11:11] ok [17:11:16] ottomata: we can rethink this later [17:11:33] k [17:11:38] ok, next q then [17:11:48] ottomata: we do need to think about distributing the topology info though [17:11:58] yes, that is my next q [17:12:03] do you know how all those files relate? [17:12:06] there are 3 of them [17:12:17] (03PS2) 10Faidon Liambotis: Handle CORS preflight requests for upload in VCL [puppet] - 10https://gerrit.wikimedia.org/r/167542 (https://bugzilla.wikimedia.org/55631) [17:12:24] (03CR) 10Faidon Liambotis: [C: 032] Handle CORS preflight requests for upload in VCL [puppet] - 10https://gerrit.wikimedia.org/r/167542 (https://bugzilla.wikimedia.org/55631) (owner: 10Faidon Liambotis) [17:12:28] and i had thought that the GossipingPropertyFileSnitch only used the cassandra-rackdc.properties file [17:12:39] but i see errors on startup now about missing .topology files [17:12:45] there's a .yaml and a .properties one [17:13:38] do we have any pattern of how private ips map to racks? [17:13:56] don't think so [17:13:58] gwicke: rows [17:13:59] Reedy: ? [17:14:03] but what do you want to do exactly? [17:14:09] http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureSnitchRackInf_c.html [17:14:20] sec [17:14:27] gwicke: naw, it has to be manually set [17:14:40] aude: https://wikitech.wikimedia.org/wiki/Add_a_wiki#Wikidata isn't quite clear [17:14:49] (03PS2) 10Alexandros Kosiaris: Add sqlite3 as a build dependency [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167598 [17:14:52] gwicke: also, we generally don't delineate based on racks, but on rows [17:15:02] what is it missing? [17:15:07] but, for cassandra's purposes, we can let it call a row a 'rack' [17:15:24] aude: Does it need running on every group? Just the new group? [17:15:26] Should it just be foreachwikiindblist wikidataclient.dblist extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --strip-protocols [17:15:29] you could delineate on racks as well, if you can do hierarchically [17:15:48] each group, sadly [17:15:53] we need to fix that [17:15:58] we could, but woudl that real;ly help us? i suppose for power failure redundancy it would [17:16:00] so surely the command I posted above makes it easier? [17:16:05] letting wikibase detect the group? [17:16:15] the site_identifiers table is dependendent on which site group the site is in [17:16:15] but, for both elasticsearch and for hadoop, i was told to delineate on the row level, as it was closer to how things were actually networked [17:16:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add sqlite3 as a build dependency [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167598 (owner: 10Alexandros Kosiaris) [17:16:26] Reedy: yes, would be better [17:16:38] aude: Do we unit test teh group identification? :D [17:16:51] ottomata: the idea is to place replicas in different racks, primarily for power & switch redundancy [17:17:28] ottomata: with puppet we could actually use http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureSnitchPFSnitch_t.html [17:17:46] that'd be the central config that's pushed to all nodes [17:17:49] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:17:49] i did foreachwikiindblist wikisource.dblist .... [17:18:03] aye, we could do that, GossipingPropertyFileSnitch is kinda cool though... [17:18:04] :) [17:18:09] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 1 failures [17:18:14] and for each group, which can easily be wrapped in a shortcut script [17:18:43] ottomata: yeah, it might be nicer when adding new nodes [17:19:39] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [17:19:42] ahhh, ok ,gwicke i see. i'm getting this message on startup [17:19:49] GossipingPropertyFileSnitch.java (line 65) Unable to load cassandra-topology.properties; compatibility mode disabled [17:19:49] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 1 failures [17:19:59] but i see that is for fallback [17:20:03] if Gossiping failes [17:20:05] fails* [17:20:05] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [17:20:16] ok, so, I think GossipingPropertyFileSnitch should work fine [17:20:20] that's what i've puppetized already anyway [17:20:35] okay [17:20:41] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [17:20:57] would be nice if $::rack or or $::row was avaialble automatically by default in our puppet [17:21:35] ottomata: it is not so difficult [17:21:38] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [17:21:44] akosiaris: no? how? [17:21:50] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 1 failures [17:22:11] ottomata: we should perhaps talk to robh about setting up the cassandra nodes with a specific per-{DC,rack} prefix pattern, so that we can use http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureSnitchRackInf_c.html [17:22:18] lldp. I already got a patch for getting LLDP neighbors from the switches in a fact [17:22:30] although I suspect it'll be hard to change if we already use a different pattern [17:22:44] doing some rather crude heuristics it is possible to derive row [17:22:45] yeah, gwicke, that would be nice, i'm not sure we do that on a rack level though...do we akosiaris? [17:22:49] gwicke: so we dont do per rack ip subnetting [17:22:51] just per row [17:22:57] yeah [17:23:04] so the 'rack octet' isn't possible [17:23:14] well, not presently [17:23:21] yeah, rack is not so easy right now [17:23:41] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [17:23:42] if they are all internal hosts you coul dmake an arguement on cassandra specific internal ip ranges [17:24:19] they are all internal hosts [17:24:38] but then you'd be setting up no less than a possible 32 ranges [17:24:44] but they could be very very very tiny ranges [17:24:46] ;] [17:24:46] we can deal with any topology by making everything explicit, but it would save some amount of time & possible errors if we followed a consistent pattern [17:24:49] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 1 failures [17:24:58] (03PS1) 10Faidon Liambotis: Brown paper bag fix for Varnish/upload/CORS [puppet] - 10https://gerrit.wikimedia.org/r/167619 [17:24:58] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Puppet has 1 failures [17:25:00] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Puppet has 1 failures [17:25:00] but it would be a pattern ONLY for cassandra [17:25:02] which would be odd. [17:25:04] K4: Respected human, time to deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141020T1725). Please do the needful. [17:25:06] that cp* is me [17:25:08] gwicke: how does RackInferringSnitch used with replication strategies? [17:25:15] akosiaris: do you have a lldp puppet fact? [17:25:17] i.e. how to set per-DC policies? [17:25:20] ahem: Error: /usr/share/varnish/reload-vcl -n frontend && (rm /var/tmp/reload-vcl-failed-frontend; true) returned 1 instead of one of [0] [17:25:20] Error: /Stage[main]/Role::Cache::Upload/Varnish::Instance[upload-frontend]/Exec[retry-load-new-vcl-file-frontend]/returns: change from notrun to 0 failed: /u [17:25:21] llnp [17:25:22] does it auto-name the DCs? [17:25:24] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Brown paper bag fix for Varnish/upload/CORS [puppet] - 10https://gerrit.wikimedia.org/r/167619 (owner: 10Faidon Liambotis) [17:25:28] akosiaris: see above [17:25:29] robh: I guess per-DC/rack prefixes would be useful for anything that cares about correlated failures [17:25:41] paravoid: k thanks [17:25:48] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [17:25:49] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [17:26:11] ottomata: it would, yes [17:26:12] cajoel: yeah I do, I have not submitted it though [17:26:33] akosiaris: neato -- I did one based on cdp. [17:26:39] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:26:50] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:27:01] not sure if my ios supports ll yet. [17:27:41] ah yes cisco [17:27:48] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:27:50] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [17:27:58] ottomata: https://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/locator/RackInferringSnitch.java [17:27:58] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [17:28:08] ottomata: see getDatacenter in there [17:28:09] let's create a weird combination of model, version, train, release and the position of the moon for the feature support matrix [17:28:32] lldp run -- yep supported [17:28:33] excellent [17:28:49] aye, cool [17:28:51] that makes sense [17:29:00] so, no, RackInferringSnitch probably won't work [17:29:18] potentially it could for rows, but better to be able to set the topology ourselves I think [17:29:39] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:29:41] having nodes in different rows isn't very useful though ;) [17:29:49] what do you mean? [17:29:57] it is, it's our network availability domain [17:30:00] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:30:06] is there correlation between failures & rows? [17:30:08] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:30:09] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:30:28] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:30:38] (03PS1) 10Dzahn: create shell user for Joel Sahleen [puppet] - 10https://gerrit.wikimedia.org/r/167620 [17:30:48] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:30:48] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:30:49] RECOVERY - DPKG on labmon1001 is OK: All packages OK [17:30:54] paravoid: so is the physical networking set up per *row* rather than per *rack*? [17:30:58] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:31:08] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:31:10] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:31:10] gwicke: yes, network failures, but, there potential for rack specific power failures, i thnik [17:31:10] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:31:10] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:31:19] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [17:31:27] gwicke: it's not that simple obviously [17:31:27] so, if a single rack were to have all replicas of something, and it were to lose power, that would be bad [17:31:28] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 65 seconds ago with 0 failures [17:31:28] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 65 seconds ago with 0 failures [17:31:49] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 2 failures [17:32:27] we have one forwarding plane per rack, 1+1 control plane per row, and multiple unified L3 domains per row [17:33:00] akosiaris: tell me more about your puppet $::rack thing [17:33:20] ottomata: I 'll upload a patch and set you as a reviewer [17:33:26] me too? [17:33:30] sure [17:33:41] k [17:33:43] paravoid: so is power per row as well? [17:33:51] requrie lldpd to be running or just snag it occasionally [17:33:57] (not like it changes alot) [17:34:01] no, power is per rack -- but usually A/B [17:34:04] (03PS2) 10Dzahn: create shell user for Joel Sahleen [puppet] - 10https://gerrit.wikimedia.org/r/167620 [17:34:46] paravoid: given our current physical setup, which replica placement strategy would give us the best failure isolation in a single DC? [17:35:24] how many replicas are we going to keep within the same DC? [17:35:25] 3? [17:35:46] replicas per item is likely going to be between 1 and 5 [17:35:57] nodes per DC will likely be > 3 [17:36:03] (03PS1) 10Andrew Bogott: Reduce RAM overprovision ratio. [puppet] - 10https://gerrit.wikimedia.org/r/167623 [17:36:04] 3 is the minimum config [17:36:31] (03CR) 10Dzahn: [C: 032] "request is old enough, has approval, key confirmed from office wiki, uid matches labs/LDAP" [puppet] - 10https://gerrit.wikimedia.org/r/167620 (owner: 10Dzahn) [17:36:45] YuviPanda: Help me understand this report? https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=mem_free&s=by+name&c=Virtualization+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [17:36:47] one per row at least, yeah [17:37:03] you don't want to be in the position where the algorithm has picked two or three racks in the same row [17:37:08] I don't understand why some instances are marked 'green' with tiny amounts of free memory. And others with much more free memory orange... [17:37:14] Maybe I should just ignore the colors in ganglia [17:37:18] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Update cxserver to use Apertium service [puppet] - 10https://gerrit.wikimedia.org/r/157787 (owner: 10KartikMistry) [17:37:26] if you have > 3 replicas to put, you can start allocating into different racks of the same 3 rows [17:37:47] (we have 4 rows per DC but it's /very/ unlikely we'll lose multiple of them at the same time) [17:38:05] (03CR) 10Dzahn: "welcome. https://wikimediafoundation.org/wiki/User:JSahleen_%28WMF%29" [puppet] - 10https://gerrit.wikimedia.org/r/167620 (owner: 10Dzahn) [17:38:52] paravoid: cassandra is interested in the replica set *per item*, which is a subset of all cluster nodes [17:39:21] yeah, otherwise we'd have only three machines per DC :) [17:39:33] *nod* [17:39:43] (03CR) 10coren: [C: 032] "Overprovisioning is always a risky business; this makes it more conservative and thus more safe." [puppet] - 10https://gerrit.wikimedia.org/r/167623 (owner: 10Andrew Bogott) [17:40:01] so if it's 3 replicas, better be one in each of three rows [17:40:12] and different racks I guess [17:40:30] andrewbogott: what is it based on, absolute numbers or percentages? [17:40:42] one rack belongs into only one row, always, so that's implied [17:40:52] mutante: I don't know any more than what you see. The index is marked in 'M' which suggests absolute [17:41:22] mutante: it's possible that the colors are based on relative values, which would mean that those servers have different amounts of RAM installed, which would surprise me a lot [17:41:32] paravoid: ahhh, now it makes sense [17:41:33] if we have 5 replicas per item, we don't have 5 rows, so one row will get two replicas; in that case, it should ideally be in two different racks within that same row [17:41:43] andrewbogott: because in icinga we used to get those free space warnings when >5%, and that was very different when we had terabytes of space on dataset [17:41:47] (rows are 4 per DC currently) [17:41:54] andrewbogott: oh... RAM .. nevermind [17:41:58] !log all trusty and lucid hosts now running salt 2014.1.11 (this includes labcontrol2001, salt master for future codfw labs) [17:42:03] Logged the message, Master [17:42:33] (it's not a big deal if it's the same though, if we lose that one rack with two copies, we'll three more...) [17:43:58] it's a hierarchy, DC > row > rack > node [17:44:09] yup [17:44:43] andrewbogott: I'm going to say I know as much as you do at this point. I've always ignored ganglia colors [17:44:44] so if we can get this info in puppet then we should be fine for now [17:44:51] no matter what the IP pattern is [17:45:01] andrewbogott: if I click through there isn't any pattern [17:45:04] YuviPanda: fair enough :) [17:46:34] andrewbogott: virt1004 has more RAM than labnet1001 [17:46:50] the graphs are confusing though [17:46:54] mutante: yeah, labnet1001 is running unrelated services [17:47:24] mutante: btw, shinken is reporting uptime for labs hosts now :) http://shinken.wmflabs.org/host/tools-login [17:47:28] now to add hostgroups... [17:47:29] it's very odd that virt1004 is red [17:47:34] MemTotal: 198028876 kB [17:47:34] MemFree: 18880016 kB [17:47:44] because that is almost not used at all [17:47:49] oh., wait [17:47:57] one more digit :p [17:48:05] well, except, gwicke, it looks like cassandra doesn't have the row level built in [17:48:07] it only has dc and rack [17:48:18] we can overload the meaning of rack here to be row, or just leave it at rack [17:48:24] gwicke: I'm not sure if we have it as a variable right now, but we could write it, inferring it from other data [17:48:29] YuviPanda: ooh. trying .. first time i see this is up :) [17:48:32] ottomata: I think it makes sense to treat rows as virtual racks [17:48:32] but, if we left it at rack, we might get all replicas in the same ro [17:48:33] w [17:48:38] aye, i agree [17:48:43] mutante: it's just going to say 'up'. No other monitoring in place yet. [17:48:52] operations/software/shinkengen [17:48:59] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:49:00] ottomata: that also seems to be what some of the other snitches are doing [17:49:12] mutante: guest/guest uid/pw. [17:49:16] gwicke: let's go with the Gossiping one for now, its really easy to configure [17:49:17] YuviPanda: is it hooked up to LDAP? [17:49:22] ah, ok [17:49:32] i'm not going to bother supporting the PrpertyFile one in this module, s'ok? [17:49:34] mutante: nope, labs instance, and we're not sure how to do auth yet [17:49:36] if we need it we can add it later [17:49:42] No new IT problems in the last 10minutes [17:49:43] :) [17:49:43] ottomata: okay [17:49:47] passing LDAP creds through labs instances seem icky [17:49:51] mutante: yeah :) [17:49:53] ottomata: we'll need to export the topology as a propertyfile [17:49:59] or some json blob [17:50:23] so that apps can set up replication for non-system keyspaces [17:50:24] naw, not with Gossiping [17:50:25] ottomata: last I recall (from messing with Cassandra years ago), Cassandra's conception of network/dc layout is completely pluggable with code [17:50:35] eh? [17:50:53] bblack, ah, cool, so we could extend one of these and add support for another hierarchy level [17:50:53] ? [17:51:04] andrewbogott: YuviPanda : ah, at least there is a "legend" link saying a bit more about the colors https://ganglia.wikimedia.org/latest/node_legend.html [17:51:05] right [17:51:07] ottomata: alternatively, we could just provide a list of datacenters to the app [17:51:24] you probably want to google things related to Snitches and Replication Strategy [17:51:31] aye [17:51:36] gwicke: why does the app need to know? [17:51:54] but we also need a list of nodes to connect to, so just providing the full list sounds like the easiest thing [17:52:03] ottomata: e.g. some info here is relevant: http://www.datastax.com/docs/1.0/cluster_architecture/replication#networktopologystrategy [17:53:02] ottomata: replication is set up per keyspace, with number of replicas per DC -> need to know the names of the dcs; the app also needs to connect to some nodes that are alive -> need a list of nodes [17:53:26] we could do this in the restbase module I'm working on [17:54:21] mutante, YuviPanda I'm thinking that the axis are labeled wrong, that some of them should have 'G' and some of them 'M'? [17:54:28] ottomata: a $row/$rack global variable (or fact) would be generally useful in many of the things you do I think [17:54:44] yup, for sure [17:55:00] gwicke: still not sure why the app needs to know the topology [17:55:00] andrewbogott: yes, agreed. something is wrong about the axis/labels it looks indeed [17:55:06] nodes to connect to, sure [17:55:11] gwicke: will the app create keyspaces? [17:55:24] ottomata: yes, the app creates keyspaces [17:55:26] ah [17:55:27] k [17:55:30] hm [17:55:39] one for each logical table [17:55:51] replication factors depend on the use case [17:55:55] well, in that case. hm. maybe we should just use PropertyFile then? [17:56:03] if we have to hardcode the topology somewhere anyway [17:57:36] ottomata: we should check if config changes require a restart, or just a hufp [17:57:39] *hup [17:58:12] if they require a restart, then PropertyFile without gossip would be much less attractive [17:58:22] (03PS1) 10Dzahn: give jsahleen access to bastion and private logs [puppet] - 10https://gerrit.wikimedia.org/r/167627 [17:58:36] hmmm [17:58:40] good point [18:00:35] (03PS2) 10Dzahn: give jsahleen access to bastion and private logs [puppet] - 10https://gerrit.wikimedia.org/r/167627 [18:01:20] (03PS2) 10Yuvipanda: [WIP] Add hosts & hostgroups generator [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167595 [18:03:13] (03PS3) 10Dzahn: give jsahleen access to bastion and private logs [puppet] - 10https://gerrit.wikimedia.org/r/167627 [18:03:14] hm, gwicke, i'm not sure yet, will have to test in labs witha multi node cluster i think. but, actualy, on second thought, even with a hardcoded topology in puppet somewhere, we can still use gossiping [18:03:15] ottomata: I don't see any HUP support, nor is there reload support in the init script [18:03:30] we can just set the $dc and $rack based on the stuff in a hash or something [18:03:39] aye, gwicke, i did just test with a single node [18:03:41] and changed the .properties file [18:03:45] and then did nodetool status [18:03:49] and the info was updated immediately [18:03:56] without HUPing or restarting [18:03:57] ohh, nice [18:04:07] but, that may be because I was running nodetool on the same node [18:04:08] dunno [18:04:22] but, either way, Gossip snitch is easier to configure, and works [18:04:28] so I will leave it there for now [18:04:31] *nod* [18:04:34] if we decide to change later i can build it in [18:05:10] ottomata: if the list of cassandra dcs and nodes is available to other puppet modules, then things should be fine [18:05:23] aye [18:05:30] we'll have to hardcode it either, but ja [18:06:20] cool, sounds like a plan ;) [18:08:35] (03PS6) 10Ottomata: Initial commit of Cassandra puppet module [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 [18:10:07] _joe_: yt? [18:10:20] any pointers for how to use hiera in labs? [18:10:25] <_joe_> ottomata: yep [18:10:37] oh, gwicke, do you have a labs project where I can spawn some test cassandra instances for this? [18:10:43] i could use annalytics...buuuut, gimme one of yours :p [18:14:37] METTING [18:14:38] DOH! [18:16:38] ottomata: added you to the services project [18:16:54] danke [18:18:16] (03PS1) 10Faidon Liambotis: upload: add X-Content-Duration to CORS exposed headers [puppet] - 10https://gerrit.wikimedia.org/r/167632 [18:18:44] (03CR) 10Faidon Liambotis: [C: 032] upload: add X-Content-Duration to CORS exposed headers [puppet] - 10https://gerrit.wikimedia.org/r/167632 (owner: 10Faidon Liambotis) [18:19:32] (03CR) 10Faidon Liambotis: [V: 032] upload: add X-Content-Duration to CORS exposed headers [puppet] - 10https://gerrit.wikimedia.org/r/167632 (owner: 10Faidon Liambotis) [18:25:01] (03CR) 10Jsahleen: [C: 031] give jsahleen access to bastion and private logs [puppet] - 10https://gerrit.wikimedia.org/r/167627 (owner: 10Dzahn) [18:38:01] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [18:38:44] (03PS3) 10Yuvipanda: [WIP] Add hosts & hostgroups generator [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167595 [18:54:06] _joe_: I'm working on manifests/openstack today. [18:54:16] (so you don't duplicate) [18:55:08] <_joe_> andrewbogott: no openstack is one of the things that still scare me in our manifests [18:57:22] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:57:38] ottomata: yes, there's an asterisk box in the offic [19:00:04] bd808, mutante: Dear anthropoid, the time has come. Please deploy IEG review (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141020T1900). [19:00:51] cajoel: asterisk box? did I ask about that? [19:01:31] someone else did, but it was related to you :) [19:01:37] I think andrewbogott did [19:01:54] in you = in finding a solution so that all of us can join [19:02:50] jouncebot: wee :) ok [19:02:52] oh [19:03:02] I'll see if I can finagle a way to have a hangout dialin to a POTS conference bridge-- then non-video attendees can just dial in to the bridge [19:03:19] not sure what I'd use as a conf bridge.. [19:03:24] I'll file an IT ticket [19:03:35] ConfBridge() :) [19:03:45] (seriously, that's how it's called) [19:03:47] (03CR) 10Gage: [C: 032] "headed to ULSFO now for router card swap" [dns] - 10https://gerrit.wikimedia.org/r/167599 (owner: 10BBlack) [19:03:52] uhm [19:03:54] jgage: wait :) [19:04:12] too late? [19:04:25] if you -2 it jenkins should stop merging it [19:04:34] oh, yup [19:04:45] arr [19:04:45] (03PS1) 10Yuvipanda: shinken: Set up separate directory for generated configs [puppet] - 10https://gerrit.wikimedia.org/r/167640 [19:05:03] sorry paravoid. waiting for GTT? [19:05:16] GTT has no loss atm [19:05:28] i'm ready to leave the office so it seemed like the right time [19:05:30] but they're still debugging, so it's still a risk [19:05:35] better safe than sorry [19:05:36] go ahead [19:05:42] and merge it [19:07:00] i haven't done dns changes before. do i have to do something after merging in gerrit? [19:07:11] yes [19:07:15] authdns-update [19:07:29] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [19:07:30] ok. what host(s)? [19:07:37] one of the NSes [19:07:41] like a regular DNS change [19:08:40] running now [19:09:01] done on ns1 aka baham [19:10:07] should be enough [19:10:40] ok transporting body to ulsfo. will check in on irc before touching anything. should be there in <60 mins. [19:12:13] (03PS4) 10Yuvipanda: [WIP] Add hosts & hostgroups generator [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167595 [19:13:00] !log importing schema, data, users into mysql for iegreview [19:13:08] Logged the message, Master [19:15:29] so, _joe_, how to use hiera in labs? [19:15:37] i think i see how to use it with self hosted puppetmaster [19:15:41] which i'll have to do for now anyway [19:15:48] but, for example [19:16:00] with the wikitech interface, i can check to include classes and set global variables [19:18:05] (03PS5) 10Yuvipanda: [WIP] Add hosts & hostgroups generator [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167595 [19:23:53] (03CR) 10Andrew Bogott: [C: 032] shinken: Set up separate directory for generated configs [puppet] - 10https://gerrit.wikimedia.org/r/167640 (owner: 10Yuvipanda) [19:23:57] ottomata: I wonder if there's much point into creating a variable for every cassandra config setting [19:24:10] I mean, you already did it, so it's kinda moot now [19:24:25] but it may have been easier to just allow the caller to pass a template for the config file [19:26:28] (03PS6) 10Yuvipanda: [WIP] Add hosts & hostgroups generator [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167595 [19:29:10] paravoid, i've thought about that before too, a little bit. [19:29:23] i kind of like the explicitness of the variables [19:29:36] but am not set on them [19:29:49] also, many of the variables i adapted from another puppet module, some of them we might not need to change [19:29:55] also, it is much easier to set variables in labs [19:29:56] afaik [19:30:00] at least, it has been [19:30:02] then to use hashes [19:30:11] oh sorry, paravoid, you said template [19:30:12] mehhhh [19:30:46] i dont' like that as much, i do like being able to override the template [19:30:48] i'll add that [19:30:51] like i have in hadoop classes [19:31:05] but, i woudln't want to have to provide a template every time i wanted to use the class [19:31:23] e.g., in vagrant, in labs, etc. [19:33:44] (03PS3) 10Ori.livneh: trebuchet provider: grains.get should be local [puppet] - 10https://gerrit.wikimedia.org/r/167562 (owner: 10ArielGlenn) [19:34:00] (03CR) 10Ori.livneh: [C: 032 V: 032] trebuchet provider: grains.get should be local [puppet] - 10https://gerrit.wikimedia.org/r/167562 (owner: 10ArielGlenn) [19:37:02] (03PS7) 10Ottomata: Initial commit of Cassandra puppet module [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 [19:37:20] (03PS1) 10Alexandros Kosiaris: Introduce an LLDP facts [puppet] - 10https://gerrit.wikimedia.org/r/167644 [19:37:22] (03PS1) 10Alexandros Kosiaris: Introduce rack facts based on LLDP facts [puppet] - 10https://gerrit.wikimedia.org/r/167645 [19:37:32] (03PS7) 10Yuvipanda: [WIP] Add hosts & hostgroups generator [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167595 [19:38:41] (03PS1) 10Yuvipanda: shinken: Make contactgroup name match project name [puppet] - 10https://gerrit.wikimedia.org/r/167646 [19:40:19] (03PS2) 10Alexandros Kosiaris: Introduce LLDP facts [puppet] - 10https://gerrit.wikimedia.org/r/167644 [19:40:21] (03PS2) 10Alexandros Kosiaris: Introduce rack/rackrow facts based on LLDP facts [puppet] - 10https://gerrit.wikimedia.org/r/167645 [19:40:29] (03CR) 10Ottomata: "Cool!" [puppet] - 10https://gerrit.wikimedia.org/r/167645 (owner: 10Alexandros Kosiaris) [19:41:03] akosiaris: still awake ? [19:41:48] matanya: yes [19:42:09] (03PS1) 10Ori.livneh: Revert "Revert "trebuchet: derive the grain name from the repo name"" [puppet] - 10https://gerrit.wikimedia.org/r/167648 [19:42:21] akosiaris: hi, i would like to remove the DNS/DHCP stanzas for tampa from the repo, do you endorse that ? [19:43:56] matanya: yes obviously [19:44:23] any preferred way ? [19:44:51] in small patches preferably [19:45:17] ok [19:46:27] (03PS1) 10Ori.livneh: hhvm: provision quickstack package [puppet] - 10https://gerrit.wikimedia.org/r/167650 [19:46:41] akosiaris: matanya [19:46:51] https://gerrit.wikimedia.org/r/#/c/166881/ [19:46:57] https://gerrit.wikimedia.org/r/#/c/166882/ [19:47:27] https://gerrit.wikimedia.org/r/#/c/164241/ [19:47:40] so done. akosiaris i'll just add you as a reviewer [19:47:51] cool, thanks [19:47:55] (03CR) 10Matanya: [C: 031] remove Tampa appserver DNS entries [dns] - 10https://gerrit.wikimedia.org/r/166881 (owner: 10Dzahn) [19:48:15] robh: are we still using tampa's mgmt IP addresses in codfw ? [19:49:31] (03CR) 10Andrew Bogott: [C: 032] shinken: Make contactgroup name match project name [puppet] - 10https://gerrit.wikimedia.org/r/167646 (owner: 10Yuvipanda) [19:50:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [19:50:16] akosiaris: we redo the mgmt ip [19:50:24] but the old one may potentially work to do so remotely [19:50:28] (03CR) 10Matanya: [C: 031] "looks ok, would run through compiler to make sure." [dns] - 10https://gerrit.wikimedia.org/r/166914 (owner: 10Dzahn) [19:50:29] rather than assign to papaul to do so [19:50:50] it depends if it was mgmt reset (they were supposed to be, but many werent cuz our tech there wasnt any good and chris and i had to make up a lot of lost itme) [19:51:07] but otherwise, the old tampa ip is only used to access and updat to codfw mgmt ip [19:51:25] (with mark's network hack that will go away sometime in the near to midterm future) [19:51:29] (03PS1) 10Dzahn: DHCP - remove Tampa public services subnet [puppet] - 10https://gerrit.wikimedia.org/r/167651 [19:51:51] robh: so, when you do that, do you still use the DNS addresses, WMFXYZW.mgmt.pmtpa.wmnet ? [19:51:53] (03CR) 10Matanya: "What is this service anyway?" [puppet] - 10https://gerrit.wikimedia.org/r/164690 (owner: 10Dzahn) [19:52:09] or some other procedure ? [19:52:32] (03PS8) 10Yuvipanda: Add hosts & hostgroups generator [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167595 [19:52:45] (03CR) 10Dzahn: "https://en.wikipedia.org/wiki/Wikipedia:Snuggle" [puppet] - 10https://gerrit.wikimedia.org/r/164690 (owner: 10Dzahn) [19:53:32] (03PS2) 10Ori.livneh: hhvm: provision quickstack package [puppet] - 10https://gerrit.wikimedia.org/r/167650 [19:53:37] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm: provision quickstack package [puppet] - 10https://gerrit.wikimedia.org/r/167650 (owner: 10Ori.livneh) [19:53:57] akosiaris: nope, all the mgmt dns entries were removed [19:53:58] (03CR) 10Dzahn: "" Just click on https://snuggle-en.wmflabs.org/ and get started." sooo.. is this staying a labs project or is the plan to make it product" [puppet] - 10https://gerrit.wikimedia.org/r/164690 (owner: 10Dzahn) [19:54:03] ori: why just hhvm? [19:54:04] i usually go back and cherry pick an old dns commit [19:54:08] and just ssh to the ip directly [19:54:10] ori: it's a tiny package, just add it to base [19:54:14] (cherrypicking to find the old ip) [19:54:30] it is mildly annoying, but seems the best way i know of [19:55:03] robh: they are actually still here https://gerrit.wikimedia.org/r/#/c/166881/1/templates/10.in-addr.arpa [19:55:13] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 282 seconds ago with 0 failures [19:55:33] robh: ok thanks [19:55:34] mutante's answer is better! (well, using that rather than git log) [19:55:46] heh [19:55:52] well, that is unmerged :) [19:56:08] mutante: well, is that every tampa mgmt? [19:56:10] paravoid: trusty only [19:56:15] which is why I was asking robh... to see if we should merge it [19:56:15] cuz those changes havbe been ongoing for 6 + months [19:56:15] but sure [19:56:19] oh? [19:56:25] Yes, merge any kill mgmt.pmtpa [19:56:32] robh: yes, he asked because i suggested to remove the rest of them [19:56:35] i misunderstood, i thought you were asking cuz you needed to access an old machine [19:56:36] ok, thanks [19:56:42] and was trying to explain how [19:56:43] heh [19:56:46] im in hardware mode =P [19:56:47] ori: fixable :) [19:56:55] yes kill all tampa mgmt entries with fire [19:57:01] (also, conditional install everywhere still better than hhvm) [19:57:21] * robh is deep within rt queue mining now [19:57:22] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM and per IRC discussion merging" [dns] - 10https://gerrit.wikimedia.org/r/166881 (owner: 10Dzahn) [19:57:29] its like minecraft but more confusing. [19:57:42] akosiaris: thx [19:58:05] a few more tampa remnants going away [19:58:09] (03PS1) 10Ori.livneh: Add quickstack to standard-packages; remove from hhvm::packages [puppet] - 10https://gerrit.wikimedia.org/r/167652 [19:58:17] ^ paravoid [19:58:31] speaking of rt tickets (we werent but now we are) [19:59:13] robh: there's a guy who built Mt. Everest .. it's crazy http://imgur.com/a/btTut#0 [19:59:15] oh, wait, answered my own question by trying to phrase the question.... [19:59:19] (03Restored) 10Alexandros Kosiaris: remove 10.0.0.0/16 Tampa subnet from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/164241 (owner: 10Dzahn) [19:59:26] (03CR) 10Alexandros Kosiaris: [C: 032] remove Tampa power distribution unit entries [dns] - 10https://gerrit.wikimedia.org/r/166882 (owner: 10Dzahn) [19:59:29] :)) [20:00:04] gwicke, cscott, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141020T2000). Please do the needful. [20:00:46] speaking of cscott [20:01:01] (03PS2) 10Ori.livneh: Revert "Revert "trebuchet: derive the grain name from the repo name"" [puppet] - 10https://gerrit.wikimedia.org/r/167648 [20:01:02] cscott: who is your manager so i can poke them to approve your zuul changes access request? [20:01:10] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Revert "trebuchet: derive the grain name from the repo name"" [puppet] - 10https://gerrit.wikimedia.org/r/167648 (owner: 10Ori.livneh) [20:01:20] cuz there is a gerrit patch all done and such... but the ticket has no actual approvals. [20:01:23] robla is now my manager. (tchay used to be) [20:01:37] !log cr1-ulsfo: reenable ospf/ospf3 (GTT is stable) [20:01:43] Logged the message, Master [20:01:44] ok, I'll cc him on the ticket for his approval. im on ops duty this week so i'll be pushing this through the system for ya [20:01:46] (03PS4) 10Alexandros Kosiaris: remove 10.0.0.0/16 Tampa subnet from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/164241 (owner: 10Dzahn) [20:01:54] the reorg makes things kinda funny, but i assume robla is the right person. if he's not, he'll know who is. [20:02:04] * robla shrugs [20:02:05] greetings from ulsfo [20:02:05] sure [20:02:20] different wifi ssid, same password. ftw. [20:03:06] (03CR) 10Faidon Liambotis: [C: 032] Add quickstack to standard-packages; remove from hhvm::packages [puppet] - 10https://gerrit.wikimedia.org/r/167652 (owner: 10Ori.livneh) [20:03:13] thanks [20:03:24] from this end, 'manager approval' is a little weird for ops requests, because it seems like 'project leader approval' would be the more logical thing. but we don't have a solid org tree for project leaders. I seem to recall that we discussed this in terms of an RFC process, don't know if anything was ever done. [20:04:51] that is, if it's parsoid permission bits, it seems like getting gwicke or subbu's approval would be a good checklist; for zuul hashar is probably the relevant person, etc. but i guess a manager approval is a decent proxy, with the working model being that the manager knows the project structure and "checks with the right person" before +2ing. [20:04:54] * cscott goes back to hacking. [20:07:09] cscott: yea i dont disagree with any of that [20:07:25] cuz ideally we do check with whoever leads up a particular service [20:07:37] zuul is traditionally hashar but his approval is already on the patchset ;] [20:07:49] (so i dont bother to ask him on rt ticket) [20:07:50] robh: this one should be good to go , btw https://gerrit.wikimedia.org/r/#/c/167627/ it already has approval from greg [20:07:50] i think i've got a +1 from hashar on that patch in any case ;) [20:07:51] yeah [20:08:02] robh: .. and i added the user account separately.. so that was unblocked [20:08:22] also, i agree to what cscott said as well.. project owner does make sense [20:09:09] mutante: uh, is greg his manager? [20:09:27] i cannot keep up to the org changes either! [20:09:34] ;D [20:09:42] greg-g is the manager of all [20:10:01] hah [20:10:25] no srsly i dunno who his manager is! [20:10:32] who? [20:10:33] robh: hmmm.. mediawiki says still Alolita [20:10:39] and i hate assuming its always robla cuz i feel like i spam him a lot ;] [20:10:46] greg-g: Joel Sahleen [20:10:47] oh, yeah, Runa is acting director for language [20:10:53] but, figured you'd trust me :) [20:10:56] Ok, cool, I'll loop her in on ticket. [20:11:03] trust but verify! [20:11:22] i tend to get both the personnel manager and project/service lead approvals [20:11:32] cuz i dislike surprising folks with unexpected users ;] [20:11:44] !log deployed Parsoid version d4567e9f [20:11:49] Logged the message, Master [20:11:54] I realize that this may be more than other folks do, but I think I'm right anyhow so meh ;] [20:11:54] !log beginning router card swap in ulsfo [20:11:55] robh: yeah I approved cscott for zuul madness on the Gerrit change. Though not on the RT because I don't have permission to read the ticket [20:12:00] Logged the message, Master [20:12:16] hashar: no worries, its noted by mutante that you approved on gerrit so i counted it as good enough once i saw it [20:13:07] robh: great, thanks! [20:13:10] ..... accidentally closing your firefox when its set to delete all history on quit and being mid update on half a dozen tickets.... [20:13:12] fml [20:13:35] my paranoia has annoying consequences [20:13:46] (03CR) 10Dzahn: "this was uploaded in 2012, and pinged again in 2013, now it's 2014. somehow i think rebasing is going to be hard :)" [puppet] - 10https://gerrit.wikimedia.org/r/15561 (owner: 10Catrope) [20:13:59] robh: happened to me twice today! [20:14:35] ctrl-q instead of ctrl-w [20:14:55] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:15:04] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:15:17] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:23] _joe_: that you? [20:15:24] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:24] PROBLEM - puppet last run on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:15:44] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:44] PROBLEM - check if dhclient is running on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:15:45] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:15:56] oh, OOM [20:16:00] cr2-ulsfo card swap complete, card is back online [20:16:14] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:16:15] PROBLEM - check configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:16:36] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:16:44] RECOVERY - check if dhclient is running on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [20:16:47] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [20:16:57] jgage: confirmed, everything looks okay [20:16:59] !log ulsfo fpc 1 mic 1 card swap complete [20:17:02] great, thanks paravoid [20:17:04] RECOVERY - DPKG on mw1114 is OK: All packages OK [20:17:05] Logged the message, Master [20:17:07] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [20:17:13] yea, damn ctrl+q [20:17:14] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [20:17:15] RECOVERY - check configured eth on mw1114 is OK: NRPE: Unable to read output [20:17:25] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 72811 bytes in 4.033 second response time [20:17:26] jgage: don't forget to un-drain [20:17:30] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 567 seconds ago with 0 failures [20:17:32] jgage: you sir are a DC tech now. [20:17:33] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.113 second response time [20:17:45] (well, you were before, but still) [20:17:49] whee [20:17:57] paravoid, will do that now [20:17:59] did you see the extended doors down the aisle? [20:18:12] yeah there's 3 more like ours in this row [20:18:16] deep dish [20:20:49] (03Abandoned) 10Faidon Liambotis: partman: separate common settings and be DRY [puppet] - 10https://gerrit.wikimedia.org/r/32366 (owner: 10Faidon Liambotis) [20:22:30] (03CR) 10Ottomata: [C: 032] "Let's do this tomorrow morning!" [puppet] - 10https://gerrit.wikimedia.org/r/167550 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [20:24:21] (03CR) 10Faidon Liambotis: [C: 04-1] "There's something very dangerous about this: LLDP facts can come and go (config errors, bugs, lldpd died, cable disconnected, whatever)." [puppet] - 10https://gerrit.wikimedia.org/r/167645 (owner: 10Alexandros Kosiaris) [20:24:58] (03PS1) 10Andrew Bogott: Revert "Reduce RAM overprovision ratio." [puppet] - 10https://gerrit.wikimedia.org/r/167665 [20:26:06] (03CR) 10Andrew Bogott: [C: 032] Revert "Reduce RAM overprovision ratio." [puppet] - 10https://gerrit.wikimedia.org/r/167665 (owner: 10Andrew Bogott) [20:27:09] (03PS1) 10Gage: repool ULSFO in DNS [dns] - 10https://gerrit.wikimedia.org/r/167681 [20:27:32] ^^ paravoid bblack [20:27:48] ok :) [20:28:24] i [20:28:29] 'll take that as +1 :) [20:28:37] give me just a sec to look, but probably [20:28:42] almost overlooked the v6 changes, but caught them [20:28:53] just hit "revert" on the previous patch [20:29:11] sigh [20:29:15] (or use git revert sha1 on cmdline) [20:29:23] well, now i'm more intimately familiar with this file i geuss [20:29:23] yes :) [20:29:32] cool ok, i'll do it the proper way [20:30:11] (03Abandoned) 10Gage: repool ULSFO in DNS [dns] - 10https://gerrit.wikimedia.org/r/167681 (owner: 10Gage) [20:30:35] (03PS1) 10Gage: Revert "depool ulsfo from DNS": maintenance complete. [dns] - 10https://gerrit.wikimedia.org/r/167694 [20:30:49] coming someday to a nameserver near you: "authdns repool ulsfo" :) [20:31:07] how broke the wikis? [20:31:12] no js/css [20:31:59] bblack: :)) [20:32:39] i'm not in ulsfo :) [20:32:45] (03CR) 10Gage: [C: 032] Revert "depool ulsfo from DNS": maintenance complete. [dns] - 10https://gerrit.wikimedia.org/r/167694 (owner: 10Gage) [20:33:40] !log ulsfo repooled in dns [20:33:45] Logged the message, Master [20:34:01] oh, it is just commons. weird [20:34:19] actually I'm really not sure how best to factor out the UI for that. We'd want commands to take a datacenter offline, or just a service at a datacenter, and commands to undo those, I guess. [20:35:44] (but you could e.g. unpool ulsfo just for uploads, then do it for everything, and then say repool everything, at which point I'm not sure if reverting one or both of them is least-surprising) [20:35:55] yeah [20:36:04] not sure, needs some thought [20:36:10] also how to be sure it's in sync [20:36:15] right now we have git for that [20:36:21] sort of :) [20:36:24] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [20:36:36] I guess regardless, we can have the command actually show a text-diff of the suppressions [20:36:46] just did an LED audit, no issues observed besides known fan issue in asw2-ulsfo. i'm ready to get out of here unless someone has a reason for me to pause? [20:37:09] _joe_: hiiiiii [20:37:40] jgage: can't think of something [20:38:04] jgage: don't forget to update #8583 :) [20:38:12] thanks :) [20:38:34] ok, i'm departing. [20:39:21] (03PS1) 10Ori.livneh: Re-remove deployment::target [puppet] - 10https://gerrit.wikimedia.org/r/167695 [20:41:52] I was planning to include a cmdline helper with gdnsd itself at some point as well, which kinda has the same questions. You can mark something down with a glob-expression, too. But then with the removal, if you use a glob pattern, does that mean "remove one line which is exactly this glob", or "remove all lines that match this glob?" :p [20:42:42] (the semantics of the actual state file are clear, this is just all about how you wrap it prettily for consumption without an editor) [20:42:57] :) [20:44:18] I guess for a really basic step 1, we could start pusing a raw state file around with authdns update. it would still be way better than editing the config. [20:44:25] *pushing [20:45:48] (actually for here, maybe that's all we need. The statefile itself would be in git as a bonus. Just put some permanent commentary lines at the top explaining how to format lines in it for our config) [20:46:00] same git you mean? [20:46:05] yeah [20:46:36] we'd have to handle it specially, like we do for config. But it would just be a git file that gets dropped into /var/gdnsd/admin_state on update [20:46:51] yeah [20:46:57] btw, I never switched our config to use include => [20:47:11] oh yeah, I kinda forgot about that. [20:48:20] paravoid: related: I've done 2.1.0 on trusty now with dh-systemd stuff and it just worked (and did nothing re statefiles because the default pbuilder env for trusty doesn't have systemd at all apparently) [20:48:33] I still have no idea how that works out on jessie heh [20:49:36] http://paste.debian.net/127810/ [20:49:52] (also note totally non-functional change to initscript for clarity) [20:50:35] what about the lintian-overrides? [20:51:26] those are a good question actually. why where they there in the first place? was that because they actually weren't hardened, or it just couldn't detect it? [20:51:51] in any case, I get no lintian warnings without them now (in this one attempt on trusty), and 2.1.0 did adds its own hardening flags to everything built. [20:52:10] it couldn't detect it because they're being linked dynamically and the check checked some specific function or something [20:52:19] well the helper isn't linked at all I guess [20:52:50] the helper should be linked like a normal binary, like the main daemon, basically. [20:53:01] yeah [20:53:20] I meant it's not dlopen()ed / linked dynamically [20:53:25] oh right [20:54:05] would it be more-correct, in the view of however distros do things now, if I made the modules link against shlibs explicitly? (does that even work?) [20:54:14] nah [20:55:23] I think I'm gonna try a jessie box today as well though, that's where this really matters. I just need to create one first. [20:55:32] already trying with sid :) [20:55:43] oh nice :) [20:56:58] to balance out the fact that my dev box is F20, I run my two test nameservers on precise and trusty currently to pick up issues there. I should just switch the precise one to jessie now. [21:01:10] !log updated OCG to version ea10c93aca9bc1cae34f284fd74bb05d4b6a8cc6 [21:01:16] Logged the message, Master [21:01:28] who's the git deploy expert around here? [21:01:45] noone? :) [21:01:54] cscott: what's up? [21:01:57] ocg's git deploy set still has tantalum.eqiad.wmnet in it, which has been deactivated for a while now. [21:02:12] basically, since before i ever did my first git deploy of ocg. [21:02:23] how do i remove it from the set of hosts which git deploy targets? [21:02:58] bblack: don't we need an init script fix for the logging stuff as well? [21:03:07] cscott: i can find out; just a moment [21:03:33] cscott: here you go: https://wikitech.wikimedia.org/wiki/Trebuchet#Removing_minions_from_redis [21:03:58] i like how none of the words in that url have 'git' or 'deploy' anywhere in them. ;) [21:05:58] paravoid: well that depends, I've lost track of what debian wants there. if you leave it alone, you still get no output like before. [21:05:59] $ redis-cli srem "deploy:ocg/ocg:minions" tantalum.eqiad.wmnet [21:05:59] (integer) 1 [21:06:01] whoo! [21:06:06] ori: thanks [21:06:12] bblack: if it's ok to show warnings/failures, then stop redirecting stderr? [21:06:14] bblack: no output is good [21:06:37] (03PS2) 10Ori.livneh: Re-remove deployment::target [puppet] - 10https://gerrit.wikimedia.org/r/167695 [21:06:43] (03CR) 10Ori.livneh: [C: 032 V: 032] Re-remove deployment::target [puppet] - 10https://gerrit.wikimedia.org/r/167695 (owner: 10Ori.livneh) [21:06:54] (03PS1) 10Legoktm: Set $wgCentralAuthEnableGlobalRenameRequest = true on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167699 [21:06:54] oh btw, 2.0 has migrated to testing already :) [21:06:58] nutcracker too! [21:07:00] nice! [21:07:11] https://packages.debian.org/search?keywords=nutcracker [21:07:12] nice indeed! [21:07:25] also nutcracker upstream completely ignored my email [21:07:34] where I said "hey I did that -- also, maybe you want to tag a new version?" [21:07:52] oh [21:07:55] they tagged 0.4.0 [21:08:02] !log Deployed iegreview 203d509 (Disable strict variables for twig) [21:08:09] Logged the message, Master [21:08:11] !log baham running gdnsd-2.1.0 test pkg [21:08:16] Logged the message, Master [21:08:19] a new person apparently [21:08:37] !log redis-cli srem "deploy:ocg/ocg:minions" tantalum.eqiad.wmnet [21:08:42] Logged the message, Master [21:08:49] still no reply in my inbox :( [21:08:57] cscott: can you do me a favor? i'm doing 15 things atm. can you file a bug about that? it should be automatic. [21:09:57] (and more secure, for that matter..) [21:10:13] ori: meeting ping :) [21:11:53] O: gdnsd: hardening-no-fortify-functions usr/lib/x86_64-linux-gnu/gdnsd/gdnsd_extmon_helper [21:12:04] hah [21:12:18] https://lintian.debian.org/tags/hardening-no-fortify-functions.html [21:12:20] seems very odd, I wonder if it's real and I screwed something up in automake [21:12:21] this is the tag [21:12:25] (03CR) 10Dzahn: "reason was wrong VLAN - fixed by paravoid - carbon dhcpd: DHCPACK on 208.80.154.63 to 84:2b:2b:fd:bd:0e thanks" [puppet] - 10https://gerrit.wikimedia.org/r/167149 (owner: 10Dzahn) [21:12:27] N: Severity: normal, Certainty: wild-guess [21:12:50] says it all :P [21:14:02] still, it shouldn't fail there. I bet it's actually telling me something's dumb in automake for gdnsd [21:14:18] Either there are no potentially unfortified functions called by any routines, all unfortified calls have already been fully validated at compile-time, or the package was not built with the default Debian compiler flags defined by dpkg-buildflags [21:15:24] oh heh [21:15:32] it's entirely possible it doesn't call any of them :) [21:15:38] yup :) [21:15:41] (03PS1) 10Ottomata: [WIP] Add cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/167700 [21:15:42] the tool shouldn't fail on that, that's stupid :p [21:15:56] build logs say that the gcc flags for extmon are fine [21:16:29] I'm not sure exactly how debian goes about setting its policy flags, but I hope it didn't downgrade the package's own, too [21:16:52] (I think the only two that woule be upgrades for most distros are -fstack-protector-all and -ftrapv) [21:17:03] -D_FORTIFY_SOURCE=2 -DNDEBUG -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -pthread -fstack-protector-all -ftrapv -fno-common -pipe -Wall -Wextra -Wbad-function-cast [21:17:17] (etc.) [21:17:42] seems ok, assuming gcc is sane about -strong being overriden by -all [21:17:53] speaking of gcc, did you read about the new 5.0 features? [21:18:01] 4.9/5.0 [21:18:02] http://developerblog.redhat.com/2014/10/16/gcc-undefined-behavior-sanitizer-ubsan/ [21:18:25] no [21:18:42] that sounds awesome. clang has some similar stuff already that's pretty strong too. [21:19:04] (I made some qa scripts for a couple of clang analyzer, but the codebase is failing the others and I haven't had time to track it down and fix them all, or even tell if they're real) [21:19:24] good ol' competition [21:19:44] http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation [21:20:15] so with your dh-systemd changes [21:20:22] I don't see a unit file being shipped [21:20:31] nor do I see postinst/postrm to do anything systemd-related [21:20:39] that sucks [21:20:54] https://wiki.debian.org/systemd/Packaging ? [21:21:29] I didn't add a debian/gdnsd.systemd [21:21:45] yeah but it builds its own, and installs it within wherever the installation destdir is, or should [21:21:46] er, debian/gdnsd.service even [21:21:56] that's what I remembered [21:22:02] I would think dh-systemd would pick that up [21:22:02] trying to install radium, i get a DHCP ACK, but then PXE-E11: ARP timeout PXE-E38: TFTP cannot open connection . i already added to install-server/netboot.cfg .. where else do i look ? [21:22:18] syslog on carbon tells me about the DHCP working.. but then .. [21:22:21] I guess not :) [21:22:36] did the build/install log show it generating/installing it at the make level? [21:22:41] mutante: someone was complaining about that again recently; maybe tftp died or something? [21:23:02] bblack: there's two ways to prepare packages basically [21:23:27] if you just have one binary, you basically set destdir to say "debian/gdnsd", which is then packed as-is [21:23:36] since we build multiple packages though (gdnsd-dev) [21:23:40] destdir is set to debian/tmp [21:23:50] and then .install files hand-pick files into their respective packages [21:24:11] in the absence of a directive, packaging can't know where that specific file belongs under the gdnsd or gdnsd-dev package [21:24:26] ah [21:25:53] ok, so it might pick it up if we add /usr/lib/systemd/system/gdnsd.service there? [21:26:00] (or whatever the correct path is for debian?) [21:26:19] I'm not sure if it's going to pick it up like that or if I need to override dh_installinit & pass an argument [21:26:22] let me check :) [21:27:22] (or alternatively, I guess you could do something ugly like stick in a step to cp sysd/gdnsd.service -> debian/gdnsd.service [21:28:01] I thought part of the Systemd Philosophy(tm) was that upstream was supposed to ship service files though, unlike packaging for previous systems [21:32:15] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [21:35:33] zscan_djb.c: In function ‘zscan_foreach_record’: [21:35:33] zscan_djb.c:498:11: warning: variable ‘failed’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Wclobbered] bool failed = false; [21:35:36] (btw) [21:36:18] (03PS8) 10Ottomata: Initial commit of Cassandra puppet module [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 [21:38:06] bblack: I don't get the service file built at all [21:39:32] are you sure --with-systemdsystemunitdir is the default? [21:40:15] #systemdsystemunit_DATA = gdnsd.service [21:40:20] #gdnsd.service: gdnsd.service.tmpl [21:40:21] # $(AM_V_GEN)sed 's|@GDNSD_SBINDIR[@]|$(sbindir)|g' <$< >$@ [21:40:23] so yeah, commented-out [21:40:40] well that's determined from pkg-config [21:40:49] based on what? [21:41:08] oh this system doesn't have systemd installed [21:41:17] it's a chroot :) [21:41:20] well based on redhat's package building guide. I assumed since they run the systemd indoctrination program, their info was canonical [21:42:15] needs libsystemd-dev I guess [21:42:17] * paravoid checks [21:42:39] yeah but can that be distro-conditional so that it doesn't screw up building on old non-systemd distros? [21:42:54] (btw, why wouldn't systemd packages be installed in the basic build env for a distro that uses systemd?) [21:42:55] libsystemd-dev ships /usr/lib/x86_64-linux-gnu/pkgconfig/libsystemd.pc [21:43:05] systemd.pc is shipped by... systemd [21:43:54] well first of all, it's a bootstrapping problem isn't it [21:44:15] second, most packages are built in chroots [21:44:23] there's nothing dictating that systemd must be installed in the chroot [21:44:47] and third, there's a big vote going on right now about whether jessie should allow people to run different init systems if they wish so (loosely coupled and all that :) [21:44:48] yeah but "man systemd" ends up saying: [21:44:50] Packages that want to install unit files shall place them in the directory returned by pkg-config systemd --variable=systemdsystemunitdir [21:46:46] yeah I donno what the right answer is there. [21:47:01] I'll figure it out [21:47:05] do you also use libsystemd features? [21:47:20] I suppose, as a packager, since you know what install paths you'll be using and you know how dh-systemd works, it's probably ok to just copy sysd/gdnsd.service.tmpl to debian/gdnsd.service and fill in the blanks [21:47:34] no, there's no libsystemd usages [21:47:44] I could do that yeah [21:47:52] but I'm curious of how it should be done properly [21:47:55] I'll figure it out :) [21:48:10] (03PS1) 10Ori.livneh: hhvm: add hhvm-watch-mem [puppet] - 10https://gerrit.wikimedia.org/r/167710 [21:48:31] a conditional build-depends on libsystemd-dev that's conditional on "this distro is a systemd distro?" (except then how does that ever handle a distro that allows choice?) [21:48:56] well no, you can link to libsystemd even if the init system isn't systemd [21:49:17] right but you don't want to end up with a requires on systemd to boot [21:49:29] oh I guess without linker deps, it wouldn't require it? [21:50:29] (03CR) 10Aaron Schulz: "I'm changing it to being private. The ACLs could use updating (I already made https://gerrit.wikimedia.org/r/#/c/167329/ for that)." [puppet] - 10https://gerrit.wikimedia.org/r/167310 (owner: 10Aaron Schulz) [21:50:57] (03CR) 10Ori.livneh: [C: 032] "I want to make sure this is in place for debugging the current issue; I'm open to reverting or improving this later." [puppet] - 10https://gerrit.wikimedia.org/r/167710 (owner: 10Ori.livneh) [21:59:47] (03PS1) 10Andrew Bogott: DRAFT: move openstack files and manifests into a module [puppet] - 10https://gerrit.wikimedia.org/r/167713 [22:06:24] (03PS5) 10Chad: Decom deployment-elastic0[1-3] from beta [puppet] - 10https://gerrit.wikimedia.org/r/167010 [22:15:03] (03PS2) 10Andrew Bogott: Move openstack files and manifests into a module [puppet] - 10https://gerrit.wikimedia.org/r/167713 [22:16:02] (03CR) 10Andrew Bogott: "This works on labs, and is no /more/ of a mess than it was before." [puppet] - 10https://gerrit.wikimedia.org/r/167713 (owner: 10Andrew Bogott) [22:17:18] (03CR) 10BryanDavis: [C: 031] Set $wgCentralAuthEnableGlobalRenameRequest = true on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167699 (owner: 10Legoktm) [22:17:45] (03CR) 10Dzahn: [C: 032] "was removed in Change-Id: I7cd3461b12c12cea" [puppet] - 10https://gerrit.wikimedia.org/r/166894 (owner: 10Dzahn) [22:28:04] !log enabled puppet on mw1189 [22:28:09] Logged the message, Master [22:28:46] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Puppet last ran 26636 seconds ago, expected 14400 [22:29:45] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:35:02] (03Abandoned) 10Dzahn: add annual.wm to misc varnish config [puppet] - 10https://gerrit.wikimedia.org/r/167165 (owner: 10Dzahn) [22:36:25] (03CR) 10Chad: [C: 032 V: 032] Add its-phabricator from d425a5ded909ee73df53d5e6d91d28014d0be375 [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/167525 (owner: 10QChris) [22:37:38] !log doing load testing on mw1189 [22:37:43] Logged the message, Master [22:38:04] hi tim :) [22:39:22] bblack: the advice I got from #debian-systemd is that build-depending on "systemd" is normal [22:39:52] even though it will, under certain circumstances (e.g. mk-build-deps but not apt-get build-dep) switch your init system [22:40:17] I think I'll just copy to debian/ :) [22:41:15] hi [22:41:28] (03CR) 10Halfak: "(HI! I maintain Snuggle.)" [puppet] - 10https://gerrit.wikimedia.org/r/164690 (owner: 10Dzahn) [22:42:28] paravoid: :) [22:42:59] oh, got newer/better advice [22:43:54] please tell me it was advice on how to make a package switch their init system back off of systemd on package install :) [22:44:13] http://lists.freedesktop.org/archives/systemd-devel/2011-June/002736.html [22:44:16] man 7 daemon [22:44:24] the default configure.ac snippet that systemd recommends [22:44:40] allows for a --with-systemdsystemunitdir=/lib/systemd/system [22:44:47] yes yes, I have that [22:44:54] aaaargh :) [22:45:00] (well, something very much like it anyways, works the same) [22:45:43] (mine also accepts --without-systemdsystemunitdir to disable installation even if pkg-config says we should) [22:47:18] so if you force it to install it via --with-systemdsysteunitdir, does dh-systemd then pick it up automagically? [22:47:48] I'll try :) [22:48:08] that + .install of course [22:48:17] oh right [22:49:27] this will get even more interesting down the road, when we'll want to do a "systemctl reload gdnsd" post-install on upgrades, if the previous version was also sufficiently new, and skip the default stop -> upgrade -> start cycle. [22:52:09] (03CR) 10coren: [C: 031] "Good to die now that it has no use." [puppet] - 10https://gerrit.wikimedia.org/r/164690 (owner: 10Dzahn) [22:52:50] there's dh_installinit -r [22:52:57] then you can do magic in postrm yourself [22:53:05] there's a big transition diagram for that [22:53:21] (03CR) 10Dzahn: "Matanya: re: compiler .. it's DNS though.." [dns] - 10https://gerrit.wikimedia.org/r/166914 (owner: 10Dzahn) [22:53:23] https://wiki.debian.org/MaintainerScripts?action=AttachFile&do=get&target=upgrade.png [22:53:26] there :) [22:53:50] mutante: ignore me :) but merge the deletion of snuggle cert [22:54:12] after varnish's flowcharts, that diagram looks almost not worth diagramming :p [22:54:30] (03PS3) 10Dzahn: delete snuggle.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164690 [22:54:56] i yet have to understand how a change that does nothing but delete a file still needs rebasing [22:55:19] jgit being suck? [22:55:28] (03CR) 10Dzahn: [C: 032] delete snuggle.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164690 (owner: 10Dzahn) [22:59:54] bblack: yup, works [23:00:04] RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141020T2300). [23:00:29] Who's got it? [23:00:37] <^d> I can do it [23:00:44] <^d> legoktm: Ping for swat. [23:00:50] pong [23:01:00] <^d> Trivial, it's beta-only. [23:01:04] <^d> Just syncing for completeness. [23:01:07] bblack: but: W: gdnsd: systemd-service-file-refers-to-obsolete-target lib/systemd/system/gdnsd.service syslog.target [23:01:13] heh [23:01:18] Some targets are obsolete by now, e.g. syslog.target or dbus.target. For example, declaring After=syslog.target is unnecessary by now because syslog is socket-activated and will therefore be started when needed. [23:02:46] (03CR) 10Chad: [C: 032] Set $wgCentralAuthEnableGlobalRenameRequest = true on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167699 (owner: 10Legoktm) [23:02:55] (03Merged) 10jenkins-bot: Set $wgCentralAuthEnableGlobalRenameRequest = true on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167699 (owner: 10Legoktm) [23:03:42] bblack: also, are you sure you want network.target and not network-online.target? [23:03:45] (03PS4) 10Dzahn: wikimedia.org service aliases - indentation fixes [dns] - 10https://gerrit.wikimedia.org/r/166914 [23:03:58] !log demon Synchronized wmf-config/CommonSettings-labs.php: no-op, for completeness (duration: 00m 05s) [23:04:01] <^d> legoktm: ^ [23:04:04] Logged the message, Master [23:04:26] does it take a while for it to show up on beta? [23:04:43] <^d> Yeah, 5mins +/- a bit [23:04:50] ok, thanks [23:05:05] paravoid: that all seems pretty hazy to me actually. But seeing as the default is listen=any, the distinction would only matter for a non-default configuration in the first place. [23:05:19] it's live :D [23:05:20] <^d> legoktm: Nothing queued in zuul, probably live now. [23:05:21] <^d> :) [23:05:40] (03CR) 10Dzahn: [C: 032] wikimedia.org service aliases - indentation fixes [dns] - 10https://gerrit.wikimedia.org/r/166914 (owner: 10Dzahn) [23:05:56] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:15] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:42] (also, re: obsolete syslog.target, since systemd exposes no versioning and has no standards for upgrade paths, there's no such thing as "obsolete". It might be necessary on another system, e.g. some version of Fedora) [23:06:47] paravoid: ^ [23:06:48] (03PS1) 10Gergő Tisza: Add ImageMetrics extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167727 (https://bugzilla.wikimedia.org/70402) [23:07:11] yeah... [23:07:17] bblack: the output of authdns-update is much more verbose now ? [23:07:18] I think I'll just copy for now then and modify [23:07:27] it's fairly trivial [23:07:27] mutante: is it? [23:07:39] yea, i see all those "info" lines i don't recall [23:07:57] oh, yeah. just on baham right? [23:08:13] i just did it on rubidium [23:08:13] I need to update the scripts for the stdout-vs-stderr think I guess [23:08:24] i'm super lazy, so i ssh to ns0 :p [23:08:32] right but the verbosity was from the remote update of baham right? [23:09:17] bblack: yes, it was [23:11:00] Coren: fyi .. :) https://gerrit.wikimedia.org/r/#/c/15561/ look at the dates [23:12:10] (03PS3) 10Dzahn: gerrit: Remove duplicate mirrors [puppet] - 10https://gerrit.wikimedia.org/r/167162 (https://bugzilla.wikimedia.org/68054) (owner: 10Krinkle) [23:12:42] paravoid: actually I wouldn't muck with the network.target part [23:12:51] yeah I wo't [23:12:55] *won't [23:13:14] paravoid: (because not only is listen=any default, but we also default to retrying explicit listen addrs with IP_FREEBIND type stuff, so actually there's no hard dependency) [23:13:22] (03CR) 10Dzahn: [C: 032] gerrit: Remove duplicate mirrors [puppet] - 10https://gerrit.wikimedia.org/r/167162 (https://bugzilla.wikimedia.org/68054) (owner: 10Krinkle) [23:15:31] Wants=to-run-at-the-end-when-normal-stuff-is-working,-this-isnt-a-desktop-package-that-cares-about-20ms-more-boot-time-:p [23:15:42] (03CR) 10Dzahn: [C: 032] More elasticsearch tools [puppet] - 10https://gerrit.wikimedia.org/r/164270 (owner: 10Chad) [23:16:40] (03CR) 10Gergő Tisza: [C: 04-2] "Not to be deployed until Thursday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167727 (https://bugzilla.wikimedia.org/70402) (owner: 10Gergő Tisza) [23:17:15] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [23:17:16] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 72812 bytes in 0.173 second response time [23:18:45] PROBLEM - DPKG on db1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:13] !log installing package upgrades on iron [23:20:20] Logged the message, Master [23:20:28] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:35] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:23:45] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 72812 bytes in 2.722 second response time [23:23:57] RECOVERY - DPKG on db1004 is OK: All packages OK [23:24:36] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.358 second response time [23:31:05] !log upgrade db1004 trusty and reboot [23:31:13] Logged the message, Master [23:32:41] (03CR) 10Dzahn: [C: 032] labmon: Add kart to betalabs monitoring [puppet] - 10https://gerrit.wikimedia.org/r/167560 (owner: 10Yuvipanda) [23:36:04] (03PS1) 10Legoktm: Set $wgCentralAuthAutoMigrate = false on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167730 [23:36:11] bd808: ^ want to deploy that? [23:38:14] (03CR) 10Dzahn: "Error: Could not find any contact matching 'kart' :/" [puppet] - 10https://gerrit.wikimedia.org/r/167560 (owner: 10Yuvipanda) [23:40:31] apergos: any chance to look at labswiki dump blacklisting? [23:41:00] (03CR) 10Dzahn: "added contact for Kartik in private repo" [puppet] - 10https://gerrit.wikimedia.org/r/167560 (owner: 10Yuvipanda) [23:42:44] (03CR) 10Dzahn: [C: 032] "should fix https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=citoid.svc.eqiad.wmnet&nostatusheader" [puppet] - 10https://gerrit.wikimedia.org/r/165731 (owner: 10Catrope) [23:44:52] (03CR) 10Dzahn: [C: 031] "-m, --machine" [puppet] - 10https://gerrit.wikimedia.org/r/166221 (owner: 10coren) [23:48:53] (03PS3) 10Dzahn: Fix Elasticsearch in ci [puppet] - 10https://gerrit.wikimedia.org/r/160524 (owner: 10Manybubbles) [23:50:14] (03CR) 10Dzahn: [C: 032] Fix Elasticsearch in ci [puppet] - 10https://gerrit.wikimedia.org/r/160524 (owner: 10Manybubbles) [23:51:50] * bd808 sees legoktm's ping here long after the ping [23:52:43] legoktm: Deploying that will make testing the rename stuff easier. What will it break in beta? [23:53:44] (03CR) 10Dzahn: "see revert and last comment on https://gerrit.wikimedia.org/r/#/c/162520/ .. Reedy review?" [puppet] - 10https://gerrit.wikimedia.org/r/162768 (owner: 10MaxSem) [23:54:25] udp2log 49G Oct 20 23:54 messagecache.log [23:54:33] heh, that's pretty verbose :) [23:54:40] (03CR) 10Dzahn: [C: 031] Add nickel to $MONITORING_HOSTS network, rename ferm::rule icinga-all to monitoring-all [puppet] - 10https://gerrit.wikimedia.org/r/160802 (owner: 10Ottomata) [23:54:55] o_0 really verbose [23:55:22] tail -f it ;) [23:56:47] (03CR) 10Dzahn: "manybubbles, yes, it does. somebody please link to existing ticket or make a new one" [puppet] - 10https://gerrit.wikimedia.org/r/152724 (owner: 10Hoo man) [23:57:16] meh, there is a ticket [23:57:20] but it's kind of stale [23:57:34] (03Abandoned) 10Hoo man: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (owner: 10Hoo man) [23:57:47] (03CR) 10Dzahn: "here it is https://rt.wikimedia.org/Ticket/Display.html?id=8286" [puppet] - 10https://gerrit.wikimedia.org/r/152724 (owner: 10Hoo man) [23:59:01] (03PS1) 10Aaron Schulz: Turn off spammy message cache log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167731 [23:59:11] (03CR) 10Dzahn: "ticket status is neither open nor closed, it's in "stalled"" [puppet] - 10https://gerrit.wikimedia.org/r/152724 (owner: 10Hoo man)