[00:28:44] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [00:32:54] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 1 failures [00:44:19] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 5.8GB (= 5.0GB critical): /srv/deployment/ocg/output 4009055266B: /srv/deployment/ocg/postmortem 996963B: ocg_job_status 11532 msg: ocg_render_job_queue 0 msg [00:45:13] RECOVERY - OCG health on ocg1003 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4009055266B: /srv/deployment/ocg/postmortem 1072969B: ocg_job_status 11532 msg: ocg_render_job_queue 0 msg [00:46:02] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:50:17] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [01:31:38] (03PS1) 10Jackmcbarn: Don't allow granting a removed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 [01:32:47] what [01:32:48] :D [01:34:33] (03CR) 10Ori.livneh: "Should this be fixed in core, instead, by pruning entries that don't correspond to an actual group? We did something similar in https://ww" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 (owner: 10Jackmcbarn) [01:35:51] (03CR) 10Ori.livneh: [C: 031] "Well, "instead" is the wrong word; it certainly doesn't hurt to do this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 (owner: 10Jackmcbarn) [01:41:10] (03PS1) 10Hoo man: Don't leak global $path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160374 [01:42:22] (03CR) 10Ori.livneh: [C: 031] Don't leak global $path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160374 (owner: 10Hoo man) [01:46:40] (03CR) 10Hoo man: [C: 032] "No-op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160374 (owner: 10Hoo man) [01:46:44] (03Merged) 10jenkins-bot: Don't leak global $path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160374 (owner: 10Hoo man) [01:47:20] !log hoo Synchronized wmf-config/flaggedrevs.php: Remove global $path (duration: 00m 10s) [01:47:28] Logged the message, Master [01:47:36] !log hoo Synchronized wmf-config/liquidthreads.php: Remove global $path (duration: 00m 07s) [01:47:42] Logged the message, Master [01:59:28] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Puppet has 1 failures [02:02:17] (03CR) 10Jackmcbarn: "I02d3f00142ca1cb0cdcbf30e79fecb3c96e96405" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 (owner: 10Jackmcbarn) [02:03:22] (03PS2) 10Jackmcbarn: Don't allow granting a removed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 [02:03:25] !log LocalisationUpdate failed: mwversionsinuse returned empty list [02:03:30] Logged the message, Master [02:06:47] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3608 MB (3% inode=99%): [02:16:45] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [02:19:08] (03PS1) 10Hoo man: Fix l10nupdate by correctly adding scap directories to $PATH [puppet] - 10https://gerrit.wikimedia.org/r/160380 [02:19:10] ori_: ^ [02:19:31] Tiny regression you introduced in 132038174a0e2734f660fd6d2e86dc5caf033aca and that breaks l10nupdate [02:19:38] * l10nupdate-1 [02:21:14] (03PS2) 10Hoo man: Add scap directories to $PATH for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/160380 [02:47:43] (03PS1) 10MZMcBride: Various tweaks to people.wikimedia.org index page [puppet] - 10https://gerrit.wikimedia.org/r/160383 [02:57:08] (03CR) 10Ori.livneh: "But on tin, /usr/local/bin/mwversionsinuse is still a symlink to /srv/deployment/scap/scap/bin/mwversionsinuse. So how did this break?" [puppet] - 10https://gerrit.wikimedia.org/r/160380 (owner: 10Hoo man) [03:00:55] RECOVERY - Disk space on virt0 is OK: DISK OK [03:02:15] (03CR) 10Legoktm: [C: 031] Various tweaks to people.wikimedia.org index page [puppet] - 10https://gerrit.wikimedia.org/r/160383 (owner: 10MZMcBride) [03:07:02] (03PS1) 10Springle: depool db1036 and db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160384 [03:09:00] (03CR) 10Springle: [C: 032] depool db1036 and db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160384 (owner: 10Springle) [03:09:05] (03Merged) 10jenkins-bot: depool db1036 and db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160384 (owner: 10Springle) [03:09:45] !log springle Synchronized wmf-config/db-eqiad.php: depool db1036 (duration: 00m 09s) [03:09:50] Logged the message, Master [03:42:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [03:56:37] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:04:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:17:57] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:59:46] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [04:59:56] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [05:00:36] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [05:01:56] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [05:01:57] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [05:02:47] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:03:37] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 2: number_of_data_nodes: 2: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [05:03:47] (03PS1) 10ArielGlenn: lab db replica: don't show log_params for deleted/suppressed logs [software] - 10https://gerrit.wikimedia.org/r/160393 [05:03:56] RECOVERY - ElasticSearch health check on logstash1003 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 2: number_of_data_nodes: 2: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [05:04:48] RECOVERY - ElasticSearch health check on logstash1002 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [05:04:58] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [05:06:12] (03CR) 10ArielGlenn: "Not sure if the check (log_delete nonzero) is too broad, can you have a look Coren?" [software] - 10https://gerrit.wikimedia.org/r/160393 (owner: 10ArielGlenn) [05:07:50] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:10:36] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [05:11:56] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 69 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 3, uunassigned_shards: 63, utimed_out: False, uactive_primary_shards: 34, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 34, uinitializing_shards: 6, unumber_of_data_nodes: 3} [05:12:37] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 69 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 3, uunassigned_shards: 63, utimed_out: False, uactive_primary_shards: 34, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 34, uinitializing_shards: 6, unumber_of_data_nodes: 3} [05:12:56] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 1 failures [05:12:56] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 69 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 3, uunassigned_shards: 63, utimed_out: False, uactive_primary_shards: 34, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 34, uinitializing_shards: 6, unumber_of_data_nodes: 3} [05:14:47] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:15:07] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [05:15:11] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [05:21:56] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:22:16] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [05:24:56] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:25:17] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [05:30:54] (03PS1) 10Ori.livneh: HHVM: increase JitAColdSize to 60 MiB [puppet] - 10https://gerrit.wikimedia.org/r/160394 [05:32:06] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:42:29] (03PS3) 10Ori.livneh: Add scap directories to $PATH for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/160380 (owner: 10Hoo man) [05:48:16] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:48:27] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 2: number_of_data_nodes: 2: active_primary_shards: 14: active_shards: 16: relocating_shards: 0: initializing_shards: 3: unassigned_shards: 84 [05:50:24] (03PS1) 10Springle: prepare db1036 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/160397 [05:51:26] (03CR) 10Springle: [C: 032] prepare db1036 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/160397 (owner: 10Springle) [05:53:57] (03CR) 10Ori.livneh: [C: 032] Add scap directories to $PATH for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/160380 (owner: 10Hoo man) [06:04:28] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [06:05:48] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [06:10:50] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [06:11:28] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [06:23:08] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 1 failures [06:23:48] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has 1 failures [06:27:58] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Epic puppet fail [06:28:10] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Epic puppet fail [06:28:29] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Epic puppet fail [06:28:30] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Epic puppet fail [06:28:38] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Epic puppet fail [06:28:39] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Epic puppet fail [06:28:49] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:50] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: Epic puppet fail [06:29:08] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Epic puppet fail [06:29:39] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:39] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:50] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:50] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:51] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:51] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:51] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:59] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:18] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:18] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:19] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:30] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:37] <_joe_> mmmh I'm not sure this is the usual mod_passenger problem [06:30:38] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:40] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:40] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:52] <_joe_> ori_: still here? [06:30:59] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:59] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:01] <_joe_> what did you change yesterday? [06:31:08] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:08] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:08] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:32:10] <_joe_> mmh it just happened to be worse than usual and I found a couple of bad salt failures [06:35:03] (03PS5) 10Florianschmidtwelzow: Fix typos in various localizations of dvwiki configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156821 (https://bugzilla.wikimedia.org/48075) (owner: 10Gerrit Patch Uploader) [06:37:20] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:37:51] PROBLEM - ElasticSearch health check for shards on elastic1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.109:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [06:44:54] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:32] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:45:35] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:35] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:45:36] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:54] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:01] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:47:21] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:22] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:47:22] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:47:51] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:51] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:57:32] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Puppet has 1 failures [07:14:57] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:59:27] https://bugzilla.wikimedia.org/show_bug.cgi?id=35534#c19 [08:06:13] _joe_: a minute of your time ? [08:06:15] https://gerrit.wikimedia.org/r/#/c/159462/2/modules/rsync/templates/module.erb [08:06:41] i didn't understand what is the wrong pard, the right one, or the left one [08:15:21] <_joe_> matanya: 1 sec [08:16:34] <_joe_> matanya: what is not clear in my comment? [08:16:53] <_joe_> if @variable != :undef will always test true [08:17:42] _joe_: that part i got, i didn;t get the logic of testing it [08:17:45] <_joe_> if @variable will test false if the $variable in puppet is either :undef or nil [08:17:54] <_joe_> and true anyway else [08:18:00] oh! [08:18:03] now i get it [08:18:27] <_joe_> it's quite tricky and I have to recheck the conditions every time [08:18:35] * matanya is so slow lately [08:18:49] probably nicer ways to check this [08:21:51] (03PS3) 10Matanya: rsync: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159462 [08:27:33] hello, I am back around :) [08:27:46] o/ :D [08:34:16] <_joe_> hashar: :))) [08:45:20] and now have to deal with all the mail spam :/ [08:49:18] <_joe_> eh [08:49:45] <_joe_> (btw, I love the names of both your daughters) [08:52:18] (03CR) 10Filippo Giunchedi: "LGTM, modulo what Daniel pointed out re: favicon." [puppet] - 10https://gerrit.wikimedia.org/r/147487 (owner: 10Reedy) [08:53:44] (03CR) 10Giuseppe Lavagetto: "Public wikis in general have both the favicon and the robots.txt files that can be personalized by the admins; so this makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/147487 (owner: 10Reedy) [08:55:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [08:56:50] (03PS7) 10Alexandros Kosiaris: Introducing Service Cluster A, hosting mathoid [puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) (owner: 10Physikerwelt) [08:56:52] (03PS1) 10Alexandros Kosiaris: Assign LVS IP address to mathoid [puppet] - 10https://gerrit.wikimedia.org/r/160412 [09:09:45] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [09:14:20] re: changing freenode passwords, they also support connecting with a ssl cert and let nickserv identify you on that https://freenode.net/certfp/ [09:15:56] !log Jenkins: apt-get upgrade on prod slaves (updates php5 / libc / jdk 7) [09:16:01] Logged the message, Master [09:17:36] (03PS2) 10Hashar: Please add the domain *.scienceimage.csiro.au to the wgCopyUploadsDomains whitelist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159999 (https://bugzilla.wikimedia.org/70771) (owner: 10Dan-nl) [09:19:00] (03CR) 10Hashar: [C: 032] Please add the domain *.scienceimage.csiro.au to the wgCopyUploadsDomains whitelist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159999 (https://bugzilla.wikimedia.org/70771) (owner: 10Dan-nl) [09:19:04] (03Merged) 10jenkins-bot: Please add the domain *.scienceimage.csiro.au to the wgCopyUploadsDomains whitelist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159999 (https://bugzilla.wikimedia.org/70771) (owner: 10Dan-nl) [09:20:10] !log hashar Synchronized wmf-config/InitialiseSettings.php: *.scienceimage.csiro.au to the wgCopyUploadsDomains {{gerrit|159999}} {{bug|70771}} (duration: 00m 06s) [09:20:15] Logged the message, Master [09:24:48] <_joe_> bbl [09:30:42] Reedy: any thoughts on https://rt.wikimedia.org/Ticket/Display.html?id=5270 ? [10:30:29] hashar: https://integration.wikimedia.org/ci/job/mwext-Thanks-testextensions-master/181/console any idea why that's not running against Flow master? [10:30:48] (the exceptions are fixed on master) [10:31:09] bah bugged :( [10:31:31] it is supposed to fetch the latest version of flow [10:31:39] hashar: welcome back! [10:34:25] (03PS1) 10Filippo Giunchedi: move metrics.wm.o and metrics-api.wm.o behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/160419 [10:34:54] godog: thx :) [10:35:05] hoo: I cleared the workspaces on the slaves and retriggered the job. Same deal [10:35:19] hoo: I copy the extension dependencies from some Gerrit replica. Maybe they are out of date [10:35:59] mh, I see [10:36:06] hoo: Flow is at 2d2362372dc Mon Sep 15 01:20:22 2014 +0000 [10:36:16] which seems up to date [10:36:26] * hashar blames code [10:36:49] That's nearly impossible [10:37:20] Flow itself was fixed and passes now :S [10:37:54] https://gerrit.wikimedia.org/r/160366 [10:38:17] Do I have shell on these worker slaves? I guess not [10:42:42] hoo: trying to reproduce [10:45:55] hashar: welcome back! [10:47:03] hoo: I have no obvious clue. One should try to reproduce by using the extensions master branches on a fresh wiki install and see what happens / whether it can be reproduced [10:47:21] hoo: the code seems up to date on the slaves. Definitely has the parent::tearDown() calls that have been added in Flow [10:47:44] then there is a bunch of class inheritance, maybe one is missing a parent::tearDown() call at some point [10:48:15] YuviPanda: thx for the icinga monitoring of the beta cluster :D [10:48:26] hashar: :D more coming! [10:48:32] YuviPanda: I will eventually have to whine about how the mail notifications are hard to read hehe [10:48:38] my goal is to make it to a point where you're alerted by icinga rather than by people :) [10:48:43] hashar: mh... Flow itself passes [10:48:52] working on checking URLs for them being up as well [10:49:02] hashar: if you'd like more notifications on some particular thing, do let me know [10:49:23] YuviPanda: will look at it eventually next week :]  Busy processing emails [10:49:29] cool [10:51:01] food time! [10:54:08] bd808S: btw, scap metrics are on graphite.wmflabs.org [10:58:30] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:13:00] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:01:42] (03PS2) 10Filippo Giunchedi: swift: remove ganglia stats via ganglia-logtailer [puppet] - 10https://gerrit.wikimedia.org/r/159705 [12:25:13] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has 1 failures [12:28:36] (03CR) 10coren: [C: 031] "It probably *is* too broad; but the contents of log_param is annoyingly variable and it's difficult to determine whether it contains somet" [software] - 10https://gerrit.wikimedia.org/r/160393 (owner: 10ArielGlenn) [12:33:23] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 1 failures [12:40:34] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:51:34] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:53:26] PROBLEM - check configured eth on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:53:43] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [12:56:26] RECOVERY - check configured eth on fenari is OK: NRPE: Unable to read output [12:57:03] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:57:03] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:32] <_joe_> mmmh is someone shutting down fenari and not logging it? [13:00:41] (03PS1) 10QChris: Remove udp2log stream to Vrije Universiteit Amsterdam [puppet] - 10https://gerrit.wikimedia.org/r/160435 [13:01:11] <_joe_> UVA is such a nice campus [13:01:43] :-) [13:01:43] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:19] <_joe_> !log fenari is swapping hard, restarting apache who was eating up all the RAM [13:03:25] Logged the message, Master [13:04:34] RECOVERY - DPKG on fenari is OK: All packages OK [13:05:05] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4775 bytes in 0.067 second response time [13:05:05] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [13:17:06] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:44:48] (03PS2) 10BBlack: add login-lb to eqiad protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160016 [13:44:50] (03PS2) 10BBlack: Remove dead protoproxy entries completely [puppet] - 10https://gerrit.wikimedia.org/r/160017 [13:44:52] (03PS2) 10BBlack: Remove dead addrs from protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/160015 [13:44:54] (03PS2) 10BBlack: add textsvc/uploadsvc in ulsfo for consistency [puppet] - 10https://gerrit.wikimedia.org/r/160014 [13:44:56] (03PS2) 10BBlack: remove dead esams donatelbsecure [puppet] - 10https://gerrit.wikimedia.org/r/160013 [13:44:58] (03PS2) 10BBlack: remove old mobile/bits addrs in eqiad+esams [puppet] - 10https://gerrit.wikimedia.org/r/160012 [13:45:00] (03PS2) 10BBlack: Sanitize text-related addrs for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/160011 [13:45:02] (03PS2) 10BBlack: Flip ed1a::0 and ed1a::1 in protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160010 [13:45:39] bblack is a sleep wal^Wcommitter [13:45:53] :) [13:46:02] just rebasing branch [13:59:27] (03CR) 10BBlack: [C: 032] Flip ed1a::0 and ed1a::1 in protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160010 (owner: 10BBlack) [13:59:43] (03CR) 10Ottomata: [C: 031] "This is awesome! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/160419 (owner: 10Filippo Giunchedi) [14:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140915T1400). [14:00:38] anthropoid :) [14:00:43] :D [14:00:59] better than "Dear sir" [14:01:26] we converted them to have non-gendered greetings [14:01:33] and they also have a list of messages they pick from [14:01:37] the other one goes 'Dear human' [14:01:41] more message suggestions welcome [14:01:46] heh [14:02:19] https://en.wikipedia.org/wiki/Anthropoid <-- many interesting different meanings... [14:02:48] "a genus of cranes" is surprising :D [14:03:09] haha [14:03:55] hoo: is local crat rename happening today as planned ? [14:04:13] Yep [14:07:29] (03PS1) 10Yuvipanda: db: Use the mysql class instead of the package [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/160445 [14:07:32] milimetric: ^ [14:07:56] (03PS1) 10Yuvipanda: Add .gitreview file [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/160446 [14:07:57] milimetric: and ^ [14:09:03] milimetric: the db patch makes puppet itself put the db files in /srv, and also ensures that the /srv folder is mounted in the correct volume [14:11:11] thanks much YuviPanda. ottomata & qchris should check ^^ [14:11:28] milimetric: \o/ cool. Haven't tested the first one, tho. it needs manual intervention to save the data files [14:11:39] but that's ok on wikimetrics1 since I already did it [14:12:41] YuviPanda: if you want to use a mysql class [14:12:46] you should probably do it in the role [14:12:46] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/wikimetrics.pp#L143 [14:12:55] the module just depends on some mysql package being installed [14:13:02] keeping it puppet module agnostic [14:13:07] !log aude Started scap: Put test.wikidata back on mw1.24-wmf19 extension branch [14:13:10] ah, hmm [14:13:12] Logged the message, Master [14:13:15] * YuviPanda considers [14:13:34] ottomata: ok let me do that [14:13:40] YuviPanda: the wikimetrics module is also used in medaiwikivagrant [14:13:42] so ja. [14:13:51] yeah forgot about that [14:13:57] * aude hopes with everything moved around, nothing explodes [14:14:22] ottomata: https://gerrit.wikimedia.org/r/160446 should be trivial +2 though? [14:14:40] (03CR) 10Ottomata: [C: 032 V: 032] Add .gitreview file [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/160446 (owner: 10Yuvipanda) [14:16:19] (03PS1) 10Yuvipanda: wikimetrics: Put mysql data in /srv, use mysql::server [puppet] - 10https://gerrit.wikimedia.org/r/160448 [14:16:20] ottomata: ^ [14:16:40] (03Abandoned) 10Yuvipanda: db: Use the mysql class instead of the package [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/160445 (owner: 10Yuvipanda) [14:18:11] _joe_: do you have some time to merge the patch that removes the mobile::vumi code? [14:18:17] hm, YuviPanda, did you already apply this manually on wikimetrics1? [14:18:18] :) [14:18:23] i'm just checking that it won't break things there [14:18:32] ottomata: I did it by hadn on wikimetrics1 :) but I can test on the dev instance if you want [14:18:39] naw its cool, i was just checking myself [14:18:42] cool [14:19:03] ottomata: I logged my actions on -labs :) had to fix apparmor profile as well for mysql, which was a bit weird [14:19:04] ok, so, you probably want to apply by hand on wikimetrics-staging1 [14:19:14] its got a /srv/wikimetrics dir [14:19:21] yeah [14:19:45] well, those are self hosted puppetmaster anyway, so its ok to merge this i think [14:19:47] ottomata: since the labs srv role wasn't included, the wikimetrics folders were just on the root volume, which doesn't have too much space [14:19:51] you can fix and apply by hand, ja? [14:19:59] ya [14:20:07] k [14:20:42] (03PS2) 10Ottomata: wikimetrics: Put mysql data in /srv, use mysql::server [puppet] - 10https://gerrit.wikimedia.org/r/160448 (owner: 10Yuvipanda) [14:20:51] (03CR) 10Ottomata: [C: 032 V: 032] wikimetrics: Put mysql data in /srv, use mysql::server [puppet] - 10https://gerrit.wikimedia.org/r/160448 (owner: 10Yuvipanda) [14:20:57] cool [14:22:48] ottomata: trying on staging1 [14:22:55] anomie: is updating the namespace name as simple as it looks? Like I can just deploy it and be done or is there a script required? https://gerrit.wikimedia.org/r/#/c/156821/5 for reference [14:23:31] (03CR) 10Manybubbles: [C: 031] Remove 'renameuser' right from bureaucrats on CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160158 (owner: 10Legoktm) [14:23:45] ottomata: hah, fails [14:23:58] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Must provide non empty value. on node wikimetrics-staging1.eqiad.wmflabs [14:23:58] Warning: Not using cache on failed catalog [14:24:01] no idea what that means [14:24:52] YuviPanda: that sounds like one of the errors we patched [14:25:04] in dev/staging/prod we have separate commits [14:25:33] so we pull --rebase to keep those on top [14:25:37] I did that as well [14:25:46] I'm checking to see if it was my change that fucked it up [14:25:48] nope [14:25:52] it was fucked up before that [14:25:56] yeah, doubt it [14:26:02] maybe dev doesn't have the right fix [14:26:13] but you're welcome to try in staging, puppet worked there last time i ran it [14:26:20] milimetric: this is on staging [14:26:21] maybe run it before and after your change [14:26:56] milimetric: just did that (reverted my change, ran it, same error) [14:28:01] ottomata: qchris do you anything about this puppet error ^? [14:28:21] * qchris reads backscroll [14:28:36] YuviPanda: will check shortly... [14:28:42] ok [14:28:44] this is on staging1 [14:30:07] YuviPanda: No clue. [14:30:14] hmm [14:30:19] Did it work before your most recent changes? [14:30:25] manybubbles: That's a good question. The maintenance script is namespaceDupes.php to clean up if there are any pages that are now inaccessible. We should ask Reedy about it. [14:30:44] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:30:44] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:31:11] qchris: nope, I checked that by reverting my change and running it again [14:31:24] :-D [14:31:35] so it's been broken a while :) [14:32:37] ottomata: hey, thanks for fixing the elasticsearch check! I hope to have some time this week to bring in some more tests too [14:32:44] (03PS1) 10BBlack: Protoproxy template variable scope fixups [puppet] - 10https://gerrit.wikimedia.org/r/160451 [14:33:11] * YuviPanda is writing an experimental shinken module [14:33:22] (03CR) 10BBlack: [C: 032] Protoproxy template variable scope fixups [puppet] - 10https://gerrit.wikimedia.org/r/160451 (owner: 10BBlack) [14:34:48] Reedy: so, renaming namespaces? ok thing? [14:35:13] this scap thing taking forever [14:35:27] to sync-common to the last server [14:35:35] :P [14:35:37] Typical [14:35:39] ottomata: merge the mysql thing? [14:35:43] yep [14:36:00] you could try to ss/ lsof to see which one is slow [14:36:59] (03CR) 10Manybubbles: "I'd like to SWAT this today (because it is scheduled for today) but I'm not sure what the procedure is for renaming namespaces so I'm not " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156821 (https://bugzilla.wikimedia.org/48075) (owner: 10Gerrit Patch Uploader) [14:37:49] ottomata: I'm assuming yes then :) [14:38:04] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [14:38:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [14:39:28] aude: It's fenari [14:39:36] that's broke [14:39:51] "fenari is swapping hard, restarting apache who was eating up all the RAM" [14:40:11] godog: yup, np, [14:40:19] oh bblack [14:40:19] sorry [14:40:22] yup that's fine [14:40:26] aude: right now it's okish [14:40:38] r b swpd free buff cache si so bi bo in cs us sy id wa [14:40:38] 0 1 627 2716 47 78 42 31 88 155 2 0 12 1 83 3 [14:40:56] (03CR) 10coren: [C: 032] lab db replica: don't show log_params for deleted/suppressed logs [software] - 10https://gerrit.wikimedia.org/r/160393 (owner: 10ArielGlenn) [14:41:33] disk is busy [14:42:00] syncing all the things or something [14:42:01] YuviPanda: mind if I run puppet? [14:42:11] ottomata: sure. qchris might also be poking at it [14:42:13] done [14:42:14] oh [14:42:20] ottomata: Take it :-) [14:42:37] ottomata: I just cleaned out the cruft, and updated production branch. [14:42:59] aude: Now everything looks fine again [14:43:05] yep [14:43:53] !log restarting the enwiki cirrus reindex process - it crashed over the weekend. why you crash and leave error message "1". "1" is not a useful error message. [14:43:57] Logged the message, Master [14:45:58] All right sports fans, there's a few multimedia backports going out, so I'll get the SWAT this morning [14:46:12] * YuviPanda cheers for marktraceur [14:46:24] I hope the ferry's wifi holds. :P [14:46:27] BUT YOU DO NOT WORK FOR THE WMF NOW HOW CAN YOU SWAT SECURITY BREACH CALL THE NSA PLEASE [14:46:34] marktraceur: are you Reedy? [14:47:03] marktraceur: You're going to be the evil guy today, then :D [14:47:12] And how [14:47:16] Taking the "rename" right from 'crats ;) [14:47:17] I don't think you actually have to call the NSA, they're already on every call :p [14:47:19] ALL of them :) [14:47:27] omg! :P [14:47:39] don't take my rights away :) [14:48:07] We won't touch your rights. Your lefts maybe. [14:48:17] heh [14:48:37] aude: Become part of the global cabal new global renamer group [14:48:41] :D [14:49:00] :) [14:49:16] (03PS1) 10Filippo Giunchedi: wikimedia.org: remove labsconsole CNAME [dns] - 10https://gerrit.wikimedia.org/r/160454 [14:50:30] manybubbles: So are you going to SWAT today? [14:50:34] !log aude Finished scap: Put test.wikidata back on mw1.24-wmf19 extension branch (duration: 37m 27s) [14:50:38] yay [14:50:39] Logged the message, Master [14:51:12] hoo: want to +2 https://gerrit.wikimedia.org/r/#/c/159948/ [14:51:13] once fenari is gone scaps will be faster again [14:51:27] unless we are more cruel to job runners :D [14:51:33] looking [14:51:39] so that test.wikidata / test2 have clean memcached of entities etc [14:51:52] (03PS1) 10BBlack: Remove eda1::0 from esams protoproxy completely [puppet] - 10https://gerrit.wikimedia.org/r/160458 [14:52:09] it's ugly hacky [14:52:44] aude: Yeah [14:53:11] we'll remove the ugly in 1-2 weeks [14:53:14] (03CR) 10Hoo man: [C: 032] Bump shared cache key for test.wikidata (memcached storage of items, etc.) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159948 (owner: 10Aude) [14:53:19] thanks [14:53:26] (03Merged) 10jenkins-bot: Bump shared cache key for test.wikidata (memcached storage of items, etc.) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159948 (owner: 10Aude) [14:54:15] (03PS2) 10BBlack: Remove eda1::0 from esams protoproxy completely [puppet] - 10https://gerrit.wikimedia.org/r/160458 [14:54:15] !log Updated Jenkins Job Builder fork: e5c0c61..2d74b16 [14:54:20] Logged the message, Master [14:54:52] !log aude Synchronized wmf-config/Wikibase.php: Bump wikibase memcached key for test.wikidata, test, test2 (duration: 00m 16s) [14:54:59] Logged the message, Master [14:55:02] alright, done :) [14:55:03] (03CR) 10BBlack: [C: 032] Remove eda1::0 from esams protoproxy completely [puppet] - 10https://gerrit.wikimedia.org/r/160458 (owner: 10BBlack) [14:55:15] aude: Just in time :) [14:55:19] yep [14:55:19] Verified? [14:55:25] doing, ... yes [14:56:05] 2 entries for 'Base lambda function' during scap [14:56:06] :( [14:56:12] those are gone now [14:56:15] mh [14:56:24] I'll keep an eye on that [14:56:30] sure [14:56:59] nobodoy yet looked at my patch to allow deployers to gracefull apaches :( [14:57:08] But that's not surprising [14:57:11] Q22 is back :) [14:57:17] it was broken on friday [14:57:35] man, our icinga code is a mess [14:57:48] As opposed to our other code? [14:58:50] marktraceur: relatively, in the puppet repo, icinga is the most messy [14:59:31] manybubbles: Hi :) [15:00:04] manybubbles, anomie, ^d, marktraceur, legoktm: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140915T1500). [15:00:36] o/ hello [15:00:41] (03PS1) 10coren: Labs: merge in changes to maintain-replicas.pl [software] - 10https://gerrit.wikimedia.org/r/160459 [15:00:45] legoktm: Hey [15:00:52] FlorianSW: Hey, did you do the submodule update patches for your GeoCrumbs change? [15:00:58] If not, would you mind? [15:01:06] I can do the config changes first [15:01:15] Right, legoktm is first because he's so pretty [15:01:18] manybubbles: that's why i'm here :) but i saw: https://gerrit.wikimedia.org/r/#/c/160452/1 [15:01:23] legoktm: Shall we enable user merge stuff on beta? [15:01:38] * anomie sees marktraceur appears to be SWATting today, and goes back to code review [15:01:52] (03PS2) 10coren: Labs: merge in changes to maintain-replicas.pl [software] - 10https://gerrit.wikimedia.org/r/160459 [15:01:54] Yeah, anomie, tgr is pushing out some evil MMV code, so I figured I'd take it. [15:02:00] (03CR) 10Andrew Bogott: [C: 031] wikimedia.org: remove labsconsole CNAME [dns] - 10https://gerrit.wikimedia.org/r/160454 (owner: 10Filippo Giunchedi) [15:02:43] marktraceur ähm :D Who is doing the swat? You or manybubbles? [15:02:48] * FlorianSW confused [15:02:52] I am [15:02:56] UHM [15:03:03] There's no /a directory on tin, wtf [15:03:18] marktraceur: it's /srv/mediawiki-staging now [15:03:18] :D [15:03:20] marktraceur: see ops list [15:03:23] yep [15:03:23] dafuq [15:03:23] i can swat [15:03:33] manybubbles: It's OK, I got it [15:03:35] marktraceur: emails! [15:03:39] marktraceur: k [15:03:44] I was building the submodule updates [15:03:46] https://gerrit.wikimedia.org/r/#/c/160452/1 i'm fine with this :) [15:03:46] want me to finish them? [15:03:54] manybubbles: Don't mind if I do [15:03:56] you do* [15:04:02] * marktraceur sips coffee [15:04:08] hoo: we probably can do it, I wanted to get https://gerrit.wikimedia.org/r/#/c/158258/, https://gerrit.wikimedia.org/r/#/c/158311/, and https://gerrit.wikimedia.org/r/#/c/159785/ in first [15:04:46] (03CR) 10MarkTraceur: [C: 032] Remove 'renameuser' right from bureaucrats on CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160158 (owner: 10Legoktm) [15:04:51] ITT legoktm creates his cabal. [15:04:53] (03Merged) 10jenkins-bot: Remove 'renameuser' right from bureaucrats on CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160158 (owner: 10Legoktm) [15:05:10] legoktm: So... not yet? [15:05:51] probably not unless you want to review those ;) [15:06:17] marktraceur: https://gerrit.wikimedia.org/r/#/c/160389/1 and https://gerrit.wikimedia.org/r/#/c/160462/ [15:06:27] ty [15:06:35] legoktm: Meh :P Have to do a lot of review for Wikidata later on... I'm actually having a break right now [15:06:37] !log marktraceur Synchronized wmf-config/: [SWAT] Remove 'renameuser' right from bureaucrats on CentralAuth wikis (duration: 00m 09s) [15:06:41] legoktm: Verify? [15:06:41] Logged the message, Master [15:06:42] although that's not very much like a break :D [15:06:43] sorry: https://gerrit.wikimedia.org/r/#/c/160452/ and https://gerrit.wikimedia.org/r/#/c/160462/ [15:07:00] marktraceur: ^^^ [15:07:05] > You do not have permission to rename users, for the following reason: You are not allowed to execute the action you have requested. [15:07:06] yay! [15:07:26] marktraceur: have you ever renamed a namespace? I'm not sure what all is involved but that is something I'm weary of [15:07:26] Cool beans [15:07:34] manybubbles: Oh, hm, no [15:08:08] marktraceur: yeah - I've poked Reedy about it a few times and not heard back. Maybe we punt it out of the SWAT because we're not sure about it? [15:08:13] marktraceur: thanks! [15:08:15] manybubbles: Which one is that? [15:08:25] The config change for i18n? [15:08:28] it needs to get done, but maybe we need to wait for someone to tell about it. https://gerrit.wikimedia.org/r/#/c/156821/5 [15:08:31] yeah [15:08:38] Ah. [15:08:44] (03PS3) 10coren: Labs: merge in changes to maintain-replicas.pl [software] - 10https://gerrit.wikimedia.org/r/160459 [15:08:46] Hi [15:08:54] Glaisher: Do you know anything about potential pitfalls there? [15:09:23] (possibly Glaisher won't respond and we'll not be able to push it anyway) [15:09:35] Namespace dupes is pretty good, it'll tell you what's inaccessible etc [15:09:41] (03PS3) 10BBlack: add login-lb to eqiad protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160016 [15:09:42] I'll pause it for now, discuss amongst yourselves [15:09:42] (03PS3) 10BBlack: Remove dead protoproxy entries completely [puppet] - 10https://gerrit.wikimedia.org/r/160017 [15:09:45] (03PS3) 10BBlack: Remove dead addrs from protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/160015 [15:09:47] (03PS3) 10BBlack: add textsvc/uploadsvc in ulsfo for consistency [puppet] - 10https://gerrit.wikimedia.org/r/160014 [15:09:48] (03PS3) 10BBlack: remove dead esams donatelbsecure [puppet] - 10https://gerrit.wikimedia.org/r/160013 [15:09:51] (03PS3) 10BBlack: remove old mobile/bits addrs in eqiad+esams [puppet] - 10https://gerrit.wikimedia.org/r/160012 [15:09:53] (03PS3) 10BBlack: Sanitize text-related addrs for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/160011 [15:10:02] manybubbles, FlorianSW, your GC stuff is next [15:10:04] A greg-g enters the room. [15:10:26] marktraceur, manybubbles: all right, let's go :) [15:11:28] hola [15:11:34] What up greg-g [15:11:54] nada mucho [15:13:26] Cool. [15:14:18] Jenkins is a bit sluggish this morning [15:14:30] Ah, timing. [15:15:51] hey Coren, if you're touching maintain-replicas... Any chance you could look at https://gerrit.wikimedia.org/r/#/c/143622/ please? [15:15:55] ori_: hiya [15:15:56] (03PS3) 10Reedy: Add pr_index table from Proofread Page extension [software] - 10https://gerrit.wikimedia.org/r/143622 [15:16:07] (03PS2) 10Reedy: Normalise quotes used. Sync fullviews [software] - 10https://gerrit.wikimedia.org/r/143649 [15:16:45] <_joe_> Reedy: I'm one step nearer to merging your big apache changes: https://github.com/lavagetto/webtest [15:16:57] thx marktraceur & manybubbles :) [15:17:00] yup [15:17:13] <_joe_> it still needs some polish, but after that I'll be able to test whatever changes upon an apache config change [15:17:16] !log marktraceur Synchronized php-1.24wmf20/extensions/GeoCrumbs/GeoCrumbs.class.php: [SWAT] Handle return value NULL of GeoCrumbs::getParserCache (duration: 00m 07s) [15:17:21] Logged the message, Master [15:17:23] FlorianSW: Verify on a wikipedia, please :) [15:17:36] (03PS4) 10coren: Labs: merge in changes to maintain-replicas.pl [software] - 10https://gerrit.wikimedia.org/r/160459 [15:17:49] (03CR) 10BBlack: [C: 032] Sanitize text-related addrs for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/160011 (owner: 10BBlack) [15:17:49] Reedy: Added to my current changeset. [15:18:24] Coren: Aha. Thanks. That looks like that makes 143649 possibly redundant too [15:18:25] (03PS1) 10Ottomata: Use $labs_finger as default $master finger in salt role [puppet] - 10https://gerrit.wikimedia.org/r/160464 [15:18:38] !log marktraceur Synchronized php-1.24wmf21/extensions/GeoCrumbs/GeoCrumbs.class.php: [SWAT] Handle return value NULL of GeoCrumbs::getParserCache (duration: 00m 07s) [15:18:42] qchris: i think this was the problem [15:18:43] https://gerrit.wikimedia.org/r/#/c/160464/1/manifests/role/salt.pp [15:18:43] Logged the message, Master [15:18:45] FlorianSW: And now, verify on mw.org or a testwiki. [15:19:02] * qchris looks [15:19:28] looks like just a typo problem in that commit [15:19:38] the pick() function ended up being passed 0 non-empty arguments [15:19:40] marktraceur: i don't know any testwiki or Wikipedia using GC :) It's used on wikivoyage, iirc [15:19:43] and thats where the error was being thrown [15:19:43] Reedy: So if I want to push Glaisher's patch with some namespace renames, what do I have to do, if anything? [15:19:51] FlorianSW: Well, then wikivoyage [15:20:03] (03CR) 10QChris: [C: 031] Use $labs_finger as default $master finger in salt role [puppet] - 10https://gerrit.wikimedia.org/r/160464 (owner: 10Ottomata) [15:20:06] It would be super if we had a testwiki running it though. [15:20:07] marktraceur: sync it, mwscript namespaceDupes.php --wiki=foobar [15:20:14] (03CR) 10Ottomata: "Ori, is this correct? It looks like this is what you meant to do in the first place. Not sure though." [puppet] - 10https://gerrit.wikimedia.org/r/160464 (owner: 10Ottomata) [15:20:16] K [15:20:19] See what the output says... You might need to run again with --fix [15:20:20] Thanks ottomata [15:20:27] Reedy: Thanks, I'll do that next. [15:20:30] Glaisher: Are you there? [15:20:36] qchris: i manually edited that on -staging1 and ran puppet [15:20:39] marktraceur: yeah, but this isn't my extension perse :) I just fixed this little problem, because it's a blocker :) [15:20:39] and that fixed the error [15:20:49] I will undo that manual edit and wait for ori's review [15:20:54] or...should I leave it? [15:21:07] I'll just commit it for now in staging. [15:21:10] Hm, hasn't talked since joining last night. [15:21:16] ok [15:21:18] Rebasing later on will recognize it. [15:21:25] marktraceur: looks good as i can see, thanks :) [15:21:35] Glaisher: You get another pause because you aren't answering; if you're around to verify, please ping me and you can go after I sync tgr's changes to MMV. [15:22:44] k [15:22:58] So tgr, you're next, if you're ready to verify. [15:22:59] YuviPanda: that error is fixed, but there are new puppet problems, mysql related [15:23:01] take it away :) [15:23:04] aaah [15:23:08] those would be mine, I suppose [15:23:12] _joe_: can you explain a bit about puppetmasters to me? It seems like puppetmaster and the apache puppetmaster server are configured to listen on the same port. [15:23:22] Which, on virt1000 I currently can't start the puppetmaster due to a port conflict [15:23:40] don't know you need both? [15:23:41] andrewbogott: [15:23:52] I do need both! But surely not on the same port? [15:23:58] ok [15:24:07] ottomata: completes on staging without any errors for me [15:24:16] marktraceur: ready [15:24:18] Sweet. [15:24:23] oh, ok...i saw an alembic upgrade problem [15:24:26] (migrations) [15:24:26] but ok [15:24:28] ottomata: ah, ok :) [15:24:30] mabye it fixed itself on the second run [15:24:32] ottomata: yeah [15:24:39] ottomata: should I run on wikimetrics1? [15:24:40] andrewbogott: Wait, we don't use the puppetmaster webserver do we? I thought we used mod_passenger on the one apache instead. [15:24:52] YuviPanda: ask milimetric [15:25:05] * YuviPanda considers milimetric askedc [15:25:16] andrewbogott: with a distinct vhost for puppet stuff. [15:25:36] Right, there is a distinct vhost. [15:25:53] * aude spent the weekend playing with puppetmaster:p [15:26:06] Alls I know is -- the puppetmaster won't start due to a port conflict. And when I look in the puppetmaster config and the apache vhost config -- same port. 8140 [15:26:13] Which seems like… a good place to start for a port conflict [15:26:27] seemed if you have the apache thing then you don't use puppetmaster in addition (for webserver) [15:26:34] but don't fully understand [15:26:39] YuviPanda: just to catch up, you guys think the puppet stuff is ready and want to run it on wikimetrics1? There's an outstanding error but it's something about alembic? [15:27:04] milimetric: no error, the alembic stuff seems to have resolved itself [15:27:22] k, YuviPanda, all good then [15:27:23] the documentation for puppetmaster is confusing and lacking [15:27:52] !log marktraceur Synchronized php-1.24wmf20/extensions/MultimediaViewer/: [SWAT] Several backports for metrics and bugfixes in Media Viewer (duration: 00m 07s) [15:27:57] Logged the message, Master [15:28:04] tgr: Test on a wikipedia please :) [15:29:06] !log marktraceur Synchronized php-1.24wmf21/extensions/MultimediaViewer/: [SWAT] Several backports for metrics and bugfixes in Media Viewer (duration: 00m 07s) [15:29:07] Aaaaand on mediawiki.org. [15:29:10] Logged the message, Master [15:29:58] Glaisher: We're 30 minutes into the SWAT window and you still haven't responded to pings; I'm fairly sure this means we're going to punt you to a later SWAT window. I'll tentatively stick it in the afternoon one today, but let me know if tomorrow's morning SWAT would be better. [15:31:46] marktraceur: neither patch can be easily tested, but nothing seems to be broken at least [15:32:03] Good enough [15:32:33] tgr: If something isn't working the way you expect, ping me and we'll use the afternoon SWAT to fix 'er [15:32:35] did this go out to wmf.org as well? [15:32:43] in that case I can test it [15:32:52] Maybe [15:33:09] Yeah, should have done [15:33:33] S:V disagrees, but I suspect it's cached and lying to me. [15:33:56] yeah, the bugfix part works [15:34:00] Sweet. [15:34:15] can't see the new event logs, but they are sampled, so... [15:34:24] Righto. [15:34:35] Should have some new data by afternoon or so. [15:36:52] (03PS1) 10Manybubbles: Switch primary search backend for jawiki to Cirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160465 [15:37:15] (03CR) 10Manybubbles: "Will deploy during window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160465 (owner: 10Manybubbles) [15:37:49] <_joe_> andrewbogott: puppetmaster should not start where we use mod_passenger [15:37:59] <_joe_> they're alternatives, should not run toghether [15:38:05] _joe_: Yep, I sorted that. [15:38:17] The error I'm seeing is: Could not retrieve catalog from remote server: Error 400 on SERVER: Must provide non empty value. [15:38:34] Ever see that? Googling suggests it's something to do with heira, but if heira is turned on for labs that's news to me [15:38:41] andrewbogott: ah, ottomata just fixed that with https://gerrit.wikimedia.org/r/#/c/160464 [15:38:46] <_joe_> andrewbogott: the error you've seen where and how? [15:39:04] <_joe_> ok :) [15:39:09] <_joe_> thnx yuvi [15:39:20] _joe_: on every single labs instance [15:39:21] yw :) we just ran into it on wikimetrics' self hosted puppetmaster... [15:39:29] YuviPanda: I totally can't tell how that would relate to puppet failing [15:39:50] andrewbogott: unsure, but I do know that it fixes it (just cherry-picked that one to wikimetrics1) [15:40:08] K, I have no more patience, I declare SWAT over [15:40:10] How's it going to get applied to labs instances where puppet is already broken? [15:40:13] Unless there are last minute things to do [15:40:34] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [15:40:36] <_joe_> andrewbogott: ??? [15:40:38] andrewbogott: the offending change is this: [15:40:38] https://gerrit.wikimedia.org/r/#/c/153727/11 [15:40:39] I thikn [15:40:42] it was merged yesterday [15:40:49] i'm not sure if my fix is correct: https://gerrit.wikimedia.org/r/#/c/160464 [15:40:52] but i think it is [15:40:52] andrewbogott: the breakage is on the master [15:40:55] <_joe_> yesterday [15:40:57] <_joe_> .... [15:41:01] oooook [15:41:02] * andrewbogott tries [15:41:15] _joe_: that is consistent with puppetmaster failing since yesterday on labs [15:41:45] <_joe_> everybody wait a minute [15:41:52] greg-g: ugh. [15:42:02] <_joe_> I'd probably revert all the salt work that was merged on sunday [15:42:15] so, swat is still open? anyone wants to backport and deploy https://gerrit.wikimedia.org/r/159086 ? (see https://bugzilla.wikimedia.org/show_bug.cgi?id=69924#c28) [15:42:22] <_joe_> It's causing problems in prod as well [15:42:55] MatmaRex: asking those kinds of leading questions (the one on the bug) only sets you up to be told "you shoulda done it yourself" ;) [15:43:17] <_joe_> so if you have an hotfix, apply it [15:43:26] <_joe_> anything larger than that, please wait [15:43:39] (03PS5) 10coren: Labs: merge in changes to maintain-replicas.pl [software] - 10https://gerrit.wikimedia.org/r/160459 (https://bugzilla.wikimedia.org/54164) [15:43:52] greg-g: okay, i won't be merging stuff that i can't commit to guiding to backport and deployment next time [15:44:04] i wasn't even home until now [15:44:34] Hm… since my firefox update yesterday, clicking on links doesn't open new tabs. That happening to everyone? [15:44:35] MatmaRex: feel free to merge it, just don't ask passive aggressively if someone will do something, either say "I can't do this, can you Krinkle?" or similara [15:45:15] "Do you even Krinkle, dude?" [15:45:32] _joe_: Hotfixing isn't especially straightforward, but I'm ok waiting for you to sort things out. [15:45:39] andrewbogott: checked the setting? Might be worth toggling it [15:45:42] Or I can submit a bunch of reverts if you like [15:46:04] <_joe_> andrewbogott: well, I'd prefer not to waste the work that has been done [15:46:27] …I can't tell if you're advocating for reverts, or for fixes, or for neither :) [15:46:50] If we're not going to revert then we can apply ottomata's patch (which looks right to me) and fix labs easily. [15:47:09] +1 to fixing labs for now [15:47:20] Reedy: hm, sure enough [15:48:16] I would say that ottomata patch is clearly right, it just resolves a copy/paste error [15:49:12] andrewbogott: merge away :) [15:49:13] (03CR) 10Andrew Bogott: [C: 031] "I think this is correct. Not merging pending a larger discussion about the preceeding change, though." [puppet] - 10https://gerrit.wikimedia.org/r/160464 (owner: 10Ottomata) [15:49:26] it will only affect labs anyway, and labs is already broken :) [15:49:39] * andrewbogott is still observing _joe_'s request to everybody wait a minute [15:49:59] ottomata: it's just a question of not piling more patches on an existing patch series if we're about to revert everything [15:50:06] ottomata: we're probably going to revert all the recent salt work, unless joe already knows an obvious fix [15:50:09] but it's multiple commits [15:51:20] aye ok [15:51:21] cool [15:52:30] <_joe_> bblack: I don't, but give me 3 mins to get to the bottom of what I'm doing right now [15:52:35] ok [15:53:42] (03PS1) 10Giuseppe Lavagetto: Remove the last references to pybal on fenari [puppet] - 10https://gerrit.wikimedia.org/r/160467 [15:53:44] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [15:54:02] <_joe_> so: if someone can take a look at the labs issue [15:54:07] <_joe_> I can look at the prod ones [15:55:21] my personal take on the salt thing is just revert the 4x commits in a row that are salt-related from ori yesterday. Starting at "Remove salt-minion.override" + the next 3 [15:55:37] _joe_: I'm pretty sure that ottomata's patch will sort out the labs problem. It's just a copy/paste thing. If there's another problem buried under that one, it's hard to know until we merge that. [15:55:58] it's going to conflict with the reverts, I think [15:56:07] <_joe_> andrewbogott: you can try a few hosts with pcc to be sure [15:56:09] which is why I'm not merging it [15:56:16] we can revert that too [15:56:37] * andrewbogott feels like he's repeating himself a lot this morning. sorry [15:56:51] hey i got an idea, let's merge that patch! [15:56:55] harharhar jk :p [15:57:06] _joe_: YuviPanda already tested it with a local puppetmaster I believe. Worked, right? [15:57:21] andrewbogott: yea [15:57:21] yes, we tested it with the wikimetrics instances [15:57:24] andrewbogott: the one-line labs fix is fixing a problem introduced in the 4x being reverted, basically [15:57:52] but that won't fix any prod issues caused by the change [15:57:56] well, then it'll be 5x revert [15:57:57] will only fix the one labs error [15:58:10] there could be labs salt issues caused by this too [15:58:14] if it messed up stuff in prod [15:58:14] * andrewbogott gets some breakfast [16:00:04] manybubbles, ^d: Dear anthropoid, the time has come. Please deploy Search (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140915T1600). [16:00:44] marktraceur: all done with swat? [16:01:11] (03CR) 10Manybubbles: [C: 032] Switch primary search backend for jawiki to Cirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160465 (owner: 10Manybubbles) [16:01:18] <_joe_> If you end up having problems with scap/trebuchet, ping me [16:01:19] the pending salt issues might affect scap, I'd hold on any deploy that relies on that for a sec [16:05:58] Uh, yes. [16:06:45] hrmmm [16:06:47] I wonder how the previous deploy completed properly? [16:06:52] it was during the broken period [16:07:32] jenkins hung? [16:08:00] Salt being jacked up should not cause any direct problems for scap/scyn-* [16:08:10] It will only cause a problem for trebuchet [16:08:18] s/will/should/ [16:08:21] can i try cirrus in jawiki yet? [16:09:46] aude: as soon as jenkins +2s I guess [16:09:56] ok :) [16:10:01] aude: i mean - you can always try it with the url parameter or the betafeature [16:10:06] if you can read japanese [16:10:11] * aude loves typing in japanese, the few words that i know [16:10:26] pushing the "scrunch" button is all kinds of fun [16:12:07] (03PS1) 10Giuseppe Lavagetto: salt: fix scoping and cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/160468 [16:12:44] (03Merged) 10jenkins-bot: Switch primary search backend for jawiki to Cirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160465 (owner: 10Manybubbles) [16:13:30] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 06s) [16:13:36] Logged the message, Master [16:13:37] yay! [16:14:18] (03CR) 10Tpt: [C: 031] Add pr_index table from Proofread Page extension [software] - 10https://gerrit.wikimedia.org/r/143622 (owner: 10Reedy) [16:14:44] what's the test.wikipedia link in the /topic for? [16:14:52] manybubbles: I've been done with SWAT for a while, yes. [16:15:00] greg-g: to test wikipedia! [16:15:04] !log jawiki now has cirrus as primary. we're back to where we were before the great cascading failure of two months ago [16:15:09] Logged the message, Master [16:15:09] YuviPanda: I didn't use scap [16:15:22] sync-dir and sync-file all the way bro [16:15:26] ah [16:15:26] Wait. [16:15:28] ol' style [16:15:35] tgr: Were there message changes in the MMV deployment? [16:15:39] scap isn't needed in most cases [16:16:21] (03CR) 10BBlack: [C: 031] salt: fix scoping and cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/160468 (owner: 10Giuseppe Lavagetto) [16:16:24] (03CR) 10Giuseppe Lavagetto: [C: 032] "http://puppet-compiler.wmflabs.org/344/change/160468/html/mw1017.eqiad.wmnet.html" [puppet] - 10https://gerrit.wikimedia.org/r/160468 (owner: 10Giuseppe Lavagetto) [16:16:47] Adding a `--no-l10n` flag to scap would make it pretty easy to kill off sync-* [16:17:10] bd808: don't we want to _just_ sync the things we're intentionally changing? [16:17:44] better would be to always run a precheck, verify, then sync everything. but hey [16:17:46] manybubbles: because side effects. Like the git info for Special:Version isn't updated without a full scap [16:18:15] bd808: not saying we shouldn't do it - but --no-intl wouldn't be enough [16:18:27] and it's a false sense of security. Any cluster host can call sync-common at any time [16:18:27] _joe_: still digging? [16:18:29] we'd need to always do a dry run to verify the list of files being changes [16:18:51] (03PS2) 10Giuseppe Lavagetto: Use $labs_finger as default $master finger in salt role [puppet] - 10https://gerrit.wikimedia.org/r/160464 (owner: 10Ottomata) [16:18:51] bd808: yeah - it really really really is false - but its something. [16:19:27] The state of /srv/mediawiki-staging could be pulled to any host in the cluster at any time [16:19:34] which is a flaw in scap in general [16:20:18] We need to fix that IMHO, but there are a few issues with Trebuchet that I'd like to see ironed out first. [16:20:51] _joe_: just curious, how does that fix scoping problem? [16:21:00] Like fanout and the general state of the `git deploy` porcelain [16:21:37] (03CR) 10Giuseppe Lavagetto: [C: 032] "given this was clearly the intent of the patch and it doesn't harm production (http://puppet-compiler.wmflabs.org/345/change/160464/html/)" [puppet] - 10https://gerrit.wikimedia.org/r/160464 (owner: 10Ottomata) [16:22:28] <_joe_> ottomata: $::cluster can only be a top-scope variable [16:22:49] <_joe_> $cluster is either a node-scope variable (which inherits top-scope) or local-scope [16:23:05] is it defined locally as $cluster? [16:23:17] <_joe_> it's one of the tricks I did to make our codebase compatible with puppet 3 without using hiera [16:23:20] <_joe_> :) [16:23:31] oh [16:23:37] so, $cluster should alwasy be $cluster in our code? [16:24:37] <_joe_> basically yes [16:24:47] <_joe_> $::cluster right now refers to 'misc' [16:29:04] oh yeah so the labs master-finger fixup probably should be merged since we didn't revert [16:29:19] oh joe already did above [16:29:47] I'm applying on virt1000 and it's hanging on Scheduling refresh of Service[salt-minion]. Waiting to see if it works after a second run... [16:33:19] _joe_: thanks for clearing that up! [16:33:30] <_joe_> andrewbogott: np [16:33:40] <_joe_> someone else will have to thank me :P [16:36:12] (03CR) 10coren: [C: 032] "Matches current version." [software] - 10https://gerrit.wikimedia.org/r/160459 (https://bugzilla.wikimedia.org/54164) (owner: 10coren) [16:36:39] (03PS4) 10BBlack: remove old mobile/bits addrs in eqiad+esams [puppet] - 10https://gerrit.wikimedia.org/r/160012 [16:36:53] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Epic puppet fail [16:37:07] (03CR) 10BBlack: [C: 032] remove old mobile/bits addrs in eqiad+esams [puppet] - 10https://gerrit.wikimedia.org/r/160012 (owner: 10BBlack) [16:41:55] (03PS4) 10BBlack: remove dead esams donatelbsecure [puppet] - 10https://gerrit.wikimedia.org/r/160013 [16:42:28] (03Abandoned) 10Reedy: Add pr_index table from Proofread Page extension [software] - 10https://gerrit.wikimedia.org/r/143622 (owner: 10Reedy) [16:42:39] (03Abandoned) 10Reedy: Normalise quotes used. Sync fullviews [software] - 10https://gerrit.wikimedia.org/r/143649 (owner: 10Reedy) [16:43:18] (03CR) 10BBlack: [C: 032] remove dead esams donatelbsecure [puppet] - 10https://gerrit.wikimedia.org/r/160013 (owner: 10BBlack) [16:47:27] So… a hanging puppet run: where to look, and what to grep for? [16:47:44] I'm already running -v and it says nothing. Last output line says 'executed successfully' [16:48:00] strace? [16:48:06] I'm pretty sure that I've never seen puppet actually hang before [16:48:07] what do you mean by 'hanging'? [16:48:43] strace -ff -e execve puppet agent -tv [16:48:53] will probably show it, if it's an exec type thing [16:49:04] you can also attach to the current process [16:49:09] to see where exactly it's on [16:49:23] (but if it's an exec, it might just be chilling in something waitpid-like) [16:49:34] or check the process list for children of the running puppet agent [16:49:43] PROBLEM - puppet last run on ssl1004 is CRITICAL: CRITICAL: Epic puppet fail [16:49:48] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Epic puppet fail [16:49:53] PROBLEM - puppet last run on search1021 is CRITICAL: CRITICAL: Epic puppet fail [16:49:53] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: Epic puppet fail [16:49:53] PROBLEM - puppet last run on search1009 is CRITICAL: CRITICAL: Epic puppet fail [16:49:54] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Epic puppet fail [16:49:54] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Epic puppet fail [16:49:57] whelp [16:49:59] uhhhhh [16:50:53] ESC[mNotice: Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists)ESC[0m [16:50:56] ESC[mNotice: Caught TERM; calling stopESC[0m [16:50:58] ? [16:51:06] mark: So far, I don't know what I mean. Just that when I run puppet agent -tv, it outputs a few happy lines then… hangs. [16:51:15] Uhoh, I wonder if that's happening everywhere :( [16:51:16] PROBLEM - puppet last run on search1014 is CRITICAL: CRITICAL: Epic puppet fail [16:51:19] I guess that's the puppet hangs you're referring to? it's hanging on lots of random hosts? [16:51:33] the epic fail is "it can't run because it's still running from last time" [16:51:34] PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Epic puppet fail [16:51:34] I was only watching virt1000, but maybe it's happening everywhere [16:51:37] yeah strace it to see what syscall it's blocked on [16:51:43] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Epic puppet fail [16:51:44] PROBLEM - puppet last run on wtp1024 is CRITICAL: CRITICAL: Epic puppet fail [16:51:53] PROBLEM - puppet last run on ssl3003 is CRITICAL: CRITICAL: Epic puppet fail [16:51:54] PROBLEM - puppet last run on virt1002 is CRITICAL: CRITICAL: Epic puppet fail [16:52:14] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Epic puppet fail [16:52:14] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Epic puppet fail [16:52:14] PROBLEM - puppet last run on search1020 is CRITICAL: CRITICAL: Epic puppet fail [16:52:16] <_joe_> mmmh [16:52:16] hmm ssl1004 fixed itself [16:52:23] PROBLEM - puppet last run on rdb1004 is CRITICAL: CRITICAL: Epic puppet fail [16:52:24] PROBLEM - puppet last run on pc1001 is CRITICAL: CRITICAL: Epic puppet fail [16:52:40] /usr/bin/python /usr/bin/salt-call --out=txt test.ping [16:52:46] ^ that looks suspicious [16:52:53] PROBLEM - puppet last run on wtp1017 is CRITICAL: CRITICAL: Epic puppet fail [16:53:03] PROBLEM - puppet last run on wtp1019 is CRITICAL: CRITICAL: Epic puppet fail [16:53:04] <_joe_> bblack: on some hosts salt is hanging upon restart [16:53:31] on ssl1004, that salt-call is hung in the process list, but it's been forked from whatever and has no real parent anymore [16:53:34] PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: Epic puppet fail [16:53:48] <_joe_> bblack: that was my test earlier probably [16:53:53] looks like it's blocked on /sbin/start salt-minion [16:54:13] PROBLEM - puppet last run on ssl1002 is CRITICAL: CRITICAL: Epic puppet fail [16:54:17] <_joe_> bblack: we should schedule a restart of salt on all servers sooner or later [16:54:23] PROBLEM - puppet last run on search1016 is CRITICAL: CRITICAL: Epic puppet fail [16:54:25] yeah there's a hung "/sbin/stop salt-minion" on ssl1004 from half an hour ago [16:54:43] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Epic puppet fail [16:54:47] I think puppet gives up eventually and leaves that forked off, it just takes longer than our client interval [16:55:03] PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: Epic puppet fail [16:55:04] PROBLEM - puppet last run on rbf1002 is CRITICAL: CRITICAL: Epic puppet fail [16:55:09] <_joe_> bblack: so salt minion doesn't stop cleanly [16:55:12] !log Restarted hung elasticsearch service on logstash1002 [16:55:14] PROBLEM - puppet last run on search1010 is CRITICAL: CRITICAL: Epic puppet fail [16:55:16] Logged the message, Master [16:55:23] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Epic puppet fail [16:55:42] oh, godog, i guess we never meregd this [16:55:43] shall we? [16:55:44] https://gerrit.wikimedia.org/r/#/c/160090/1/modules/elasticsearch/files/nagios/check_elasticsearch.py [16:55:53] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Epic puppet fail [16:56:03] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Epic puppet fail [16:56:03] PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: Epic puppet fail [16:56:10] Debug: Executing '/sbin/start salt-minion' [16:56:15] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Epic puppet fail [16:56:16] ^ that's where ssl1004 hands now, so yeah [16:56:23] *hangs [16:56:23] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Epic puppet fail [16:56:41] (03PS2) 10Ottomata: Remove udp2log stream to Vrije Universiteit Amsterdam [puppet] - 10https://gerrit.wikimedia.org/r/160435 (owner: 10QChris) [16:56:50] (03CR) 10Ottomata: [C: 032 V: 032] Remove udp2log stream to Vrije Universiteit Amsterdam [puppet] - 10https://gerrit.wikimedia.org/r/160435 (owner: 10QChris) [16:57:05] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Epic puppet fail [16:57:15] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Epic puppet fail [16:57:17] _joe_: I've noticed o trusty some odd minion behavior, where I get two returns from minion to the master.. restarting helped some but not all of those [16:57:35] not relevant to you now I guess but fyi. [16:57:36] The minion says Master hostname: salt not found. Retrying in 30 seconds [16:57:49] <_joe_> andrewbogott: where? [16:58:04] in /var/log/salt/minion [16:58:04] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Epic puppet fail [16:58:04] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Epic puppet fail [16:58:14] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Epic puppet fail [16:58:23] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Epic puppet fail [16:58:23] PROBLEM - puppet last run on virt1003 is CRITICAL: CRITICAL: Epic puppet fail [16:58:27] <_joe_> good god. [16:58:29] (03CR) 10Ottomata: For regular webstatscollector installs, use latest version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/157878 (https://bugzilla.wikimedia.org/70295) (owner: 10QChris) [16:58:33] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Epic puppet fail [16:58:42] _joe_: It's right in the minion config though [16:58:45] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Epic puppet fail [16:58:45] PROBLEM - puppet last run on virt1001 is CRITICAL: CRITICAL: Epic puppet fail [16:58:53] PROBLEM - puppet last run on pc1002 is CRITICAL: CRITICAL: Epic puppet fail [16:58:53] PROBLEM - puppet last run on virt1004 is CRITICAL: CRITICAL: Epic puppet fail [16:58:55] <_joe_> andrewbogott: lemme check [16:58:55] I mean, 'correct', in other words, palladium [16:58:55] on ssl1004, a manual kill of salt-minion (TERM) + manual stop worked fine [16:59:01] (03PS3) 10Ottomata: For regular webstatscollector installs, use latest version [puppet] - 10https://gerrit.wikimedia.org/r/157878 (https://bugzilla.wikimedia.org/70295) (owner: 10QChris) [16:59:03] PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: Epic puppet fail [16:59:09] /sbin/stop doesn't realize what to do and thinks nothing needs to be stopped, though [16:59:13] PROBLEM - puppet last run on labstore1001 is CRITICAL: CRITICAL: Epic puppet fail [16:59:20] (03CR) 10Ottomata: [C: 032 V: 032] For regular webstatscollector installs, use latest version [puppet] - 10https://gerrit.wikimedia.org/r/157878 (https://bugzilla.wikimedia.org/70295) (owner: 10QChris) [16:59:21] just shoot and restart [16:59:23] we may have manually kill the old minion everywhere or something? [16:59:24] PROBLEM - puppet last run on search1005 is CRITICAL: CRITICAL: Epic puppet fail [16:59:41] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: Epic puppet fail [16:59:54] PROBLEM - puppet last run on ssl1005 is CRITICAL: CRITICAL: Epic puppet fail [17:00:07] PROBLEM - puppet last run on search1002 is CRITICAL: CRITICAL: Epic puppet fail [17:00:12] <_joe_> how do I start the minion? [17:00:22] /sbin/start salt-minion [17:00:26] PROBLEM - puppet last run on wtp1018 is CRITICAL: CRITICAL: Epic puppet fail [17:00:43] PROBLEM - puppet last run on rdb1001 is CRITICAL: CRITICAL: Epic puppet fail [17:00:44] PROBLEM - puppet last run on virt1007 is CRITICAL: CRITICAL: Epic puppet fail [17:00:48] (but in the host I'm looking at, there's one running from 2013, and /sbin/stop salt-minion doesn't detect it, but its existence prevents start from getting anywhere...) [17:00:49] <_joe_> bblack: I think there is a stale pidfile around [17:00:55] PROBLEM - puppet last run on search1017 is CRITICAL: CRITICAL: Epic puppet fail [17:01:04] PROBLEM - puppet last run on ssl1009 is CRITICAL: CRITICAL: Epic puppet fail [17:01:05] stale process in my case [17:01:16] PROBLEM - puppet last run on ssl3002 is CRITICAL: CRITICAL: Epic puppet fail [17:01:32] <_joe_> bblack: in mine too [17:01:32] did any of the salt refactor move the pidfile to a new path by chance? [17:01:33] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Epic puppet fail [17:01:39] <_joe_> bblack: no [17:01:43] PROBLEM - puppet last run on wtp1023 is CRITICAL: CRITICAL: Epic puppet fail [17:01:43] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Epic puppet fail [17:01:43] PROBLEM - puppet last run on wtp1022 is CRITICAL: CRITICAL: Epic puppet fail [17:01:44] PROBLEM - puppet last run on search1023 is CRITICAL: CRITICAL: Epic puppet fail [17:01:44] <_joe_> it's the stale process [17:01:53] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Epic puppet fail [17:02:02] <_joe_> so, we just need to go around the cluster, kill all failing salt calls [17:02:04] PROBLEM - puppet last run on wtp1002 is CRITICAL: CRITICAL: Epic puppet fail [17:02:13] PROBLEM - puppet last run on search1024 is CRITICAL: CRITICAL: Epic puppet fail [17:02:16] can we use salt? :) [17:02:24] PROBLEM - puppet last run on ssl1008 is CRITICAL: CRITICAL: Epic puppet fail [17:02:43] PROBLEM - puppet last run on wtp1011 is CRITICAL: CRITICAL: Epic puppet fail [17:02:44] PROBLEM - puppet last run on search1015 is CRITICAL: CRITICAL: Epic puppet fail [17:02:44] PROBLEM - puppet last run on wtp1013 is CRITICAL: CRITICAL: Epic puppet fail [17:02:44] <_joe_> andrewbogott: nope I guess [17:02:46] !log Restarted logstash on logstash1001. I hoped this would fix the dashboards, but it looks like the backing elasticsearch cluster is too sad for them to work at the moment. [17:02:51] Logged the message, Master [17:02:55] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Epic puppet fail [17:02:55] PROBLEM - puppet last run on ssl1004 is CRITICAL: CRITICAL: Epic puppet fail [17:03:04] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Epic puppet fail [17:03:07] * andrewbogott gets to work on the virt cluster [17:03:09] well, we need to kill the old minion manually too [17:03:16] (and any outstanding salt-call maybe) [17:03:24] PROBLEM - puppet last run on ssl1007 is CRITICAL: CRITICAL: Epic puppet fail [17:03:34] PROBLEM - puppet last run on search1013 is CRITICAL: CRITICAL: Epic puppet fail [17:03:45] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Epic puppet fail [17:03:45] PROBLEM - puppet last run on pc1003 is CRITICAL: CRITICAL: Epic puppet fail [17:03:45] PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: Epic puppet fail [17:03:53] PROBLEM - puppet last run on wtp1007 is CRITICAL: CRITICAL: Epic puppet fail [17:03:54] RECOVERY - puppet last run on ssl1004 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:03:56] salt still works btw [17:04:04] I'm gonna try to make it shoot its own client I guess? [17:04:10] (and then puppet will restart) [17:04:13] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Epic puppet fail [17:04:13] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Epic puppet fail [17:04:15] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Epic puppet fail [17:04:15] PROBLEM - puppet last run on search1006 is CRITICAL: CRITICAL: Epic puppet fail [17:04:43] PROBLEM - puppet last run on ssl1006 is CRITICAL: CRITICAL: Epic puppet fail [17:04:53] !log using salt to kill salt-minion everywhere... [17:04:56] PROBLEM - puppet last run on search1011 is CRITICAL: CRITICAL: Epic puppet fail [17:04:56] PROBLEM - puppet last run on ssl1001 is CRITICAL: CRITICAL: Epic puppet fail [17:04:57] PROBLEM - puppet last run on search1022 is CRITICAL: CRITICAL: Epic puppet fail [17:04:59] Logged the message, Master [17:05:14] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Epic puppet fail [17:05:14] PROBLEM - puppet last run on wtp1010 is CRITICAL: CRITICAL: Epic puppet fail [17:05:15] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:05:15] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Epic puppet fail [17:05:19] doesn't seem to have worked really [17:05:21] <_joe_> bblack: salt doesn't really work [17:05:28] <_joe_> I tried [17:05:29] PROBLEM - puppet last run on search1004 is CRITICAL: CRITICAL: Epic puppet fail [17:05:35] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Epic puppet fail [17:05:40] <_joe_> bblack: we need a list of all the hosts wehre puppet is failing [17:05:53] RECOVERY - puppet last run on virt1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:05:54] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Epic puppet fail [17:05:54] PROBLEM - puppet last run on virt1008 is CRITICAL: CRITICAL: Epic puppet fail [17:05:54] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Epic puppet fail [17:05:58] I think what I did, did work for a bunch of hosts, just not all [17:05:59] sudo kill $(ps aux | grep '/usr/bin/python /usr/bin/salt-call --out=txt test.ping' | awk '{print $2}') [17:06:03] PROBLEM - puppet last run on rbf1001 is CRITICAL: CRITICAL: Epic puppet fail [17:06:03] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Epic puppet fail [17:06:06] just that works for me [17:06:16] PROBLEM - puppet last run on search1012 is CRITICAL: CRITICAL: Epic puppet fail [17:06:16] RECOVERY - puppet last run on virt1002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:06:16] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Epic puppet fail [17:06:17] !log Restarted elasticsearch on logstash1001; 2014-09-15T06:12:09Z java.lang.OutOfMemoryError [17:06:19] (03PS2) 10Ottomata: Add ferm::service rule for zookeeper admin port [puppet] - 10https://gerrit.wikimedia.org/r/153801 [17:06:22] Logged the message, Master [17:06:24] <_joe_> andrewbogott: sudo killall salt-call [17:06:32] andrewbogott: it didn't have a stale salt-minion process that wouldn't stop, also? [17:06:43] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Epic puppet fail [17:06:43] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Epic puppet fail [17:06:47] <_joe_> bblack: it stops [17:06:54] ok [17:06:57] <_joe_> as soon as the salt-call dies [17:07:13] PROBLEM - puppet last run on virt1005 is CRITICAL: CRITICAL: Epic puppet fail [17:07:23] well /sbin/stop wouldn't detect it. Perhaps it was trying to stop, wiped its own pidfile, but was still hung waiting on the salt-call then [17:07:33] PROBLEM - puppet last run on search1003 is CRITICAL: CRITICAL: Epic puppet fail [17:07:43] PROBLEM - puppet last run on ssl1003 is CRITICAL: CRITICAL: Epic puppet fail [17:07:43] (and holding resources preventing another start, and then start hangs instead of failing gracefully..) [17:07:47] what a comedy of errors [17:07:48] <_joe_> bblack: we could run an exec via puppet [17:07:54] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: Epic puppet fail [17:07:58] that's probably best, wanna do it? [17:08:03] RECOVERY - puppet last run on virt1004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:08:05] (03CR) 10Ottomata: [C: 032 V: 032] "Fingers crossed!" [puppet] - 10https://gerrit.wikimedia.org/r/153801 (owner: 10Ottomata) [17:08:14] PROBLEM - puppet last run on search1019 is CRITICAL: CRITICAL: Epic puppet fail [17:08:15] except puppet won't run will it? [17:08:15] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Epic puppet fail [17:08:21] Due to a lock file due to the previous run hanging [17:08:21] puppet runs from cron [17:08:24] maybe it times out eventually [17:08:31] it does time out, eventually [17:08:37] ah, ok [17:08:43] RECOVERY - puppet last run on virt1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:08:54] PROBLEM - puppet last run on virt0 is CRITICAL: CRITICAL: Epic puppet fail [17:08:54] PROBLEM - puppet last run on search1008 is CRITICAL: CRITICAL: Epic puppet fail [17:09:18] <_joe_> !log killing salt-call on all mediawiki hosts [17:09:22] Logged the message, Master [17:09:47] <_joe_> yeah but subsequent puppet runs will not work [17:09:50] <_joe_> either [17:10:18] put it in an early stage so it hits it before it hangs on the salt-minion [17:11:06] RECOVERY - puppet last run on virt1007 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:11:25] RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [17:11:44] RECOVERY - puppet last run on ssl1008 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:11:49] reconfirmed even in the cases I'm looking at "killall salt-call" is enough to get things moving again [17:12:17] <_joe_> bblack: it is [17:12:37] <_joe_> we should add that to the stop stanza of our 'beautifully enhanced' salt upstart script [17:12:56] ok I'm gonna fix them all from here [17:13:04] RECOVERY - puppet last run on ssl1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:13:04] <_joe_> how? [17:13:22] <_joe_> bblack: the only way I thought of was to collect all hosts from puppet [17:13:23] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:13:23] (03CR) 10Manybubbles: [C: 031] Cast cluster health keys and values to strings before attempting to utf8 encode them [puppet] - 10https://gerrit.wikimedia.org/r/160090 (owner: 10Ottomata) [17:13:24] by listing the nodes from the puppet yaml dir on the master, and looping over ssh root@ from my local box [17:13:33] <_joe_> bblack: eh [17:13:33] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:13:38] <_joe_> that was my idea as well [17:13:40] (03PS2) 10Ottomata: Cast cluster health keys and values to strings before attempting to utf8 encode them [puppet] - 10https://gerrit.wikimedia.org/r/160090 [17:13:43] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [17:13:44] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 2: number_of_data_nodes: 2: active_primary_shards: 35: active_shards: 39: relocating_shards: 0: initializing_shards: 7: unassigned_shards: 57 [17:13:47] (03CR) 10Ottomata: [C: 032 V: 032] Cast cluster health keys and values to strings before attempting to utf8 encode them [puppet] - 10https://gerrit.wikimedia.org/r/160090 (owner: 10Ottomata) [17:13:55] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 2: number_of_data_nodes: 2: active_primary_shards: 35: active_shards: 39: relocating_shards: 0: initializing_shards: 7: unassigned_shards: 57 [17:13:55] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:13:57] <_joe_> bblack: but it's a waste of time [17:14:07] ? [17:14:07] <_joe_> lemme fix most hosts now via dsh [17:14:11] ok [17:14:22] we have an easy way to get the failing list after that? [17:14:28] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:14:34] !log Restarted elasticsearch on logstash1003; 2014-09-14T09:33:57Z java.lang.OutOfMemoryError [17:14:39] <_joe_> bblack: I guess so [17:14:40] Logged the message, Master [17:14:46] greg-g: fyi, with permission of the parsoid team, I've added OCG to the Parsoid deploy slot this afternoon. [17:14:59] <_joe_> bblack: or, I have a better idea [17:15:03] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:15:04] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:15:14] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:15:35] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:15:53] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:15:53] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:16:13] RECOVERY - puppet last run on virt1008 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [17:16:56] <_joe_> bblack: can we live with some failing hosts for some time? [17:17:01] yeah [17:17:04] RECOVERY - puppet last run on tmh1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:17:17] honestly the brute-force ssh loop wouldn't take long and likely wouldn't hurt anything else, though [17:17:26] <_joe_> ok, then go on :) [17:17:58] (03PS1) 10Ottomata: Include ferm class on zookeeper servers [puppet] - 10https://gerrit.wikimedia.org/r/160476 [17:18:00] <_joe_> I wanted to add something to salt-minion.override for the stop stanza of upstart [17:18:07] <_joe_> a killall salt-call [17:18:13] <_joe_> which we should add anyway [17:18:28] <_joe_> I should have fixed 99% of the hosts already [17:18:29] RECOVERY - puppet last run on virt1005 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:18:29] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:18:34] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:18:37] (03CR) 10Ottomata: [C: 032 V: 032] Include ferm class on zookeeper servers [puppet] - 10https://gerrit.wikimedia.org/r/160476 (owner: 10Ottomata) [17:19:14] RECOVERY - puppet last run on virt0 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:19:24] RECOVERY - puppet last run on tmh1002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:19:33] <_joe_> bblack: dsh has almost all the groups we need [17:20:35] RECOVERY - ElasticSearch health check for shards on elastic1002 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2012, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6031, initializing_shards: 0, number_of_data_nodes: 18 [17:20:45] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:21:04] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:21:17] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:21:17] RECOVERY - puppet last run on wtp1024 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:21:33] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:21:34] RECOVERY - puppet last run on wtp1017 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:21:44] RECOVERY - puppet last run on search1020 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:22:14] it's going to take something like half an hour for the simple ssh loop, since no parallelism in it [17:22:20] not too bad for cleaning up whatever's left though [17:22:23] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:22:33] PROBLEM - RAID on analytics1023 is CRITICAL: Timeout while attempting connection [17:22:33] PROBLEM - DPKG on analytics1023 is CRITICAL: Timeout while attempting connection [17:22:46] PROBLEM - check configured eth on analytics1023 is CRITICAL: Timeout while attempting connection [17:23:03] PROBLEM - check if dhclient is running on analytics1023 is CRITICAL: Timeout while attempting connection [17:23:03] PROBLEM - Disk space on analytics1023 is CRITICAL: Timeout while attempting connection [17:23:03] PROBLEM - puppet last run on analytics1023 is CRITICAL: Timeout while attempting connection [17:23:14] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:23:34] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:23:54] RECOVERY - puppet last run on ssl1002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:23:54] RECOVERY - puppet last run on search1016 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:24:34] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:24:35] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:24:45] RECOVERY - puppet last run on search1010 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:25:24] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:25:24] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:25:59] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:26:24] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:26:38] ok I made it faster, doing 10 in parallel now [17:26:53] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:26:54] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:27:23] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:27:25] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:27:34] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:27:44] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:27:44] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:27:53] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:28:14] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:28:33] RECOVERY - puppet last run on search1002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:28:46] RECOVERY - puppet last run on ssl1005 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:28:54] RECOVERY - puppet last run on wtp1018 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:28:54] RECOVERY - puppet last run on search1005 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:29:13] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:29:24] RECOVERY - puppet last run on search1017 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:29:34] RECOVERY - puppet last run on ssl1009 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:29:44] RECOVERY - puppet last run on ssl3002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:29:54] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:30:04] RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:30:23] RECOVERY - puppet last run on search1023 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:30:45] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [17:30:55] ok I've hit all of them I could connect to in under 5s anyways [17:31:06] there will probably odd ones out we can pick up from icinga after a while [17:31:14] RECOVERY - puppet last run on wtp1023 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:31:17] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:31:17] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:31:23] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [17:31:24] RECOVERY - puppet last run on wtp1013 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:31:24] RECOVERY - puppet last run on search1015 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:31:33] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:31:34] RECOVERY - puppet last run on search1024 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:31:34] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:32:07] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:32:30] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:32:33] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:32:52] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:32:53] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:32:56] there's only ~30-40 left as it is, and some of those will clean up as they hit puppet-cron [17:33:04] RECOVERY - puppet last run on ssl1007 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:33:13] RECOVERY - puppet last run on search1013 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:33:23] RECOVERY - puppet last run on pc1003 is OK: OK: Puppet is currently enabled, last run 66 seconds ago with 0 failures [17:33:44] RECOVERY - puppet last run on search1022 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:34:03] RECOVERY - puppet last run on wtp1010 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:34:04] RECOVERY - puppet last run on search1006 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:34:23] RECOVERY - puppet last run on ssl1006 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:34:24] RECOVERY - puppet last run on search1011 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:34:43] RECOVERY - puppet last run on rbf1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:34:53] RECOVERY - puppet last run on search1012 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:35:14] RECOVERY - puppet last run on search1004 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:35:15] PROBLEM - NTP on analytics1023 is CRITICAL: NTP CRITICAL: No response from NTP server [17:35:15] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:37:28] RECOVERY - puppet last run on search1003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:37:29] RECOVERY - puppet last run on ssl1003 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:38:17] RECOVERY - puppet last run on search1019 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:38:19] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:38:46] RECOVERY - puppet last run on search1008 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:38:55] RECOVERY - puppet last run on search1021 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:38:55] RECOVERY - puppet last run on search1009 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:40:18] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [17:40:18] RECOVERY - puppet last run on search1014 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:40:53] RECOVERY - puppet last run on pc1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:41:24] RECOVERY - puppet last run on ssl3003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:41:24] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:41:58] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:43:01] So many sweet, sweet recoveries [17:43:19] RECOVERY - puppet last run on rbf1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:44:31] (03PS1) 10Ottomata: Allow base::firewall to specify an 'accept' policy by default [puppet] - 10https://gerrit.wikimedia.org/r/160480 [17:44:34] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:45:14] <_joe_> ottomata: mmmm don't do that [17:45:58] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "the drop all by default is a sane policy and we should progressively enforce it everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/160480 (owner: 10Ottomata) [17:46:19] _joe_, not doing it, just sent email [17:46:21] that's just an idea [17:46:27] i'm doing the hacky fix now [17:46:59] <_joe_> :) [17:47:06] RECOVERY - puppet last run on pc1002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:47:18] (03PS1) 10Ottomata: Render defs.conf for ferm on zookeeper servers [puppet] - 10https://gerrit.wikimedia.org/r/160482 [17:47:18] ottomata: count me as a -1 too :) [17:47:25] (03PS4) 10BBlack: add textsvc/uploadsvc in ulsfo for consistency [puppet] - 10https://gerrit.wikimedia.org/r/160014 [17:47:44] see email, all I want is to be able to use ferm without restricting everything [17:48:29] (03CR) 10BBlack: [C: 032] add textsvc/uploadsvc in ulsfo for consistency [puppet] - 10https://gerrit.wikimedia.org/r/160014 (owner: 10BBlack) [17:48:31] (03CR) 10Ottomata: "Aye, this is just an idea. All I want is to be able to use ferm without having to restrict everything right now. See email in ops list." [puppet] - 10https://gerrit.wikimedia.org/r/160480 (owner: 10Ottomata) [17:49:28] (03PS2) 10Ottomata: Render defs.conf for ferm on zookeeper servers [puppet] - 10https://gerrit.wikimedia.org/r/160482 [17:50:10] legoktm: your help is needed [17:50:17] mutante: around for a few icinga questions? I'm starting to write a shinken module, and wanted a few small answers :) [17:50:25] (not going to ask you to merge anything, I promise!) [17:50:41] YuviPanda: ask,someone here might know :D [17:50:48] so... [17:50:50] why do we use naggen? [17:50:59] is it because icinga expects to read one config file? [17:51:05] (03PS3) 10Ottomata: Render defs.conf for ferm on zookeeper servers [puppet] - 10https://gerrit.wikimedia.org/r/160482 [17:51:23] that one os for _joe_in fact [17:51:25] (03CR) 10Ottomata: [C: 032 V: 032] Render defs.conf for ferm on zookeeper servers [puppet] - 10https://gerrit.wikimedia.org/r/160482 (owner: 10Ottomata) [17:51:26] *is [17:51:41] who rewritten it iirc [17:52:15] oh [17:52:26] * YuviPanda wonders if there's a 'load' metric for _joe_, how high it would be [17:52:34] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:53:36] (03PS1) 10Ottomata: Need network::constants to render ferm defs.conf [puppet] - 10https://gerrit.wikimedia.org/r/160485 [17:53:40] (03CR) 10jenkins-bot: [V: 04-1] Need network::constants to render ferm defs.conf [puppet] - 10https://gerrit.wikimedia.org/r/160485 (owner: 10Ottomata) [17:53:43] (03PS2) 10Ottomata: Need network::constants to render ferm defs.conf [puppet] - 10https://gerrit.wikimedia.org/r/160485 [17:53:53] (03CR) 10Ottomata: [C: 032 V: 032] Need network::constants to render ferm defs.conf [puppet] - 10https://gerrit.wikimedia.org/r/160485 (owner: 10Ottomata) [17:54:00] YuviPanda: any howm the short answer is : # Naggen takes exported resources from hosts and creates nagios [17:54:00] # configuration files [17:54:21] matanya: true, but why have naggen? why can't the resources themselves just be realized on the host, and the resources define individual files? [17:55:18] no clue :) [17:55:37] <_joe_> YuviPanda: eh? [17:55:40] <_joe_> wat? [17:55:43] RECOVERY - DPKG on analytics1023 is OK: All packages OK [17:55:43] RECOVERY - RAID on analytics1023 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:55:44] RECOVERY - NTP on analytics1023 is OK: NTP OK: Offset 0.001294851303 secs [17:55:44] RECOVERY - check configured eth on analytics1023 is OK: NRPE: Unable to read output [17:55:59] <_joe_> resources realized on what host? [17:56:03] RECOVERY - Disk space on analytics1023 is OK: DISK OK [17:56:09] on neon (in prod's case) [17:56:11] RECOVERY - check if dhclient is running on analytics1023 is OK: PROCS OK: 0 processes with command name dhclient [17:56:11] RECOVERY - puppet last run on analytics1023 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:56:13] <_joe_> so, we export resources, and we collect them on neon [17:56:34] <_joe_> we use naggen for the collection because puppet scales well to tens of servers [17:57:06] <_joe_> and using naggen2 vs naggen (who still used puppet internals) brought the single neon puppet run down from 25 mins to ~ 6 [17:57:12] aaaaahhhhhh [17:57:13] I see [17:57:24] ottomata: sure (re: merge) [17:57:55] godog, already done! :) [17:58:24] _joe_: I'm beginning to checkout shinken (use for labs to begin with, possibly migrate prod later on), so wanted to understand how we do things in icinga a bit more [17:58:26] will dig [17:58:29] ottomata: haha perfect, thanks! [17:59:58] PROBLEM - puppet last run on analytics1024 is CRITICAL: Timeout while attempting connection [18:00:04] ejegg, awight: Respected human, time to deploy CentralNotice (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140915T1800). Please do the needful. [18:01:03] RECOVERY - puppet last run on analytics1024 is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [18:02:59] YuviPanda: told you ? :D [18:03:06] :) [18:13:14] (03PS2) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 [18:14:51] (03PS5) 10Krinkle: hhvm: create module + list all dev dependencies [puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [18:15:25] (03CR) 10Krinkle: "This is already deployed on the puppetmaster for integration slaves in labs." [puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [18:15:57] (03CR) 10Krinkle: "Hashar: This was already deployed on the puppetmaster for integration slaves in labs. Should it be reverted there?" [puppet] - 10https://gerrit.wikimedia.org/r/154401 (owner: 10Hashar) [18:24:04] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: Epic puppet fail [18:24:08] showJobs.php is weird :S [18:28:58] (03CR) 10Hashar: "Timo wrote:" [puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [18:29:44] (03PS1) 10Awight: Revert "Enable FundraisingTranslateWorkflow on metawiki (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160493 [18:30:03] (03Abandoned) 10Awight: Revert "Enable FundraisingTranslateWorkflow on metawiki (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160493 (owner: 10Awight) [18:32:55] (03PS1) 10Awight: Revert "Enable FundraisingTranslateWorkflow on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160494 [18:33:48] (03PS2) 10Awight: Revert "Enable FundraisingTranslateWorkflow on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160494 [18:36:52] (03CR) 10Chmarkine: redirect http->https on racktables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/160164 (owner: 10Dzahn) [18:40:29] !log Setting Cirrus to jawiki's primary search backend went well but Japan is mostly asleep. If Elasticsearch load takes a turn for the worse in four or five hours then we'll know how it went. [18:40:35] Logged the message, Master [18:40:43] !log Local part of the global rename of Gnumarcoo => .avgas fatally timed out on itwiki. This needs to be fixed per hand. [18:40:49] Logged the message, Master [18:40:51] legoktm: ^ [18:41:01] Be careful with that one... Nemo_bis messed with the old user name [18:43:09] !log performance tests show cirrus should handle jawiki with no problem but if load spirals out of control and I'm not around then revert https://gerrit.wikimedia.org/r/#/c/160465/ [18:43:14] Logged the message, Master [18:43:34] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:44:01] !log ejegg Synchronized php-1.24wmf21/extensions/CentralNotice/: Update CentralNotice to remove jquery.json dependency (duration: 00m 09s) [18:44:08] Logged the message, Master [18:44:46] Dear ops, we had sync-dir errors from tmh1001 and tmh1002 [18:44:48] what are those boxen? [18:45:38] video scalers [18:45:53] hoo: yes, thanks. I'm no longer worried then :) [18:46:23] !log Sync to tmh100[12] failed, according to awight [18:46:30] Logged the message, Master [18:46:30] thx! [18:48:17] (03PS1) 10Matanya: admin: add subbu and gwicke to ocg-render-admins [puppet] - 10https://gerrit.wikimedia.org/r/160497 [18:48:55] (03CR) 10jenkins-bot: [V: 04-1] admin: add subbu and gwicke to ocg-render-admins [puppet] - 10https://gerrit.wikimedia.org/r/160497 (owner: 10Matanya) [18:49:48] !log ejegg Synchronized php-1.24wmf20/extensions/CentralNotice/: Update CentralNotice to remove jquery.json dependency (duration: 00m 23s) [18:49:55] Logged the message, Master [18:51:35] (03PS2) 10Matanya: admin: add subbu and gwicke to ocg-render-admins [puppet] - 10https://gerrit.wikimedia.org/r/160497 [18:51:38] typos ... :/ [19:00:32] (03CR) 10Greg Grossmeier: "This caused a breakage to scap in Beta Cluster, see: https://bugzilla.wikimedia.org/show_bug.cgi?id=70858" [puppet] - 10https://gerrit.wikimedia.org/r/160485 (owner: 10Ottomata) [19:01:24] legoktm: ??? [19:02:01] Did you just per hand resolve that? [19:35:01] (03PS1) 10Gage: Hadoop logging regression fixes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/160505 [19:37:46] (03CR) 10Ottomata: [C: 032 V: 032] Hadoop logging regression fixes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/160505 (owner: 10Gage) [19:43:53] (03PS1) 10Gage: merge hadoop logging regression fix in modules/cdh [puppet] - 10https://gerrit.wikimedia.org/r/160507 [19:48:18] ottomata: do you know who I should ask about labs instance 'wikimetrics1'? You perhaps? [19:48:41] you can ask me, milimetric and nuria also know things [19:49:02] ok -- my question is, how hard will it be to reproduce that instance if I destroy it? [19:49:31] I need to migrate hosts off of virt1006 (where that one is). I'm /pretty sure/ I can move it safely, but I'd like a few less-valuable test subjects [19:49:31] andrewbogott: not bad except for whatever Yuvi did this morning [19:49:40] YuviPanda: ^ ? [19:49:41] I'm not 100% sure if it's perfectly puppetized [19:49:45] not hard, except there are locally applied puppet changes there [19:49:49] milimetric: its fine now [19:49:53] ok, cool [19:49:58] but, it is a local puppet master [19:50:05] which has locally maintained changes [19:50:18] there are three changes setting passwords and patching things like phabricator [19:50:21] wikimetrics1 is probably the most valuable analytics labs instance you could think of [19:52:07] andrewbogott: milimetric whatever I did didn't change puppetization status. I commited patches to mirror the manual steps I did [19:52:53] ok, I will save this one for later when I'm more confident... [19:53:35] andrewbogott: if you want less valuable test subjects, wikimetrics-dev1 is all yours [19:53:52] milimetric: it needs to be on virt1006. Whole list is here: https://etherpad.wikimedia.org/p/virt1006migrate [19:55:04] ah, sorry don't see any of ours there [19:55:13] (except wikimetrics1) [20:00:26] (03CR) 10Gage: [C: 032] merge hadoop logging regression fix in modules/cdh [puppet] - 10https://gerrit.wikimedia.org/r/160507 (owner: 10Gage) [20:02:56] ok fix is on stat1002, both commands work. that slf4j warning is annoying, but seems harmless. [20:10:55] (03PS1) 10Hoo man: HHVM: Increase the maximum number of open files to 16384 [puppet] - 10https://gerrit.wikimedia.org/r/160510 [20:11:02] _joe_: ori_ ^ [20:11:12] mw1017 is maxing out its limit [20:11:26] <_joe_> hoo: lemme check why please [20:11:31] I think, at least [20:12:14] <_joe_> hoo: It's most certainly not [20:12:23] <_joe_> hoo: some logs in fatal.log? [20:12:38] hoo@mw1017:~$ sudo -u apache lsof | grep hhvm | wc -l [20:12:39] 2580 [20:12:44] so true [20:12:52] (03PS1) 10Dzahn: people.wm - use (smaller) image thumbnail [puppet] - 10https://gerrit.wikimedia.org/r/160511 [20:12:52] mh... hhvm.log is full with failed mysql stuff [20:13:03] <_joe_> yes, seeing that [20:13:22] <_joe_> springle: "Sep 15 20:12:27 mw1017 hhvm: message repeated 26 times: [ #012Warning: Unable to record MySQL stats with: SELECT MASTER_POS_WAIT('db1038-bin.000775', 1037193327, 10)] [20:13:26] <_joe_> Sep 15 20:12:32 mw1017 hhvm: #012Warning: Unable to record MySQL stats with: SELECT MASTER_POS_WAIT('db1038-bin.000775', 823974672, 10) [20:13:35] (03PS2) 10Dzahn: people.wm - use (smaller) image thumbnail [puppet] - 10https://gerrit.wikimedia.org/r/160511 [20:14:09] <_joe_> hoo: it has 178 open files at the moment [20:14:41] <_joe_> looks like an app problem [20:15:08] wait, mobile app? [20:15:21] * YuviPanda guesses not [20:16:21] (03CR) 10Aaron Schulz: [C: 031] HHVM: Increase the maximum number of open files to 16384 [puppet] - 10https://gerrit.wikimedia.org/r/160510 (owner: 10Hoo man) [20:16:25] (03PS3) 10Dzahn: people.wm - use (smaller) image thumbnail [puppet] - 10https://gerrit.wikimedia.org/r/160511 [20:16:58] _joe_: Ok... we're seeing inconsistencies again and that seemed like an obvious reason :S [20:17:03] Sorry to bother [20:17:04] (03PS4) 10BBlack: Remove dead addrs from protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/160015 [20:17:13] <_joe_> hoo: no bother at all [20:17:22] <_joe_> actually, I see a lot of anon_inodes in lsof [20:18:21] (03CR) 10BBlack: [C: 032] Remove dead addrs from protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/160015 (owner: 10BBlack) [20:18:40] (03PS4) 10BBlack: add login-lb to eqiad protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160016 [20:18:56] <_joe_> hoo: so, maybe you were right [20:19:04] (03CR) 10Dzahn: [C: 032] people.wm - use (smaller) image thumbnail [puppet] - 10https://gerrit.wikimedia.org/r/160511 (owner: 10Dzahn) [20:19:32] (03PS5) 10BBlack: add login-lb to eqiad protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160016 [20:19:45] _joe_: User apache was hitting the 4096... but hhvm alone not [20:19:52] (at least I didn't see it) [20:20:02] (03CR) 10BBlack: [C: 032 V: 032] add login-lb to eqiad protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160016 (owner: 10BBlack) [20:20:05] but whenever the lsof went close to 4k the mysql stuff came up [20:20:08] so I thought that's it [20:20:15] (03PS4) 10BBlack: Remove dead protoproxy entries completely [puppet] - 10https://gerrit.wikimedia.org/r/160017 [20:20:22] (03CR) 10BBlack: [C: 032 V: 032] Remove dead protoproxy entries completely [puppet] - 10https://gerrit.wikimedia.org/r/160017 (owner: 10BBlack) [20:20:31] <_joe_> hoo: anon_inodes are usually the consequence of too many open files [20:20:38] <_joe_> at least in my experience [20:21:01] mh... might be. So bump the limit and restart hhvm? [20:21:20] <_joe_> I restarted it, and the error went away [20:21:43] The open file count is way lower now also [20:22:05] <_joe_> yep [20:22:24] _joe_: those hhvm mysql errors, based on my understanding of php_mysql_do_query_general in ext_mysql.cpp... i think it's some client side issue with a regex [20:22:38] but i've only skimmed the source so far [20:22:58] <_joe_> springle: strange they stopped as soon as I restarted it [20:23:23] <_joe_> hoo: also, anon_inodes are there now as well, its probably some trick with non-blocking io [20:27:20] mh... whatever it is, we need to keep an eye on [20:27:29] and maybe raising the limit is still worth it [20:29:12] _joe_: https://github.com/facebook/hhvm/blob/37d09a68085bf23dc18253612ea29148fc66d22d/hphp/runtime/ext/mysql/mysql_common.cpp#L1349 ... size != 2 i guess. but why they stopped after restart, no clue [20:29:54] <_joe_> springle: eh! I'll take a look [20:30:44] it's just being too smart [20:34:13] <_joe_> springle: so that was just poisoning the error log because the query was 'unusual' [20:34:39] <_joe_> if we do happen to do a lot of such queries, we may want to patch taht [20:34:41] _joe_: I think so. it may have stopped because the slave caught up [20:35:02] not because you restarted the client [20:35:11] <_joe_> springle: or because I restarted the server and it still didn't catch such a query? [20:35:23] <_joe_> springle: mmmh ok that may be [20:35:35] <_joe_> is that db serving testwiki? [20:35:41] MW issues the MASTER_POS_WAIt when a slave lags [20:35:43] hmm [20:35:47] yes [20:36:03] <_joe_> ok that explains why it was the only hhvm appserver doing that [20:36:22] <_joe_> springle: do you think we should open an issue upstream? [20:36:42] <_joe_> I think it's worth trying to have FB people fix that [20:36:44] _joe_: https://blog.freenode.net/2014/09/server-issues-2/ [20:37:04] ant action needed on the server hosted by WMF ? [20:37:07] *y [20:37:08] _joe_: seems like a proper bug to me [20:38:36] matanya: The server is operated by freenode, only housed by WMF [20:41:49] matanya: it's down [20:41:59] 'halted', as it were [20:42:24] thanks mutante and greg-g and hoo and all :) [20:42:47] greg-g: do i poke you for globalrename issues ? [20:43:01] matanya: legoktm or me [20:43:21] or csteipp (but he's busy often) [20:43:41] since SUL is down, i can't check local unatthached accounts, can something be done about that ? [20:43:51] what did I do? [20:44:00] csteipp: Nothing, yet :D [20:44:04] matanya: SUL is down? [20:44:20] What exactly is down? [20:44:24] tools.wmflabs.org/sulinfo/sulinfo.php [20:44:35] oh [20:44:40] that's a user tool :/ [20:44:51] i get php errors: Warning: mysqli::query(): Empty query in /data/project/quentinv57-tools/public_html/tools/sulinfo.php on line 207 [20:45:18] i know, hence asking if there is a more reliable tool to check local unattched accounts [20:46:09] !log deployed Parsoid version b845bff9 [20:46:14] Logged the message, Master [20:46:16] matanya: not at all ;) [20:46:30] i'll find a reason, don't worry [20:46:34] matanya: Of non-global users, nope there's not. [20:46:51] Yep... if you have a tool labs account you can poke the slaves there [20:46:52] so i rename, and get errors, great :P [20:47:05] but that's not nice [20:49:04] csteipp: Any reason to not support this in SpecialCentralAuth ? [20:49:20] Not sure how hard that is, but I might give it a shot [20:50:45] hoo: Not really, other than sulinfo existed, so we didn't need to. I think it would be a nice addition. [20:51:04] +1 [20:51:13] Shouldn't be to hard except that it's code is a big pile of mud :P [20:51:27] Well, there's that, yeah ;) [20:55:52] andrewbogott: any idea why virt0 is filling dberror.log trying to connect to virt1000 db? [20:56:02] labswiki [20:56:32] springle: virt0 has a bunch of the same classes as virt1000, it probably isn't happy with the change to the deployment train. [20:56:46] I'd say ignore it if you can stand to; I'm going to kill off virt0 this week anyway [20:56:51] sure np [20:56:54] thanks [20:57:42] (03PS1) 10Andrew Bogott: Don't copy network filters during cold migrate. [puppet] - 10https://gerrit.wikimedia.org/r/160518 [20:58:33] (03CR) 10Andrew Bogott: [C: 032] Don't copy network filters during cold migrate. [puppet] - 10https://gerrit.wikimedia.org/r/160518 (owner: 10Andrew Bogott) [21:00:56] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Puppet has 1 failures [21:02:55] 'git deploy sync' seems to be very slow on deployment-bastion.eqiad.wmflabs [21:03:02] slow ~= hang. can't tell yet. [21:03:24] from where cscott? [21:03:35] /srv/deployment/ocg/ocg [21:04:09] also: boston ;) [21:05:00] ah, no root on that host :/ [21:05:16] i need to request it. bd808? [21:05:49] is there any -v type option that will give me more information? the ctrl-c backtrace seems to indicate that it is stuck talking to trebuchet [21:05:50] matanya: where? what? why? (hello!) [21:06:44] hi bd808i would like root on deployment prep [21:06:53] http://pastebin.com/RgSZXXZ2 <- backtrace [21:06:53] bd808: ^ [21:07:30] matanya: i think that's "i'd like to be a member of the deployment-prep group", right? [21:07:42] cscott: i'm already [21:07:52] oh. that's enough to give me root. [21:08:43] matanya: Can you open a bug for that please and I'll try to get it fixed in the "near" future (next hour or so) [21:08:58] sure, thanks! :) [21:09:03] matanya: "that" being sudo on deployment project [21:09:15] that is what i need [21:09:29] also, to help hashar debug dome ferm issues [21:09:40] mh, this was to easy [21:10:22] matanya: put it in the wikimedia labs -> deployment prep project so someone gets to it (the request) [21:10:41] already done, but thanks! :) [21:13:53] ok, now who can help me figure out why 'git deploy' is broken? [21:15:06] sorry cscott :) though i did push a change for your RT request ... [21:15:57] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [21:16:40] (03PS1) 10RobH: db2006 missing from lease file, adding [puppet] - 10https://gerrit.wikimedia.org/r/160521 [21:17:29] (03CR) 10RobH: [C: 032] db2006 missing from lease file, adding [puppet] - 10https://gerrit.wikimedia.org/r/160521 (owner: 10RobH) [21:18:57] PROBLEM - Host db2010 is DOWN: PING CRITICAL - Packet loss = 100% [21:20:22] RECOVERY - Host db2010 is UP: PING OK - Packet loss = 0%, RTA = 43.06 ms [21:20:23] bd808: are you the trebuchet guru? [21:20:40] cscott: ummmm... maybe? [21:20:52] cscott: beta or prod? [21:20:58] git deploy sync is hanging (no console output at all) on beta. [21:21:16] /srv/deployment/ocg/ocg [21:21:18] lame. I wonder if salt is sad in beta [21:21:32] how can i tell? [21:21:53] (for reference, the ocg deploy procedure is nominally: https://wikitech.wikimedia.org/wiki/OCG#Deploying_the_latest_version_of_OCG ) [21:21:56] "[WARNING ] Master hostname: salt not found. Retrying in 30 second" [21:22:25] cscott: I ran `sudo salt-call saltutil.sync_all` on deployment-bastion and got that error [21:22:33] Looks like salt is borked in beta [21:22:43] "This master address: 'salt' was previously resolvable but now fails to resolve! The previously resolved ip addr will continue to be used" [21:23:04] might be a nice entry for https://wikitech.wikimedia.org/wiki/Trebuchet once we figure out the problem [21:24:18] cscott: I think that command is there somewhere? Anyway anybody have time to try and figure out why salt is dead in beta? [21:24:32] bd808: it is actully : Error: /Stage[main]/Role::Salt::Minions/Salt::Grain[instanceproject]/Exec[ensure_instanceproject_deployment-prep]/unless: Check "/usr/local/s [21:24:32] bin/grain-ensure contains instanceproject deployment-prep" exceeded timeout [21:25:25] I get "Master hostname: salt not found. Retrying in 30 seconds" error on the salt master too :( [21:25:53] bd808: the root casue is: Warning: Unable to fetch my node definition, but the agent run will continue: [21:25:53] Warning: Connection refused - connect(2) [21:26:08] the host can't connect to puppet master [21:26:14] matanya: agreed [21:26:17] due to ferm rule changes i suspect [21:26:27] ah. [21:26:31] which hashar poked me about earlier [21:26:33] * bd808 shakes fist at ferm again [21:26:47] (03PS1) 10Ottomata: Bring more udp2log filters over to kafkatee on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/160523 [21:26:59] and i promised to look at but had $day_job prod issue, and didn't have time and root [21:27:15] i guess i can fix it [21:27:20] matanya: `iptables -L` is empty on deployment-salt and it can't talk to itself either [21:27:38] i rest my case :) [21:28:05] !log updated OCG to version 188a3c221d927bd0601ef5e1b0c0f4a9d1cdbd31 [21:28:10] Logged the message, Master [21:30:24] pushed to prod, skipped beta for now :( [21:33:09] greg-g or hashar or ^d, is it OK if I shut down deployment-sentry2 and/or deployment-videoscaler01 and reboot them in 10 mins? I need to move them onto a different virt host. [21:37:57] andrewbogott: should be ok [21:38:11] greg-g: great. Is now an OK time? [21:38:20] andrewbogott: actually, ask in -qa [21:43:57] bd808, matanya: should i file an RT or a bugzilla for the deployment-salt issue? [21:44:17] cscott: bug [21:44:39] i'll try to sort it out, but it depends on my free time and the fact it is 1am [21:44:47] what component in bugzilla? [21:44:57] labs [21:45:09] wikimedia labs -> deployment prep project [21:46:29] https://bugzilla.wikimedia.org/show_bug.cgi?id=70868 [21:47:37] cscott: come on over to #wikimedia-qa :) [21:47:50] (03PS1) 10Manybubbles: Fix Elasticsearch in ci [puppet] - 10https://gerrit.wikimedia.org/r/160524 [21:47:52] it's where the Beta Cluster peeps/cool people hang out [21:48:11] (03CR) 10Manybubbles: "Not sure if this is right but its something." [puppet] - 10https://gerrit.wikimedia.org/r/160524 (owner: 10Manybubbles) [21:48:14] oh? by there's also -operations and -labs ;) [21:48:26] and i thought all the *cool* people were in -parsoid ;) [21:48:30] and pdfhack and parsoid and visualeditor? [21:48:35] (03CR) 10Manybubbles: "Meant to fix https://gist.githubusercontent.com/Krinkle/fd6cf70688d110809440/raw/ ." [puppet] - 10https://gerrit.wikimedia.org/r/160524 (owner: 10Manybubbles) [21:48:38] sssh, pdfhack is our secret hideout [21:49:30] i think it was advertised actually :P [21:53:17] is there a secret channel I'm missing out on called #wikimedia-puppet-breakers ? :) [21:53:34] oh wait that's this channel! :) [21:57:55] matanya: csteipp-ish: Here you go: https://gerrit.wikimedia.org/r/160526 [21:58:03] that should bring Special:CentralAuth up to sulinfo [21:58:42] thank you hoo [22:06:26] (03CR) 10Tim Starling: "I think 64K would be better. Still way less than any fundamental performance limit, and it gives us a bit more time to respond in the case" [puppet] - 10https://gerrit.wikimedia.org/r/160510 (owner: 10Hoo man) [22:08:39] (03PS2) 10Hoo man: HHVM: Increase the maximum number of open files to 65536 [puppet] - 10https://gerrit.wikimedia.org/r/160510 [22:09:01] (03CR) 10Hoo man: "Raised limit to 65536 (per Tim)" [puppet] - 10https://gerrit.wikimedia.org/r/160510 (owner: 10Hoo man) [22:09:40] (03PS1) 10Chmarkine: racktables - remove RewriteCond on /status [puppet] - 10https://gerrit.wikimedia.org/r/160528 [22:12:43] (03PS2) 10Chmarkine: racktables - remove RewriteCond on /status [puppet] - 10https://gerrit.wikimedia.org/r/160528 [22:13:02] bblack: no, we're in #wikimedia-qa :P [22:13:22] :) [22:17:34] (03CR) 10Krinkle: "The error this should address is this:" [puppet] - 10https://gerrit.wikimedia.org/r/160524 (owner: 10Manybubbles) [22:19:41] bblack: that is how it feels (-operations being the puppet breakers, at least for us over in -qa) ;) [22:19:53] (03CR) 10Krinkle: [C: 031] "But whatever it is and whether it is documented, I've cherry-picked this to integration-puppetmaster.eqiad.wmflabs and confirmed it fixes " [puppet] - 10https://gerrit.wikimedia.org/r/160524 (owner: 10Manybubbles) [22:28:43] (03PS1) 10Springle: assign codfw DBs to s[1-7] [puppet] - 10https://gerrit.wikimedia.org/r/160539 [22:33:27] (03CR) 10Springle: [C: 032] assign codfw DBs to s[1-7] [puppet] - 10https://gerrit.wikimedia.org/r/160539 (owner: 10Springle) [22:38:22] hoo: there was a borked global rename earlier today? [22:38:37] greg-g: do you want to revert that patch by otto ? [22:38:40] i can do that [22:38:45] yes legoktm [22:39:00] matanya: ? [22:39:13] matanya: which one? the salt one that broke prod? [22:39:17] https://gerrit.wikimedia.org/r/#/c/160485/ [22:39:24] I'm kind of out of the loop on root cause for the beta issues [22:39:38] the timing fits [22:39:48] * greg-g nods [22:40:58] matanya: who should we have review it from ops side? (the revert) [22:41:14] otto if possible [22:41:32] ott :( [22:41:38] if not, i guess any opsen will work [22:41:41] matanya: do you know why it happened and what fixed it? [22:41:45] it is a simple fix [22:41:59] legoktm: no clue, try in -stewards or ask hoo [22:42:15] ok, I'll wait for him to respond :P [22:42:34] !log dropping static routes for 2620:0:861:ed1a::[d,f,10,11] -> lvs1005 from cr[12]-eqiad (only 11 is of any consequence, misc-web-lb, and they're advertised by bgp and this is preventing failover to lvs1002) [22:42:41] Logged the message, Master [22:42:52] hmm, godog is asleep yes? [22:43:29] probably should be [22:43:45] bblack: want to review a revert that's breaking beta? [22:43:50] sure [22:43:54] well, rever the patch that is breaking beta, that is [22:44:16] which? [22:44:25] https://gerrit.wikimedia.org/r/#/c/160485/ [22:44:35] I don't know the repurcussions of the revert [22:44:43] I'm taking matanya's word that it'll help us ;) [22:44:51] jeremyb: do you concur? :) [22:45:57] (03PS1) 10Matanya: Revert "Need network::constants to render ferm defs.conf" [puppet] - 10https://gerrit.wikimedia.org/r/160544 [22:46:21] (03PS1) 10Springle: repool db1036, depool db1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160545 [22:46:45] bblack: here [22:47:11] please review, i'm not 100% sure what otto did, or tried to do [22:47:18] seems odd that including network::constants is even a functional change, other than maybe to unbreak a broken variable ref in a template? [22:47:29] legoktm: It fixed itself [22:47:34] automatic re-run, ftw [22:47:36] gn [22:47:43] yay, those are the best :D [22:47:45] See SAL [22:47:48] ok [22:48:00] bblack: i suspect that what happened. [22:48:25] I'm checking it out on puppet-compiler to see the effect [22:48:35] !log Pooling the newly setup Trusty-based Jenkins slaves (integration-slave1006, integration-slave1007 and integration-slave1008) [22:48:41] Logged the message, Master [22:49:13] if "# We don't include base::firewall yet" is true [22:49:17] then how does it do anything? [22:49:19] matanya: [22:49:27] see https://gerrit.wikimedia.org/r/#/c/153801/2/modules/base/templates/firewall/defs.erb [22:49:34] !log Running sample job on integration-slave1007 and warming up npmjs.org cache [22:49:41] Logged the message, Master [22:49:53] ugh how does deployment-bastion get its node definition? [22:49:59] it's not in manifests/site.pp :p [22:50:02] (03CR) 10Springle: [C: 031] "waiting for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160545 (owner: 10Springle) [22:50:23] and and https://gerrit.wikimedia.org/r/#/c/160482/3/manifests/role/analytics/zookeeper.pp [22:50:25] (03CR) 10Springle: [C: 04-1] "waiting for schema change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160545 (owner: 10Springle) [22:50:32] bblack: LDAP [22:50:41] that's really really lame [22:50:46] LDAP does not do git grep [22:50:46] bblack: people hit "configure instance" and select a "puppet group" [22:50:49] so all three touch ferm [22:50:51] where puppet group = class [22:51:05] well anyways what class does it use? [22:51:10] and that is what broke, i try to do math :) [22:52:14] matanya: hmm.. wouldn't you revert the other one then? [22:52:29] the one that touches firewall/defs.erb [22:52:38] if i don't call the defs, they are not affecting anything [22:53:03] the merge comment being "fingers crossed" :) [22:53:22] yeah I donno about the revert, it doesn't smell right, and I wish I could test on puppet-compiler [22:53:38] network::constants just defines things that would otherwise generate a parse error and fail puppet completely [22:54:19] what lines on the real host are causing the real problem? [22:55:18] matanya: did you say earlier there was an unpuppetized rule on labs that disappeared? [22:56:49] i don't know, hashar said in the bug report he suspects so [22:56:53] !log Running sample job on integration-slave1008 and warming up npmjs.org cache [22:57:00] Logged the message, Master [22:57:22] well 2am, i wouldn't merge any commit by me [22:57:49] or revert both? [22:57:53] i'm off. i'll look at it in sane hours tomorrow, if it will still be relevant [22:58:13] the existing revert + revert of https://gerrit.wikimedia.org/r/#/c/160482/3/manifests/role/analytics/zookeeper.pp ? [22:58:15] bblack: or both or none [22:58:22] your call [22:58:38] is someone going to use it and see if it works if I do? :) [23:00:04] RoanKattouw, ^d, marktraceur, MaxSem, RoanKattouw: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140915T2300). Please do the needful. [23:00:09] who's on this or looking or caring if matanya's off? I don't even know what the original problem is to confirm a fix [23:00:34] i think it can safely wait bblack [23:00:47] I can SWAT [23:00:53] ok, in that case otto can pick it up and look himself then probably [23:01:06] Glaisher, are you there? [23:01:11] ok, thanks and good night [23:01:21] MaxSem: Want my help setting up https://gerrit.wikimedia.org/r/#/c/160076/ [23:01:23] ? [23:01:29] It requires a double submodule update [23:01:36] (nested) [23:01:57] RoanKattouw, git submodule update --init --recursive ? [23:02:04] Yes eventually [23:02:11] But the submodule update commits haven't been created yet [23:02:21] ahaha [23:02:31] so create them) [23:03:30] OK will do [23:06:07] !log Running sample job on integration-slave1006 and warming up npmjs.org cache [23:06:17] remove the leading space [23:07:33] !log Running sample job on integration-slave1006 and warming up npmjs.org cache [23:07:39] Logged the message, Master [23:07:44] Oh, ha, thx [23:07:48] :) [23:09:24] MaxSem: https://gerrit.wikimedia.org/r/#/c/160554/ [23:11:55] thx RoanKattouw, only wmf21 is needed? [23:12:24] Yeah [23:12:54] !log restarting lvs1002 for HT disable + kernel upgrade [23:13:00] Logged the message, Master [23:17:57] RoanKattouw, Submodule path 'extensions/VisualEditor': rebased into 'e621e39656887b31698ed238810629dfc6da9403' [23:18:08] Submodule path 'lib/ve': checked out 'd9658b24ceffe4533336f8d17412a6e053613baa' [23:18:42] MaxSem: Looks good [23:18:49] ok, pushing [23:19:54] !log maxsem Synchronized php-1.24wmf21/extensions/VisualEditor/: SWAT: https://gerrit.wikimedia.org/r/#/c/160554/ (duration: 00m 07s) [23:20:00] Logged the message, Master [23:20:01] RoanKattouw, ^ [23:20:25] Thanks [23:20:41] works? [23:26:23] !log restarting lvs1001 for HT disable + kernel upgrade [23:26:24] I'll ask our QA person [23:26:29] Logged the message, Master [23:26:39] The fixes were for JS errors and the reproduction instructions were somewhat elaborate [23:27:11] MaxSem: Also we have two OOUI commits in there as well [23:27:19] yup, waiting for CI [23:28:07] OK cool [23:29:21] one test appears to be hanging [23:29:39] gj jerkins [23:30:48] (03PS2) 10BBlack: add XPS for bnx2 (etc) to interface-rps.py [puppet] - 10https://gerrit.wikimedia.org/r/140376 [23:31:34] (03CR) 10BBlack: [C: 032 V: 032] "Kernels are updated now that I went back and found the v6 failover issue..." [puppet] - 10https://gerrit.wikimedia.org/r/140376 (owner: 10BBlack) [23:32:44] !log maxsem Synchronized php-1.24wmf21/resources/: SWAT: https://gerrit.wikimedia.org/r/#/c/160488/1 https://gerrit.wikimedia.org/r/#/c/160543/ (duration: 00m 06s) [23:32:50] Logged the message, Master [23:32:50] RoanKattouw, ^ [23:35:57] MaxSem: I tested test2 and it looks good, thanks [23:44:40] (03PS1) 10BBlack: bugfix for 1:1 rxq:txq mapping [puppet] - 10https://gerrit.wikimedia.org/r/160559 [23:45:00] (03CR) 10BBlack: [C: 032 V: 032] bugfix for 1:1 rxq:txq mapping [puppet] - 10https://gerrit.wikimedia.org/r/160559 (owner: 10BBlack) [23:51:18] (03PS1) 10BBlack: add ip6_mapped addr for neon [puppet] - 10https://gerrit.wikimedia.org/r/160560 [23:52:12] (03PS1) 10BBlack: use mapped v6 addr for neon DNS [dns] - 10https://gerrit.wikimedia.org/r/160561 [23:52:25] (03CR) 10BBlack: [C: 032] add ip6_mapped addr for neon [puppet] - 10https://gerrit.wikimedia.org/r/160560 (owner: 10BBlack) [23:53:29] (03CR) 10BBlack: [C: 032 V: 032] use mapped v6 addr for neon DNS [dns] - 10https://gerrit.wikimedia.org/r/160561 (owner: 10BBlack) [23:56:01] (03PS1) 10BBlack: add neon v6 mapped to net constants [puppet] - 10https://gerrit.wikimedia.org/r/160564 [23:56:17] (03CR) 10BBlack: [C: 032 V: 032] add neon v6 mapped to net constants [puppet] - 10https://gerrit.wikimedia.org/r/160564 (owner: 10BBlack)