[00:28:44] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [00:32:54] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 1 failures [00:44:19] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 5.8GB (= 5.0GB critical): /srv/deployment/ocg/output 4009055266B: /srv/deployment/ocg/postmortem 996963B: ocg_job_status 11532 msg: ocg_render_job_queue 0 msg [00:45:13] RECOVERY - OCG health on ocg1003 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4009055266B: /srv/deployment/ocg/postmortem 1072969B: ocg_job_status 11532 msg: ocg_render_job_queue 0 msg [00:46:02] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:50:17] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [01:31:38] (03PS1) 10Jackmcbarn: Don't allow granting a removed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 [01:32:47] what [01:32:48] :D [01:34:33] (03CR) 10Ori.livneh: "Should this be fixed in core, instead, by pruning entries that don't correspond to an actual group? We did something similar in https://ww" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 (owner: 10Jackmcbarn) [01:35:51] (03CR) 10Ori.livneh: [C: 031] "Well, "instead" is the wrong word; it certainly doesn't hurt to do this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 (owner: 10Jackmcbarn) [01:41:10] (03PS1) 10Hoo man: Don't leak global $path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160374 [01:42:22] (03CR) 10Ori.livneh: [C: 031] Don't leak global $path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160374 (owner: 10Hoo man) [01:46:40] (03CR) 10Hoo man: [C: 032] "No-op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160374 (owner: 10Hoo man) [01:46:44] (03Merged) 10jenkins-bot: Don't leak global $path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160374 (owner: 10Hoo man) [01:47:20] !log hoo Synchronized wmf-config/flaggedrevs.php: Remove global $path (duration: 00m 10s) [01:47:28] Logged the message, Master [01:47:36] !log hoo Synchronized wmf-config/liquidthreads.php: Remove global $path (duration: 00m 07s) [01:47:42] Logged the message, Master [01:59:28] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Puppet has 1 failures [02:02:17] (03CR) 10Jackmcbarn: "I02d3f00142ca1cb0cdcbf30e79fecb3c96e96405" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 (owner: 10Jackmcbarn) [02:03:22] (03PS2) 10Jackmcbarn: Don't allow granting a removed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 [02:03:25] !log LocalisationUpdate failed: mwversionsinuse returned empty list [02:03:30] Logged the message, Master [02:06:47] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3608 MB (3% inode=99%): [02:16:45] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [02:19:08] (03PS1) 10Hoo man: Fix l10nupdate by correctly adding scap directories to $PATH [puppet] - 10https://gerrit.wikimedia.org/r/160380 [02:19:10] ori_: ^ [02:19:31] Tiny regression you introduced in 132038174a0e2734f660fd6d2e86dc5caf033aca and that breaks l10nupdate [02:19:38] * l10nupdate-1 [02:21:14] (03PS2) 10Hoo man: Add scap directories to $PATH for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/160380 [02:47:43] (03PS1) 10MZMcBride: Various tweaks to people.wikimedia.org index page [puppet] - 10https://gerrit.wikimedia.org/r/160383 [02:57:08] (03CR) 10Ori.livneh: "But on tin, /usr/local/bin/mwversionsinuse is still a symlink to /srv/deployment/scap/scap/bin/mwversionsinuse. So how did this break?" [puppet] - 10https://gerrit.wikimedia.org/r/160380 (owner: 10Hoo man) [03:00:55] RECOVERY - Disk space on virt0 is OK: DISK OK [03:02:15] (03CR) 10Legoktm: [C: 031] Various tweaks to people.wikimedia.org index page [puppet] - 10https://gerrit.wikimedia.org/r/160383 (owner: 10MZMcBride) [03:07:02] (03PS1) 10Springle: depool db1036 and db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160384 [03:09:00] (03CR) 10Springle: [C: 032] depool db1036 and db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160384 (owner: 10Springle) [03:09:05] (03Merged) 10jenkins-bot: depool db1036 and db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160384 (owner: 10Springle) [03:09:45] !log springle Synchronized wmf-config/db-eqiad.php: depool db1036 (duration: 00m 09s) [03:09:50] Logged the message, Master [03:42:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [03:56:37] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:04:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:17:57] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:59:46] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [04:59:56] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [05:00:36] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [05:01:56] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [05:01:57] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [05:02:47] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:03:37] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 2: number_of_data_nodes: 2: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [05:03:47] (03PS1) 10ArielGlenn: lab db replica: don't show log_params for deleted/suppressed logs [software] - 10https://gerrit.wikimedia.org/r/160393 [05:03:56] RECOVERY - ElasticSearch health check on logstash1003 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 2: number_of_data_nodes: 2: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [05:04:48] RECOVERY - ElasticSearch health check on logstash1002 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [05:04:58] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [05:06:12] (03CR) 10ArielGlenn: "Not sure if the check (log_delete nonzero) is too broad, can you have a look Coren?" [software] - 10https://gerrit.wikimedia.org/r/160393 (owner: 10ArielGlenn) [05:07:50] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:10:36] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [05:11:56] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 69 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 3, uunassigned_shards: 63, utimed_out: False, uactive_primary_shards: 34, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 34, uinitializing_shards: 6, unumber_of_data_nodes: 3} [05:12:37] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 69 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 3, uunassigned_shards: 63, utimed_out: False, uactive_primary_shards: 34, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 34, uinitializing_shards: 6, unumber_of_data_nodes: 3} [05:12:56] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 1 failures [05:12:56] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 69 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 3, uunassigned_shards: 63, utimed_out: False, uactive_primary_shards: 34, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 34, uinitializing_shards: 6, unumber_of_data_nodes: 3} [05:14:47] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:15:07] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [05:15:11] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [05:21:56] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:22:16] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [05:24:56] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:25:17] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [05:30:54] (03PS1) 10Ori.livneh: HHVM: increase JitAColdSize to 60 MiB [puppet] - 10https://gerrit.wikimedia.org/r/160394 [05:32:06] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:42:29] (03PS3) 10Ori.livneh: Add scap directories to $PATH for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/160380 (owner: 10Hoo man) [05:48:16] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [05:48:27] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 2: number_of_data_nodes: 2: active_primary_shards: 14: active_shards: 16: relocating_shards: 0: initializing_shards: 3: unassigned_shards: 84 [05:50:24] (03PS1) 10Springle: prepare db1036 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/160397 [05:51:26] (03CR) 10Springle: [C: 032] prepare db1036 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/160397 (owner: 10Springle) [05:53:57] (03CR) 10Ori.livneh: [C: 032] Add scap directories to $PATH for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/160380 (owner: 10Hoo man) [06:04:28] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [06:05:48] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [06:10:50] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [06:11:28] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [06:23:08] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 1 failures [06:23:48] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has 1 failures [06:27:58] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Epic puppet fail [06:28:10] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Epic puppet fail [06:28:29] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Epic puppet fail [06:28:30] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Epic puppet fail [06:28:38] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Epic puppet fail [06:28:39] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Epic puppet fail [06:28:49] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:50] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: Epic puppet fail [06:29:08] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Epic puppet fail [06:29:39] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:39] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:50] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:50] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:51] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:51] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:51] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:59] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:18] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:18] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:19] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:30] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:37] <_joe_> mmmh I'm not sure this is the usual mod_passenger problem [06:30:38] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:40] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:40] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:52] <_joe_> ori_: still here? [06:30:59] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:59] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:01] <_joe_> what did you change yesterday? [06:31:08] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:08] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:08] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:32:10] <_joe_> mmh it just happened to be worse than usual and I found a couple of bad salt failures [06:35:03] (03PS5) 10Florianschmidtwelzow: Fix typos in various localizations of dvwiki configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156821 (https://bugzilla.wikimedia.org/48075) (owner: 10Gerrit Patch Uploader) [06:37:20] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:37:51] PROBLEM - ElasticSearch health check for shards on elastic1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.109:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [06:44:54] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:32] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:45:35] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:35] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:45:36] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:54] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:01] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:47:21] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:22] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:47:22] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:47:51] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:51] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:57:32] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Puppet has 1 failures [07:14:57] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:59:27] https://bugzilla.wikimedia.org/show_bug.cgi?id=35534#c19 [08:06:13] _joe_: a minute of your time ? [08:06:15] https://gerrit.wikimedia.org/r/#/c/159462/2/modules/rsync/templates/module.erb [08:06:41] i didn't understand what is the wrong pard, the right one, or the left one [08:15:21] <_joe_> matanya: 1 sec [08:16:34] <_joe_> matanya: what is not clear in my comment? [08:16:53] <_joe_> if @variable != :undef will always test true [08:17:42] _joe_: that part i got, i didn;t get the logic of testing it [08:17:45] <_joe_> if @variable will test false if the $variable in puppet is either :undef or nil [08:17:54] <_joe_> and true anyway else [08:18:00] oh! [08:18:03] now i get it [08:18:27] <_joe_> it's quite tricky and I have to recheck the conditions every time [08:18:35] * matanya is so slow lately [08:18:49] probably nicer ways to check this [08:21:51] (03PS3) 10Matanya: rsync: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159462 [08:27:33] hello, I am back around :) [08:27:46] o/ :D [08:34:16] <_joe_> hashar: :))) [08:45:20] and now have to deal with all the mail spam :/ [08:49:18] <_joe_> eh [08:49:45] <_joe_> (btw, I love the names of both your daughters) [08:52:18] (03CR) 10Filippo Giunchedi: "LGTM, modulo what Daniel pointed out re: favicon." [puppet] - 10https://gerrit.wikimedia.org/r/147487 (owner: 10Reedy) [08:53:44] (03CR) 10Giuseppe Lavagetto: "Public wikis in general have both the favicon and the robots.txt files that can be personalized by the admins; so this makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/147487 (owner: 10Reedy) [08:55:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [08:56:50] (03PS7) 10Alexandros Kosiaris: Introducing Service Cluster A, hosting mathoid [puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) (owner: 10Physikerwelt) [08:56:52] (03PS1) 10Alexandros Kosiaris: Assign LVS IP address to mathoid [puppet] - 10https://gerrit.wikimedia.org/r/160412 [09:09:45] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [09:14:20] re: changing freenode passwords, they also support connecting with a ssl cert and let nickserv identify you on that https://freenode.net/certfp/ [09:15:56] !log Jenkins: apt-get upgrade on prod slaves (updates php5 / libc / jdk 7) [09:16:01] Logged the message, Master [09:17:36] (03PS2) 10Hashar: Please add the domain *.scienceimage.csiro.au to the wgCopyUploadsDomains whitelist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159999 (https://bugzilla.wikimedia.org/70771) (owner: 10Dan-nl) [09:19:00] (03CR) 10Hashar: [C: 032] Please add the domain *.scienceimage.csiro.au to the wgCopyUploadsDomains whitelist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159999 (https://bugzilla.wikimedia.org/70771) (owner: 10Dan-nl) [09:19:04] (03Merged) 10jenkins-bot: Please add the domain *.scienceimage.csiro.au to the wgCopyUploadsDomains whitelist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159999 (https://bugzilla.wikimedia.org/70771) (owner: 10Dan-nl) [09:20:10] !log hashar Synchronized wmf-config/InitialiseSettings.php: *.scienceimage.csiro.au to the wgCopyUploadsDomains {{gerrit|159999}} {{bug|70771}} (duration: 00m 06s) [09:20:15] Logged the message, Master [09:24:48] <_joe_> bbl [09:30:42] Reedy: any thoughts on https://rt.wikimedia.org/Ticket/Display.html?id=5270 ? [10:30:29] hashar: https://integration.wikimedia.org/ci/job/mwext-Thanks-testextensions-master/181/console any idea why that's not running against Flow master? [10:30:48] (the exceptions are fixed on master) [10:31:09] bah bugged :( [10:31:31] it is supposed to fetch the latest version of flow [10:31:39] hashar: welcome back! [10:34:25] (03PS1) 10Filippo Giunchedi: move metrics.wm.o and metrics-api.wm.o behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/160419 [10:34:54] godog: thx :) [10:35:05] hoo: I cleared the workspaces on the slaves and retriggered the job. Same deal [10:35:19] hoo: I copy the extension dependencies from some Gerrit replica. Maybe they are out of date [10:35:59] mh, I see [10:36:06] hoo: Flow is at 2d2362372dc Mon Sep 15 01:20:22 2014 +0000 [10:36:16] which seems up to date [10:36:26] * hashar blames code [10:36:49] That's nearly impossible [10:37:20] Flow itself was fixed and passes now :S [10:37:54] https://gerrit.wikimedia.org/r/160366 [10:38:17] Do I have shell on these worker slaves? I guess not [10:42:42] hoo: trying to reproduce [10:45:55] hashar: welcome back! [10:47:03] hoo: I have no obvious clue. One should try to reproduce by using the extensions master branches on a fresh wiki install and see what happens / whether it can be reproduced [10:47:21] hoo: the code seems up to date on the slaves. Definitely has the parent::tearDown() calls that have been added in Flow [10:47:44] then there is a bunch of class inheritance, maybe one is missing a parent::tearDown() call at some point [10:48:15] YuviPanda: thx for the icinga monitoring of the beta cluster :D [10:48:26] hashar: :D more coming! [10:48:32] YuviPanda: I will eventually have to whine about how the mail notifications are hard to read hehe [10:48:38] my goal is to make it to a point where you're alerted by icinga rather than by people :) [10:48:43] hashar: mh... Flow itself passes [10:48:52] working on checking URLs for them being up as well [10:49:02] hashar: if you'd like more notifications on some particular thing, do let me know [10:49:23] YuviPanda: will look at it eventually next week :] Busy processing emails [10:49:29] cool [10:51:01] food time! [10:54:08] bd808S: btw, scap metrics are on graphite.wmflabs.org [10:58:30] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:13:00] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:01:42] (03PS2) 10Filippo Giunchedi: swift: remove ganglia stats via ganglia-logtailer [puppet] - 10https://gerrit.wikimedia.org/r/159705 [12:25:13] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has 1 failures [12:28:36] (03CR) 10coren: [C: 031] "It probably *is* too broad; but the contents of log_param is annoyingly variable and it's difficult to determine whether it contains somet" [software] - 10https://gerrit.wikimedia.org/r/160393 (owner: 10ArielGlenn) [12:33:23] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 1 failures [12:40:34] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:51:34] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:53:26] PROBLEM - check configured eth on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:53:43] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [12:56:26] RECOVERY - check configured eth on fenari is OK: NRPE: Unable to read output [12:57:03] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:57:03] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:32] <_joe_> mmmh is someone shutting down fenari and not logging it? [13:00:41] (03PS1) 10QChris: Remove udp2log stream to Vrije Universiteit Amsterdam [puppet] - 10https://gerrit.wikimedia.org/r/160435 [13:01:11] <_joe_> UVA is such a nice campus [13:01:43] :-) [13:01:43] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:19] <_joe_> !log fenari is swapping hard, restarting apache who was eating up all the RAM [13:03:25] Logged the message, Master [13:04:34] RECOVERY - DPKG on fenari is OK: All packages OK [13:05:05] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4775 bytes in 0.067 second response time [13:05:05] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [13:17:06] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:44:48] (03PS2) 10BBlack: add login-lb to eqiad protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160016 [13:44:50] (03PS2) 10BBlack: Remove dead protoproxy entries completely [puppet] - 10https://gerrit.wikimedia.org/r/160017 [13:44:52] (03PS2) 10BBlack: Remove dead addrs from protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/160015 [13:44:54] (03PS2) 10BBlack: add textsvc/uploadsvc in ulsfo for consistency [puppet] - 10https://gerrit.wikimedia.org/r/160014 [13:44:56] (03PS2) 10BBlack: remove dead esams donatelbsecure [puppet] - 10https://gerrit.wikimedia.org/r/160013 [13:44:58] (03PS2) 10BBlack: remove old mobile/bits addrs in eqiad+esams [puppet] - 10https://gerrit.wikimedia.org/r/160012 [13:45:00] (03PS2) 10BBlack: Sanitize text-related addrs for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/160011 [13:45:02] (03PS2) 10BBlack: Flip ed1a::0 and ed1a::1 in protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160010 [13:45:39] bblack is a sleep wal^Wcommitter [13:45:53] :) [13:46:02] just rebasing branch [13:59:27] (03CR) 10BBlack: [C: 032] Flip ed1a::0 and ed1a::1 in protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160010 (owner: 10BBlack) [13:59:43] (03CR) 10Ottomata: [C: 031] "This is awesome! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/160419 (owner: 10Filippo Giunchedi) [14:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140915T1400). [14:00:38] anthropoid :) [14:00:43] :D [14:00:59] better than "Dear sir" [14:01:26] we converted them to have non-gendered greetings [14:01:33] and they also have a list of messages they pick from [14:01:37] the other one goes 'Dear human' [14:01:41] more message suggestions welcome [14:01:46] heh [14:02:19] https://en.wikipedia.org/wiki/Anthropoid <-- many interesting different meanings... [14:02:48] "a genus of cranes" is surprising :D [14:03:09] haha [14:03:55] hoo: is local crat rename happening today as planned ? [14:04:13] Yep [14:07:29] (03PS1) 10Yuvipanda: db: Use the mysql class instead of the package [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/160445 [14:07:32] milimetric: ^ [14:07:56] (03PS1) 10Yuvipanda: Add .gitreview file [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/160446 [14:07:57] milimetric: and ^ [14:09:03] milimetric: the db patch makes puppet itself put the db files in /srv, and also ensures that the /srv folder is mounted in the correct volume [14:11:11] thanks much YuviPanda. ottomata & qchris should check ^^ [14:11:28] milimetric: \o/ cool. Haven't tested the first one, tho. it needs manual intervention to save the data files [14:11:39] but that's ok on wikimetrics1 since I already did it [14:12:41] YuviPanda: if you want to use a mysql class [14:12:46] you should probably do it in the role [14:12:46] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/wikimetrics.pp#L143 [14:12:55] the module just depends on some mysql package being installed [14:13:02] keeping it puppet module agnostic [14:13:07] !log aude Started scap: Put test.wikidata back on mw1.24-wmf19 extension branch [14:13:10] ah, hmm [14:13:12] Logged the message, Master [14:13:15] * YuviPanda considers [14:13:34] ottomata: ok let me do that [14:13:40] YuviPanda: the wikimetrics module is also used in medaiwikivagrant [14:13:42] so ja. [14:13:51] yeah forgot about that [14:13:57] * aude hopes with everything moved around, nothing explodes [14:14:22] ottomata: https://gerrit.wikimedia.org/r/160446 should be trivial +2 though? [14:14:40] (03CR) 10Ottomata: [C: 032 V: 032] Add .gitreview file [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/160446 (owner: 10Yuvipanda) [14:16:19] (03PS1) 10Yuvipanda: wikimetrics: Put mysql data in /srv, use mysql::server [puppet] - 10https://gerrit.wikimedia.org/r/160448 [14:16:20] ottomata: ^ [14:16:40] (03Abandoned) 10Yuvipanda: db: Use the mysql class instead of the package [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/160445 (owner: 10Yuvipanda) [14:18:11] _joe_: do you have some time to merge the patch that removes the mobile::vumi code? [14:18:17] hm, YuviPanda, did you already apply this manually on wikimetrics1? [14:18:18] :) [14:18:23] i'm just checking that it won't break things there [14:18:32] ottomata: I did it by hadn on wikimetrics1 :) but I can test on the dev instance if you want [14:18:39] naw its cool, i was just checking myself [14:18:42] cool [14:19:03] ottomata: I logged my actions on -labs :) had to fix apparmor profile as well for mysql, which was a bit weird [14:19:04] ok, so, you probably want to apply by hand on wikimetrics-staging1 [14:19:14] its got a /srv/wikimetrics dir [14:19:21] yeah [14:19:45] well, those are self hosted puppetmaster anyway, so its ok to merge this i think [14:19:47] ottomata: since the labs srv role wasn't included, the wikimetrics folders were just on the root volume, which doesn't have too much space [14:19:51] you can fix and apply by hand, ja? [14:19:59] ya [14:20:07] k [14:20:42] (03PS2) 10Ottomata: wikimetrics: Put mysql data in /srv, use mysql::server [puppet] - 10https://gerrit.wikimedia.org/r/160448 (owner: 10Yuvipanda) [14:20:51] (03CR) 10Ottomata: [C: 032 V: 032] wikimetrics: Put mysql data in /srv, use mysql::server [puppet] - 10https://gerrit.wikimedia.org/r/160448 (owner: 10Yuvipanda) [14:20:57] cool [14:22:48] ottomata: trying on staging1 [14:22:55] anomie: is updating the namespace name as simple as it looks? Like I can just deploy it and be done or is there a script required? https://gerrit.wikimedia.org/r/#/c/156821/5 for reference [14:23:31] (03CR) 10Manybubbles: [C: 031] Remove 'renameuser' right from bureaucrats on CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160158 (owner: 10Legoktm) [14:23:45] ottomata: hah, fails [14:23:58] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Must provide non empty value. on node wikimetrics-staging1.eqiad.wmflabs [14:23:58] Warning: Not using cache on failed catalog [14:24:01] no idea what that means [14:24:52] YuviPanda: that sounds like one of the errors we patched [14:25:04] in dev/staging/prod we have separate commits [14:25:33] so we pull --rebase to keep those on top [14:25:37] I did that as well [14:25:46] I'm checking to see if it was my change that fucked it up [14:25:48] nope [14:25:52] it was fucked up before that [14:25:56] yeah, doubt it [14:26:02] maybe dev doesn't have the right fix [14:26:13] but you're welcome to try in staging, puppet worked there last time i ran it [14:26:20] milimetric: this is on staging [14:26:21] maybe run it before and after your change [14:26:56] milimetric: just did that (reverted my change, ran it, same error) [14:28:01] ottomata: qchris do you anything about this puppet error ^? [14:28:21] * qchris reads backscroll [14:28:36] YuviPanda: will check shortly... [14:28:42] ok [14:28:44] this is on staging1 [14:30:07] YuviPanda: No clue. [14:30:14] hmm [14:30:19] Did it work before your most recent changes? [14:30:25] manybubbles: That's a good question. The maintenance script is namespaceDupes.php to clean up if there are any pages that are now inaccessible. We should ask Reedy about it. [14:30:44] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:30:44] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:31:11] qchris: nope, I checked that by reverting my change and running it again [14:31:24] :-D [14:31:35] so it's been broken a while :) [14:32:37] ottomata: hey, thanks for fixing the elasticsearch check! I hope to have some time this week to bring in some more tests too [14:32:44] (03PS1) 10BBlack: Protoproxy template variable scope fixups [puppet] - 10https://gerrit.wikimedia.org/r/160451 [14:33:11] * YuviPanda is writing an experimental shinken module [14:33:22] (03CR) 10BBlack: [C: 032] Protoproxy template variable scope fixups [puppet] - 10https://gerrit.wikimedia.org/r/160451 (owner: 10BBlack) [14:34:48] Reedy: so, renaming namespaces? ok thing? [14:35:13] this scap thing taking forever [14:35:27] to sync-common to the last server [14:35:35] :P [14:35:37] Typical [14:35:39] ottomata: merge the mysql thing? [14:35:43] yep [14:36:00] you could try to ss/ lsof to see which one is slow [14:36:59] (03CR) 10Manybubbles: "I'd like to SWAT this today (because it is scheduled for today) but I'm not sure what the procedure is for renaming namespaces so I'm not " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156821 (https://bugzilla.wikimedia.org/48075) (owner: 10Gerrit Patch Uploader) [14:37:49] ottomata: I'm assuming yes then :) [14:38:04] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [14:38:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [14:39:28] aude: It's fenari [14:39:36] that's broke [14:39:51] "fenari is swapping hard, restarting apache who was eating up all the RAM" [14:40:11] godog: yup, np, [14:40:19] oh bblack [14:40:19] sorry [14:40:22] yup that's fine [14:40:26] aude: right now it's okish [14:40:38] r b swpd free buff cache si so bi bo in cs us sy id wa [14:40:38] 0 1 627 2716 47 78 42 31 88 155 2 0 12 1 83 3 [14:40:56] (03CR) 10coren: [C: 032] lab db replica: don't show log_params for deleted/suppressed logs [software] - 10https://gerrit.wikimedia.org/r/160393 (owner: 10ArielGlenn) [14:41:33] disk is busy [14:42:00] syncing all the things or something [14:42:01] YuviPanda: mind if I run puppet? [14:42:11] ottomata: sure. qchris might also be poking at it [14:42:13] done [14:42:14] oh [14:42:20] ottomata: Take it :-) [14:42:37] ottomata: I just cleaned out the cruft, and updated production branch. [14:42:59] aude: Now everything looks fine again [14:43:05] yep [14:43:53] !log restarting the enwiki cirrus reindex process - it crashed over the weekend. why you crash and leave error message "1". "1" is not a useful error message. [14:43:57] Logged the message, Master [14:45:58] All right sports fans, there's a few multimedia backports going out, so I'll get the SWAT this morning [14:46:12] * YuviPanda cheers for marktraceur [14:46:24] I hope the ferry's wifi holds. :P [14:46:27] BUT YOU DO NOT WORK FOR THE WMF NOW HOW CAN YOU SWAT SECURITY BREACH CALL THE NSA PLEASE [14:46:34] marktraceur: are you Reedy? [14:47:03] marktraceur: You're going to be the evil guy today, then :D [14:47:12] And how [14:47:16] Taking the "rename" right from 'crats ;) [14:47:17] I don't think you actually have to call the NSA, they're already on every call :p [14:47:19] ALL of them :) [14:47:27] omg! :P [14:47:39] don't take my rights away :) [14:48:07] We won't touch your rights. Your lefts maybe. [14:48:17] heh [14:48:37] aude: Become part of the ~~global cabal~~ new global renamer group [14:48:41] :D [14:49:00] :) [14:49:16] (03PS1) 10Filippo Giunchedi: wikimedia.org: remove labsconsole CNAME [dns] - 10https://gerrit.wikimedia.org/r/160454 [14:50:30] manybubbles: So are you going to SWAT today? [14:50:34] !log aude Finished scap: Put test.wikidata back on mw1.24-wmf19 extension branch (duration: 37m 27s) [14:50:38] yay [14:50:39] Logged the message, Master [14:51:12] hoo: want to +2 https://gerrit.wikimedia.org/r/#/c/159948/ [14:51:13] once fenari is gone scaps will be faster again [14:51:27] unless we are more cruel to job runners :D [14:51:33] looking [14:51:39] so that test.wikidata / test2 have clean memcached of entities etc [14:51:52] (03PS1) 10BBlack: Remove eda1::0 from esams protoproxy completely [puppet] - 10https://gerrit.wikimedia.org/r/160458 [14:52:09] it's ugly hacky [14:52:44] aude: Yeah [14:53:11] we'll remove the ugly in 1-2 weeks [14:53:14] (03CR) 10Hoo man: [C: 032] Bump shared cache key for test.wikidata (memcached storage of items, etc.) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159948 (owner: 10Aude) [14:53:19] thanks [14:53:26] (03Merged) 10jenkins-bot: Bump shared cache key for test.wikidata (memcached storage of items, etc.) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159948 (owner: 10Aude) [14:54:15]