[00:05:06] PROBLEM - Puppet freshness on fenari is CRITICAL: Last successful Puppet run was Tue 17 Jun 2014 21:03:59 UTC [00:05:25] (03CR) 10BryanDavis: [C: 031] graphite: remove duplicate site config [operations/puppet] - 10https://gerrit.wikimedia.org/r/140243 (owner: 10Ori.livneh) [00:14:30] (03CR) 10Ori.livneh: [C: 031] Move the ordered_json parser function to a shared module and remove the copy-pasta found in deployment, statsd and gdash modules. [operations/puppet] - 10https://gerrit.wikimedia.org/r/139921 (owner: 1020after4) [00:16:06] PROBLEM - Puppet freshness on logstash1002 is CRITICAL: Last successful Puppet run was Tue 17 Jun 2014 21:15:33 UTC [00:20:06] PROBLEM - Puppet freshness on tungsten is CRITICAL: Last successful Puppet run was Tue 17 Jun 2014 21:19:17 UTC [00:22:24] (03PS3) 10Ori.livneh: graphite/kibana: remove duplicate site config [operations/puppet] - 10https://gerrit.wikimedia.org/r/140243 [00:29:50] (03PS1) 10Ori.livneh: mediawiki: Dedupe libapache2-mod-php5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140258 [00:31:06] PROBLEM - Puppet freshness on logstash1003 is CRITICAL: Last successful Puppet run was Tue 17 Jun 2014 21:30:39 UTC [00:43:06] PROBLEM - Puppet freshness on logstash1001 is CRITICAL: Last successful Puppet run was Tue 17 Jun 2014 21:42:13 UTC [00:44:40] (03CR) 10TTO: "Oops! Well spotted." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78016 (https://bugzilla.wikimedia.org/52528) (owner: 10TTO) [00:48:56] (03PS1) 10TTO: Put testwiki namespaces in the right place [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140261 [00:49:26] (03CR) 10TTO: "See Ie3c913b9e9850f7f67e2ab933bbcc46e2f43bca0" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78016 (https://bugzilla.wikimedia.org/52528) (owner: 10TTO) [00:52:06] PROBLEM - Puppet freshness on caesium is CRITICAL: Last successful Puppet run was Tue 17 Jun 2014 21:51:37 UTC [00:52:06] PROBLEM - Puppet freshness on netmon1001 is CRITICAL: Last successful Puppet run was Tue 17 Jun 2014 21:51:27 UTC [00:55:06] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Last successful Puppet run was Tue 17 Jun 2014 21:54:04 UTC [00:55:11] (03PS2) 10Ori.livneh: mediawiki: Dedupe libapache2-mod-php5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140258 [00:56:18] (03PS3) 10Ori.livneh: Fix-ups for Iddc778a28 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140258 [01:20:06] PROBLEM - Puppet freshness on polonium is CRITICAL: Last successful Puppet run was Tue 17 Jun 2014 13:16:47 UTC [01:24:31] (03CR) 10BBlack: [C: 032 V: 032] graphite/kibana: remove duplicate site config [operations/puppet] - 10https://gerrit.wikimedia.org/r/140243 (owner: 10Ori.livneh) [01:24:46] (03PS2) 10BBlack: otrs: resolve duplicate def'n of libapache2-mod-perl2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140252 (owner: 10Ori.livneh) [01:25:07] (03PS4) 10BBlack: Fix-ups for Iddc778a28 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140258 (owner: 10Ori.livneh) [01:27:03] (03CR) 10BBlack: [C: 032 V: 032] otrs: resolve duplicate def'n of libapache2-mod-perl2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140252 (owner: 10Ori.livneh) [01:27:13] (03PS5) 10BBlack: Fix-ups for Iddc778a28 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140258 (owner: 10Ori.livneh) [01:29:16] (03CR) 10Withoutaname: [C: 031] Put testwiki namespaces in the right place [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140261 (owner: 10TTO) [01:29:46] (03CR) 10BBlack: [C: 032 V: 032] Fix-ups for Iddc778a28 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140258 (owner: 10Ori.livneh) [01:31:36] RECOVERY - Puppet freshness on logstash1003 is OK: puppet ran at Wed Jun 18 01:31:29 UTC 2014 [01:34:16] RECOVERY - Puppet freshness on fenari is OK: puppet ran at Wed Jun 18 01:34:08 UTC 2014 [01:37:56] RECOVERY - Puppet freshness on caesium is OK: puppet ran at Wed Jun 18 01:37:51 UTC 2014 [01:41:06] RECOVERY - Puppet freshness on logstash1002 is OK: puppet ran at Wed Jun 18 01:40:58 UTC 2014 [01:41:06] RECOVERY - Puppet freshness on logstash1001 is OK: puppet ran at Wed Jun 18 01:41:03 UTC 2014 [01:42:16] RECOVERY - Puppet freshness on netmon1001 is OK: puppet ran at Wed Jun 18 01:42:14 UTC 2014 [01:42:36] RECOVERY - Puppet freshness on iodine is OK: puppet ran at Wed Jun 18 01:42:29 UTC 2014 [01:46:43] bblack: i got most of them but i missed two :/ [01:47:07] (03PS1) 10Ori.livneh: Kill two last stray symlinks left behind by Iddc778a28 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140269 [01:49:30] !log puppet freshness on tungsten and stat1001 can be fixed with https://gerrit.wikimedia.org/r/#/c/140269/ [01:49:36] * ori runs [01:49:39] Logged the message, Master [01:56:20] (03CR) 10Springle: [C: 032] Make dbstore1002 handle s6 analytics queries [operations/dns] - 10https://gerrit.wikimedia.org/r/140236 (https://bugzilla.wikimedia.org/66068) (owner: 10QChris) [01:58:06] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.002 second response time [02:00:16] Hm.. getting bad gateway errors when trying to update jqmigrate deprecation graphs [02:00:21] an there's icinga-wm already [02:02:30] !log graphite.wikimedia.org is down with HTTP 502 Bad Gateway errors [02:02:43] Logged the message, Master [02:03:03] This was working less than 10 minutes ago [02:04:10] Krinkle: it's overloaded [02:04:29] ori: not caused by apache upgrades? [02:04:36] no [02:04:53] so this happens a lot then? [02:05:06] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time [02:05:07] it's been chronically overloaded and we've been piling more load [02:05:16] case in point ^ :) [02:05:47] !log Nevermind, graphite.wikimedia.org going down is due to overload which recovers eventually (it just has). Has become SNAFU/FIXME. [02:05:50] Very well [02:05:52] Logged the message, Master [02:05:59] Is there a bug or RT ticket? [02:06:10] i don't think so, but you are right to ask, because there ought to be [02:06:39] I've never hit it until now [02:07:00] Wow, this is new [02:07:08] and awesome [02:07:19] http://i.imgur.com/iWNHMSC.png [02:07:20] ori: [02:08:03] http://graphite.wikimedia.org/render/?title=_j&width=300&height=200&target=aliasByNode(mw.js.deprecate._j.rate)%2C-1)&from=-3days [02:08:09] just a dangling ), found it [02:08:13] but an interesting error [02:11:20] http://s.codepen.io/Krinkle/fullpage/cBGCl [02:12:19] Krinkle, reminds me of http://ih3.redbubble.net/image.14966978.4324/fc,550x550,white.u5.jpg :) [02:12:38] http://codepen.io/Krinkle/full/cBGCl * [02:13:48] (03CR) 10BBlack: [C: 032 V: 032] Kill two last stray symlinks left behind by Iddc778a28 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140269 (owner: 10Ori.livneh) [02:14:01] bless your heart [02:14:23] graphite works for me, e.g. https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [02:14:35] (I never really use the UI, I just keep pasting old URLs like those that I save) [02:14:37] !log LocalisationUpdate completed (1.24wmf8) at 2014-06-18 02:13:34+00:00 [02:14:48] Logged the message, Master [02:14:58] bblack: http://codepen.io/Krinkle/full/cBGCl for me works at the moment, but I reckon in a few minutes they'll all be broken again (like it was af ew minutes ago) [02:15:42] apache issues the 502 when there are no available backend workers to delegate to [02:16:13] there frequently aren't because we now have a bunch of "anomaly detection"-type alerts that poll graphite frequently [02:16:21] heading out for dinner, back in a while [02:16:25] * ori waves [02:16:47] ori: you wouldn't happen to know a way to avoid the graphis always appearing to decline to near 0 at the right of each graph ? [02:16:58] http://codepen.io/Krinkle/full/cBGCl http://codepen.io/Krinkle/full/zyodJ [02:17:16] you can filter values above/below a certain threshold [02:17:23] but not off the top of my head, no [02:17:26] RECOVERY - Puppet freshness on stat1001 is OK: puppet ran at Wed Jun 18 02:17:22 UTC 2014 [02:17:43] e.g. http://graphite.wikimedia.org/render/?title=wikiGetlink&width=300&height=200&target=aliasByNode(summarize(mw.js.deprecate.wikiGetlink.rate%2C%221days%22)%2C-1)&from=-3weeks [02:17:51] I think the summary is doing it wrong [02:18:02] instead of per day (from now), it goes per actual 0-24 day [02:18:09] so the current day is probaly very low [02:18:11] most of these behaviors and emergent from the union of runtime options for {graphite,carbon,txstatsd} [02:18:23] it's a full-time job to debug them and i haven't been focused on it [02:18:24] at least that's my theory [02:18:27] yeah [02:18:54] logstash works quite well and more intuitive for me. Not sure if these things have a potential overlap of whether that'd be useful. [02:19:08] modules/graphite/manifests/web.pp sets a default $uwsgi_processes to 4 [02:19:29] manifests/role/graphite.pp doesn't specify a value, so the default is used [02:19:39] 4 is way too low [02:20:45] you can fix it either by specifying a value in the role class, overriding the default, or by having a different default [02:20:47] i'd go with the latter [02:20:56] if you file a bug and submit a patch i'll +1 it [02:21:34] i'm really happy to hear logstash is working well [02:21:42] i think most people are unfamiliar with it, it could do with some socialization [02:23:06] !log searchidx1001 outta sync - running sync-common [02:23:10] Logged the message, Master [02:23:23] ori: Yep, I'm also curious what the potential is to have it (or logstash's source) be used for monitoring and alerts. [02:24:00] I've happened about a dozen times now that I had to react to an outage and both found my root cause there in plain text that I can act on, and saw a clear indicator of what would've alerted us in time. [02:24:02] ori: [02:24:24] e.g. number of memcache errors going from 100/s to 5000/s [02:24:30] or from 0.2% to 80% [02:24:32] etc. [02:24:59] right now it's very powerful, but it's all pull and manually looking into [02:25:10] graphite bug https://bugzilla.wikimedia.org/show_bug.cgi?id=66765 [02:26:06] !log LocalisationUpdate completed (1.24wmf9) at 2014-06-18 02:25:03+00:00 [02:26:12] Logged the message, Master [02:59:29] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jun 18 02:58:22 UTC 2014 (duration 58m 21s) [02:59:34] Logged the message, Master [03:21:06] PROBLEM - Puppet freshness on tungsten is CRITICAL: Last successful Puppet run was Tue 17 Jun 2014 21:19:17 UTC [03:34:06] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [03:46:06] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [04:02:06] PROBLEM - Puppet freshness on mw1028 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:01:06 UTC [04:02:06] PROBLEM - Puppet freshness on mw1067 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:01:36 UTC [04:03:06] PROBLEM - Puppet freshness on mw1077 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:02:12 UTC [04:03:06] PROBLEM - Puppet freshness on mw1108 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:02:32 UTC [04:03:06] PROBLEM - Puppet freshness on mw1186 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:02:22 UTC [04:03:06] PROBLEM - Puppet freshness on mw1105 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:02:17 UTC [04:03:06] PROBLEM - Puppet freshness on mw1043 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:02:37 UTC [04:03:07] PROBLEM - Puppet freshness on mw1121 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:02:02 UTC [04:03:07] PROBLEM - Puppet freshness on mw1167 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:02:42 UTC [04:04:06] PROBLEM - Puppet freshness on mw1064 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:03:48 UTC [04:04:06] PROBLEM - Puppet freshness on mw1152 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:02:58 UTC [04:04:06] PROBLEM - Puppet freshness on mw1160 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:02:58 UTC [04:04:06] PROBLEM - Puppet freshness on mw1187 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:03:38 UTC [04:04:06] PROBLEM - Puppet freshness on mw1193 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:03:54 UTC [04:04:07] PROBLEM - Puppet freshness on mw1209 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:03:03 UTC [04:04:22] :| [04:05:06] PROBLEM - Puppet freshness on mw1071 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:04:49 UTC [04:05:06] PROBLEM - Puppet freshness on mw1099 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:04:09 UTC [04:05:06] PROBLEM - Puppet freshness on mw1088 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:04:55 UTC [04:05:06] PROBLEM - Puppet freshness on mw1117 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:04:04 UTC [04:05:06] PROBLEM - Puppet freshness on mw1100 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:04:24 UTC [04:05:07] PROBLEM - Puppet freshness on mw1164 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:04:19 UTC [04:05:07] PROBLEM - Puppet freshness on mw1176 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:04:19 UTC [04:05:08] PROBLEM - Puppet freshness on mw1217 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:04:14 UTC [04:06:06] PROBLEM - Puppet freshness on mw1037 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:05:31 UTC [04:06:06] PROBLEM - Puppet freshness on mw1042 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:05:00 UTC [04:06:06] PROBLEM - Puppet freshness on mw1052 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:05:21 UTC [04:06:06] PROBLEM - Puppet freshness on mw1065 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:05:10 UTC [04:06:06] PROBLEM - Puppet freshness on mw1113 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:05:10 UTC [04:06:07] PROBLEM - Puppet freshness on mw1123 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:05:36 UTC [04:06:07] PROBLEM - Puppet freshness on mw1144 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:05:36 UTC [04:06:08] PROBLEM - Puppet freshness on mw1207 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:05:15 UTC [04:06:13] Failed to apply catalog: Could not find dependency Package[libapache2-mod-php5] [04:07:06] PROBLEM - Puppet freshness on mw1018 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:06:16 UTC [04:07:06] PROBLEM - Puppet freshness on mw1103 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:06:16 UTC [04:07:06] PROBLEM - Puppet freshness on mw1114 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:05:56 UTC [04:07:06] PROBLEM - Puppet freshness on mw1126 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:05:56 UTC [04:07:06] PROBLEM - Puppet freshness on mw1175 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:06:31 UTC [04:07:21] (03PS1) 10Ori.livneh: mediawiki: avoid spurious dep on libapache2-mod-php5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140278 [04:07:28] springle: ^ [04:07:48] springle: not a very satisfactory commit message, but the tl;dr is that the dependency is expressed by another route [04:08:02] (03PS2) 10Springle: mediawiki: avoid spurious dep on libapache2-mod-php5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140278 (owner: 10Ori.livneh) [04:08:06] PROBLEM - Puppet freshness on mw1017 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:07:07 UTC [04:08:06] PROBLEM - Puppet freshness on mw1044 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:07:07 UTC [04:08:06] PROBLEM - Puppet freshness on mw1095 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:07:42 UTC [04:08:06] PROBLEM - Puppet freshness on mw1101 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:07:07 UTC [04:08:06] PROBLEM - Puppet freshness on mw1162 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:07:01 UTC [04:08:07] PROBLEM - Puppet freshness on mw1111 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:07:52 UTC [04:08:14] (03CR) 10Springle: [C: 032] mediawiki: avoid spurious dep on libapache2-mod-php5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140278 (owner: 10Ori.livneh) [04:09:06] PROBLEM - Puppet freshness on mw1056 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:08:52 UTC [04:09:06] PROBLEM - Puppet freshness on mw1125 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:08:02 UTC [04:09:06] PROBLEM - Puppet freshness on mw1127 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:08:27 UTC [04:09:06] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:08:27 UTC [04:09:06] PROBLEM - Puppet freshness on mw1190 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:08:07 UTC [04:09:07] PROBLEM - Puppet freshness on mw1214 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:08:37 UTC [04:09:38] (03PS3) 10Ori.livneh: (Final!) fix-ups for Iddc778a28 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140278 [04:09:48] (03CR) 10Ori.livneh: [C: 032 V: 032] (Final!) fix-ups for Iddc778a28 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140278 (owner: 10Ori.livneh) [04:09:55] heh [04:10:06] PROBLEM - Puppet freshness on mw1057 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:09:02 UTC [04:10:06] PROBLEM - Puppet freshness on mw1081 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:09:42 UTC [04:10:06] PROBLEM - Puppet freshness on mw1146 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:09:12 UTC [04:10:06] PROBLEM - Puppet freshness on mw1159 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:09:12 UTC [04:10:06] PROBLEM - Puppet freshness on mw1196 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:09:02 UTC [04:10:36] ok, i'll stagger puppet runs on the mediawikis [04:10:40] sorry for the spam folks [04:11:06] PROBLEM - Puppet freshness on mw1087 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:10:39 UTC [04:11:06] PROBLEM - Puppet freshness on mw1124 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:10:17 UTC [04:11:06] PROBLEM - Puppet freshness on mw1130 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:10:33 UTC [04:11:06] PROBLEM - Puppet freshness on mw1147 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:10:07 UTC [04:11:06] PROBLEM - Puppet freshness on mw1161 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:10:07 UTC [04:11:07] PROBLEM - Puppet freshness on mw1171 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:10:44 UTC [04:11:07] PROBLEM - Puppet freshness on mw1210 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:10:02 UTC [04:11:16] RECOVERY - Puppet freshness on mw1044 is OK: puppet ran at Wed Jun 18 04:11:14 UTC 2014 [04:11:16] RECOVERY - Puppet freshness on mw1042 is OK: puppet ran at Wed Jun 18 04:11:14 UTC 2014 [04:11:27] RECOVERY - Puppet freshness on mw1043 is OK: puppet ran at Wed Jun 18 04:11:19 UTC 2014 [04:12:06] PROBLEM - Puppet freshness on mw1024 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:11:45 UTC [04:12:06] PROBLEM - Puppet freshness on mw1033 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:11:50 UTC [04:12:06] PROBLEM - Puppet freshness on mw1106 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:11:50 UTC [04:12:06] RECOVERY - Puppet freshness on mw1106 is OK: puppet ran at Wed Jun 18 04:12:05 UTC 2014 [04:12:36] RECOVERY - Puppet freshness on mw1024 is OK: puppet ran at Wed Jun 18 04:12:30 UTC 2014 [04:12:36] RECOVERY - Puppet freshness on mw1033 is OK: puppet ran at Wed Jun 18 04:12:30 UTC 2014 [04:12:46] RECOVERY - Puppet freshness on mw1052 is OK: puppet ran at Wed Jun 18 04:12:45 UTC 2014 [04:12:56] RECOVERY - Puppet freshness on mw1057 is OK: puppet ran at Wed Jun 18 04:12:50 UTC 2014 [04:13:06] RECOVERY - Puppet freshness on mw1056 is OK: puppet ran at Wed Jun 18 04:12:55 UTC 2014 [04:13:16] RECOVERY - Puppet freshness on mw1037 is OK: puppet ran at Wed Jun 18 04:13:06 UTC 2014 [04:13:46] RECOVERY - Puppet freshness on mw1028 is OK: puppet ran at Wed Jun 18 04:13:36 UTC 2014 [04:14:06] PROBLEM - Puppet freshness on mw1069 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:13:46 UTC [04:14:16] RECOVERY - Puppet freshness on mw1069 is OK: puppet ran at Wed Jun 18 04:14:11 UTC 2014 [04:15:06] PROBLEM - Puppet freshness on mw1092 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:14:42 UTC [04:15:26] RECOVERY - Puppet freshness on mw1092 is OK: puppet ran at Wed Jun 18 04:15:22 UTC 2014 [04:15:26] RECOVERY - Puppet freshness on mw1071 is OK: puppet ran at Wed Jun 18 04:15:22 UTC 2014 [04:15:26] RECOVERY - Puppet freshness on mw1077 is OK: puppet ran at Wed Jun 18 04:15:22 UTC 2014 [04:15:36] RECOVERY - Puppet freshness on mw1067 is OK: puppet ran at Wed Jun 18 04:15:27 UTC 2014 [04:15:36] RECOVERY - Puppet freshness on mw1064 is OK: puppet ran at Wed Jun 18 04:15:27 UTC 2014 [04:15:36] RECOVERY - Puppet freshness on mw1065 is OK: puppet ran at Wed Jun 18 04:15:32 UTC 2014 [04:16:06] RECOVERY - Puppet freshness on mw1095 is OK: puppet ran at Wed Jun 18 04:15:58 UTC 2014 [04:16:16] RECOVERY - Puppet freshness on mw1099 is OK: puppet ran at Wed Jun 18 04:16:08 UTC 2014 [04:16:36] RECOVERY - Puppet freshness on mw1101 is OK: puppet ran at Wed Jun 18 04:16:34 UTC 2014 [04:16:37] I forgive you. [04:16:46] RECOVERY - Puppet freshness on mw1108 is OK: puppet ran at Wed Jun 18 04:16:39 UTC 2014 [04:16:56] RECOVERY - Puppet freshness on mw1103 is OK: puppet ran at Wed Jun 18 04:16:49 UTC 2014 [04:16:56] RECOVERY - Puppet freshness on tungsten is OK: puppet ran at Wed Jun 18 04:16:54 UTC 2014 [04:17:07] :( [04:17:16] RECOVERY - Puppet freshness on mw1105 is OK: puppet ran at Wed Jun 18 04:17:10 UTC 2014 [04:17:16] RECOVERY - Puppet freshness on mw1100 is OK: puppet ran at Wed Jun 18 04:17:15 UTC 2014 [04:17:32] icinga-wm is not very subtle about puppet failures on the app servers [04:17:34] for good reason [04:18:36] RECOVERY - Puppet freshness on mw1088 is OK: puppet ran at Wed Jun 18 04:18:31 UTC 2014 [04:18:46] RECOVERY - Puppet freshness on mw1111 is OK: puppet ran at Wed Jun 18 04:18:36 UTC 2014 [04:18:46] RECOVERY - Puppet freshness on mw1114 is OK: puppet ran at Wed Jun 18 04:18:36 UTC 2014 [04:18:56] RECOVERY - Puppet freshness on mw1117 is OK: puppet ran at Wed Jun 18 04:18:51 UTC 2014 [04:19:06] RECOVERY - Puppet freshness on mw1087 is OK: puppet ran at Wed Jun 18 04:18:56 UTC 2014 [04:19:06] PROBLEM - Puppet freshness on mw1084 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:18:22 UTC [04:19:06] PROBLEM - Puppet freshness on mw1083 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:18:48 UTC [04:19:06] PROBLEM - Puppet freshness on mw1180 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:17:57 UTC [04:19:06] RECOVERY - Puppet freshness on mw1083 is OK: puppet ran at Wed Jun 18 04:19:01 UTC 2014 [04:19:16] RECOVERY - Puppet freshness on mw1017 is OK: puppet ran at Wed Jun 18 04:19:06 UTC 2014 [04:19:16] RECOVERY - Puppet freshness on mw1113 is OK: puppet ran at Wed Jun 18 04:19:06 UTC 2014 [04:19:16] RECOVERY - Puppet freshness on mw1180 is OK: puppet ran at Wed Jun 18 04:19:06 UTC 2014 [04:19:16] RECOVERY - Puppet freshness on mw1084 is OK: puppet ran at Wed Jun 18 04:19:06 UTC 2014 [04:19:16] RECOVERY - Puppet freshness on mw1081 is OK: puppet ran at Wed Jun 18 04:19:11 UTC 2014 [04:19:26] RECOVERY - Puppet freshness on mw1018 is OK: puppet ran at Wed Jun 18 04:19:21 UTC 2014 [04:19:46] RECOVERY - Puppet freshness on mw1125 is OK: puppet ran at Wed Jun 18 04:19:41 UTC 2014 [04:19:56] RECOVERY - Puppet freshness on mw1123 is OK: puppet ran at Wed Jun 18 04:19:46 UTC 2014 [04:19:56] RECOVERY - Puppet freshness on mw1124 is OK: puppet ran at Wed Jun 18 04:19:46 UTC 2014 [04:20:06] RECOVERY - Puppet freshness on mw1126 is OK: puppet ran at Wed Jun 18 04:19:56 UTC 2014 [04:20:06] PROBLEM - Puppet freshness on mw1165 is CRITICAL: Last successful Puppet run was Wed 18 Jun 2014 01:19:44 UTC [04:20:06] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [04:20:06] RECOVERY - Puppet freshness on mw1121 is OK: puppet ran at Wed Jun 18 04:20:01 UTC 2014 [04:20:26] RECOVERY - Puppet freshness on mw1165 is OK: puppet ran at Wed Jun 18 04:20:16 UTC 2014 [04:20:36] RECOVERY - Puppet freshness on mw1127 is OK: puppet ran at Wed Jun 18 04:20:27 UTC 2014 [04:21:06] PROBLEM - Puppet freshness on polonium is CRITICAL: Last successful Puppet run was Tue 17 Jun 2014 13:16:47 UTC [04:21:36] RECOVERY - Puppet freshness on mw1187 is OK: puppet ran at Wed Jun 18 04:21:32 UTC 2014 [04:21:47] RECOVERY - Puppet freshness on mw1182 is OK: puppet ran at Wed Jun 18 04:21:37 UTC 2014 [04:21:56] RECOVERY - Puppet freshness on mw1214 is OK: puppet ran at Wed Jun 18 04:21:53 UTC 2014 [04:22:06] RECOVERY - Puppet freshness on mw1186 is OK: puppet ran at Wed Jun 18 04:22:03 UTC 2014 [04:22:16] RECOVERY - Puppet freshness on mw1217 is OK: puppet ran at Wed Jun 18 04:22:08 UTC 2014 [04:22:16] RECOVERY - Puppet freshness on mw1209 is OK: puppet ran at Wed Jun 18 04:22:13 UTC 2014 [04:22:16] RECOVERY - Puppet freshness on mw1210 is OK: puppet ran at Wed Jun 18 04:22:13 UTC 2014 [04:22:26] RECOVERY - Puppet freshness on mw1207 is OK: puppet ran at Wed Jun 18 04:22:18 UTC 2014 [04:22:36] RECOVERY - Puppet freshness on mw1193 is OK: puppet ran at Wed Jun 18 04:22:28 UTC 2014 [04:22:36] RECOVERY - Puppet freshness on mw1190 is OK: puppet ran at Wed Jun 18 04:22:28 UTC 2014 [04:22:36] RECOVERY - Puppet freshness on mw1196 is OK: puppet ran at Wed Jun 18 04:22:33 UTC 2014 [04:23:36] RECOVERY - Puppet freshness on mw1147 is OK: puppet ran at Wed Jun 18 04:23:29 UTC 2014 [04:23:36] RECOVERY - Puppet freshness on mw1144 is OK: puppet ran at Wed Jun 18 04:23:29 UTC 2014 [04:23:46] RECOVERY - Puppet freshness on mw1146 is OK: puppet ran at Wed Jun 18 04:23:44 UTC 2014 [04:24:16] RECOVERY - Puppet freshness on mw1161 is OK: puppet ran at Wed Jun 18 04:24:15 UTC 2014 [04:24:26] RECOVERY - Puppet freshness on mw1167 is OK: puppet ran at Wed Jun 18 04:24:20 UTC 2014 [04:24:26] RECOVERY - Puppet freshness on mw1160 is OK: puppet ran at Wed Jun 18 04:24:20 UTC 2014 [04:24:26] RECOVERY - Puppet freshness on mw1164 is OK: puppet ran at Wed Jun 18 04:24:25 UTC 2014 [04:24:36] RECOVERY - Puppet freshness on mw1162 is OK: puppet ran at Wed Jun 18 04:24:35 UTC 2014 [04:25:56] RECOVERY - Puppet freshness on mw1176 is OK: puppet ran at Wed Jun 18 04:25:46 UTC 2014 [04:26:06] RECOVERY - Puppet freshness on mw1175 is OK: puppet ran at Wed Jun 18 04:25:56 UTC 2014 [04:26:06] RECOVERY - Puppet freshness on mw1171 is OK: puppet ran at Wed Jun 18 04:26:01 UTC 2014 [04:26:26] RECOVERY - Puppet freshness on mw1152 is OK: puppet ran at Wed Jun 18 04:26:21 UTC 2014 [04:26:36] RECOVERY - Puppet freshness on mw1159 is OK: puppet ran at Wed Jun 18 04:26:31 UTC 2014 [04:27:05] ori: tnx ;) [04:27:15] super sorry about that [04:27:26] RECOVERY - Puppet freshness on mw1130 is OK: puppet ran at Wed Jun 18 04:27:16 UTC 2014 [04:27:51] 17 mins for a puppet run on all app servers, not bad [04:30:06] RECOVERY - Puppet freshness on polonium is OK: puppet ran at Wed Jun 18 04:29:58 UTC 2014 [04:30:56] !log enabled puppet on polonium (was disabled but nothing in SAL) [04:30:59] Logged the message, Master [04:32:06] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [04:32:58] /clear [04:37:19] (03PS2) 10Ori.livneh: Add a lightweight apache::site resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 [04:38:27] (03CR) 10jenkins-bot: [V: 04-1] Add a lightweight apache::site resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 (owner: 10Ori.livneh) [04:50:08] (03PS3) 10Ori.livneh: Add a lightweight apache::site resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 [05:06:32] (03PS1) 10Springle: Make sanitarium hosts aggregators, since they're bored. Make db1021 deaf, since it shows consistently higher load than its siblings and there isn't much else to turn off. [operations/puppet] - 10https://gerrit.wikimedia.org/r/140305 [05:08:13] (03CR) 10Springle: [C: 032] Make sanitarium hosts aggregators, since they're bored. Make db1021 deaf, since it shows consistently higher load than its siblings and ther [operations/puppet] - 10https://gerrit.wikimedia.org/r/140305 (owner: 10Springle) [05:34:37] <_joe_> hey springle [05:34:48] <_joe_> good morning! [05:34:50] hi _joe_ [05:53:44] (03PS1) 10Springle: Update ganglia data sources to match the aggregators in I61d35c978. [operations/puppet] - 10https://gerrit.wikimedia.org/r/140307 [05:56:07] (03CR) 10Springle: [C: 032] Update ganglia data sources to match the aggregators in I61d35c978. [operations/puppet] - 10https://gerrit.wikimedia.org/r/140307 (owner: 10Springle) [06:01:17] !log restarted gmetad on nickel while unbreaking the mysql graphs I broke on ganglia [06:01:22] Logged the message, Master [06:44:10] (03PS9) 10Nemo bis: Gather all soft-disabled uploads wikis in one config item [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 [06:44:37] (03CR) 10Nemo bis: "Thanks Anomie; switched to use groupOverrides instead." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [07:43:56] PROBLEM - RAID on ms-be1007 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [07:44:06] PROBLEM - Disk space on ms-be1007 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sde1 is not accessible: Input/output error [08:23:06] RECOVERY - Disk space on ms-be1007 is OK: DISK OK [08:24:11] <_joe_> !log stopped swift on ms-be1007, unmounting volume to check for repair [08:24:16] Logged the message, Master [08:26:29] <_joe_> !log disk is gone, powering down ms-be1007, opening ticket for disk replacement [08:26:34] Logged the message, Master [08:53:04] (03CR) 10Alexandros Kosiaris: [C: 032] Now building against and running with openjdk-7 [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140206 (owner: 10Ottomata) [08:57:04] (03CR) 10Alexandros Kosiaris: [C: 032] "I have no idea what all these settings do, I suppose we can trust linkedin ? That being said, the commit message gives the impression ther" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140207 (owner: 10Ottomata) [08:57:58] <_joe_> akosiaris: we trust linkedin? [08:58:00] <_joe_> :) [08:58:11] well we already run their software [08:58:19] kafka [08:59:53] so, we already have commited the sin of trusting it, a commented performance setting is just a very very small step towards... well wherever linkedin trusting people go [09:00:28] <_joe_> eheh [09:00:35] <_joe_> ti was just funny [09:01:28] * YuviPanda waves at akosiaris [09:01:32] was about to say that ... the setting is commented out ... [09:01:35] not as funny as using a message queue named after a writer whose most famous story is about a message that never arrives [09:01:50] YuviPanda: aaaah nice to see you are here. So to answer the question about the postgres in labs, new DC has some priority over that. The hardware order however for decoupling the toolserver/osm database from the rest of the postgres dbs has been sent. [09:02:13] http://www.kafka-online.info/an-imperial-message.html [09:02:27] so, the moment those SSDs get installed + a couple of days and that should be done [09:02:28] akosiaris: ah, cool! :) General ETA is in the order of weeks/months? [09:02:39] akosiaris: oh, cool. couple of days is much nicer :) [09:02:46] ah... weeks [09:02:56] akosiaris: apache_module, apache_site, apache_confd, webserver::apache::module, generic-definitions.pp are all gone :D [09:03:04] akosiaris: I have reworked the user creation patch for mysql as well, so when Coren is back should be able to get that merged as well. [09:03:06] akosiaris: cool :) [09:03:12] akosiaris: thanks! I'll bug you in a week! [09:03:25] YuviPanda: cool [09:04:55] ori: yes I saw quite a few patches. Cool! I am gonna see if this https://gerrit.wikimedia.org/r/#/c/107567/ makes zirconium happy nowadays (it used to not to) soon [09:05:38] <_joe_> ori: while you're awake... apache-config [09:05:41] oh cool [09:06:02] <_joe_> ori: are most of the changes *value* changes or we add/remove directives often? [09:06:16] (03CR) 10Alexandros Kosiaris: [C: 032] Fix for KAFKA_MIRROR_START variable in kafka-mirror.init script [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140208 (owner: 10Ottomata) [09:06:21] <_joe_> I was thinking we could separate the logic for distributing the two things [09:06:46] 'git log -p' is how i'd answer that [09:06:57] <_joe_> yes, fair enough [09:07:09] not snarking, i couldn't answer that fairly off the top of my head [09:07:11] <_joe_> that's me being lazy [09:07:13] <_joe_> :) [09:07:30] <_joe_> oh I was told you always had the answers! [09:08:01] no no, i always have *some* answers [09:08:09] not the right ones [09:08:21] not usually, at any rate :) [09:08:32] hmm, https://wikitech.wikimedia.org/wiki/Special:NovaSudoer is giving me a blank page. [09:09:25] are you sure it's blank? maybe it's using very, very pale grey-on-white text [09:09:54] * ori should sleep [09:10:08] ori: :D [09:10:33] tools' crontabs are down, because tools-submit is empty, but for some reason I don't have root there, despite having root everywhere else. something's fishy. [09:10:54] it's also frustrating that the only way to figure out when something runs out of disk space is when someone complains on labs-l or IRC [09:11:02] <_joe_> ori: I was thinking that if we change more values than directives, https://github.com/kelseyhightower/confd could be a good tool (that is, distribute templates with puppet, and use confd to change values on the run) [09:11:04] and even then it's sometimes a game of 'ssh everywhere and check' [09:11:33] * YuviPanda should get icinga up on tools at some point, and configure it properly. also need to get more disk space for all nodes [09:14:18] YuviPanda: tools-submit has /var full [09:14:23] akosiaris: yeah, need to clean up [09:14:32] akosiaris: I don't have root just on that instance, and I've no idea why [09:14:39] akosiaris: do you have root there? [09:14:44] yes [09:14:54] well, I got root everywhere (I hope) [09:14:57] akosiaris: :D [09:15:04] akosiaris: I thought I had root everywhere for tools, but apparently not [09:16:02] 2G /var is ridiculous [09:16:09] it is the 10th time this is causing an issue [09:16:19] akosiaris: yeah [09:16:30] same with all the nodes [09:16:34] I cleaned some 328M by deleting old archived debs [09:16:35] they have tiny roots. [09:16:45] roots is are not tiny [09:16:50] more like 8G [09:17:04] the /var being only 2G makes no sense though [09:17:05] akosiaris: right, I meant no secondary partition mounted so they overall don't have enough space [09:17:13] no lvm classes applied [09:17:34] what is worse is that people requesting an 80G instance and those 70G will not be allocated automagically [09:17:42] but rest there as unpartitioned space [09:17:46] yup [09:17:56] is that an openstack thing or a our-config thing? [09:18:21] I honestly do not know. Whatever it is we need to solve it [09:18:25] yeah [09:18:47] cause all we do is tell people, hey use that puppet lvm class and blah blah but what for ? [09:19:06] it should not matter to them, it should be transparent [09:19:16] yup [09:19:24] let me file a general bug [09:19:28] I am gonna stop ranting now :-) [09:19:31] akosiaris: :D [09:19:34] thank you [09:20:30] akosiaris: hmm, so I'm going to apply the biglogs class to it, which gives it an 8G /var/log. Would that empty the current logs? [09:20:41] or rather, would we fully lose the current logs, or will they be... somewhere? [09:20:47] * YuviPanda hasn't really poked at how lvm works yet [09:22:32] * YuviPanda starts reading http://tldp.org/HOWTO/LVM-HOWTO/whatislvm.html [09:23:23] has nothing to do with lvm [09:23:33] lvm is just another way of managing partitions [09:23:43] so a new "partition" will be created and mounted on /var [09:24:01] sorry /var/log/ [09:24:13] /var/log in this case. so what happens to the files that are in /var/log right now? [09:24:14] yeah [09:24:27] I guess maybe it'll fail to mount saying mount point isn't empty? [09:24:28] the old /var partition will not be touched or whatever but the /var/log files will be "hidden" while the new "partition" is mounted [09:25:02] ah, right [09:25:25] so you umount /var/log, move /var/log to /var/log.old, mount /var/log, rsync /var/log.old /var/log/, rm -rf /var/log.old [09:25:34] right [09:25:48] YuviPanda: the "mount point isn't empty" is a windows thing, linux doesn't really care, as long as the mount point exists ;) [09:25:52] do you think you'll have time to do that now, or should I want till andrewbogott_afk is back and we can figure out why I don't have root? [09:25:56] and you are done. Take care to stop all log writing processes before the rsync however [09:26:34] not that urgent atm, since cron isn't failing, but I suspect it'll start failing soon. [09:26:42] I can do that now. But, Will you pretty please apply the puppet class ? I honestly don't want to go through the wikitech interface right now [09:27:12] hmm do you have the rights to do that ? [09:27:13] akosiaris: yup, applied class [09:27:31] akosiaris: yup, I've root on tools project. [09:27:35] s/root/projectadmin/ [09:27:50] akosiaris: which is why it's doubly confusing as to why I don't have root on tools-submit [09:28:18] akosiaris: not just me, scfe_de also doesn't have root there (despite being projectadmin) [09:28:39] meaning that sudo -s asks for a password right ? [09:29:00] akosiaris: yup [09:30:51] i just killed your shell btw [09:31:04] akosiaris: :D just saw [09:36:23] YuviPanda: done. /var has 1.5G now. Not much but better than before [09:37:10] akosiaris: ok. and /log as 8? [09:37:16] yes [09:37:47] akosiaris: cool. should be ok for now. [09:37:50] akosiaris: thanks! [09:37:54] akosiaris: but any idea why I can't sudo there? [09:38:08] looking at it now. I got curious as well [09:49:14] akosiaris: ok :) [10:01:18] !log Updated our Jenkins job builder fork: 8cbc93a..416ee7d [10:01:23] Logged the message, Master [10:05:02] (03CR) 10Hashar: [C: 031] beta: Small scap fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/140045 (owner: 10BryanDavis) [10:11:30] akosiaris: any luck? [10:12:50] YuviPanda: sudo: pam_unix(sudo:auth): conversation failed and sudo: pam_unix(sudo:auth): auth could not identify password for [yuvipanda] [10:12:59] the same for scfc [10:13:08] hmm, works for coren? [10:13:11] obviously not for my account [10:13:20] looking to see why now [10:13:30] ty! [10:21:31] YuviPanda: SUDOERS_BASE ou=sudoers,cn=,ou=projects,dc=wikimedia,dc=org [10:21:38] in /etc/sudo-ldap.conf [10:21:49] spot the missing cn= [10:21:51] right [10:22:04] I added tools there manually and you should be ok now [10:22:06] I wonder where that came from, and why puppet never fixed it. [10:22:18] same thing here [10:22:30] not sure it is puppet managed [10:22:35] well it probably is not [10:22:47] I just ran puppet and it it not revert my change [10:22:50] maybe base image. I don't that's a manual step. [10:22:51] ok [10:22:59] akosiaris: I'll make a comment somewhere so this isn't forgotten [10:23:35] akosiaris: and thanks very much! [10:23:53] you are welcome [10:24:07] akosiaris: may I also bug you to merge this super trivial, +1 package addition? Toolserver is being shut down soon and I want to give migrators as much time as possible... [10:24:14] ok if not, I'll just wait for andrewbogott later on [10:24:16] sure [10:24:22] akosiaris: ty! [10:24:30] akosiaris: https://gerrit.wikimedia.org/r/#/c/140239/1 [10:24:32] but after that your quota is done for the day :-) [10:24:48] akosiaris: indeed, I appreciate my quota being this high already :) [10:25:32] I am only merging this if Catalonia splits from the rest of spain :P [10:25:44] niah, stupid political joke, forget about it [10:25:54] (03CR) 10Alexandros Kosiaris: [C: 032] Tools: Install Catalan locale [operations/puppet] - 10https://gerrit.wikimedia.org/r/140239 (https://bugzilla.wikimedia.org/62269) (owner: 10Tim Landscheidt) [10:27:40] akosiaris: indeed, I appreciate my quota being this high already :) [10:27:42] (resending because my network crapped out) [10:28:01] akosiaris: ahaha, just saw the joke :) [10:28:15] * YuviPanda wonders if it is completely inappropriate to make an Iraq joke [10:28:16] probably is [10:28:26] akosiaris: ty for the merge [10:28:38] I wonder too. [10:29:55] akosiaris: :) unrelated, but how does one pronounce your last name? [10:30:32] ko-see-aah-ris? [10:30:38] hmm I am close to ignorant in the phonetic alhpabet [10:30:51] let me consult wikipedia [10:30:57] hehe [10:37:57] (03PS1) 10Giuseppe Lavagetto: puppet 3: convert jobrunners 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140328 [10:37:59] (03PS1) 10Giuseppe Lavagetto: puppet 3: convert jobrunners 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140329 [10:57:00] (03CR) 10Giuseppe Lavagetto: [C: 031] "looks good to me, I can rebase and merge if you agree." [operations/puppet] - 10https://gerrit.wikimedia.org/r/136325 (https://bugzilla.wikimedia.org/65868) (owner: 10Hashar) [10:59:40] (03PS6) 10Nuria: Add backup role and scripts [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [11:02:56] RECOVERY - RAID on ms-be1007 is OK: OK: optimal, 13 logical, 13 physical [11:03:06] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [11:05:49] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "One thing I'd change, apart from that, it LGTM" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 (owner: 10Ori.livneh) [11:07:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The situation you describe fails because in line 94 of /etc/init.d/kafka-mirror you have" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140209 (owner: 10Ottomata) [11:15:06] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [11:16:02] (03CR) 10Ori.livneh: "@Giuseppe: I wanted that too, but it is difficult to map existing usage to that, because it makes provisioning or purging a site a two-ste" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 (owner: 10Ori.livneh) [11:18:50] <_joe_> ori: go-to-sleep [11:19:10] <_joe_> I just realized it's that time of the day [11:19:33] _joe_: what timezone is he in? [11:20:46] <_joe_> Trminator: you don't want to know :) [11:20:58] hehe# [11:22:14] Trminator: he's in ori timezone [11:22:26] usually involves sandwiches at late hours [11:22:32] (at least everytime I've been around) [11:23:54] (03CR) 10Giuseppe Lavagetto: "Actually, the whole point of having sites-available is that you just have to remove a link (and could leave the vhost in sites-available i" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 (owner: 10Ori.livneh) [11:24:53] <_joe_> ori: about the link-vs-file in sites-enabled, it's a matter of taste, so if I'm the only one feeling that way I won't oppose to doing that [11:25:02] _joe_: i prefer it too, but consider [11:25:27] 1) the current apache-config thing and the Execs that rsync it, etc. are all oriented around concrete files [11:25:51] 2) the apache::vhost resource that we inherited from the puppetlabs apache module (and which is still in use) is oriented around concrete files [11:26:10] <_joe_> ok so conversion would've been even worse [11:26:17] 3) remaining apache sites across the repo were a healthy mix of concrete files in sites-enabled and symlinks [11:26:24] <_joe_> ori: it can be a two-step process, if we want to [11:26:33] oh sure, i wouldn't mind that at all [11:26:43] <_joe_> first we move everything under the new class [11:26:46] if we get the repo to a state where everyone is using apache::site, sure [11:26:50] <_joe_> across all our puppet manifests [11:26:53] yeah, i'd be happy with that [11:26:54] <_joe_> exactly [11:27:04] <_joe_> then we can do whatever we want, really [11:27:13] <_joe_> ok agreed :) [11:27:14] * YuviPanda makes _joe_ invade poland [11:27:31] <_joe_> I'll merge your patch if you can prove me you're afk :P [11:27:59] <_joe_> YuviPanda: I'd need to listen to Wagner first [11:28:16] _joe_: :) [11:28:20] <_joe_> ok, time for lunch + break it seems! see you later [11:29:06] (03CR) 10Filippo Giunchedi: [C: 031] puppet 3: convert jobrunners 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140328 (owner: 10Giuseppe Lavagetto) [11:29:34] I think ori's afk for real [11:30:23] well played, ori [11:31:08] Logged the compliment, YuviPanda. [11:31:32] * YuviPanda gives logmsgbot a tuna sandwich [11:32:15] no large trouts involved? [11:33:00] godog: no large trouts were harmed in the making of this tuna sandwich [11:36:15] haha nice [12:18:21] (03CR) 10Alexandros Kosiaris: [C: 032] puppet 3: convert jobrunners 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140328 (owner: 10Giuseppe Lavagetto) [12:22:57] (03PS1) 10Alexandros Kosiaris: puppet interval to 20 mins [operations/puppet] - 10https://gerrit.wikimedia.org/r/140337 [12:23:49] (03PS2) 10Alexandros Kosiaris: apache::vhost: get rid of logroot param [operations/puppet] - 10https://gerrit.wikimedia.org/r/140148 (owner: 10Ori.livneh) [12:23:58] (03CR) 10Alexandros Kosiaris: [C: 032] apache::vhost: get rid of logroot param [operations/puppet] - 10https://gerrit.wikimedia.org/r/140148 (owner: 10Ori.livneh) [12:36:44] err: /Stage[main]/Diamond/Package[python-diamond]/ensure: change from purged to present failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install python-diamond' returned 100: Reading package lists... [12:36:44] Building dependency tree... [12:36:44] Reading state information... [12:36:44] E: Couldn't find package python-diamond [12:36:56] on nickel. It is a 10.04 machine so it makes sense [12:37:26] godog: does it makes sense to also have the package for 10.04 ? [12:37:54] as in... will it take more that 10 mins ? [12:40:38] akosiaris: no I think it'll build the deb just fine, yes it makes sense mostly because it'll take 5 min, not sure what's your 10.04 population [12:43:05] godog: on palladium salt '*' -t 20 --out raw cmd.run 'lsb_release -r' | grep 10.04 [12:43:10] 13 machines [12:44:06] oh ok, thanks! [12:44:20] I feared it was more [12:44:29] only 13 is kind of nice [12:46:18] (03CR) 10Alexandros Kosiaris: "This did not have the intended result. The ensure => present lines have kept the files intact, as in they are still symlinks to sites-avai" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140218 (owner: 10Ori.livneh) [12:50:10] (03PS2) 10Giuseppe Lavagetto: puppet 3: convert jobrunners 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140328 [12:50:58] <_joe_> godog: you need to build python-diamond? if so, ping me [12:51:18] <_joe_> what we have in the repo atm will not build a correct package for you [12:54:41] (03PS1) 10Odder: Correct a broken link on the Weekly Report page [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/140341 (https://bugzilla.wikimedia.org/66778) [12:56:50] (03PS2) 10Odder: Correct broken links on the Weekly Report page [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/140341 (https://bugzilla.wikimedia.org/66778) [13:00:08] (03PS2) 10Giuseppe Lavagetto: puppet 3: convert jobrunners 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140329 [13:00:23] (03CR) 10jenkins-bot: [V: 04-1] puppet 3: convert jobrunners 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140329 (owner: 10Giuseppe Lavagetto) [13:00:25] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet 3: convert jobrunners 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140329 (owner: 10Giuseppe Lavagetto) [13:00:51] <_joe_> damn, jenkins unable to merge [13:01:11] (03CR) 10Giuseppe Lavagetto: [V: 032] puppet 3: convert jobrunners 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140329 (owner: 10Giuseppe Lavagetto) [13:02:51] _joe_: it was akosiaris not me btw :) [13:03:19] <_joe_> ok, whoever needs that, I can help :) [13:21:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I am more or less of the same opinion as Giuseppe on this. It feels weird to step around that debian mechanism. I have already seen that o" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 (owner: 10Ori.livneh) [13:23:31] (03PS1) 10Giuseppe Lavagetto: puppet3: migrate mediawiki appservers 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140349 [13:23:33] (03PS1) 10Giuseppe Lavagetto: puppet3: migrate mediawiki appservers 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140350 [13:25:39] !log script rt-7708.pl hitting m2-master eventlogging from terbium for RT #7708. fine to kill if necessary [13:25:44] Logged the message, Master [13:35:25] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: migrate mediawiki appservers 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140349 (owner: 10Giuseppe Lavagetto) [13:38:28] (03CR) 10Alexandros Kosiaris: [C: 032] puppet interval to 20 mins [operations/puppet] - 10https://gerrit.wikimedia.org/r/140337 (owner: 10Alexandros Kosiaris) [13:38:59] let's see now... will this bring the puppetmasters down or not ? [13:40:18] <_joe_> !log restarted profiler-to-carbon, stuck (again) waiting for mwprof [13:40:23] Logged the message, Master [13:40:36] <_joe_> I hope this time it's my patched version... [13:41:34] <_joe_> akosiaris: your commit just made it between my two puppet runs on the canary host... I've seen a change while I was expecting none and got scared :P [13:41:57] ahaha [13:42:29] (03PS2) 10Giuseppe Lavagetto: puppet3: migrate mediawiki appservers 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140350 [13:42:31] gonna be looking at this for the next couple of hours [13:42:32] https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=Puppetmaster+CPU+user&vl=%25&x=&n=&hreg[]=palladium%7Cstrontium&mreg[]=cpu_user>ype=line&glegend=show&aggregate=1&embed=1&_=1403021312953 [13:43:19] * _joe_ prepares to launch a bitcoin miner on the puppetmasters just to retaliate [13:44:08] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: migrate mediawiki appservers 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140350 (owner: 10Giuseppe Lavagetto) [13:44:08] loool [13:44:20] <_joe_> or folding@home [13:44:32] <_joe_> which is at least for a noble cause [13:45:51] what if you donated the vast amounts of wealth that bitcoin miner (namely 0.0 btc in the next 10k years) would generate for you to the folding@home project ? [13:47:50] <_joe_> btw, BTC have just lost all their value, but that's another story completely [13:51:25] I wonder if https://it.wikipedia.org/wiki/Gastone_Paperone was real how many people would actually even try to mess with BTCs [13:51:55] and yes, he is called Gastone in greek too and not Gladstone [13:55:24] <_joe_> lol [13:55:47] <_joe_> 'una faccia una razza' [13:55:54] (03PS1) 10Ottomata: Fixing stat1001.wikimedia.org website [operations/puppet] - 10https://gerrit.wikimedia.org/r/140357 [13:55:55] Interesting, any relation between the publishers or translators? [13:56:05] (That quotation on top is against local policy.) [13:56:27] <_joe_> Nemo_bis: most of the mickey mouse/disney comics were written in italy anyway since the late 1940s I think [13:56:58] (03PS2) 10QChris: Fixing stat1001.wikimedia.org website [operations/puppet] - 10https://gerrit.wikimedia.org/r/140357 (https://bugzilla.wikimedia.org/66781) (owner: 10Ottomata) [13:57:04] <_joe_> so chances are, greece just imported and translated italian stories [13:57:13] makes sense [13:57:35] (03PS1) 10BBlack: all LVS on "standard" now (w/ NTP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140358 [13:58:34] wasn't all that the work of Carl Barks? [13:58:47] <_joe_> the original stories and the characters, yes [13:58:57] <_joe_> Banks or barks? [13:59:09] (03CR) 10Ottomata: [C: 032 V: 032] Fixing stat1001.wikimedia.org website [operations/puppet] - 10https://gerrit.wikimedia.org/r/140357 (https://bugzilla.wikimedia.org/66781) (owner: 10Ottomata) [13:59:10] https://en.wikipedia.org/wiki/Carl_Barks [13:59:19] <_joe_> ok [13:59:25] (03PS2) 10BBlack: all LVS on "standard" now (w/ NTP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140358 [13:59:31] (03CR) 10BBlack: [C: 032 V: 032] all LVS on "standard" now (w/ NTP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140358 (owner: 10BBlack) [13:59:40] ottomata: hey, re: the GC settings we could take a look but I'm no expert by any stretch of imagination, best of course would be to have current GC stats to know what's going on, is it ES? [13:59:57] no, kafka [14:00:06] there's no problems that I know of right now [14:00:18] there are just some different recommended settings that i'm about to adopt [14:00:24] and i'd kinda like to know the difference between then [14:00:25] them [14:00:29] i've got 2 kafka brokers running right now [14:00:33] one has the new settings, and one the old [14:00:39] i've got a meeting right now, can chat in jsut a bit [14:00:45] sure [14:02:06] (03PS1) 10Giuseppe Lavagetto: puppet 3: api appservers 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140360 [14:02:12] (03PS1) 10Giuseppe Lavagetto: puppet3: api appservers 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140361 [14:02:12] (03PS1) 10Giuseppe Lavagetto: puppet3: bits appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/140362 [14:02:12] (03PS1) 10Giuseppe Lavagetto: puppet3: imagescalers 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140363 [14:02:14] (03PS1) 10Giuseppe Lavagetto: puppet3: imagescalers 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140364 [14:02:16] (03PS1) 10Giuseppe Lavagetto: puppet3: move remaining appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/140365 [14:04:47] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet 3: api appservers 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140360 (owner: 10Giuseppe Lavagetto) [14:08:48] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: api appservers 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140361 (owner: 10Giuseppe Lavagetto) [14:11:11] (03PS3) 10Ottomata: Add commented out recommended GC settings if using Java 7 u51 or greater [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140207 [14:11:37] (03CR) 10Ottomata: [C: 032 V: 032] "Added comment about Java 7 u51 +" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140207 (owner: 10Ottomata) [14:11:53] (03PS3) 10Ottomata: Fix for KAFKA_MIRROR_START variable in kafka-mirror.init script [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140208 [14:12:06] (03CR) 10Ottomata: [C: 032 V: 032] Fix for KAFKA_MIRROR_START variable in kafka-mirror.init script [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140208 (owner: 10Ottomata) [14:12:15] (03PS3) 10Ottomata: Not starting kafka and kafka-mirror during postinstall [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140209 [14:13:47] (03CR) 10Ottomata: "How would moving those checks into kafka_mirror_sh() change the behavior?" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140209 (owner: 10Ottomata) [14:13:57] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: bits appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/140362 (owner: 10Giuseppe Lavagetto) [14:14:23] akosiaris: the kafka init start stuff is really annoying, and I'm not sure I understand why we would ever want kafka and especially kafka mirror to start on install [14:14:56] kafka base configs probably can start a running broker ok, IF KAFKA_START was set to 'yes' in default file [14:15:02] but kafka-mirror wouldn't [14:15:11] as kafka mirror is meant to work with at least 2 separate kafka clusters [14:15:28] and there's now way we can know what configs another cluster might have [14:15:31] no way* [14:17:43] ottomata: that is one thing which is more like a policy decision. I am not against doing it but it would not only hide the actual problem under the carpet and not solve it. We already have the no/yes flag and we are not using it which is why that bug bites us [14:20:49] (03PS2) 10Giuseppe Lavagetto: puppet3: imagescalers 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140363 [14:20:53] OH, you are saying if I move it into kafak_mirror_sh, that the no flag should make this ok? [14:20:58] start flag [14:20:58] hm [14:20:59] reading [14:21:10] oh hm [14:21:10] yeah [14:21:12] it exit 0s [14:21:14] if no [14:21:15] hmm [14:21:16] ok [14:22:51] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: imagescalers 1 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140363 (owner: 10Giuseppe Lavagetto) [14:29:23] (03PS1) 10Ottomata: Parameterize redis config template [operations/puppet] - 10https://gerrit.wikimedia.org/r/140371 [14:29:54] (03PS2) 10Ottomata: Parameterize redis config template [operations/puppet] - 10https://gerrit.wikimedia.org/r/140371 [14:33:14] (03CR) 10Ottomata: [C: 032 V: 032] Parameterize redis config template [operations/puppet] - 10https://gerrit.wikimedia.org/r/140371 (owner: 10Ottomata) [14:33:50] (03CR) 10Milimetric: Add backup role and scripts (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [14:35:18] (03PS2) 10Giuseppe Lavagetto: puppet3: imagescalers 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140364 [14:35:35] (03PS3) 10Giuseppe Lavagetto: puppet3: imagescalers 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140364 [14:35:44] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet3: imagescalers 2 of 2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140364 (owner: 10Giuseppe Lavagetto) [14:36:15] <_joe_> ottomata: should I merge your change? [14:37:10] <_joe_> whatever, I'm going to. [14:37:39] (03PS2) 10Giuseppe Lavagetto: puppet3: move remaining appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/140365 [14:38:11] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: move remaining appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/140365 (owner: 10Giuseppe Lavagetto) [14:38:32] (03CR) 10Giuseppe Lavagetto: [V: 032] puppet3: move remaining appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/140365 (owner: 10Giuseppe Lavagetto) [14:38:55] oh sorry, ja that's cool _joe_ [14:38:55] thanks [14:39:23] <_joe_> yes looked at the change, it was a noop basically [14:40:08] ja [14:45:09] (03PS1) 10Milimetric: Revert "Stop relying on limited redis module" [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/140374 [14:45:39] (03CR) 10Milimetric: [C: 032] Revert "Stop relying on limited redis module" [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/140374 (owner: 10Milimetric) [14:46:39] (03Abandoned) 10Milimetric: Revert "Stop relying on limited redis module" [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/140374 (owner: 10Milimetric) [14:49:39] (03PS1) 10BBlack: add XPS for bnx2 (etc) to interface-rps.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/140376 [14:51:30] manybubbles: I'll SWAT today [14:51:41] anomie: I saw you had something in it [14:52:01] (03CR) 10BBlack: [C: 031] "May or may not merge this, depending how kernel upgrades go on lvs100x (whether the new generic support for tx queue selection in recent k" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140376 (owner: 10BBlack) [14:52:23] Yeah, may as well fix Scribunto on testwiki today instead of waiting for tomorrow. [14:52:38] are lvs100x bnx2? [14:52:45] yeah [14:53:15] they have rss/rps/xps support in theory, with a fixed set of 8 queues in each direction [14:53:28] why did I always thought we had 10G at eqiad? [14:53:34] beats me :) [14:54:41] related: 3.13.0-30 is *still* only in proposed. is there some normal way to see why? like, a status or ticket or something somewhere that covers whether something's held up in proposed for a reason? [14:55:54] (03CR) 10BBlack: [C: 04-1] "heh meant -1 with the above comment" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140376 (owner: 10BBlack) [14:56:02] (03PS1) 10Ottomata: Fix for redis include and apache module changes [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/140378 [14:56:30] (03CR) 10Ottomata: [C: 032 V: 032] Fix for redis include and apache module changes [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/140378 (owner: 10Ottomata) [14:59:00] akosiaris: I have added in cxserver reviewer list :) [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140618T1500) [15:00:12] * anomie starts SWAT [15:01:19] (03PS1) 10Ottomata: Use custom redis.conf.erb template for wikimetrics in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/140380 [15:01:26] (03PS2) 10Ottomata: Use custom redis.conf.erb template for wikimetrics in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/140380 [15:01:53] (03PS3) 10Ottomata: Use custom redis.conf.erb template for wikimetrics in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/140380 [15:01:59] (03PS4) 10Ottomata: Use custom redis.conf.erb template for wikimetrics in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/140380 [15:02:47] * anomie grumbles about anomie not properly preparing the extension update patch for the SWAT [15:04:58] (03PS5) 10Ottomata: Use custom redis.conf.erb template for wikimetrics in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/140380 (https://bugzilla.wikimedia.org/63664) [15:05:11] (03CR) 10Ottomata: [C: 032 V: 032] Use custom redis.conf.erb template for wikimetrics in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/140380 (https://bugzilla.wikimedia.org/63664) (owner: 10Ottomata) [15:10:03] !log anomie Synchronized php-1.24wmf9/extensions/Scribunto/engines/LuaCommon/SiteLibrary.php: SWAT: Fix Scribunto-related exceptions on testwiki [[gerrit:140370]] (duration: 00m 14s) [15:10:07] Logged the message, Master [15:10:08] * anomie tests [15:10:43] * anomie confirms [15:10:53] * anomie is done with SWAT [15:13:45] (03PS1) 10Ottomata: Fix duplicate require in wikimetrics role [operations/puppet] - 10https://gerrit.wikimedia.org/r/140382 [15:16:14] (03CR) 10Ottomata: [C: 032 V: 032] Fix duplicate require in wikimetrics role [operations/puppet] - 10https://gerrit.wikimedia.org/r/140382 (owner: 10Ottomata) [15:17:27] !log rebooting lvs4004 for kernel / num_queues updates [15:17:32] Logged the message, Master [15:20:18] !log rebooting lvs4003 for kernel / num_queues updates [15:20:22] Logged the message, Master [15:22:36] (03PS7) 10Nuria: Add backup role and scripts [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [15:27:41] (03PS1) 10Faidon Liambotis: Remove SPF RR records (but keep SPF TXTs) [operations/dns] - 10https://gerrit.wikimedia.org/r/140385 [15:28:09] apergos: do you take care of the dumps? can you point me to some current documentation and/or source code? I see a lot on wikitech but I'm not sure what is current [15:28:16] (03CR) 10Faidon Liambotis: [C: 032] Remove SPF RR records (but keep SPF TXTs) [operations/dns] - 10https://gerrit.wikimedia.org/r/140385 (owner: 10Faidon Liambotis) [15:36:45] manybubbles: afaik https://git.wikimedia.org/tree/operations%2Fdumps.git/refs%2Fheads%2Fariel is current but i don't know all the details :/ [15:37:03] last time i made a patch (to add a table to the dumps), it was to there [15:39:59] aude: thanks! I found the directory but hadn't looked in the branch [15:40:06] there are some readmes [15:40:18] don't think this makes what we want to do easy though :/ [15:43:51] (03PS1) 10BBlack: *really* disable irqbalance if doing RSS [operations/puppet] - 10https://gerrit.wikimedia.org/r/140389 [15:44:34] (03PS2) 10Faidon Liambotis: Kill $project-lb.$site.wikimedia.org and free IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/140149 [15:44:36] (03PS3) 10Faidon Liambotis: Kill $project-lb.wikimedia.org IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/140136 [15:44:38] (03PS1) 10Faidon Liambotis: Remove $lang.wikimediafoundation.org [operations/dns] - 10https://gerrit.wikimedia.org/r/140390 [15:44:40] (03PS1) 10Faidon Liambotis: Add text-lb.wikimedia.org and switch CNAMEs to it [operations/dns] - 10https://gerrit.wikimedia.org/r/140391 [15:46:29] (03CR) 10BBlack: [C: 032] *really* disable irqbalance if doing RSS [operations/puppet] - 10https://gerrit.wikimedia.org/r/140389 (owner: 10BBlack) [15:47:17] (03CR) 10Faidon Liambotis: [C: 032] "Obvious enough." [operations/dns] - 10https://gerrit.wikimedia.org/r/140390 (owner: 10Faidon Liambotis) [15:48:02] bblack: updated patches for -lb [15:49:10] I got a message back from Oliver Keyes about the abnormal pageview stats on a few non-existant userspace pages. He has identified a massive botnet that is repeatedly hitting those pages. [15:49:35] Wikipedia userpages are likely being used for command and control messages for these botnets, that's my theory at least. [15:50:00] DELETE! [15:50:37] His angle is mostly for fixing the stats, but we probably need to have a discussion about security/liability aspects. [15:50:42] Reedy: re-read "nonexistant" ;) [15:50:53] That's the target [15:50:58] not the source [15:51:08] Gigs-: I think it's a bad idea to have that discussion in public [15:51:29] it is a little beansy but clearly it's already out there [15:51:41] (03PS1) 10Ottomata: Puppetizing cp3015-cp3018 as cache uploads [operations/puppet] - 10https://gerrit.wikimedia.org/r/140392 [15:51:55] the effect is, the mitigation strategy isn't [15:51:58] user:cunard is the top page involved and it's protected and non-existent ... my theory is that there's a "default setting" of user:cunard in some canned bot [15:52:11] that the user is supposed to change but some are too stupid to [15:52:34] sure [15:53:17] (03PS2) 10Ottomata: Puppetizing cp3015-cp3018 as cache uploads [operations/puppet] - 10https://gerrit.wikimedia.org/r/140392 [15:54:47] well just making sure the right people are aware of the issue... I think you guys are the right people :P [15:55:07] I was the one to forward this to analytics, so yeah we are aware [15:55:35] can't say I care too much, but I'm looped in regardless :P [16:04:35] (03PS10) 10Nemo bis: Gather all soft-disabled uploads wikis in one config item [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 [16:07:19] (03CR) 10Anomie: [C: 04-1] "Config changes that aren't mentioned in the summary:" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [16:08:57] (03CR) 10BBlack: [C: 031] Add text-lb.wikimedia.org and switch CNAMEs to it [operations/dns] - 10https://gerrit.wikimedia.org/r/140391 (owner: 10Faidon Liambotis) [16:09:25] (03CR) 10Nemo bis: Gather all soft-disabled uploads wikis in one config item (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [16:09:34] manybubbles: I do, the wikitech stuff is more or less good except for the directories things run out of and the user, [16:09:54] and deployment is done differently now too but as far as the code and all the steps etc it's all current [16:10:05] (03CR) 10Anomie: [C: 04-1] "PS10 fixes the incubatorwiki issue. The others I mentioned for PS9 remain." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [16:10:45] (03CR) 10Nemo bis: "Sorry, I amended and commented before seeing your review." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [16:10:51] apergos: there is a branch with newer stuff in it - origin/arial - do we deploy master or it? [16:11:11] ariel [16:11:26] that info should be plastered all over the docs [16:11:57] * Reedy hands apergos some [16:12:08] paravoid: I'd give it more like a day between the two commits on the CNAME TTL stuff. borked caches aren't that uncommon. [16:12:16] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [16:13:14] (03CR) 10Anomie: "> Sorry, I amended and commented before seeing your review." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [16:13:21] yeah I wasn't suggesting to wait an hour [16:13:39] that's why the commit says "at least one TTL" [16:13:43] ok :) [16:15:41] godog: still around? [16:15:47] manybubbles: I have t run unfortunately, is there ay specific piece of the dumps you're poking at? [16:15:49] ottomata: yup [16:16:18] ok so [16:16:19] https://kafka.apache.org/081/ops.html [16:16:26] apergos: i see there is a phprunner thing inside the python [16:16:45] Scroll down to the Java Version section, where they talk about GC tuning [16:16:51] and then check this [16:16:51] https://gerrit.wikimedia.org/r/#/c/140207/3/debian/kafka.default [16:16:56] you can see two options there [16:17:11] the first one 'Default GC settings' is what we ahve been running with [16:17:21] the next one are the recommendations from that linked in page [16:17:32] the big difference is the use of a different GC algorithm [16:17:42] we currently have two brokers getting equal traffic [16:17:49] analytics1012 is running with Default [16:17:53] is it possible to hook into that for wikidata and do something special for our dumps? [16:17:55] and analytics1022 is running with newer G1GC [16:19:39] !log rebooting lvs4002 for kenerl + num_queues [16:19:43] Logged the message, Master [16:19:54] bblack: can you do a careful review of these when you get the chance? [16:19:55] ottomata: yep, are we collecting gc stats already somewhere from the brokers? [16:20:00] it's not something urgent obviously [16:20:04] yes [16:20:05] i think so [16:20:07] link coming.. [16:20:18] but since it's touching basically all of our sites, it'd be nice to have another set of eyes carefully reviewing it [16:20:21] ok here is an12 [16:20:23] http://ganglia.wikimedia.org/latest/?c=Analytics%20Kafka%20cluster%20eqiad&h=analytics1012.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [16:20:24] and then an22 [16:20:26] paravoid: yes [16:20:42] you can collaopse all metric groups [16:20:47] and expand the jvm memory ones [16:21:12] also, whoever had the idea of having shop/store.ALL OF OUR DOMAINS.org just to redirect to shop.wikimedia.org [16:21:23] despite having a link to the shop on every goddamn wikipedia page [16:21:31] (03PS11) 10Nemo bis: Gather all soft-disabled uploads wikis in one config item [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 [16:21:52] as if people manually type URLs anymore anyways [16:21:54] aude: I wouldn't go that route, I was thining of this as a one off cron that runs once every couple weeks [16:21:56] google is the new DNS [16:22:02] hoo and I have been chatting about it [16:22:13] huh? [16:22:29] wikidaa json dumps [16:22:38] apergos: the json thing is an additionla thing... [16:22:54] (03CR) 10Nemo bis: "I was not sure what to do with the ruwiki reupload* rights, they smell of unintentional inheritances. But I don't want to dig more bugs ri" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [16:22:59] we want to change our serialization format (internal) to be same as the one used by the api [16:23:08] (03CR) 10Dzahn: [C: 032] Re-add Santhosh Thottingal's blog to English Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/139586 (owner: 10Odder) [16:23:11] apergos: Yeah, still ahven't found time to get to the bash stuffs [16:23:13] (03PS12) 10Nemo bis: Gather all soft-disabled uploads wikis in one config item [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 [16:23:14] Maybe tonigh [16:23:15] t [16:23:27] we would like the contents of the xml dumps to be consistent (i suppose the new format) [16:23:28] no worries [16:23:50] mutante: there are two other planet patches by me :) [16:23:51] ottomata: cool, ganglia should be able to compare the two, I think, trying that now [16:24:09] so this sounds like you need something to happen inside Special:Export [16:24:19] paravoid: not all wikipedia pages, only en.wiki that is ~45 % [16:24:25] :p [16:24:32] Nemo_bis: ok, in a minute [16:24:40] !log rebooting lvs4001 for kenerl + num_queues [16:24:44] Logged the message, Master [16:25:07] apergos: does the python work with special:export somehow? [16:25:10] if you make your wikidata-specific things happen there this will cover all exports of wikidata pages and will be picked up by the dumps automatically [16:25:12] godog, they have been running that way (with different GCs) for about 20 hours maybe? [16:25:26] there's a series of php maintenance scripts that just invke that [16:25:42] hmmm, for wmf dumps? [16:25:52] * aude needs to look at the scripts more closely [16:25:59] the python scripts are the rest of the insfrastructure (managing the various tables and data being dumped for all the projects) [16:26:03] ok [16:26:18] so you don't even need to go there. (lucky you!) [16:26:21] then should be doable... [16:26:27] we already have a patch for special:export :) [16:26:32] great! [16:26:35] \o/ [16:26:51] ok, I really really have to run now, sorry... [16:26:55] (03PS2) 10Dzahn: Add Terry Chay to English Wikimedia Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/139791 (owner: 10Nemo bis) [16:26:57] ok, thanks [16:27:04] (03PS2) 10Dzahn: Add a couple new blogs to the English Wikimedia Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/139897 (owner: 10Nemo bis) [16:27:04] you can type in pm at me later aand I'll see it, if something comes up [16:27:32] unless I'm dog tired, I'll look at the screen when I get back and before heaing off to bed [16:27:35] see folks later! [16:27:40] later :) [16:28:29] (03CR) 10Dzahn: [C: 032] Add Terry Chay to English Wikimedia Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/139791 (owner: 10Nemo bis) [16:29:29] (03PS3) 10Dzahn: Add a couple new blogs to the English Wikimedia Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/139897 (owner: 10Nemo bis) [16:30:16] (03PS1) 10Nuria: Removing hostname from redis backup file [operations/puppet] - 10https://gerrit.wikimedia.org/r/140394 [16:31:44] (03CR) 10Dzahn: [C: 032] Add a couple new blogs to the English Wikimedia Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/139897 (owner: 10Nemo bis) [16:35:28] Nemo_bis: and ran update already [16:36:21] ottomata: yup, there doesn't seem to be any noticeable difference in the cpu stats afaict, does kafka seem to perform better? [16:36:37] not that I can tell, no [16:40:07] hi mutante [16:40:13] thanks mutante ! [16:40:24] (03CR) 10Dzahn: [C: 032] Correct broken links on the Weekly Report page [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/140341 (https://bugzilla.wikimedia.org/66778) (owner: 10Odder) [16:40:26] I was looking at some of the kafka_stats metrics and the means seem higher for 1022 but meh hard to tell anyways, I doesn't seem to make a noticeable difference for that load ottomata [16:40:49] (03CR) 10Dzahn: [V: 032] Correct broken links on the Weekly Report page [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/140341 (https://bugzilla.wikimedia.org/66778) (owner: 10Odder) [16:41:11] yeah, i've noticed that too [16:41:13] notexaclyt sure why [16:41:16] about a ms higher on an22 [16:41:21] but it was like that before the GC switch [16:41:26] and I think i've seen an12 higher too [16:41:45] i've been doing lots of restarting of an22's broker [16:41:50] i think if I restart an12's [16:41:53] while an22s is up [16:41:59] it might switch, not sure [16:42:07] at any given time one broker is the 'controller' of the cluster [16:42:10] (03CR) 10Anomie: [C: 031] "Config-wise, looks good now. I'm not 100% convinced on the added complexity, but the DRY benefit probably outweighs it." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [16:42:17] not sure which is right now [16:42:20] maybe an22..? [16:42:29] matanya3: hi there [16:43:27] ottomata: an22 is running its cpus at 2000MHz, vs 1200 for an12 [16:43:35] cat /proc/cpuinfo | grep MHz [16:43:38] (03CR) 10Dzahn: [C: 032] Remove padlock icon from links in Wikimedia Bugzilla installation [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/106761 (https://bugzilla.wikimedia.org/59893) (owner: 1001tonythomas) [16:43:52] (03CR) 10Dzahn: [V: 032] Remove padlock icon from links in Wikimedia Bugzilla installation [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/106761 (https://bugzilla.wikimedia.org/59893) (owner: 1001tonythomas) [16:43:52] !!! [16:43:55] per bblack's comments yesterday about cpufreq [16:44:55] what's really wacky is that on an12, i saw 7 cores @ 1200 and 1 core @ 1300 [16:45:03] hm, weird [16:45:10] er i mean 11 and 1 [16:45:17] (03CR) 10Dzahn: "icons are gone on reload" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/106761 (https://bugzilla.wikimedia.org/59893) (owner: 1001tonythomas) [16:45:19] uh, it varies though [16:45:22] each time you check it [16:45:23] it changes [16:45:29] yeah [16:45:49] that is cpufreq auto frequency scaling, we can force them to run at full speed all the time if we want [16:45:55] will at least give us an even comparison to an22 [16:46:20] not sure which is better? but being consistent is important [16:46:46] for measurement, i'd say consistent clock is better [16:46:50] gah! good catch jgage, we should graph the sum of all cpu mhz [16:47:54] https://wiki.debian.org/HowTo/CpuFrequencyScaling [16:48:48] hm. [16:48:49] ok [17:00:04] yurik: The time is nigh to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140618T1700) [17:07:40] ok godog, i'll fix the cpufreq thing, but I'm thikning of just going with the newer recommended GC settings [17:09:49] !log apache2ctl restart on magnesium, racktables wasn't working [17:09:54] Logged the message, RobH [17:10:00] !log magnesium back to proper function [17:10:04] Logged the message, RobH [17:21:17] (03PS1) 10Yurik: Updated Firefox OS app to master [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140407 [17:21:55] (03CR) 10Dr0ptp4kt: [C: 032] Updated Firefox OS app to master [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140407 (owner: 10Yurik) [17:22:08] (03Merged) 10jenkins-bot: Updated Firefox OS app to master [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140407 (owner: 10Yurik) [17:26:10] !log yurik Synchronized docroot/bits/WikipediaMobileFirefoxOS/: (no message) (duration: 01m 04s) [17:26:15] Logged the message, Master [17:31:12] (03PS13) 10Reedy: Gather all soft-disabled uploads wikis in one config item [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134400 (owner: 10Nemo bis) [17:32:30] !log yurik Synchronized php-1.24wmf8/extensions/: Updating JsonConfig, ZeroBanner, ZeroPortal (duration: 01m 15s) [17:32:35] Logged the message, Master [17:34:16] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:47] (03PS1) 10Yurik: Updated Firefox OS app to master [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140410 [17:35:59] !log yurik Synchronized php-1.24wmf9/extensions/: Updating JsonConfig, ZeroBanner, ZeroPortal (duration: 01m 14s) [17:36:04] Logged the message, Master [17:36:16] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 9.216 second response time [17:36:41] (03CR) 10Dr0ptp4kt: [C: 032] Updated Firefox OS app to master [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140410 (owner: 10Yurik) [17:36:47] (03Merged) 10jenkins-bot: Updated Firefox OS app to master [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140410 (owner: 10Yurik) [17:39:52] !log yurik Synchronized docroot/bits/WikipediaMobileFirefoxOS/: (no message) (duration: 01m 09s) [17:39:58] Logged the message, Master [17:44:16] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [17:48:42] (03CR) 10Faidon Liambotis: [C: 04-1] "a) This needs to be updated for the new Apache stuff (it still uses "httpd", for instance)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133274 (owner: 10Ori.livneh) [17:51:43] (03CR) 10Dzahn: [C: 032] beta: add role::cache::configuration::backends['labs']['bits'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/140019 (owner: 10BryanDavis) [17:52:28] (03CR) 10Dzahn: "just adds deployment-apache01/02 as for the other backends in beta/labs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140019 (owner: 10BryanDavis) [18:01:06] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [18:01:36] (03CR) 10Dzahn: [C: 032] bugzilla - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137995 (owner: 10Rush) [18:04:06] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [18:04:33] OK: Fetching ... ok [18:09:53] andrewbogott: https://gerrit.wikimedia.org/r/#/c/138480/ if you like [18:10:47] greg-g, i have monitoring everything for the past 30 min or so, all's good. Pls ping dr0ptp4kt if anything goes crazy, he will be able to reach me. [18:11:14] why are selenium user's edits not being marked as patrolled on beta? [18:11:50] (03CR) 10Dzahn: [C: 032] planet - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137990 (owner: 10Rush) [18:11:58] yurikR: thanks sir [18:12:02] (03PS4) 10Dzahn: planet - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137990 (owner: 10Rush) [18:12:15] (03CR) 10Andrew Bogott: [C: 04-1] "Please leave in the site-switching logic and comments; soon tools will have hosts in both Ashburn and Texas and we'll need to be site-awar" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138480 (owner: 10Tim Landscheidt) [18:13:24] ori or YuviPanda, is there any support for https in mwvagrant? [18:13:40] andrewbogott: there's a role, yeah [18:13:45] andrewbogott: uses nginx. [18:14:00] andrewbogott: why? [18:14:01] Ah, hm. That might not help. [18:14:12] andrewbogott: labs-vagrant should just use proxy tho :) [18:14:45] Sure. Just, because I'm trying to replicate wikitech and wikitech uses https… I'm not sure if the issues I'm seeing are related to that. [18:14:46] Probably not. [18:15:05] right [18:15:10] Mostly I just can't make the damn sidebar work. /that/ definitely doesn't have to do with https [18:15:45] heh [18:16:29] I am getting a ton of "Did not find alias for special page 'NovaInstance'. Perhaps no aliases are defined for it?" and I don't know what that's about or if I should care. [18:16:43] Why would there be aliases? Why does mw care? [18:17:22] * Nemo_bis has vague memories of rebuilding l10n cache helping [18:17:33] Ah, ok, how do I do that? [18:17:55] andrewbogott: yeah, that's to do with l10n aliases, not apache [18:18:07] rebuildLocalisationCache.php [18:18:08] ? [18:18:10] assuming you actually have an alias.i18n file :) [18:18:14] it's a mw maintenance script [18:18:20] yeah, waht Nemo_bis said [18:18:54] * andrewbogott tries... [18:19:33] (03PS3) 10Ottomata: Puppetizing cp3015-cp3018 as cache uploads [operations/puppet] - 10https://gerrit.wikimedia.org/r/140392 [18:19:59] (03CR) 10Ottomata: [C: 032 V: 032] Puppetizing cp3015-cp3018 as cache uploads [operations/puppet] - 10https://gerrit.wikimedia.org/r/140392 (owner: 10Ottomata) [18:20:10] papaul: ping [18:20:55] andrewbogott: if this is the extension, you're missing an entry here https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FOpenStackManager/caa2767b4c73a4ebe233e56e7b41b5cab2ed92bd/OpenStackManager.alias.php#L13 [18:21:53] cmjohnson1, yep [18:21:58] papaul: does that setup work better? [18:22:00] and in fact I can't translate NovaInstance anywhere https://translatewiki.net/w/i.php?language=it&module=special&title=Special%3AAdvancedTranslate [18:22:32] (03PS5) 10Dzahn: planet - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137990 (owner: 10Rush) [18:23:04] Nemo_bis: the "Nova" part is a proper noun though [18:23:10] cmjohnson1, yes it does the msw will be difficult to access but the asw not [18:23:34] you will have to work with it so as long as you like it then make it happen [18:23:40] hm, well, that rebuild filled up /var and broke everything :) [18:23:54] Ok will do that [18:24:01] mutante: well, translations can be fixed if they don't respect that [18:24:14] Nemo_bis: true [18:24:36] You can also add a code comment so that siebrand notices when he commits the translations [18:25:33] cmjohnson1, ok about to go back in was on lunch break [18:26:08] okay...lmk if you need anything else. [18:27:38] cmjohnson1, thanks ii will once i am back again in the break room where i have connectivity [18:43:21] Nemo_bis, mutante, I'm still confused. A rebuild didn't help, and the .alias.php file is the same on my test box as in production (where we seem not to have that issue) [18:44:55] andrewbogott: on a prod server /var is like 6GB, how much did it grow on labs? [18:45:08] mutante: Oh, I resolved the /var issue [18:45:15] and then rebuilt again [18:47:18] andrewbogott: i dunno better than "Nemo_bis has vague memories of rebuilding l10n cache helping:" :/ [18:47:28] 'k [18:48:13] andrewbogott: try #mediawiki-i18n [18:48:20] it must be translation related [18:50:26] maybe this will make more sense after lunch :) [18:51:10] andrewbogott: it seems the translation people just have to add that string [18:51:27] on translatewiki [18:51:58] that doesn't account for the difference in prod vs. testing does it? [18:53:11] andrewbogott: maybe prod's on an ancient mw core install too? [18:53:16] andrewbogott: prod probably doesnt use "NovaInstance" [18:53:26] it just doesnt appear there i guess [18:53:55] so then the error doesnt show up either [18:54:10] what do you mean "doesn't use..." [18:54:17] the string NovaInstance [18:54:31] it's not a special page in prod [18:54:36] and not a translated string [18:54:42] Oh, well... [18:54:53] I'm cutting urls from prod and pasting them into testing [18:54:58] So, definitely the same page title. [18:55:38] http://en.wikipedia.org/wiki/Special:NovaInstance No such special page [18:55:49] https://wikitech.wikimedia.org/wiki/Special:NovaInstance [18:56:11] vs http://wikitech-test.wmflabs.org/wiki/Special:NovaInstance [18:56:52] ohh, when saying "prod" i did not have wikitech in mind [18:57:51] http://wikitech-test.wmflabs.org/wiki/Special:SpecialPages [18:57:58] that's all special pages ? [18:58:01] not just that one [18:58:58] yeps! [19:00:08] mutante: technically, mostly all [19:00:21] (some boring ones are not listed there) [19:01:34] hundreds special pages are unlisted, last time I checked (considering all extensions) [19:02:06] the special page has to be added to alias.php anyway, why bother investigating other solutions? :) [19:03:01] Nemo_bis: ok, and the reason I don't get those errors on prod is… just an old mw version? [19:03:03] 1.23 on prod [19:03:23] They've needed to be added to alias for ages [19:04:30] weird [19:04:33] ok, well, anyway… lunchtime [19:14:21] (03PS4) 10Ottomata: Move mirror maker argument checking into start func [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/140209 [19:26:35] (03CR) 10Matanya: mirror.pp - retab, unquoted resource titles, lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139467 (owner: 10Dzahn) [19:27:28] (03CR) 10Matanya: [C: 04-1] "dup of https://gerrit.wikimedia.org/r/#/c/139681/2" [operations/puppet] - 10https://gerrit.wikimedia.org/r/139467 (owner: 10Dzahn) [20:00:04] gwicke, subbu: The time is nigh to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140618T2000) [20:21:28] grrrit-wm: . [20:23:47] !log deployed Parsoid 88a61f81 (deploy repo sha 470a5ef2) [20:23:51] Logged the message, Master [20:25:38] (03CR) 10Dzahn: "grrit-wm test message" [operations/puppet] - 10https://gerrit.wikimedia.org/r/139467 (owner: 10Dzahn) [20:25:52] odd, it's like the bot just skipped one [20:26:51] (03Abandoned) 10Dzahn: mirror.pp - retab, unquoted resource titles, lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/139467 (owner: 10Dzahn) [20:27:06] sorry about that mutante ^ [20:27:27] matanya: hey, no problem, just duplicate [20:27:47] i tried hitting rebase to see if that makes it a no-op, but meh.. [20:27:49] i'll push now the most complicated lint i ever did [20:27:53] i'm sure you did the same thing [20:27:59] oh? site.pp ? [20:28:05] (03PS1) 10Matanya: nove: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140561 [20:28:05] no, worse [20:28:16] andrewbogott: ^^ [20:28:25] s/nove/nova [20:28:30] yes [20:28:31] that one? [20:28:41] i think in number of lines changed you had larger ones [20:29:01] yes, but it was broken in so many ways [20:29:09] ah [20:29:22] site.pp was straitforeward [20:29:35] just dangerous [20:29:54] can we still remove more from admins.pp ? [20:30:13] you need to ask chasemp [20:30:39] yea yea, i already talked to him about the generic::systemuser patches and was off for 2 days [20:30:53] next getting some more of them merged [20:31:10] (03PS2) 10Matanya: nova: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140561 [20:31:10] getting rid of generic::systemuser that is [20:31:32] yeah, that would be a good move [20:31:52] i want to get jenkins to vote on lint [20:32:04] matanya: i have more lint changes you are added on btw :) [20:32:15] without 80chars [20:32:19] also to avoid duplicates [20:32:42] matanya: i also want that, i thought we ask for it in the moment we removed all real tabs [20:32:56] hence the etherpad with the files that still use them [20:32:59] not enough [20:33:14] you will still get errors [20:33:21] http://etherpad.wikimedia.org/p/tabsinpuppet [20:33:40] note how some of them have things that seem to be legitimate tabs [20:34:30] yes, i'm on this pad [20:34:50] :) [20:35:32] we should remove stuff that was merged [20:35:43] yes [20:36:28] topic:bye_systemuser :) https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+topic:bye_systemuser,n,z [20:37:17] (03PS4) 10Dzahn: icinga - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138006 (owner: 10Rush) [20:37:27] so why was the back and forth on this ? [20:37:37] (03CR) 10jenkins-bot: [V: 04-1] icinga - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138006 (owner: 10Rush) [20:40:24] (03PS4) 10Awight: Meta: automatic translation workflow state changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137804 [20:44:16] (03CR) 10Andrew Bogott: [C: 031] "Now my eyes are tired" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140561 (owner: 10Matanya) [20:44:27] matanya, have you already verified that ^ is a noop? [20:44:32] matanya: chase and i started out by just following the existing thing (generic::systemuser) expecting the abstraction layer is wanted, then paravoid and ori reviewed and pointed out we should actually just remove it everywhere. either way, the goal was to have it unified and not half and half [20:44:41] I have andrewbogott [20:44:52] ok, I will merge shortly... [20:45:26] matanya: fyi https://gerrit.wikimedia.org/r/#/c/137963/ [20:45:56] oh, i saw this one on my phone, and didn't get the reason [20:45:56] (03PS5) 10Awight: Meta: automatic translation workflow state changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137804 [20:45:59] now it is clear [20:46:48] matanya: yep, so exec summary is like "The right way to go here is to keep using system => true to mark system [20:46:57] users [20:47:10] ok, good to know [20:49:16] (03PS1) 10Matanya: mha: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140565 [20:49:28] (03PS5) 10Dzahn: icinga - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138006 (owner: 10Rush) [20:50:03] mutante: i have a deal for you, you do the ones in submodules, and i'll do the rest [20:51:44] hah :P the other way around, did we even check all submodules yet.. hold on, one by one [20:52:11] matanya: sorry, puppet is broken on virt1000; need to fix that before I can merge your change. [20:52:20] ori, are you around? I have a question about apache refactors [20:52:56] (03PS6) 10Dzahn: icinga/snmptt- replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138006 (owner: 10Rush) [20:53:17] (03PS7) 10Dzahn: snmptt - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138006 (owner: 10Rush) [20:54:16] (03PS8) 10Dzahn: snmptt - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138006 (owner: 10Rush) [20:54:46] (03CR) 10Dzahn: [C: 032] snmptt - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138006 (owner: 10Rush) [20:55:38] watches icinga / snmptt monitoring stuff because of that user change [20:56:11] mutante: I see boxes where the puppet catalog is failing but icinga is showing puppet freshness as green [20:56:18] does that… surprise you? [20:56:47] andrewbogott: yes, i don't that is related to what i just merged [20:56:52] it wasnt even applied yet [20:57:01] It's not related, it's been true since yesterday. [20:57:04] only now i am running puppet on neon [20:57:14] But I'm wondering if it's one problem (puppet breakage) or two (puppet breakage + monitoring fail) [20:57:21] hmm,ok, yea, it still surprises me but i didnt see any recent changes, was off [20:57:48] hmmm, there were some mini changes to the output format of that .. but besides that [20:58:11] (03CR) 10Nikerabbit: [C: 031] Meta: automatic translation workflow state changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137804 (owner: 10Awight) [20:58:12] who would I ask about icinga/puppet checks? [20:58:12] andrewbogott: what's an example box where that is the case? [20:58:17] mutante: puppet masters. [20:58:21] all of 'em I think [20:59:32] andrewbogott: oh, puppet3 upgrade [20:59:42] you think that broke monitoring? [20:59:52] i think it's suspicious [21:00:04] yurik: The time is nigh to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140618T2100) [21:00:05] !log rebooting lvs1006 for kernel/bios stuff [21:00:10] Logged the message, Master [21:00:19] _joe_: still working? [21:00:42] andrewbogott: i see what you mean on palladium [21:00:44] Duplicate declaration: Package[libapache2-mod-passenger] is already declared [21:00:53] mutante: kafak,kafkatee and varnishkafka are in the list and they are submodules [21:01:00] that sounds very much like Apache refactor [21:01:11] please please run puppet after merge [21:01:26] mutante: yes, the failure is pretty simple. But icinga really should've told us! [21:02:18] andrewbogott: yes [21:04:21] (03PS1) 10Andrew Bogott: Don't declare Package[libapache2-mod-passenger] in puppetmaster [operations/puppet] - 10https://gerrit.wikimedia.org/r/140571 [21:05:19] (03Abandoned) 10Kaldari: Turning on Translate Extension for foundationwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140231 (owner: 10Kaldari) [21:05:27] the apache refactor happened over like 6 commits, too [21:05:53] (03CR) 10Dzahn: [C: 031] "yes please, we have Duplicate declaration: Package[libapache2-mod-passenger] on puppetmasters" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140571 (owner: 10Andrew Bogott) [21:05:57] it's a pain to see it all [21:06:08] (but on the other hand, it's also a pain to review giant commits) [21:06:27] (03CR) 10Andrew Bogott: [C: 032] Don't declare Package[libapache2-mod-passenger] in puppetmaster [operations/puppet] - 10https://gerrit.wikimedia.org/r/140571 (owner: 10Andrew Bogott) [21:08:39] !log reviving puppet runs on puppetmasters, via https://gerrit.wikimedia.org/r/#/c/140571/ [21:09:10] yikes, getting a pretty big diff on palladium. Hopes I don't breaks the puppets [21:09:48] the combination with icinga check being broken.. grmbl :p [21:09:57] tries to figure out why it would claim OK [21:10:33] that should only happen as long as packets keep coming in, they are passive checks [21:11:11] palladium seems happy now, and the few clients I'm looking at are also happy. [21:11:31] cool [21:12:29] (03CR) 10Andrew Bogott: [C: 032] nova: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140561 (owner: 10Matanya) [21:12:54] wikitech is broken, is that known? [21:13:10] andrewbogott: ^ [21:13:39] paravoid: Um… puppet was broken until just a minute ago [21:13:46] so probably a working puppet run broke it. I'm looking... [21:14:02] whoah, ugly [21:14:04] wee, more issues [21:14:30] paravoid: by 'broken' you mean that it lacks static content? [21:14:34] Or are you seeing something more drastic? [21:14:42] no, it redirects to virt1000.wm.org [21:14:49] and then cert warning and such [21:14:49] by broken I mean it redirects me to https://virt1000.wm.org [21:14:58] what mutante said [21:15:00] dang [21:15:19] ok, um… does anyone know about recent apache refactors other than ori? [21:15:58] !log stopping pybal on lvs1003 to test lvs1006 setup [21:17:11] meh wikitech [21:18:04] i also got a weird blip on office. was logged in, submitted an edit, it said i needed to auth, auth page told "you're already logged in" [21:18:32] ok, wikitech is working up until the next puppet run when it breaks again :( [21:20:59] do we know when the puppet runs broke on the masters? [21:21:11] just to rule out it's been longer than the icinga check checks for [21:21:32] mutante: I don't, precisely. [21:22:07] mutante: but you can look when the change that I just fixed was merged... [21:22:12] that was yesterday I think? [21:23:35] "The approach of the new apache module is to provision files [21:23:36] directly in sites-enabled rather than symlinks to files in sites-available." [21:23:47] ewwww [21:23:48] is that right? [21:23:53] bad ori [21:23:59] i can't say i like it [21:24:06] I can say I don't [21:24:25] https://gerrit.wikimedia.org/r/#/c/140218/ [21:24:45] Yeah, I was just noticing/disliking that but hadn't got around to saying anything [21:25:36] MaxSem, is there a procedure for removing an ext from prod? we can obsolete ZRMA [21:26:21] (03CR) 10Faidon Liambotis: "This is widely disliked by opsens:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140218 (owner: 10Ori.livneh) [21:26:54] yurikR, remove it from mediawiki-config and make-wmf-branch [21:26:56] so, couple things noticed while looking at eqiad lvs logs: search10xx seem to be fairly unstable? they're constantly being depooled and repooled. this is part of some known stuff, right? [21:27:08] no need to physically delete it from servers [21:27:36] and also, the RunCommand checks for apache load use ssh, but they have outdated keys for some reinstalled boxes. [21:27:56] (mw1053, mw1163) [21:34:47] !log rebooting lvs1003 for kernel/bios stuff [21:34:50] Logged the message, Master [21:37:35] (03PS1) 10Awight: Ruthless kludge to provide a special workflow for Fundraising page translation [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140574 [21:37:41] ori: paravoid: Regarding assert-check.js, did you come to a conclusion? [21:37:47] https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:assetcheck,n,z [21:38:30] (03PS6) 10Ottomata: Add CDH5 support, drop CDH4 support [operations/puppet/cdh4] (cdh5) - 10https://gerrit.wikimedia.org/r/135494 [21:38:35] I've come to the conclusion that this doesn't belong to the puppet repo :) [21:38:57] But it's in production now and broken [21:39:38] you can wait until one of us gains experience with it to be able to meaningfully review it? :) [21:40:22] (03PS1) 10Andrew Bogott: Move wikitech back to wikitech.wikimedia.org. [operations/puppet] - 10https://gerrit.wikimedia.org/r/140576 [21:40:59] sorry for the crappy response, but really, +2 should mean "I understand what this change does and I'm saying it's okay" and I don't think anyone from us can say that [21:41:22] so the way to fix this would be to move it into a separate repository that a different set of people will have +2 rights [21:41:42] (03CR) 10Andrew Bogott: kill apache_site (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140218 (owner: 10Ori.livneh) [21:42:13] (03PS2) 10Andrew Bogott: Move wikitech back to wikitech.wikimedia.org. [operations/puppet] - 10https://gerrit.wikimedia.org/r/140576 [21:42:14] otherwise, you'll just keep submitting changes and we will keep merging them blindly, which kind of defeats the whole point of code review [21:42:58] ottomata: hey, can you leave the apache reviews to akosiaris? [21:43:13] he was actively working with ori on these, and you +1ed a couple of broken ones :) [21:43:58] 140218 is broken both in concept and in implementation (cf. wikitech breakage) [21:44:09] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:44:29] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:44:34] swift outage, goddamit [21:44:39] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:44:53] (03CR) 10Andrew Bogott: [C: 032] Move wikitech back to wikitech.wikimedia.org. [operations/puppet] - 10https://gerrit.wikimedia.org/r/140576 (owner: 10Andrew Bogott) [21:45:09] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [21:45:09] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [21:45:09] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [21:45:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data exceeded the critical threshold [500.0] [21:45:09] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:10] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:47] <_joe_> hey just got paged [21:45:58] <_joe_> someone on it I see [21:46:09] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [21:46:59] bblack: I think it's you [21:47:06] really? [21:47:11] yeah [21:47:20] ms-fe1001/1002 are not getting any traffic for some reason [21:47:26] hmmm [21:47:55] lvs1003 is not in subnet [21:48:27] was the subnet manual and not in puppet perhaps? [21:48:51] although it seems odd this didn't happen when I first flipped traffic to a recently-rebooted lvs1006 [21:49:02] <_joe_> meanwhile, imagescalers outage [21:49:20] (03CR) 10Nemo bis: Ruthless kludge to provide a special workflow for Fundraising page translation (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140574 (owner: 10Awight) [21:49:33] it's a swift outage [21:49:36] I depooled ms-fe1001/1002 [21:49:42] until we figured this out [21:49:49] s/d// [21:49:59] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.891 second response time [21:50:39] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 68553 bytes in 9.967 second response time [21:50:45] auto eth0.1017 [21:50:46] iface eth0.1017 inet static address 10.64.1.3 netmask 255.255.252.0 [21:50:48] but it's not up [21:50:50] (03PS2) 10Awight: Ruthless kludge to provide a custom translation workflow for the Fundraising Thank-you letter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140574 [21:50:59] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [21:51:00] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time [21:51:00] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [21:51:09] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.518 second response time [21:51:19] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [21:51:46] it's the only vlan on eth0, too [21:51:59] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.094 second response time [21:52:41] !log disable pybal on lvs1003, since 1006 seems to have all its interfaces :P [21:52:45] I'm gonna try ifup'ing, unless you're still debugging [21:52:46] Logged the message, Master [21:53:12] ack? [21:54:00] andrewbogott: wikitech is still broken for me [21:54:16] bblack: it worked... [21:54:18] re-broken after next puppet run ? [21:54:19] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.047 second response time [21:54:22] paravoid: you have a knack for checking at just the right time :) [21:54:25] it worked a few minutes ago [21:54:26] heh, sorry [21:54:30] (03PS1) 10Danny B.: Set $wgCategoryCollation to 'uca-sk' on skwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140580 [21:54:56] <_joe_> paravoid: I fear the lvs1003 thing may have to do with puppet? [21:55:00] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.067 second response time [21:55:02] _joe_: no [21:55:12] paravoid: I suspect it's some kind of race on reboot for link state on eth0 vs ifconfig of eth0.1017 or something [21:55:27] yeah quite possibly [21:55:29] <_joe_> oh ok reboot, nevermind [21:56:13] pybal should really do a "make sure this is in-subnet if it's LVS-DR" kind of check :/ [21:56:34] and then not advertise that route if not? [21:56:36] ntpd thinks it can listen normally on that interface, fwiw [21:56:40] (03PS1) 10Yurik: Removing ZeroRatedMobileAccess ext (obsolete) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140581 [21:56:44] (can we split traffic per-service-ip like that?) [21:56:45] ntpd[12691]: Listen normally on 30 eth0.1017 fe80::1a03 ... [21:56:54] no, it would just depool ms-fe1001/1002 [21:56:59] and it'd work with ms-fe1003/1004 [21:57:04] yeah ntpd is ugly on LVS right now, it listens on *everything* [21:57:14] I don't think that hurts anything, but it's annoying and I'd like to fix it [21:57:39] now, the fact that it should switch to the backup LVS which would have visibility for both is another matter [21:57:42] and also, something on lvs1003 keeps sending SIGTERM to ntpd [21:57:46] i've suggested a solution for that [21:57:58] sigterm to ntpd is me doing ifdown/ifup cycles [21:58:05] ah [21:58:23] pybal could advertise a MED for the routes [21:58:35] that could be the sum of the weights of all the pooled hosts [21:58:53] so the routers would pick the LVS box which has the best visibility to endservers [21:59:06] think cross-row outages and such [21:59:35] (03PS1) 10Andrew Bogott: Use webserver_hostname instead of controller_hostname in wikitech.wikimedia.org.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/140582 [21:59:44] anyone, I'l repooling ms-fe1001/1002 [22:00:05] ntp option --novirtualips keeps it off the LVS interfaces at least [22:01:02] "virtual" IPs? [22:01:11] ah, interesting, i was wondering if you wanted it on all interfaces or not, btw, kudos for resolving that ancient ticket [22:01:18] (03CR) 10Andrew Bogott: kill apache_site (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140218 (owner: 10Ori.livneh) [22:01:23] well, the LVS:foo on lo [22:01:34] what's a virtual IP? :) [22:01:35] it doesn't prevent listening on the vlan interfaces [22:02:24] (03CR) 10Andrew Bogott: [C: 032] Use webserver_hostname instead of controller_hostname in wikitech.wikimedia.org.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/140582 (owner: 10Andrew Bogott) [22:02:43] paravoid: I think their definition of a virtual IP is the ones that aren't in ifconfig but are in ip link [22:02:57] hmm no that can't be it either [22:04:15] anyways, I could just have ntp only listen on eth0, too [22:04:26] that should be good for all LVS? [22:04:44] "virtual IP" = eth3.SOMETHING as opposed to eth3 [22:04:50] ? [22:05:10] that would fail with e.g. bond0, which isn't a very unlikely scenario for LVS [22:05:22] yeah [22:05:37] mutante: no, because it still listens on vlans like eth0.1017 [22:05:45] it just makes it skip the lo:LVS ips [22:05:56] ah, i see [22:06:14] although I really don't know what makes those more-virtualier [22:07:11] if (!listen_to_virtual_ips && if_name != NULL [22:07:11] && (strchr(if_name, ':') != NULL)) { [22:07:23] WP definition is " an IP address assigned to multiple applications residing on a single server, multiple domain names, or multiple servers, rather than being assigned to a specific single server or network interface card (NIC)" [22:07:23] colons [22:07:33] foo:bar [22:07:36] heh [22:07:43] <_joe_> paravoid: I would have bet that was the case [22:07:46] they should call it --nocoloninterfaces [22:08:03] which are deprecated on Linux [22:08:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [22:08:16] (but which we use for ipvs!) [22:08:24] these are now optional labels [22:08:51] Each address may be tagged with a label string. In order to [22:08:54] preserve compatibility with Linux-2.0 net aliases, this string [22:08:57] must coincide with the name of the device or must be prefixed [22:09:00] with the device name followed by colon. [22:10:14] (03CR) 10Ottomata: "I don't think I mind the files directly in sites-enabled. It seems to me that the whole symlink business is more useful if one was admini" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140218 (owner: 10Ori.livneh) [22:10:46] !log turning lvs1003 pybal back on [22:10:51] Logged the message, Master [22:13:59] well, on the brighter side, XPS does work for bnx2 on 3.13.0-30 [22:14:14] (03CR) 10Faidon Liambotis: "Convention is pretty important by itself and diverging from a system's default should be done for good reason, which I don't see here." [operations/puppet] - 10https://gerrit.wikimedia.org/r/140218 (owner: 10Ori.livneh) [22:14:34] they fixed it in the generic case for drivers that don't have their own .ndo_select_queue function, which bnx2 doesn't (and which is where bnx2x had its bug, in it custom selector function) [22:14:46] awesome [22:15:01] (03CR) 10MaxSem: [V: 031] Removing ZeroRatedMobileAccess ext (obsolete) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140581 (owner: 10Yurik) [22:28:42] mutante: still looking at icinga vs. freshness? [22:30:18] !log rebooting lvs1004 + lvs1005 [22:30:24] Logged the message, Master [22:31:22] andrewbogott: got distracted by something else, but i will and/or make a ticket to keep track [22:31:53] mutante: virt1008 doesn't have any actual VMs running on it right now. So i'm going to disable puppet there to provide you with a canary. [22:31:59] PROBLEM - Host misc-web-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:32:08] andrewbogott: ok,cool [22:32:29] bblack: that you again :) [22:32:31] eh, that is the misc lb on [22:32:35] heh [22:32:43] eh what the hell [22:32:52] I'm rebooting the backups not the live ones [22:33:09] why just IPv6 ? [22:33:38] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=misc-web [22:34:18] v6 address didnt come back via puppet? [22:36:23] (03PS1) 10Faidon Liambotis: exim: sign with DKIM on the mail routers [operations/puppet] - 10https://gerrit.wikimedia.org/r/140584 [22:36:25] I had to finish biosy things, looking now [22:36:25] (03PS1) 10Faidon Liambotis: mail: move wiki-mail-eqiad IP stanzas to site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/140585 [22:36:27] (03PS1) 10Faidon Liambotis: exim: add all of our domains to wikimedia_domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/140586 [22:36:29] (03PS1) 10Faidon Liambotis: exim: get rid of the implicit secondary MX feature [operations/puppet] - 10https://gerrit.wikimedia.org/r/140587 [22:36:31] (03PS1) 10Faidon Liambotis: mail: add a root system alias to role::mail::mx [operations/puppet] - 10https://gerrit.wikimedia.org/r/140588 [22:36:44] Jeff_Green: up for some reviews? :) [22:39:29] RECOVERY - Host misc-web-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [22:40:22] misc-web-lb for v6 does in fact exist on lvs1005 (was rebooting) and lvs1002 (which should have handled it. should have been primary even, I would have expected) [22:42:01] I wonder what's wrong with lvs1002 ipv6? maybe router things? [22:53:44] paravoid: The NFS overload on labs, was that visible somewhere in ganglia or icinga wikimedia.org (e.g. the virt node maybe?) [22:53:55] I'd be nice to see a graph going down as a result of my change (or not) [22:53:57] yeah, virt1002 [22:54:08] Should be all done now as of 3-4 days ago [22:54:41] blergh it has a spike [22:55:29] http://ganglia.wikimedia.org/latest/?r=month&c=Virtualization+cluster+eqiad&h=virt1002.eqiad.wmnet [22:58:19] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [22:59:08] yeah useless graph [23:00:04] RoanKattouw, mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140618T2300) [23:00:13] can do [23:00:18] awesome [23:00:25] Thanks Max [23:00:33] Krinkle: http://ganglia.wikimedia.org/latest/graph.php?h=virt1002.eqiad.wmnet&m=cpu_report&r=custom&s=by%20name&hc=4&mc=2&cs=06%2F12%2F2014%2010%3A00%20&ce=06%2F17%2F2014%2010%3A00%20&st=1403132408&g=network_report&z=large&c=Virtualization%20cluster%20eqiad [23:00:52] tgr, gonna break the site on your behalf, are you there? [23:00:57] Krinkle: note that 200M is the box's capacity [23:01:07] MaxSem: Hold on, we're adding one from me too. James accidentally put it in for Thursday instead of Wednesday [23:01:11] MaxSem: sounds exciting [23:01:17] 250 in theory, more like 220 in practice [23:01:38] Krinkle: so yeah, definitely much much better :) kudos! [23:01:56] (03CR) 10MaxSem: [C: 032] Enable MediaViewer suvery links on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140250 (owner: 10Gergő Tisza) [23:02:04] (03Merged) 10jenkins-bot: Enable MediaViewer suvery links on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140250 (owner: 10Gergő Tisza) [23:03:39] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/140250/ (duration: 00m 04s) [23:03:43] Logged the message, Master [23:03:51] tgr, please test ^^ :) [23:04:58] paravoid: Hm.. yeah those 8P spikes looke like a bug [23:05:27] yeah they are [23:05:29] MaxSem: tested with debug=1, works [23:05:32] thanks! [23:05:43] !log maxsem Synchronized php-1.24wmf9/extensions/VisualEditor/: https://gerrit.wikimedia.org/r/#/c/140563/ (duration: 00m 03s) [23:05:47] Logged the message, Master [23:05:52] RoanKattouw, ^^ [23:06:08] andrewbogott: i was afk earlier, faidon tells me you fixed wikitech after i broke it, thanks for that [23:06:26] i'll e-mail the ops list with some context for the change [23:06:28] paravoid: Thx, nice graph. [23:06:30] ori: no worries… sent a report about it to ops [23:06:39] I didn't know about the date pickers [23:06:40] http://ganglia.wikimedia.org/latest/?r=custom&cs=06%2F13%2F2014+00%3A00+&ce=06%2F18%2F2014+00%3A00+&c=Virtualization+cluster+eqiad&h=virt1002.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS [23:06:40] nice [23:08:55] MaxSem: Thanks [23:13:25] (03CR) 10Awight: [C: 04-2] "This approach does not work. Checking with the I18n team to see if they can help." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140574 (owner: 10Awight) [23:14:40] What? The colors of what is "progress" and "needs-updating" etc are in wmf-config? [23:14:46] Can't we settle on anything? [23:15:05] Wow, I'm shocked. This is a fun one to look back on one day. [23:15:25] (they're all the same, so that's good) [23:20:11] (03PS1) 10Dzahn: add wsa (wikistats admin) basic shell script [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140601 [23:22:30] (03PS2) 10Dzahn: add wsa (wikistats admin) basic shell script [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140601 [23:26:33] (03PS1) 10Dzahn: add 'add' feature to wikistats admin script [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140604 [23:31:14] (03PS1) 10Dzahn: add maintenance functions for wikistats admins [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140605 [23:34:43] (03PS1) 10Dzahn: add update_functions file to wikistats [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140606 [23:35:31] (03PS2) 10Dzahn: add update_functions file to wikistats [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140606 [23:37:31] (03PS1) 10Dzahn: retab update.php and sync live hack with repo [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140609 [23:37:33] (03CR) 10jenkins-bot: [V: 04-1] retab update.php and sync live hack with repo [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140609 (owner: 10Dzahn) [23:40:04] (03PS2) 10Dzahn: retab update.php and sync live hack with repo [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140609 [23:40:06] (03CR) 10jenkins-bot: [V: 04-1] retab update.php and sync live hack with repo [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/140609 (owner: 10Dzahn) [23:55:11] (03PS2) 10Dzahn: apt/pin.pp - retab and mini quoting fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/139458 [23:57:08] (03CR) 10Dzahn: apt/pin.pp - retab and mini quoting fix (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139458 (owner: 10Dzahn) [23:58:41] (03CR) 10Dzahn: misc/management.pp - retab and lint fixes (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139460 (owner: 10Dzahn)